geneharmony 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ # SCM syntax highlighting & preventing 3-way merges
2
+ pixi.lock merge=binary linguist-language=YAML linguist-generated=true
3
+
4
+ # Strip notebook outputs on commit (requires the local nbstrip filter; see CLAUDE.md)
5
+ *.ipynb filter=nbstrip
@@ -0,0 +1,72 @@
1
+ name: Publish to PyPI
2
+
3
+ # Build always; publish to TestPyPI on a manual run (dry-run), and to PyPI
4
+ # when a GitHub Release is published. Authentication uses PyPI Trusted
5
+ # Publishing (OIDC) — no API tokens or stored secrets.
6
+ on:
7
+ release:
8
+ types: [published]
9
+ workflow_dispatch:
10
+
11
+ jobs:
12
+ build:
13
+ name: Build distribution
14
+ runs-on: ubuntu-latest
15
+ steps:
16
+ - uses: actions/checkout@v7
17
+ - uses: actions/setup-python@v6
18
+ with:
19
+ python-version: "3.12"
20
+ - name: Install build backend
21
+ run: python -m pip install --upgrade build
22
+ - name: Build sdist and wheel
23
+ run: python -m build
24
+ - name: Check metadata renders on PyPI
25
+ run: |
26
+ python -m pip install --upgrade twine
27
+ python -m twine check dist/*
28
+ - name: Upload distribution artifacts
29
+ uses: actions/upload-artifact@v7
30
+ with:
31
+ name: dist
32
+ path: dist/
33
+
34
+ publish-testpypi:
35
+ name: Publish to TestPyPI
36
+ needs: build
37
+ if: github.event_name == 'workflow_dispatch'
38
+ runs-on: ubuntu-latest
39
+ environment:
40
+ name: testpypi
41
+ url: https://test.pypi.org/p/geneharmony
42
+ permissions:
43
+ id-token: write # required for OIDC trusted publishing
44
+ steps:
45
+ - name: Download distribution artifacts
46
+ uses: actions/download-artifact@v8
47
+ with:
48
+ name: dist
49
+ path: dist/
50
+ - name: Publish to TestPyPI
51
+ uses: pypa/gh-action-pypi-publish@release/v1
52
+ with:
53
+ repository-url: https://test.pypi.org/legacy/
54
+
55
+ publish-pypi:
56
+ name: Publish to PyPI
57
+ needs: build
58
+ if: github.event_name == 'release'
59
+ runs-on: ubuntu-latest
60
+ environment:
61
+ name: pypi
62
+ url: https://pypi.org/p/geneharmony
63
+ permissions:
64
+ id-token: write # required for OIDC trusted publishing
65
+ steps:
66
+ - name: Download distribution artifacts
67
+ uses: actions/download-artifact@v8
68
+ with:
69
+ name: dist
70
+ path: dist/
71
+ - name: Publish to PyPI
72
+ uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,26 @@
1
+ # pixi environments
2
+ .pixi/*
3
+ !.pixi/config.toml
4
+
5
+ # pycache
6
+ __pycache__/
7
+
8
+ # distribution files
9
+ dist/
10
+
11
+ # vscode settings
12
+ .vscode/
13
+
14
+ # cache
15
+ cache/
16
+
17
+ # test and archive files
18
+ test.txt
19
+ agr_http
20
+ src_old/
21
+
22
+ # environment variables
23
+ .env
24
+
25
+ # wheel files
26
+ *.whl
@@ -0,0 +1,142 @@
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## What this is
6
+
7
+ An async Python wrapper around the [Alliance of Genome Resources](https://www.alliancegenome.org) (AGR) REST API and its bulk-download files. It resolves gene symbols/IDs to canonical genes with an in-memory index built from AGR's `GENE-TSV-COMBINED` bulk file, fetches API data concurrently, and downloads/parses bulk files. `Annotator` (`annotator.py`) is the user-facing surface — `download` / `ingest_annotation` / `annotate`. The pipeline is developed interactively in `src/notebook.ipynb`; `src/main.py` is still an empty placeholder.
8
+
9
+ This replaces an earlier design (kept in `src_old/` for reference) that used an external Rust `gene_normalizer` binary and a filesystem cache of per-endpoint DataFrames. Gene normalization is now pure Python against the bulk file — there is **no external binary and no `.env`** to set up (`.env.example` is stale).
10
+
11
+ ## Environment & commands
12
+
13
+ The package is **pip-installable** (`pip install geneharmony`); runtime dependencies live in `[project.dependencies]` in `pyproject.toml` (`httpx`, `pydantic` v2, `pandas` 3.x, `pyarrow`) and are the single source of truth for end users. **pixi** (conda-forge) is the **development** environment only — end users don't need it. There are **no defined tasks, tests, or linters** — `[tasks]` in `pixi.toml` is empty. Published floor is Python 3.12+ (`requires-python`); the pixi dev env runs 3.14 (free-threaded).
14
+
15
+ ```bash
16
+ pixi install # create the environment from pixi.lock
17
+ pixi run python <script> # run Python inside the env
18
+ pixi run jupyter lab src/notebook.ipynb # open the driver notebook
19
+ ```
20
+
21
+ **Notebook outputs are stripped on commit** via a git clean filter (`*.ipynb filter=nbstrip` in `.gitattributes`), keeping `notebook.ipynb` diffs to code only. The filter is repo-local config, so enable it once per clone:
22
+
23
+ ```bash
24
+ git config filter.nbstrip.clean "pixi run jupyter nbconvert --clear-output --to notebook --stdin --stdout --log-level=ERROR"
25
+ git config filter.nbstrip.smudge cat
26
+ ```
27
+
28
+ **Imports are flat** (`from client import ...`, `from normalizer import ...`) with no package prefix, so code only resolves when **`src/` is the working directory / on `sys.path`**. The notebook runs there; standalone scripts must too (e.g. `sys.path.insert(0, "src")`).
29
+
30
+ ## Conventions
31
+
32
+ The owner prefers **strong typing** (modern `type` aliases, `Final`, `Self`, `enum`, `NamedTuple`) and **minimal comments** — comments are for user-facing docstrings or genuinely non-obvious logic, not narration. Match this when editing.
33
+
34
+ ## Architecture
35
+
36
+ Two AGR hosts are in play, and keeping them separate matters:
37
+ - **API host** `https://www.alliancegenome.org/api` — JSON endpoints, served by `AGRClient`.
38
+ - **Download host** `https://download.alliancegenome.org` — large bulk files, served by `Downloader`. The `/downloads` *listing* is JSON on the API host; only the file bytes come from the download host (the `s3Url` field).
39
+
40
+ ### `client.py` — `AGRClient`
41
+
42
+ Async API client wrapping one pooled `httpx.AsyncClient` bounded by an `asyncio.Semaphore` (default `max_concurrent=5`). `get_json` / `get_text` issue GETs through `_get`, which **retries** transient failures (statuses 429/502/503/504, timeouts, transport errors) with full-jitter exponential backoff, honoring `Retry-After`. `list_downloads()` fetches `/downloads` and validates it into `list[DownloadFile]` via a module-level `TypeAdapter`. Async context manager; `aclose()` closes the pool.
43
+
44
+ ### `downloader.py` — `Downloader`
45
+
46
+ Host-agnostic streaming file downloader (deliberately **not** AGR-specific — usable for any absolute URL). `download(url, dest, *, expected_size=None)` streams to disk in 1 MiB chunks via `client.stream(...)`, writes to a `.part` temp file then `os.replace`s into place (atomic; `dest` only ever exists complete). Bytes are written **verbatim** — `.gz` stays compressed on disk and is inflated at ingest. Retries transport/transient-status errors with the same backoff style as the client. Raises `SizeMismatchError` on a post-download size check.
47
+
48
+ > Caveat: the `/downloads` listing's `size` is the **uncompressed** size, but we fetch the compressed `.gz`. So do **not** pass `expected_size=DownloadFile.size` — it will always mismatch. The skip-if-already-downloaded shortcut only engages when `expected_size` is given, so it's currently a no-op for these files.
49
+
50
+ ### `ingest.py` — decompress + parse
51
+
52
+ Free functions, no class. Read straight from the compressed file into memory (no decompressed copy on disk):
53
+ - `load_json_gz(path) -> Any` — `gzip.open` + `json.load`.
54
+ - `load_tsv_gz(path, dtype=None) -> pd.DataFrame` — `pd.read_csv(sep="\t", comment="#", compression="gzip")`. The `comment="#"` skips AGR's leading metadata block; `dtype=str` is needed for the gene file so digit-like symbols/IDs aren't coerced to numbers.
55
+
56
+ Note: `sep="\t"` in real source files uses pandas' fast C engine. (A spurious "falling back to the python engine / regex separator" warning only appears when `\t` is passed through a shell `-c` invocation, where it arrives as a literal two-char `\t` — not a real-code problem.)
57
+
58
+ ### `normalizer.py` — the gene index
59
+
60
+ `load_gene_index(path) -> GeneIndex` reads `GENE-TSV-COMBINED` (`dtype=str`, all columns) and precomputes O(1) lookups. The file is ~914k rows across 9 species; the real ID column is **`GeneId`** (not `GeneID`); `GeneSymbol` is **not unique** even within a species; `GeneSecondaryIds` are deprecated IDs.
61
+
62
+ `build_gene_index` fills five `dict[str, list[int]]` tables mapping each identifier form to row positions in the retained `records` DataFrame, tagged by `MatchKind`:
63
+
64
+ ```
65
+ PRIMARY_ID > SECONDARY_ID > OFFICIAL_SYMBOL > SYNONYM > CROSS_REFERENCE (precedence, high→low)
66
+ ```
67
+
68
+ **Precedence is the `MatchKind` enum's definition order**, surfaced by `for kind in MatchKind` in `_resolve` (there is no explicit sort; the int values are documentation only). Keep members declared best-to-worst.
69
+
70
+ `CROSS_REFERENCE` indexes `GeneCrossReferences` — pipe-separated external IDs (`NCBI_Gene:`, `ENSEMBL:`, `UniProtKB:`, `RefSeq:`, …), populated on ~43% of rows. Keys are the **full `PREFIX:ID` token**, not the bare ID (bare IDs collide across databases — e.g. `601309` is both OMIM and MIM). It is the **lowest** tier, so cross-refs only resolve queries no better identifier matches. Family/class databases that fan one token out to hundreds of genes (`_XREF_EXCLUDED_PREFIXES`: `PANTHER`, `TreeFam`, `ExPASy`, `TCDB`) are **excluded**. Some `RGD:*` tokens also appear in `GeneSecondaryIds`; the higher `SECONDARY_ID` tier wins, so the duplication is harmless.
71
+
72
+ `GeneIndex.lookup(queries, *, taxon=None, limit=1, case_insensitive=False) -> pd.DataFrame`:
73
+ - `queries`: `str | list[str]` (scalar is wrapped). Built for batches of thousands.
74
+ - Returns one row per `(query, match)`: columns `query`, `match_kind`, then **all** gene-record columns. Input order preserved.
75
+ - **Unmatched queries are retained** with null `match_kind` (and `NaN` record cells) so misses are visible. Filter with `df.match_kind.notna()`.
76
+ - `limit` caps matches per query (`limit=None` = all → useful for top-N when no taxon disambiguates). Pipeline is **precedence-sort → taxon-filter → dedup → limit** (i.e. filter-then-limit). A kind filter, if ever wanted, is left to the caller on the returned frame — note that user-side kind-filtering happens *after* `limit`.
77
+ - Within a tier, matches are ordered by **row/file order** (not species priority).
78
+ - `case_insensitive=True` consults a **lazily-built** casefolded copy of the tables (zero cost otherwise). Default is case-sensitive, since case can be meaningful across species (human `TP53` vs mouse `Trp53`).
79
+
80
+ ### `taxa.py` — taxon resolution (`taxa.json` + `resolve_taxon`)
81
+
82
+ Self-contained species resolution with no dependency on the gene index (not even pandas); `normalizer.py` and `annotator.py` both import `resolve_taxon` from here. `taxa.json` (in `src/`, read via a path relative to `taxa.py`) holds one entry per species with `id` / `species` / `common`. At import, `_load_taxa` builds the `_TAXA` tuple of `Taxon` records, then `_TAXON_BY_ALIAS` maps every casefolded string tied to a species — full `NCBITaxon:` ID, bare number, species name, each common name — back to its `Taxon`. The full ID is itself an alias, so this one map covers taxon-ID lookups too (no separate by-ID table). `Taxon` is a `NamedTuple` (`id`, `species`, `common`) with `.number` (bare ID, no prefix) and `.common_name` (first common name, or `None`) properties. `resolve_taxon(value)` strips + casefolds against `_TAXON_BY_ALIAS` and returns the matched `Taxon`, raising `ValueError` on unknown — so any alias resolves to the whole record, and callers pull the part they need (`.id`, `.species`, `.common_name`, `.number`). This lets users pass `"human"`, `"9606"`, `"Homo sapiens"`, or `"NCBITaxon:9606"` interchangeably. Ambiguous aliases (`frog`, `xenopus`) are intentionally omitted.
83
+
84
+ **Naming convention:** `taxon` is an *alias string* a user passes; `taxon_id` is a *resolved canonical* `NCBITaxon:` string (`normalize` resolves its `taxon` arg to `taxon_id` once up front via `resolve_taxon(...).id`, so a bad taxon fails fast); a `Taxon` is the record object.
85
+
86
+ To append taxon data to a frame, `taxon_mapper(field)` builds a `value -> field` callable for `df[col].map(...)`. `field` is a `TaxonField` (`StrEnum`: `ID` / `NUMBER` / `SPECIES` / `COMMON_NAME`, each value the matching `Taxon` attribute name). The returned callable accepts any alias (so it works on `normalize`'s `Taxon` column, orthology's `Gene2SpeciesTaxonID`, etc.) and yields `None` for unknown or non-string cells — e.g. `df["common_name"] = df["Taxon"].map(taxon_mapper(TaxonField.COMMON_NAME))`.
87
+
88
+ ### `datasets.py` — dataset registry
89
+
90
+ `AGRDataset` (`StrEnum`: `GENE`, `ORTHOLOGY`, `PHENOTYPES`, `ALLELES`) is the typed handle users pass to `download`/`annotate`. `GENE` is the bulk file backing the gene index — downloaded through the same `download` path as the rest, but built into a `GeneIndex` rather than joined onto a base frame. `DATASETS` maps each member to a `DatasetSpec(bulk, api)`:
91
+ - `BulkSpec(data_type, file_type, data_sub_type, join_key)` — a selector into the `/downloads` listing (matched at runtime; never a hardcoded `s3Url`) plus the column its rows join on.
92
+ - `ApiSpec(endpoint, join_key, project)` — a per-gene endpoint template and a `project(gene_id, result) -> dict` that flattens one API result into one flat row.
93
+
94
+ Each dataset has **one natural backend** so output columns stay predictable: orthology → bulk TSV (`ORTHOLOGY-ALLIANCE`, keyed `Gene1ID`; rich columns `Gene2ID`/`Gene2SpeciesTaxonID`/…); phenotypes & alleles → per-gene API (their bulk files are nested per-MOD JSON, deferred). The API orthology projector mirrors the bulk column names so either backend yields the same shape. Adding a dataset = add an enum member + a `DatasetSpec` (+ a projector for API ones).
95
+
96
+ ### `store.py` — atomic Parquet persistence
97
+
98
+ `write_parquet(df, path)` / `read_parquet(path, *, decode_json=())` back the bulk, per-gene API, and external-annotation caches. Writes go through a same-dir temp file then `os.replace` (atomic), zstd-compressed; object columns holding dicts/lists are JSON-encoded to strings on write (`decode_json` reverses it on read). Ported/trimmed from `src_old/cache.py`. Current projections are flat, so the nested-encoding path is a safety net.
99
+
100
+ ### `annotator.py` — `Annotator` (the user-facing surface)
101
+
102
+ Ties the lower-level pieces together over a resolved cache dir, with a lazily-built `GeneIndex` cached for the instance's lifetime. Intended use is an **iterative filter-then-requery traversal** — one primary AGR dataset per `annotate` call — so cardinality stays under the caller's control (no cross-dataset Cartesian blow-up).
103
+
104
+ Cache resolution lives here: module-level `default_cache_dir()` is `$XDG_CACHE_HOME/geneharmony` (falling back to `~/.cache/geneharmony`); `resolve_cache_dir(cache_dir)` returns the user's override or that default, creating it (called once in `__init__`). Pass a path to share a cache between users; omit it for the home default.
105
+
106
+ The gene index is built lazily in `_gene_index()` and memoized in `self._index` for the instance's lifetime: it calls `download(AGRDataset.GENE)` (the gene bulk file goes to `bulk/gene.parquet` like any other dataset), then builds the index from that parquet in memory (~2 s; **not** pickled — a pickled `GeneIndex` is ~5× the parquet yet saves only ~0.6 s). The parquet is existence-cached and becomes **stale across AGR releases**; `download(AGRDataset.GENE, refresh=True)` (or deleting it) rebuilds.
107
+
108
+ - `Annotator(cache_dir=None, *, client=None, downloader=None)` — one instance serves any genome; `taxon` is passed per call (`normalize`/`ingest_annotation`/`annotate`), defaulting to `None` (no species filter).
109
+ - `async download(dataset, *, refresh=False) -> Path` — resolve the bulk file, stream it, convert TSV→Parquet under `bulk/<dataset>.parquet`, **delete the `.tsv.gz`**; no-op if the parquet exists. `AGRClient`/`Downloader` are created only when a fetch is actually needed (`AsyncExitStack`); passing your own leaves their lifecycle to you. Raises if the dataset has no bulk spec.
110
+ - `async normalize(genes, *, taxon=None, limit=1, case_insensitive=False)` — passes through to `GeneIndex.lookup`.
111
+ - `async ingest_annotation(source, name, *, gene_id_column, normalize=True, taxon=None, case_insensitive=False, override=False) -> tuple[dict, pd.DataFrame | None]` — store an external table under `external/<name>.parquet`, keyed by canonical `GeneId` (gene ids normalized via the index, replacing the old Rust `normalize_symbols_map`). Ported from `src_old/utils.py`. `gene_id_column` is `str | list[str]`: with a list, the columns are tried **left-to-right per row** and the first identifier that resolves wins (a fallback for tables whose primary ID column has gaps) — normalization is mapping-aware, so a non-null-but-unresolvable value falls through to the next column; `normalize=False` degrades to a plain first-non-null coalesce. The id columns are **kept as-is** and the resolved canonical id is written to a **separate, freshly-added `GeneId` column** (so the input must not already contain a `GeneId` column — that raises); rows where no candidate resolves are dropped and counted. Returns `(summary_dict, unmapped_df)` where `unmapped_df` holds the dropped rows with their original columns (null when the cache hit short-circuits or nothing was dropped).
112
+ - `async annotate(genes, *sources, taxon=None, limit=1, case_insensitive=False) -> DataFrame` — build a base frame (`list[str]` → `normalize`, **misses retained** with null cells; or pass a pre-normalized DataFrame), then **wide left-join** each source onto `GeneId` in order. `limit` is forwarded to `normalize`, capping matches per query (`limit=None` = all) — useful for fanning a symbol out to several genes when no `taxon` disambiguates; each matched gene is annotated as its own base row. An `AGRDataset` contributes its **native** columns (bulk filtered by `join_key ∈ gene_ids`, or per-gene API fetched + cached under `api/<dataset>/<id>.parquet`); a `str` names an ingested annotation and contributes `name.`-prefixed columns. An unknown `str` raises.
113
+
114
+ Per-gene API fetches paginate (`limit`/`page`, page-1-for-total, remaining pages concurrent — both phenotypes and alleles paginate) and are cached one Parquet per gene; cache hits skip the network. Clients are created on demand via `AsyncExitStack` (same pattern as `download`).
115
+
116
+ ### `models.py`
117
+
118
+ `DownloadFile` (Pydantic) is the only model — it mirrors a `/downloads` entry verbatim (camelCase fields to match the API), with `size: PositiveInt` and `lastModified: datetime`. The former `In_*` / `Out_*` ortholog/phenotype/allele placeholders are gone; the live API projections live as plain-dict `project` callables in `datasets.py` (no Pydantic).
119
+
120
+ ### Placeholder
121
+
122
+ `main.py` is currently empty — it's the planned user-facing entry point.
123
+
124
+ ## Data flows
125
+
126
+ **Bulk file:** `AGRClient.list_downloads()` (API host) → pick a `DownloadFile` by `dataType`/`fileType` → `Downloader.download(file.s3Url, dest)` (download host) → `ingest.load_tsv_gz` / `load_json_gz`. Don't hardcode URLs — `s3Url` embeds `releaseVersion`, which changes each release; select by intent and resolve at runtime.
127
+
128
+ **Gene normalization:** `Annotator._gene_index()` returns a ready `GeneIndex` — `download(AGRDataset.GENE)` yields `bulk/gene.parquet` (downloaded + converted on a cold cache, one-time ~10 s; ~µs lookups) which it builds into the index, memoized on the instance. Then `Annotator.normalize(symbols_or_ids, taxon=..., limit=...)` → `GeneIndex.lookup` → DataFrame of resolved canonical records → proceed with downstream queries using the official `GeneId`. `normalizer.load_gene_index(path)` remains the lower-level "build straight from a TSV path" entry point.
129
+
130
+ **Annotation (the user-facing flow):** `Annotator(cache_dir)` → optionally `await download(AGRDataset.X)` for bulk datasets → `await annotate(genes, AGRDataset.X, taxon=...)` returns a wide frame on the *normalized* base → slice it (e.g. `df[df.Gene2SpeciesTaxonID == "NCBITaxon:10090"]["Gene2ID"]`) → feed the slice back into the next `annotate` call. One AGR dataset per call; combine with ingested annotations by name in the same call.
131
+
132
+ ## Cache / scratch
133
+
134
+ Caches default to `~/.cache/geneharmony` (or `$XDG_CACHE_HOME`); the repo-root `cache/` is the manual override used by the notebook and ad-hoc scratch. Layout written by `Annotator`:
135
+
136
+ ```
137
+ bulk/<dataset>.parquet # downloaded + converted bulk datasets (incl. bulk/gene.parquet, the gene-index source)
138
+ api/<dataset>/<gene_id>.parquet # per-gene API results (':' -> '_' in filenames)
139
+ external/<name>.parquet # ingested annotations
140
+ ```
141
+
142
+ `agr_http/downloads.json` is a saved snapshot of a `/downloads` listing. `src_old/` is the previous (binary + endpoint-cache) implementation, kept for reference.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Lionel Sequeira
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,225 @@
1
+ Metadata-Version: 2.4
2
+ Name: geneharmony
3
+ Version: 0.3.0
4
+ Summary: Async toolkit to normalize gene identifiers and annotate gene sets with data from the Alliance of Genome Resources (AGR) or user-ingested datasets.
5
+ Project-URL: Homepage, https://github.com/limenode/geneharmony
6
+ Project-URL: Repository, https://github.com/limenode/geneharmony
7
+ Project-URL: Issues, https://github.com/limenode/geneharmony/issues
8
+ Author-email: Lionel Sequeira <lionelsequeira@gmail.com>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: agr,alliance of genome resources,annotation,bioinformatics,gene,genomics
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: Python :: 3.14
19
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
20
+ Classifier: Typing :: Typed
21
+ Requires-Python: >=3.12
22
+ Requires-Dist: httpx>=0.28
23
+ Requires-Dist: pandas>=3
24
+ Requires-Dist: pyarrow>=14
25
+ Requires-Dist: pydantic>=2
26
+ Description-Content-Type: text/markdown
27
+
28
+ # geneharmony
29
+
30
+ An async Python toolkit that normalizes gene identifiers and annotates gene sets using the [Alliance of Genome Resources](https://www.alliancegenome.org) (AGR) REST API and bulk-download files, with functionality to append local annotations.
31
+
32
+ It resolves gene symbols and identifiers to canonical genes using an in-memory index built from AGR's bulk gene file, fetches per-gene API data concurrently, and downloads and parses AGR bulk datasets.
33
+
34
+ ## Highlights
35
+
36
+ - **Gene normalization**: Resolve symbols, primary/secondary IDs, synonyms, systematic names, and external cross-references (NCBI, Ensembl, UniProtKB, RefSeq, …) to the appropriate records in the AGR's `GENE-TSV-COMBINED` file.
37
+ - **Nine model organisms**: Human, mouse, rat, zebrafish, fly, worm, yeast, african clawed frog, and western clawed frog.
38
+ - **Concurrent and resilient**: Pooled, rate-limited HTTP with automatic retry/backoff for transient failures.
39
+ - **Transparent caching**: Bulk files, per-gene API results, and ingested annotations are cached as Parquet to expedite repeat runs.
40
+ - **Bring your own data**: Ingest external annotation tables keyed on whatever gene identifier you have; they normalize to canonical AGR genes and join cleanly.
41
+
42
+ ## Install
43
+
44
+ Requires **Python 3.12+**. Install from PyPI with pip or uv — all dependencies (`httpx`, `pydantic` v2, `pandas` 3.x, `pyarrow`) are resolved automatically:
45
+
46
+ ```bash
47
+ pip install geneharmony
48
+ # or
49
+ uv add geneharmony
50
+ ```
51
+
52
+ ## Development
53
+
54
+ Contributors use **pixi** (conda-forge) for a reproducible environment from the lockfile. End users do not need pixi.
55
+
56
+ ```bash
57
+ # 1. Install the environment from the lockfile
58
+ pixi install
59
+
60
+ # 2. Run Python inside the environment
61
+ pixi run python <script>
62
+
63
+ # 3. Or open the interactive driver notebook
64
+ pixi run jupyter lab src/notebook.ipynb
65
+ ```
66
+
67
+ Notebook outputs are stripped from version control via a git clean filter. The filter config is repo-local, so enable it once per clone:
68
+
69
+ ```bash
70
+ git config filter.nbstrip.clean "pixi run jupyter nbconvert --clear-output --to notebook --stdin --stdout --log-level=ERROR"
71
+ git config filter.nbstrip.smudge cat
72
+ ```
73
+
74
+ ## Usage
75
+
76
+ The `Annotator` is the single entry point. It is async, so call its methods with `await` (inside a notebook cell, an `async def`, or `asyncio.run(...)`).
77
+
78
+ ### Quick start
79
+
80
+ ```python
81
+ from geneharmony import Annotator, AGRDataset
82
+
83
+ ann = Annotator()
84
+
85
+ # Resolve gene symbols to canonical AGR records
86
+ genes = await ann.normalize(["TP53", "BRCA1"], taxon="human")
87
+
88
+ # Annotate genes with phenotypic information
89
+ annotated_genes = await ann.annotate(["Atp7b", "Ttn"], AGRDataset.PHENOTYPES, taxon="mouse")
90
+ ```
91
+
92
+ ### Resolving genes (`normalize`)
93
+
94
+ `normalize` accepts an identifier or a list of identifiers and returns one row per match. Unmatched queries are **retained** with a null `match_kind` so misses stay visible.
95
+
96
+ ```python
97
+ df = await ann.normalize(
98
+ ["TP53", "ENSG00000141510", "not_a_gene"],
99
+ taxon="human", # any alias: "human", "9606", "Homo sapiens", "NCBITaxon:9606"
100
+ limit=1, # max matches per query; use None for all
101
+ case_insensitive=False, # case can be meaningful (human TP53 vs mouse Trp53)
102
+ )
103
+
104
+ resolved = df[df.match_kind.notna()] # drop the misses
105
+ ```
106
+
107
+ Matches are ranked by identifier precedence:
108
+
109
+ ```
110
+ PRIMARY_ID > SECONDARY_ID > OFFICIAL_SYMBOL > SYNONYM > CROSS_REFERENCE
111
+ ```
112
+
113
+ ### Annotating genes (`annotate`)
114
+
115
+ `annotate` builds a normalized base frame, then **left joins** one or more sources onto the canonical `GeneId`:
116
+
117
+ ```python
118
+ from geneharmony import AGRDataset
119
+
120
+ orth = await ann.annotate(
121
+ ["TP53", "BRCA1"],
122
+ AGRDataset.ORTHOLOGY,
123
+ taxon="human",
124
+ )
125
+ ```
126
+
127
+ For chaining annotate calls, the recommended pattern is an **iterative filter-then-requery traversal** — one AGR dataset per call — so result cardinality stays under your control:
128
+
129
+ ```python
130
+ # 1. Find orthologs of human genes
131
+ orth = await ann.annotate(["TP53", "BRCA1"], AGRDataset.ORTHOLOGY, taxon="human")
132
+
133
+ # 2. Keep the mouse orthologs
134
+ mouse = orth.loc[orth.Gene2SpeciesTaxonID == "NCBITaxon:10090", "Gene2ID"].unique()
135
+
136
+ # 3. Fetch their phenotypes
137
+ pheno = await ann.annotate(list(mouse), AGRDataset.PHENOTYPES, taxon="mouse")
138
+ ```
139
+
140
+ #### Available AGR datasets
141
+
142
+ | Dataset | Backend | Key columns contributed |
143
+ | ----------------------- | ------------ | ----------------------------------------------------------- |
144
+ | `AGRDataset.ORTHOLOGY` | Bulk TSV | `Gene2ID`, `Gene2Symbol`, `Gene2SpeciesTaxonID`, … |
145
+ | `AGRDataset.PHENOTYPES` | Per-gene API | `phenotypeStatement`, `references` |
146
+ | `AGRDataset.ALLELES` | Per-gene API | `allele_id`, `symbol`, `alterationType`, `variantType`, … |
147
+
148
+ ### Orthologs convenience helper
149
+
150
+ For the common ortholog case there is a shortcut that returns a tidy subset:
151
+
152
+ ```python
153
+ orthologs = await ann.get_orthologs(
154
+ ["TP53", "BRCA1"],
155
+ taxon="human",
156
+ target_taxon="mouse", # optional: filter to one target species
157
+ )
158
+ # -> columns: query, match_kind, Gene2ID, Gene2Symbol, Gene2SpeciesTaxonID
159
+ ```
160
+
161
+ ### Downloading bulk datasets (`download`)
162
+
163
+ Bulk datasets are downloaded and converted to Parquet on first use (and cached thereafter). You can pre-fetch one explicitly:
164
+
165
+ ```python
166
+ path = await ann.download(AGRDataset.ORTHOLOGY)
167
+ # Force a refresh across AGR releases:
168
+ path = await ann.download(AGRDataset.ORTHOLOGY, refresh=True)
169
+ ```
170
+
171
+ ### Ingesting your own annotations (`ingest_annotation`)
172
+
173
+ Bring an external table (CSV, TSV, or Parquet file, or a `DataFrame`), normalize its gene identifiers to canonical AGR genes, and store it for joining by name:
174
+
175
+ ```python
176
+ summary, unmapped = await ann.ingest_annotation(
177
+ "my_expression_table.csv",
178
+ name="expression",
179
+ gene_id_column="symbol", # or a list of columns, tried left-to-right per row
180
+ taxon="human",
181
+ )
182
+
183
+ # Join it alongside an AGR dataset; its columns are prefixed `expression.`
184
+ df = await ann.annotate(["TP53", "BRCA1"], AGRDataset.ORTHOLOGY, "expression", taxon="human")
185
+ ```
186
+
187
+ `summary` reports rows in / stored / dropped; `unmapped` holds the rows whose identifiers could not be resolved (with their original columns) so nothing is silently lost.
188
+
189
+ ### Supported species
190
+
191
+ | Common name | Species | Taxon ID |
192
+ | --------------------- | -------------------------------- | ------------------ |
193
+ | human | *Homo sapiens* | `NCBITaxon:9606` |
194
+ | mouse | *Mus musculus* | `NCBITaxon:10090` |
195
+ | rat | *Rattus norvegicus* | `NCBITaxon:10116` |
196
+ | zebrafish | *Danio rerio* | `NCBITaxon:7955` |
197
+ | fly / fruit fly | *Drosophila melanogaster* | `NCBITaxon:7227` |
198
+ | worm / roundworm | *Caenorhabditis elegans* | `NCBITaxon:6239` |
199
+ | yeast / budding yeast | *Saccharomyces cerevisiae S288C* | `NCBITaxon:559292` |
200
+ | african clawed frog | *Xenopus laevis* | `NCBITaxon:8355` |
201
+ | western clawed frog | *Xenopus tropicalis* | `NCBITaxon:8364` |
202
+
203
+ Any of the aliases above — common name, full species name, bare number, or `NCBITaxon:` ID — can be passed as `taxon`.
204
+
205
+ ## Caching
206
+
207
+ Results are cached so repeat work is fast and largely offline. The cache defaults to `$XDG_CACHE_HOME/geneharmony` (falling back to `~/.cache/geneharmony`); pass a `cache_dir` to `Annotator(...)` to share or relocate it.
208
+
209
+ ```
210
+ bulk/<dataset>.parquet # downloaded + converted bulk datasets (incl. the gene index source)
211
+ api/<dataset>/<gene_id>.parquet # per-gene API results
212
+ external/<name>.parquet # ingested annotations
213
+ ```
214
+
215
+ The gene index is built from `bulk/gene.parquet`, which becomes stale across AGR releases. Refresh it with `await ann.download(AGRDataset.GENE, refresh=True)` (or by deleting the file).
216
+
217
+ ## Acknowledgements & Citation
218
+
219
+ This project is a client for data and services provided by the **Alliance of Genome Resources (AGR)**. It is not affiliated with or endorsed by the Alliance. All gene, ortholog, phenotype, and allele data are sourced from AGR and its member model-organism databases, and remain subject to the Alliance's terms of use.
220
+
221
+ If you use data obtained through this wrapper, please cite the Alliance of Genome Resources:
222
+
223
+ > [Updates to the Alliance of Genome Resources central infrastructure.](https://pubmed.ncbi.nlm.nih.gov/38552170/) 2024. Alliance of Genome Resources Consortium. Genetics. 2024 May 7;227(1):iyae049. doi: 10.1093/genetics/iyae049. PMID: 38552170.
224
+
225
+ Please also consult the [Alliance citation and data-usage guidelines](https://www.alliancegenome.org/cite-us) and acknowledge the underlying model-organism databases (e.g. SGD, WormBase, FlyBase, ZFIN, MGI, RGD, Xenbase) as appropriate.