PyPI - geneharmony - Versions diffs - 0.3.0__tar.gz - Mend

geneharmony 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

geneharmony-0.3.0/.gitattributes +5 -0
geneharmony-0.3.0/.github/workflows/publish.yml +72 -0
geneharmony-0.3.0/.gitignore +26 -0
geneharmony-0.3.0/CLAUDE.md +142 -0
geneharmony-0.3.0/LICENSE +21 -0
geneharmony-0.3.0/PKG-INFO +225 -0
geneharmony-0.3.0/README.md +198 -0
geneharmony-0.3.0/pixi.lock +2703 -0
geneharmony-0.3.0/pixi.toml +24 -0
geneharmony-0.3.0/pyproject.toml +44 -0
geneharmony-0.3.0/src/geneharmony/__init__.py +17 -0
geneharmony-0.3.0/src/geneharmony/annotator.py +356 -0
geneharmony-0.3.0/src/geneharmony/client.py +100 -0
geneharmony-0.3.0/src/geneharmony/datasets.py +107 -0
geneharmony-0.3.0/src/geneharmony/downloader.py +96 -0
geneharmony-0.3.0/src/geneharmony/ingest.py +22 -0
geneharmony-0.3.0/src/geneharmony/models.py +15 -0
geneharmony-0.3.0/src/geneharmony/normalizer.py +190 -0
geneharmony-0.3.0/src/geneharmony/store.py +61 -0
geneharmony-0.3.0/src/geneharmony/taxa.json +11 -0
geneharmony-0.3.0/src/geneharmony/taxa.py +86 -0
geneharmony-0.3.0/src/main.py +0 -0
geneharmony-0.3.0/src/notebook.ipynb +5654 -0

geneharmony-0.3.0/.gitattributes ADDED Viewed

@@ -0,0 +1,5 @@
+# SCM syntax highlighting & preventing 3-way merges
+pixi.lock merge=binary linguist-language=YAML linguist-generated=true
+# Strip notebook outputs on commit (requires the local nbstrip filter; see CLAUDE.md)
+*.ipynb filter=nbstrip

geneharmony-0.3.0/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,72 @@
+name: Publish to PyPI
+# Build always; publish to TestPyPI on a manual run (dry-run), and to PyPI
+# when a GitHub Release is published. Authentication uses PyPI Trusted
+# Publishing (OIDC) — no API tokens or stored secrets.
+on:
+  release:
+    types: [published]
+  workflow_dispatch:
+jobs:
+  build:
+    name: Build distribution
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v7
+      - uses: actions/setup-python@v6
+        with:
+          python-version: "3.12"
+      - name: Install build backend
+        run: python -m pip install --upgrade build
+      - name: Build sdist and wheel
+        run: python -m build
+      - name: Check metadata renders on PyPI
+        run: |
+          python -m pip install --upgrade twine
+          python -m twine check dist/*
+      - name: Upload distribution artifacts
+        uses: actions/upload-artifact@v7
+        with:
+          name: dist
+          path: dist/
+  publish-testpypi:
+    name: Publish to TestPyPI
+    needs: build
+    if: github.event_name == 'workflow_dispatch'
+    runs-on: ubuntu-latest
+    environment:
+      name: testpypi
+      url: https://test.pypi.org/p/geneharmony
+    permissions:
+      id-token: write          # required for OIDC trusted publishing
+    steps:
+      - name: Download distribution artifacts
+        uses: actions/download-artifact@v8
+        with:
+          name: dist
+          path: dist/
+      - name: Publish to TestPyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          repository-url: https://test.pypi.org/legacy/
+  publish-pypi:
+    name: Publish to PyPI
+    needs: build
+    if: github.event_name == 'release'
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/geneharmony
+    permissions:
+      id-token: write          # required for OIDC trusted publishing
+    steps:
+      - name: Download distribution artifacts
+        uses: actions/download-artifact@v8
+        with:
+          name: dist
+          path: dist/
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

geneharmony-0.3.0/.gitignore ADDED Viewed

@@ -0,0 +1,26 @@
+# pixi environments
+.pixi/*
+!.pixi/config.toml
+# pycache
+__pycache__/
+# distribution files
+dist/
+# vscode settings
+.vscode/
+# cache
+cache/
+# test and archive files
+test.txt
+agr_http
+src_old/
+# environment variables
+.env
+# wheel files
+*.whl

geneharmony-0.3.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,142 @@
+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## What this is
+An async Python wrapper around the [Alliance of Genome Resources](https://www.alliancegenome.org) (AGR) REST API and its bulk-download files. It resolves gene symbols/IDs to canonical genes with an in-memory index built from AGR's `GENE-TSV-COMBINED` bulk file, fetches API data concurrently, and downloads/parses bulk files. `Annotator` (`annotator.py`) is the user-facing surface — `download` / `ingest_annotation` / `annotate`. The pipeline is developed interactively in `src/notebook.ipynb`; `src/main.py` is still an empty placeholder.
+This replaces an earlier design (kept in `src_old/` for reference) that used an external Rust `gene_normalizer` binary and a filesystem cache of per-endpoint DataFrames. Gene normalization is now pure Python against the bulk file — there is **no external binary and no `.env`** to set up (`.env.example` is stale).
+## Environment & commands
+The package is **pip-installable** (`pip install geneharmony`); runtime dependencies live in `[project.dependencies]` in `pyproject.toml` (`httpx`, `pydantic` v2, `pandas` 3.x, `pyarrow`) and are the single source of truth for end users. **pixi** (conda-forge) is the **development** environment only — end users don't need it. There are **no defined tasks, tests, or linters** — `[tasks]` in `pixi.toml` is empty. Published floor is Python 3.12+ (`requires-python`); the pixi dev env runs 3.14 (free-threaded).
+```bash
+pixi install                                # create the environment from pixi.lock
+pixi run python <script>                    # run Python inside the env
+pixi run jupyter lab src/notebook.ipynb     # open the driver notebook
+```
+**Notebook outputs are stripped on commit** via a git clean filter (`*.ipynb filter=nbstrip` in `.gitattributes`), keeping `notebook.ipynb` diffs to code only. The filter is repo-local config, so enable it once per clone:
+```bash
+git config filter.nbstrip.clean "pixi run jupyter nbconvert --clear-output --to notebook --stdin --stdout --log-level=ERROR"
+git config filter.nbstrip.smudge cat
+```
+**Imports are flat** (`from client import ...`, `from normalizer import ...`) with no package prefix, so code only resolves when **`src/` is the working directory / on `sys.path`**. The notebook runs there; standalone scripts must too (e.g. `sys.path.insert(0, "src")`).
+## Conventions
+The owner prefers **strong typing** (modern `type` aliases, `Final`, `Self`, `enum`, `NamedTuple`) and **minimal comments** — comments are for user-facing docstrings or genuinely non-obvious logic, not narration. Match this when editing.
+## Architecture
+Two AGR hosts are in play, and keeping them separate matters:
+- **API host** `https://www.alliancegenome.org/api` — JSON endpoints, served by `AGRClient`.
+- **Download host** `https://download.alliancegenome.org` — large bulk files, served by `Downloader`. The `/downloads` *listing* is JSON on the API host; only the file bytes come from the download host (the `s3Url` field).
+### `client.py` — `AGRClient`
+Async API client wrapping one pooled `httpx.AsyncClient` bounded by an `asyncio.Semaphore` (default `max_concurrent=5`). `get_json` / `get_text` issue GETs through `_get`, which **retries** transient failures (statuses 429/502/503/504, timeouts, transport errors) with full-jitter exponential backoff, honoring `Retry-After`. `list_downloads()` fetches `/downloads` and validates it into `list[DownloadFile]` via a module-level `TypeAdapter`. Async context manager; `aclose()` closes the pool.
+### `downloader.py` — `Downloader`
+Host-agnostic streaming file downloader (deliberately **not** AGR-specific — usable for any absolute URL). `download(url, dest, *, expected_size=None)` streams to disk in 1 MiB chunks via `client.stream(...)`, writes to a `.part` temp file then `os.replace`s into place (atomic; `dest` only ever exists complete). Bytes are written **verbatim** — `.gz` stays compressed on disk and is inflated at ingest. Retries transport/transient-status errors with the same backoff style as the client. Raises `SizeMismatchError` on a post-download size check.
+> Caveat: the `/downloads` listing's `size` is the **uncompressed** size, but we fetch the compressed `.gz`. So do **not** pass `expected_size=DownloadFile.size` — it will always mismatch. The skip-if-already-downloaded shortcut only engages when `expected_size` is given, so it's currently a no-op for these files.
+### `ingest.py` — decompress + parse
+Free functions, no class. Read straight from the compressed file into memory (no decompressed copy on disk):
+- `load_json_gz(path) -> Any` — `gzip.open` + `json.load`.
+- `load_tsv_gz(path, dtype=None) -> pd.DataFrame` — `pd.read_csv(sep="\t", comment="#", compression="gzip")`. The `comment="#"` skips AGR's leading metadata block; `dtype=str` is needed for the gene file so digit-like symbols/IDs aren't coerced to numbers.
+Note: `sep="\t"` in real source files uses pandas' fast C engine. (A spurious "falling back to the python engine / regex separator" warning only appears when `\t` is passed through a shell `-c` invocation, where it arrives as a literal two-char `\t` — not a real-code problem.)
+### `normalizer.py` — the gene index
+`load_gene_index(path) -> GeneIndex` reads `GENE-TSV-COMBINED` (`dtype=str`, all columns) and precomputes O(1) lookups. The file is ~914k rows across 9 species; the real ID column is **`GeneId`** (not `GeneID`); `GeneSymbol` is **not unique** even within a species; `GeneSecondaryIds` are deprecated IDs.
+`build_gene_index` fills five `dict[str, list[int]]` tables mapping each identifier form to row positions in the retained `records` DataFrame, tagged by `MatchKind`:
+```
+PRIMARY_ID  >  SECONDARY_ID  >  OFFICIAL_SYMBOL  >  SYNONYM  >  CROSS_REFERENCE   (precedence, high→low)
+```
+**Precedence is the `MatchKind` enum's definition order**, surfaced by `for kind in MatchKind` in `_resolve` (there is no explicit sort; the int values are documentation only). Keep members declared best-to-worst.
+`CROSS_REFERENCE` indexes `GeneCrossReferences` — pipe-separated external IDs (`NCBI_Gene:`, `ENSEMBL:`, `UniProtKB:`, `RefSeq:`, …), populated on ~43% of rows. Keys are the **full `PREFIX:ID` token**, not the bare ID (bare IDs collide across databases — e.g. `601309` is both OMIM and MIM). It is the **lowest** tier, so cross-refs only resolve queries no better identifier matches. Family/class databases that fan one token out to hundreds of genes (`_XREF_EXCLUDED_PREFIXES`: `PANTHER`, `TreeFam`, `ExPASy`, `TCDB`) are **excluded**. Some `RGD:*` tokens also appear in `GeneSecondaryIds`; the higher `SECONDARY_ID` tier wins, so the duplication is harmless.
+`GeneIndex.lookup(queries, *, taxon=None, limit=1, case_insensitive=False) -> pd.DataFrame`:
+- `queries`: `str | list[str]` (scalar is wrapped). Built for batches of thousands.
+- Returns one row per `(query, match)`: columns `query`, `match_kind`, then **all** gene-record columns. Input order preserved.
+- **Unmatched queries are retained** with null `match_kind` (and `NaN` record cells) so misses are visible. Filter with `df.match_kind.notna()`.
+- `limit` caps matches per query (`limit=None` = all → useful for top-N when no taxon disambiguates). Pipeline is **precedence-sort → taxon-filter → dedup → limit** (i.e. filter-then-limit). A kind filter, if ever wanted, is left to the caller on the returned frame — note that user-side kind-filtering happens *after* `limit`.
+- Within a tier, matches are ordered by **row/file order** (not species priority).
+- `case_insensitive=True` consults a **lazily-built** casefolded copy of the tables (zero cost otherwise). Default is case-sensitive, since case can be meaningful across species (human `TP53` vs mouse `Trp53`).
+### `taxa.py` — taxon resolution (`taxa.json` + `resolve_taxon`)
+Self-contained species resolution with no dependency on the gene index (not even pandas); `normalizer.py` and `annotator.py` both import `resolve_taxon` from here. `taxa.json` (in `src/`, read via a path relative to `taxa.py`) holds one entry per species with `id` / `species` / `common`. At import, `_load_taxa` builds the `_TAXA` tuple of `Taxon` records, then `_TAXON_BY_ALIAS` maps every casefolded string tied to a species — full `NCBITaxon:` ID, bare number, species name, each common name — back to its `Taxon`. The full ID is itself an alias, so this one map covers taxon-ID lookups too (no separate by-ID table). `Taxon` is a `NamedTuple` (`id`, `species`, `common`) with `.number` (bare ID, no prefix) and `.common_name` (first common name, or `None`) properties. `resolve_taxon(value)` strips + casefolds against `_TAXON_BY_ALIAS` and returns the matched `Taxon`, raising `ValueError` on unknown — so any alias resolves to the whole record, and callers pull the part they need (`.id`, `.species`, `.common_name`, `.number`). This lets users pass `"human"`, `"9606"`, `"Homo sapiens"`, or `"NCBITaxon:9606"` interchangeably. Ambiguous aliases (`frog`, `xenopus`) are intentionally omitted.
+**Naming convention:** `taxon` is an *alias string* a user passes; `taxon_id` is a *resolved canonical* `NCBITaxon:` string (`normalize` resolves its `taxon` arg to `taxon_id` once up front via `resolve_taxon(...).id`, so a bad taxon fails fast); a `Taxon` is the record object.
+To append taxon data to a frame, `taxon_mapper(field)` builds a `value -> field` callable for `df[col].map(...)`. `field` is a `TaxonField` (`StrEnum`: `ID` / `NUMBER` / `SPECIES` / `COMMON_NAME`, each value the matching `Taxon` attribute name). The returned callable accepts any alias (so it works on `normalize`'s `Taxon` column, orthology's `Gene2SpeciesTaxonID`, etc.) and yields `None` for unknown or non-string cells — e.g. `df["common_name"] = df["Taxon"].map(taxon_mapper(TaxonField.COMMON_NAME))`.
+### `datasets.py` — dataset registry
+`AGRDataset` (`StrEnum`: `GENE`, `ORTHOLOGY`, `PHENOTYPES`, `ALLELES`) is the typed handle users pass to `download`/`annotate`. `GENE` is the bulk file backing the gene index — downloaded through the same `download` path as the rest, but built into a `GeneIndex` rather than joined onto a base frame. `DATASETS` maps each member to a `DatasetSpec(bulk, api)`:
+- `BulkSpec(data_type, file_type, data_sub_type, join_key)` — a selector into the `/downloads` listing (matched at runtime; never a hardcoded `s3Url`) plus the column its rows join on.
+- `ApiSpec(endpoint, join_key, project)` — a per-gene endpoint template and a `project(gene_id, result) -> dict` that flattens one API result into one flat row.
+Each dataset has **one natural backend** so output columns stay predictable: orthology → bulk TSV (`ORTHOLOGY-ALLIANCE`, keyed `Gene1ID`; rich columns `Gene2ID`/`Gene2SpeciesTaxonID`/…); phenotypes & alleles → per-gene API (their bulk files are nested per-MOD JSON, deferred). The API orthology projector mirrors the bulk column names so either backend yields the same shape. Adding a dataset = add an enum member + a `DatasetSpec` (+ a projector for API ones).
+### `store.py` — atomic Parquet persistence
+`write_parquet(df, path)` / `read_parquet(path, *, decode_json=())` back the bulk, per-gene API, and external-annotation caches. Writes go through a same-dir temp file then `os.replace` (atomic), zstd-compressed; object columns holding dicts/lists are JSON-encoded to strings on write (`decode_json` reverses it on read). Ported/trimmed from `src_old/cache.py`. Current projections are flat, so the nested-encoding path is a safety net.
+### `annotator.py` — `Annotator` (the user-facing surface)
+Ties the lower-level pieces together over a resolved cache dir, with a lazily-built `GeneIndex` cached for the instance's lifetime. Intended use is an **iterative filter-then-requery traversal** — one primary AGR dataset per `annotate` call — so cardinality stays under the caller's control (no cross-dataset Cartesian blow-up).
+Cache resolution lives here: module-level `default_cache_dir()` is `$XDG_CACHE_HOME/geneharmony` (falling back to `~/.cache/geneharmony`); `resolve_cache_dir(cache_dir)` returns the user's override or that default, creating it (called once in `__init__`). Pass a path to share a cache between users; omit it for the home default.
+The gene index is built lazily in `_gene_index()` and memoized in `self._index` for the instance's lifetime: it calls `download(AGRDataset.GENE)` (the gene bulk file goes to `bulk/gene.parquet` like any other dataset), then builds the index from that parquet in memory (~2 s; **not** pickled — a pickled `GeneIndex` is ~5× the parquet yet saves only ~0.6 s). The parquet is existence-cached and becomes **stale across AGR releases**; `download(AGRDataset.GENE, refresh=True)` (or deleting it) rebuilds.
+- `Annotator(cache_dir=None, *, client=None, downloader=None)` — one instance serves any genome; `taxon` is passed per call (`normalize`/`ingest_annotation`/`annotate`), defaulting to `None` (no species filter).
+- `async download(dataset, *, refresh=False) -> Path` — resolve the bulk file, stream it, convert TSV→Parquet under `bulk/<dataset>.parquet`, **delete the `.tsv.gz`**; no-op if the parquet exists. `AGRClient`/`Downloader` are created only when a fetch is actually needed (`AsyncExitStack`); passing your own leaves their lifecycle to you. Raises if the dataset has no bulk spec.
+- `async normalize(genes, *, taxon=None, limit=1, case_insensitive=False)` — passes through to `GeneIndex.lookup`.
+- `async ingest_annotation(source, name, *, gene_id_column, normalize=True, taxon=None, case_insensitive=False, override=False) -> tuple[dict, pd.DataFrame | None]` — store an external table under `external/<name>.parquet`, keyed by canonical `GeneId` (gene ids normalized via the index, replacing the old Rust `normalize_symbols_map`). Ported from `src_old/utils.py`. `gene_id_column` is `str | list[str]`: with a list, the columns are tried **left-to-right per row** and the first identifier that resolves wins (a fallback for tables whose primary ID column has gaps) — normalization is mapping-aware, so a non-null-but-unresolvable value falls through to the next column; `normalize=False` degrades to a plain first-non-null coalesce. The id columns are **kept as-is** and the resolved canonical id is written to a **separate, freshly-added `GeneId` column** (so the input must not already contain a `GeneId` column — that raises); rows where no candidate resolves are dropped and counted. Returns `(summary_dict, unmapped_df)` where `unmapped_df` holds the dropped rows with their original columns (null when the cache hit short-circuits or nothing was dropped).
+- `async annotate(genes, *sources, taxon=None, limit=1, case_insensitive=False) -> DataFrame` — build a base frame (`list[str]` → `normalize`, **misses retained** with null cells; or pass a pre-normalized DataFrame), then **wide left-join** each source onto `GeneId` in order. `limit` is forwarded to `normalize`, capping matches per query (`limit=None` = all) — useful for fanning a symbol out to several genes when no `taxon` disambiguates; each matched gene is annotated as its own base row. An `AGRDataset` contributes its **native** columns (bulk filtered by `join_key ∈ gene_ids`, or per-gene API fetched + cached under `api/<dataset>/<id>.parquet`); a `str` names an ingested annotation and contributes `name.`-prefixed columns. An unknown `str` raises.
+Per-gene API fetches paginate (`limit`/`page`, page-1-for-total, remaining pages concurrent — both phenotypes and alleles paginate) and are cached one Parquet per gene; cache hits skip the network. Clients are created on demand via `AsyncExitStack` (same pattern as `download`).
+### `models.py`
+`DownloadFile` (Pydantic) is the only model — it mirrors a `/downloads` entry verbatim (camelCase fields to match the API), with `size: PositiveInt` and `lastModified: datetime`. The former `In_*` / `Out_*` ortholog/phenotype/allele placeholders are gone; the live API projections live as plain-dict `project` callables in `datasets.py` (no Pydantic).
+### Placeholder
+`main.py` is currently empty — it's the planned user-facing entry point.
+## Data flows
+**Bulk file:** `AGRClient.list_downloads()` (API host) → pick a `DownloadFile` by `dataType`/`fileType` → `Downloader.download(file.s3Url, dest)` (download host) → `ingest.load_tsv_gz` / `load_json_gz`. Don't hardcode URLs — `s3Url` embeds `releaseVersion`, which changes each release; select by intent and resolve at runtime.
+**Gene normalization:** `Annotator._gene_index()` returns a ready `GeneIndex` — `download(AGRDataset.GENE)` yields `bulk/gene.parquet` (downloaded + converted on a cold cache, one-time ~10 s; ~µs lookups) which it builds into the index, memoized on the instance. Then `Annotator.normalize(symbols_or_ids, taxon=..., limit=...)` → `GeneIndex.lookup` → DataFrame of resolved canonical records → proceed with downstream queries using the official `GeneId`. `normalizer.load_gene_index(path)` remains the lower-level "build straight from a TSV path" entry point.
+**Annotation (the user-facing flow):** `Annotator(cache_dir)` → optionally `await download(AGRDataset.X)` for bulk datasets → `await annotate(genes, AGRDataset.X, taxon=...)` returns a wide frame on the *normalized* base → slice it (e.g. `df[df.Gene2SpeciesTaxonID == "NCBITaxon:10090"]["Gene2ID"]`) → feed the slice back into the next `annotate` call. One AGR dataset per call; combine with ingested annotations by name in the same call.
+## Cache / scratch
+Caches default to `~/.cache/geneharmony` (or `$XDG_CACHE_HOME`); the repo-root `cache/` is the manual override used by the notebook and ad-hoc scratch. Layout written by `Annotator`:
+```
+bulk/<dataset>.parquet            # downloaded + converted bulk datasets (incl. bulk/gene.parquet, the gene-index source)
+api/<dataset>/<gene_id>.parquet   # per-gene API results (':' -> '_' in filenames)
+external/<name>.parquet           # ingested annotations
+```
+`agr_http/downloads.json` is a saved snapshot of a `/downloads` listing. `src_old/` is the previous (binary + endpoint-cache) implementation, kept for reference.

geneharmony-0.3.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Lionel Sequeira
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

geneharmony-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,225 @@
+Metadata-Version: 2.4
+Name: geneharmony
+Version: 0.3.0
+Summary: Async toolkit to normalize gene identifiers and annotate gene sets with data from the Alliance of Genome Resources (AGR) or user-ingested datasets.
+Project-URL: Homepage, https://github.com/limenode/geneharmony
+Project-URL: Repository, https://github.com/limenode/geneharmony
+Project-URL: Issues, https://github.com/limenode/geneharmony/issues
+Author-email: Lionel Sequeira <lionelsequeira@gmail.com>
+License-Expression: MIT
+License-File: LICENSE
+Keywords: agr,alliance of genome resources,annotation,bioinformatics,gene,genomics
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Typing :: Typed
+Requires-Python: >=3.12
+Requires-Dist: httpx>=0.28
+Requires-Dist: pandas>=3
+Requires-Dist: pyarrow>=14
+Requires-Dist: pydantic>=2
+Description-Content-Type: text/markdown
+# geneharmony
+An async Python toolkit that normalizes gene identifiers and annotates gene sets using the [Alliance of Genome Resources](https://www.alliancegenome.org) (AGR) REST API and bulk-download files, with functionality to append local annotations.
+It resolves gene symbols and identifiers to canonical genes using an in-memory index built from AGR's bulk gene file, fetches per-gene API data concurrently, and downloads and parses AGR bulk datasets.
+## Highlights
+- **Gene normalization**: Resolve symbols, primary/secondary IDs, synonyms, systematic names, and external cross-references (NCBI, Ensembl, UniProtKB, RefSeq, …) to the appropriate records in the AGR's `GENE-TSV-COMBINED` file.
+- **Nine model organisms**: Human, mouse, rat, zebrafish, fly, worm, yeast, african clawed frog, and western clawed frog.
+- **Concurrent and resilient**: Pooled, rate-limited HTTP with automatic retry/backoff for transient failures.
+- **Transparent caching**: Bulk files, per-gene API results, and ingested annotations are cached as Parquet to expedite repeat runs.
+- **Bring your own data**: Ingest external annotation tables keyed on whatever gene identifier you have; they normalize to canonical AGR genes and join cleanly.
+## Install
+Requires **Python 3.12+**. Install from PyPI with pip or uv — all dependencies (`httpx`, `pydantic` v2, `pandas` 3.x, `pyarrow`) are resolved automatically:
+```bash
+pip install geneharmony
+# or
+uv add geneharmony
+```
+## Development
+Contributors use **pixi** (conda-forge) for a reproducible environment from the lockfile. End users do not need pixi.
+```bash
+# 1. Install the environment from the lockfile
+pixi install
+# 2. Run Python inside the environment
+pixi run python <script>
+# 3. Or open the interactive driver notebook
+pixi run jupyter lab src/notebook.ipynb
+```
+Notebook outputs are stripped from version control via a git clean filter. The filter config is repo-local, so enable it once per clone:
+```bash
+git config filter.nbstrip.clean "pixi run jupyter nbconvert --clear-output --to notebook --stdin --stdout --log-level=ERROR"
+git config filter.nbstrip.smudge cat
+```
+## Usage
+The `Annotator` is the single entry point. It is async, so call its methods with `await` (inside a notebook cell, an `async def`, or `asyncio.run(...)`).
+### Quick start
+```python
+from geneharmony import Annotator, AGRDataset
+ann = Annotator()
+# Resolve gene symbols to canonical AGR records
+genes = await ann.normalize(["TP53", "BRCA1"], taxon="human")
+# Annotate genes with phenotypic information
+annotated_genes = await ann.annotate(["Atp7b", "Ttn"], AGRDataset.PHENOTYPES, taxon="mouse")
+```
+### Resolving genes (`normalize`)
+`normalize` accepts an identifier or a list of identifiers and returns one row per match. Unmatched queries are **retained** with a null `match_kind` so misses stay visible.
+```python
+df = await ann.normalize(
+    ["TP53", "ENSG00000141510", "not_a_gene"],
+    taxon="human",          # any alias: "human", "9606", "Homo sapiens", "NCBITaxon:9606"
+    limit=1,                # max matches per query; use None for all
+    case_insensitive=False, # case can be meaningful (human TP53 vs mouse Trp53)
+)
+resolved = df[df.match_kind.notna()]   # drop the misses
+```
+Matches are ranked by identifier precedence:
+```
+PRIMARY_ID > SECONDARY_ID > OFFICIAL_SYMBOL > SYNONYM > CROSS_REFERENCE
+```
+### Annotating genes (`annotate`)
+`annotate` builds a normalized base frame, then **left joins** one or more sources onto the canonical `GeneId`:
+```python
+from geneharmony import AGRDataset
+orth = await ann.annotate(
+    ["TP53", "BRCA1"],
+    AGRDataset.ORTHOLOGY,
+    taxon="human",
+)
+```
+For chaining annotate calls, the recommended pattern is an **iterative filter-then-requery traversal** — one AGR dataset per call — so result cardinality stays under your control:
+```python
+# 1. Find orthologs of human genes
+orth = await ann.annotate(["TP53", "BRCA1"], AGRDataset.ORTHOLOGY, taxon="human")
+# 2. Keep the mouse orthologs
+mouse = orth.loc[orth.Gene2SpeciesTaxonID == "NCBITaxon:10090", "Gene2ID"].unique()
+# 3. Fetch their phenotypes
+pheno = await ann.annotate(list(mouse), AGRDataset.PHENOTYPES, taxon="mouse")
+```
+#### Available AGR datasets
+| Dataset                 | Backend      | Key columns contributed                                     |
+| ----------------------- | ------------ | ----------------------------------------------------------- |
+| `AGRDataset.ORTHOLOGY`  | Bulk TSV     | `Gene2ID`, `Gene2Symbol`, `Gene2SpeciesTaxonID`, …          |
+| `AGRDataset.PHENOTYPES` | Per-gene API | `phenotypeStatement`, `references`                          |
+| `AGRDataset.ALLELES`    | Per-gene API | `allele_id`, `symbol`, `alterationType`, `variantType`, …   |
+### Orthologs convenience helper
+For the common ortholog case there is a shortcut that returns a tidy subset:
+```python
+orthologs = await ann.get_orthologs(
+    ["TP53", "BRCA1"],
+    taxon="human",
+    target_taxon="mouse",   # optional: filter to one target species
+)
+# -> columns: query, match_kind, Gene2ID, Gene2Symbol, Gene2SpeciesTaxonID
+```
+### Downloading bulk datasets (`download`)
+Bulk datasets are downloaded and converted to Parquet on first use (and cached thereafter). You can pre-fetch one explicitly:
+```python
+path = await ann.download(AGRDataset.ORTHOLOGY)
+# Force a refresh across AGR releases:
+path = await ann.download(AGRDataset.ORTHOLOGY, refresh=True)
+```
+### Ingesting your own annotations (`ingest_annotation`)
+Bring an external table (CSV, TSV, or Parquet file, or a `DataFrame`), normalize its gene identifiers to canonical AGR genes, and store it for joining by name:
+```python
+summary, unmapped = await ann.ingest_annotation(
+    "my_expression_table.csv",
+    name="expression",
+    gene_id_column="symbol",   # or a list of columns, tried left-to-right per row
+    taxon="human",
+)
+# Join it alongside an AGR dataset; its columns are prefixed `expression.`
+df = await ann.annotate(["TP53", "BRCA1"], AGRDataset.ORTHOLOGY, "expression", taxon="human")
+```
+`summary` reports rows in / stored / dropped; `unmapped` holds the rows whose identifiers could not be resolved (with their original columns) so nothing is silently lost.
+### Supported species
+| Common name           | Species                          | Taxon ID           |
+| --------------------- | -------------------------------- | ------------------ |
+| human                 | *Homo sapiens*                   | `NCBITaxon:9606`   |
+| mouse                 | *Mus musculus*                   | `NCBITaxon:10090`  |
+| rat                   | *Rattus norvegicus*              | `NCBITaxon:10116`  |
+| zebrafish             | *Danio rerio*                    | `NCBITaxon:7955`   |
+| fly / fruit fly       | *Drosophila melanogaster*        | `NCBITaxon:7227`   |
+| worm / roundworm      | *Caenorhabditis elegans*         | `NCBITaxon:6239`   |
+| yeast / budding yeast | *Saccharomyces cerevisiae S288C* | `NCBITaxon:559292` |
+| african clawed frog   | *Xenopus laevis*                 | `NCBITaxon:8355`   |
+| western clawed frog   | *Xenopus tropicalis*             | `NCBITaxon:8364`   |
+Any of the aliases above — common name, full species name, bare number, or `NCBITaxon:` ID — can be passed as `taxon`.
+## Caching
+Results are cached so repeat work is fast and largely offline. The cache defaults to `$XDG_CACHE_HOME/geneharmony` (falling back to `~/.cache/geneharmony`); pass a `cache_dir` to `Annotator(...)` to share or relocate it.
+```
+bulk/<dataset>.parquet            # downloaded + converted bulk datasets (incl. the gene index source)
+api/<dataset>/<gene_id>.parquet   # per-gene API results
+external/<name>.parquet           # ingested annotations
+```
+The gene index is built from `bulk/gene.parquet`, which becomes stale across AGR releases. Refresh it with `await ann.download(AGRDataset.GENE, refresh=True)` (or by deleting the file).
+## Acknowledgements & Citation
+This project is a client for data and services provided by the **Alliance of Genome Resources (AGR)**. It is not affiliated with or endorsed by the Alliance. All gene, ortholog, phenotype, and allele data are sourced from AGR and its member model-organism databases, and remain subject to the Alliance's terms of use.
+If you use data obtained through this wrapper, please cite the Alliance of Genome Resources:
+> [Updates to the Alliance of Genome Resources central infrastructure.](https://pubmed.ncbi.nlm.nih.gov/38552170/) 2024. Alliance of Genome Resources Consortium. Genetics. 2024 May 7;227(1):iyae049. doi: 10.1093/genetics/iyae049. PMID: 38552170.
+Please also consult the [Alliance citation and data-usage guidelines](https://www.alliancegenome.org/cite-us) and acknowledge the underlying model-organism databases (e.g. SGD, WormBase, FlyBase, ZFIN, MGI, RGD, Xenbase) as appropriate.