PyPI - allelix - Versions diffs - 1.8.4__tar.gz → 2.0.0__tar.gz - Mend

allelix 1.8.4tar.gz → 2.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (84) hide show

{allelix-1.8.4 → allelix-2.0.0}/PKG-INFO RENAMED Viewed

@@ -1,13 +1,14 @@
 Metadata-Version: 2.4
 Name: allelix
-Version: 1.8.4
+Version: 2.0.0
 Summary: Open-source genotype analysis toolkit. Format-agnostic ingestion, database-agnostic annotation, offline-first.
-Author-email: dial481 <dial481@users.noreply.github.com>
+Author: Allelix
+Maintainer-email: dial481 <dial481@users.noreply.github.com>
 License-Expression: AGPL-3.0-or-later
 Project-URL: Homepage, https://allelix.io
-Project-URL: Source, https://github.com/dial481/allelix
-Project-URL: Issues, https://github.com/dial481/allelix/issues
-Project-URL: Changelog, https://github.com/dial481/allelix/blob/main/CHANGELOG.md
+Project-URL: Source, https://github.com/allelix/allelix
+Project-URL: Issues, https://github.com/allelix/allelix/issues
+Project-URL: Changelog, https://github.com/allelix/allelix/blob/main/CHANGELOG.md
 Keywords: genomics,genotype,snp,bioinformatics,dna
 Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Science/Research
@@ -25,6 +26,8 @@ Requires-Dist: pyyaml>=6.0
 Requires-Dist: rich>=13.7
 Provides-Extra: cadd
 Requires-Dist: pysam>=0.22; extra == "cadd"
+Provides-Extra: vcf-index
+Requires-Dist: pysam>=0.22; extra == "vcf-index"
 Provides-Extra: dev
 Requires-Dist: pre-commit>=3.7; extra == "dev"
 Requires-Dist: pytest>=8.0; extra == "dev"
@@ -36,18 +39,20 @@ Dynamic: license-file
 Open-source command-line toolkit for analyzing raw genotype files from consumer DNA testing services. Format-agnostic ingestion, database-agnostic annotation, offline-first.
-> **Status:** Production — six parser formats, four annotators (ClinVar +
-> PharmGKB + GWAS Catalog + SNPedia), three enrichment sources (gnomAD
-> population frequencies + AlphaMissense pathogenicity + CADD
-> deleteriousness), licensable-source gating for commercial users,
-> dual-build ClinVar caches (GRCh37 + GRCh38),
+> **Status:** Production — eight parser formats (including VCF + gVCF),
+> four annotators (ClinVar + ClinPGx + GWAS Catalog + SNPedia), three
+> enrichment sources (gnomAD population frequencies + AlphaMissense
+> pathogenicity + CADD deleteriousness), licensable-source gating for
+> commercial users, dual-build ClinVar caches (GRCh37 + GRCh38),
 > HTML/JSON/terminal reports, methylation + pharmacogenomics focused
 > commands, report diffing, persistent config with commercial-mode
 > safety switch. Build auto-detection from position data (ADR-0021).
-> No regex on prose anywhere in production. **Latest: v1.8.4** —
-> `--no-cadd` flag for licensing exclusion parity.
+> No regex on prose anywhere in production. **Latest: v2.0.0** — VCF +
+> gVCF parser with multi-sample handling, batched annotation pipeline
+> for WGS scale, FTDNA Illumina raw parser, R-4 ClinVar CLNSIG drift CI
+> test, CLI package restructure.
 > Release notes:
-> [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
+> [`CHANGELOG.md`](https://github.com/allelix/allelix/blob/main/CHANGELOG.md).
 ## Quickstart
@@ -61,6 +66,15 @@ allelix db update
 # Analyze a genotype file
 allelix analyze your_genotype_file.txt --output report.html
+# VCF / gVCF input — same command, auto-detected
+allelix analyze your_wgs.vcf.gz --output report.html
+# Multi-sample VCF — pick which sample to analyze
+allelix analyze trio.vcf.gz --sample HG002 --output report.html
+# Filter to a custom panel (rsIDs + gene names, one per line; '#' comments and blank lines ignored)
+allelix analyze your_genotype_file.txt --filter-file my_panel.txt --output report.html
 ```
 Requires Python 3.11+. See [Development](#development) for source installs and running tests.
@@ -75,16 +89,20 @@ Requires Python 3.11+. See [Development](#development) for source installs and r
 | Family Tree DNA | ✓ | CSV, double-quoted fields, concatenated genotype. Build 37 default. |
 | MyHeritage DNA | ✓ | CSV, same structure as FTDNA. Detected by "MyHeritage" in comment header. Handles double-double-quoted field variant. |
 | Living DNA | ✓ | Tab-delimited despite `.csv` extension. Handles AX-, AFFX-prefixed and CHR:POS positional SNP IDs. |
+| FTDNA Illumina raw | ✓ | Tab-delimited variant of the FTDNA export (distinct from the CSV format above). `RSID/CHROMOSOME/POSITION/RESULT` columns. Build 37 default. |
+| VCF / gVCF | ✓ | REF/ALT encoding, `0/1` genotype notation. Plain VCF: absence at a position means reference. gVCF: explicit reference blocks (lines with `<NON_REF>` ALT and `END=` INFO) are skipped — they match nothing in any annotation database. Multi-sample files require `--sample <ID>`. Streams via stdlib; `.vcf.gz` handled transparently. Optional `pip install allelix[vcf-index]` enables pysam-backed tabix random access for fast `extract --snps` on huge VCFs. |
 Adding a new format means adding one file to `allelix/parsers/` and registering an instance in the `PARSERS` list in `allelix/parsers/__init__.py`.
-### v2 roadmap
+### v2.1+ roadmap
-| Format | Notes |
+| Feature | Notes |
 |---|---|
-| VCF | REF/ALT encoding, `0/1` genotype notation, absence-means-reference semantics. Architecturally different from array parsers — 4-6M variants per file, streaming + batch SQL required. |
 | Per-source scoring | Magnitude breakdown by database. Users see which source drove the composite score. |
+| Annotator-level strand awareness (R-1) | Strand-flip matching wired into every annotator's carrier check. Basic `compare` strand support shipped in v1.1; full annotator integration deferred from v2.0.0. |
+| Good / Bad / Neutral repute | Per-annotation repute field. Reframes the report from "here's what's wrong" to "here's your full picture." Requires Annotation model change + renderer updates. |
 | PLINK import | Read .bed/.bim/.fam as an input format (complement to the v1.7.0 export). |
+| PharmCAT integration | Wrap CPIC's PharmCAT as an optional external engine for star-allele / diplotype calling. Requires VCF input (shipped in v2.0.0). |
 | Genome Watchtower | Real-time variant monitoring via database delta feeds. Privacy-preserving: server publishes universal feed, matching happens locally against your deviation set. Replaces full re-analysis with millisecond set intersection. |
 ## Supported Databases
@@ -92,22 +110,32 @@ Adding a new format means adding one file to `allelix/parsers/` and registering
 | Database | Status | Notes |
 |---|---|---|
 | ClinVar (GRCh37 + GRCh38) | ✓ | Public domain (NCBI). SNVs + indels + multi-allelic sites. **Both builds cached**; `analyze` dispatches by detected build (ADR-0021). Carrier rule (ADR-0007) requires the user to carry the ALT allele. Indel-anchor protection (ADR-0011) prevents single-base array readouts from matching anchor-base indels. |
-| PharmGKB | ✓ | CC BY-SA 4.0. Clinical annotations only — single-rsid SNVs; star alleles and haplotypes deferred (ADR-0009). **Primary non-finding filter is the ClinVar REF carrier rule (ADR-0023):** if ClinVar publishes a single-base REF for the rsid and the user is homozygous for it, the row is suppressed. CPIC's `(rsid, base) → function_class` join (ADR-0020) survives as a secondary tier for rsids ClinVar doesn't catalog. Earlier prose tiers (ADR-0013, ADR-0017, ADR-0018) are superseded. |
-| CPIC (per-allele function table) | ✓ | Internal data source for the PharmGKB filter. Fetched from `api.cpicpgx.org` at `db update` time. Used to populate the `pharmgkb_allele_function` table — not surfaced to end users as its own annotator. |
+| ClinPGx (formerly PharmGKB) | ✓ | CC BY-SA 4.0. Clinical annotations only — single-rsid SNVs; star alleles and haplotypes deferred (ADR-0009). **Primary non-finding filter is the ClinVar REF carrier rule (ADR-0023):** if ClinVar publishes a single-base REF for the rsid and the user is homozygous for it, the row is suppressed. CPIC's `(rsid, base) → function_class` join (ADR-0020) survives as a secondary tier for rsids ClinVar doesn't catalog. Earlier prose tiers (ADR-0013, ADR-0017, ADR-0018) are superseded. |
+| CPIC (per-allele function table) | ✓ | Internal data source for the ClinPGx filter. Fetched from `api.cpicpgx.org` at `db update` time. Used to populate the `pharmgkb_allele_function` table — not surfaced to end users as its own annotator. |
 | SNPedia | ✓ | CC BY-NC-SA 3.0 US. Pre-built cache downloaded via `db update` (~216K wiki pages, ~105K genotype rows). If the SNPedia database is absent, analysis runs without it. For commercial use, pass `--exclude-snpedia` — `analyze` runs using all other databases and omits SNPedia annotations. The cache can also be rebuilt from source via `scripts/scrape_snpedia.py` + `scripts/parse_snpedia.py`. |
 | GWAS Catalog | ✓ | Public domain (EBI/NHGRI). Trait–SNP associations with p-values and effect sizes. Carrier rule (ADR-0007) requires the user to carry the risk allele. P-value magnitude scoring (ADR-0024) maps continuous p-values to the 0–10 scale; unknown-risk-allele entries fire on rsID match alone but are capped at 3.0. |
 | gnomAD | ✓ | ODbL v1.0. **Enrichment annotator** — adds population allele frequency context to existing annotations. Shows how common each variant is in the general population (~16M exome variants from 730K individuals). A pathogenic variant that 35% of people carry reads very differently from one seen in 0.001%. Pre-built cache downloaded via `db update` (~6GB on disk). Use `--no-gnomad` to skip. |
 | AlphaMissense | ✓ | CC BY 4.0. **Enrichment annotator** — adds DeepMind's protein-structure-based pathogenicity predictions to existing annotations. Scores 71M missense variants on a 0–1 scale: <0.34 = likely benign, >0.564 = likely pathogenic. Complements ClinVar's expert classifications with computational predictions — especially valuable for variants ClinVar hasn't reviewed yet. Pre-built cache downloaded via `db update` (~8GB on disk). Use `--no-alphamissense` to skip. |
 | CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Use `--no-cadd` to skip enrichment for a single run. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
-### Known PharmGKB limitation: reference-genotype rows where ClinVar and CPIC both lack data
+### Build coverage asymmetry (GRCh37 vs GRCh38)
+ClinVar dispatches per-build (ADR-0021) and ships with both GRCh37 and GRCh38 caches. The two caches are essentially equivalent in coverage: 2,896,063 rows / 2,645,206 distinct rsIDs in GRCh37 vs 2,896,102 / 2,645,243 in GRCh38 — a difference of 39 rows.
+Despite that equivalence, the same person's WGS file produces noticeably more annotations as GRCh37 than as GRCh38. The mechanism is in the resolution step, not in upstream-data shape. Position-keyed rsID resolution requires exact `(chromosome, position, ref, alt)` alignment between the user's variant call and ClinVar's stored row. Lift-over between builds does not preserve that alignment perfectly: the `~0.4%` of the genome where the reference assembly was rebuilt has different REF alleles, multi-allelic sites split differently, and some benchmark VCF positions drop out entirely in the GRCh38 lift. Each misalignment loses one resolution, which in turn loses all the rsID-keyed downstream annotations that rsID would have driven (ClinVar's own carrier annotation, plus GWAS Catalog, SNPedia, and ClinPGx).
+Real GIAB HG002 benchmark, surviving the default `--min-magnitude 5.0` filter: GRCh37 surfaces 520 distinct rsIDs across all sources, GRCh38 surfaces 341. The two sets overlap on 331 rsIDs; 189 are GRCh37-only and 10 are GRCh38-only — pure asymmetric loss in the GRCh38 lift, not different upstream coverage. The unfiltered totals (65,965 vs 4,867) magnify the same pattern at lower magnitudes, mostly via GWAS-Catalog weak-association rows.
+If you have a choice of build for the input, GRCh37 surfaces more annotations today on rsID-less VCFs that flow through position-keyed resolution. GRCh38 still surfaces every ClinVar carrier hit it has an exact alignment for.
+### Known ClinPGx limitation: reference-genotype rows where ClinVar and CPIC both lack data
-ADR-0022 + ADR-0023: a tiny residual of PharmGKB rows may appear in reports even when the user is homozygous reference. PharmGKB publishes one annotation per genotype including the reference homozygote, and for the reference-homozygote row to be suppressed Allelix needs structured data on the variant from either:
+ADR-0022 + ADR-0023: a tiny residual of ClinPGx rows may appear in reports even when the user is homozygous reference. ClinPGx publishes one annotation per genotype including the reference homozygote, and for the reference-homozygote row to be suppressed Allelix needs structured data on the variant from either:
 - **ClinVar's REF allele** (the primary filter — see ADR-0023). Covers any rsID ClinVar catalogs.
 - **CPIC's per-allele function table** (the secondary fallback — see ADR-0020). Covers rsIDs CPIC has classified.
-For the rare rsID where PharmGKB has an annotation but *neither* ClinVar nor CPIC has data, the row emits. These are identifiable by a homozygous-reference genotype combined with "decreased risk," "may have a typical response," or similar comparative language. They are an upstream data gap, not an Allelix bug — we surface them honestly rather than hide them behind a curated exclusion list (which would recreate the maintenance trap the v0.5–v0.7 prose filters were trying to escape).
+For the rare rsID where ClinPGx has an annotation but *neither* ClinVar nor CPIC has data, the row emits. These are identifiable by a homozygous-reference genotype combined with "decreased risk," "may have a typical response," or similar comparative language. They are an upstream data gap, not an Allelix bug — we surface them honestly rather than hide them behind a curated exclusion list (which would recreate the maintenance trap the v0.5–v0.7 prose filters were trying to escape).
 The CFTR × ivacaftor leak (~30+ rows on real data, pre-v0.7.3) is fixed by the ADR-0023 ClinVar REF check: CPIC's CFTR vocabulary (`"ivacaftor responsive"`) doesn't match the four-class enum the secondary tier expects, but ClinVar publishes REF for every CFTR rsID, so the primary tier catches them universally.
@@ -164,7 +192,7 @@ Not all databases are equal in size. `allelix db update` downloads them all by d
 | Database | On disk | Download time | What it adds |
 |---|---|---|---|
 | ClinVar (GRCh37 + GRCh38) | ~900MB | 1–2 min | Core clinical variant classifications. Required. |
-| PharmGKB + CPIC | ~6MB | seconds | Drug-gene interactions. |
+| ClinPGx + CPIC | ~6MB | seconds | Drug-gene interactions. |
 | GWAS Catalog | ~200MB | 1–2 min | Trait-SNP associations from genome-wide studies. |
 | gnomAD | ~6GB | 5–15 min | Population allele frequencies (how common is this variant?). |
 | AlphaMissense | ~8GB | 5–15 min | Missense pathogenicity predictions (how likely to break protein function?). |
@@ -182,14 +210,14 @@ Allelix source code is licensed under the **GNU Affero General Public License v3
 |---|---|---|---|
 | ClinVar | NCBI | Public domain | No restrictions |
 | GWAS Catalog | EBI/NHGRI | Public domain | No restrictions |
-| PharmGKB | pharmgkb.org | CC BY-SA 4.0 | Attribution required |
-| CPIC | cpicpgx.org | CC BY-SA 4.0 | Attribution required. Per-allele function data fetched from `api.cpicpgx.org` at `db update` time; used internally for the PharmGKB non-finding filter (ADR-0020), not surfaced as its own annotator. |
+| ClinPGx (formerly PharmGKB) | clinpgx.org | CC BY-SA 4.0 | Attribution required |
+| CPIC | cpicpgx.org | CC BY-SA 4.0 | Attribution required. Per-allele function data fetched from `api.cpicpgx.org` at `db update` time; used internally for the ClinPGx non-finding filter (ADR-0020), not surfaced as its own annotator. |
 | SNPedia | snpedia.com | CC BY-NC-SA 3.0 US | Attribution required, **non-commercial only**. Use `--exclude-snpedia` to omit. |
 | gnomAD | gnomad.broadinstitute.org | ODbL v1.0 | Attribution required. Population allele frequencies for context; not a clinical annotator. Use `--no-gnomad` to omit. |
 | AlphaMissense | zenodo.org/records/10813168 | CC BY 4.0 | Attribution required. Cheng et al., Science 2023. Missense variant pathogenicity predictions. Use `--no-alphamissense` to omit. |
 | CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. Use `--no-cadd` to omit. |
-**Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, PharmGKB, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.
+**Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, ClinPGx, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.
 ### SNPedia data download
@@ -218,17 +246,17 @@ None of these are scraping errors. They are editorial inconsistencies on the sou
 ## Architecture & Design Decisions
-The "why" behind major design choices lives in [`docs/adr/`](https://github.com/dial481/allelix/blob/main/docs/adr/README.md) as Architecture Decision Records. Read these before proposing changes that touch the parser/annotator interfaces, the regulatory posture, or the data-handling model.
+The "why" behind major design choices lives in [`docs/adr/`](https://github.com/allelix/allelix/blob/main/docs/adr/README.md) as Architecture Decision Records. Read these before proposing changes that touch the parser/annotator interfaces, the regulatory posture, or the data-handling model.
 Notable load-bearing ADRs:
 - **ADR-0016 — Data Classification Principle.** Classification reads structured fields only. Regex on prose is forbidden in production code.
-- **ADR-0020 — CPIC API as the per-allele function source.** The PharmGKB non-finding filter is a table join keyed on `(rsid, base) → clinicalfunctionalstatus`, sourced from CPIC's structured API. Supersedes the prose-extraction tiers from earlier versions (ADR-0017, ADR-0018).
+- **ADR-0020 — CPIC API as the per-allele function source.** The ClinPGx non-finding filter is a table join keyed on `(rsid, base) → clinicalfunctionalstatus`, sourced from CPIC's structured API. Supersedes the prose-extraction tiers from earlier versions (ADR-0017, ADR-0018).
 - **ADR-0007 — Genotype matching requires the user to carry the ALT allele.** Applies to ClinVar.
-- **ADR-0009 — PharmGKB matches the user's exact normalized diploid call.**
+- **ADR-0009 — ClinPGx matches the user's exact normalized diploid call.**
 - **ADR-0015 — Mock data generators are the contract.** Fixture shape must mirror real data shape; invariants tested.
-Release history: see [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
+Release history: see [`CHANGELOG.md`](https://github.com/allelix/allelix/blob/main/CHANGELOG.md).
 ## Development
@@ -248,4 +276,4 @@ The pre-commit hook enforces `ruff check` + `ruff format --check`. If a commit i
 ## License
-AGPL-3.0-or-later. See `LICENSE`.
+GNU Affero General Public License v3.0 or later (AGPL-3.0-or-later). See `LICENSE`.

{allelix-1.8.4 → allelix-2.0.0}/README.md RENAMED Viewed

@@ -2,18 +2,20 @@
 Open-source command-line toolkit for analyzing raw genotype files from consumer DNA testing services. Format-agnostic ingestion, database-agnostic annotation, offline-first.
-> **Status:** Production — six parser formats, four annotators (ClinVar +
-> PharmGKB + GWAS Catalog + SNPedia), three enrichment sources (gnomAD
-> population frequencies + AlphaMissense pathogenicity + CADD
-> deleteriousness), licensable-source gating for commercial users,
-> dual-build ClinVar caches (GRCh37 + GRCh38),
+> **Status:** Production — eight parser formats (including VCF + gVCF),
+> four annotators (ClinVar + ClinPGx + GWAS Catalog + SNPedia), three
+> enrichment sources (gnomAD population frequencies + AlphaMissense
+> pathogenicity + CADD deleteriousness), licensable-source gating for
+> commercial users, dual-build ClinVar caches (GRCh37 + GRCh38),
 > HTML/JSON/terminal reports, methylation + pharmacogenomics focused
 > commands, report diffing, persistent config with commercial-mode
 > safety switch. Build auto-detection from position data (ADR-0021).
-> No regex on prose anywhere in production. **Latest: v1.8.4** —
-> `--no-cadd` flag for licensing exclusion parity.
+> No regex on prose anywhere in production. **Latest: v2.0.0** — VCF +
+> gVCF parser with multi-sample handling, batched annotation pipeline
+> for WGS scale, FTDNA Illumina raw parser, R-4 ClinVar CLNSIG drift CI
+> test, CLI package restructure.
 > Release notes:
-> [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
+> [`CHANGELOG.md`](https://github.com/allelix/allelix/blob/main/CHANGELOG.md).
 ## Quickstart
@@ -27,6 +29,15 @@ allelix db update
 # Analyze a genotype file
 allelix analyze your_genotype_file.txt --output report.html
+# VCF / gVCF input — same command, auto-detected
+allelix analyze your_wgs.vcf.gz --output report.html
+# Multi-sample VCF — pick which sample to analyze
+allelix analyze trio.vcf.gz --sample HG002 --output report.html
+# Filter to a custom panel (rsIDs + gene names, one per line; '#' comments and blank lines ignored)
+allelix analyze your_genotype_file.txt --filter-file my_panel.txt --output report.html
 ```
 Requires Python 3.11+. See [Development](#development) for source installs and running tests.
@@ -41,16 +52,20 @@ Requires Python 3.11+. See [Development](#development) for source installs and r
 | Family Tree DNA | ✓ | CSV, double-quoted fields, concatenated genotype. Build 37 default. |
 | MyHeritage DNA | ✓ | CSV, same structure as FTDNA. Detected by "MyHeritage" in comment header. Handles double-double-quoted field variant. |
 | Living DNA | ✓ | Tab-delimited despite `.csv` extension. Handles AX-, AFFX-prefixed and CHR:POS positional SNP IDs. |
+| FTDNA Illumina raw | ✓ | Tab-delimited variant of the FTDNA export (distinct from the CSV format above). `RSID/CHROMOSOME/POSITION/RESULT` columns. Build 37 default. |
+| VCF / gVCF | ✓ | REF/ALT encoding, `0/1` genotype notation. Plain VCF: absence at a position means reference. gVCF: explicit reference blocks (lines with `<NON_REF>` ALT and `END=` INFO) are skipped — they match nothing in any annotation database. Multi-sample files require `--sample <ID>`. Streams via stdlib; `.vcf.gz` handled transparently. Optional `pip install allelix[vcf-index]` enables pysam-backed tabix random access for fast `extract --snps` on huge VCFs. |
 Adding a new format means adding one file to `allelix/parsers/` and registering an instance in the `PARSERS` list in `allelix/parsers/__init__.py`.
-### v2 roadmap
+### v2.1+ roadmap
-| Format | Notes |
+| Feature | Notes |
 |---|---|
-| VCF | REF/ALT encoding, `0/1` genotype notation, absence-means-reference semantics. Architecturally different from array parsers — 4-6M variants per file, streaming + batch SQL required. |
 | Per-source scoring | Magnitude breakdown by database. Users see which source drove the composite score. |
+| Annotator-level strand awareness (R-1) | Strand-flip matching wired into every annotator's carrier check. Basic `compare` strand support shipped in v1.1; full annotator integration deferred from v2.0.0. |
+| Good / Bad / Neutral repute | Per-annotation repute field. Reframes the report from "here's what's wrong" to "here's your full picture." Requires Annotation model change + renderer updates. |
 | PLINK import | Read .bed/.bim/.fam as an input format (complement to the v1.7.0 export). |
+| PharmCAT integration | Wrap CPIC's PharmCAT as an optional external engine for star-allele / diplotype calling. Requires VCF input (shipped in v2.0.0). |
 | Genome Watchtower | Real-time variant monitoring via database delta feeds. Privacy-preserving: server publishes universal feed, matching happens locally against your deviation set. Replaces full re-analysis with millisecond set intersection. |
 ## Supported Databases
@@ -58,22 +73,32 @@ Adding a new format means adding one file to `allelix/parsers/` and registering
 | Database | Status | Notes |
 |---|---|---|
 | ClinVar (GRCh37 + GRCh38) | ✓ | Public domain (NCBI). SNVs + indels + multi-allelic sites. **Both builds cached**; `analyze` dispatches by detected build (ADR-0021). Carrier rule (ADR-0007) requires the user to carry the ALT allele. Indel-anchor protection (ADR-0011) prevents single-base array readouts from matching anchor-base indels. |
-| PharmGKB | ✓ | CC BY-SA 4.0. Clinical annotations only — single-rsid SNVs; star alleles and haplotypes deferred (ADR-0009). **Primary non-finding filter is the ClinVar REF carrier rule (ADR-0023):** if ClinVar publishes a single-base REF for the rsid and the user is homozygous for it, the row is suppressed. CPIC's `(rsid, base) → function_class` join (ADR-0020) survives as a secondary tier for rsids ClinVar doesn't catalog. Earlier prose tiers (ADR-0013, ADR-0017, ADR-0018) are superseded. |
-| CPIC (per-allele function table) | ✓ | Internal data source for the PharmGKB filter. Fetched from `api.cpicpgx.org` at `db update` time. Used to populate the `pharmgkb_allele_function` table — not surfaced to end users as its own annotator. |
+| ClinPGx (formerly PharmGKB) | ✓ | CC BY-SA 4.0. Clinical annotations only — single-rsid SNVs; star alleles and haplotypes deferred (ADR-0009). **Primary non-finding filter is the ClinVar REF carrier rule (ADR-0023):** if ClinVar publishes a single-base REF for the rsid and the user is homozygous for it, the row is suppressed. CPIC's `(rsid, base) → function_class` join (ADR-0020) survives as a secondary tier for rsids ClinVar doesn't catalog. Earlier prose tiers (ADR-0013, ADR-0017, ADR-0018) are superseded. |
+| CPIC (per-allele function table) | ✓ | Internal data source for the ClinPGx filter. Fetched from `api.cpicpgx.org` at `db update` time. Used to populate the `pharmgkb_allele_function` table — not surfaced to end users as its own annotator. |
 | SNPedia | ✓ | CC BY-NC-SA 3.0 US. Pre-built cache downloaded via `db update` (~216K wiki pages, ~105K genotype rows). If the SNPedia database is absent, analysis runs without it. For commercial use, pass `--exclude-snpedia` — `analyze` runs using all other databases and omits SNPedia annotations. The cache can also be rebuilt from source via `scripts/scrape_snpedia.py` + `scripts/parse_snpedia.py`. |
 | GWAS Catalog | ✓ | Public domain (EBI/NHGRI). Trait–SNP associations with p-values and effect sizes. Carrier rule (ADR-0007) requires the user to carry the risk allele. P-value magnitude scoring (ADR-0024) maps continuous p-values to the 0–10 scale; unknown-risk-allele entries fire on rsID match alone but are capped at 3.0. |
 | gnomAD | ✓ | ODbL v1.0. **Enrichment annotator** — adds population allele frequency context to existing annotations. Shows how common each variant is in the general population (~16M exome variants from 730K individuals). A pathogenic variant that 35% of people carry reads very differently from one seen in 0.001%. Pre-built cache downloaded via `db update` (~6GB on disk). Use `--no-gnomad` to skip. |
 | AlphaMissense | ✓ | CC BY 4.0. **Enrichment annotator** — adds DeepMind's protein-structure-based pathogenicity predictions to existing annotations. Scores 71M missense variants on a 0–1 scale: <0.34 = likely benign, >0.564 = likely pathogenic. Complements ClinVar's expert classifications with computational predictions — especially valuable for variants ClinVar hasn't reviewed yet. Pre-built cache downloaded via `db update` (~8GB on disk). Use `--no-alphamissense` to skip. |
 | CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Use `--no-cadd` to skip enrichment for a single run. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
-### Known PharmGKB limitation: reference-genotype rows where ClinVar and CPIC both lack data
+### Build coverage asymmetry (GRCh37 vs GRCh38)
+ClinVar dispatches per-build (ADR-0021) and ships with both GRCh37 and GRCh38 caches. The two caches are essentially equivalent in coverage: 2,896,063 rows / 2,645,206 distinct rsIDs in GRCh37 vs 2,896,102 / 2,645,243 in GRCh38 — a difference of 39 rows.
+Despite that equivalence, the same person's WGS file produces noticeably more annotations as GRCh37 than as GRCh38. The mechanism is in the resolution step, not in upstream-data shape. Position-keyed rsID resolution requires exact `(chromosome, position, ref, alt)` alignment between the user's variant call and ClinVar's stored row. Lift-over between builds does not preserve that alignment perfectly: the `~0.4%` of the genome where the reference assembly was rebuilt has different REF alleles, multi-allelic sites split differently, and some benchmark VCF positions drop out entirely in the GRCh38 lift. Each misalignment loses one resolution, which in turn loses all the rsID-keyed downstream annotations that rsID would have driven (ClinVar's own carrier annotation, plus GWAS Catalog, SNPedia, and ClinPGx).
+Real GIAB HG002 benchmark, surviving the default `--min-magnitude 5.0` filter: GRCh37 surfaces 520 distinct rsIDs across all sources, GRCh38 surfaces 341. The two sets overlap on 331 rsIDs; 189 are GRCh37-only and 10 are GRCh38-only — pure asymmetric loss in the GRCh38 lift, not different upstream coverage. The unfiltered totals (65,965 vs 4,867) magnify the same pattern at lower magnitudes, mostly via GWAS-Catalog weak-association rows.
+If you have a choice of build for the input, GRCh37 surfaces more annotations today on rsID-less VCFs that flow through position-keyed resolution. GRCh38 still surfaces every ClinVar carrier hit it has an exact alignment for.
+### Known ClinPGx limitation: reference-genotype rows where ClinVar and CPIC both lack data
-ADR-0022 + ADR-0023: a tiny residual of PharmGKB rows may appear in reports even when the user is homozygous reference. PharmGKB publishes one annotation per genotype including the reference homozygote, and for the reference-homozygote row to be suppressed Allelix needs structured data on the variant from either:
+ADR-0022 + ADR-0023: a tiny residual of ClinPGx rows may appear in reports even when the user is homozygous reference. ClinPGx publishes one annotation per genotype including the reference homozygote, and for the reference-homozygote row to be suppressed Allelix needs structured data on the variant from either:
 - **ClinVar's REF allele** (the primary filter — see ADR-0023). Covers any rsID ClinVar catalogs.
 - **CPIC's per-allele function table** (the secondary fallback — see ADR-0020). Covers rsIDs CPIC has classified.
-For the rare rsID where PharmGKB has an annotation but *neither* ClinVar nor CPIC has data, the row emits. These are identifiable by a homozygous-reference genotype combined with "decreased risk," "may have a typical response," or similar comparative language. They are an upstream data gap, not an Allelix bug — we surface them honestly rather than hide them behind a curated exclusion list (which would recreate the maintenance trap the v0.5–v0.7 prose filters were trying to escape).
+For the rare rsID where ClinPGx has an annotation but *neither* ClinVar nor CPIC has data, the row emits. These are identifiable by a homozygous-reference genotype combined with "decreased risk," "may have a typical response," or similar comparative language. They are an upstream data gap, not an Allelix bug — we surface them honestly rather than hide them behind a curated exclusion list (which would recreate the maintenance trap the v0.5–v0.7 prose filters were trying to escape).
 The CFTR × ivacaftor leak (~30+ rows on real data, pre-v0.7.3) is fixed by the ADR-0023 ClinVar REF check: CPIC's CFTR vocabulary (`"ivacaftor responsive"`) doesn't match the four-class enum the secondary tier expects, but ClinVar publishes REF for every CFTR rsID, so the primary tier catches them universally.
@@ -130,7 +155,7 @@ Not all databases are equal in size. `allelix db update` downloads them all by d
 | Database | On disk | Download time | What it adds |
 |---|---|---|---|
 | ClinVar (GRCh37 + GRCh38) | ~900MB | 1–2 min | Core clinical variant classifications. Required. |
-| PharmGKB + CPIC | ~6MB | seconds | Drug-gene interactions. |
+| ClinPGx + CPIC | ~6MB | seconds | Drug-gene interactions. |
 | GWAS Catalog | ~200MB | 1–2 min | Trait-SNP associations from genome-wide studies. |
 | gnomAD | ~6GB | 5–15 min | Population allele frequencies (how common is this variant?). |
 | AlphaMissense | ~8GB | 5–15 min | Missense pathogenicity predictions (how likely to break protein function?). |
@@ -148,14 +173,14 @@ Allelix source code is licensed under the **GNU Affero General Public License v3
 |---|---|---|---|
 | ClinVar | NCBI | Public domain | No restrictions |
 | GWAS Catalog | EBI/NHGRI | Public domain | No restrictions |
-| PharmGKB | pharmgkb.org | CC BY-SA 4.0 | Attribution required |
-| CPIC | cpicpgx.org | CC BY-SA 4.0 | Attribution required. Per-allele function data fetched from `api.cpicpgx.org` at `db update` time; used internally for the PharmGKB non-finding filter (ADR-0020), not surfaced as its own annotator. |
+| ClinPGx (formerly PharmGKB) | clinpgx.org | CC BY-SA 4.0 | Attribution required |
+| CPIC | cpicpgx.org | CC BY-SA 4.0 | Attribution required. Per-allele function data fetched from `api.cpicpgx.org` at `db update` time; used internally for the ClinPGx non-finding filter (ADR-0020), not surfaced as its own annotator. |
 | SNPedia | snpedia.com | CC BY-NC-SA 3.0 US | Attribution required, **non-commercial only**. Use `--exclude-snpedia` to omit. |
 | gnomAD | gnomad.broadinstitute.org | ODbL v1.0 | Attribution required. Population allele frequencies for context; not a clinical annotator. Use `--no-gnomad` to omit. |
 | AlphaMissense | zenodo.org/records/10813168 | CC BY 4.0 | Attribution required. Cheng et al., Science 2023. Missense variant pathogenicity predictions. Use `--no-alphamissense` to omit. |
 | CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. Use `--no-cadd` to omit. |
-**Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, PharmGKB, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.
+**Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, ClinPGx, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.
 ### SNPedia data download
@@ -184,17 +209,17 @@ None of these are scraping errors. They are editorial inconsistencies on the sou
 ## Architecture & Design Decisions
-The "why" behind major design choices lives in [`docs/adr/`](https://github.com/dial481/allelix/blob/main/docs/adr/README.md) as Architecture Decision Records. Read these before proposing changes that touch the parser/annotator interfaces, the regulatory posture, or the data-handling model.
+The "why" behind major design choices lives in [`docs/adr/`](https://github.com/allelix/allelix/blob/main/docs/adr/README.md) as Architecture Decision Records. Read these before proposing changes that touch the parser/annotator interfaces, the regulatory posture, or the data-handling model.
 Notable load-bearing ADRs:
 - **ADR-0016 — Data Classification Principle.** Classification reads structured fields only. Regex on prose is forbidden in production code.
-- **ADR-0020 — CPIC API as the per-allele function source.** The PharmGKB non-finding filter is a table join keyed on `(rsid, base) → clinicalfunctionalstatus`, sourced from CPIC's structured API. Supersedes the prose-extraction tiers from earlier versions (ADR-0017, ADR-0018).
+- **ADR-0020 — CPIC API as the per-allele function source.** The ClinPGx non-finding filter is a table join keyed on `(rsid, base) → clinicalfunctionalstatus`, sourced from CPIC's structured API. Supersedes the prose-extraction tiers from earlier versions (ADR-0017, ADR-0018).
 - **ADR-0007 — Genotype matching requires the user to carry the ALT allele.** Applies to ClinVar.
-- **ADR-0009 — PharmGKB matches the user's exact normalized diploid call.**
+- **ADR-0009 — ClinPGx matches the user's exact normalized diploid call.**
 - **ADR-0015 — Mock data generators are the contract.** Fixture shape must mirror real data shape; invariants tested.
-Release history: see [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
+Release history: see [`CHANGELOG.md`](https://github.com/allelix/allelix/blob/main/CHANGELOG.md).
 ## Development
@@ -214,4 +239,4 @@ The pre-commit hook enforces `ruff check` + `ruff format --check`. If a commit i
 ## License
-AGPL-3.0-or-later. See `LICENSE`.
+GNU Affero General Public License v3.0 or later (AGPL-3.0-or-later). See `LICENSE`.

{allelix-1.8.4 → allelix-2.0.0}/allelix/__init__.py RENAMED Viewed

@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2026 dial481
+# Copyright (C) 2026 Allelix
 """Allelix: open-source genotype analysis toolkit."""
 from __future__ import annotations

{allelix-1.8.4 → allelix-2.0.0}/allelix/annotators/__init__.py RENAMED Viewed

@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2026 dial481
+# Copyright (C) 2026 Allelix
 """Annotator registry. Unlike parsers, ALL annotators run on every variant."""
 from __future__ import annotations
@@ -43,7 +43,7 @@ def get_annotators(
     complete 81 GB CADD file). Requires ``pysam`` and a local copy.
     ADR-0023: ClinVar's `reference_for(rsid, build)` is wired into
-    PharmGKB and SNPedia as the primary hom-ref suppression filter — the
+    ClinPGx and SNPedia as the primary hom-ref suppression filter — the
     REF allele lookup universally determines whether the user is
     homozygous reference (and thus a non-finding for that variant).
     """

{allelix-1.8.4 → allelix-2.0.0}/allelix/annotators/alphamissense.py RENAMED Viewed

@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2026 dial481
+# Copyright (C) 2026 Allelix
 """AlphaMissense variant pathogenicity enrichment.
 AlphaMissense is not a clinical annotator — it does not produce
@@ -226,3 +226,32 @@ class AlphaMissenseAnnotator(Annotator):
                 if score is not None:
                     result[(rsid, alt)] = (score, cls)
         return result
+    def bulk_lookup_by_position(
+        self, keys: set[tuple[str, int, str, str]]
+    ) -> dict[tuple[str, int, str, str], tuple[float, str]]:
+        """Return ``{(chrom, pos, ref, alt): (score, class)}`` via PK lookup.
+        Position-keyed fallback for rsID-less VCFs whose ClinVar-resolved
+        rsIDs aren't indexed in the AlphaMissense cache. Hits the
+        ``(chrom, pos, ref, alt)`` primary key directly.
+        """
+        if not keys:
+            return {}
+        conn = self._connection()
+        result: dict[tuple[str, int, str, str], tuple[float, str]] = {}
+        key_list = list(keys)
+        batch_size = _BULK_BATCH_SIZE // 4
+        for i in range(0, len(key_list), batch_size):
+            batch = key_list[i : i + batch_size]
+            clauses = " OR ".join(["(chrom = ? AND pos = ? AND ref = ? AND alt = ?)"] * len(batch))
+            params = [v for k in batch for v in k]
+            rows = conn.execute(
+                f"SELECT chrom, pos, ref, alt, am_pathogenicity, am_class"
+                f" FROM alphamissense_scores WHERE {clauses}",
+                params,
+            ).fetchall()
+            for chrom, pos, ref, alt, score, cls in rows:
+                if score is not None:
+                    result[(chrom, pos, ref, alt)] = (score, cls)
+        return result

{allelix-1.8.4 → allelix-2.0.0}/allelix/annotators/base.py RENAMED Viewed

@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2026 dial481
+# Copyright (C) 2026 Allelix
 """Abstract base class for reference-database annotators."""
 from __future__ import annotations
@@ -11,7 +11,7 @@ from enum import Enum, auto
 from typing import TYPE_CHECKING, ClassVar
 if TYPE_CHECKING:
-    from collections.abc import Callable
+    from collections.abc import Callable, Iterable, Iterator
     from pathlib import Path
     from types import TracebackType
@@ -173,6 +173,22 @@ class Annotator(ABC):
         """
         ...
+    def batch_annotate(self, variants: Iterable[Variant]) -> Iterator[Annotation]:
+        """Annotate a batch of variants. Yields annotations in arrival order.
+        Default implementation loops over ``annotate(variant)`` so any
+        existing annotator works unchanged. Subclasses with rsID-based
+        SQLite lookups should override this with a chunked ``WHERE rsid
+        IN (...)`` query to avoid per-variant round-trips at WGS scale
+        (4-6M variants per VCF).
+        The default keeps the pipeline single-path: callers always use
+        ``batch_annotate``; the loop fallback gives backward compatibility
+        for annotators that haven't grown a batched query path yet.
+        """
+        for variant in variants:
+            yield from self.annotate(variant)
     @abstractmethod
     def is_ready(self) -> bool:
         """Whether the local cache exists and is queryable."""

{allelix-1.8.4 → allelix-2.0.0}/allelix/annotators/cadd.py RENAMED Viewed

@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2026 dial481
+# Copyright (C) 2026 Allelix
 """CADD variant deleteriousness enrichment.
 CADD is not a clinical annotator — it does not produce Annotation

allelix 1.8.4__tar.gz → 2.0.0__tar.gz

allelix 1.8.4tar.gz → 2.0.0tar.gz