PyPI - allelix - Versions diffs - 1.8.3__tar.gz → 1.9.0__tar.gz - Mend

allelix 1.8.3tar.gz → 1.9.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (72) hide show

{allelix-1.8.3 → allelix-1.9.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: allelix
-Version: 1.8.3
+Version: 1.9.0
 Summary: Open-source genotype analysis toolkit. Format-agnostic ingestion, database-agnostic annotation, offline-first.
 Author-email: dial481 <dial481@users.noreply.github.com>
 License-Expression: AGPL-3.0-or-later
@@ -44,8 +44,8 @@ Open-source command-line toolkit for analyzing raw genotype files from consumer
 > HTML/JSON/terminal reports, methylation + pharmacogenomics focused
 > commands, report diffing, persistent config with commercial-mode
 > safety switch. Build auto-detection from position data (ADR-0021).
-> No regex on prose anywhere in production. **Latest: v1.8.3** —
-> pip install quickstart, workflow hardening, PyPI link fix.
+> No regex on prose anywhere in production. **Latest: v1.9.0** —
+> `--filter-file` flag for custom-panel filtering on `analyze`.
 > Release notes:
 > [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
@@ -61,6 +61,9 @@ allelix db update
 # Analyze a genotype file
 allelix analyze your_genotype_file.txt --output report.html
+# Filter to a custom panel (rsIDs + gene names, one per line; '#' comments and blank lines ignored)
+allelix analyze your_genotype_file.txt --filter-file my_panel.txt --output report.html
 ```
 Requires Python 3.11+. See [Development](#development) for source installs and running tests.
@@ -98,7 +101,7 @@ Adding a new format means adding one file to `allelix/parsers/` and registering
 | GWAS Catalog | ✓ | Public domain (EBI/NHGRI). Trait–SNP associations with p-values and effect sizes. Carrier rule (ADR-0007) requires the user to carry the risk allele. P-value magnitude scoring (ADR-0024) maps continuous p-values to the 0–10 scale; unknown-risk-allele entries fire on rsID match alone but are capped at 3.0. |
 | gnomAD | ✓ | ODbL v1.0. **Enrichment annotator** — adds population allele frequency context to existing annotations. Shows how common each variant is in the general population (~16M exome variants from 730K individuals). A pathogenic variant that 35% of people carry reads very differently from one seen in 0.001%. Pre-built cache downloaded via `db update` (~6GB on disk). Use `--no-gnomad` to skip. |
 | AlphaMissense | ✓ | CC BY 4.0. **Enrichment annotator** — adds DeepMind's protein-structure-based pathogenicity predictions to existing annotations. Scores 71M missense variants on a 0–1 scale: <0.34 = likely benign, >0.564 = likely pathogenic. Complements ClinVar's expert classifications with computational predictions — especially valuable for variants ClinVar hasn't reviewed yet. Pre-built cache downloaded via `db update` (~8GB on disk). Use `--no-alphamissense` to skip. |
-| CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
+| CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Use `--no-cadd` to skip enrichment for a single run. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
 ### Known PharmGKB limitation: reference-genotype rows where ClinVar and CPIC both lack data
@@ -155,7 +158,7 @@ allelix config set license.commercial true
 allelix config set license.cadd true
 ```
-CLI flags (`--no-gnomad`, `--no-alphamissense`, `--exclude-snpedia`, `--cadd`) override the config for a single run. The config sets the baseline; flags override per-invocation.
+CLI flags (`--no-gnomad`, `--no-alphamissense`, `--no-cadd`, `--exclude-snpedia`, `--cadd`) override the config for a single run. The config sets the baseline; flags override per-invocation.
 ### Database sizes and download times
@@ -187,7 +190,7 @@ Allelix source code is licensed under the **GNU Affero General Public License v3
 | SNPedia | snpedia.com | CC BY-NC-SA 3.0 US | Attribution required, **non-commercial only**. Use `--exclude-snpedia` to omit. |
 | gnomAD | gnomad.broadinstitute.org | ODbL v1.0 | Attribution required. Population allele frequencies for context; not a clinical annotator. Use `--no-gnomad` to omit. |
 | AlphaMissense | zenodo.org/records/10813168 | CC BY 4.0 | Attribution required. Cheng et al., Science 2023. Missense variant pathogenicity predictions. Use `--no-alphamissense` to omit. |
-| CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. |
+| CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. Use `--no-cadd` to omit. |
 **Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, PharmGKB, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.

{allelix-1.8.3 → allelix-1.9.0}/README.md RENAMED Viewed

@@ -10,8 +10,8 @@ Open-source command-line toolkit for analyzing raw genotype files from consumer
 > HTML/JSON/terminal reports, methylation + pharmacogenomics focused
 > commands, report diffing, persistent config with commercial-mode
 > safety switch. Build auto-detection from position data (ADR-0021).
-> No regex on prose anywhere in production. **Latest: v1.8.3** —
-> pip install quickstart, workflow hardening, PyPI link fix.
+> No regex on prose anywhere in production. **Latest: v1.9.0** —
+> `--filter-file` flag for custom-panel filtering on `analyze`.
 > Release notes:
 > [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
@@ -27,6 +27,9 @@ allelix db update
 # Analyze a genotype file
 allelix analyze your_genotype_file.txt --output report.html
+# Filter to a custom panel (rsIDs + gene names, one per line; '#' comments and blank lines ignored)
+allelix analyze your_genotype_file.txt --filter-file my_panel.txt --output report.html
 ```
 Requires Python 3.11+. See [Development](#development) for source installs and running tests.
@@ -64,7 +67,7 @@ Adding a new format means adding one file to `allelix/parsers/` and registering
 | GWAS Catalog | ✓ | Public domain (EBI/NHGRI). Trait–SNP associations with p-values and effect sizes. Carrier rule (ADR-0007) requires the user to carry the risk allele. P-value magnitude scoring (ADR-0024) maps continuous p-values to the 0–10 scale; unknown-risk-allele entries fire on rsID match alone but are capped at 3.0. |
 | gnomAD | ✓ | ODbL v1.0. **Enrichment annotator** — adds population allele frequency context to existing annotations. Shows how common each variant is in the general population (~16M exome variants from 730K individuals). A pathogenic variant that 35% of people carry reads very differently from one seen in 0.001%. Pre-built cache downloaded via `db update` (~6GB on disk). Use `--no-gnomad` to skip. |
 | AlphaMissense | ✓ | CC BY 4.0. **Enrichment annotator** — adds DeepMind's protein-structure-based pathogenicity predictions to existing annotations. Scores 71M missense variants on a 0–1 scale: <0.34 = likely benign, >0.564 = likely pathogenic. Complements ClinVar's expert classifications with computational predictions — especially valuable for variants ClinVar hasn't reviewed yet. Pre-built cache downloaded via `db update` (~8GB on disk). Use `--no-alphamissense` to skip. |
-| CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
+| CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Use `--no-cadd` to skip enrichment for a single run. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
 ### Known PharmGKB limitation: reference-genotype rows where ClinVar and CPIC both lack data
@@ -121,7 +124,7 @@ allelix config set license.commercial true
 allelix config set license.cadd true
 ```
-CLI flags (`--no-gnomad`, `--no-alphamissense`, `--exclude-snpedia`, `--cadd`) override the config for a single run. The config sets the baseline; flags override per-invocation.
+CLI flags (`--no-gnomad`, `--no-alphamissense`, `--no-cadd`, `--exclude-snpedia`, `--cadd`) override the config for a single run. The config sets the baseline; flags override per-invocation.
 ### Database sizes and download times
@@ -153,7 +156,7 @@ Allelix source code is licensed under the **GNU Affero General Public License v3
 | SNPedia | snpedia.com | CC BY-NC-SA 3.0 US | Attribution required, **non-commercial only**. Use `--exclude-snpedia` to omit. |
 | gnomAD | gnomad.broadinstitute.org | ODbL v1.0 | Attribution required. Population allele frequencies for context; not a clinical annotator. Use `--no-gnomad` to omit. |
 | AlphaMissense | zenodo.org/records/10813168 | CC BY 4.0 | Attribution required. Cheng et al., Science 2023. Missense variant pathogenicity predictions. Use `--no-alphamissense` to omit. |
-| CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. |
+| CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. Use `--no-cadd` to omit. |
 **Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, PharmGKB, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.

{allelix-1.8.3 → allelix-1.9.0}/allelix/cli.py RENAMED Viewed

@@ -5,6 +5,7 @@
 from __future__ import annotations
 import logging
+import re
 import sys
 import time
 from pathlib import Path
@@ -195,6 +196,35 @@ def _format_from_path(output: Path, override: str | None) -> str:
     )
+_RSID_PATTERN = re.compile(r"^rs\d+$", re.IGNORECASE)
+def _parse_filter_file(path: Path) -> tuple[frozenset[str], frozenset[str]]:
+    r"""Parse a filter file into ``(gene_names, rsids)``.
+    Lines matching ``^rs\d+$`` (case-insensitive) are rsIDs. Everything
+    else is a gene name. Lines starting with ``#`` and blank lines are
+    ignored. Gene names starting with ``RS`` (e.g., RSPO1, RSF1) are
+    correctly classified as gene names, not rsIDs.
+    Input is case-tolerant; output is canonical: rsIDs are normalized to
+    lowercase (``rs1801133``), gene names to uppercase (``MTHFR``). The
+    filter recorded in JSON output therefore looks identical regardless
+    of how the user typed the entries in the filter file.
+    """
+    genes: set[str] = set()
+    rsids: set[str] = set()
+    for raw in path.read_text().splitlines():
+        line = raw.strip()
+        if not line or line.startswith("#"):
+            continue
+        if _RSID_PATTERN.match(line):
+            rsids.add(line.lower())
+        else:
+            genes.add(line.upper())
+    return frozenset(genes), frozenset(rsids)
 def _run_analysis_command(
     file_path: Path,
     fmt: str | None,
@@ -204,6 +234,7 @@ def _run_analysis_command(
     min_magnitude: float,
     category: str | None,
     genes: frozenset[str] | None,
+    rsids: frozenset[str] | None = None,
     build: str | None = None,
     include_benign: bool = False,
     gwas_min_magnitude: float | None = None,
@@ -214,6 +245,7 @@ def _run_analysis_command(
     no_update: bool = False,
     no_gnomad: bool = False,
     no_alphamissense: bool = False,
+    no_cadd: bool = False,
 ) -> None:
     resolved = resolve_data_dir(data_dir)
     if not no_update:
@@ -256,12 +288,13 @@ def _run_analysis_command(
     ready = [a for a in ready if a.name != "alphamissense"]
     cadd_annotator = None
-    from allelix.annotators.cadd import CaddAnnotator
+    if not no_cadd:
+        from allelix.annotators.cadd import CaddAnnotator
-    for a in ready:
-        if isinstance(a, CaddAnnotator):
-            cadd_annotator = a
-            break
+        for a in ready:
+            if isinstance(a, CaddAnnotator):
+                cadd_annotator = a
+                break
     ready = [a for a in ready if a.name != "cadd"]
     if not_ready:
@@ -333,6 +366,7 @@ def _run_analysis_command(
             min_magnitude=min_magnitude,
             category=category,
             genes=genes,
+            rsids=rsids,
             source_min_magnitudes=source_floors,
         )
         from allelix.reports._pipeline import rollup_gwas_duplicates
@@ -354,6 +388,7 @@ def _run_analysis_command(
                 min_magnitude=min_magnitude,
                 category=category,
                 genes=genes,
+                rsids=rsids,
                 source_min_magnitudes=source_floors,
             )
     else:
@@ -371,6 +406,7 @@ def _run_analysis_command(
                 min_magnitude=min_magnitude,
                 category=category,
                 genes=genes,
+                rsids=rsids,
                 source_min_magnitudes=source_floors,
                 diff=diff_result,
                 high_value_no_calls=hv_dicts,
@@ -382,6 +418,7 @@ def _run_analysis_command(
                 min_magnitude=min_magnitude,
                 category=category,
                 genes=genes,
+                rsids=rsids,
                 source_min_magnitudes=source_floors,
                 diff=diff_result,
                 high_value_no_calls=hv_warning_lines,
@@ -558,6 +595,16 @@ _DIFF_OPT = click.option(
         "Not a monitoring tool — use for version-to-version validation."
     ),
 )
+_FILTER_FILE_OPT = click.option(
+    "--filter-file",
+    type=click.Path(exists=True, dir_okay=False, path_type=Path),
+    default=None,
+    help=(
+        "Plain text file with rsIDs and/or gene names (one per line) to "
+        "filter the report. Lines matching '^rs\\d+$' are rsIDs; everything "
+        "else is a gene name. Comments (#) and blank lines are ignored."
+    ),
+)
 _NO_UPDATE_OPT = click.option(
     "--no-update",
     is_flag=True,
@@ -576,6 +623,12 @@ _NO_ALPHAMISSENSE_OPT = click.option(
     default=False,
     help="Skip AlphaMissense variant pathogenicity enrichment.",
 )
+_NO_CADD_OPT = click.option(
+    "--no-cadd",
+    is_flag=True,
+    default=False,
+    help="Skip CADD deleteriousness score enrichment.",
+)
 _BUILD_OPT = click.option(
     "--build",
     type=click.Choice(["grch37", "grch38", "auto"], case_sensitive=False),
@@ -669,9 +722,11 @@ def _emit_build_diagnostics(result: object) -> None:
 @_GWAS_ALL_OPT
 @_EXCLUDE_SNPEDIA_OPT
 @_DIFF_OPT
+@_FILTER_FILE_OPT
 @_NO_UPDATE_OPT
 @_NO_GNOMAD_OPT
 @_NO_ALPHAMISSENSE_OPT
+@_NO_CADD_OPT
 def analyze(
     file_path: Path,
     fmt: str | None,
@@ -687,11 +742,20 @@ def analyze(
     gwas_all: bool,
     exclude_snpedia: bool,
     diff_path: Path | None,
+    filter_file: Path | None,
     no_update: bool,
     no_gnomad: bool,
     no_alphamissense: bool,
+    no_cadd: bool,
 ) -> None:
     """Annotate a genotype file against all ready reference databases."""
+    filter_genes: frozenset[str] | None = None
+    filter_rsids: frozenset[str] | None = None
+    if filter_file is not None:
+        filter_genes, filter_rsids = _parse_filter_file(filter_file)
+        # Empty sets (file had only comments/blanks) still apply — they
+        # mean "match nothing", producing an empty report.
     _run_analysis_command(
         file_path=file_path,
         fmt=fmt,
@@ -700,7 +764,8 @@ def analyze(
         report_format=report_format,
         min_magnitude=min_magnitude,
         category=category,
-        genes=None,
+        genes=filter_genes,
+        rsids=filter_rsids,
         build=_normalize_cli_build(build),
         include_benign=include_benign,
         gwas_min_magnitude=gwas_min_magnitude,
@@ -711,6 +776,7 @@ def analyze(
         no_update=no_update,
         no_gnomad=no_gnomad,
         no_alphamissense=no_alphamissense,
+        no_cadd=no_cadd,
     )
@@ -868,6 +934,7 @@ def compare(file1: Path, file2: Path, fmt1: str | None, fmt2: str | None) -> Non
 @_NO_UPDATE_OPT
 @_NO_GNOMAD_OPT
 @_NO_ALPHAMISSENSE_OPT
+@_NO_CADD_OPT
 def methylation(
     file_path: Path,
     fmt: str | None,
@@ -886,6 +953,7 @@ def methylation(
     no_update: bool,
     no_gnomad: bool,
     no_alphamissense: bool,
+    no_cadd: bool,
 ) -> None:
     """Methylation-pathway-focused report (MTHFR, MTR, MTRR, COMT, CBS, …)."""
     excluded: set[str] = set()
@@ -912,6 +980,7 @@ def methylation(
         no_update=no_update,
         no_gnomad=no_gnomad,
         no_alphamissense=no_alphamissense,
+        no_cadd=no_cadd,
     )
@@ -933,6 +1002,7 @@ def methylation(
 @_NO_UPDATE_OPT
 @_NO_GNOMAD_OPT
 @_NO_ALPHAMISSENSE_OPT
+@_NO_CADD_OPT
 def pharmacogenomics(
     file_path: Path,
     fmt: str | None,
@@ -951,6 +1021,7 @@ def pharmacogenomics(
     no_update: bool,
     no_gnomad: bool,
     no_alphamissense: bool,
+    no_cadd: bool,
 ) -> None:
     """Pharmacogenomics-focused report (annotations from PharmGKB-style sources)."""
     excluded: set[str] = set()
@@ -977,6 +1048,7 @@ def pharmacogenomics(
         no_update=no_update,
         no_gnomad=no_gnomad,
         no_alphamissense=no_alphamissense,
+        no_cadd=no_cadd,
     )

{allelix-1.8.3 → allelix-1.9.0}/allelix/reports/_pipeline.py RENAMED Viewed

@@ -105,6 +105,7 @@ class AnalysisResult:
         min_magnitude: float = 0.0,
         category: str | None = None,
         genes: Iterable[str] | None = None,
+        rsids: Iterable[str] | None = None,
         source_min_magnitudes: dict[str, float] | None = None,
     ) -> list[Annotation]:
         """Apply the standard filters and return a sorted list of annotations.
@@ -117,8 +118,14 @@ class AnalysisResult:
         entry, that value IS the floor for that source — it can raise OR
         lower the global ``min_magnitude``. Sources without an entry use
         the global floor.
+        `genes` and `rsids` combine with OR: when either is provided, an
+        annotation passes if it matches the gene set OR the rsid set.
+        Empty collections (vs None) mean "match nothing" — an empty
+        filter file produces an empty report.
         """
-        gene_set = {g.upper() for g in genes} if genes else None
+        gene_set = {g.upper() for g in genes} if genes is not None else None
+        rsid_set = {r.lower() for r in rsids} if rsids is not None else None
         out: list[Annotation] = []
         for a in self.annotations:
             if (
@@ -133,8 +140,11 @@ class AnalysisResult:
                 continue
             if category is not None and a.category != category:
                 continue
-            if gene_set is not None and (a.gene or "").upper() not in gene_set:
-                continue
+            if gene_set is not None or rsid_set is not None:
+                gene_match = gene_set is not None and (a.gene or "").upper() in gene_set
+                rsid_match = rsid_set is not None and a.rsid.lower() in rsid_set
+                if not gene_match and not rsid_match:
+                    continue
             out.append(a)
         out.sort(key=lambda a: (-a.magnitude, a.rsid))
         return out

{allelix-1.8.3 → allelix-1.9.0}/allelix/reports/html.py RENAMED Viewed

@@ -951,6 +951,7 @@ def render_html(
     min_magnitude: float = 0.0,
     category: str | None = None,
     genes: Iterable[str] | None = None,
+    rsids: Iterable[str] | None = None,
     source_min_magnitudes: dict[str, float] | None = None,
     title: str = "Allelix Genotype Report",
     diff: DiffResult | None = None,
@@ -961,6 +962,7 @@ def render_html(
         min_magnitude=min_magnitude,
         category=category,
         genes=genes,
+        rsids=rsids,
         source_min_magnitudes=source_min_magnitudes,
     )
     filtered = rollup_gwas_duplicates(filtered)

{allelix-1.8.3 → allelix-1.9.0}/allelix/reports/json_report.py RENAMED Viewed

@@ -103,6 +103,7 @@ def render_json(
     min_magnitude: float = 0.0,
     category: str | None = None,
     genes: Iterable[str] | None = None,
+    rsids: Iterable[str] | None = None,
     source_min_magnitudes: dict[str, float] | None = None,
     diff: DiffResult | None = None,
     high_value_no_calls: list[dict[str, str]] | None = None,
@@ -112,6 +113,7 @@ def render_json(
         min_magnitude=min_magnitude,
         category=category,
         genes=genes,
+        rsids=rsids,
         source_min_magnitudes=source_min_magnitudes,
     )
     filtered = rollup_gwas_duplicates(filtered)
@@ -134,7 +136,8 @@ def render_json(
         "filters": {
             "min_magnitude": min_magnitude,
             "category": category,
-            "genes": sorted(genes) if genes else None,
+            "genes": sorted(genes) if genes is not None else None,
+            "rsids": sorted(rsids) if rsids is not None else None,
         },
         "annotations": [_annotation_dict(a) for a in filtered],
     }

{allelix-1.8.3 → allelix-1.9.0}/allelix/reports/terminal.py RENAMED Viewed

@@ -27,6 +27,7 @@ def render_terminal(
     min_magnitude: float = 0.0,
     category: str | None = None,
     genes: Iterable[str] | None = None,
+    rsids: Iterable[str] | None = None,
     source_min_magnitudes: dict[str, float] | None = None,
 ) -> int:
     """Render an AnalysisResult as a Rich table. Returns annotation count.
@@ -38,6 +39,7 @@ def render_terminal(
         min_magnitude=min_magnitude,
         category=category,
         genes=genes,
+        rsids=rsids,
         source_min_magnitudes=source_min_magnitudes,
     )
     filtered = rollup_gwas_duplicates(filtered)

{allelix-1.8.3 → allelix-1.9.0}/allelix.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: allelix
-Version: 1.8.3
+Version: 1.9.0
 Summary: Open-source genotype analysis toolkit. Format-agnostic ingestion, database-agnostic annotation, offline-first.
 Author-email: dial481 <dial481@users.noreply.github.com>
 License-Expression: AGPL-3.0-or-later
@@ -44,8 +44,8 @@ Open-source command-line toolkit for analyzing raw genotype files from consumer
 > HTML/JSON/terminal reports, methylation + pharmacogenomics focused
 > commands, report diffing, persistent config with commercial-mode
 > safety switch. Build auto-detection from position data (ADR-0021).
-> No regex on prose anywhere in production. **Latest: v1.8.3** —
-> pip install quickstart, workflow hardening, PyPI link fix.
+> No regex on prose anywhere in production. **Latest: v1.9.0** —
+> `--filter-file` flag for custom-panel filtering on `analyze`.
 > Release notes:
 > [`CHANGELOG.md`](https://github.com/dial481/allelix/blob/main/CHANGELOG.md).
@@ -61,6 +61,9 @@ allelix db update
 # Analyze a genotype file
 allelix analyze your_genotype_file.txt --output report.html
+# Filter to a custom panel (rsIDs + gene names, one per line; '#' comments and blank lines ignored)
+allelix analyze your_genotype_file.txt --filter-file my_panel.txt --output report.html
 ```
 Requires Python 3.11+. See [Development](#development) for source installs and running tests.
@@ -98,7 +101,7 @@ Adding a new format means adding one file to `allelix/parsers/` and registering
 | GWAS Catalog | ✓ | Public domain (EBI/NHGRI). Trait–SNP associations with p-values and effect sizes. Carrier rule (ADR-0007) requires the user to carry the risk allele. P-value magnitude scoring (ADR-0024) maps continuous p-values to the 0–10 scale; unknown-risk-allele entries fire on rsID match alone but are capped at 3.0. |
 | gnomAD | ✓ | ODbL v1.0. **Enrichment annotator** — adds population allele frequency context to existing annotations. Shows how common each variant is in the general population (~16M exome variants from 730K individuals). A pathogenic variant that 35% of people carry reads very differently from one seen in 0.001%. Pre-built cache downloaded via `db update` (~6GB on disk). Use `--no-gnomad` to skip. |
 | AlphaMissense | ✓ | CC BY 4.0. **Enrichment annotator** — adds DeepMind's protein-structure-based pathogenicity predictions to existing annotations. Scores 71M missense variants on a 0–1 scale: <0.34 = likely benign, >0.564 = likely pathogenic. Complements ClinVar's expert classifications with computational predictions — especially valuable for variants ClinVar hasn't reviewed yet. Pre-built cache downloaded via `db update` (~8GB on disk). Use `--no-alphamissense` to skip. |
-| CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
+| CADD | ✓ | LicenseRef-CADD (non-commercial). **Enrichment annotator** — adds PHRED-scaled deleteriousness scores from CADD v1.7. Ranks how deleterious any single-nucleotide variant is using 100+ annotation tracks (coding, non-coding, regulatory). PHRED 10 = top 10% most deleterious, 20 = top 1%, 30 = top 0.1%. **Opt-in** — disabled by default (`sources.cadd = false`). Enable via `allelix db update --cadd` or `allelix config set sources.cadd true`. Use `--no-cadd` to skip enrichment for a single run. Pre-built cache (~5 GB on disk, ~120M variant keys). Full mode available via pysam for GRCh38 data (`options.cadd_full = true`). Cache mode covers the large majority of variants present in gnomAD, AlphaMissense, and ClinVar — nearly every position allelix can annotate from its other databases. For genotyping chip data (23andMe, AncestryDNA, MyHappyGenes, etc.), cache and full mode produce effectively identical results because chip probes overwhelmingly target known, cataloged variants. Full mode adds coverage for novel or private variants that appear only in whole-genome or whole-exome sequencing data and are not in any pre-computed database. If your input is a genotyping chip file, cache mode is all you need. |
 ### Known PharmGKB limitation: reference-genotype rows where ClinVar and CPIC both lack data
@@ -155,7 +158,7 @@ allelix config set license.commercial true
 allelix config set license.cadd true
 ```
-CLI flags (`--no-gnomad`, `--no-alphamissense`, `--exclude-snpedia`, `--cadd`) override the config for a single run. The config sets the baseline; flags override per-invocation.
+CLI flags (`--no-gnomad`, `--no-alphamissense`, `--no-cadd`, `--exclude-snpedia`, `--cadd`) override the config for a single run. The config sets the baseline; flags override per-invocation.
 ### Database sizes and download times
@@ -187,7 +190,7 @@ Allelix source code is licensed under the **GNU Affero General Public License v3
 | SNPedia | snpedia.com | CC BY-NC-SA 3.0 US | Attribution required, **non-commercial only**. Use `--exclude-snpedia` to omit. |
 | gnomAD | gnomad.broadinstitute.org | ODbL v1.0 | Attribution required. Population allele frequencies for context; not a clinical annotator. Use `--no-gnomad` to omit. |
 | AlphaMissense | zenodo.org/records/10813168 | CC BY 4.0 | Attribution required. Cheng et al., Science 2023. Missense variant pathogenicity predictions. Use `--no-alphamissense` to omit. |
-| CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. |
+| CADD | cadd.gs.washington.edu | LicenseRef-CADD | Attribution required, **non-commercial by default**. Commercial licenses available from UW CoMotion. Opt-in via `allelix db update --cadd`. Use `--no-cadd` to omit. |
 **Commercial users:** When `license.commercial = true`, non-commercial sources are gated by a three-state permission model. SNPedia is permanently blocked (no commercial license is available). CADD is blocked by default but can be unlocked — the University of Washington offers commercial licenses at `https://els2.comotion.uw.edu/product/cadd-scores`; after purchasing, assert your license with `allelix config set license.cadd true` to re-enable CADD in commercial mode. All other databases (ClinVar, PharmGKB, GWAS Catalog, gnomAD, AlphaMissense) are compatible with commercial use. `allelix config show` displays the permission state for each source.

{allelix-1.8.3 → allelix-1.9.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "allelix"
-version = "1.8.3"
+version = "1.9.0"
 description = "Open-source genotype analysis toolkit. Format-agnostic ingestion, database-agnostic annotation, offline-first."
 readme = "README.md"
 requires-python = ">=3.11"

{allelix-1.8.3 → allelix-1.9.0}/tests/test_cli.py RENAMED Viewed

@@ -1527,6 +1527,253 @@ class TestExcludeSnpedia:
         assert captured["exclude_sources"] == frozenset({"snpedia", "gwas"})
+class TestNoCaddFlag:
+    """--no-cadd wires through to no_cadd on all three analysis commands."""
+    def test_analyze_passes_no_cadd(self, mock_mhg_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        runner = CliRunner()
+        result = runner.invoke(main, ["analyze", str(mock_mhg_path), "--no-cadd"])
+        assert result.exit_code == 0, result.output
+        assert captured["no_cadd"] is True
+    def test_analyze_default_no_cadd_false(self, mock_mhg_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        runner = CliRunner()
+        result = runner.invoke(main, ["analyze", str(mock_mhg_path)])
+        assert result.exit_code == 0, result.output
+        assert captured["no_cadd"] is False
+    def test_methylation_passes_no_cadd(self, mock_mhg_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        runner = CliRunner()
+        result = runner.invoke(main, ["methylation", str(mock_mhg_path), "--no-cadd"])
+        assert result.exit_code == 0, result.output
+        assert captured["no_cadd"] is True
+    def test_pharmacogenomics_passes_no_cadd(self, mock_mhg_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        runner = CliRunner()
+        result = runner.invoke(main, ["pharmacogenomics", str(mock_mhg_path), "--no-cadd"])
+        assert result.exit_code == 0, result.output
+        assert captured["no_cadd"] is True
+class TestParseFilterFile:
+    """Unit tests for _parse_filter_file (parser classification)."""
+    def test_rsid_lowercase(self, tmp_path):
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("rs1801133\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset()
+        assert rsids == frozenset({"rs1801133"})
+    def test_rsid_uppercase_normalized_to_lowercase(self, tmp_path):
+        """Input case-tolerant, output canonical: RS1801133 → rs1801133."""
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("RS1801133\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset()
+        assert rsids == frozenset({"rs1801133"})
+    def test_gene_lowercase_normalized_to_uppercase(self, tmp_path):
+        """Input case-tolerant, output canonical: mthfr → MTHFR."""
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("mthfr\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset({"MTHFR"})
+        assert rsids == frozenset()
+    def test_mixed_messy_case_normalized(self, tmp_path):
+        """End-to-end case-mixing across rsIDs and genes."""
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("Rs1801133\ncomt\nRSPO1\nRS4680\nmThFr\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset({"COMT", "RSPO1", "MTHFR"})
+        assert rsids == frozenset({"rs1801133", "rs4680"})
+    def test_gene_only(self, tmp_path):
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("MTHFR\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset({"MTHFR"})
+        assert rsids == frozenset()
+    def test_gene_starting_with_rs_prefix_is_gene_not_rsid(self, tmp_path):
+        """RSPO1, RSF1, RSC1A1 are real gene names — must not be classified as rsIDs."""
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("RSPO1\nRSF1\nRSC1A1\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset({"RSPO1", "RSF1", "RSC1A1"})
+        assert rsids == frozenset()
+    def test_mixed(self, tmp_path):
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("rs1801133\nMTHFR\nrs4680\nCOMT\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset({"MTHFR", "COMT"})
+        assert rsids == frozenset({"rs1801133", "rs4680"})
+    def test_comments_and_blanks_ignored(self, tmp_path):
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("# this is a comment\n\nMTHFR\n\n# another\nrs1801133\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset({"MTHFR"})
+        assert rsids == frozenset({"rs1801133"})
+    def test_empty_file_returns_empty_sets(self, tmp_path):
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset()
+        assert rsids == frozenset()
+    def test_comments_only_returns_empty_sets(self, tmp_path):
+        from allelix.cli import _parse_filter_file
+        f = tmp_path / "filter.txt"
+        f.write_text("# only a comment\n# another\n\n")
+        genes, rsids = _parse_filter_file(f)
+        assert genes == frozenset()
+        assert rsids == frozenset()
+class TestFilterFileOnAnalyze:
+    """--filter-file is only on analyze; threads through _run_analysis_command."""
+    def test_analyze_rsid_only(self, mock_mhg_path, tmp_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        f = tmp_path / "filter.txt"
+        f.write_text("rs1801133\n")
+        runner = CliRunner()
+        result = runner.invoke(main, ["analyze", str(mock_mhg_path), "--filter-file", str(f)])
+        assert result.exit_code == 0, result.output
+        assert captured["genes"] == frozenset()
+        assert captured["rsids"] == frozenset({"rs1801133"})
+    def test_analyze_gene_only(self, mock_mhg_path, tmp_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        f = tmp_path / "filter.txt"
+        f.write_text("MTHFR\n")
+        runner = CliRunner()
+        result = runner.invoke(main, ["analyze", str(mock_mhg_path), "--filter-file", str(f)])
+        assert result.exit_code == 0, result.output
+        assert captured["genes"] == frozenset({"MTHFR"})
+        assert captured["rsids"] == frozenset()
+    def test_analyze_mixed_or_combination(self, mock_mhg_path, tmp_path, monkeypatch):
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        f = tmp_path / "filter.txt"
+        f.write_text("rs1801133\nCOMT\n")
+        runner = CliRunner()
+        result = runner.invoke(main, ["analyze", str(mock_mhg_path), "--filter-file", str(f)])
+        assert result.exit_code == 0, result.output
+        assert captured["genes"] == frozenset({"COMT"})
+        assert captured["rsids"] == frozenset({"rs1801133"})
+    def test_analyze_empty_filter_passes_empty_sets(self, mock_mhg_path, tmp_path, monkeypatch):
+        """Empty filter file (only comments/blanks) threads empty frozensets through.
+        The empty-set → match-nothing semantic on AnalysisResult.filter()
+        is covered by a direct unit test in tests/test_pipeline_filter.py;
+        here we verify only that the CLI layer forwards empty frozensets,
+        not None.
+        """
+        captured: dict = {}
+        def fake_run(**kwargs):
+            captured.update(kwargs)
+        monkeypatch.setattr("allelix.cli._run_analysis_command", fake_run)
+        f = tmp_path / "filter.txt"
+        f.write_text("# only comments\n\n")
+        runner = CliRunner()
+        result = runner.invoke(main, ["analyze", str(mock_mhg_path), "--filter-file", str(f)])
+        assert result.exit_code == 0, result.output
+        assert captured["genes"] == frozenset()
+        assert captured["rsids"] == frozenset()
+    def test_analyze_filter_file_nonexistent_path_errors(self, mock_mhg_path):
+        runner = CliRunner()
+        result = runner.invoke(
+            main,
+            ["analyze", str(mock_mhg_path), "--filter-file", "/does/not/exist.txt"],
+        )
+        assert result.exit_code != 0
+    def test_methylation_does_not_have_filter_file(self, mock_mhg_path):
+        runner = CliRunner()
+        result = runner.invoke(
+            main,
+            ["methylation", str(mock_mhg_path), "--filter-file", "/tmp/x.txt"],
+        )
+        assert result.exit_code != 0
+        assert "no such option" in result.output.lower()
+    def test_pharmacogenomics_does_not_have_filter_file(self, mock_mhg_path):
+        runner = CliRunner()
+        result = runner.invoke(
+            main,
+            ["pharmacogenomics", str(mock_mhg_path), "--filter-file", "/tmp/x.txt"],
+        )
+        assert result.exit_code != 0
+        assert "no such option" in result.output.lower()
 class TestHighValueNoCalls:
     def test_stats_flags_dpyd_no_call(self, mock_mhg_path):
         """The MHG fixture has rs3918290 (DPYD) as a no-call; stats should flag it."""