PyPI - bam2tensor - Versions diffs - 2.5__tar.gz → 2.7__tar.gz - Mend

bam2tensor 2.5tar.gz → 2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{bam2tensor-2.5 → bam2tensor-2.7}/.github/workflows/docs.yml RENAMED Viewed

@@ -47,7 +47,7 @@ jobs:
           uv run sphinx-build docs docs/_build
       - name: Upload artifact
-        uses: actions/upload-pages-artifact@v4
+        uses: actions/upload-pages-artifact@v5
         with:
           path: "docs/_build"

{bam2tensor-2.5 → bam2tensor-2.7}/.github/workflows/release.yml RENAMED Viewed

@@ -67,16 +67,16 @@ jobs:
       - name: Publish package on PyPI
         if: steps.check-version.outputs.tag || steps.check-tag.outputs.tag
-        uses: pypa/gh-action-pypi-publish@v1.13.0
+        uses: pypa/gh-action-pypi-publish@v1.14.0
       - name: Publish package on TestPyPI
         if: (!steps.check-version.outputs.tag && !steps.check-tag.outputs.tag)
-        uses: pypa/gh-action-pypi-publish@v1.13.0
+        uses: pypa/gh-action-pypi-publish@v1.14.0
         with:
           repository-url: https://test.pypi.org/legacy/
       - name: Publish the release notes
-        uses: release-drafter/release-drafter@v7.1.1
+        uses: release-drafter/release-drafter@v7.2.1
         with:
           publish: ${{ steps.check-version.outputs.tag != '' || steps.check-tag.outputs.tag != '' }}
           tag: ${{ steps.check-version.outputs.tag || steps.check-tag.outputs.tag }}

{bam2tensor-2.5 → bam2tensor-2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bam2tensor
-Version: 2.5
+Version: 2.7
 Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
 Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
 Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
@@ -72,6 +72,7 @@ Description-Content-Type: text/markdown
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Filtering Conversion Errors](#filtering-conversion-errors)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
   - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
@@ -99,6 +100,7 @@ Description-Content-Type: text/markdown
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
 - **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
+- **Conversion-Error Filters**: Optional per-read filters for incomplete bisulfite/EM-seq conversion (ported from `nebiolabs/mark-nonconverted-reads`) and EM-seq fragment-level over-conversion (Loyfer et al. 2026)
 ## Requirements
@@ -256,6 +258,25 @@ Options:
                                   determine CpG sites).
   --quality-limit INTEGER         Quality filter for aligned reads (default =
                                   20)
+  --filter-non-converted          Drop reads with >= --non-converted-threshold
+                                  retained non-CpG cytosines, the signature of
+                                  incomplete bisulfite/EM-seq conversion (port
+                                  of nebiolabs/mark-nonconverted-reads).
+                                  Default: off.
+  --non-converted-threshold INTEGER
+                                  Minimum count of retained non-CpG cytosines
+                                  to drop a read (default = 3, matches NEB
+                                  mark-nonconverted-reads).
+  --filter-em-overconversion      Drop EM-seq reads whose covered CpGs are all
+                                  called unmethylated and cover at least --em-
+                                  overconversion-min-cpgs sites (heuristic for
+                                  the fragment-level over-conversion artifact
+                                  described in Loyfer et al. bioRxiv
+                                  2026.03.24.713040). Default: off.
+  --em-overconversion-min-cpgs INTEGER
+                                  Minimum covered CpG count required before
+                                  the EM over-conversion filter will drop a
+                                  read (default = 3).
   --verbose                       Verbose output.
   --skip-cache                    De-novo generate CpG sites (slow).
   --debug                         Debug mode (extensive validity checking +
@@ -281,6 +302,10 @@ Options:
 | `--expected-chromosomes` | Comma-separated list of chromosome names to process. Chromosomes not in this list are skipped. Defaults to human autosomes + sex chromosomes. |
 | `--reference-fasta` | Path to the reference genome FASTA file. Must match the genome used for alignment. |
 | `--quality-limit` | Minimum mapping quality score (MAPQ) for reads to be included. Default is 20. |
+| `--filter-non-converted` | Drop reads with retained non-CpG cytosines above `--non-converted-threshold` (incomplete conversion). See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--non-converted-threshold` | Threshold for the non-converted filter. Default is 3. |
+| `--filter-em-overconversion` | Drop EM-seq reads whose covered CpGs are all unmethylated and cover ≥ `--em-overconversion-min-cpgs` sites. See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--em-overconversion-min-cpgs` | Minimum covered CpG count before the EM over-conversion filter will drop a read. Default is 3. |
 | `--verbose` | Enable detailed progress output including per-chromosome progress bars. |
 | `--skip-cache` | Force regeneration of CpG site cache. Useful if you've modified the reference or chromosome list. |
 | `--debug` | Enable extensive validation and debug output. Slower but useful for troubleshooting. |
@@ -289,6 +314,66 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Filtering Conversion Errors
+Bisulfite and EM-seq library preparation can produce two kinds of per-read conversion errors that bias downstream methylation calls. bam2tensor provides two opt-in filters to drop affected reads at extraction time. Both are **default-off**, apply per read, and are recorded in the output `metadata.json` so downstream consumers know which filters were applied.
+### `--filter-non-converted` — incomplete conversion
+Ports the logic of [nebiolabs/mark-nonconverted-reads](https://github.com/nebiolabs/mark-nonconverted-reads). A read is dropped if it carries at least `--non-converted-threshold` (default 3) retained non-CpG cytosines, a signature of incomplete bisulfite or EM-seq conversion.
+- **Bismark BAMs**: counted directly from the `XM` tag's uppercase `H`/`X`/`U` characters (retained cytosines in CHH/CHG/unknown contexts).
+- **Biscuit / bwameth / gem3 BAMs**: counted by comparing the read to the reference via the `MD` tag (using pysam's `get_aligned_pairs(with_seq=True)`). SNPs — where the read's retained `C` sits over a reference base that isn't `C` — are excluded from the count, matching NEB's reference-validation step. No separate FASTA reload is required.
+### `--filter-em-overconversion` — EM-seq fragment-level over-conversion
+A heuristic inspired by [Loyfer et al. (bioRxiv 2026.03.24.713040)](https://www.biorxiv.org/content/10.64898/2026.03.24.713040v1). That paper shows EM-seq reproducibly produces ~1–2.5% of multi-CpG fragments that appear fully unmethylated across every covered CpG — a fragment-level artifact absent from WGBS and Oxford Nanopore. This filter drops any read whose covered CpGs are **all** called unmethylated *and* cover at least `--em-overconversion-min-cpgs` sites (default 3, the regime where the EM-seq artifact is clearly separable from WGBS in Loyfer et al. Fig. 1C).
+The filter is a blunt instrument: it will also drop genuinely fully-unmethylated biological fragments at unmethylated markers. Enable it only when your downstream application (e.g., cfDNA deconvolution at constitutively methylated loci) can tolerate that trade-off.
+### Usage
+```bash
+bam2tensor \
+    --input-path sample.bam \
+    --reference-fasta GRCh38.fa \
+    --genome-name hg38 \
+    --filter-non-converted \
+    --filter-em-overconversion
+```
+Filter parameters and enabled state are written to the output `metadata.json`:
+```json
+{
+    "filters": {
+        "non_converted_reads": {"enabled": true, "threshold": 3},
+        "em_overconversion": {"enabled": true, "min_cpgs": 3}
+    }
+}
+```
+### Reproducibility note
+The two filters differ in whether they can be replayed downstream without the source BAM:
+- **`--filter-em-overconversion` is reproducible from the `.npz` alone.** The heuristic is a pure function of each row's CpG state values. A downstream consumer who receives an unfiltered `.npz` can replay the filter at analysis time:
+  ```python
+  import scipy.sparse
+  mat = scipy.sparse.load_npz("sample.methylation.npz").tocsr()
+  min_cpgs = 3
+  kept_rows = []
+  for i in range(mat.shape[0]):
+      row = mat.getrow(i).toarray().ravel()
+      covered = row[(row == 0) | (row == 1)]  # drop -1 no-data
+      is_overconv = len(covered) >= min_cpgs and (covered == 0).all()
+      if not is_overconv:
+          kept_rows.append(i)
+  ```
+- **`--filter-non-converted` is *not* reproducible from the `.npz` alone.** It relies on retained non-CpG cytosines (or Bismark's `H`/`X`/`U`), which are never written to the matrix. If you need this filter, apply it at extraction time (or re-run bam2tensor against the original BAM).
 ## Inspecting Output Files
 Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
@@ -302,11 +387,15 @@ sample.methylation.npz
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
   Fragment len:    median 167, mean 182, range [50, 600]
+  Filters:         non-converted (>= 3 non-CpG Cs)
+                   EM over-conversion (all-unmethylated, >= 3 CpGs)
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.4
+  bam2tensor:      v2.5
   File size:       14.2 MB
 ```
+When no filters were applied, the line reads `Filters:         none`. Files produced by bam2tensor versions older than v2.5 omit the line entirely.
 You can pass multiple files at once:
 ```bash
@@ -368,6 +457,7 @@ Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP arc
 | `expected_chromosomes` | List of chromosomes included in the column mapping |
 | `total_cpg_sites` | Total number of CpG columns in the matrix |
 | `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+| `filters` | Nested dict recording which opt-in conversion-error filters were applied (`non_converted_reads`, `em_overconversion`) and their parameters. See [Filtering Conversion Errors](#filtering-conversion-errors). Added in v2.5. |
 This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
@@ -570,6 +660,10 @@ extract_methylation_data_from_bam(
     input_bam: str,                                    # Path to BAM file
     genome_methylation_embedding: GenomeMethylationEmbedding,  # Embedding object
     quality_limit: int = 20,                           # Minimum MAPQ
+    filter_non_converted: bool = False,                # Drop reads with retained non-CpG Cs
+    non_converted_threshold: int = 3,                  # Threshold for the above filter
+    filter_em_overconversion: bool = False,            # Drop EM-seq fragment-level over-conversion reads
+    em_overconversion_min_cpgs: int = 3,               # Min CpGs before applying the above filter
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
 ) -> ExtractionResult

{bam2tensor-2.5 → bam2tensor-2.7}/README.md RENAMED Viewed

@@ -39,6 +39,7 @@
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Filtering Conversion Errors](#filtering-conversion-errors)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
   - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
@@ -66,6 +67,7 @@
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
 - **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
+- **Conversion-Error Filters**: Optional per-read filters for incomplete bisulfite/EM-seq conversion (ported from `nebiolabs/mark-nonconverted-reads`) and EM-seq fragment-level over-conversion (Loyfer et al. 2026)
 ## Requirements
@@ -223,6 +225,25 @@ Options:
                                   determine CpG sites).
   --quality-limit INTEGER         Quality filter for aligned reads (default =
                                   20)
+  --filter-non-converted          Drop reads with >= --non-converted-threshold
+                                  retained non-CpG cytosines, the signature of
+                                  incomplete bisulfite/EM-seq conversion (port
+                                  of nebiolabs/mark-nonconverted-reads).
+                                  Default: off.
+  --non-converted-threshold INTEGER
+                                  Minimum count of retained non-CpG cytosines
+                                  to drop a read (default = 3, matches NEB
+                                  mark-nonconverted-reads).
+  --filter-em-overconversion      Drop EM-seq reads whose covered CpGs are all
+                                  called unmethylated and cover at least --em-
+                                  overconversion-min-cpgs sites (heuristic for
+                                  the fragment-level over-conversion artifact
+                                  described in Loyfer et al. bioRxiv
+                                  2026.03.24.713040). Default: off.
+  --em-overconversion-min-cpgs INTEGER
+                                  Minimum covered CpG count required before
+                                  the EM over-conversion filter will drop a
+                                  read (default = 3).
   --verbose                       Verbose output.
   --skip-cache                    De-novo generate CpG sites (slow).
   --debug                         Debug mode (extensive validity checking +
@@ -248,6 +269,10 @@ Options:
 | `--expected-chromosomes` | Comma-separated list of chromosome names to process. Chromosomes not in this list are skipped. Defaults to human autosomes + sex chromosomes. |
 | `--reference-fasta` | Path to the reference genome FASTA file. Must match the genome used for alignment. |
 | `--quality-limit` | Minimum mapping quality score (MAPQ) for reads to be included. Default is 20. |
+| `--filter-non-converted` | Drop reads with retained non-CpG cytosines above `--non-converted-threshold` (incomplete conversion). See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--non-converted-threshold` | Threshold for the non-converted filter. Default is 3. |
+| `--filter-em-overconversion` | Drop EM-seq reads whose covered CpGs are all unmethylated and cover ≥ `--em-overconversion-min-cpgs` sites. See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--em-overconversion-min-cpgs` | Minimum covered CpG count before the EM over-conversion filter will drop a read. Default is 3. |
 | `--verbose` | Enable detailed progress output including per-chromosome progress bars. |
 | `--skip-cache` | Force regeneration of CpG site cache. Useful if you've modified the reference or chromosome list. |
 | `--debug` | Enable extensive validation and debug output. Slower but useful for troubleshooting. |
@@ -256,6 +281,66 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Filtering Conversion Errors
+Bisulfite and EM-seq library preparation can produce two kinds of per-read conversion errors that bias downstream methylation calls. bam2tensor provides two opt-in filters to drop affected reads at extraction time. Both are **default-off**, apply per read, and are recorded in the output `metadata.json` so downstream consumers know which filters were applied.
+### `--filter-non-converted` — incomplete conversion
+Ports the logic of [nebiolabs/mark-nonconverted-reads](https://github.com/nebiolabs/mark-nonconverted-reads). A read is dropped if it carries at least `--non-converted-threshold` (default 3) retained non-CpG cytosines, a signature of incomplete bisulfite or EM-seq conversion.
+- **Bismark BAMs**: counted directly from the `XM` tag's uppercase `H`/`X`/`U` characters (retained cytosines in CHH/CHG/unknown contexts).
+- **Biscuit / bwameth / gem3 BAMs**: counted by comparing the read to the reference via the `MD` tag (using pysam's `get_aligned_pairs(with_seq=True)`). SNPs — where the read's retained `C` sits over a reference base that isn't `C` — are excluded from the count, matching NEB's reference-validation step. No separate FASTA reload is required.
+### `--filter-em-overconversion` — EM-seq fragment-level over-conversion
+A heuristic inspired by [Loyfer et al. (bioRxiv 2026.03.24.713040)](https://www.biorxiv.org/content/10.64898/2026.03.24.713040v1). That paper shows EM-seq reproducibly produces ~1–2.5% of multi-CpG fragments that appear fully unmethylated across every covered CpG — a fragment-level artifact absent from WGBS and Oxford Nanopore. This filter drops any read whose covered CpGs are **all** called unmethylated *and* cover at least `--em-overconversion-min-cpgs` sites (default 3, the regime where the EM-seq artifact is clearly separable from WGBS in Loyfer et al. Fig. 1C).
+The filter is a blunt instrument: it will also drop genuinely fully-unmethylated biological fragments at unmethylated markers. Enable it only when your downstream application (e.g., cfDNA deconvolution at constitutively methylated loci) can tolerate that trade-off.
+### Usage
+```bash
+bam2tensor \
+    --input-path sample.bam \
+    --reference-fasta GRCh38.fa \
+    --genome-name hg38 \
+    --filter-non-converted \
+    --filter-em-overconversion
+```
+Filter parameters and enabled state are written to the output `metadata.json`:
+```json
+{
+    "filters": {
+        "non_converted_reads": {"enabled": true, "threshold": 3},
+        "em_overconversion": {"enabled": true, "min_cpgs": 3}
+    }
+}
+```
+### Reproducibility note
+The two filters differ in whether they can be replayed downstream without the source BAM:
+- **`--filter-em-overconversion` is reproducible from the `.npz` alone.** The heuristic is a pure function of each row's CpG state values. A downstream consumer who receives an unfiltered `.npz` can replay the filter at analysis time:
+  ```python
+  import scipy.sparse
+  mat = scipy.sparse.load_npz("sample.methylation.npz").tocsr()
+  min_cpgs = 3
+  kept_rows = []
+  for i in range(mat.shape[0]):
+      row = mat.getrow(i).toarray().ravel()
+      covered = row[(row == 0) | (row == 1)]  # drop -1 no-data
+      is_overconv = len(covered) >= min_cpgs and (covered == 0).all()
+      if not is_overconv:
+          kept_rows.append(i)
+  ```
+- **`--filter-non-converted` is *not* reproducible from the `.npz` alone.** It relies on retained non-CpG cytosines (or Bismark's `H`/`X`/`U`), which are never written to the matrix. If you need this filter, apply it at extraction time (or re-run bam2tensor against the original BAM).
 ## Inspecting Output Files
 Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
@@ -269,11 +354,15 @@ sample.methylation.npz
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
   Fragment len:    median 167, mean 182, range [50, 600]
+  Filters:         non-converted (>= 3 non-CpG Cs)
+                   EM over-conversion (all-unmethylated, >= 3 CpGs)
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.4
+  bam2tensor:      v2.5
   File size:       14.2 MB
 ```
+When no filters were applied, the line reads `Filters:         none`. Files produced by bam2tensor versions older than v2.5 omit the line entirely.
 You can pass multiple files at once:
 ```bash
@@ -335,6 +424,7 @@ Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP arc
 | `expected_chromosomes` | List of chromosomes included in the column mapping |
 | `total_cpg_sites` | Total number of CpG columns in the matrix |
 | `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+| `filters` | Nested dict recording which opt-in conversion-error filters were applied (`non_converted_reads`, `em_overconversion`) and their parameters. See [Filtering Conversion Errors](#filtering-conversion-errors). Added in v2.5. |
 This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
@@ -537,6 +627,10 @@ extract_methylation_data_from_bam(
     input_bam: str,                                    # Path to BAM file
     genome_methylation_embedding: GenomeMethylationEmbedding,  # Embedding object
     quality_limit: int = 20,                           # Minimum MAPQ
+    filter_non_converted: bool = False,                # Drop reads with retained non-CpG Cs
+    non_converted_threshold: int = 3,                  # Threshold for the above filter
+    filter_em_overconversion: bool = False,            # Drop EM-seq fragment-level over-conversion reads
+    em_overconversion_min_cpgs: int = 3,               # Min CpGs before applying the above filter
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
 ) -> ExtractionResult

{bam2tensor-2.5 → bam2tensor-2.7}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "bam2tensor"
-version = "2.5"
+version = "2.7"
 description = "Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation"
 authors = [{ name = "Nick Semenkovich", email = "semenko@alum.mit.edu" }]
 license = "MIT"

{bam2tensor-2.5 → bam2tensor-2.7}/src/bam2tensor/__init__.py RENAMED Viewed

@@ -50,4 +50,4 @@ See Also:
     - https://mcwdsi.github.io/bam2tensor for full documentation
 """
-__version__ = "2.5"
+__version__ = "2.7"

{bam2tensor-2.5 → bam2tensor-2.7}/src/bam2tensor/embedding.py RENAMED Viewed

@@ -45,6 +45,9 @@ import numpy as np
 from tqdm import tqdm
 from Bio import SeqIO
+from bam2tensor import __version__
+from bam2tensor.metadata import compute_fasta_sha256
 class GenomeMethylationEmbedding:
     """Manages CpG site positions and coordinate conversions for a reference genome.
@@ -173,7 +176,12 @@ class GenomeMethylationEmbedding:
                     window_size == self.window_size
                 ), "Window size does not match cached window size!"
             except FileNotFoundError as e:
-                if self.verbose:
+                # Stale-cache rejections (version or FASTA SHA-256 mismatch)
+                # raise FileNotFoundError too — always surface those so users
+                # are not silently regenerating a cache they thought was valid.
+                if os.path.exists(self.cache_file):
+                    print(f"Discarding stale embedding cache: {e}")
+                elif self.verbose:
                     print("Could not load methylation embedding from cache: " + str(e))
         if not cache_available:
@@ -224,6 +232,9 @@ class GenomeMethylationEmbedding:
         The cache file is named "{genome_name}.cache.json.gz" and contains:
         - genome_name: The genome identifier
         - fasta_source: Path to the original FASTA file
+        - fasta_sha256: SHA-256 of the FASTA file bytes (for cache validation)
+        - bam2tensor_version: Version that produced this cache
+        - total_cpg_sites: Total CpG count across all included chromosomes
         - expected_chromosomes: List of included chromosomes
         - window_size: The window_size parameter (for compatibility checking)
         - cpg_sites_dict: Dictionary of chromosome -> list of CpG positions
@@ -241,9 +252,13 @@ class GenomeMethylationEmbedding:
         assert len(self.cpg_sites_dict) > 0, "CpG sites dict is empty!"
+        total_cpg_sites = sum(len(v) for v in self.cpg_sites_dict.values())
         cache_data = {
             "genome_name": self.genome_name,
             "fasta_source": self.fasta_source,
+            "fasta_sha256": compute_fasta_sha256(self.fasta_source),
+            "bam2tensor_version": __version__,
+            "total_cpg_sites": total_cpg_sites,
             "expected_chromosomes": self.expected_chromosomes,
             "window_size": self.window_size,
             "cpg_sites_dict": self.cpg_sites_dict,
@@ -263,38 +278,66 @@ class GenomeMethylationEmbedding:
         restore all CpG site data. If successful, this avoids the slow
         FASTA parsing step.
+        Provenance is validated before the cached data is trusted: the
+        cache must have been written by the same major.minor of
+        bam2tensor and must reference a FASTA file with the same SHA-256
+        as the current ``fasta_source``. A stale cache is rejected with
+        a ``FileNotFoundError`` so the caller falls through to a fresh
+        FASTA parse and overwrites the stale cache on save.
         Returns:
             True if the cache was successfully loaded.
         Raises:
-            FileNotFoundError: If the cache file does not exist.
+            FileNotFoundError: If the cache file does not exist, or if
+                the cache is stale (version mismatch or FASTA SHA-256
+                mismatch).
         Note:
-            After loading, the caller should verify that expected_chromosomes
-            and window_size match the current configuration, as this method
-            overwrites those attributes with cached values.
+            After loading, the caller should verify that
+            ``expected_chromosomes`` and ``window_size`` match the current
+            configuration, as this method overwrites those attributes
+            with cached values.
         """
-        if os.path.exists(self.cache_file):
-            if self.verbose:
-                print(f"\tReading embedding from cache: {self.cache_file}")
+        if not os.path.exists(self.cache_file):
+            raise FileNotFoundError("No cache of embedding found.")
-            # TODO: Add type hinting via TypedDicts?
-            # e.g. https://stackoverflow.com/questions/51291722/define-a-jsonable-type-using-mypy-pep-526
-            with gzip.open(self.cache_file, "rt") as f:
-                self.cache_data = json.load(f)
+        if self.verbose:
+            print(f"\tReading embedding from cache: {self.cache_file}")
-            # Load the cached data
-            self.genome_name = self.cache_data["genome_name"]
-            self.fasta_source = self.cache_data["fasta_source"]
-            self.expected_chromosomes = self.cache_data["expected_chromosomes"]
-            self.window_size = self.cache_data["window_size"]
-            self.cpg_sites_dict = self.cache_data["cpg_sites_dict"]
+        with gzip.open(self.cache_file, "rt") as f:
+            self.cache_data = json.load(f)
-            if self.verbose:
-                print(f"\tCached genome fasta source: {self.fasta_source}")
-        else:
-            raise FileNotFoundError("No cache of embedding found.")
+        # Validate cache provenance: stale caches predating v2.7 used a
+        # case-sensitive CpG search that silently dropped roughly half
+        # the CpG sites in soft-masked FASTAs (e.g. UCSC's hg38.fa.gz).
+        cached_version = self.cache_data.get("bam2tensor_version")
+        if cached_version != __version__:
+            raise FileNotFoundError(
+                f"Stale cache {self.cache_file!r}: written by bam2tensor "
+                f"{cached_version!r}, current is {__version__!r}. "
+                "Regenerating."
+            )
+        cached_fasta_sha256 = self.cache_data.get("fasta_sha256")
+        current_fasta_sha256 = compute_fasta_sha256(self.fasta_source)
+        if cached_fasta_sha256 != current_fasta_sha256:
+            raise FileNotFoundError(
+                f"Stale cache {self.cache_file!r}: FASTA SHA-256 mismatch "
+                f"(cache={cached_fasta_sha256}, current={current_fasta_sha256}). "
+                "Regenerating."
+            )
+        # Load the cached data
+        self.genome_name = self.cache_data["genome_name"]
+        self.fasta_source = self.cache_data["fasta_source"]
+        self.expected_chromosomes = self.cache_data["expected_chromosomes"]
+        self.window_size = self.cache_data["window_size"]
+        self.cpg_sites_dict = self.cache_data["cpg_sites_dict"]
+        if self.verbose:
+            print(f"\tCached genome fasta source: {self.fasta_source}")
         return True
@@ -350,7 +393,11 @@ class GenomeMethylationEmbedding:
                 if self.verbose:
                     tqdm.write(f"\tSkipping chromosome {seqrecord.id}")
                 continue
-            sequence = seqrecord.seq
+            # Upper-case the sequence so soft-masked FASTAs (UCSC's default
+            # hg38.fa.gz uses lowercase for RepeatMasker/TRF regions) do not
+            # silently drop CpGs in repeats — that is roughly half of all
+            # CpGs in the human genome.
+            sequence = seqrecord.seq.upper()
             # Find all CpG sites
             # The pos+1 is because we want to store the 1-based position, because .bed is wild and arguably 1-based maybe:

{bam2tensor-2.5 → bam2tensor-2.7}/src/bam2tensor/functions.py RENAMED Viewed

@@ -706,25 +706,36 @@ def extract_methylation_data_from_bam(
             # get_aligned_pairs returns a list of tuples of (read_pos, ref_pos)
             # We filter this to only include the specific CpG sites from above
+            aligned_pairs = aligned_segment.get_aligned_pairs(matches_only=True)
             this_segment_cpgs = [
-                e
-                for e in aligned_segment.get_aligned_pairs(matches_only=True)
-                if e[1] + 1 in cpgs_within_read_set
+                e for e in aligned_pairs if e[1] + 1 in cpgs_within_read_set
             ]
             # If no CpGs covered (after filtering for matches only), skip
             if not this_segment_cpgs:
                 continue
-            # Ok we're on the same strand as the methylation (right?)
-            # Let's compare the possible CpGs in this interval to the reference and note status
-            #   A methylated C will be *unchanged* and read as C (pair G)
-            #   An unmethylated C will be *changed* and read as T (pair A)
-            for query_pos, ref_pos in this_segment_cpgs:
-                query_base = aligned_segment.query_sequence[query_pos]  # type: ignore
-                # query_base_raw = aligned_segment.get_forward_sequence()[query_pos] # raw off sequencer
-                # query_base_no_offset = aligned_segment.query_alignment_sequence[query_pos] # this needs to be offset by the soft clip
+            # OT (forward parent): methylation-informative base sits on the
+            #   top-strand C at ref_pos. BAM SEQ is reference-oriented, so
+            #   C = methylated, T = unmethylated.
+            # OB (reverse parent): the original bottom-strand C lives at
+            #   ref_pos + 1 (the G of the top-strand CG). After the aligner
+            #   reverse-complements into reference orientation for BAM
+            #   storage, that base reads G = methylated, A = unmethylated.
+            #   At ref_pos itself, BAM always shows C (the unaffected
+            #   bottom-strand G reverse-complemented), which is why reading
+            #   ref_pos on OB reads collapses every CpG to "methylated".
+            query_sequence = aligned_segment.query_sequence
+            if bisulfite_parent_strand_is_reverse:
+                methylated_base, unmethylated_base = "G", "A"
+                # Indels at the CpG boundary mean ref_pos + 1 isn't always
+                # query_pos + 1 — go through a ref -> query map.
+                ref_to_query: dict[int, int] = {ref: q for q, ref in aligned_pairs}
+            else:
+                methylated_base, unmethylated_base = "C", "T"
+                ref_to_query = {}
+            for query_pos, ref_pos in this_segment_cpgs:
                 read_cpg_cols.append(
                     genome_methylation_embedding.genomic_position_to_embedding(
                         chrom,
@@ -732,21 +743,34 @@ def extract_methylation_data_from_bam(
                     )
                 )
-                if query_base == "C":
-                    # Methylated
+                if bisulfite_parent_strand_is_reverse:
+                    target_query_pos = ref_to_query.get(ref_pos + 1)
+                    if target_query_pos is None:
+                        read_cpg_data.append(-1)
+                        if debug:
+                            print(f"\t{query_pos} {ref_pos} [Indel at OB target]")
+                        continue
+                    query_base = query_sequence[target_query_pos]  # type: ignore[index]
+                else:
+                    query_base = query_sequence[query_pos]  # type: ignore[index]
+                if query_base == methylated_base:
                     read_cpg_data.append(1)
                     if debug:
-                        print(f"\t{query_pos} {ref_pos} C->{query_base} [Methylated]")
-                elif query_base == "T":
+                        print(
+                            f"\t{query_pos} {ref_pos} {methylated_base}->{query_base} [Methylated]"
+                        )
+                elif query_base == unmethylated_base:
                     read_cpg_data.append(0)
-                    # Unmethylated
                     if debug:
-                        print(f"\t{query_pos} {ref_pos} C->{query_base} [Unmethylated]")
+                        print(
+                            f"\t{query_pos} {ref_pos} {methylated_base}->{query_base} [Unmethylated]"
+                        )
                 else:
                     read_cpg_data.append(-1)
                     if debug:
                         print(
-                            f"\t{query_pos} {ref_pos} C->{query_base} [Unknown! SNV? Indel?]"
+                            f"\t{query_pos} {ref_pos} {methylated_base}->{query_base} [Unknown! SNV? Indel?]"
                         )
             if filter_em_overconversion and is_em_overconversion_read(

{bam2tensor-2.5 → bam2tensor-2.7}/src/bam2tensor/metadata.py RENAMED Viewed

@@ -27,17 +27,44 @@ Example:
         hg38
 """
+import hashlib
 import io
 import json
 import zipfile
 import zlib
+from typing import TYPE_CHECKING
 import numpy as np
-from bam2tensor.embedding import GenomeMethylationEmbedding
+if TYPE_CHECKING:
+    # Avoid a runtime circular import: embedding.py imports compute_fasta_sha256
+    # from this module, and this module only needs the embedding type for
+    # annotations.
+    from bam2tensor.embedding import GenomeMethylationEmbedding
-def compute_cpg_index_crc32(embedding: GenomeMethylationEmbedding) -> str:
+def compute_fasta_sha256(fasta_source: str) -> str:
+    """Compute the SHA-256 of a FASTA file's bytes on disk.
+    Used to stamp the CpG-site cache (see
+    :py:class:`bam2tensor.embedding.GenomeMethylationEmbedding`) so a
+    cache can be rejected when the underlying FASTA changes (e.g. a
+    user swaps a soft-masked build for a hard-masked one).
+    Args:
+        fasta_source: Path to the reference FASTA file.
+    Returns:
+        The hex-encoded SHA-256 digest of the file's bytes.
+    """
+    h = hashlib.sha256()
+    with open(fasta_source, "rb") as f:
+        for chunk in iter(lambda: f.read(1024 * 1024), b""):
+            h.update(chunk)
+    return h.hexdigest()
+def compute_cpg_index_crc32(embedding: "GenomeMethylationEmbedding") -> str:
     """Compute a CRC32 checksum over the CpG site positions in an embedding.
     The checksum captures the exact column mapping of the sparse matrix:

bam2tensor 2.5__tar.gz → 2.7__tar.gz

bam2tensor 2.5tar.gz → 2.7tar.gz