PyPI - bam2tensor - Versions diffs - 2.5__tar.gz → 2.6__tar.gz - Mend

bam2tensor 2.5tar.gz → 2.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{bam2tensor-2.5 → bam2tensor-2.6}/.github/workflows/docs.yml RENAMED Viewed

@@ -47,7 +47,7 @@ jobs:
           uv run sphinx-build docs docs/_build
       - name: Upload artifact
-        uses: actions/upload-pages-artifact@v4
+        uses: actions/upload-pages-artifact@v5
         with:
           path: "docs/_build"

{bam2tensor-2.5 → bam2tensor-2.6}/.github/workflows/release.yml RENAMED Viewed

@@ -67,16 +67,16 @@ jobs:
       - name: Publish package on PyPI
         if: steps.check-version.outputs.tag || steps.check-tag.outputs.tag
-        uses: pypa/gh-action-pypi-publish@v1.13.0
+        uses: pypa/gh-action-pypi-publish@v1.14.0
       - name: Publish package on TestPyPI
         if: (!steps.check-version.outputs.tag && !steps.check-tag.outputs.tag)
-        uses: pypa/gh-action-pypi-publish@v1.13.0
+        uses: pypa/gh-action-pypi-publish@v1.14.0
         with:
           repository-url: https://test.pypi.org/legacy/
       - name: Publish the release notes
-        uses: release-drafter/release-drafter@v7.1.1
+        uses: release-drafter/release-drafter@v7.2.1
         with:
           publish: ${{ steps.check-version.outputs.tag != '' || steps.check-tag.outputs.tag != '' }}
           tag: ${{ steps.check-version.outputs.tag || steps.check-tag.outputs.tag }}

{bam2tensor-2.5 → bam2tensor-2.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bam2tensor
-Version: 2.5
+Version: 2.6
 Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
 Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
 Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
@@ -72,6 +72,7 @@ Description-Content-Type: text/markdown
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Filtering Conversion Errors](#filtering-conversion-errors)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
   - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
@@ -99,6 +100,7 @@ Description-Content-Type: text/markdown
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
 - **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
+- **Conversion-Error Filters**: Optional per-read filters for incomplete bisulfite/EM-seq conversion (ported from `nebiolabs/mark-nonconverted-reads`) and EM-seq fragment-level over-conversion (Loyfer et al. 2026)
 ## Requirements
@@ -256,6 +258,25 @@ Options:
                                   determine CpG sites).
   --quality-limit INTEGER         Quality filter for aligned reads (default =
                                   20)
+  --filter-non-converted          Drop reads with >= --non-converted-threshold
+                                  retained non-CpG cytosines, the signature of
+                                  incomplete bisulfite/EM-seq conversion (port
+                                  of nebiolabs/mark-nonconverted-reads).
+                                  Default: off.
+  --non-converted-threshold INTEGER
+                                  Minimum count of retained non-CpG cytosines
+                                  to drop a read (default = 3, matches NEB
+                                  mark-nonconverted-reads).
+  --filter-em-overconversion      Drop EM-seq reads whose covered CpGs are all
+                                  called unmethylated and cover at least --em-
+                                  overconversion-min-cpgs sites (heuristic for
+                                  the fragment-level over-conversion artifact
+                                  described in Loyfer et al. bioRxiv
+                                  2026.03.24.713040). Default: off.
+  --em-overconversion-min-cpgs INTEGER
+                                  Minimum covered CpG count required before
+                                  the EM over-conversion filter will drop a
+                                  read (default = 3).
   --verbose                       Verbose output.
   --skip-cache                    De-novo generate CpG sites (slow).
   --debug                         Debug mode (extensive validity checking +
@@ -281,6 +302,10 @@ Options:
 | `--expected-chromosomes` | Comma-separated list of chromosome names to process. Chromosomes not in this list are skipped. Defaults to human autosomes + sex chromosomes. |
 | `--reference-fasta` | Path to the reference genome FASTA file. Must match the genome used for alignment. |
 | `--quality-limit` | Minimum mapping quality score (MAPQ) for reads to be included. Default is 20. |
+| `--filter-non-converted` | Drop reads with retained non-CpG cytosines above `--non-converted-threshold` (incomplete conversion). See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--non-converted-threshold` | Threshold for the non-converted filter. Default is 3. |
+| `--filter-em-overconversion` | Drop EM-seq reads whose covered CpGs are all unmethylated and cover ≥ `--em-overconversion-min-cpgs` sites. See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--em-overconversion-min-cpgs` | Minimum covered CpG count before the EM over-conversion filter will drop a read. Default is 3. |
 | `--verbose` | Enable detailed progress output including per-chromosome progress bars. |
 | `--skip-cache` | Force regeneration of CpG site cache. Useful if you've modified the reference or chromosome list. |
 | `--debug` | Enable extensive validation and debug output. Slower but useful for troubleshooting. |
@@ -289,6 +314,66 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Filtering Conversion Errors
+Bisulfite and EM-seq library preparation can produce two kinds of per-read conversion errors that bias downstream methylation calls. bam2tensor provides two opt-in filters to drop affected reads at extraction time. Both are **default-off**, apply per read, and are recorded in the output `metadata.json` so downstream consumers know which filters were applied.
+### `--filter-non-converted` — incomplete conversion
+Ports the logic of [nebiolabs/mark-nonconverted-reads](https://github.com/nebiolabs/mark-nonconverted-reads). A read is dropped if it carries at least `--non-converted-threshold` (default 3) retained non-CpG cytosines, a signature of incomplete bisulfite or EM-seq conversion.
+- **Bismark BAMs**: counted directly from the `XM` tag's uppercase `H`/`X`/`U` characters (retained cytosines in CHH/CHG/unknown contexts).
+- **Biscuit / bwameth / gem3 BAMs**: counted by comparing the read to the reference via the `MD` tag (using pysam's `get_aligned_pairs(with_seq=True)`). SNPs — where the read's retained `C` sits over a reference base that isn't `C` — are excluded from the count, matching NEB's reference-validation step. No separate FASTA reload is required.
+### `--filter-em-overconversion` — EM-seq fragment-level over-conversion
+A heuristic inspired by [Loyfer et al. (bioRxiv 2026.03.24.713040)](https://www.biorxiv.org/content/10.64898/2026.03.24.713040v1). That paper shows EM-seq reproducibly produces ~1–2.5% of multi-CpG fragments that appear fully unmethylated across every covered CpG — a fragment-level artifact absent from WGBS and Oxford Nanopore. This filter drops any read whose covered CpGs are **all** called unmethylated *and* cover at least `--em-overconversion-min-cpgs` sites (default 3, the regime where the EM-seq artifact is clearly separable from WGBS in Loyfer et al. Fig. 1C).
+The filter is a blunt instrument: it will also drop genuinely fully-unmethylated biological fragments at unmethylated markers. Enable it only when your downstream application (e.g., cfDNA deconvolution at constitutively methylated loci) can tolerate that trade-off.
+### Usage
+```bash
+bam2tensor \
+    --input-path sample.bam \
+    --reference-fasta GRCh38.fa \
+    --genome-name hg38 \
+    --filter-non-converted \
+    --filter-em-overconversion
+```
+Filter parameters and enabled state are written to the output `metadata.json`:
+```json
+{
+    "filters": {
+        "non_converted_reads": {"enabled": true, "threshold": 3},
+        "em_overconversion": {"enabled": true, "min_cpgs": 3}
+    }
+}
+```
+### Reproducibility note
+The two filters differ in whether they can be replayed downstream without the source BAM:
+- **`--filter-em-overconversion` is reproducible from the `.npz` alone.** The heuristic is a pure function of each row's CpG state values. A downstream consumer who receives an unfiltered `.npz` can replay the filter at analysis time:
+  ```python
+  import scipy.sparse
+  mat = scipy.sparse.load_npz("sample.methylation.npz").tocsr()
+  min_cpgs = 3
+  kept_rows = []
+  for i in range(mat.shape[0]):
+      row = mat.getrow(i).toarray().ravel()
+      covered = row[(row == 0) | (row == 1)]  # drop -1 no-data
+      is_overconv = len(covered) >= min_cpgs and (covered == 0).all()
+      if not is_overconv:
+          kept_rows.append(i)
+  ```
+- **`--filter-non-converted` is *not* reproducible from the `.npz` alone.** It relies on retained non-CpG cytosines (or Bismark's `H`/`X`/`U`), which are never written to the matrix. If you need this filter, apply it at extraction time (or re-run bam2tensor against the original BAM).
 ## Inspecting Output Files
 Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
@@ -302,11 +387,15 @@ sample.methylation.npz
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
   Fragment len:    median 167, mean 182, range [50, 600]
+  Filters:         non-converted (>= 3 non-CpG Cs)
+                   EM over-conversion (all-unmethylated, >= 3 CpGs)
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.4
+  bam2tensor:      v2.5
   File size:       14.2 MB
 ```
+When no filters were applied, the line reads `Filters:         none`. Files produced by bam2tensor versions older than v2.5 omit the line entirely.
 You can pass multiple files at once:
 ```bash
@@ -368,6 +457,7 @@ Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP arc
 | `expected_chromosomes` | List of chromosomes included in the column mapping |
 | `total_cpg_sites` | Total number of CpG columns in the matrix |
 | `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+| `filters` | Nested dict recording which opt-in conversion-error filters were applied (`non_converted_reads`, `em_overconversion`) and their parameters. See [Filtering Conversion Errors](#filtering-conversion-errors). Added in v2.5. |
 This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
@@ -570,6 +660,10 @@ extract_methylation_data_from_bam(
     input_bam: str,                                    # Path to BAM file
     genome_methylation_embedding: GenomeMethylationEmbedding,  # Embedding object
     quality_limit: int = 20,                           # Minimum MAPQ
+    filter_non_converted: bool = False,                # Drop reads with retained non-CpG Cs
+    non_converted_threshold: int = 3,                  # Threshold for the above filter
+    filter_em_overconversion: bool = False,            # Drop EM-seq fragment-level over-conversion reads
+    em_overconversion_min_cpgs: int = 3,               # Min CpGs before applying the above filter
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
 ) -> ExtractionResult

{bam2tensor-2.5 → bam2tensor-2.6}/README.md RENAMED Viewed

@@ -39,6 +39,7 @@
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Filtering Conversion Errors](#filtering-conversion-errors)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
   - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
@@ -66,6 +67,7 @@
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
 - **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
+- **Conversion-Error Filters**: Optional per-read filters for incomplete bisulfite/EM-seq conversion (ported from `nebiolabs/mark-nonconverted-reads`) and EM-seq fragment-level over-conversion (Loyfer et al. 2026)
 ## Requirements
@@ -223,6 +225,25 @@ Options:
                                   determine CpG sites).
   --quality-limit INTEGER         Quality filter for aligned reads (default =
                                   20)
+  --filter-non-converted          Drop reads with >= --non-converted-threshold
+                                  retained non-CpG cytosines, the signature of
+                                  incomplete bisulfite/EM-seq conversion (port
+                                  of nebiolabs/mark-nonconverted-reads).
+                                  Default: off.
+  --non-converted-threshold INTEGER
+                                  Minimum count of retained non-CpG cytosines
+                                  to drop a read (default = 3, matches NEB
+                                  mark-nonconverted-reads).
+  --filter-em-overconversion      Drop EM-seq reads whose covered CpGs are all
+                                  called unmethylated and cover at least --em-
+                                  overconversion-min-cpgs sites (heuristic for
+                                  the fragment-level over-conversion artifact
+                                  described in Loyfer et al. bioRxiv
+                                  2026.03.24.713040). Default: off.
+  --em-overconversion-min-cpgs INTEGER
+                                  Minimum covered CpG count required before
+                                  the EM over-conversion filter will drop a
+                                  read (default = 3).
   --verbose                       Verbose output.
   --skip-cache                    De-novo generate CpG sites (slow).
   --debug                         Debug mode (extensive validity checking +
@@ -248,6 +269,10 @@ Options:
 | `--expected-chromosomes` | Comma-separated list of chromosome names to process. Chromosomes not in this list are skipped. Defaults to human autosomes + sex chromosomes. |
 | `--reference-fasta` | Path to the reference genome FASTA file. Must match the genome used for alignment. |
 | `--quality-limit` | Minimum mapping quality score (MAPQ) for reads to be included. Default is 20. |
+| `--filter-non-converted` | Drop reads with retained non-CpG cytosines above `--non-converted-threshold` (incomplete conversion). See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--non-converted-threshold` | Threshold for the non-converted filter. Default is 3. |
+| `--filter-em-overconversion` | Drop EM-seq reads whose covered CpGs are all unmethylated and cover ≥ `--em-overconversion-min-cpgs` sites. See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--em-overconversion-min-cpgs` | Minimum covered CpG count before the EM over-conversion filter will drop a read. Default is 3. |
 | `--verbose` | Enable detailed progress output including per-chromosome progress bars. |
 | `--skip-cache` | Force regeneration of CpG site cache. Useful if you've modified the reference or chromosome list. |
 | `--debug` | Enable extensive validation and debug output. Slower but useful for troubleshooting. |
@@ -256,6 +281,66 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Filtering Conversion Errors
+Bisulfite and EM-seq library preparation can produce two kinds of per-read conversion errors that bias downstream methylation calls. bam2tensor provides two opt-in filters to drop affected reads at extraction time. Both are **default-off**, apply per read, and are recorded in the output `metadata.json` so downstream consumers know which filters were applied.
+### `--filter-non-converted` — incomplete conversion
+Ports the logic of [nebiolabs/mark-nonconverted-reads](https://github.com/nebiolabs/mark-nonconverted-reads). A read is dropped if it carries at least `--non-converted-threshold` (default 3) retained non-CpG cytosines, a signature of incomplete bisulfite or EM-seq conversion.
+- **Bismark BAMs**: counted directly from the `XM` tag's uppercase `H`/`X`/`U` characters (retained cytosines in CHH/CHG/unknown contexts).
+- **Biscuit / bwameth / gem3 BAMs**: counted by comparing the read to the reference via the `MD` tag (using pysam's `get_aligned_pairs(with_seq=True)`). SNPs — where the read's retained `C` sits over a reference base that isn't `C` — are excluded from the count, matching NEB's reference-validation step. No separate FASTA reload is required.
+### `--filter-em-overconversion` — EM-seq fragment-level over-conversion
+A heuristic inspired by [Loyfer et al. (bioRxiv 2026.03.24.713040)](https://www.biorxiv.org/content/10.64898/2026.03.24.713040v1). That paper shows EM-seq reproducibly produces ~1–2.5% of multi-CpG fragments that appear fully unmethylated across every covered CpG — a fragment-level artifact absent from WGBS and Oxford Nanopore. This filter drops any read whose covered CpGs are **all** called unmethylated *and* cover at least `--em-overconversion-min-cpgs` sites (default 3, the regime where the EM-seq artifact is clearly separable from WGBS in Loyfer et al. Fig. 1C).
+The filter is a blunt instrument: it will also drop genuinely fully-unmethylated biological fragments at unmethylated markers. Enable it only when your downstream application (e.g., cfDNA deconvolution at constitutively methylated loci) can tolerate that trade-off.
+### Usage
+```bash
+bam2tensor \
+    --input-path sample.bam \
+    --reference-fasta GRCh38.fa \
+    --genome-name hg38 \
+    --filter-non-converted \
+    --filter-em-overconversion
+```
+Filter parameters and enabled state are written to the output `metadata.json`:
+```json
+{
+    "filters": {
+        "non_converted_reads": {"enabled": true, "threshold": 3},
+        "em_overconversion": {"enabled": true, "min_cpgs": 3}
+    }
+}
+```
+### Reproducibility note
+The two filters differ in whether they can be replayed downstream without the source BAM:
+- **`--filter-em-overconversion` is reproducible from the `.npz` alone.** The heuristic is a pure function of each row's CpG state values. A downstream consumer who receives an unfiltered `.npz` can replay the filter at analysis time:
+  ```python
+  import scipy.sparse
+  mat = scipy.sparse.load_npz("sample.methylation.npz").tocsr()
+  min_cpgs = 3
+  kept_rows = []
+  for i in range(mat.shape[0]):
+      row = mat.getrow(i).toarray().ravel()
+      covered = row[(row == 0) | (row == 1)]  # drop -1 no-data
+      is_overconv = len(covered) >= min_cpgs and (covered == 0).all()
+      if not is_overconv:
+          kept_rows.append(i)
+  ```
+- **`--filter-non-converted` is *not* reproducible from the `.npz` alone.** It relies on retained non-CpG cytosines (or Bismark's `H`/`X`/`U`), which are never written to the matrix. If you need this filter, apply it at extraction time (or re-run bam2tensor against the original BAM).
 ## Inspecting Output Files
 Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
@@ -269,11 +354,15 @@ sample.methylation.npz
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
   Fragment len:    median 167, mean 182, range [50, 600]
+  Filters:         non-converted (>= 3 non-CpG Cs)
+                   EM over-conversion (all-unmethylated, >= 3 CpGs)
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.4
+  bam2tensor:      v2.5
   File size:       14.2 MB
 ```
+When no filters were applied, the line reads `Filters:         none`. Files produced by bam2tensor versions older than v2.5 omit the line entirely.
 You can pass multiple files at once:
 ```bash
@@ -335,6 +424,7 @@ Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP arc
 | `expected_chromosomes` | List of chromosomes included in the column mapping |
 | `total_cpg_sites` | Total number of CpG columns in the matrix |
 | `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+| `filters` | Nested dict recording which opt-in conversion-error filters were applied (`non_converted_reads`, `em_overconversion`) and their parameters. See [Filtering Conversion Errors](#filtering-conversion-errors). Added in v2.5. |
 This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
@@ -537,6 +627,10 @@ extract_methylation_data_from_bam(
     input_bam: str,                                    # Path to BAM file
     genome_methylation_embedding: GenomeMethylationEmbedding,  # Embedding object
     quality_limit: int = 20,                           # Minimum MAPQ
+    filter_non_converted: bool = False,                # Drop reads with retained non-CpG Cs
+    non_converted_threshold: int = 3,                  # Threshold for the above filter
+    filter_em_overconversion: bool = False,            # Drop EM-seq fragment-level over-conversion reads
+    em_overconversion_min_cpgs: int = 3,               # Min CpGs before applying the above filter
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
 ) -> ExtractionResult

{bam2tensor-2.5 → bam2tensor-2.6}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "bam2tensor"
-version = "2.5"
+version = "2.6"
 description = "Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation"
 authors = [{ name = "Nick Semenkovich", email = "semenko@alum.mit.edu" }]
 license = "MIT"

{bam2tensor-2.5 → bam2tensor-2.6}/src/bam2tensor/__init__.py RENAMED Viewed

@@ -50,4 +50,4 @@ See Also:
     - https://mcwdsi.github.io/bam2tensor for full documentation
 """
-__version__ = "2.5"
+__version__ = "2.6"

{bam2tensor-2.5 → bam2tensor-2.6}/src/bam2tensor/functions.py RENAMED Viewed

@@ -706,25 +706,36 @@ def extract_methylation_data_from_bam(
             # get_aligned_pairs returns a list of tuples of (read_pos, ref_pos)
             # We filter this to only include the specific CpG sites from above
+            aligned_pairs = aligned_segment.get_aligned_pairs(matches_only=True)
             this_segment_cpgs = [
-                e
-                for e in aligned_segment.get_aligned_pairs(matches_only=True)
-                if e[1] + 1 in cpgs_within_read_set
+                e for e in aligned_pairs if e[1] + 1 in cpgs_within_read_set
             ]
             # If no CpGs covered (after filtering for matches only), skip
             if not this_segment_cpgs:
                 continue
-            # Ok we're on the same strand as the methylation (right?)
-            # Let's compare the possible CpGs in this interval to the reference and note status
-            #   A methylated C will be *unchanged* and read as C (pair G)
-            #   An unmethylated C will be *changed* and read as T (pair A)
-            for query_pos, ref_pos in this_segment_cpgs:
-                query_base = aligned_segment.query_sequence[query_pos]  # type: ignore
-                # query_base_raw = aligned_segment.get_forward_sequence()[query_pos] # raw off sequencer
-                # query_base_no_offset = aligned_segment.query_alignment_sequence[query_pos] # this needs to be offset by the soft clip
+            # OT (forward parent): methylation-informative base sits on the
+            #   top-strand C at ref_pos. BAM SEQ is reference-oriented, so
+            #   C = methylated, T = unmethylated.
+            # OB (reverse parent): the original bottom-strand C lives at
+            #   ref_pos + 1 (the G of the top-strand CG). After the aligner
+            #   reverse-complements into reference orientation for BAM
+            #   storage, that base reads G = methylated, A = unmethylated.
+            #   At ref_pos itself, BAM always shows C (the unaffected
+            #   bottom-strand G reverse-complemented), which is why reading
+            #   ref_pos on OB reads collapses every CpG to "methylated".
+            query_sequence = aligned_segment.query_sequence
+            if bisulfite_parent_strand_is_reverse:
+                methylated_base, unmethylated_base = "G", "A"
+                # Indels at the CpG boundary mean ref_pos + 1 isn't always
+                # query_pos + 1 — go through a ref -> query map.
+                ref_to_query: dict[int, int] = {ref: q for q, ref in aligned_pairs}
+            else:
+                methylated_base, unmethylated_base = "C", "T"
+                ref_to_query = {}
+            for query_pos, ref_pos in this_segment_cpgs:
                 read_cpg_cols.append(
                     genome_methylation_embedding.genomic_position_to_embedding(
                         chrom,
@@ -732,21 +743,34 @@ def extract_methylation_data_from_bam(
                     )
                 )
-                if query_base == "C":
-                    # Methylated
+                if bisulfite_parent_strand_is_reverse:
+                    target_query_pos = ref_to_query.get(ref_pos + 1)
+                    if target_query_pos is None:
+                        read_cpg_data.append(-1)
+                        if debug:
+                            print(f"\t{query_pos} {ref_pos} [Indel at OB target]")
+                        continue
+                    query_base = query_sequence[target_query_pos]  # type: ignore[index]
+                else:
+                    query_base = query_sequence[query_pos]  # type: ignore[index]
+                if query_base == methylated_base:
                     read_cpg_data.append(1)
                     if debug:
-                        print(f"\t{query_pos} {ref_pos} C->{query_base} [Methylated]")
-                elif query_base == "T":
+                        print(
+                            f"\t{query_pos} {ref_pos} {methylated_base}->{query_base} [Methylated]"
+                        )
+                elif query_base == unmethylated_base:
                     read_cpg_data.append(0)
-                    # Unmethylated
                     if debug:
-                        print(f"\t{query_pos} {ref_pos} C->{query_base} [Unmethylated]")
+                        print(
+                            f"\t{query_pos} {ref_pos} {methylated_base}->{query_base} [Unmethylated]"
+                        )
                 else:
                     read_cpg_data.append(-1)
                     if debug:
                         print(
-                            f"\t{query_pos} {ref_pos} C->{query_base} [Unknown! SNV? Indel?]"
+                            f"\t{query_pos} {ref_pos} {methylated_base}->{query_base} [Unknown! SNV? Indel?]"
                         )
             if filter_em_overconversion and is_em_overconversion_read(

{bam2tensor-2.5 → bam2tensor-2.6}/tests/test_functions.py RENAMED Viewed

@@ -1074,6 +1074,133 @@ def test_biscuit_debug_mode_ct_bases(tmp_path):
     assert result.matrix.shape[0] == 1
+def test_biscuit_ob_strand_methylation_extraction(tmp_path):
+    """Biscuit/bwameth OB-strand (YD=r, is_reverse=True) reads must read the
+    methylation-informative base at ref_pos+1 (G=methylated, A=unmethylated),
+    not ref_pos (which is always C in BAM SEQ regardless of methylation state).
+    Regression for the bug where OB reads were extracted with C/T logic at
+    ref_pos and thus scored as universally methylated.
+    """
+    fasta_path = tmp_path / "ref.fa"
+    # CpGs at 1-based positions 10, 21 (top-strand C at 0-based 9, 20; G at 10, 21).
+    seq = "N" * 9 + "CG" + "N" * 9 + "CG" + "N" * 128
+    with open(fasta_path, "w") as f:
+        f.write(">chr1\n" + seq + "\n")
+    emb = embedding.GenomeMethylationEmbedding(
+        "test_biscuit_ob",
+        expected_chromosomes=["chr1"],
+        fasta_source=str(fasta_path),
+        skip_cache=True,
+    )
+    bam_path = tmp_path / "test.bam"
+    header = {"HD": {"VN": "1.0"}, "SQ": [{"LN": len(seq), "SN": "chr1"}]}
+    # OB read: BAM SEQ is reference-oriented. The C of each top-strand CG is
+    # always C in BAM (bottom-strand G reverse-complemented). The G of each
+    # top-strand CG is what carries the methylation signal: G=methylated,
+    # A=unmethylated.
+    read_seq = list("N" * len(seq))
+    read_seq[9] = "C"  # top-strand C of CpG#1 (always C in BAM for OB)
+    read_seq[10] = "G"  # methylated → G at ref_pos+1
+    read_seq[20] = "C"  # top-strand C of CpG#2 (always C in BAM for OB)
+    read_seq[21] = "A"  # unmethylated → A at ref_pos+1
+    with pysam.AlignmentFile(bam_path, "wb", header=header) as out_bam:
+        a = pysam.AlignedSegment()
+        a.query_name = "ob_read"
+        a.query_sequence = "".join(read_seq)
+        a.flag = 0x10  # reverse-mapped
+        a.reference_id = 0
+        a.reference_start = 0
+        a.mapping_quality = 60
+        a.cigartuples = [(0, len(seq))]
+        a.set_tag("MD", str(len(seq)))
+        a.set_tag("YD", "r")  # OB / reverse parent strand
+        out_bam.write(a)
+    pysam.index(str(bam_path))
+    result = functions.extract_methylation_data_from_bam(
+        input_bam=str(bam_path),
+        genome_methylation_embedding=emb,
+    )
+    assert result.matrix.shape[0] == 1
+    assert result.matrix.nnz == 2
+    data = sorted(result.matrix.data)
+    assert data == [0, 1], (
+        f"Expected one methylated (1) and one unmethylated (0) call, got {data}. "
+        "If this is all 1s, the OB-strand base lookup regression has returned."
+    )
+def test_biscuit_ot_and_ob_share_cpg_columns(tmp_path):
+    """OT and OB reads at the same CpG must land in the same embedding column
+    (canonical CpG site = top-strand C, ref_pos+1 in 1-based coordinates).
+    """
+    fasta_path = tmp_path / "ref.fa"
+    seq = "N" * 9 + "CG" + "N" * 9 + "CG" + "N" * 128
+    with open(fasta_path, "w") as f:
+        f.write(">chr1\n" + seq + "\n")
+    emb = embedding.GenomeMethylationEmbedding(
+        "test_biscuit_ot_ob_columns",
+        expected_chromosomes=["chr1"],
+        fasta_source=str(fasta_path),
+        skip_cache=True,
+    )
+    bam_path = tmp_path / "test.bam"
+    header = {"HD": {"VN": "1.0"}, "SQ": [{"LN": len(seq), "SN": "chr1"}]}
+    # OT read: C at top-strand C positions = methylated at both CpGs.
+    ot_seq = list("N" * len(seq))
+    ot_seq[9] = "C"
+    ot_seq[20] = "C"
+    # OB read (BAM in reference orientation): G at top-strand G positions
+    # = methylated at both CpGs.
+    ob_seq = list("N" * len(seq))
+    ob_seq[9] = "C"
+    ob_seq[10] = "G"
+    ob_seq[20] = "C"
+    ob_seq[21] = "G"
+    with pysam.AlignmentFile(bam_path, "wb", header=header) as out_bam:
+        a = pysam.AlignedSegment()
+        a.query_name = "ot_read"
+        a.query_sequence = "".join(ot_seq)
+        a.flag = 0
+        a.reference_id = 0
+        a.reference_start = 0
+        a.mapping_quality = 60
+        a.cigartuples = [(0, len(seq))]
+        a.set_tag("MD", str(len(seq)))
+        a.set_tag("YD", "f")
+        out_bam.write(a)
+        b = pysam.AlignedSegment()
+        b.query_name = "ob_read"
+        b.query_sequence = "".join(ob_seq)
+        b.flag = 0x10
+        b.reference_id = 0
+        b.reference_start = 0
+        b.mapping_quality = 60
+        b.cigartuples = [(0, len(seq))]
+        b.set_tag("MD", str(len(seq)))
+        b.set_tag("YD", "r")
+        out_bam.write(b)
+    pysam.index(str(bam_path))
+    result = functions.extract_methylation_data_from_bam(
+        input_bam=str(bam_path),
+        genome_methylation_embedding=emb,
+    )
+    assert result.matrix.shape[0] == 2
+    # Both reads call both CpGs methylated, so we expect two reads × two CpGs
+    # in the same two columns, all with value 1.
+    coo = result.matrix.tocoo()
+    ot_cols = sorted(int(c) for r, c in zip(coo.row, coo.col) if r == 0)
+    ob_cols = sorted(int(c) for r, c in zip(coo.row, coo.col) if r == 1)
+    assert ot_cols == ob_cols, f"OT and OB columns diverged: OT={ot_cols} OB={ob_cols}"
+    assert list(result.matrix.data) == [1, 1, 1, 1]
 # ======================================================================
 # XB tag (gem3/Blueprint) extraction tests
 # ======================================================================

{bam2tensor-2.5 → bam2tensor-2.6}/tests/test_inspect.py RENAMED Viewed

@@ -122,7 +122,7 @@ def test_inspect_end_to_end(tmp_path) -> None:
     assert result.exit_code == 0
     assert "test" in result.output  # genome_name
     assert "CpG index CRC32:" in result.output
-    assert "v2.5" in result.output
+    assert "v2.6" in result.output
 def test_format_size_bytes() -> None:

bam2tensor 2.5__tar.gz → 2.6__tar.gz

bam2tensor 2.5tar.gz → 2.6tar.gz