PyPI - bam2tensor - Versions diffs - 2.4__tar.gz → 2.6__tar.gz - Mend

bam2tensor 2.4tar.gz → 2.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{bam2tensor-2.4 → bam2tensor-2.6}/.github/workflows/docs.yml RENAMED Viewed

@@ -47,7 +47,7 @@ jobs:
           uv run sphinx-build docs docs/_build
       - name: Upload artifact
-        uses: actions/upload-pages-artifact@v4
+        uses: actions/upload-pages-artifact@v5
         with:
           path: "docs/_build"

{bam2tensor-2.4 → bam2tensor-2.6}/.github/workflows/release.yml RENAMED Viewed

@@ -67,16 +67,16 @@ jobs:
       - name: Publish package on PyPI
         if: steps.check-version.outputs.tag || steps.check-tag.outputs.tag
-        uses: pypa/gh-action-pypi-publish@v1.13.0
+        uses: pypa/gh-action-pypi-publish@v1.14.0
       - name: Publish package on TestPyPI
         if: (!steps.check-version.outputs.tag && !steps.check-tag.outputs.tag)
-        uses: pypa/gh-action-pypi-publish@v1.13.0
+        uses: pypa/gh-action-pypi-publish@v1.14.0
         with:
           repository-url: https://test.pypi.org/legacy/
       - name: Publish the release notes
-        uses: release-drafter/release-drafter@v7.1.1
+        uses: release-drafter/release-drafter@v7.2.1
         with:
           publish: ${{ steps.check-version.outputs.tag != '' || steps.check-tag.outputs.tag != '' }}
           tag: ${{ steps.check-version.outputs.tag || steps.check-tag.outputs.tag }}

{bam2tensor-2.4 → bam2tensor-2.6}/CLAUDE.md RENAMED Viewed

@@ -40,7 +40,7 @@ uv run mypy src
 ```
 src/bam2tensor/
-  __init__.py      # Package version (2.4)
+  __init__.py      # Package version (2.5)
   __main__.py      # Click CLI entry point (bam2tensor command)
   inspect.py       # Inspect CLI entry point (bam2tensor-inspect command)
   embedding.py     # GenomeMethylationEmbedding class (FASTA parsing, CpG indexing)

{bam2tensor-2.4 → bam2tensor-2.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bam2tensor
-Version: 2.4
+Version: 2.6
 Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
 Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
 Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
@@ -72,6 +72,7 @@ Description-Content-Type: text/markdown
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Filtering Conversion Errors](#filtering-conversion-errors)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
   - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
@@ -99,6 +100,7 @@ Description-Content-Type: text/markdown
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
 - **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
+- **Conversion-Error Filters**: Optional per-read filters for incomplete bisulfite/EM-seq conversion (ported from `nebiolabs/mark-nonconverted-reads`) and EM-seq fragment-level over-conversion (Loyfer et al. 2026)
 ## Requirements
@@ -256,6 +258,25 @@ Options:
                                   determine CpG sites).
   --quality-limit INTEGER         Quality filter for aligned reads (default =
                                   20)
+  --filter-non-converted          Drop reads with >= --non-converted-threshold
+                                  retained non-CpG cytosines, the signature of
+                                  incomplete bisulfite/EM-seq conversion (port
+                                  of nebiolabs/mark-nonconverted-reads).
+                                  Default: off.
+  --non-converted-threshold INTEGER
+                                  Minimum count of retained non-CpG cytosines
+                                  to drop a read (default = 3, matches NEB
+                                  mark-nonconverted-reads).
+  --filter-em-overconversion      Drop EM-seq reads whose covered CpGs are all
+                                  called unmethylated and cover at least --em-
+                                  overconversion-min-cpgs sites (heuristic for
+                                  the fragment-level over-conversion artifact
+                                  described in Loyfer et al. bioRxiv
+                                  2026.03.24.713040). Default: off.
+  --em-overconversion-min-cpgs INTEGER
+                                  Minimum covered CpG count required before
+                                  the EM over-conversion filter will drop a
+                                  read (default = 3).
   --verbose                       Verbose output.
   --skip-cache                    De-novo generate CpG sites (slow).
   --debug                         Debug mode (extensive validity checking +
@@ -281,6 +302,10 @@ Options:
 | `--expected-chromosomes` | Comma-separated list of chromosome names to process. Chromosomes not in this list are skipped. Defaults to human autosomes + sex chromosomes. |
 | `--reference-fasta` | Path to the reference genome FASTA file. Must match the genome used for alignment. |
 | `--quality-limit` | Minimum mapping quality score (MAPQ) for reads to be included. Default is 20. |
+| `--filter-non-converted` | Drop reads with retained non-CpG cytosines above `--non-converted-threshold` (incomplete conversion). See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--non-converted-threshold` | Threshold for the non-converted filter. Default is 3. |
+| `--filter-em-overconversion` | Drop EM-seq reads whose covered CpGs are all unmethylated and cover ≥ `--em-overconversion-min-cpgs` sites. See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--em-overconversion-min-cpgs` | Minimum covered CpG count before the EM over-conversion filter will drop a read. Default is 3. |
 | `--verbose` | Enable detailed progress output including per-chromosome progress bars. |
 | `--skip-cache` | Force regeneration of CpG site cache. Useful if you've modified the reference or chromosome list. |
 | `--debug` | Enable extensive validation and debug output. Slower but useful for troubleshooting. |
@@ -289,6 +314,66 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Filtering Conversion Errors
+Bisulfite and EM-seq library preparation can produce two kinds of per-read conversion errors that bias downstream methylation calls. bam2tensor provides two opt-in filters to drop affected reads at extraction time. Both are **default-off**, apply per read, and are recorded in the output `metadata.json` so downstream consumers know which filters were applied.
+### `--filter-non-converted` — incomplete conversion
+Ports the logic of [nebiolabs/mark-nonconverted-reads](https://github.com/nebiolabs/mark-nonconverted-reads). A read is dropped if it carries at least `--non-converted-threshold` (default 3) retained non-CpG cytosines, a signature of incomplete bisulfite or EM-seq conversion.
+- **Bismark BAMs**: counted directly from the `XM` tag's uppercase `H`/`X`/`U` characters (retained cytosines in CHH/CHG/unknown contexts).
+- **Biscuit / bwameth / gem3 BAMs**: counted by comparing the read to the reference via the `MD` tag (using pysam's `get_aligned_pairs(with_seq=True)`). SNPs — where the read's retained `C` sits over a reference base that isn't `C` — are excluded from the count, matching NEB's reference-validation step. No separate FASTA reload is required.
+### `--filter-em-overconversion` — EM-seq fragment-level over-conversion
+A heuristic inspired by [Loyfer et al. (bioRxiv 2026.03.24.713040)](https://www.biorxiv.org/content/10.64898/2026.03.24.713040v1). That paper shows EM-seq reproducibly produces ~1–2.5% of multi-CpG fragments that appear fully unmethylated across every covered CpG — a fragment-level artifact absent from WGBS and Oxford Nanopore. This filter drops any read whose covered CpGs are **all** called unmethylated *and* cover at least `--em-overconversion-min-cpgs` sites (default 3, the regime where the EM-seq artifact is clearly separable from WGBS in Loyfer et al. Fig. 1C).
+The filter is a blunt instrument: it will also drop genuinely fully-unmethylated biological fragments at unmethylated markers. Enable it only when your downstream application (e.g., cfDNA deconvolution at constitutively methylated loci) can tolerate that trade-off.
+### Usage
+```bash
+bam2tensor \
+    --input-path sample.bam \
+    --reference-fasta GRCh38.fa \
+    --genome-name hg38 \
+    --filter-non-converted \
+    --filter-em-overconversion
+```
+Filter parameters and enabled state are written to the output `metadata.json`:
+```json
+{
+    "filters": {
+        "non_converted_reads": {"enabled": true, "threshold": 3},
+        "em_overconversion": {"enabled": true, "min_cpgs": 3}
+    }
+}
+```
+### Reproducibility note
+The two filters differ in whether they can be replayed downstream without the source BAM:
+- **`--filter-em-overconversion` is reproducible from the `.npz` alone.** The heuristic is a pure function of each row's CpG state values. A downstream consumer who receives an unfiltered `.npz` can replay the filter at analysis time:
+  ```python
+  import scipy.sparse
+  mat = scipy.sparse.load_npz("sample.methylation.npz").tocsr()
+  min_cpgs = 3
+  kept_rows = []
+  for i in range(mat.shape[0]):
+      row = mat.getrow(i).toarray().ravel()
+      covered = row[(row == 0) | (row == 1)]  # drop -1 no-data
+      is_overconv = len(covered) >= min_cpgs and (covered == 0).all()
+      if not is_overconv:
+          kept_rows.append(i)
+  ```
+- **`--filter-non-converted` is *not* reproducible from the `.npz` alone.** It relies on retained non-CpG cytosines (or Bismark's `H`/`X`/`U`), which are never written to the matrix. If you need this filter, apply it at extraction time (or re-run bam2tensor against the original BAM).
 ## Inspecting Output Files
 Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
@@ -302,11 +387,15 @@ sample.methylation.npz
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
   Fragment len:    median 167, mean 182, range [50, 600]
+  Filters:         non-converted (>= 3 non-CpG Cs)
+                   EM over-conversion (all-unmethylated, >= 3 CpGs)
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.4
+  bam2tensor:      v2.5
   File size:       14.2 MB
 ```
+When no filters were applied, the line reads `Filters:         none`. Files produced by bam2tensor versions older than v2.5 omit the line entirely.
 You can pass multiple files at once:
 ```bash
@@ -368,6 +457,7 @@ Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP arc
 | `expected_chromosomes` | List of chromosomes included in the column mapping |
 | `total_cpg_sites` | Total number of CpG columns in the matrix |
 | `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+| `filters` | Nested dict recording which opt-in conversion-error filters were applied (`non_converted_reads`, `em_overconversion`) and their parameters. See [Filtering Conversion Errors](#filtering-conversion-errors). Added in v2.5. |
 This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
@@ -570,6 +660,10 @@ extract_methylation_data_from_bam(
     input_bam: str,                                    # Path to BAM file
     genome_methylation_embedding: GenomeMethylationEmbedding,  # Embedding object
     quality_limit: int = 20,                           # Minimum MAPQ
+    filter_non_converted: bool = False,                # Drop reads with retained non-CpG Cs
+    non_converted_threshold: int = 3,                  # Threshold for the above filter
+    filter_em_overconversion: bool = False,            # Drop EM-seq fragment-level over-conversion reads
+    em_overconversion_min_cpgs: int = 3,               # Min CpGs before applying the above filter
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
 ) -> ExtractionResult

{bam2tensor-2.4 → bam2tensor-2.6}/README.md RENAMED Viewed

@@ -39,6 +39,7 @@
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Filtering Conversion Errors](#filtering-conversion-errors)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
   - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
@@ -66,6 +67,7 @@
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
 - **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
+- **Conversion-Error Filters**: Optional per-read filters for incomplete bisulfite/EM-seq conversion (ported from `nebiolabs/mark-nonconverted-reads`) and EM-seq fragment-level over-conversion (Loyfer et al. 2026)
 ## Requirements
@@ -223,6 +225,25 @@ Options:
                                   determine CpG sites).
   --quality-limit INTEGER         Quality filter for aligned reads (default =
                                   20)
+  --filter-non-converted          Drop reads with >= --non-converted-threshold
+                                  retained non-CpG cytosines, the signature of
+                                  incomplete bisulfite/EM-seq conversion (port
+                                  of nebiolabs/mark-nonconverted-reads).
+                                  Default: off.
+  --non-converted-threshold INTEGER
+                                  Minimum count of retained non-CpG cytosines
+                                  to drop a read (default = 3, matches NEB
+                                  mark-nonconverted-reads).
+  --filter-em-overconversion      Drop EM-seq reads whose covered CpGs are all
+                                  called unmethylated and cover at least --em-
+                                  overconversion-min-cpgs sites (heuristic for
+                                  the fragment-level over-conversion artifact
+                                  described in Loyfer et al. bioRxiv
+                                  2026.03.24.713040). Default: off.
+  --em-overconversion-min-cpgs INTEGER
+                                  Minimum covered CpG count required before
+                                  the EM over-conversion filter will drop a
+                                  read (default = 3).
   --verbose                       Verbose output.
   --skip-cache                    De-novo generate CpG sites (slow).
   --debug                         Debug mode (extensive validity checking +
@@ -248,6 +269,10 @@ Options:
 | `--expected-chromosomes` | Comma-separated list of chromosome names to process. Chromosomes not in this list are skipped. Defaults to human autosomes + sex chromosomes. |
 | `--reference-fasta` | Path to the reference genome FASTA file. Must match the genome used for alignment. |
 | `--quality-limit` | Minimum mapping quality score (MAPQ) for reads to be included. Default is 20. |
+| `--filter-non-converted` | Drop reads with retained non-CpG cytosines above `--non-converted-threshold` (incomplete conversion). See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--non-converted-threshold` | Threshold for the non-converted filter. Default is 3. |
+| `--filter-em-overconversion` | Drop EM-seq reads whose covered CpGs are all unmethylated and cover ≥ `--em-overconversion-min-cpgs` sites. See [Filtering Conversion Errors](#filtering-conversion-errors). |
+| `--em-overconversion-min-cpgs` | Minimum covered CpG count before the EM over-conversion filter will drop a read. Default is 3. |
 | `--verbose` | Enable detailed progress output including per-chromosome progress bars. |
 | `--skip-cache` | Force regeneration of CpG site cache. Useful if you've modified the reference or chromosome list. |
 | `--debug` | Enable extensive validation and debug output. Slower but useful for troubleshooting. |
@@ -256,6 +281,66 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Filtering Conversion Errors
+Bisulfite and EM-seq library preparation can produce two kinds of per-read conversion errors that bias downstream methylation calls. bam2tensor provides two opt-in filters to drop affected reads at extraction time. Both are **default-off**, apply per read, and are recorded in the output `metadata.json` so downstream consumers know which filters were applied.
+### `--filter-non-converted` — incomplete conversion
+Ports the logic of [nebiolabs/mark-nonconverted-reads](https://github.com/nebiolabs/mark-nonconverted-reads). A read is dropped if it carries at least `--non-converted-threshold` (default 3) retained non-CpG cytosines, a signature of incomplete bisulfite or EM-seq conversion.
+- **Bismark BAMs**: counted directly from the `XM` tag's uppercase `H`/`X`/`U` characters (retained cytosines in CHH/CHG/unknown contexts).
+- **Biscuit / bwameth / gem3 BAMs**: counted by comparing the read to the reference via the `MD` tag (using pysam's `get_aligned_pairs(with_seq=True)`). SNPs — where the read's retained `C` sits over a reference base that isn't `C` — are excluded from the count, matching NEB's reference-validation step. No separate FASTA reload is required.
+### `--filter-em-overconversion` — EM-seq fragment-level over-conversion
+A heuristic inspired by [Loyfer et al. (bioRxiv 2026.03.24.713040)](https://www.biorxiv.org/content/10.64898/2026.03.24.713040v1). That paper shows EM-seq reproducibly produces ~1–2.5% of multi-CpG fragments that appear fully unmethylated across every covered CpG — a fragment-level artifact absent from WGBS and Oxford Nanopore. This filter drops any read whose covered CpGs are **all** called unmethylated *and* cover at least `--em-overconversion-min-cpgs` sites (default 3, the regime where the EM-seq artifact is clearly separable from WGBS in Loyfer et al. Fig. 1C).
+The filter is a blunt instrument: it will also drop genuinely fully-unmethylated biological fragments at unmethylated markers. Enable it only when your downstream application (e.g., cfDNA deconvolution at constitutively methylated loci) can tolerate that trade-off.
+### Usage
+```bash
+bam2tensor \
+    --input-path sample.bam \
+    --reference-fasta GRCh38.fa \
+    --genome-name hg38 \
+    --filter-non-converted \
+    --filter-em-overconversion
+```
+Filter parameters and enabled state are written to the output `metadata.json`:
+```json
+{
+    "filters": {
+        "non_converted_reads": {"enabled": true, "threshold": 3},
+        "em_overconversion": {"enabled": true, "min_cpgs": 3}
+    }
+}
+```
+### Reproducibility note
+The two filters differ in whether they can be replayed downstream without the source BAM:
+- **`--filter-em-overconversion` is reproducible from the `.npz` alone.** The heuristic is a pure function of each row's CpG state values. A downstream consumer who receives an unfiltered `.npz` can replay the filter at analysis time:
+  ```python
+  import scipy.sparse
+  mat = scipy.sparse.load_npz("sample.methylation.npz").tocsr()
+  min_cpgs = 3
+  kept_rows = []
+  for i in range(mat.shape[0]):
+      row = mat.getrow(i).toarray().ravel()
+      covered = row[(row == 0) | (row == 1)]  # drop -1 no-data
+      is_overconv = len(covered) >= min_cpgs and (covered == 0).all()
+      if not is_overconv:
+          kept_rows.append(i)
+  ```
+- **`--filter-non-converted` is *not* reproducible from the `.npz` alone.** It relies on retained non-CpG cytosines (or Bismark's `H`/`X`/`U`), which are never written to the matrix. If you need this filter, apply it at extraction time (or re-run bam2tensor against the original BAM).
 ## Inspecting Output Files
 Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
@@ -269,11 +354,15 @@ sample.methylation.npz
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
   Fragment len:    median 167, mean 182, range [50, 600]
+  Filters:         non-converted (>= 3 non-CpG Cs)
+                   EM over-conversion (all-unmethylated, >= 3 CpGs)
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.4
+  bam2tensor:      v2.5
   File size:       14.2 MB
 ```
+When no filters were applied, the line reads `Filters:         none`. Files produced by bam2tensor versions older than v2.5 omit the line entirely.
 You can pass multiple files at once:
 ```bash
@@ -335,6 +424,7 @@ Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP arc
 | `expected_chromosomes` | List of chromosomes included in the column mapping |
 | `total_cpg_sites` | Total number of CpG columns in the matrix |
 | `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+| `filters` | Nested dict recording which opt-in conversion-error filters were applied (`non_converted_reads`, `em_overconversion`) and their parameters. See [Filtering Conversion Errors](#filtering-conversion-errors). Added in v2.5. |
 This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
@@ -537,6 +627,10 @@ extract_methylation_data_from_bam(
     input_bam: str,                                    # Path to BAM file
     genome_methylation_embedding: GenomeMethylationEmbedding,  # Embedding object
     quality_limit: int = 20,                           # Minimum MAPQ
+    filter_non_converted: bool = False,                # Drop reads with retained non-CpG Cs
+    non_converted_threshold: int = 3,                  # Threshold for the above filter
+    filter_em_overconversion: bool = False,            # Drop EM-seq fragment-level over-conversion reads
+    em_overconversion_min_cpgs: int = 3,               # Min CpGs before applying the above filter
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
 ) -> ExtractionResult

{bam2tensor-2.4 → bam2tensor-2.6}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "bam2tensor"
-version = "2.4"
+version = "2.6"
 description = "Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation"
 authors = [{ name = "Nick Semenkovich", email = "semenko@alum.mit.edu" }]
 license = "MIT"

{bam2tensor-2.4 → bam2tensor-2.6}/src/bam2tensor/__init__.py RENAMED Viewed

@@ -50,4 +50,4 @@ See Also:
     - https://mcwdsi.github.io/bam2tensor for full documentation
 """
-__version__ = "2.4"
+__version__ = "2.6"

{bam2tensor-2.4 → bam2tensor-2.6}/src/bam2tensor/__main__.py RENAMED Viewed

@@ -229,6 +229,43 @@ def validate_input_output(
     default=20,
     type=int,
 )
+@click.option(
+    "--filter-non-converted",
+    help=(
+        "Drop reads with >= --non-converted-threshold retained non-CpG "
+        "cytosines, the signature of incomplete bisulfite/EM-seq conversion "
+        "(port of nebiolabs/mark-nonconverted-reads). Default: off."
+    ),
+    is_flag=True,
+)
+@click.option(
+    "--non-converted-threshold",
+    help=(
+        "Minimum count of retained non-CpG cytosines to drop a read "
+        "(default = 3, matches NEB mark-nonconverted-reads)."
+    ),
+    default=3,
+    type=int,
+)
+@click.option(
+    "--filter-em-overconversion",
+    help=(
+        "Drop EM-seq reads whose covered CpGs are all called unmethylated "
+        "and cover at least --em-overconversion-min-cpgs sites (heuristic "
+        "for the fragment-level over-conversion artifact described in "
+        "Loyfer et al. bioRxiv 2026.03.24.713040). Default: off."
+    ),
+    is_flag=True,
+)
+@click.option(
+    "--em-overconversion-min-cpgs",
+    help=(
+        "Minimum covered CpG count required before the EM over-conversion "
+        "filter will drop a read (default = 3)."
+    ),
+    default=3,
+    type=int,
+)
 @click.option("--verbose", help="Verbose output.", is_flag=True)
 @click.option("--skip-cache", help="De-novo generate CpG sites (slow).", is_flag=True)
 @click.option(
@@ -263,6 +300,10 @@ def main(
     expected_chromosomes: str | None,
     reference_fasta: str | None,
     quality_limit: int,
+    filter_non_converted: bool,
+    non_converted_threshold: int,
+    filter_em_overconversion: bool,
+    em_overconversion_min_cpgs: int,
     verbose: bool,
     skip_cache: bool,
     debug: bool,
@@ -300,6 +341,17 @@ def main(
             ``--download-reference`` is used.
         quality_limit: Minimum mapping quality (MAPQ) threshold. Reads below
             this quality are excluded.
+        filter_non_converted: If True, drop reads with at least
+            ``non_converted_threshold`` retained non-CpG cytosines —
+            indicating incomplete bisulfite/EM-seq conversion.
+        non_converted_threshold: Threshold used by the non-converted
+            read filter.
+        filter_em_overconversion: If True, drop reads whose covered CpGs
+            are all called unmethylated and cover at least
+            ``em_overconversion_min_cpgs`` sites — heuristic for EM-seq
+            fragment-level over-conversion (Loyfer et al. 2026).
+        em_overconversion_min_cpgs: Minimum covered CpG count required
+            before the over-conversion filter will drop a read.
         verbose: If True, print detailed progress information.
         skip_cache: If True, regenerate the CpG site index even if a cache
             file exists.
@@ -382,6 +434,16 @@ def main(
     print(f"  Reference:     {reference_fasta}")
     print(f"  Chromosomes:   {chrom_display}")
     print(f"  Quality limit: MAPQ >= {quality_limit}")
+    if filter_non_converted:
+        print(
+            f"  Filters:       non-converted reads (>= "
+            f"{non_converted_threshold} retained non-CpG Cs)"
+        )
+    if filter_em_overconversion:
+        print(
+            f"                 EM over-conversion (all-unmethylated, >= "
+            f"{em_overconversion_min_cpgs} CpGs)"
+        )
     if output_dir:
         print(f"  Output dir:    {output_dir}")
     else:
@@ -448,6 +510,10 @@ def main(
                 input_bam=input_bam,
                 genome_methylation_embedding=genome_methylation_embedding,
                 quality_limit=quality_limit,
+                filter_non_converted=filter_non_converted,
+                non_converted_threshold=non_converted_threshold,
+                filter_em_overconversion=filter_em_overconversion,
+                em_overconversion_min_cpgs=em_overconversion_min_cpgs,
                 verbose=verbose,
                 debug=debug,
             )
@@ -476,6 +542,16 @@ def main(
                 "expected_chromosomes": chrom_list,
                 "total_cpg_sites": genome_methylation_embedding.total_cpg_sites,
                 "cpg_index_crc32": cpg_crc32,
+                "filters": {
+                    "non_converted_reads": {
+                        "enabled": filter_non_converted,
+                        "threshold": non_converted_threshold,
+                    },
+                    "em_overconversion": {
+                        "enabled": filter_em_overconversion,
+                        "min_cpgs": em_overconversion_min_cpgs,
+                    },
+                },
             },
         )
         print(f"  Output:        {output_file}")

bam2tensor 2.4__tar.gz → 2.6__tar.gz

bam2tensor 2.4tar.gz → 2.6tar.gz