PyPI - bam2tensor - Versions diffs - 2.3__tar.gz → 2.5__tar.gz - Mend

bam2tensor 2.3tar.gz → 2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{bam2tensor-2.3 → bam2tensor-2.5}/.github/workflows/constraints.txt RENAMED Viewed

@@ -1,2 +1,2 @@
 nox==2026.2.9
-uv==0.10.7
+uv==0.11.2

{bam2tensor-2.3 → bam2tensor-2.5}/.github/workflows/docs.yml RENAMED Viewed

@@ -59,8 +59,8 @@ jobs:
     needs: build
     steps:
       - name: Setup Pages
-        uses: actions/configure-pages@v5
+        uses: actions/configure-pages@v6
       - name: Deploy to GitHub Pages
         id: deployment
-        uses: actions/deploy-pages@v4
+        uses: actions/deploy-pages@v5

{bam2tensor-2.3 → bam2tensor-2.5}/.github/workflows/labeler.yml RENAMED Viewed

@@ -20,6 +20,6 @@ jobs:
         uses: actions/checkout@v6
       - name: Run Labeler
-        uses: crazy-max/ghaction-github-labeler@v5.3.0
+        uses: crazy-max/ghaction-github-labeler@v6.0.0
         with:
           skip-delete: true

{bam2tensor-2.3 → bam2tensor-2.5}/.github/workflows/release.yml RENAMED Viewed

@@ -76,7 +76,7 @@ jobs:
           repository-url: https://test.pypi.org/legacy/
       - name: Publish the release notes
-        uses: release-drafter/release-drafter@v6.2.0
+        uses: release-drafter/release-drafter@v7.1.1
         with:
           publish: ${{ steps.check-version.outputs.tag != '' || steps.check-tag.outputs.tag != '' }}
           tag: ${{ steps.check-version.outputs.tag || steps.check-tag.outputs.tag }}

{bam2tensor-2.3 → bam2tensor-2.5}/CLAUDE.md RENAMED Viewed

@@ -40,7 +40,7 @@ uv run mypy src
 ```
 src/bam2tensor/
-  __init__.py      # Package version (2.3)
+  __init__.py      # Package version (2.5)
   __main__.py      # Click CLI entry point (bam2tensor command)
   inspect.py       # Inspect CLI entry point (bam2tensor-inspect command)
   embedding.py     # GenomeMethylationEmbedding class (FASTA parsing, CpG indexing)
@@ -117,6 +117,8 @@ xdoctest validates code examples in docstrings. Important rules:
 - Columns = CpG sites (ordered by genomic position, determined by reference genome)
 - Values: 1 (methylated), 0 (unmethylated), -1 (no data/indels/SNVs)
 - Each .npz file contains a `metadata.json` entry with provenance info (genome name, version, CpG index CRC32, expected chromosomes). Read via `bam2tensor.metadata.read_npz_metadata()`.
+- Each .npz file contains a `tlen.npy` entry with per-read signed template length (BAM TLEN field) as int32. Read via `bam2tensor.metadata.read_npz_tlen()`. Returns `None` for files from older versions.
+- `extract_methylation_data_from_bam()` returns an `ExtractionResult` NamedTuple with `.matrix` (sparse COO) and `.tlen` (numpy int32 array).
 ### Methylation Strand Detection
 - Bismark aligner: XM tag (Z/z for methylated/unmethylated CpG; no strand filtering needed)

{bam2tensor-2.3 → bam2tensor-2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bam2tensor
-Version: 2.3
+Version: 2.5
 Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
 Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
 Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
@@ -74,6 +74,7 @@ Description-Content-Type: text/markdown
   - [Command-Line Options](#command-line-options)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
+  - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
   - [Embedded Metadata](#embedded-metadata)
   - [Loading Output Files](#loading-output-files)
   - [Converting to Dense Arrays](#converting-to-dense-arrays)
@@ -97,6 +98,7 @@ Description-Content-Type: text/markdown
 - **Batch Processing**: Process multiple BAM files with directory recursion
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
+- **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
 ## Requirements
@@ -299,8 +301,9 @@ sample.methylation.npz
   Reads:           1,423,891
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
+  Fragment len:    median 167, mean 182, range [50, 600]
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.3
+  bam2tensor:      v2.4
   File size:       14.2 MB
 ```
@@ -333,6 +336,27 @@ The **column dimension is determined entirely by the reference genome**: it equa
 Note: The matrix uses SciPy's COO sparse format, which explicitly stores all non-zero values. Unmethylated sites (value `0`) **are** stored as explicit entries. Positions not covered by a read are simply absent from the matrix (implicit zero, which is distinct from the explicit `0` = unmethylated).
+### Per-Read Fragment Length (TLEN)
+Each `.methylation.npz` file includes a `tlen.npy` entry inside the ZIP archive containing the signed BAM template length (TLEN) for every read in the matrix. This enables joint fragment-length and methylation analysis without re-processing the BAM.
+- One `int32` value per read (row), in the same order as the sparse matrix rows
+- Signed: positive for the leftmost read in a pair, negative for the rightmost
+- Zero for single-end reads or reads with unmapped mates
+- Use `abs(tlen)` to get fragment lengths
+```python
+from bam2tensor.metadata import read_npz_tlen
+import numpy as np
+tlen = read_npz_tlen("sample.methylation.npz")
+if tlen is not None:
+    frag_lengths = np.abs(tlen)
+    nonzero = frag_lengths[frag_lengths > 0]
+    print(f"Median fragment length: {np.median(nonzero):.0f}")
+    print(f"Mean fragment length: {np.mean(nonzero):.0f}")
+```
 ### Embedded Metadata
 Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP archive with provenance information:
@@ -548,10 +572,22 @@ extract_methylation_data_from_bam(
     quality_limit: int = 20,                           # Minimum MAPQ
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
-) -> scipy.sparse.coo_matrix
+) -> ExtractionResult
+```
+**Returns:** An `ExtractionResult` named tuple with two fields:
+- `matrix`: A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites)
+- `tlen`: A 1-D numpy `int32` array of shape (n_reads,) containing the signed template length (BAM TLEN field) for each read
+### `bam2tensor.metadata.read_npz_tlen`
+Read per-read template lengths from a `.methylation.npz` file.
+```python
+read_npz_tlen(npz_path: str) -> np.ndarray | None
 ```
-**Returns:** A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites).
+**Returns:** The per-read template-length array, or `None` if the file was produced by an older version of bam2tensor.
 ## Contributing

{bam2tensor-2.3 → bam2tensor-2.5}/README.md RENAMED Viewed

@@ -41,6 +41,7 @@
   - [Command-Line Options](#command-line-options)
 - [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
+  - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
   - [Embedded Metadata](#embedded-metadata)
   - [Loading Output Files](#loading-output-files)
   - [Converting to Dense Arrays](#converting-to-dense-arrays)
@@ -64,6 +65,7 @@
 - **Batch Processing**: Process multiple BAM files with directory recursion
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
+- **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
 ## Requirements
@@ -266,8 +268,9 @@ sample.methylation.npz
   Reads:           1,423,891
   CpG sites:       28,217,448
   Data points:     12,847,322 (sparsity: 99.97%)
+  Fragment len:    median 167, mean 182, range [50, 600]
   CpG index CRC32: a1b2c3d4
-  bam2tensor:      v2.3
+  bam2tensor:      v2.4
   File size:       14.2 MB
 ```
@@ -300,6 +303,27 @@ The **column dimension is determined entirely by the reference genome**: it equa
 Note: The matrix uses SciPy's COO sparse format, which explicitly stores all non-zero values. Unmethylated sites (value `0`) **are** stored as explicit entries. Positions not covered by a read are simply absent from the matrix (implicit zero, which is distinct from the explicit `0` = unmethylated).
+### Per-Read Fragment Length (TLEN)
+Each `.methylation.npz` file includes a `tlen.npy` entry inside the ZIP archive containing the signed BAM template length (TLEN) for every read in the matrix. This enables joint fragment-length and methylation analysis without re-processing the BAM.
+- One `int32` value per read (row), in the same order as the sparse matrix rows
+- Signed: positive for the leftmost read in a pair, negative for the rightmost
+- Zero for single-end reads or reads with unmapped mates
+- Use `abs(tlen)` to get fragment lengths
+```python
+from bam2tensor.metadata import read_npz_tlen
+import numpy as np
+tlen = read_npz_tlen("sample.methylation.npz")
+if tlen is not None:
+    frag_lengths = np.abs(tlen)
+    nonzero = frag_lengths[frag_lengths > 0]
+    print(f"Median fragment length: {np.median(nonzero):.0f}")
+    print(f"Mean fragment length: {np.mean(nonzero):.0f}")
+```
 ### Embedded Metadata
 Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP archive with provenance information:
@@ -515,10 +539,22 @@ extract_methylation_data_from_bam(
     quality_limit: int = 20,                           # Minimum MAPQ
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
-) -> scipy.sparse.coo_matrix
+) -> ExtractionResult
+```
+**Returns:** An `ExtractionResult` named tuple with two fields:
+- `matrix`: A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites)
+- `tlen`: A 1-D numpy `int32` array of shape (n_reads,) containing the signed template length (BAM TLEN field) for each read
+### `bam2tensor.metadata.read_npz_tlen`
+Read per-read template lengths from a `.methylation.npz` file.
+```python
+read_npz_tlen(npz_path: str) -> np.ndarray | None
 ```
-**Returns:** A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites).
+**Returns:** The per-read template-length array, or `None` if the file was produced by an older version of bam2tensor.
 ## Contributing

{bam2tensor-2.3 → bam2tensor-2.5}/docs/reference.md RENAMED Viewed

@@ -32,6 +32,14 @@ bam2tensor.functions module
    :show-inheritance:
    :undoc-members:
+bam2tensor.metadata module
+--------------------------
+.. automodule:: bam2tensor.metadata
+   :members:
+   :show-inheritance:
+   :undoc-members:
 bam2tensor.reference module
 ---------------------------

{bam2tensor-2.3 → bam2tensor-2.5}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "bam2tensor"
-version = "2.3"
+version = "2.5"
 description = "Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation"
 authors = [{ name = "Nick Semenkovich", email = "semenko@alum.mit.edu" }]
 license = "MIT"

{bam2tensor-2.3 → bam2tensor-2.5}/src/bam2tensor/__init__.py RENAMED Viewed

@@ -30,14 +30,14 @@ Example:
         )
         # Extract methylation data
-        sparse_matrix = extract_methylation_data_from_bam(
+        result = extract_methylation_data_from_bam(
             input_bam="/path/to/sample.bam",
             genome_methylation_embedding=embedding,
         )
         # Save to file
         import scipy.sparse
-        scipy.sparse.save_npz("output.npz", sparse_matrix)
+        scipy.sparse.save_npz("output.npz", result.matrix)
 Output Format:
     The output is a SciPy sparse COO matrix where:
@@ -50,4 +50,4 @@ See Also:
     - https://mcwdsi.github.io/bam2tensor for full documentation
 """
-__version__ = "2.3"
+__version__ = "2.5"

{bam2tensor-2.3 → bam2tensor-2.5}/src/bam2tensor/__main__.py RENAMED Viewed

@@ -38,7 +38,11 @@ from bam2tensor.functions import (
     detect_aligner,
     extract_methylation_data_from_bam,
 )
-from bam2tensor.metadata import compute_cpg_index_crc32, write_npz_metadata
+from bam2tensor.metadata import (
+    compute_cpg_index_crc32,
+    write_npz_metadata,
+    write_npz_tlen,
+)
 from bam2tensor.reference import (
     KNOWN_GENOMES,
     download_reference as download_reference_fn,
@@ -225,6 +229,43 @@ def validate_input_output(
     default=20,
     type=int,
 )
+@click.option(
+    "--filter-non-converted",
+    help=(
+        "Drop reads with >= --non-converted-threshold retained non-CpG "
+        "cytosines, the signature of incomplete bisulfite/EM-seq conversion "
+        "(port of nebiolabs/mark-nonconverted-reads). Default: off."
+    ),
+    is_flag=True,
+)
+@click.option(
+    "--non-converted-threshold",
+    help=(
+        "Minimum count of retained non-CpG cytosines to drop a read "
+        "(default = 3, matches NEB mark-nonconverted-reads)."
+    ),
+    default=3,
+    type=int,
+)
+@click.option(
+    "--filter-em-overconversion",
+    help=(
+        "Drop EM-seq reads whose covered CpGs are all called unmethylated "
+        "and cover at least --em-overconversion-min-cpgs sites (heuristic "
+        "for the fragment-level over-conversion artifact described in "
+        "Loyfer et al. bioRxiv 2026.03.24.713040). Default: off."
+    ),
+    is_flag=True,
+)
+@click.option(
+    "--em-overconversion-min-cpgs",
+    help=(
+        "Minimum covered CpG count required before the EM over-conversion "
+        "filter will drop a read (default = 3)."
+    ),
+    default=3,
+    type=int,
+)
 @click.option("--verbose", help="Verbose output.", is_flag=True)
 @click.option("--skip-cache", help="De-novo generate CpG sites (slow).", is_flag=True)
 @click.option(
@@ -259,6 +300,10 @@ def main(
     expected_chromosomes: str | None,
     reference_fasta: str | None,
     quality_limit: int,
+    filter_non_converted: bool,
+    non_converted_threshold: int,
+    filter_em_overconversion: bool,
+    em_overconversion_min_cpgs: int,
     verbose: bool,
     skip_cache: bool,
     debug: bool,
@@ -296,6 +341,17 @@ def main(
             ``--download-reference`` is used.
         quality_limit: Minimum mapping quality (MAPQ) threshold. Reads below
             this quality are excluded.
+        filter_non_converted: If True, drop reads with at least
+            ``non_converted_threshold`` retained non-CpG cytosines —
+            indicating incomplete bisulfite/EM-seq conversion.
+        non_converted_threshold: Threshold used by the non-converted
+            read filter.
+        filter_em_overconversion: If True, drop reads whose covered CpGs
+            are all called unmethylated and cover at least
+            ``em_overconversion_min_cpgs`` sites — heuristic for EM-seq
+            fragment-level over-conversion (Loyfer et al. 2026).
+        em_overconversion_min_cpgs: Minimum covered CpG count required
+            before the over-conversion filter will drop a read.
         verbose: If True, print detailed progress information.
         skip_cache: If True, regenerate the CpG site index even if a cache
             file exists.
@@ -378,6 +434,16 @@ def main(
     print(f"  Reference:     {reference_fasta}")
     print(f"  Chromosomes:   {chrom_display}")
     print(f"  Quality limit: MAPQ >= {quality_limit}")
+    if filter_non_converted:
+        print(
+            f"  Filters:       non-converted reads (>= "
+            f"{non_converted_threshold} retained non-CpG Cs)"
+        )
+    if filter_em_overconversion:
+        print(
+            f"                 EM over-conversion (all-unmethylated, >= "
+            f"{em_overconversion_min_cpgs} CpGs)"
+        )
     if output_dir:
         print(f"  Output dir:    {output_dir}")
     else:
@@ -440,10 +506,14 @@ def main(
         # Extract
         print("  Extracting methylation data...")
         try:
-            methylation_data_coo = extract_methylation_data_from_bam(
+            extraction_result = extract_methylation_data_from_bam(
                 input_bam=input_bam,
                 genome_methylation_embedding=genome_methylation_embedding,
                 quality_limit=quality_limit,
+                filter_non_converted=filter_non_converted,
+                non_converted_threshold=non_converted_threshold,
+                filter_em_overconversion=filter_em_overconversion,
+                em_overconversion_min_cpgs=em_overconversion_min_cpgs,
                 verbose=verbose,
                 debug=debug,
             )
@@ -453,16 +523,17 @@ def main(
             continue
         # Matrix stats
-        n_reads = methylation_data_coo.shape[0]
-        n_cpgs = methylation_data_coo.shape[1]
-        n_data = methylation_data_coo.nnz
+        n_reads = extraction_result.matrix.shape[0]
+        n_cpgs = extraction_result.matrix.shape[1]
+        n_data = extraction_result.matrix.nnz
         print(
             f"  Result:        {n_reads:,} reads x {n_cpgs:,} CpG sites"
             f" ({n_data:,} data points)"
         )
         # Save
-        scipy.sparse.save_npz(output_file, methylation_data_coo, compressed=True)
+        scipy.sparse.save_npz(output_file, extraction_result.matrix, compressed=True)
+        write_npz_tlen(output_file, extraction_result.tlen)
         write_npz_metadata(
             output_file,
             {
@@ -471,6 +542,16 @@ def main(
                 "expected_chromosomes": chrom_list,
                 "total_cpg_sites": genome_methylation_embedding.total_cpg_sites,
                 "cpg_index_crc32": cpg_crc32,
+                "filters": {
+                    "non_converted_reads": {
+                        "enabled": filter_non_converted,
+                        "threshold": non_converted_threshold,
+                    },
+                    "em_overconversion": {
+                        "enabled": filter_em_overconversion,
+                        "min_cpgs": em_overconversion_min_cpgs,
+                    },
+                },
             },
         )
         print(f"  Output:        {output_file}")

bam2tensor 2.3__tar.gz → 2.5__tar.gz

bam2tensor 2.3tar.gz → 2.5tar.gz