PyPI - bam2tensor - Versions diffs - 2.2__tar.gz → 2.4__tar.gz - Mend

bam2tensor 2.2tar.gz → 2.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

{bam2tensor-2.2 → bam2tensor-2.4}/.github/workflows/constraints.txt RENAMED Viewed

@@ -1,2 +1,2 @@
 nox==2026.2.9
-uv==0.10.7
+uv==0.11.2

{bam2tensor-2.2 → bam2tensor-2.4}/.github/workflows/docs.yml RENAMED Viewed

@@ -59,8 +59,8 @@ jobs:
     needs: build
     steps:
       - name: Setup Pages
-        uses: actions/configure-pages@v5
+        uses: actions/configure-pages@v6
       - name: Deploy to GitHub Pages
         id: deployment
-        uses: actions/deploy-pages@v4
+        uses: actions/deploy-pages@v5

{bam2tensor-2.2 → bam2tensor-2.4}/.github/workflows/labeler.yml RENAMED Viewed

@@ -20,6 +20,6 @@ jobs:
         uses: actions/checkout@v6
       - name: Run Labeler
-        uses: crazy-max/ghaction-github-labeler@v5.3.0
+        uses: crazy-max/ghaction-github-labeler@v6.0.0
         with:
           skip-delete: true

{bam2tensor-2.2 → bam2tensor-2.4}/.github/workflows/release.yml RENAMED Viewed

@@ -76,7 +76,7 @@ jobs:
           repository-url: https://test.pypi.org/legacy/
       - name: Publish the release notes
-        uses: release-drafter/release-drafter@v6.2.0
+        uses: release-drafter/release-drafter@v7.1.1
         with:
           publish: ${{ steps.check-version.outputs.tag != '' || steps.check-tag.outputs.tag != '' }}
           tag: ${{ steps.check-version.outputs.tag || steps.check-tag.outputs.tag }}

{bam2tensor-2.2 → bam2tensor-2.4}/CLAUDE.md RENAMED Viewed

@@ -40,10 +40,12 @@ uv run mypy src
 ```
 src/bam2tensor/
-  __init__.py      # Package version (2.1)
-  __main__.py      # Click CLI entry point
+  __init__.py      # Package version (2.4)
+  __main__.py      # Click CLI entry point (bam2tensor command)
+  inspect.py       # Inspect CLI entry point (bam2tensor-inspect command)
   embedding.py     # GenomeMethylationEmbedding class (FASTA parsing, CpG indexing)
   functions.py     # Core extraction: extract_methylation_data_from_bam()
+  metadata.py      # .npz metadata read/write (provenance info in output files)
   reference.py     # Reference genome download and caching utilities
 tests/
@@ -51,6 +53,8 @@ tests/
   test_functions.py   # Core function tests
   test_embedding.py   # Embedding class tests
   test_duplication.py  # Read duplication bug tests
+  test_inspect.py     # Inspect CLI tests
+  test_metadata.py    # Metadata read/write/round-trip tests
   test_reference.py   # Reference download/caching tests
   test.bam, test.bam.bai, test_fasta.fa  # Test fixtures
 ```
@@ -110,8 +114,11 @@ xdoctest validates code examples in docstrings. Important rules:
 ### Data Structure
 - Output: scipy sparse COO matrix saved as .npz
 - Rows = unique reads (primary alignments)
-- Columns = CpG sites
+- Columns = CpG sites (ordered by genomic position, determined by reference genome)
 - Values: 1 (methylated), 0 (unmethylated), -1 (no data/indels/SNVs)
+- Each .npz file contains a `metadata.json` entry with provenance info (genome name, version, CpG index CRC32, expected chromosomes). Read via `bam2tensor.metadata.read_npz_metadata()`.
+- Each .npz file contains a `tlen.npy` entry with per-read signed template length (BAM TLEN field) as int32. Read via `bam2tensor.metadata.read_npz_tlen()`. Returns `None` for files from older versions.
+- `extract_methylation_data_from_bam()` returns an `ExtractionResult` NamedTuple with `.matrix` (sparse COO) and `.tlen` (numpy int32 array).
 ### Methylation Strand Detection
 - Bismark aligner: XM tag (Z/z for methylated/unmethylated CpG; no strand filtering needed)
@@ -144,6 +151,8 @@ xdoctest validates code examples in docstrings. Important rules:
 uv run bam2tensor --input-path input.bam --reference-fasta ref.fa
 # Or with auto-download:
 uv run bam2tensor --input-path input.bam --download-reference hg38
+# Inspect an output file:
+uv run bam2tensor-inspect output.methylation.npz
 ```
 ### Reference Genome Downloads

{bam2tensor-2.2 → bam2tensor-2.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bam2tensor
-Version: 2.2
+Version: 2.4
 Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
 Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
 Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
@@ -72,7 +72,10 @@ Description-Content-Type: text/markdown
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
+  - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
+  - [Embedded Metadata](#embedded-metadata)
   - [Loading Output Files](#loading-output-files)
   - [Converting to Dense Arrays](#converting-to-dense-arrays)
   - [Working with Genomic Coordinates](#working-with-genomic-coordinates)
@@ -95,6 +98,7 @@ Description-Content-Type: text/markdown
 - **Batch Processing**: Process multiple BAM files with directory recursion
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
+- **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
 ## Requirements
@@ -285,24 +289,103 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Inspecting Output Files
+Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
+```bash
+$ bam2tensor-inspect sample.methylation.npz
+sample.methylation.npz
+  Genome:          hg38
+  Chromosomes:     24 (chr1, chr2, ... chrX, chrY)
+  Reads:           1,423,891
+  CpG sites:       28,217,448
+  Data points:     12,847,322 (sparsity: 99.97%)
+  Fragment len:    median 167, mean 182, range [50, 600]
+  CpG index CRC32: a1b2c3d4
+  bam2tensor:      v2.4
+  File size:       14.2 MB
+```
+You can pass multiple files at once:
+```bash
+$ bam2tensor-inspect *.methylation.npz
+```
+This works on files produced by older versions of bam2tensor too (metadata fields will be omitted).
 ## Output Data Structure
-bam2tensor generates one `.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] with the following structure:
+bam2tensor generates one `.methylation.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] (`scipy.sparse.coo_matrix`) with the following structure:
 | Dimension | Represents |
 |-----------|------------|
-| Rows | Unique reads (primary alignments that pass quality filters) |
-| Columns | CpG sites (ordered by genomic position across all chromosomes) |
+| **Rows** | Unique sequencing reads (primary alignments that pass quality and flag filters, numbered sequentially as encountered across chromosomes) |
+| **Columns** | CpG sites from the reference genome, ordered by genomic position across all chromosomes (chr1, chr2, ..., chrX, chrY). Column `i` maps to the `i`-th CpG dinucleotide in the reference FASTA. |
+The **column dimension is determined entirely by the reference genome**: it equals the total number of CpG sites across all `--expected-chromosomes`. For example, hg38 with default chromosomes has ~28 million CpG columns. To map column indices back to genomic coordinates (e.g., column 12345 → chr1:29503), use the `GenomeMethylationEmbedding` class with the same reference FASTA and chromosome list (see [Working with Genomic Coordinates](#working-with-genomic-coordinates) below).
 ### Methylation State Values
 | Value | Meaning |
 |-------|---------|
-| `1` | Methylated (cytosine preserved as C) |
-| `0` | Unmethylated (cytosine converted to T by bisulfite treatment) |
-| `-1` | No data (indel, SNV, or site not covered by read) |
+| `1` | Methylated (cytosine preserved as C after bisulfite/enzymatic conversion) |
+| `0` | Unmethylated (cytosine converted to T by bisulfite/enzymatic treatment) |
+| `-1` | No data (indel, SNV, or other non-C/T base at a CpG position) |
+Note: The matrix uses SciPy's COO sparse format, which explicitly stores all non-zero values. Unmethylated sites (value `0`) **are** stored as explicit entries. Positions not covered by a read are simply absent from the matrix (implicit zero, which is distinct from the explicit `0` = unmethylated).
+### Per-Read Fragment Length (TLEN)
+Each `.methylation.npz` file includes a `tlen.npy` entry inside the ZIP archive containing the signed BAM template length (TLEN) for every read in the matrix. This enables joint fragment-length and methylation analysis without re-processing the BAM.
+- One `int32` value per read (row), in the same order as the sparse matrix rows
+- Signed: positive for the leftmost read in a pair, negative for the rightmost
+- Zero for single-end reads or reads with unmapped mates
+- Use `abs(tlen)` to get fragment lengths
-Note: Sparse matrices only store non-zero values. Positions with value `0` (unmethylated) are stored, but positions not covered by a read are simply absent from the matrix.
+```python
+from bam2tensor.metadata import read_npz_tlen
+import numpy as np
+tlen = read_npz_tlen("sample.methylation.npz")
+if tlen is not None:
+    frag_lengths = np.abs(tlen)
+    nonzero = frag_lengths[frag_lengths > 0]
+    print(f"Median fragment length: {np.median(nonzero):.0f}")
+    print(f"Mean fragment length: {np.mean(nonzero):.0f}")
+```
+### Embedded Metadata
+Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP archive with provenance information:
+| Field | Description |
+|-------|-------------|
+| `bam2tensor_version` | Version of bam2tensor that produced the file |
+| `genome_name` | Genome identifier (e.g., `hg38`, `mm10`) |
+| `expected_chromosomes` | List of chromosomes included in the column mapping |
+| `total_cpg_sites` | Total number of CpG columns in the matrix |
+| `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
+```python
+from bam2tensor.metadata import read_npz_metadata
+meta = read_npz_metadata("sample.methylation.npz")
+if meta is not None:
+    print(f"Genome: {meta['genome_name']}")
+    print(f"CpG sites: {meta['total_cpg_sites']:,}")
+    print(f"CpG index CRC32: {meta['cpg_index_crc32']}")
+```
+The `cpg_index_crc32` field uniquely identifies the column mapping. Two files with the same CRC32 have identical column semantics (same chromosomes, same CpG positions, same order) and their matrices can be directly stacked or compared. The metadata is also accessible without bam2tensor installed, since `.npz` files are ZIP archives:
+```bash
+unzip -p sample.methylation.npz metadata.json | python -m json.tool
+```
 ### Loading Output Files
@@ -489,10 +572,22 @@ extract_methylation_data_from_bam(
     quality_limit: int = 20,                           # Minimum MAPQ
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
-) -> scipy.sparse.coo_matrix
+) -> ExtractionResult
+```
+**Returns:** An `ExtractionResult` named tuple with two fields:
+- `matrix`: A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites)
+- `tlen`: A 1-D numpy `int32` array of shape (n_reads,) containing the signed template length (BAM TLEN field) for each read
+### `bam2tensor.metadata.read_npz_tlen`
+Read per-read template lengths from a `.methylation.npz` file.
+```python
+read_npz_tlen(npz_path: str) -> np.ndarray | None
 ```
-**Returns:** A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites).
+**Returns:** The per-read template-length array, or `None` if the file was produced by an older version of bam2tensor.
 ## Contributing

{bam2tensor-2.2 → bam2tensor-2.4}/README.md RENAMED Viewed

@@ -39,7 +39,10 @@
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
+  - [Per-Read Fragment Length (TLEN)](#per-read-fragment-length-tlen)
+  - [Embedded Metadata](#embedded-metadata)
   - [Loading Output Files](#loading-output-files)
   - [Converting to Dense Arrays](#converting-to-dense-arrays)
   - [Working with Genomic Coordinates](#working-with-genomic-coordinates)
@@ -62,6 +65,7 @@
 - **Batch Processing**: Process multiple BAM files with directory recursion
 - **Caching**: CpG site indexing is cached to accelerate repeated runs on the same genome
 - **Quality Filtering**: Configurable mapping quality thresholds
+- **Per-Read Fragment Length**: Stores BAM TLEN (template length) alongside the methylation tensor for joint fragment-methylation analysis
 ## Requirements
@@ -252,24 +256,103 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Inspecting Output Files
+Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
+```bash
+$ bam2tensor-inspect sample.methylation.npz
+sample.methylation.npz
+  Genome:          hg38
+  Chromosomes:     24 (chr1, chr2, ... chrX, chrY)
+  Reads:           1,423,891
+  CpG sites:       28,217,448
+  Data points:     12,847,322 (sparsity: 99.97%)
+  Fragment len:    median 167, mean 182, range [50, 600]
+  CpG index CRC32: a1b2c3d4
+  bam2tensor:      v2.4
+  File size:       14.2 MB
+```
+You can pass multiple files at once:
+```bash
+$ bam2tensor-inspect *.methylation.npz
+```
+This works on files produced by older versions of bam2tensor too (metadata fields will be omitted).
 ## Output Data Structure
-bam2tensor generates one `.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] with the following structure:
+bam2tensor generates one `.methylation.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] (`scipy.sparse.coo_matrix`) with the following structure:
 | Dimension | Represents |
 |-----------|------------|
-| Rows | Unique reads (primary alignments that pass quality filters) |
-| Columns | CpG sites (ordered by genomic position across all chromosomes) |
+| **Rows** | Unique sequencing reads (primary alignments that pass quality and flag filters, numbered sequentially as encountered across chromosomes) |
+| **Columns** | CpG sites from the reference genome, ordered by genomic position across all chromosomes (chr1, chr2, ..., chrX, chrY). Column `i` maps to the `i`-th CpG dinucleotide in the reference FASTA. |
+The **column dimension is determined entirely by the reference genome**: it equals the total number of CpG sites across all `--expected-chromosomes`. For example, hg38 with default chromosomes has ~28 million CpG columns. To map column indices back to genomic coordinates (e.g., column 12345 → chr1:29503), use the `GenomeMethylationEmbedding` class with the same reference FASTA and chromosome list (see [Working with Genomic Coordinates](#working-with-genomic-coordinates) below).
 ### Methylation State Values
 | Value | Meaning |
 |-------|---------|
-| `1` | Methylated (cytosine preserved as C) |
-| `0` | Unmethylated (cytosine converted to T by bisulfite treatment) |
-| `-1` | No data (indel, SNV, or site not covered by read) |
+| `1` | Methylated (cytosine preserved as C after bisulfite/enzymatic conversion) |
+| `0` | Unmethylated (cytosine converted to T by bisulfite/enzymatic treatment) |
+| `-1` | No data (indel, SNV, or other non-C/T base at a CpG position) |
+Note: The matrix uses SciPy's COO sparse format, which explicitly stores all non-zero values. Unmethylated sites (value `0`) **are** stored as explicit entries. Positions not covered by a read are simply absent from the matrix (implicit zero, which is distinct from the explicit `0` = unmethylated).
+### Per-Read Fragment Length (TLEN)
+Each `.methylation.npz` file includes a `tlen.npy` entry inside the ZIP archive containing the signed BAM template length (TLEN) for every read in the matrix. This enables joint fragment-length and methylation analysis without re-processing the BAM.
+- One `int32` value per read (row), in the same order as the sparse matrix rows
+- Signed: positive for the leftmost read in a pair, negative for the rightmost
+- Zero for single-end reads or reads with unmapped mates
+- Use `abs(tlen)` to get fragment lengths
-Note: Sparse matrices only store non-zero values. Positions with value `0` (unmethylated) are stored, but positions not covered by a read are simply absent from the matrix.
+```python
+from bam2tensor.metadata import read_npz_tlen
+import numpy as np
+tlen = read_npz_tlen("sample.methylation.npz")
+if tlen is not None:
+    frag_lengths = np.abs(tlen)
+    nonzero = frag_lengths[frag_lengths > 0]
+    print(f"Median fragment length: {np.median(nonzero):.0f}")
+    print(f"Mean fragment length: {np.mean(nonzero):.0f}")
+```
+### Embedded Metadata
+Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP archive with provenance information:
+| Field | Description |
+|-------|-------------|
+| `bam2tensor_version` | Version of bam2tensor that produced the file |
+| `genome_name` | Genome identifier (e.g., `hg38`, `mm10`) |
+| `expected_chromosomes` | List of chromosomes included in the column mapping |
+| `total_cpg_sites` | Total number of CpG columns in the matrix |
+| `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
+```python
+from bam2tensor.metadata import read_npz_metadata
+meta = read_npz_metadata("sample.methylation.npz")
+if meta is not None:
+    print(f"Genome: {meta['genome_name']}")
+    print(f"CpG sites: {meta['total_cpg_sites']:,}")
+    print(f"CpG index CRC32: {meta['cpg_index_crc32']}")
+```
+The `cpg_index_crc32` field uniquely identifies the column mapping. Two files with the same CRC32 have identical column semantics (same chromosomes, same CpG positions, same order) and their matrices can be directly stacked or compared. The metadata is also accessible without bam2tensor installed, since `.npz` files are ZIP archives:
+```bash
+unzip -p sample.methylation.npz metadata.json | python -m json.tool
+```
 ### Loading Output Files
@@ -456,10 +539,22 @@ extract_methylation_data_from_bam(
     quality_limit: int = 20,                           # Minimum MAPQ
     verbose: bool = False,                             # Enable verbose output
     debug: bool = False                                # Enable debug output
-) -> scipy.sparse.coo_matrix
+) -> ExtractionResult
+```
+**Returns:** An `ExtractionResult` named tuple with two fields:
+- `matrix`: A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites)
+- `tlen`: A 1-D numpy `int32` array of shape (n_reads,) containing the signed template length (BAM TLEN field) for each read
+### `bam2tensor.metadata.read_npz_tlen`
+Read per-read template lengths from a `.methylation.npz` file.
+```python
+read_npz_tlen(npz_path: str) -> np.ndarray | None
 ```
-**Returns:** A SciPy COO sparse matrix with shape (n_reads, n_cpg_sites).
+**Returns:** The per-read template-length array, or `None` if the file was produced by an older version of bam2tensor.
 ## Contributing

{bam2tensor-2.2 → bam2tensor-2.4}/docs/reference.md RENAMED Viewed

@@ -32,6 +32,14 @@ bam2tensor.functions module
    :show-inheritance:
    :undoc-members:
+bam2tensor.metadata module
+--------------------------
+.. automodule:: bam2tensor.metadata
+   :members:
+   :show-inheritance:
+   :undoc-members:
 bam2tensor.reference module
 ---------------------------

{bam2tensor-2.2 → bam2tensor-2.4}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "bam2tensor"
-version = "2.2"
+version = "2.4"
 description = "Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation"
 authors = [{ name = "Nick Semenkovich", email = "semenko@alum.mit.edu" }]
 license = "MIT"
@@ -38,6 +38,7 @@ Changelog = "https://github.com/mcwdsi/bam2tensor/releases"
 [project.scripts]
 bam2tensor = "bam2tensor.__main__:main"
+bam2tensor-inspect = "bam2tensor.inspect:main"
 [dependency-groups]
 dev = [

{bam2tensor-2.2 → bam2tensor-2.4}/src/bam2tensor/__init__.py RENAMED Viewed

@@ -30,14 +30,14 @@ Example:
         )
         # Extract methylation data
-        sparse_matrix = extract_methylation_data_from_bam(
+        result = extract_methylation_data_from_bam(
             input_bam="/path/to/sample.bam",
             genome_methylation_embedding=embedding,
         )
         # Save to file
         import scipy.sparse
-        scipy.sparse.save_npz("output.npz", sparse_matrix)
+        scipy.sparse.save_npz("output.npz", result.matrix)
 Output Format:
     The output is a SciPy sparse COO matrix where:
@@ -50,4 +50,4 @@ See Also:
     - https://mcwdsi.github.io/bam2tensor for full documentation
 """
-__version__ = "2.2"
+__version__ = "2.4"

{bam2tensor-2.2 → bam2tensor-2.4}/src/bam2tensor/__main__.py RENAMED Viewed

@@ -38,6 +38,11 @@ from bam2tensor.functions import (
     detect_aligner,
     extract_methylation_data_from_bam,
 )
+from bam2tensor.metadata import (
+    compute_cpg_index_crc32,
+    write_npz_metadata,
+    write_npz_tlen,
+)
 from bam2tensor.reference import (
     KNOWN_GENOMES,
     download_reference as download_reference_fn,
@@ -393,10 +398,12 @@ def main(
         verbose=verbose,
     )
     n_chroms = len(genome_methylation_embedding.cpg_sites_dict)
+    cpg_crc32 = compute_cpg_index_crc32(genome_methylation_embedding)
     print(
         f"  Total CpG sites: {genome_methylation_embedding.total_cpg_sites:,}"
         f" across {n_chroms} chromosome(s)"
     )
+    print(f"  CpG index CRC32: {cpg_crc32}")
     print(f"  Index loaded in {_format_elapsed(time.time() - time_start)}")
     # ── Discover BAM files ──────────────────────────────────────────────
@@ -437,7 +444,7 @@ def main(
         # Extract
         print("  Extracting methylation data...")
         try:
-            methylation_data_coo = extract_methylation_data_from_bam(
+            extraction_result = extract_methylation_data_from_bam(
                 input_bam=input_bam,
                 genome_methylation_embedding=genome_methylation_embedding,
                 quality_limit=quality_limit,
@@ -450,16 +457,27 @@ def main(
             continue
         # Matrix stats
-        n_reads = methylation_data_coo.shape[0]
-        n_cpgs = methylation_data_coo.shape[1]
-        n_data = methylation_data_coo.nnz
+        n_reads = extraction_result.matrix.shape[0]
+        n_cpgs = extraction_result.matrix.shape[1]
+        n_data = extraction_result.matrix.nnz
         print(
             f"  Result:        {n_reads:,} reads x {n_cpgs:,} CpG sites"
             f" ({n_data:,} data points)"
         )
         # Save
-        scipy.sparse.save_npz(output_file, methylation_data_coo, compressed=True)
+        scipy.sparse.save_npz(output_file, extraction_result.matrix, compressed=True)
+        write_npz_tlen(output_file, extraction_result.tlen)
+        write_npz_metadata(
+            output_file,
+            {
+                "bam2tensor_version": __version__,
+                "genome_name": genome_name,
+                "expected_chromosomes": chrom_list,
+                "total_cpg_sites": genome_methylation_embedding.total_cpg_sites,
+                "cpg_index_crc32": cpg_crc32,
+            },
+        )
         print(f"  Output:        {output_file}")
         print(f"  Time:          {_format_elapsed(time.time() - time_bam)}")

{bam2tensor-2.2 → bam2tensor-2.4}/src/bam2tensor/functions.py RENAMED Viewed

@@ -39,15 +39,18 @@ Example:
     ... )
     >>>
     >>> # Extract methylation data
-    >>> matrix = extract_methylation_data_from_bam(
+    >>> result = extract_methylation_data_from_bam(
     ...     input_bam="sample.bam",
     ...     genome_methylation_embedding=embedding,
     ...     quality_limit=20,
     ... )
     >>>
-    >>> print(f"Extracted {matrix.shape[0]} reads, {matrix.nnz} data points")
+    >>> print(f"Extracted {result.matrix.shape[0]} reads, {result.matrix.nnz} data points")
 """
+from typing import NamedTuple
+import numpy as np
 import scipy.sparse
 import pysam
 import bisect
@@ -55,6 +58,23 @@ import bisect
 from tqdm import tqdm
 from bam2tensor.embedding import GenomeMethylationEmbedding
+class ExtractionResult(NamedTuple):
+    """Result of methylation extraction from a BAM file.
+    Attributes:
+        matrix: Sparse COO matrix of shape (n_reads, n_cpg_sites) with
+            methylation states: 1 (methylated), 0 (unmethylated), -1
+            (no data).
+        tlen: 1-D numpy array of shape (n_reads,) containing the signed
+            template length (TLEN from BAM) for each read. 0 for
+            single-end reads or reads with unmapped mates.
+    """
+    matrix: scipy.sparse.coo_matrix
+    tlen: np.ndarray
 # BAM flag bits for reads to skip: duplicate (0x400), qcfail (0x200),
 # secondary (0x100), supplementary (0x800).
 _SKIP_FLAGS = 0x400 | 0x200 | 0x100 | 0x800
@@ -180,7 +200,7 @@ def extract_methylation_data_from_bam(
     quality_limit: int = 20,
     verbose: bool = False,
     debug: bool = False,
-) -> scipy.sparse.coo_matrix:
+) -> ExtractionResult:
     """Extract read-level CpG methylation data from a BAM file.
     Parses a bisulfite-sequencing or EM-seq BAM file and extracts methylation
@@ -225,14 +245,19 @@ def extract_methylation_data_from_bam(
             only processed once. Significantly slower.
     Returns:
-        A scipy.sparse.coo_matrix with shape (n_reads, n_cpg_sites) where:
-            - n_reads is the number of reads that passed filters and covered
-              at least one CpG site
-            - n_cpg_sites is genome_methylation_embedding.total_cpg_sites
-            - Values are: 1 (methylated), 0 (unmethylated), -1 (no data)
-        The matrix uses COO format for efficient construction. Convert to
-        CSR (tocsr()) for row slicing or CSC (tocsc()) for column slicing.
+        An ExtractionResult named tuple with two fields:
+        - **matrix**: A scipy.sparse.coo_matrix with shape
+          (n_reads, n_cpg_sites) where n_reads is the number of reads
+          that passed filters and covered at least one CpG site,
+          n_cpg_sites is genome_methylation_embedding.total_cpg_sites,
+          and values are: 1 (methylated), 0 (unmethylated), -1 (no data).
+          The matrix uses COO format for efficient construction; convert
+          to CSR (tocsr()) for row slicing or CSC (tocsc()) for column
+          slicing.
+        - **tlen**: A 1-D numpy int32 array of shape (n_reads,) containing
+          the signed template length (BAM TLEN field) for each read.
+          Values are 0 for single-end reads or reads with unmapped mates.
     Raises:
         FileNotFoundError: If the BAM file index (.bam.bai) is missing.
@@ -252,7 +277,7 @@ def extract_methylation_data_from_bam(
         ... )
         >>>
         >>> # Extract methylation data
-        >>> matrix = extract_methylation_data_from_bam(
+        >>> result = extract_methylation_data_from_bam(
         ...     input_bam="sample.bam",
         ...     genome_methylation_embedding=embedding,
         ...     quality_limit=30,  # Stricter quality filter
@@ -260,12 +285,12 @@ def extract_methylation_data_from_bam(
         ... )
         >>>
         >>> # Analyze results
-        >>> print(f"Reads with CpG data: {matrix.shape[0]:,}")
-        >>> print(f"Total CpG sites: {matrix.shape[1]:,}")
-        >>> print(f"Data points: {matrix.nnz:,}")
+        >>> print(f"Reads with CpG data: {result.matrix.shape[0]:,}")
+        >>> print(f"Total CpG sites: {result.matrix.shape[1]:,}")
+        >>> print(f"Data points: {result.matrix.nnz:,}")
         >>>
         >>> # Save to file
-        >>> scipy.sparse.save_npz("sample.methylation.npz", matrix)
+        >>> scipy.sparse.save_npz("sample.methylation.npz", result.matrix)
     Note:
         The function processes chromosomes in the order they appear in
@@ -304,6 +329,7 @@ def extract_methylation_data_from_bam(
     coo_row = []  # Read number
     coo_col = []  # CpG number (embedding)
     coo_data = []  # Methylation state
+    tlen_list: list[int] = []  # Template length (TLEN) per read
     # This is slow, but we only run it once and store the results for later
     for chrom, cpg_sites in tqdm(
@@ -398,6 +424,7 @@ def extract_methylation_data_from_bam(
                         ), "Read seen twice!"
                         debug_read_name_to_row_number[read_key] = read_number
                         print("************************************************\n")
+                    tlen_list.append(aligned_segment.template_length)
                     read_number += 1
                 continue  # Skip the Biscuit/bwameth/gem3 path below
@@ -526,6 +553,7 @@ def extract_methylation_data_from_bam(
                             f"\t{query_pos} {ref_pos} C->{query_base} [Unknown! SNV? Indel?]"
                         )
+            tlen_list.append(aligned_segment.template_length)
             read_number += 1
             if debug:
@@ -557,6 +585,6 @@ def extract_methylation_data_from_bam(
     #   Number of columns = number of CpG sites
     assert sparse_matrix.shape[1] == genome_methylation_embedding.total_cpg_sites
-    return sparse_matrix
+    tlen_array = np.array(tlen_list, dtype=np.int32)
-    # return scipy.sparse.coo_matrix((coo_data, (coo_row, coo_col)), shape=(len(read_name_to_row_number) + 1, total_cpg_sites))
+    return ExtractionResult(matrix=sparse_matrix, tlen=tlen_array)

bam2tensor 2.2__tar.gz → 2.4__tar.gz

bam2tensor 2.2tar.gz → 2.4tar.gz