PyPI - bam2tensor - Versions diffs - 2.1__tar.gz → 2.3__tar.gz - Mend

bam2tensor 2.1tar.gz → 2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

{bam2tensor-2.1 → bam2tensor-2.3}/CLAUDE.md RENAMED Viewed

@@ -40,10 +40,12 @@ uv run mypy src
 ```
 src/bam2tensor/
-  __init__.py      # Package version (2.1)
-  __main__.py      # Click CLI entry point
+  __init__.py      # Package version (2.3)
+  __main__.py      # Click CLI entry point (bam2tensor command)
+  inspect.py       # Inspect CLI entry point (bam2tensor-inspect command)
   embedding.py     # GenomeMethylationEmbedding class (FASTA parsing, CpG indexing)
   functions.py     # Core extraction: extract_methylation_data_from_bam()
+  metadata.py      # .npz metadata read/write (provenance info in output files)
   reference.py     # Reference genome download and caching utilities
 tests/
@@ -51,6 +53,8 @@ tests/
   test_functions.py   # Core function tests
   test_embedding.py   # Embedding class tests
   test_duplication.py  # Read duplication bug tests
+  test_inspect.py     # Inspect CLI tests
+  test_metadata.py    # Metadata read/write/round-trip tests
   test_reference.py   # Reference download/caching tests
   test.bam, test.bam.bai, test_fasta.fa  # Test fixtures
 ```
@@ -110,8 +114,9 @@ xdoctest validates code examples in docstrings. Important rules:
 ### Data Structure
 - Output: scipy sparse COO matrix saved as .npz
 - Rows = unique reads (primary alignments)
-- Columns = CpG sites
+- Columns = CpG sites (ordered by genomic position, determined by reference genome)
 - Values: 1 (methylated), 0 (unmethylated), -1 (no data/indels/SNVs)
+- Each .npz file contains a `metadata.json` entry with provenance info (genome name, version, CpG index CRC32, expected chromosomes). Read via `bam2tensor.metadata.read_npz_metadata()`.
 ### Methylation Strand Detection
 - Bismark aligner: XM tag (Z/z for methylated/unmethylated CpG; no strand filtering needed)
@@ -144,6 +149,8 @@ xdoctest validates code examples in docstrings. Important rules:
 uv run bam2tensor --input-path input.bam --reference-fasta ref.fa
 # Or with auto-download:
 uv run bam2tensor --input-path input.bam --download-reference hg38
+# Inspect an output file:
+uv run bam2tensor-inspect output.methylation.npz
 ```
 ### Reference Genome Downloads

bam2tensor-2.1/README.md → bam2tensor-2.3/PKG-INFO RENAMED Viewed

@@ -1,3 +1,36 @@
+Metadata-Version: 2.4
+Name: bam2tensor
+Version: 2.3
+Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
+Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
+Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
+Project-URL: Documentation, https://mcwdsi.github.io/bam2tensor
+Project-URL: Changelog, https://github.com/mcwdsi/bam2tensor/releases
+Author-email: Nick Semenkovich <semenko@alum.mit.edu>
+License-Expression: MIT
+License-File: LICENSE
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: MacOS
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
+Classifier: Typing :: Typed
+Requires-Python: >=3.10
+Requires-Dist: biopython>=1.81
+Requires-Dist: click>=8.0.1
+Requires-Dist: numpy>=1.26.0
+Requires-Dist: pysam>=0.22.0
+Requires-Dist: scipy>=1.11.4
+Requires-Dist: tqdm>=4.66.1
+Description-Content-Type: text/markdown
 # bam2tensor
 **Author:** [Nick Semenkovich](https://nick.semenkovich.com/) (semenko@alum.mit.edu)
@@ -39,7 +72,9 @@
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
+  - [Embedded Metadata](#embedded-metadata)
   - [Loading Output Files](#loading-output-files)
   - [Converting to Dense Arrays](#converting-to-dense-arrays)
   - [Working with Genomic Coordinates](#working-with-genomic-coordinates)
@@ -252,24 +287,81 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Inspecting Output Files
+Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
+```bash
+$ bam2tensor-inspect sample.methylation.npz
+sample.methylation.npz
+  Genome:          hg38
+  Chromosomes:     24 (chr1, chr2, ... chrX, chrY)
+  Reads:           1,423,891
+  CpG sites:       28,217,448
+  Data points:     12,847,322 (sparsity: 99.97%)
+  CpG index CRC32: a1b2c3d4
+  bam2tensor:      v2.3
+  File size:       14.2 MB
+```
+You can pass multiple files at once:
+```bash
+$ bam2tensor-inspect *.methylation.npz
+```
+This works on files produced by older versions of bam2tensor too (metadata fields will be omitted).
 ## Output Data Structure
-bam2tensor generates one `.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] with the following structure:
+bam2tensor generates one `.methylation.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] (`scipy.sparse.coo_matrix`) with the following structure:
 | Dimension | Represents |
 |-----------|------------|
-| Rows | Unique reads (primary alignments that pass quality filters) |
-| Columns | CpG sites (ordered by genomic position across all chromosomes) |
+| **Rows** | Unique sequencing reads (primary alignments that pass quality and flag filters, numbered sequentially as encountered across chromosomes) |
+| **Columns** | CpG sites from the reference genome, ordered by genomic position across all chromosomes (chr1, chr2, ..., chrX, chrY). Column `i` maps to the `i`-th CpG dinucleotide in the reference FASTA. |
+The **column dimension is determined entirely by the reference genome**: it equals the total number of CpG sites across all `--expected-chromosomes`. For example, hg38 with default chromosomes has ~28 million CpG columns. To map column indices back to genomic coordinates (e.g., column 12345 → chr1:29503), use the `GenomeMethylationEmbedding` class with the same reference FASTA and chromosome list (see [Working with Genomic Coordinates](#working-with-genomic-coordinates) below).
 ### Methylation State Values
 | Value | Meaning |
 |-------|---------|
-| `1` | Methylated (cytosine preserved as C) |
-| `0` | Unmethylated (cytosine converted to T by bisulfite treatment) |
-| `-1` | No data (indel, SNV, or site not covered by read) |
+| `1` | Methylated (cytosine preserved as C after bisulfite/enzymatic conversion) |
+| `0` | Unmethylated (cytosine converted to T by bisulfite/enzymatic treatment) |
+| `-1` | No data (indel, SNV, or other non-C/T base at a CpG position) |
-Note: Sparse matrices only store non-zero values. Positions with value `0` (unmethylated) are stored, but positions not covered by a read are simply absent from the matrix.
+Note: The matrix uses SciPy's COO sparse format, which explicitly stores all non-zero values. Unmethylated sites (value `0`) **are** stored as explicit entries. Positions not covered by a read are simply absent from the matrix (implicit zero, which is distinct from the explicit `0` = unmethylated).
+### Embedded Metadata
+Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP archive with provenance information:
+| Field | Description |
+|-------|-------------|
+| `bam2tensor_version` | Version of bam2tensor that produced the file |
+| `genome_name` | Genome identifier (e.g., `hg38`, `mm10`) |
+| `expected_chromosomes` | List of chromosomes included in the column mapping |
+| `total_cpg_sites` | Total number of CpG columns in the matrix |
+| `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
+```python
+from bam2tensor.metadata import read_npz_metadata
+meta = read_npz_metadata("sample.methylation.npz")
+if meta is not None:
+    print(f"Genome: {meta['genome_name']}")
+    print(f"CpG sites: {meta['total_cpg_sites']:,}")
+    print(f"CpG index CRC32: {meta['cpg_index_crc32']}")
+```
+The `cpg_index_crc32` field uniquely identifies the column mapping. Two files with the same CRC32 have identical column semantics (same chromosomes, same CpG positions, same order) and their matrices can be directly stacked or compared. The metadata is also accessible without bam2tensor installed, since `.npz` files are ZIP archives:
+```bash
+unzip -p sample.methylation.npz metadata.json | python -m json.tool
+```
 ### Loading Output Files

bam2tensor-2.1/PKG-INFO → bam2tensor-2.3/README.md RENAMED Viewed

@@ -1,24 +1,3 @@
-Metadata-Version: 2.4
-Name: bam2tensor
-Version: 2.1
-Summary: Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation
-Project-URL: Homepage, https://github.com/mcwdsi/bam2tensor
-Project-URL: Repository, https://github.com/mcwdsi/bam2tensor
-Project-URL: Documentation, https://mcwdsi.github.io/bam2tensor
-Project-URL: Changelog, https://github.com/mcwdsi/bam2tensor/releases
-Author-email: Nick Semenkovich <semenko@alum.mit.edu>
-License-Expression: MIT
-License-File: LICENSE
-Classifier: Development Status :: 5 - Production/Stable
-Requires-Python: >=3.10
-Requires-Dist: biopython>=1.81
-Requires-Dist: click>=8.0.1
-Requires-Dist: numpy>=1.26.0
-Requires-Dist: pysam>=0.22.0
-Requires-Dist: scipy>=1.11.4
-Requires-Dist: tqdm>=4.66.1
-Description-Content-Type: text/markdown
 # bam2tensor
 **Author:** [Nick Semenkovich](https://nick.semenkovich.com/) (semenko@alum.mit.edu)
@@ -60,7 +39,9 @@ Description-Content-Type: text/markdown
   - [Custom Output Directory](#custom-output-directory)
   - [Using a Custom Genome](#using-a-custom-genome)
   - [Command-Line Options](#command-line-options)
+- [Inspecting Output Files](#inspecting-output-files)
 - [Output Data Structure](#output-data-structure)
+  - [Embedded Metadata](#embedded-metadata)
   - [Loading Output Files](#loading-output-files)
   - [Converting to Dense Arrays](#converting-to-dense-arrays)
   - [Working with Genomic Coordinates](#working-with-genomic-coordinates)
@@ -273,24 +254,81 @@ Options:
 | `--download-reference` | Download and cache a known reference genome. Choices: `hg38`, `hg19`, `mm10`, `T2T-CHM13`. Replaces `--reference-fasta`. |
 | `--list-genomes` | List available reference genomes for `--download-reference` and exit. |
+## Inspecting Output Files
+Use `bam2tensor-inspect` to view a summary of any `.methylation.npz` file without writing Python:
+```bash
+$ bam2tensor-inspect sample.methylation.npz
+sample.methylation.npz
+  Genome:          hg38
+  Chromosomes:     24 (chr1, chr2, ... chrX, chrY)
+  Reads:           1,423,891
+  CpG sites:       28,217,448
+  Data points:     12,847,322 (sparsity: 99.97%)
+  CpG index CRC32: a1b2c3d4
+  bam2tensor:      v2.3
+  File size:       14.2 MB
+```
+You can pass multiple files at once:
+```bash
+$ bam2tensor-inspect *.methylation.npz
+```
+This works on files produced by older versions of bam2tensor too (metadata fields will be omitted).
 ## Output Data Structure
-bam2tensor generates one `.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] with the following structure:
+bam2tensor generates one `.methylation.npz` file per input BAM file. Each file contains a SciPy sparse [COO matrix] (`scipy.sparse.coo_matrix`) with the following structure:
 | Dimension | Represents |
 |-----------|------------|
-| Rows | Unique reads (primary alignments that pass quality filters) |
-| Columns | CpG sites (ordered by genomic position across all chromosomes) |
+| **Rows** | Unique sequencing reads (primary alignments that pass quality and flag filters, numbered sequentially as encountered across chromosomes) |
+| **Columns** | CpG sites from the reference genome, ordered by genomic position across all chromosomes (chr1, chr2, ..., chrX, chrY). Column `i` maps to the `i`-th CpG dinucleotide in the reference FASTA. |
+The **column dimension is determined entirely by the reference genome**: it equals the total number of CpG sites across all `--expected-chromosomes`. For example, hg38 with default chromosomes has ~28 million CpG columns. To map column indices back to genomic coordinates (e.g., column 12345 → chr1:29503), use the `GenomeMethylationEmbedding` class with the same reference FASTA and chromosome list (see [Working with Genomic Coordinates](#working-with-genomic-coordinates) below).
 ### Methylation State Values
 | Value | Meaning |
 |-------|---------|
-| `1` | Methylated (cytosine preserved as C) |
-| `0` | Unmethylated (cytosine converted to T by bisulfite treatment) |
-| `-1` | No data (indel, SNV, or site not covered by read) |
+| `1` | Methylated (cytosine preserved as C after bisulfite/enzymatic conversion) |
+| `0` | Unmethylated (cytosine converted to T by bisulfite/enzymatic treatment) |
+| `-1` | No data (indel, SNV, or other non-C/T base at a CpG position) |
-Note: Sparse matrices only store non-zero values. Positions with value `0` (unmethylated) are stored, but positions not covered by a read are simply absent from the matrix.
+Note: The matrix uses SciPy's COO sparse format, which explicitly stores all non-zero values. Unmethylated sites (value `0`) **are** stored as explicit entries. Positions not covered by a read are simply absent from the matrix (implicit zero, which is distinct from the explicit `0` = unmethylated).
+### Embedded Metadata
+Each `.methylation.npz` file includes a `metadata.json` entry inside the ZIP archive with provenance information:
+| Field | Description |
+|-------|-------------|
+| `bam2tensor_version` | Version of bam2tensor that produced the file |
+| `genome_name` | Genome identifier (e.g., `hg38`, `mm10`) |
+| `expected_chromosomes` | List of chromosomes included in the column mapping |
+| `total_cpg_sites` | Total number of CpG columns in the matrix |
+| `cpg_index_crc32` | CRC32 checksum of the CpG site positions (verifies identical column semantics) |
+This metadata is ignored by `scipy.sparse.load_npz`, so existing code continues to work. To read it:
+```python
+from bam2tensor.metadata import read_npz_metadata
+meta = read_npz_metadata("sample.methylation.npz")
+if meta is not None:
+    print(f"Genome: {meta['genome_name']}")
+    print(f"CpG sites: {meta['total_cpg_sites']:,}")
+    print(f"CpG index CRC32: {meta['cpg_index_crc32']}")
+```
+The `cpg_index_crc32` field uniquely identifies the column mapping. Two files with the same CRC32 have identical column semantics (same chromosomes, same CpG positions, same order) and their matrices can be directly stacked or compared. The metadata is also accessible without bam2tensor installed, since `.npz` files are ZIP archives:
+```bash
+unzip -p sample.methylation.npz metadata.json | python -m json.tool
+```
 ### Loading Output Files

{bam2tensor-2.1 → bam2tensor-2.3}/pyproject.toml RENAMED Viewed

@@ -1,12 +1,26 @@
 [project]
 name = "bam2tensor"
-version = "2.1"
+version = "2.3"
 description = "Convert bisulfite-seq and EM-seq BAM files to sparse tensor representations of DNA methylation"
 authors = [{ name = "Nick Semenkovich", email = "semenko@alum.mit.edu" }]
 license = "MIT"
 readme = "README.md"
 requires-python = ">=3.10"
-classifiers = ["Development Status :: 5 - Production/Stable"]
+classifiers = [
+    "Development Status :: 5 - Production/Stable",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: MacOS",
+    "Operating System :: POSIX :: Linux",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Topic :: Scientific/Engineering :: Bio-Informatics",
+    "Topic :: Scientific/Engineering :: Medical Science Apps.",
+    "Typing :: Typed",
+]
 dependencies = [
     "click>=8.0.1",
     "numpy>=1.26.0",
@@ -24,6 +38,7 @@ Changelog = "https://github.com/mcwdsi/bam2tensor/releases"
 [project.scripts]
 bam2tensor = "bam2tensor.__main__:main"
+bam2tensor-inspect = "bam2tensor.inspect:main"
 [dependency-groups]
 dev = [

{bam2tensor-2.1 → bam2tensor-2.3}/src/bam2tensor/__init__.py RENAMED Viewed

@@ -50,4 +50,4 @@ See Also:
     - https://mcwdsi.github.io/bam2tensor for full documentation
 """
-__version__ = "2.1"
+__version__ = "2.3"

{bam2tensor-2.1 → bam2tensor-2.3}/src/bam2tensor/__main__.py RENAMED Viewed

@@ -38,6 +38,7 @@ from bam2tensor.functions import (
     detect_aligner,
     extract_methylation_data_from_bam,
 )
+from bam2tensor.metadata import compute_cpg_index_crc32, write_npz_metadata
 from bam2tensor.reference import (
     KNOWN_GENOMES,
     download_reference as download_reference_fn,
@@ -393,10 +394,12 @@ def main(
         verbose=verbose,
     )
     n_chroms = len(genome_methylation_embedding.cpg_sites_dict)
+    cpg_crc32 = compute_cpg_index_crc32(genome_methylation_embedding)
     print(
         f"  Total CpG sites: {genome_methylation_embedding.total_cpg_sites:,}"
         f" across {n_chroms} chromosome(s)"
     )
+    print(f"  CpG index CRC32: {cpg_crc32}")
     print(f"  Index loaded in {_format_elapsed(time.time() - time_start)}")
     # ── Discover BAM files ──────────────────────────────────────────────
@@ -460,6 +463,16 @@ def main(
         # Save
         scipy.sparse.save_npz(output_file, methylation_data_coo, compressed=True)
+        write_npz_metadata(
+            output_file,
+            {
+                "bam2tensor_version": __version__,
+                "genome_name": genome_name,
+                "expected_chromosomes": chrom_list,
+                "total_cpg_sites": genome_methylation_embedding.total_cpg_sites,
+                "cpg_index_crc32": cpg_crc32,
+            },
+        )
         print(f"  Output:        {output_file}")
         print(f"  Time:          {_format_elapsed(time.time() - time_bam)}")

bam2tensor-2.3/src/bam2tensor/inspect.py ADDED Viewed

@@ -0,0 +1,143 @@
+"""Inspect command for bam2tensor .npz output files.
+Provides a CLI entry point (``bam2tensor-inspect``) that prints a summary
+of one or more ``.methylation.npz`` files, including matrix dimensions,
+sparsity, file size, and embedded provenance metadata.
+Example:
+    Inspect a single file::
+        $ bam2tensor-inspect sample.methylation.npz
+    Inspect multiple files::
+        $ bam2tensor-inspect *.methylation.npz
+"""
+import os
+import sys
+import click
+import numpy as np
+import scipy.sparse
+from bam2tensor.metadata import read_npz_metadata
+def _format_size(nbytes: int) -> str:
+    """Format a byte count as a human-readable string.
+    Args:
+        nbytes: Number of bytes.
+    Returns:
+        A string such as ``"14.2 MB"`` or ``"832 bytes"``.
+    Example:
+        >>> _format_size(14_200_000)
+        '13.5 MB'
+        >>> _format_size(500)
+        '500 bytes'
+        >>> _format_size(2048)
+        '2.0 KB'
+    """
+    if nbytes < 1024:
+        return f"{nbytes} bytes"
+    elif nbytes < 1024 * 1024:
+        return f"{nbytes / 1024:.1f} KB"
+    elif nbytes < 1024 * 1024 * 1024:
+        return f"{nbytes / (1024 * 1024):.1f} MB"
+    else:
+        return f"{nbytes / (1024 * 1024 * 1024):.1f} GB"
+def inspect_npz(npz_path: str) -> None:
+    """Print a human-readable summary of a .methylation.npz file.
+    Loads the sparse matrix and any embedded metadata, then prints
+    matrix dimensions, data-point counts, sparsity, provenance
+    information, and file size.
+    Args:
+        npz_path: Path to the ``.npz`` file to inspect.
+    Example:
+        >>> # xdoctest: +SKIP
+        >>> inspect_npz("sample.methylation.npz")
+        sample.methylation.npz
+          Reads:           1,423
+          CpG sites:       28,217,448
+          ...
+    """
+    # Load matrix
+    matrix = scipy.sparse.load_npz(npz_path)
+    n_reads, n_cpgs = matrix.shape
+    n_data = matrix.nnz
+    total_cells = int(np.prod(matrix.shape)) if n_reads > 0 else 0
+    sparsity = 1 - (n_data / total_cells) if total_cells > 0 else 0.0
+    file_size = os.path.getsize(npz_path)
+    # Load metadata (may be None for old files)
+    meta = read_npz_metadata(npz_path)
+    # Print summary
+    print(os.path.basename(npz_path))
+    if meta and "genome_name" in meta:
+        print(f"  Genome:          {meta['genome_name']}")
+    if meta and "expected_chromosomes" in meta:
+        chroms = meta["expected_chromosomes"]
+        n_chr = len(chroms)
+        if n_chr <= 4:
+            chrom_display = ", ".join(chroms)
+        else:
+            chrom_display = (
+                f"{n_chr} ({chroms[0]}, {chroms[1]}, "
+                f"... {chroms[-2]}, {chroms[-1]})"
+            )
+        print(f"  Chromosomes:     {chrom_display}")
+    print(f"  Reads:           {n_reads:,}")
+    print(f"  CpG sites:       {n_cpgs:,}")
+    print(f"  Data points:     {n_data:,} (sparsity: {sparsity:.2%})")
+    if meta and "cpg_index_crc32" in meta:
+        print(f"  CpG index CRC32: {meta['cpg_index_crc32']}")
+    if meta and "bam2tensor_version" in meta:
+        print(f"  bam2tensor:      v{meta['bam2tensor_version']}")
+    elif meta is None:
+        print("  Metadata:        none (produced by older bam2tensor)")
+    print(f"  File size:       {_format_size(file_size)}")
+@click.command(help="Inspect bam2tensor .methylation.npz output files.")
+@click.argument(
+    "files",
+    nargs=-1,
+    required=True,
+    type=click.Path(exists=True, dir_okay=False, readable=True),
+)
+def main(files: tuple[str, ...]) -> None:
+    """Inspect one or more .methylation.npz files.
+    Prints a summary of each file including matrix dimensions, sparsity,
+    embedded metadata, and file size.
+    Args:
+        files: One or more paths to ``.methylation.npz`` files.
+    """
+    for i, path in enumerate(files):
+        if i > 0:
+            print()
+        try:
+            inspect_npz(path)
+        except Exception as e:
+            print(f"{os.path.basename(path)}", file=sys.stderr)
+            print(f"  Error: {e}", file=sys.stderr)
+if __name__ == "__main__":
+    main()  # pylint: disable=no-value-for-parameter

bam2tensor-2.3/src/bam2tensor/metadata.py ADDED Viewed

@@ -0,0 +1,114 @@
+"""Metadata utilities for bam2tensor .npz output files.
+This module provides functions to embed and retrieve provenance metadata
+inside the ``.methylation.npz`` files produced by bam2tensor.  The metadata
+is stored as a ``metadata.json`` entry appended to the ZIP archive that
+underlies every ``.npz`` file.  ``scipy.sparse.load_npz`` silently ignores
+this extra entry, so existing downstream code is unaffected.
+Example:
+    Writing metadata (done automatically by the CLI)::
+        >>> # xdoctest: +SKIP
+        >>> from bam2tensor.metadata import write_npz_metadata, read_npz_metadata
+        >>> write_npz_metadata("sample.methylation.npz", {
+        ...     "bam2tensor_version": "2.2",
+        ...     "genome_name": "hg38",
+        ... })
+        >>> read_npz_metadata("sample.methylation.npz")
+        {'bam2tensor_version': '2.2', 'genome_name': 'hg38'}
+    Reading metadata from an existing file::
+        >>> # xdoctest: +SKIP
+        >>> meta = read_npz_metadata("sample.methylation.npz")
+        >>> if meta is not None:
+        ...     print(meta["genome_name"])
+        hg38
+"""
+import json
+import zipfile
+import zlib
+from bam2tensor.embedding import GenomeMethylationEmbedding
+def compute_cpg_index_crc32(embedding: GenomeMethylationEmbedding) -> str:
+    """Compute a CRC32 checksum over the CpG site positions in an embedding.
+    The checksum captures the exact column mapping of the sparse matrix:
+    which chromosomes are included, in what order, and which genomic
+    positions are CpG sites within each chromosome.  Two embeddings with
+    the same checksum will produce identical column semantics.
+    Args:
+        embedding: A fully initialised GenomeMethylationEmbedding whose
+            ``cpg_sites_dict`` is populated.
+    Returns:
+        The CRC32 checksum as an 8-character lowercase hexadecimal string.
+    Example:
+        >>> # xdoctest: +SKIP
+        >>> from bam2tensor.embedding import GenomeMethylationEmbedding
+        >>> emb = GenomeMethylationEmbedding(
+        ...     genome_name="hg38",
+        ...     expected_chromosomes=["chr1"],
+        ...     fasta_source="ref.fa",
+        ... )
+        >>> compute_cpg_index_crc32(emb)
+        'a1b2c3d4'
+    """
+    # Build a deterministic byte representation:
+    #   chrom\tpos1,pos2,...\n  (one line per chromosome, in order)
+    parts: list[str] = []
+    for chrom in embedding.expected_chromosomes:
+        positions = embedding.cpg_sites_dict.get(chrom, [])
+        parts.append(chrom + "\t" + ",".join(str(p) for p in positions))
+    payload = "\n".join(parts).encode("utf-8")
+    return format(zlib.crc32(payload) & 0xFFFFFFFF, "08x")
+def write_npz_metadata(
+    npz_path: str,
+    metadata: dict,
+) -> None:
+    """Append a ``metadata.json`` entry to an existing ``.npz`` file.
+    The metadata is serialised as compact JSON and appended to the ZIP
+    archive.  ``scipy.sparse.load_npz`` ignores unrecognised entries, so
+    the file remains fully compatible with existing code.
+    Args:
+        npz_path: Path to the ``.npz`` file (must already exist).
+        metadata: A JSON-serialisable dictionary of metadata to embed.
+    Example:
+        >>> # xdoctest: +SKIP
+        >>> write_npz_metadata("out.npz", {"genome_name": "hg38"})
+    """
+    with zipfile.ZipFile(npz_path, "a") as zf:
+        zf.writestr("metadata.json", json.dumps(metadata, indent=2))
+def read_npz_metadata(npz_path: str) -> dict | None:
+    """Read the ``metadata.json`` entry from a ``.npz`` file.
+    Args:
+        npz_path: Path to the ``.npz`` file.
+    Returns:
+        The metadata dictionary, or ``None`` if the file does not contain
+        a ``metadata.json`` entry (e.g. files produced by older versions).
+    Example:
+        >>> # xdoctest: +SKIP
+        >>> meta = read_npz_metadata("sample.methylation.npz")
+        >>> meta["genome_name"]
+        'hg38'
+    """
+    with zipfile.ZipFile(npz_path, "r") as zf:
+        if "metadata.json" in zf.namelist():
+            return json.loads(zf.read("metadata.json"))
+    return None

bam2tensor-2.3/tests/test_inspect.py ADDED Viewed

@@ -0,0 +1,146 @@
+"""Test cases for the inspect module."""
+import shutil
+import scipy.sparse
+from click.testing import CliRunner
+from bam2tensor import __main__
+from bam2tensor.inspect import _format_size
+from bam2tensor.inspect import main as inspect_main
+from bam2tensor.metadata import write_npz_metadata
+def test_inspect_with_metadata(tmp_path) -> None:
+    """Inspect prints metadata fields when present."""
+    npz_path = str(tmp_path / "sample.methylation.npz")
+    matrix = scipy.sparse.coo_matrix(([1, 0, -1], ([0, 0, 1], [0, 2, 1])), shape=(2, 5))
+    scipy.sparse.save_npz(npz_path, matrix)
+    write_npz_metadata(
+        npz_path,
+        {
+            "bam2tensor_version": "2.3",
+            "genome_name": "hg38",
+            "expected_chromosomes": ["chr1", "chr2", "chrX", "chrY"],
+            "total_cpg_sites": 5,
+            "cpg_index_crc32": "deadbeef",
+        },
+    )
+    runner = CliRunner()
+    result = runner.invoke(inspect_main, [npz_path])
+    assert result.exit_code == 0
+    assert "hg38" in result.output
+    assert "Reads:" in result.output
+    assert "2" in result.output  # 2 reads
+    assert "CpG sites:" in result.output
+    assert "deadbeef" in result.output
+    assert "v2.3" in result.output
+    assert "chr1, chr2, chrX, chrY" in result.output
+def test_inspect_without_metadata(tmp_path) -> None:
+    """Inspect works on files without metadata (older bam2tensor)."""
+    npz_path = str(tmp_path / "old.methylation.npz")
+    matrix = scipy.sparse.coo_matrix(([1], ([0], [0])), shape=(1, 100))
+    scipy.sparse.save_npz(npz_path, matrix)
+    runner = CliRunner()
+    result = runner.invoke(inspect_main, [npz_path])
+    assert result.exit_code == 0
+    assert "Reads:" in result.output
+    assert "older bam2tensor" in result.output
+    # Should NOT have genome or CRC lines
+    assert "Genome:" not in result.output
+def test_inspect_multiple_files(tmp_path) -> None:
+    """Inspect handles multiple files with blank line separator."""
+    paths = []
+    for name in ["a.npz", "b.npz"]:
+        p = str(tmp_path / name)
+        matrix = scipy.sparse.coo_matrix(([1], ([0], [0])), shape=(1, 10))
+        scipy.sparse.save_npz(p, matrix)
+        paths.append(p)
+    runner = CliRunner()
+    result = runner.invoke(inspect_main, paths)
+    assert result.exit_code == 0
+    assert "a.npz" in result.output
+    assert "b.npz" in result.output
+def test_inspect_many_chromosomes(tmp_path) -> None:
+    """Chromosome list is summarised when > 4 entries."""
+    npz_path = str(tmp_path / "matrix.npz")
+    matrix = scipy.sparse.coo_matrix(([1], ([0], [0])), shape=(1, 10))
+    scipy.sparse.save_npz(npz_path, matrix)
+    chroms = [f"chr{i}" for i in range(1, 23)] + ["chrX", "chrY"]
+    write_npz_metadata(
+        npz_path,
+        {
+            "expected_chromosomes": chroms,
+            "genome_name": "hg38",
+        },
+    )
+    runner = CliRunner()
+    result = runner.invoke(inspect_main, [npz_path])
+    assert "24 (" in result.output
+    assert "chrY" in result.output
+def test_inspect_end_to_end(tmp_path) -> None:
+    """Full pipeline: bam2tensor produces file, bam2tensor-inspect reads it."""
+    shutil.copy("tests/test.bam", tmp_path / "test.bam")
+    shutil.copy("tests/test.bam.bai", tmp_path / "test.bam.bai")
+    runner = CliRunner()
+    # Run extraction
+    result = runner.invoke(
+        __main__.main,
+        [
+            "--input-path",
+            str(tmp_path / "test.bam"),
+            "--reference-fasta",
+            "tests/test_fasta.fa",
+            "--genome-name",
+            "test",
+            "--expected-chromosomes",
+            "chr1,chr2,chr3",
+            "--output-dir",
+            str(tmp_path / "out"),
+            "--overwrite",
+        ],
+    )
+    assert result.exit_code == 0
+    # Inspect the output
+    npz_path = str(tmp_path / "out" / "test.methylation.npz")
+    result = runner.invoke(inspect_main, [npz_path])
+    assert result.exit_code == 0
+    assert "test" in result.output  # genome_name
+    assert "CpG index CRC32:" in result.output
+    assert "v2.3" in result.output
+def test_format_size_bytes() -> None:
+    """_format_size handles small byte counts."""
+    assert _format_size(500) == "500 bytes"
+def test_format_size_kb() -> None:
+    """_format_size handles kilobyte range."""
+    assert _format_size(2048) == "2.0 KB"
+def test_format_size_mb() -> None:
+    """_format_size handles megabyte range."""
+    result = _format_size(14_200_000)
+    assert "MB" in result
+def test_format_size_gb() -> None:
+    """_format_size handles gigabyte range."""
+    result = _format_size(2_500_000_000)
+    assert "GB" in result

bam2tensor-2.3/tests/test_metadata.py ADDED Viewed

@@ -0,0 +1,162 @@
+"""Test cases for the metadata module."""
+import json
+import zipfile
+import scipy.sparse
+from bam2tensor import embedding
+from bam2tensor.metadata import (
+    compute_cpg_index_crc32,
+    read_npz_metadata,
+    write_npz_metadata,
+)
+TEST_EMBEDDING = embedding.GenomeMethylationEmbedding(
+    "test_genome",
+    expected_chromosomes=["chr1", "chr2", "chr3"],
+    fasta_source="tests/test_fasta.fa",
+    window_size=150,
+    skip_cache=False,
+    verbose=False,
+)
+# -- compute_cpg_index_crc32 -------------------------------------------------
+def test_cpg_index_crc32_deterministic() -> None:
+    """Same embedding always produces the same CRC32."""
+    assert compute_cpg_index_crc32(TEST_EMBEDDING) == compute_cpg_index_crc32(
+        TEST_EMBEDDING
+    )
+def test_cpg_index_crc32_format() -> None:
+    """CRC32 is an 8-character hex string."""
+    crc = compute_cpg_index_crc32(TEST_EMBEDDING)
+    assert len(crc) == 8
+    int(crc, 16)  # must be valid hex
+def test_cpg_index_crc32_differs_for_different_embeddings(tmp_path) -> None:
+    """Different chromosome lists produce different CRC32 values."""
+    emb_subset = embedding.GenomeMethylationEmbedding(
+        "test_subset",
+        expected_chromosomes=["chr1"],
+        fasta_source="tests/test_fasta.fa",
+        window_size=150,
+        skip_cache=True,
+        verbose=False,
+    )
+    assert compute_cpg_index_crc32(TEST_EMBEDDING) != compute_cpg_index_crc32(
+        emb_subset
+    )
+# -- write / read round-trip -------------------------------------------------
+def test_write_then_read_metadata(tmp_path) -> None:
+    """Metadata survives a write-then-read round trip."""
+    npz_path = str(tmp_path / "matrix.npz")
+    matrix = scipy.sparse.coo_matrix(([1, 0, -1], ([0, 0, 1], [0, 2, 1])), shape=(2, 4))
+    scipy.sparse.save_npz(npz_path, matrix)
+    metadata = {
+        "bam2tensor_version": "2.2",
+        "genome_name": "hg38",
+        "cpg_index_crc32": "deadbeef",
+        "total_cpg_sites": 4,
+        "expected_chromosomes": ["chr1", "chr2"],
+    }
+    write_npz_metadata(npz_path, metadata)
+    loaded = read_npz_metadata(npz_path)
+    assert loaded == metadata
+def test_scipy_load_unaffected_by_metadata(tmp_path) -> None:
+    """scipy.sparse.load_npz still works after metadata is appended."""
+    npz_path = str(tmp_path / "matrix.npz")
+    data = [1, 0, -1, 1, 0]
+    row = [0, 0, 1, 1, 2]
+    col = [0, 2, 1, 3, 2]
+    matrix = scipy.sparse.coo_matrix((data, (row, col)), shape=(3, 5))
+    scipy.sparse.save_npz(npz_path, matrix)
+    write_npz_metadata(npz_path, {"genome_name": "hg38"})
+    loaded = scipy.sparse.load_npz(npz_path)
+    assert (loaded.toarray() == matrix.toarray()).all()
+    assert loaded.shape == matrix.shape
+    assert loaded.nnz == matrix.nnz
+def test_read_metadata_returns_none_without_metadata(tmp_path) -> None:
+    """read_npz_metadata returns None for files without metadata."""
+    npz_path = str(tmp_path / "plain.npz")
+    matrix = scipy.sparse.coo_matrix(([1], ([0], [0])), shape=(1, 1))
+    scipy.sparse.save_npz(npz_path, matrix)
+    assert read_npz_metadata(npz_path) is None
+def test_metadata_accessible_via_zipfile(tmp_path) -> None:
+    """Metadata is plain JSON readable with standard zipfile tools."""
+    npz_path = str(tmp_path / "matrix.npz")
+    matrix = scipy.sparse.coo_matrix(([1], ([0], [0])), shape=(1, 1))
+    scipy.sparse.save_npz(npz_path, matrix)
+    write_npz_metadata(npz_path, {"genome_name": "mm10", "total_cpg_sites": 42})
+    with zipfile.ZipFile(npz_path, "r") as zf:
+        assert "metadata.json" in zf.namelist()
+        raw = json.loads(zf.read("metadata.json"))
+        assert raw["genome_name"] == "mm10"
+        assert raw["total_cpg_sites"] == 42
+# -- CLI integration (end-to-end) -------------------------------------------
+def test_main_writes_metadata(tmp_path) -> None:
+    """The CLI embeds metadata in the output .npz file."""
+    import shutil
+    from click.testing import CliRunner
+    from bam2tensor import __main__
+    shutil.copy("tests/test.bam", tmp_path / "test.bam")
+    shutil.copy("tests/test.bam.bai", tmp_path / "test.bam.bai")
+    runner = CliRunner()
+    result = runner.invoke(
+        __main__.main,
+        [
+            "--input-path",
+            str(tmp_path / "test.bam"),
+            "--reference-fasta",
+            "tests/test_fasta.fa",
+            "--genome-name",
+            "test",
+            "--expected-chromosomes",
+            "chr1,chr2,chr3",
+            "--output-dir",
+            str(tmp_path / "out"),
+            "--overwrite",
+        ],
+    )
+    assert result.exit_code == 0, f"CLI failed: {result.output}"
+    npz_path = str(tmp_path / "out" / "test.methylation.npz")
+    meta = read_npz_metadata(npz_path)
+    assert meta is not None
+    assert meta["genome_name"] == "test"
+    assert meta["expected_chromosomes"] == ["chr1", "chr2", "chr3"]
+    assert meta["total_cpg_sites"] == TEST_EMBEDDING.total_cpg_sites
+    assert len(meta["cpg_index_crc32"]) == 8
+    assert "bam2tensor_version" in meta
+    # Verify the sparse matrix is still loadable
+    mat = scipy.sparse.load_npz(npz_path)
+    assert mat.shape[1] == TEST_EMBEDDING.total_cpg_sites

{bam2tensor-2.1 → bam2tensor-2.3}/uv.lock RENAMED Viewed

@@ -62,7 +62,7 @@ wheels = [
 [[package]]
 name = "bam2tensor"
-version = "2.0"
+version = "2.3"
 source = { editable = "." }
 dependencies = [
     { name = "biopython" },