PyPI - bio2zarr - Versions diffs - 0.1.6__tar.gz → 0.1.7__tar.gz - Mend

bio2zarr 0.1.6tar.gz → 0.1.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of bio2zarr might be problematic. Click here for more details.

Files changed (62) hide show

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/.github/workflows/ci.yml RENAMED Viewed

@@ -6,6 +6,9 @@ on:
   push:
     branches:
       - main
+  schedule:
+    # At 04:44 on Monday, see https://crontab.guru/
+    - cron: "44 4 * * 1"
 jobs:
   pre-commit:
@@ -22,22 +25,16 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        # Use macos-13 because pip binary packages for ARM aren't
-        # available for many dependencies
-        os: [macos-13, macos-14, ubuntu-latest]
-        python-version: ["3.10", "3.11", "3.12"]
+        os: [macos-14, ubuntu-latest]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
         exclude:
           # Just run macos tests on one Python version
-          - os: macos-13
-            python-version: "3.10"
-          - os: macos-13
-            python-version: "3.11"
-          - os: macos-13
-            python-version: "3.12"
           - os: macos-14
             python-version: "3.10"
           - os: macos-14
             python-version: "3.12"
+          - os: macos-14
+            python-version: "3.13"
     steps:
       - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
@@ -152,36 +149,16 @@ jobs:
           plink2zarr --help
           python -m bio2zarr plink2zarr --help
-  test-numpy-version:
-    name: Test numpy versions
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        numpy: ["==1.26", ">=2"]
-    steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
-        with:
-          python-version: '3.11'
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          python -m pip install '.[dev]'
-      - name: Install numpy${{ matrix.numpy }}
-        run: |
-          python -m pip install 'numpy${{ matrix.numpy }}'
-      - name: Run tests
-        run: |
-          # We just run the CLI tests here because it doesn't require other upstream
-          # packages like sgkit (which are tangled up with the numpy 2 dependency)
-          python -m pytest tests/test_cli.py
   test-zarr-version:
     name: Test Zarr versions
     runs-on: ubuntu-latest
     strategy:
       matrix:
         zarr: ["==2.18.3", ">=3.0.3"]
+        zarr-format: [2, 3]
+        exclude:
+          - zarr: "==2.18.3"
+            zarr-format: 3
     steps:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
@@ -197,3 +174,5 @@ jobs:
       - name: Run tests
         run: |
           python -m pytest
+        env:
+          BIO2ZARR_ZARR_FORMAT: ${{ matrix.zarr-format }}

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/CHANGELOG.md RENAMED Viewed

@@ -1,3 +1,31 @@
+# 0.1.7 2026-02-03
+*Bug fixes*
+- Fix issue with 0-dimensional arrays (#437)
+- Fix issue with pandas 3.x (required in plink code; #439)
+*Breaking changes*
+- Require NumPy 2 (#426)
+- Require tskit >= 1.0.
+- The default `isolated_as_missing` behaviour for tskit conversion now follows
+  tskit's default (currently `True`). To get the previous behaviour, create a
+  model mapping using `ts.map_to_vcf_model(isolated_as_missing=False)` and pass
+  it via the `model_mapping` parameter (or use `tskit2zarr convert --isolated-as-ancestral`).
+- The `contig_id` and `isolated_as_missing` parameters to
+  `bio2zarr.tskit.convert` have been removed; set these via
+  `tskit.TreeSequence.map_to_vcf_model` and pass the returned mapping via the
+  `model_mapping` parameter.
+*Maintenance*
+- Add support for Python 3.13
 # 0.1.6 2025-05-23
 - Initial Python API support for VCF and tskit one-shot conversion. Format

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bio2zarr
-Version: 0.1.6
+Version: 0.1.7
 Summary: Convert bioinformatics data to Zarr
 Author-email: sgkit Developers <project@sgkit.dev>
 License:                                  Apache License
@@ -219,11 +219,12 @@ Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Scientific/Engineering
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: numpy>=1.26
+Requires-Dist: numpy>=2
 Requires-Dist: zarr<3,>=2.17
 Requires-Dist: numcodecs[msgpack]!=0.14.0,!=0.14.1,<0.16
 Requires-Dist: tabulate
@@ -240,22 +241,25 @@ Requires-Dist: pysam; extra == "dev"
 Requires-Dist: pytest; extra == "dev"
 Requires-Dist: pytest-coverage; extra == "dev"
 Requires-Dist: pytest-xdist; extra == "dev"
-Requires-Dist: sgkit>=0.8.0; extra == "dev"
 Requires-Dist: tqdm; extra == "dev"
-Requires-Dist: tskit>=0.6.4; extra == "dev"
+Requires-Dist: tskit>=1; extra == "dev"
 Requires-Dist: bed_reader; extra == "dev"
 Requires-Dist: cyvcf2; extra == "dev"
+Requires-Dist: xarray<2025.03.1; extra == "dev"
+Requires-Dist: dask[array]<=2024.8.0,>=2022.01.0; extra == "dev"
 Provides-Extra: tskit
-Requires-Dist: tskit>=0.6.4; extra == "tskit"
+Requires-Dist: tskit>=1; extra == "tskit"
 Provides-Extra: vcf
 Requires-Dist: cyvcf2; extra == "vcf"
 Provides-Extra: all
-Requires-Dist: tskit>=0.6.4; extra == "all"
+Requires-Dist: tskit>=1; extra == "all"
 Requires-Dist: cyvcf2; extra == "all"
 Dynamic: license-file
 [![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
 [![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)
+[![PyPI Downloads](https://static.pepy.tech/badge/bio2zarr)](https://pepy.tech/projects/bio2zarr)
+[![Anaconda-Server Badge](https://anaconda.org/bioconda/bio2zarr/badges/downloads.svg)](https://anaconda.org/bioconda/bio2zarr)
 # bio2zarr

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/README.md RENAMED Viewed

@@ -1,5 +1,7 @@
 [![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
 [![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)
+[![PyPI Downloads](https://static.pepy.tech/badge/bio2zarr)](https://pepy.tech/projects/bio2zarr)
+[![Anaconda-Server Badge](https://anaconda.org/bioconda/bio2zarr/badges/downloads.svg)](https://anaconda.org/bioconda/bio2zarr)
 # bio2zarr

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/bio2zarr/_version.py RENAMED Viewed

@@ -1,7 +1,14 @@
 # file generated by setuptools-scm
 # don't change, don't track in version control
-__all__ = ["__version__", "__version_tuple__", "version", "version_tuple"]
+__all__ = [
+    "__version__",
+    "__version_tuple__",
+    "version",
+    "version_tuple",
+    "__commit_id__",
+    "commit_id",
+]
 TYPE_CHECKING = False
 if TYPE_CHECKING:
@@ -9,13 +16,19 @@ if TYPE_CHECKING:
     from typing import Union
     VERSION_TUPLE = Tuple[Union[int, str], ...]
+    COMMIT_ID = Union[str, None]
 else:
     VERSION_TUPLE = object
+    COMMIT_ID = object
 version: str
 __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
+commit_id: COMMIT_ID
+__commit_id__: COMMIT_ID
-__version__ = version = '0.1.6'
-__version_tuple__ = version_tuple = (0, 1, 6)
+__version__ = version = '0.1.7'
+__version_tuple__ = version_tuple = (0, 1, 7)
+__commit_id__ = commit_id = 'g4359d72e2'

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/bio2zarr/cli.py RENAMED Viewed

@@ -652,7 +652,12 @@ def vcfpartition(vcfs, verbose, num_partitions, partition_size):
 @click.argument("zarr_path", type=click.Path())
 @click.option("--contig-id", type=str, help="Contig/chromosome ID (default: '1')")
 @click.option(
-    "--isolated-as-missing", is_flag=True, help="Treat isolated nodes as missing"
+    "--isolated-as-missing/--isolated-as-ancestral",
+    default=None,
+    help=(
+        "Treat isolated samples without mutations as missing or ancestral "
+        "(default: tskit default)"
+    ),
 )
 @variants_chunk_size
 @samples_chunk_size
@@ -660,6 +665,7 @@ def vcfpartition(vcfs, verbose, num_partitions, partition_size):
 @progress
 @worker_processes
 @force
+@core.requires_optional_dependency("tskit", "tskit")
 def convert_tskit(
     ts_path,
     zarr_path,
@@ -675,11 +681,18 @@ def convert_tskit(
     setup_logging(verbose)
     check_overwrite_dir(zarr_path, force)
+    import tskit
+    ts = tskit.load(ts_path)
+    model_mapping = ts.map_to_vcf_model(
+        contig_id=contig_id,
+        isolated_as_missing=isolated_as_missing,
+    )
     tskit_mod.convert(
         ts_path,
         zarr_path,
-        contig_id=contig_id,
-        isolated_as_missing=isolated_as_missing,
+        model_mapping=model_mapping,
         variants_chunk_size=variants_chunk_size,
         samples_chunk_size=samples_chunk_size,
         worker_processes=worker_processes,

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/bio2zarr/plink.py RENAMED Viewed

@@ -6,6 +6,7 @@ import numpy as np
 import pandas as pd
 from bio2zarr import constants, core, vcz
+from bio2zarr.zarr_utils import STRING_DTYPE_NAME
 logger = logging.getLogger(__name__)
@@ -198,7 +199,7 @@ class PlinkFormat(vcz.Source):
         ref_iter = self.bim.allele_2.values[start:stop]
         gt_iter = self.bed_reader.iter_decode(start, stop)
         for alt, ref, gt in zip(alt_iter, ref_iter, gt_iter):
-            alleles = np.full(num_alleles, constants.STR_FILL, dtype="O")
+            alleles = np.full(num_alleles, constants.STR_FILL, dtype=STRING_DTYPE_NAME)
             alleles[0] = ref
             alleles[1 : 1 + len(alt)] = alt
             phased = np.zeros(gt.shape[0], dtype=bool)
@@ -234,8 +235,9 @@ class PlinkFormat(vcz.Source):
         )
         # If we don't have SVLEN or END annotations, the rlen field is defined
         # as the length of the REF
-        max_len = self.bim.allele_2.values.itemsize
+        # Explicitly cast to fixed size array to support pandas 2.x and 3.x
+        allele_2_array = self.bim.allele_2.values.astype("S")
+        max_len = allele_2_array.itemsize
         array_specs = [
             vcz.ZarrArraySpec(
                 source="position",
@@ -246,13 +248,13 @@ class PlinkFormat(vcz.Source):
             ),
             vcz.ZarrArraySpec(
                 name="variant_allele",
-                dtype="O",
+                dtype=STRING_DTYPE_NAME,
                 dimensions=["variants", "alleles"],
                 description=None,
             ),
             vcz.ZarrArraySpec(
                 name="variant_id",
-                dtype="O",
+                dtype=STRING_DTYPE_NAME,
                 dimensions=["variants"],
                 description=None,
             ),

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/bio2zarr/tskit.py RENAMED Viewed

@@ -4,6 +4,7 @@ import pathlib
 import numpy as np
 from bio2zarr import constants, core, vcz
+from bio2zarr.zarr_utils import STRING_DTYPE_NAME
 logger = logging.getLogger(__name__)
@@ -15,8 +16,6 @@ class TskitFormat(vcz.Source):
         ts,
         *,
         model_mapping=None,
-        contig_id=None,
-        isolated_as_missing=False,
     ):
         import tskit
@@ -35,14 +34,14 @@ class TskitFormat(vcz.Source):
             f"{self.ts.num_sites} sites"
         )
-        self.contig_id = contig_id if contig_id is not None else "1"
-        self.isolated_as_missing = isolated_as_missing
-        self.positions = self.ts.sites_position
         if model_mapping is None:
             model_mapping = self.ts.map_to_vcf_model()
+        self.contig_id = model_mapping.contig_id
+        self.contig_length = model_mapping.contig_length
+        self.isolated_as_missing = model_mapping.isolated_as_missing
+        self.raw_positions = self.ts.sites_position
+        self.vcf_positions = model_mapping.transformed_positions
         individuals_nodes = model_mapping.individuals_nodes
         sample_ids = model_mapping.individuals_name
@@ -91,14 +90,14 @@ class TskitFormat(vcz.Source):
     @property
     def contigs(self):
-        return [vcz.Contig(id=self.contig_id)]
+        return [vcz.Contig(id=self.contig_id, length=self.contig_length)]
     def iter_contig(self, start, stop):
         yield from (0 for _ in range(start, stop))
     def iter_field(self, field_name, shape, start, stop):
         if field_name == "position":
-            for pos in self.ts.sites_position[start:stop]:
+            for pos in self.vcf_positions[start:stop]:
                 yield int(pos)
         else:
             raise ValueError(f"Unknown field {field_name}")
@@ -110,13 +109,13 @@ class TskitFormat(vcz.Source):
         for variant in self.ts.variants(
             isolated_as_missing=self.isolated_as_missing,
-            left=self.positions[start],
-            right=self.positions[stop] if stop < self.num_records else None,
+            left=self.raw_positions[start],
+            right=self.raw_positions[stop] if stop < self.num_records else None,
             samples=self.tskit_samples,
             copy=False,
         ):
             gt = np.full(shape, constants.INT_FILL, dtype=np.int8)
-            alleles = np.full(num_alleles, constants.STR_FILL, dtype="O")
+            alleles = np.full(num_alleles, constants.STR_FILL, dtype=STRING_DTYPE_NAME)
             # length is the length of the REF allele unless other fields
             # are included.
             variant_length = len(variant.alleles[0])
@@ -176,8 +175,8 @@ class TskitFormat(vcz.Source):
         min_position = 0
         max_position = 0
         if self.ts.num_sites > 0:
-            min_position = np.min(self.ts.sites_position)
-            max_position = np.max(self.ts.sites_position)
+            min_position = np.min(self.vcf_positions)
+            max_position = np.max(self.vcf_positions)
         tables = self.ts.tables
         ancestral_state_offsets = tables.sites.ancestral_state_offset
@@ -200,7 +199,7 @@ class TskitFormat(vcz.Source):
             vcz.ZarrArraySpec(
                 source=None,
                 name="variant_allele",
-                dtype="O",
+                dtype=STRING_DTYPE_NAME,
                 dimensions=["variants", "alleles"],
                 description="Alleles for each variant",
             ),
@@ -252,8 +251,6 @@ def convert(
     vcz_path,
     *,
     model_mapping=None,
-    contig_id=None,
-    isolated_as_missing=False,
     variants_chunk_size=None,
     samples_chunk_size=None,
     worker_processes=core.DEFAULT_WORKER_PROCESSES,
@@ -277,8 +274,6 @@ def convert(
     tskit_format = TskitFormat(
         ts_or_path,
         model_mapping=model_mapping,
-        contig_id=contig_id,
-        isolated_as_missing=isolated_as_missing,
     )
     schema_instance = tskit_format.generate_schema(
         variants_chunk_size=variants_chunk_size,

{bio2zarr-0.1.6 → bio2zarr-0.1.7}/bio2zarr/vcf.py RENAMED Viewed

@@ -16,6 +16,8 @@ from typing import Any
 import numcodecs
 import numpy as np
+from bio2zarr.zarr_utils import STRING_DTYPE_NAME, zarr_exists
 from . import constants, core, provenance, vcf_utils, vcz
 logger = logging.getLogger(__name__)
@@ -110,7 +112,7 @@ class VcfField:
             ret = "U1"
         else:
             assert self.vcf_type == "String"
-            ret = "O"
+            ret = STRING_DTYPE_NAME
         return ret
@@ -397,7 +399,7 @@ def sanitise_value_string_scalar(shape, value):
 def sanitise_value_string_1d(shape, value):
     if value is None:
-        return np.full(shape, ".", dtype="O")
+        return np.full(shape, ".", dtype=STRING_DTYPE_NAME)
     else:
         value = drop_empty_second_dim(value)
         result = np.full(shape, "", dtype=value.dtype)
@@ -407,9 +409,9 @@ def sanitise_value_string_1d(shape, value):
 def sanitise_value_string_2d(shape, value):
     if value is None:
-        return np.full(shape, ".", dtype="O")
+        return np.full(shape, ".", dtype=STRING_DTYPE_NAME)
     else:
-        result = np.full(shape, "", dtype="O")
+        result = np.full(shape, "", dtype=STRING_DTYPE_NAME)
         if value.ndim == 2:
             result[: value.shape[0], : value.shape[1]] = value
         else:
@@ -569,7 +571,12 @@ class StringValueTransformer(VcfValueTransformer):
             value = np.array(list(vcf_value.split(",")))
         else:
             # TODO can we make this faster??
-            value = np.array([v.split(",") for v in vcf_value], dtype="O")
+            var_len_values = [v.split(",") for v in vcf_value]
+            number = max(len(v) for v in var_len_values)
+            value = np.array(
+                [v + [""] * (number - len(v)) for v in var_len_values],
+                dtype=STRING_DTYPE_NAME,
+            )
             # print("HERE", vcf_value, value)
             # for v in vcf_value:
             #     print("\t", type(v), len(v), v.split(","))
@@ -1044,7 +1051,7 @@ class IntermediateColumnarFormat(vcz.Source):
             ref_field.iter_values(start, stop),
             alt_field.iter_values(start, stop),
         ):
-            alleles = np.full(num_alleles, constants.STR_FILL, dtype="O")
+            alleles = np.full(num_alleles, constants.STR_FILL, dtype=STRING_DTYPE_NAME)
             alleles[0] = ref[0]
             alleles[1 : 1 + len(alt)] = alt
             yield alleles
@@ -1068,14 +1075,16 @@ class IntermediateColumnarFormat(vcz.Source):
             for variant_length, alleles in zip(
                 variant_lengths, self.iter_alleles(start, stop, num_alleles)
             ):
-                yield vcz.VariantData(variant_length, alleles, None, None)
+                # Stored ICF values are always at least 1D arrays; "rlen" is Number=1
+                # so we must extract the scalar to avoid NumPy scalar-conversion issues.
+                yield vcz.VariantData(variant_length[0], alleles, None, None)
         else:
             for variant_length, alleles, (gt, phased) in zip(
                 variant_lengths,
                 self.iter_alleles(start, stop, num_alleles),
                 self.iter_genotypes(shape, start, stop),
             ):
-                yield vcz.VariantData(variant_length, alleles, gt, phased)
+                yield vcz.VariantData(variant_length[0], alleles, gt, phased)
     def generate_schema(
         self, variants_chunk_size=None, samples_chunk_size=None, local_alleles=None
@@ -1087,8 +1096,10 @@ class IntermediateColumnarFormat(vcz.Source):
         # Add ploidy and genotypes dimensions only when needed
         max_genotypes = 0
+        has_g_field = False
         for field in self.metadata.format_fields:
             if field.vcf_number == "G":
+                has_g_field = True
                 max_genotypes = max(max_genotypes, field.summary.max_number)
         ploidy = None
@@ -1100,7 +1111,7 @@ class IntermediateColumnarFormat(vcz.Source):
             genotypes_size = math.comb(max_alleles + ploidy - 1, ploidy)
             # assert max_genotypes == genotypes_size
         else:
-            if max_genotypes > 0:
+            if max_genotypes > 0 or has_g_field:
                 # there is no GT field, but there is at least one Number=G field,
                 # so need to define genotypes dimension
                 genotypes_size = max_genotypes
@@ -1163,7 +1174,7 @@ class IntermediateColumnarFormat(vcz.Source):
             ),
             fixed_field_spec(
                 name="variant_allele",
-                dtype="O",
+                dtype=STRING_DTYPE_NAME,
                 dimensions=["variants", "alleles"],
             ),
             fixed_field_spec(
@@ -1173,7 +1184,7 @@ class IntermediateColumnarFormat(vcz.Source):
             ),
             fixed_field_spec(
                 name="variant_id",
-                dtype="O",
+                dtype=STRING_DTYPE_NAME,
             ),
             fixed_field_spec(
                 name="variant_id_mask",
@@ -1581,8 +1592,7 @@ def inspect(path):
         raise ValueError(f"Path not found: {path}")
     if (path / "metadata.json").exists():
         obj = IntermediateColumnarFormat(path)
-    # NOTE: this is too strict, we should support more general Zarrs, see #276
-    elif (path / ".zmetadata").exists():
+    elif zarr_exists(path):
         obj = vcz.VcfZarr(path)
     else:
         raise ValueError(f"{path} not in ICF or VCF Zarr format")

bio2zarr 0.1.6__tar.gz → 0.1.7__tar.gz

Potentially problematic release.

bio2zarr 0.1.6tar.gz → 0.1.7tar.gz