PyPI - bio2zarr - Versions diffs - 0.1.4__tar.gz → 0.1.6__tar.gz - Mend

bio2zarr 0.1.4tar.gz → 0.1.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of bio2zarr might be problematic. Click here for more details.

Files changed (66) hide show

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/.github/workflows/cd.yml RENAMED Viewed

@@ -1,6 +1,7 @@
 name: CD
 on:
+  merge_group:
   push:
     branches:
       - main
@@ -18,7 +19,7 @@ jobs:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
         with:
-          python-version: '3.9'
+          python-version: '3.10'
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/.github/workflows/ci.yml RENAMED Viewed

@@ -1,6 +1,7 @@
 name: CI
 on:
+  merge_group:
   pull_request:
   push:
     branches:
@@ -24,7 +25,7 @@ jobs:
         # Use macos-13 because pip binary packages for ARM aren't
         # available for many dependencies
         os: [macos-13, macos-14, ubuntu-latest]
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12"]
         exclude:
           # Just run macos tests on one Python version
           - os: macos-13
@@ -33,8 +34,6 @@ jobs:
             python-version: "3.11"
           - os: macos-13
             python-version: "3.12"
-          - os: macos-14
-            python-version: "3.9"
           - os: macos-14
             python-version: "3.10"
           - os: macos-14
@@ -70,6 +69,12 @@ jobs:
           python -m bio2zarr vcf2zarr dencode-partition sample.vcz 1
           python -m bio2zarr vcf2zarr dencode-partition sample.vcz 2
           python -m bio2zarr vcf2zarr dencode-finalise sample.vcz
+      - name: Run tskit2zarr example
+        run: |
+          python -m bio2zarr tskit2zarr convert tests/data/tskit/example.trees sample.vcz -f
+      - name: Run plink2zarr example
+        run: |
+          python -m bio2zarr plink2zarr convert tests/data/plink/example sample.vcz -f
       - name: Run tests
         run: |
           pytest --cov=bio2zarr
@@ -82,6 +87,36 @@ jobs:
           # https://github.com/coverallsapp/github-action
           fail-on-error: false
+  optional_dependencies:
+    name: Optional dependencies
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+      - name: Test optional dependencies
+        run: |
+          python -m venv env-tskit
+          source env-tskit/bin/activate
+          python -m pip install .
+          python -m bio2zarr tskit2zarr convert tests/data/tskit/example.trees ts.vcz > ts.txt 2>&1 || echo $? > ts_exit.txt
+          test "$(cat ts_exit.txt)" = "1"
+            grep -q "This process requires the optional tskit module. Install it with: pip install bio2zarr\[tskit\]" ts.txt
+          python -m pip install '.[tskit]'
+          python -m bio2zarr tskit2zarr convert tests/data/tskit/example.trees ts.vcz
+          deactivate
+          python -m venv env-vcf
+          source env-vcf/bin/activate
+          python -m pip install .
+          python -m bio2zarr vcf2zarr convert tests/data/vcf/sample.vcf.gz sample.vcz > vcf.txt 2>&1 || echo $? > vcf_exit.txt
+          test "$(cat vcf_exit.txt)" = "1"
+          grep -q "This process requires the optional cyvcf2 module. Install it with: pip install bio2zarr\[vcf\]" vcf.txt
+          python -m pip install '.[vcf]'
+          python -m bio2zarr vcf2zarr convert tests/data/vcf/sample.vcf.gz sample.vcz
+          deactivate
   packaging:
     name: Packaging
     runs-on: ubuntu-latest
@@ -108,6 +143,14 @@ jobs:
         run: |
           vcfpartition --help
           python -m bio2zarr vcfpartition --help
+      - name: Check tskit2zarr CLI
+        run: |
+          tskit2zarr --help
+          python -m bio2zarr tskit2zarr --help
+      - name: Check plink2zarr CLI
+        run: |
+          plink2zarr --help
+          python -m bio2zarr plink2zarr --help
   test-numpy-version:
     name: Test numpy versions

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/.github/workflows/docs.yml RENAMED Viewed

@@ -1,6 +1,7 @@
 name: Docs
 on:
+  merge_group:
   pull_request:
   push:
     branches:
@@ -37,7 +38,7 @@ jobs:
       - name: Install package
         run: |
-          python3 -m pip install .
+          python3 -m pip install '.[all]'
       - name: Build Docs
         run: |
@@ -50,7 +51,7 @@ jobs:
   deploy:
     needs: build-docs
-    if: github.event_name != 'pull_request'
+    if: github.event_name != 'pull_request' && github.event_name != 'merge_group'
     permissions:
         pages: write
         id-token: write

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/CHANGELOG.md RENAMED Viewed

@@ -1,3 +1,36 @@
+# 0.1.6 2025-05-23
+- Initial Python API support for VCF and tskit one-shot conversion. Format
+conversion is done using the functions ``bio2zarr.vcf.convert``
+and ``bio2zarr.tskit.convert``.
+- Initial version of supported plink2zarr (#390, #344, #382)
+- Initial version of tskit2zarr (#232)
+- Make format-specific dependencies optional (#385)
+- Remove bed_reader dependency (#397, #400)
+- Change default number of worker processes to zero (#404) to simplify
+  debugging
+*Breaking changes*
+- Remove explicit sample, contig and filter lists from the schema.
+  Existing ICFs will need to be recreated. (#343)
+- Add dimensions and default compressor and filter settings to the schema.
+  (#361)
+- Various changes to existing experimental plink encoding (#390)
+# 0.1.5 2025-03-31
+- Add support for merging contig IDs across multiple VCFs (#335)
+- Add support for unindexed (and uncompressed) VCFs (#337)
 # 0.1.4 2025-03-10
 - Fix bug in handling all-missing genotypes (#328)

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: bio2zarr
-Version: 0.1.4
+Version: 0.1.6
 Summary: Convert bioinformatics data to Zarr
 Author-email: sgkit Developers <project@sgkit.dev>
 License:                                  Apache License
@@ -216,23 +216,24 @@ Classifier: Operating System :: MacOS :: MacOS X
 Classifier: Intended Audience :: Science/Research
 Classifier: Programming Language :: Python
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Topic :: Scientific/Engineering
-Requires-Python: >=3.9
+Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: numpy>=1.26
 Requires-Dist: zarr<3,>=2.17
-Requires-Dist: click
+Requires-Dist: numcodecs[msgpack]!=0.14.0,!=0.14.1,<0.16
 Requires-Dist: tabulate
 Requires-Dist: tqdm
 Requires-Dist: humanfriendly
-Requires-Dist: cyvcf2
-Requires-Dist: bed_reader
+Requires-Dist: coloredlogs
+Requires-Dist: click
+Requires-Dist: pandas
 Provides-Extra: dev
+Requires-Dist: click>=8.2.0; extra == "dev"
 Requires-Dist: hypothesis-vcf; extra == "dev"
 Requires-Dist: msprime; extra == "dev"
 Requires-Dist: pysam; extra == "dev"
@@ -241,6 +242,17 @@ Requires-Dist: pytest-coverage; extra == "dev"
 Requires-Dist: pytest-xdist; extra == "dev"
 Requires-Dist: sgkit>=0.8.0; extra == "dev"
 Requires-Dist: tqdm; extra == "dev"
+Requires-Dist: tskit>=0.6.4; extra == "dev"
+Requires-Dist: bed_reader; extra == "dev"
+Requires-Dist: cyvcf2; extra == "dev"
+Provides-Extra: tskit
+Requires-Dist: tskit>=0.6.4; extra == "tskit"
+Provides-Extra: vcf
+Requires-Dist: cyvcf2; extra == "vcf"
+Provides-Extra: all
+Requires-Dist: tskit>=0.6.4; extra == "all"
+Requires-Dist: cyvcf2; extra == "all"
+Dynamic: license-file
 [![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
 [![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/bio2zarr/__main__.py RENAMED Viewed

@@ -15,7 +15,8 @@ def bio2zarr():
 # is handy for development and for those whose PATHs aren't set
 # up in the right way.
 bio2zarr.add_command(cli.vcf2zarr_main)
-bio2zarr.add_command(cli.plink2zarr)
+bio2zarr.add_command(cli.plink2zarr_main)
+bio2zarr.add_command(cli.tskit2zarr_main)
 bio2zarr.add_command(cli.vcfpartition)
 if __name__ == "__main__":

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/bio2zarr/_version.py RENAMED Viewed

@@ -17,5 +17,5 @@ __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
-__version__ = version = '0.1.4'
-__version_tuple__ = version_tuple = (0, 1, 4)
+__version__ = version = '0.1.6'
+__version_tuple__ = version_tuple = (0, 1, 6)

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/bio2zarr/cli.py RENAMED Viewed

@@ -8,8 +8,9 @@ import coloredlogs
 import numcodecs
 import tabulate
-from . import plink, provenance, vcf2zarr, vcf_utils
-from .vcf2zarr import icf as icf_mod
+from . import core, plink, provenance, vcf_utils
+from . import tskit as tskit_mod
+from . import vcf as vcf_mod
 logger = logging.getLogger(__name__)
@@ -88,7 +89,12 @@ json = click.option(
 version = click.version_option(version=f"{provenance.__version__}")
 worker_processes = click.option(
-    "-p", "--worker-processes", type=int, default=1, help="Number of worker processes"
+    "-p",
+    "--worker-processes",
+    type=int,
+    default=core.DEFAULT_WORKER_PROCESSES,
+    help="Number of worker processes",
+    show_default=True,
 )
 column_chunk_size = click.option(
@@ -197,7 +203,7 @@ def check_partitions(num_partitions):
 def get_compressor(cname):
     if cname is None:
         return None
-    config = icf_mod.ICF_DEFAULT_COMPRESSOR.get_config()
+    config = vcf_mod.ICF_DEFAULT_COMPRESSOR.get_config()
     config["cname"] = cname
     return numcodecs.get_codec(config)
@@ -236,7 +242,7 @@ def explode(
     """
     setup_logging(verbose)
     check_overwrite_dir(icf_path, force)
-    vcf2zarr.explode(
+    vcf_mod.explode(
         icf_path,
         vcfs,
         worker_processes=worker_processes,
@@ -276,7 +282,7 @@ def dexplode_init(
     setup_logging(verbose)
     check_overwrite_dir(icf_path, force)
     check_partitions(num_partitions)
-    work_summary = vcf2zarr.explode_init(
+    work_summary = vcf_mod.explode_init(
         icf_path,
         vcfs,
         target_num_partitions=num_partitions,
@@ -304,7 +310,7 @@ def dexplode_partition(icf_path, partition, verbose, one_based):
     setup_logging(verbose)
     if one_based:
         partition -= 1
-    vcf2zarr.explode_partition(icf_path, partition)
+    vcf_mod.explode_partition(icf_path, partition)
 @click.command
@@ -315,7 +321,7 @@ def dexplode_finalise(icf_path, verbose):
     Final step for distributed conversion of VCF(s) to intermediate columnar format.
     """
     setup_logging(verbose)
-    vcf2zarr.explode_finalise(icf_path)
+    vcf_mod.explode_finalise(icf_path)
 @click.command
@@ -326,7 +332,7 @@ def inspect(path, verbose):
     Inspect an intermediate columnar format or Zarr path.
     """
     setup_logging(verbose)
-    data = vcf2zarr.inspect(path)
+    data = vcf_mod.inspect(path)
     click.echo(tabulate.tabulate(data, headers="keys"))
@@ -345,7 +351,7 @@ def mkschema(icf_path, variants_chunk_size, samples_chunk_size, local_alleles):
             err=True,
         )
     stream = click.get_text_stream("stdout")
-    vcf2zarr.mkschema(
+    vcf_mod.mkschema(
         icf_path,
         stream,
         variants_chunk_size=variants_chunk_size,
@@ -380,11 +386,11 @@ def encode(
     worker_processes,
 ):
     """
-    Convert intermediate columnar format to vcfzarr.
+    Convert intermediate columnar format to VCF Zarr.
     """
     setup_logging(verbose)
     check_overwrite_dir(zarr_path, force)
-    vcf2zarr.encode(
+    vcf_mod.encode(
         icf_path,
         zarr_path,
         schema_path=schema,
@@ -438,7 +444,7 @@ def dencode_init(
     setup_logging(verbose)
     check_overwrite_dir(zarr_path, force)
     check_partitions(num_partitions)
-    work_summary = vcf2zarr.encode_init(
+    work_summary = vcf_mod.encode_init(
         icf_path,
         zarr_path,
         target_num_partitions=num_partitions,
@@ -466,7 +472,7 @@ def dencode_partition(zarr_path, partition, verbose, one_based):
     setup_logging(verbose)
     if one_based:
         partition -= 1
-    vcf2zarr.encode_partition(zarr_path, partition)
+    vcf_mod.encode_partition(zarr_path, partition)
 @click.command
@@ -478,7 +484,7 @@ def dencode_finalise(zarr_path, verbose, progress):
     Final step for distributed conversion of ICF to VCF Zarr.
     """
     setup_logging(verbose)
-    vcf2zarr.encode_finalise(zarr_path, show_progress=progress)
+    vcf_mod.encode_finalise(zarr_path, show_progress=progress)
 @click.command(name="convert")
@@ -503,11 +509,11 @@ def convert_vcf(
     local_alleles,
 ):
     """
-    Convert input VCF(s) directly to vcfzarr (not recommended for large files).
+    Convert input VCF(s) directly to VCF Zarr (not recommended for large files).
     """
     setup_logging(verbose)
     check_overwrite_dir(zarr_path, force)
-    vcf2zarr.convert(
+    vcf_mod.convert(
         vcfs,
         zarr_path,
         variants_chunk_size=variants_chunk_size,
@@ -522,9 +528,10 @@ def convert_vcf(
 @click.group(cls=NaturalOrderGroup, name="vcf2zarr")
 def vcf2zarr_main():
     """
-    Convert VCF file(s) to the vcfzarr format.
+    Convert VCF file(s) to VCF Zarr format.
     See the online documentation at https://sgkit-dev.github.io/bio2zarr/
     for more information.
     """
@@ -545,6 +552,7 @@ vcf2zarr_main.add_command(dencode_finalise)
 @click.command(name="convert")
 @click.argument("in_path", type=click.Path())
 @click.argument("zarr_path", type=click.Path())
+@force
 @worker_processes
 @progress
 @verbose
@@ -553,6 +561,7 @@ vcf2zarr_main.add_command(dencode_finalise)
 def convert_plink(
     in_path,
     zarr_path,
+    force,
     verbose,
     worker_processes,
     progress,
@@ -560,9 +569,12 @@ def convert_plink(
     samples_chunk_size,
 ):
     """
-    In development; DO NOT USE!
+    Convert plink fileset to VCF Zarr. Results are equivalent to
+    `plink1.9 --bfile prefix --keep-allele-order --recode vcf-iid --out tmp`
+    then running `vcf2zarr convert tmp.vcf zarr_path`
     """
     setup_logging(verbose)
+    check_overwrite_dir(zarr_path, force)
     plink.convert(
         in_path,
         zarr_path,
@@ -574,12 +586,15 @@ def convert_plink(
 @version
-@click.group()
-def plink2zarr():
+@click.group(name="plink2zarr")
+def plink2zarr_main():
+    """
+    Convert plink fileset(s) to VCF Zarr format
+    """
     pass
-plink2zarr.add_command(convert_plink)
+plink2zarr_main.add_command(convert_plink)
 @click.command
@@ -624,9 +639,61 @@ def vcfpartition(vcfs, verbose, num_partitions, partition_size):
         num_parts_per_path = max(1, num_partitions // len(vcfs))
     for vcf_path in vcfs:
-        indexed_vcf = vcf_utils.IndexedVcf(vcf_path)
-        regions = indexed_vcf.partition_into_regions(
+        vcf_file = vcf_utils.VcfFile(vcf_path)
+        regions = vcf_file.partition_into_regions(
             num_parts=num_parts_per_path, target_part_size=partition_size
         )
         for region in regions:
             click.echo(f"{region}\t{vcf_path}")
+@click.command(name="convert")
+@click.argument("ts_path", type=click.Path(exists=True))
+@click.argument("zarr_path", type=click.Path())
+@click.option("--contig-id", type=str, help="Contig/chromosome ID (default: '1')")
+@click.option(
+    "--isolated-as-missing", is_flag=True, help="Treat isolated nodes as missing"
+)
+@variants_chunk_size
+@samples_chunk_size
+@verbose
+@progress
+@worker_processes
+@force
+def convert_tskit(
+    ts_path,
+    zarr_path,
+    contig_id,
+    isolated_as_missing,
+    variants_chunk_size,
+    samples_chunk_size,
+    verbose,
+    progress,
+    worker_processes,
+    force,
+):
+    setup_logging(verbose)
+    check_overwrite_dir(zarr_path, force)
+    tskit_mod.convert(
+        ts_path,
+        zarr_path,
+        contig_id=contig_id,
+        isolated_as_missing=isolated_as_missing,
+        variants_chunk_size=variants_chunk_size,
+        samples_chunk_size=samples_chunk_size,
+        worker_processes=worker_processes,
+        show_progress=progress,
+    )
+@version
+@click.group(name="tskit2zarr")
+def tskit2zarr_main():
+    """
+    Convert tskit tree sequence(s) to VCF Zarr format
+    """
+    pass
+tskit2zarr_main.add_command(convert_tskit)

{bio2zarr-0.1.4 → bio2zarr-0.1.6}/bio2zarr/core.py RENAMED Viewed

@@ -1,16 +1,16 @@
 import concurrent.futures as cf
 import contextlib
 import dataclasses
+import functools
+import importlib
 import json
 import logging
 import math
 import multiprocessing
 import os
 import os.path
-import sys
 import threading
 import time
-import warnings
 import humanfriendly
 import numcodecs
@@ -23,6 +23,26 @@ logger = logging.getLogger(__name__)
 numcodecs.blosc.use_threads = False
+def requires_optional_dependency(module_name, extras_name):
+    """Decorator to check for optional dependencies"""
+    def decorator(func):
+        @functools.wraps(func)
+        def wrapper(*args, **kwargs):
+            try:
+                importlib.import_module(module_name)
+            except ImportError:
+                raise ImportError(
+                    f"This process requires the optional {module_name} module. "
+                    f"Install it with: pip install bio2zarr[{extras_name}]"
+                ) from None
+            return func(*args, **kwargs)
+        return wrapper
+    return decorator
 def display_number(x):
     ret = "n/a"
     if math.isfinite(x):
@@ -34,6 +54,16 @@ def display_size(n):
     return humanfriendly.format_size(n, binary=True)
+def parse_max_memory(max_memory):
+    if max_memory is None:
+        # Effectively unbounded
+        return 2**63
+    if isinstance(max_memory, str):
+        max_memory = humanfriendly.parse_size(max_memory)
+    logger.info(f"Set memory budget to {display_size(max_memory)}")
+    return max_memory
 def min_int_dtype(min_value, max_value):
     if min_value > max_value:
         raise ValueError("min_value must be <= max_value")
@@ -100,12 +130,20 @@ def du(path):
     return total
+# We set the default number of worker processes to 0 because it avoids
+# complexity in the call chain and makes things easier to debug by
+# default. However, it does use the SynchronousExecutor here, which
+# is technically not recommended by the Python docs.
+DEFAULT_WORKER_PROCESSES = 0
 class SynchronousExecutor(cf.Executor):
-    # Arguably we should use workers=0 as the default and use this
+    # Since https://github.com/sgkit-dev/bio2zarr/issues/404 we
+    # set worker_processses=0 as the default and use this
     # executor implementation. However, the docs are fairly explicit
     # about saying we shouldn't instantiate Future objects directly,
-    # so it's best to keep this as a semi-secret debugging interface
-    # for now.
+    # so we may need to revisit this is obscure problems start to
+    # arise.
     def submit(self, fn, /, *args, **kwargs):
         future = cf.Future()
         future.set_result(fn(*args, **kwargs))
@@ -246,22 +284,6 @@ def setup_progress_counter(counter):
     _progress_counter = counter
-def warn_py39_mac():
-    if sys.platform == "darwin" and sys.version_info[:2] == (3, 9):
-        warnings.warn(
-            "There is a known issue with bio2zarr on MacOS Python 3.9 "
-            "in which OS-level named semaphores are leaked. "
-            "You will also probably see warnings like 'There appear to be N "
-            "leaked semaphore objects at shutdown'. "
-            "While this is likely harmless for a few runs, it could lead to "
-            "issues if you do a lot of conversion. To get prevent this issue "
-            "either: (1) use --worker-processes=0 or (2) upgrade to a newer "
-            "Python version. See https://github.com/sgkit-dev/bio2zarr/issues/209 "
-            "for more details.",
-            stacklevel=2,
-        )
 class ParallelWorkManager(contextlib.AbstractContextManager):
     def __init__(self, worker_processes=1, progress_config=None):
         # Need to specify this explicitly to suppport Macs and
@@ -274,7 +296,6 @@ class ParallelWorkManager(contextlib.AbstractContextManager):
             # production. See note on the SynchronousExecutor class.
             self.executor = SynchronousExecutor()
         else:
-            warn_py39_mac()
             self.executor = cf.ProcessPoolExecutor(
                 max_workers=worker_processes,
                 mp_context=ctx,

bio2zarr 0.1.4__tar.gz → 0.1.6__tar.gz

Potentially problematic release.

bio2zarr 0.1.4tar.gz → 0.1.6tar.gz