PyPI - pywombat - Versions diffs - 1.0.2__tar.gz → 1.2.0__tar.gz - Mend

pywombat 1.0.2tar.gz → 1.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

{pywombat-1.0.2 → pywombat-1.2.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,109 @@ All notable changes to PyWombat will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.2.0] - 2026-02-05
+### Added
+- **Per-Chromosome DNM Processing**: Dramatically reduced memory usage for de novo mutation (DNM) filtering
+  - Processes one chromosome at a time instead of loading all variants into memory
+  - Reduces peak memory from (total_variants × samples) to (max_chr_variants × samples)
+  - Example: 38 samples, 4.2M variants
+    - Before: 200GB+ (OOM failure)
+    - After: ~24GB (completes successfully in 20 seconds)
+  - **88% memory reduction** for DNM workflows
+- **Early Frequency Filtering for DNM**: Applies population frequency filters BEFORE melting
+  - Frequency filters (fafmax_faf95_max_genomes) applied on wide-format data
+  - Quality filters (genomes_filters PASS) applied before melting
+  - Reduces data expansion by filtering variants early in the pipeline
+- **New Helper Functions**:
+  - `get_unique_chromosomes()`: Discovers and naturally sorts chromosomes from Parquet files
+  - `apply_dnm_prefilters()`: Applies variant-level filters before melting
+  - `process_dnm_by_chromosome()`: Orchestrates per-chromosome DNM filtering
+### Changed
+- **DNM Filter Architecture**: Refactored `apply_de_novo_filter()` to support `skip_prefilters` parameter
+  - Allows separation of variant-level filters (applied before melting) from sample-level filters
+  - Prevents double-filtering when prefilters already applied
+- **Filter Command Routing**: Automatically detects DNM mode and routes to per-chromosome processing
+  - Transparent to users - no command syntax changes required
+  - Optimized memory usage is automatic when using DNM config with Parquet input
+### Performance
+- **DNM Memory Usage**: 88% reduction in peak memory (200GB+ → ~24GB)
+- **DNM Processing Time**: 20 seconds for 38-sample, 4.2M variant dataset (previously failed with OOM)
+- **Throughput**: Successfully processes 6,788 DNM variants from 4.2M input variants
+### Testing
+- Added 3 new test cases for DNM optimization:
+  - `test_get_unique_chromosomes()`: Verifies chromosome discovery and natural sorting
+  - `test_apply_dnm_prefilters()`: Validates frequency prefiltering logic
+  - `test_dnm_skip_prefilters()`: Ensures skip_prefilters parameter works correctly
+- Total test suite: 25 tests (all passing)
+## [1.1.0] - 2026-02-05
+### Added
+- **Memory-Optimized Two-Step Workflow**: New `wombat prepare` command for preprocessing large files
+  - Converts TSV/TSV.gz to Parquet format with pre-expanded INFO fields
+  - Processes files in chunks (50k rows default) to handle files of any size
+  - Applies memory-efficient data types (Categorical, UInt32, etc.)
+  - Reduces file size by ~30% compared to gzipped TSV
+- **Parquet Input Support**: `wombat filter` now accepts both TSV and Parquet input
+  - Auto-detects input format (TSV, TSV.gz, or Parquet)
+  - Pre-filtering optimization: Applies expression filters BEFORE melting samples
+  - Reduces memory usage by 95%+ for large files (e.g., 200GB → 1.2GB for 38-sample, 4.2M variant dataset)
+  - Processing time improved from minutes/OOM to <1 second for filtered datasets
+- **Subcommand Architecture**: Converted CLI to use Click groups
+  - `wombat prepare`: Preprocess TSV to Parquet
+  - `wombat filter`: Process and filter data (replaces old direct command)
+  - **Breaking Change**: Old syntax `wombat input.tsv` no longer works, use `wombat filter input.tsv`
+- **Test Suite**: Added comprehensive pytest test suite
+  - 22 tests covering CLI structure, prepare command, and filter command
+  - Test fixtures for creating synthetic test data
+  - Integration tests with real data validation
+  - Added pytest and pytest-cov to dev dependencies
+### Changed
+- **CLI Architecture**: Restructured from single command to group-based subcommands
+- **Filter Command**: Now applies expression filters before melting when using Parquet input
+- **Sample Column Detection**: Improved heuristics to work with both TSV and Parquet formats
+- **Documentation**: Updated README with two-step workflow examples and memory comparison table
+### Fixed
+- **INFO Field Extraction**: Fixed column index detection in `prepare` command (was using hardcoded index)
+- **Type Casting**: Added explicit `.cast(pl.Utf8)` to preserve string types when all values are NULL
+- **Parquet Processing**: Fixed `format_bcftools_tsv_minimal` to work without `(null)` column
+### Performance
+- **Memory Usage**: 95%+ reduction for large files with expression filters
+  - Example: 38 samples, 4.2M variants
+  - Before: 200GB+ (OOM failure)
+  - After: ~1.2GB peak memory
+- **Processing Speed**: <1 second for filtered datasets (vs minutes or failure before)
+- **Pre-filtering**: Expression filters applied before melting reduces data expansion
+### Documentation
+- Added memory optimization workflow section to README
+- Added performance comparison table showing memory/time improvements
+- Updated all examples to use new `wombat filter` syntax
+- Added section explaining when to use `prepare` command
+- Documented two-step workflow benefits and use cases
 ## [1.0.1] - 2026-01-24
 ### Added

{pywombat-1.0.2 → pywombat-1.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pywombat
-Version: 1.0.2
+Version: 1.2.0
 Summary: A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
 Project-URL: Homepage, https://github.com/bourgeron-lab/pywombat
 Project-URL: Repository, https://github.com/bourgeron-lab/pywombat
@@ -18,6 +18,9 @@ Requires-Dist: click>=8.1.0
 Requires-Dist: polars>=0.19.0
 Requires-Dist: pyyaml>=6.0
 Requires-Dist: tqdm>=4.67.1
+Provides-Extra: dev
+Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
+Requires-Dist: pytest>=7.0.0; extra == 'dev'
 Description-Content-Type: text/markdown
 # PyWombat 🦘
@@ -29,14 +32,15 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
 ## Features
-✨ **Fast Processing**: Uses Polars for efficient data handling
-🔬 **Quality Filtering**: Configurable depth, quality, and VAF thresholds
-👨‍👩‍👧 **Pedigree Support**: Trio and family analysis with parent genotypes
-🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
-📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
-🎯 **Expression Filters**: Complex filtering with logical expressions
-🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
-⚡ **Streaming Mode**: Memory-efficient processing of large files
+✨ **Fast Processing**: Uses Polars for efficient data handling
+🔬 **Quality Filtering**: Configurable depth, quality, and VAF thresholds
+👨‍👩‍👧 **Pedigree Support**: Trio and family analysis with parent genotypes
+🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
+📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
+🎯 **Expression Filters**: Complex filtering with logical expressions
+🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
+⚡ **Memory Optimized**: Two-step workflow for large files (prepare → filter)
+💾 **Parquet Support**: Pre-process large files for repeated, memory-efficient analysis
 ---
@@ -47,17 +51,37 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
 Use `uvx` to run PyWombat without installation:
 ```bash
-# Basic formatting
-uvx pywombat input.tsv -o output
+# Basic filtering
+uvx pywombat filter input.tsv -o output
-# With filtering
-uvx pywombat input.tsv -F examples/rare_variants_high_impact.yml -o output
+# With filter configuration
+uvx pywombat filter input.tsv -F examples/rare_variants_high_impact.yml -o output
 # De novo mutation detection
-uvx pywombat input.tsv --pedigree pedigree.tsv \
+uvx pywombat filter input.tsv --pedigree pedigree.tsv \
   -F examples/de_novo_mutations.yml -o denovo
 ```
+### For Large Files (>1GB or >50 samples)
+Use the two-step workflow for memory-efficient processing:
+```bash
+# Step 1: Prepare (one-time preprocessing)
+uvx pywombat prepare input.tsv.gz -o prepared.parquet
+# Step 2: Filter (fast, memory-efficient, can be run multiple times)
+uvx pywombat filter prepared.parquet \
+  -p pedigree.tsv \
+  -F config.yml \
+  -o filtered
+```
+**Benefits:**
+- Pre-expands INFO fields once (saves time on repeated filtering)
+- Applies filters before melting samples (reduces memory by 95%+)
+- Parquet format enables fast columnar access
 ### Installation for Development/Repeated Use
 ```bash
@@ -69,7 +93,7 @@ cd pywombat
 uv sync
 # Run with uv run
-uv run wombat input.tsv -o output
+uv run wombat filter input.tsv -o output
 ```
 ---
@@ -114,25 +138,62 @@ chr1    100  A    T    2   0.5  30  true  Sample2  1/1        18         99
 ---
-## Basic Usage
+## Commands
+PyWombat has two main commands:
+### `wombat prepare` - Preprocess Large Files
+Converts TSV/TSV.gz to optimized Parquet format with pre-expanded INFO fields:
+```bash
+# Basic usage
+wombat prepare input.tsv.gz -o prepared.parquet
+# With verbose output
+wombat prepare input.tsv.gz -o prepared.parquet -v
+# Adjust chunk size for memory constraints
+wombat prepare input.tsv.gz -o prepared.parquet --chunk-size 25000
+```
+**What it does:**
+- Extracts all INFO fields (VEP_*, AF, etc.) as separate columns
+- Keeps samples in wide format (not melted yet)
+- Writes memory-efficient Parquet format
+- Processes in chunks to handle files of any size
+**When to use:**
+- Files >1GB or >50 samples
+- Large families (>10 members)
+- Running multiple filter configurations
+- Repeated analysis of the same dataset
+### `wombat filter` - Process and Filter Data
-### Format Without Filtering
+Transforms and filters variant data (works with both TSV and Parquet input):
 ```bash
-# Output to file
-uvx pywombat input.tsv -o output
+# Basic filtering (TSV input)
+wombat filter input.tsv -o output
-# Output to stdout (useful for piping)
-uvx pywombat input.tsv
+# From prepared Parquet (faster, more memory-efficient)
+wombat filter prepared.parquet -o output
+# With filter configuration
+wombat filter input.tsv -F config.yml -o output
+# With pedigree
+wombat filter input.tsv -p pedigree.tsv -o output
 # Compressed output
-uvx pywombat input.tsv -o output -f tsv.gz
+wombat filter input.tsv -o output -f tsv.gz
-# Parquet format (fastest for large files)
-uvx pywombat input.tsv -o output -f parquet
+# Parquet output
+wombat filter input.tsv -o output -f parquet
 # With verbose output
-uvx pywombat input.tsv -o output --verbose
+wombat filter input.tsv -o output -v
 ```
 ### With Pedigree (Trio/Family Analysis)
@@ -140,7 +201,7 @@ uvx pywombat input.tsv -o output --verbose
 Add parent genotype information for inheritance analysis:
 ```bash
-uvx pywombat input.tsv --pedigree pedigree.tsv -o output
+wombat filter input.tsv --pedigree pedigree.tsv -o output
 ```
 **Pedigree File Format** (tab-separated):
@@ -178,7 +239,7 @@ PyWombat supports two types of filtering:
 Filter for ultra-rare, high-impact variants:
 ```bash
-uvx pywombat input.tsv \
+wombat filter input.tsv \
   -F examples/rare_variants_high_impact.yml \
   -o rare_variants
 ```
@@ -210,7 +271,7 @@ expression: "VEP_CANONICAL = YES & VEP_IMPACT = HIGH & VEP_LoF = HC & VEP_LoF_fl
 Identify de novo mutations in trio data:
 ```bash
-uvx pywombat input.tsv \
+wombat filter input.tsv \
   --pedigree pedigree.tsv \
   -F examples/de_novo_mutations.yml \
   -o denovo
@@ -290,7 +351,7 @@ expression: "VEP_IMPACT = HIGH & VEP_CANONICAL = YES & gnomad_AF < 0.01 & CADD_P
 Inspect specific variants for troubleshooting:
 ```bash
-uvx pywombat input.tsv \
+wombat filter input.tsv \
   -F config.yml \
   --debug chr11:70486013
 ```
@@ -309,20 +370,20 @@ Shows:
 ### TSV (Default)
 ```bash
-uvx pywombat input.tsv -o output          # Creates output.tsv
-uvx pywombat input.tsv -o output -f tsv   # Same as above
+wombat filter input.tsv -o output          # Creates output.tsv
+wombat filter input.tsv -o output -f tsv   # Same as above
 ```
 ### Compressed TSV
 ```bash
-uvx pywombat input.tsv -o output -f tsv.gz  # Creates output.tsv.gz
+wombat filter input.tsv -o output -f tsv.gz  # Creates output.tsv.gz
 ```
 ### Parquet (Fastest for Large Files)
 ```bash
-uvx pywombat input.tsv -o output -f parquet  # Creates output.parquet
+wombat filter input.tsv -o output -f parquet  # Creates output.parquet
 ```
 **When to use Parquet:**
@@ -340,7 +401,7 @@ uvx pywombat input.tsv -o output -f parquet  # Creates output.parquet
 ```bash
 # Step 1: Filter for rare, high-impact variants
-uvx pywombat cohort.tsv \
+wombat filter cohort.tsv \
   -F examples/rare_variants_high_impact.yml \
   -o rare_variants
@@ -352,24 +413,34 @@ uvx pywombat cohort.tsv \
 ```bash
 # Identify de novo mutations in autism cohort
-uvx pywombat autism_trios.tsv \
+wombat filter autism_trios.tsv \
   --pedigree autism_pedigree.tsv \
   -F examples/de_novo_mutations.yml \
   -o autism_denovo \
-  --verbose
+  -v
 # Review output for genes in autism risk lists
 ```
-### 3. Multi-Family Rare Variant Analysis
+### 3. Large Multi-Family Analysis (Memory-Optimized)
 ```bash
-# Process multiple families together
-uvx pywombat families.tsv \
+# Step 1: Prepare once (preprocesses INFO fields)
+wombat prepare large_cohort.tsv.gz -o prepared.parquet -v
+# Step 2: Filter with different configurations (fast, memory-efficient)
+wombat filter prepared.parquet \
   --pedigree families_pedigree.tsv \
   -F examples/rare_variants_high_impact.yml \
   -o families_rare_variants \
-  -f parquet  # Parquet for fast downstream analysis
+  -v
+# Step 3: Run additional filters without re-preparing
+wombat filter prepared.parquet \
+  --pedigree families_pedigree.tsv \
+  -F examples/de_novo_mutations.yml \
+  -o families_denovo \
+  -v
 ```
 ### 4. Custom Expression Filter
@@ -389,7 +460,7 @@ expression: "VEP_IMPACT = HIGH & (gnomad_AF < 0.0001 | gnomad_AF = null)"
 Apply:
 ```bash
-uvx pywombat input.tsv -F custom_filter.yml -o output
+wombat filter input.tsv -F custom_filter.yml -o output
 ```
 ---
@@ -464,7 +535,7 @@ bcftools query -HH \
   annotated.split.bcf > annotated.tsv
 # 4. Process with PyWombat
-uvx pywombat annotated.tsv -F examples/rare_variants_high_impact.yml -o output
+wombat filter annotated.tsv -F examples/rare_variants_high_impact.yml -o output
 ```
 **Why split-vep is required:**
@@ -481,7 +552,7 @@ For production workflows, these commands can be piped together:
 # Efficient pipeline (single pass through data)
 bcftools +split-vep -c - -p VEP_ input.vcf.gz | \
   bcftools query -HH -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' | \
-  uvx pywombat - -F config.yml -o output
+  wombat filter - -F config.yml -o output
 ```
 **Note**: For multiple filter configurations, it's more efficient to save the intermediate TSV file rather than regenerating it each time.
@@ -517,11 +588,49 @@ Each configuration file is fully documented with:
 ## Performance Tips
-1. **Use streaming mode** (default): Efficient for most workflows
-2. **Parquet output**: Faster for large files and repeated analysis
+### For Large Files (>1GB or >50 samples)
+1. **Use the two-step workflow**: `wombat prepare` → `wombat filter`
+   - Reduces memory usage by 95%+ (4.2M variants → ~100 after early filtering)
+   - Pre-expands INFO fields once, reuse for multiple filter configurations
+   - Example: 38-sample family with 4.2M variants processes in <1 second with ~1.2GB RAM
+2. **Parquet format benefits**:
+   - Columnar storage enables selective column loading
+   - Pre-filtering before melting (expression filters applied before expanding to per-sample rows)
+   - **Per-chromosome processing for DNM**: Automatically processes DNM filtering chromosome-by-chromosome
+   - 30% smaller file size vs gzipped TSV
+3. **De Novo Mutation (DNM) filtering optimization**:
+   - Automatically uses per-chromosome processing when DNM mode is enabled
+   - Processes one chromosome at a time to reduce peak memory
+   - Applies frequency filters before melting to reduce data expansion
+   - Example: 38-sample family with 4.2M variants completes in 20 seconds with ~24GB RAM (vs 200GB+ OOM failure)
+### For All Files
 3. **Pre-filter with bcftools**: Filter by region/gene before PyWombat
 4. **Compressed input**: PyWombat handles `.gz` files natively
-5. **Filter early**: Apply quality filters before complex expression filters
+5. **Use verbose mode** (`-v`): Monitor progress and filtering statistics
+### Memory Comparison
+**Expression Filtering** (e.g., VEP_IMPACT filters):
+| Approach | 38 samples, 4.2M variants | Memory | Time |
+|----------|---------------------------|--------|------|
+| Direct TSV | ❌ OOM (>200GB) | 200+ GB | Failed |
+| TSV with chunking | ⚠️ Slow | ~30GB | ~3 min |
+| **Parquet + pre-filter** | ✅ **Optimal** | **~1.2GB** | **<1 sec** |
+**De Novo Mutation (DNM) Filtering**:
+| Approach | 38 samples, 4.2M variants | Memory | Time | Result |
+|----------|---------------------------|--------|------|--------|
+| Without optimization | ❌ OOM (>200GB) | 200+ GB | Failed | N/A |
+| **Parquet + per-chromosome** | ✅ **Success** | **~24GB** | **20 sec** | **6,788 DNM variants** |
+*DNM filtering requires sample-level data (cannot pre-filter before melting), but per-chromosome processing reduces peak memory by 88%.*
 ---
@@ -588,11 +697,15 @@ pywombat/
 **Issue**: Memory errors on large files
-- **Solution**: Files are processed in streaming mode by default; if issues persist, pre-filter with bcftools
+- **Solution**: Use the two-step workflow: `wombat prepare` then `wombat filter` for 95%+ memory reduction
+**Issue**: Command not found after upgrading
+- **Solution**: PyWombat now uses subcommands - use `wombat filter` instead of just `wombat`
 ### Getting Help
-1. Check `--help` for command options: `uvx pywombat --help`
+1. Check `--help` for command options: `wombat --help` or `wombat filter --help`
 2. Review example configurations in [`examples/`](examples/)
 3. Use `--debug` mode to inspect specific variants
 4. Use `--verbose` to see filtering steps

pywombat 1.0.2__tar.gz → 1.2.0__tar.gz

pywombat 1.0.2tar.gz → 1.2.0tar.gz