pywombat 1.0.2__tar.gz → 1.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -5,6 +5,109 @@ All notable changes to PyWombat will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [1.2.0] - 2026-02-05
9
+
10
+ ### Added
11
+
12
+ - **Per-Chromosome DNM Processing**: Dramatically reduced memory usage for de novo mutation (DNM) filtering
13
+ - Processes one chromosome at a time instead of loading all variants into memory
14
+ - Reduces peak memory from (total_variants × samples) to (max_chr_variants × samples)
15
+ - Example: 38 samples, 4.2M variants
16
+ - Before: 200GB+ (OOM failure)
17
+ - After: ~24GB (completes successfully in 20 seconds)
18
+ - **88% memory reduction** for DNM workflows
19
+
20
+ - **Early Frequency Filtering for DNM**: Applies population frequency filters BEFORE melting
21
+ - Frequency filters (fafmax_faf95_max_genomes) applied on wide-format data
22
+ - Quality filters (genomes_filters PASS) applied before melting
23
+ - Reduces data expansion by filtering variants early in the pipeline
24
+
25
+ - **New Helper Functions**:
26
+ - `get_unique_chromosomes()`: Discovers and naturally sorts chromosomes from Parquet files
27
+ - `apply_dnm_prefilters()`: Applies variant-level filters before melting
28
+ - `process_dnm_by_chromosome()`: Orchestrates per-chromosome DNM filtering
29
+
30
+ ### Changed
31
+
32
+ - **DNM Filter Architecture**: Refactored `apply_de_novo_filter()` to support `skip_prefilters` parameter
33
+ - Allows separation of variant-level filters (applied before melting) from sample-level filters
34
+ - Prevents double-filtering when prefilters already applied
35
+
36
+ - **Filter Command Routing**: Automatically detects DNM mode and routes to per-chromosome processing
37
+ - Transparent to users - no command syntax changes required
38
+ - Optimized memory usage is automatic when using DNM config with Parquet input
39
+
40
+ ### Performance
41
+
42
+ - **DNM Memory Usage**: 88% reduction in peak memory (200GB+ → ~24GB)
43
+ - **DNM Processing Time**: 20 seconds for 38-sample, 4.2M variant dataset (previously failed with OOM)
44
+ - **Throughput**: Successfully processes 6,788 DNM variants from 4.2M input variants
45
+
46
+ ### Testing
47
+
48
+ - Added 3 new test cases for DNM optimization:
49
+ - `test_get_unique_chromosomes()`: Verifies chromosome discovery and natural sorting
50
+ - `test_apply_dnm_prefilters()`: Validates frequency prefiltering logic
51
+ - `test_dnm_skip_prefilters()`: Ensures skip_prefilters parameter works correctly
52
+ - Total test suite: 25 tests (all passing)
53
+
54
+ ## [1.1.0] - 2026-02-05
55
+
56
+ ### Added
57
+
58
+ - **Memory-Optimized Two-Step Workflow**: New `wombat prepare` command for preprocessing large files
59
+ - Converts TSV/TSV.gz to Parquet format with pre-expanded INFO fields
60
+ - Processes files in chunks (50k rows default) to handle files of any size
61
+ - Applies memory-efficient data types (Categorical, UInt32, etc.)
62
+ - Reduces file size by ~30% compared to gzipped TSV
63
+
64
+ - **Parquet Input Support**: `wombat filter` now accepts both TSV and Parquet input
65
+ - Auto-detects input format (TSV, TSV.gz, or Parquet)
66
+ - Pre-filtering optimization: Applies expression filters BEFORE melting samples
67
+ - Reduces memory usage by 95%+ for large files (e.g., 200GB → 1.2GB for 38-sample, 4.2M variant dataset)
68
+ - Processing time improved from minutes/OOM to <1 second for filtered datasets
69
+
70
+ - **Subcommand Architecture**: Converted CLI to use Click groups
71
+ - `wombat prepare`: Preprocess TSV to Parquet
72
+ - `wombat filter`: Process and filter data (replaces old direct command)
73
+ - **Breaking Change**: Old syntax `wombat input.tsv` no longer works, use `wombat filter input.tsv`
74
+
75
+ - **Test Suite**: Added comprehensive pytest test suite
76
+ - 22 tests covering CLI structure, prepare command, and filter command
77
+ - Test fixtures for creating synthetic test data
78
+ - Integration tests with real data validation
79
+ - Added pytest and pytest-cov to dev dependencies
80
+
81
+ ### Changed
82
+
83
+ - **CLI Architecture**: Restructured from single command to group-based subcommands
84
+ - **Filter Command**: Now applies expression filters before melting when using Parquet input
85
+ - **Sample Column Detection**: Improved heuristics to work with both TSV and Parquet formats
86
+ - **Documentation**: Updated README with two-step workflow examples and memory comparison table
87
+
88
+ ### Fixed
89
+
90
+ - **INFO Field Extraction**: Fixed column index detection in `prepare` command (was using hardcoded index)
91
+ - **Type Casting**: Added explicit `.cast(pl.Utf8)` to preserve string types when all values are NULL
92
+ - **Parquet Processing**: Fixed `format_bcftools_tsv_minimal` to work without `(null)` column
93
+
94
+ ### Performance
95
+
96
+ - **Memory Usage**: 95%+ reduction for large files with expression filters
97
+ - Example: 38 samples, 4.2M variants
98
+ - Before: 200GB+ (OOM failure)
99
+ - After: ~1.2GB peak memory
100
+ - **Processing Speed**: <1 second for filtered datasets (vs minutes or failure before)
101
+ - **Pre-filtering**: Expression filters applied before melting reduces data expansion
102
+
103
+ ### Documentation
104
+
105
+ - Added memory optimization workflow section to README
106
+ - Added performance comparison table showing memory/time improvements
107
+ - Updated all examples to use new `wombat filter` syntax
108
+ - Added section explaining when to use `prepare` command
109
+ - Documented two-step workflow benefits and use cases
110
+
8
111
  ## [1.0.1] - 2026-01-24
9
112
 
10
113
  ### Added
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pywombat
3
- Version: 1.0.2
3
+ Version: 1.2.0
4
4
  Summary: A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
5
5
  Project-URL: Homepage, https://github.com/bourgeron-lab/pywombat
6
6
  Project-URL: Repository, https://github.com/bourgeron-lab/pywombat
@@ -18,6 +18,9 @@ Requires-Dist: click>=8.1.0
18
18
  Requires-Dist: polars>=0.19.0
19
19
  Requires-Dist: pyyaml>=6.0
20
20
  Requires-Dist: tqdm>=4.67.1
21
+ Provides-Extra: dev
22
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
23
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
21
24
  Description-Content-Type: text/markdown
22
25
 
23
26
  # PyWombat 🦘
@@ -29,14 +32,15 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
29
32
 
30
33
  ## Features
31
34
 
32
- ✨ **Fast Processing**: Uses Polars for efficient data handling
33
- 🔬 **Quality Filtering**: Configurable depth, quality, and VAF thresholds
34
- 👨‍👩‍👧 **Pedigree Support**: Trio and family analysis with parent genotypes
35
- 🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
36
- 📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
37
- 🎯 **Expression Filters**: Complex filtering with logical expressions
38
- 🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
39
- ⚡ **Streaming Mode**: Memory-efficient processing of large files
35
+ ✨ **Fast Processing**: Uses Polars for efficient data handling
36
+ 🔬 **Quality Filtering**: Configurable depth, quality, and VAF thresholds
37
+ 👨‍👩‍👧 **Pedigree Support**: Trio and family analysis with parent genotypes
38
+ 🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
39
+ 📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
40
+ 🎯 **Expression Filters**: Complex filtering with logical expressions
41
+ 🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
42
+ ⚡ **Memory Optimized**: Two-step workflow for large files (prepare → filter)
43
+ 💾 **Parquet Support**: Pre-process large files for repeated, memory-efficient analysis
40
44
 
41
45
  ---
42
46
 
@@ -47,17 +51,37 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
47
51
  Use `uvx` to run PyWombat without installation:
48
52
 
49
53
  ```bash
50
- # Basic formatting
51
- uvx pywombat input.tsv -o output
54
+ # Basic filtering
55
+ uvx pywombat filter input.tsv -o output
52
56
 
53
- # With filtering
54
- uvx pywombat input.tsv -F examples/rare_variants_high_impact.yml -o output
57
+ # With filter configuration
58
+ uvx pywombat filter input.tsv -F examples/rare_variants_high_impact.yml -o output
55
59
 
56
60
  # De novo mutation detection
57
- uvx pywombat input.tsv --pedigree pedigree.tsv \
61
+ uvx pywombat filter input.tsv --pedigree pedigree.tsv \
58
62
  -F examples/de_novo_mutations.yml -o denovo
59
63
  ```
60
64
 
65
+ ### For Large Files (>1GB or >50 samples)
66
+
67
+ Use the two-step workflow for memory-efficient processing:
68
+
69
+ ```bash
70
+ # Step 1: Prepare (one-time preprocessing)
71
+ uvx pywombat prepare input.tsv.gz -o prepared.parquet
72
+
73
+ # Step 2: Filter (fast, memory-efficient, can be run multiple times)
74
+ uvx pywombat filter prepared.parquet \
75
+ -p pedigree.tsv \
76
+ -F config.yml \
77
+ -o filtered
78
+ ```
79
+
80
+ **Benefits:**
81
+ - Pre-expands INFO fields once (saves time on repeated filtering)
82
+ - Applies filters before melting samples (reduces memory by 95%+)
83
+ - Parquet format enables fast columnar access
84
+
61
85
  ### Installation for Development/Repeated Use
62
86
 
63
87
  ```bash
@@ -69,7 +93,7 @@ cd pywombat
69
93
  uv sync
70
94
 
71
95
  # Run with uv run
72
- uv run wombat input.tsv -o output
96
+ uv run wombat filter input.tsv -o output
73
97
  ```
74
98
 
75
99
  ---
@@ -114,25 +138,62 @@ chr1 100 A T 2 0.5 30 true Sample2 1/1 18 99
114
138
 
115
139
  ---
116
140
 
117
- ## Basic Usage
141
+ ## Commands
142
+
143
+ PyWombat has two main commands:
144
+
145
+ ### `wombat prepare` - Preprocess Large Files
146
+
147
+ Converts TSV/TSV.gz to optimized Parquet format with pre-expanded INFO fields:
148
+
149
+ ```bash
150
+ # Basic usage
151
+ wombat prepare input.tsv.gz -o prepared.parquet
152
+
153
+ # With verbose output
154
+ wombat prepare input.tsv.gz -o prepared.parquet -v
155
+
156
+ # Adjust chunk size for memory constraints
157
+ wombat prepare input.tsv.gz -o prepared.parquet --chunk-size 25000
158
+ ```
159
+
160
+ **What it does:**
161
+ - Extracts all INFO fields (VEP_*, AF, etc.) as separate columns
162
+ - Keeps samples in wide format (not melted yet)
163
+ - Writes memory-efficient Parquet format
164
+ - Processes in chunks to handle files of any size
165
+
166
+ **When to use:**
167
+ - Files >1GB or >50 samples
168
+ - Large families (>10 members)
169
+ - Running multiple filter configurations
170
+ - Repeated analysis of the same dataset
171
+
172
+ ### `wombat filter` - Process and Filter Data
118
173
 
119
- ### Format Without Filtering
174
+ Transforms and filters variant data (works with both TSV and Parquet input):
120
175
 
121
176
  ```bash
122
- # Output to file
123
- uvx pywombat input.tsv -o output
177
+ # Basic filtering (TSV input)
178
+ wombat filter input.tsv -o output
124
179
 
125
- # Output to stdout (useful for piping)
126
- uvx pywombat input.tsv
180
+ # From prepared Parquet (faster, more memory-efficient)
181
+ wombat filter prepared.parquet -o output
182
+
183
+ # With filter configuration
184
+ wombat filter input.tsv -F config.yml -o output
185
+
186
+ # With pedigree
187
+ wombat filter input.tsv -p pedigree.tsv -o output
127
188
 
128
189
  # Compressed output
129
- uvx pywombat input.tsv -o output -f tsv.gz
190
+ wombat filter input.tsv -o output -f tsv.gz
130
191
 
131
- # Parquet format (fastest for large files)
132
- uvx pywombat input.tsv -o output -f parquet
192
+ # Parquet output
193
+ wombat filter input.tsv -o output -f parquet
133
194
 
134
195
  # With verbose output
135
- uvx pywombat input.tsv -o output --verbose
196
+ wombat filter input.tsv -o output -v
136
197
  ```
137
198
 
138
199
  ### With Pedigree (Trio/Family Analysis)
@@ -140,7 +201,7 @@ uvx pywombat input.tsv -o output --verbose
140
201
  Add parent genotype information for inheritance analysis:
141
202
 
142
203
  ```bash
143
- uvx pywombat input.tsv --pedigree pedigree.tsv -o output
204
+ wombat filter input.tsv --pedigree pedigree.tsv -o output
144
205
  ```
145
206
 
146
207
  **Pedigree File Format** (tab-separated):
@@ -178,7 +239,7 @@ PyWombat supports two types of filtering:
178
239
  Filter for ultra-rare, high-impact variants:
179
240
 
180
241
  ```bash
181
- uvx pywombat input.tsv \
242
+ wombat filter input.tsv \
182
243
  -F examples/rare_variants_high_impact.yml \
183
244
  -o rare_variants
184
245
  ```
@@ -210,7 +271,7 @@ expression: "VEP_CANONICAL = YES & VEP_IMPACT = HIGH & VEP_LoF = HC & VEP_LoF_fl
210
271
  Identify de novo mutations in trio data:
211
272
 
212
273
  ```bash
213
- uvx pywombat input.tsv \
274
+ wombat filter input.tsv \
214
275
  --pedigree pedigree.tsv \
215
276
  -F examples/de_novo_mutations.yml \
216
277
  -o denovo
@@ -290,7 +351,7 @@ expression: "VEP_IMPACT = HIGH & VEP_CANONICAL = YES & gnomad_AF < 0.01 & CADD_P
290
351
  Inspect specific variants for troubleshooting:
291
352
 
292
353
  ```bash
293
- uvx pywombat input.tsv \
354
+ wombat filter input.tsv \
294
355
  -F config.yml \
295
356
  --debug chr11:70486013
296
357
  ```
@@ -309,20 +370,20 @@ Shows:
309
370
  ### TSV (Default)
310
371
 
311
372
  ```bash
312
- uvx pywombat input.tsv -o output # Creates output.tsv
313
- uvx pywombat input.tsv -o output -f tsv # Same as above
373
+ wombat filter input.tsv -o output # Creates output.tsv
374
+ wombat filter input.tsv -o output -f tsv # Same as above
314
375
  ```
315
376
 
316
377
  ### Compressed TSV
317
378
 
318
379
  ```bash
319
- uvx pywombat input.tsv -o output -f tsv.gz # Creates output.tsv.gz
380
+ wombat filter input.tsv -o output -f tsv.gz # Creates output.tsv.gz
320
381
  ```
321
382
 
322
383
  ### Parquet (Fastest for Large Files)
323
384
 
324
385
  ```bash
325
- uvx pywombat input.tsv -o output -f parquet # Creates output.parquet
386
+ wombat filter input.tsv -o output -f parquet # Creates output.parquet
326
387
  ```
327
388
 
328
389
  **When to use Parquet:**
@@ -340,7 +401,7 @@ uvx pywombat input.tsv -o output -f parquet # Creates output.parquet
340
401
 
341
402
  ```bash
342
403
  # Step 1: Filter for rare, high-impact variants
343
- uvx pywombat cohort.tsv \
404
+ wombat filter cohort.tsv \
344
405
  -F examples/rare_variants_high_impact.yml \
345
406
  -o rare_variants
346
407
 
@@ -352,24 +413,34 @@ uvx pywombat cohort.tsv \
352
413
 
353
414
  ```bash
354
415
  # Identify de novo mutations in autism cohort
355
- uvx pywombat autism_trios.tsv \
416
+ wombat filter autism_trios.tsv \
356
417
  --pedigree autism_pedigree.tsv \
357
418
  -F examples/de_novo_mutations.yml \
358
419
  -o autism_denovo \
359
- --verbose
420
+ -v
360
421
 
361
422
  # Review output for genes in autism risk lists
362
423
  ```
363
424
 
364
- ### 3. Multi-Family Rare Variant Analysis
425
+ ### 3. Large Multi-Family Analysis (Memory-Optimized)
365
426
 
366
427
  ```bash
367
- # Process multiple families together
368
- uvx pywombat families.tsv \
428
+ # Step 1: Prepare once (preprocesses INFO fields)
429
+ wombat prepare large_cohort.tsv.gz -o prepared.parquet -v
430
+
431
+ # Step 2: Filter with different configurations (fast, memory-efficient)
432
+ wombat filter prepared.parquet \
369
433
  --pedigree families_pedigree.tsv \
370
434
  -F examples/rare_variants_high_impact.yml \
371
435
  -o families_rare_variants \
372
- -f parquet # Parquet for fast downstream analysis
436
+ -v
437
+
438
+ # Step 3: Run additional filters without re-preparing
439
+ wombat filter prepared.parquet \
440
+ --pedigree families_pedigree.tsv \
441
+ -F examples/de_novo_mutations.yml \
442
+ -o families_denovo \
443
+ -v
373
444
  ```
374
445
 
375
446
  ### 4. Custom Expression Filter
@@ -389,7 +460,7 @@ expression: "VEP_IMPACT = HIGH & (gnomad_AF < 0.0001 | gnomad_AF = null)"
389
460
  Apply:
390
461
 
391
462
  ```bash
392
- uvx pywombat input.tsv -F custom_filter.yml -o output
463
+ wombat filter input.tsv -F custom_filter.yml -o output
393
464
  ```
394
465
 
395
466
  ---
@@ -464,7 +535,7 @@ bcftools query -HH \
464
535
  annotated.split.bcf > annotated.tsv
465
536
 
466
537
  # 4. Process with PyWombat
467
- uvx pywombat annotated.tsv -F examples/rare_variants_high_impact.yml -o output
538
+ wombat filter annotated.tsv -F examples/rare_variants_high_impact.yml -o output
468
539
  ```
469
540
 
470
541
  **Why split-vep is required:**
@@ -481,7 +552,7 @@ For production workflows, these commands can be piped together:
481
552
  # Efficient pipeline (single pass through data)
482
553
  bcftools +split-vep -c - -p VEP_ input.vcf.gz | \
483
554
  bcftools query -HH -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' | \
484
- uvx pywombat - -F config.yml -o output
555
+ wombat filter - -F config.yml -o output
485
556
  ```
486
557
 
487
558
  **Note**: For multiple filter configurations, it's more efficient to save the intermediate TSV file rather than regenerating it each time.
@@ -517,11 +588,49 @@ Each configuration file is fully documented with:
517
588
 
518
589
  ## Performance Tips
519
590
 
520
- 1. **Use streaming mode** (default): Efficient for most workflows
521
- 2. **Parquet output**: Faster for large files and repeated analysis
591
+ ### For Large Files (>1GB or >50 samples)
592
+
593
+ 1. **Use the two-step workflow**: `wombat prepare` → `wombat filter`
594
+ - Reduces memory usage by 95%+ (4.2M variants → ~100 after early filtering)
595
+ - Pre-expands INFO fields once, reuse for multiple filter configurations
596
+ - Example: 38-sample family with 4.2M variants processes in <1 second with ~1.2GB RAM
597
+
598
+ 2. **Parquet format benefits**:
599
+ - Columnar storage enables selective column loading
600
+ - Pre-filtering before melting (expression filters applied before expanding to per-sample rows)
601
+ - **Per-chromosome processing for DNM**: Automatically processes DNM filtering chromosome-by-chromosome
602
+ - 30% smaller file size vs gzipped TSV
603
+
604
+ 3. **De Novo Mutation (DNM) filtering optimization**:
605
+ - Automatically uses per-chromosome processing when DNM mode is enabled
606
+ - Processes one chromosome at a time to reduce peak memory
607
+ - Applies frequency filters before melting to reduce data expansion
608
+ - Example: 38-sample family with 4.2M variants completes in 20 seconds with ~24GB RAM (vs 200GB+ OOM failure)
609
+
610
+ ### For All Files
611
+
522
612
  3. **Pre-filter with bcftools**: Filter by region/gene before PyWombat
523
613
  4. **Compressed input**: PyWombat handles `.gz` files natively
524
- 5. **Filter early**: Apply quality filters before complex expression filters
614
+ 5. **Use verbose mode** (`-v`): Monitor progress and filtering statistics
615
+
616
+ ### Memory Comparison
617
+
618
+ **Expression Filtering** (e.g., VEP_IMPACT filters):
619
+
620
+ | Approach | 38 samples, 4.2M variants | Memory | Time |
621
+ |----------|---------------------------|--------|------|
622
+ | Direct TSV | ❌ OOM (>200GB) | 200+ GB | Failed |
623
+ | TSV with chunking | ⚠️ Slow | ~30GB | ~3 min |
624
+ | **Parquet + pre-filter** | ✅ **Optimal** | **~1.2GB** | **<1 sec** |
625
+
626
+ **De Novo Mutation (DNM) Filtering**:
627
+
628
+ | Approach | 38 samples, 4.2M variants | Memory | Time | Result |
629
+ |----------|---------------------------|--------|------|--------|
630
+ | Without optimization | ❌ OOM (>200GB) | 200+ GB | Failed | N/A |
631
+ | **Parquet + per-chromosome** | ✅ **Success** | **~24GB** | **20 sec** | **6,788 DNM variants** |
632
+
633
+ *DNM filtering requires sample-level data (cannot pre-filter before melting), but per-chromosome processing reduces peak memory by 88%.*
525
634
 
526
635
  ---
527
636
 
@@ -588,11 +697,15 @@ pywombat/
588
697
 
589
698
  **Issue**: Memory errors on large files
590
699
 
591
- - **Solution**: Files are processed in streaming mode by default; if issues persist, pre-filter with bcftools
700
+ - **Solution**: Use the two-step workflow: `wombat prepare` then `wombat filter` for 95%+ memory reduction
701
+
702
+ **Issue**: Command not found after upgrading
703
+
704
+ - **Solution**: PyWombat now uses subcommands - use `wombat filter` instead of just `wombat`
592
705
 
593
706
  ### Getting Help
594
707
 
595
- 1. Check `--help` for command options: `uvx pywombat --help`
708
+ 1. Check `--help` for command options: `wombat --help` or `wombat filter --help`
596
709
  2. Review example configurations in [`examples/`](examples/)
597
710
  3. Use `--debug` mode to inspect specific variants
598
711
  4. Use `--verbose` to see filtering steps