pywombat 0.5.0__tar.gz → 1.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,130 @@
1
+ # PyWombat Development Guide
2
+
3
+ ## Project Overview
4
+
5
+ PyWombat is a bioinformatics CLI tool that transforms bcftools tabulated TSV files into analysis-ready long-format data. It specializes in:
6
+ - Expanding VCF INFO fields from `(null)` column into individual columns
7
+ - Converting wide-format sample data to long-format (melting)
8
+ - Extracting and calculating genotype metrics (GT, DP, GQ, AD, VAF)
9
+ - De novo mutation (DNM) detection with sex-chromosome awareness
10
+ - Trio/family analysis with pedigree-based parent genotype joining
11
+
12
+ ## Architecture
13
+
14
+ ### Single-File Monolith
15
+ All functionality lives in [src/pywombat/cli.py](src/pywombat/cli.py) (~2100 lines). This is intentional for deployment simplicity with `uvx` (one-command installation-free execution).
16
+
17
+ ### Core Data Flow
18
+ 1. **Input**: bcftools tabulated TSV (wide format, one row per variant)
19
+ 2. **Transform Pipeline**:
20
+ - `format_bcftools_tsv_lazy()` → Expand INFO, melt samples, parse genotypes
21
+ - `apply_filters_lazy()` → Apply quality/expression filters OR DNM detection
22
+ 3. **Output**: Long-format TSV/Parquet (one row per variant-sample)
23
+
24
+ ### Key Functions
25
+ - **`format_bcftools_tsv()`**: Eager transformation (collects data)
26
+ - **`format_bcftools_tsv_lazy()`**: Streaming wrapper using Polars LazyFrame
27
+ - **`apply_filters_lazy()`**: Conditional filter dispatcher (quality/expression OR DNM)
28
+ - **`apply_de_novo_filter()`**: Complex vectorized DNM logic with PAR region handling
29
+ - **`parse_impact_filter_expression()`**: Expression parser for flexible YAML-based filtering
30
+
31
+ ## Development Workflow
32
+
33
+ ### Setup & Running
34
+ ```bash
35
+ # Install dependencies (uses uv package manager)
36
+ uv sync
37
+
38
+ # Run CLI locally
39
+ uv run wombat input.tsv -o output
40
+
41
+ # Production usage (no installation)
42
+ uvx pywombat input.tsv -o output
43
+ ```
44
+
45
+ ### Testing
46
+ - Test data: `tests/C0733-011-068.*` files (real variant data)
47
+ - Compare script: [tests/compare_dnm_results.py](tests/compare_dnm_results.py) validates DNM detection
48
+ - No formal test framework yet—manual validation against reference outputs in `tests/check/`
49
+
50
+ ## Critical Domain Logic
51
+
52
+ ### Genotype Parsing
53
+ - Input format: `GT:DP:GQ:AD` (e.g., `0/1:25:99:10,15`)
54
+ - Extracted fields:
55
+ - `sample_gt`: Genotype (0/1, 1/1, etc.)
56
+ - `sample_dp`: Total depth (25)
57
+ - `sample_gq`: Quality (99)
58
+ - `sample_ad`: **Second value** from AD (15 = alternate allele depth)
59
+ - `sample_vaf`: Calculated as `sample_ad / sample_dp` (0.6)
60
+
61
+ ### De Novo Mutation Detection
62
+ Sex-chromosome-aware logic in `apply_de_novo_filter()`:
63
+ - **Autosomes**: Both parents must have VAF < 2% (reference genotype)
64
+ - **X chromosome (male proband)**: Mother must be reference; father is hemizygous (ignored)
65
+ - **Y chromosome**: Only father checked (mother doesn't have Y)
66
+ - **PAR regions**: Treated as autosomal (both parents checked)
67
+ - **Hemizygous variants** (X/Y in males): Require VAF ≥ 85% (not ~50% like het)
68
+
69
+ PAR regions defined in [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml):
70
+ ```yaml
71
+ par_regions:
72
+ GRCh38:
73
+ PAR1: { chrom: "X", start: 10001, end: 2781479 }
74
+ PAR2: { chrom: "X", start: 155701383, end: 156030895 }
75
+ ```
76
+
77
+ ## Conventions & Patterns
78
+
79
+ ### Polars Usage
80
+ - **Lazy evaluation**: Use `scan_csv()` → `LazyFrame` → `sink_csv()` for streaming
81
+ - **Streaming mode**: `collect(streaming=True)` for memory-efficient processing
82
+ - **Schema overrides**: Force string type for sample IDs to prevent numeric inference:
83
+ ```python
84
+ schema_overrides = {col: pl.Utf8 for col in ["sample", "FatherBarcode", ...]}
85
+ ```
86
+
87
+ ### YAML Configuration Files
88
+ Filter configs in [examples/](examples/) define:
89
+ - `quality`: Thresholds (DP, GQ, VAF ranges for het/hom)
90
+ - `expression`: Polars-compatible filter expressions (see `parse_impact_filter_expression()`)
91
+ - `dnm`: De novo detection parameters (parent thresholds, PAR regions)
92
+
93
+ Example expression syntax:
94
+ ```yaml
95
+ expression: "VEP_IMPACT = HIGH & fafmax_faf95_max_genomes <= 0.001"
96
+ ```
97
+ Supports: `=`, `!=`, `<=`, `>=`, `<`, `>`, `&`, `|`, `()`, NULL, NaN
98
+
99
+ ### Error Handling
100
+ - Use `click.echo(..., err=True)` for verbose messages (stderr)
101
+ - Raise `click.Abort()` instead of generic exceptions for CLI errors
102
+ - Validate pedigree/config requirements early (e.g., DNM mode requires pedigree)
103
+
104
+ ## Common Pitfalls
105
+
106
+ 1. **VAF Calculation**: Always use `sample_ad / sample_dp`, not first AD value
107
+ 2. **Genotype Filtering**: Use `str.contains("1")` not regex—handles 0/1, 1/0, 1/1, 1/2
108
+ 3. **Parent Joining**: Pedigree must have `sample_id`, `FatherBarcode`, `MotherBarcode` columns
109
+ 4. **Sex Normalization**: DNM filter accepts "1"/"2" or "M"/"F" for sex; normalizes to uppercase
110
+ 5. **Categorical Types**: Convert `#CHROM` from Categorical to Utf8 before string operations
111
+
112
+ ## File References
113
+
114
+ - Main CLI: [src/pywombat/cli.py](src/pywombat/cli.py)
115
+ - Config examples: [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml), [examples/rare_variants_high_impact.yml](examples/rare_variants_high_impact.yml)
116
+ - Test data: [tests/C0733-011-068.pedigree.tsv](tests/C0733-011-068.pedigree.tsv)
117
+ - Documentation: [README.md](README.md), [QUICKSTART.md](QUICKSTART.md)
118
+
119
+ ## Adding New Features
120
+
121
+ When adding filters:
122
+ 1. Update YAML schema in config examples
123
+ 2. Extend `apply_filters_lazy()` or create specialized function (like `apply_de_novo_filter()`)
124
+ 3. Use lazy operations when possible; collect only if vectorization requires it
125
+ 4. Add verbose logging with `--verbose` flag support
126
+
127
+ When modifying transforms:
128
+ - Keep `format_bcftools_tsv()` and `format_bcftools_tsv_lazy()` in sync
129
+ - Test with real data from `tests/` directory
130
+ - Verify output with `compare_dnm_results.py` for DNM changes
@@ -0,0 +1,135 @@
1
+ # Changelog
2
+
3
+ All notable changes to PyWombat will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [1.0.1] - 2026-01-24
9
+
10
+ ### Added
11
+
12
+ - **Boolean Flag Support in INFO Fields**: INFO field entries without `=` signs (e.g., `PASS`, `DB`, `SOMATIC`) are now extracted as boolean columns with `True`/`False` values instead of being ignored. This enables filtering on VCF flag fields.
13
+
14
+ ### Fixed
15
+
16
+ - INFO field parsing now handles all field types correctly, including standalone boolean flags commonly used in VCF files.
17
+
18
+ ## [1.0.0] - 2026-01-23
19
+
20
+ First stable release of PyWombat! 🎉
21
+
22
+ ### Added
23
+
24
+ #### Core Features
25
+
26
+ - **Fast TSV Processing**: Efficient processing of bcftools tabulated TSV files using Polars
27
+ - **Flexible Output Formats**: Support for TSV, compressed TSV (`.gz`), and Parquet formats
28
+ - **Streaming Mode**: Memory-efficient processing for large files
29
+ - **Pedigree Support**: Trio and family analysis with automatic parent genotype joining
30
+ - **Multiple Sample Formats**: Handles various genotype formats (GT:DP:GQ:AD and variants)
31
+
32
+ #### Filtering Capabilities
33
+
34
+ - **Quality Filters**: Configurable thresholds for depth (DP), genotype quality (GQ), and variant allele frequency (VAF)
35
+ - **Genotype-Specific VAF Filters**: Separate thresholds for heterozygous, homozygous alternate, and homozygous reference calls
36
+ - **Expression-Based Filtering**: Complex logical expressions with comparison operators (`=`, `!=`, `<`, `>`, `<=`, `>=`) and logical operators (`&`, `|`)
37
+ - **Parent Quality Filtering**: Optional quality filter application to parent genotypes
38
+
39
+ #### De Novo Mutation Detection
40
+
41
+ - **Sex-Chromosome Aware Logic**: Proper handling of X and Y chromosomes in males
42
+ - **PAR Region Support**: Configurable pseudo-autosomal region (PAR) coordinates for GRCh37 and GRCh38
43
+ - **Hemizygous Variant Detection**: Specialized VAF thresholds for X chromosome in males (non-PAR) and Y chromosome
44
+ - **Homozygous VAF Thresholds**: Higher VAF requirements (≥85%) for homozygous variants
45
+ - **Parent Genotype Validation**: Ensures parents are homozygous reference with low VAF (<2%)
46
+ - **Missing Genotype Filtering**: Removes variants with partial/missing genotypes (`./.`, `0/.`, etc.)
47
+ - **Population Frequency Filtering**: Maximum allele frequency thresholds (gnomAD fafmax_faf95_max_genomes)
48
+ - **Quality Filter Support**: gnomAD genomes_filters PASS-only option
49
+
50
+ #### User Experience
51
+
52
+ - **Debug Mode**: Inspect specific variants by chromosome:position for troubleshooting
53
+ - **Verbose Mode**: Detailed filtering step information with variant counts
54
+ - **Automatic Output Naming**: Intelligent output file naming based on input and filter config
55
+ - **Configuration Examples**: Two comprehensive example configurations with extensive documentation
56
+ - `rare_variants_high_impact.yml`: Ultra-rare, high-impact variant filtering
57
+ - `de_novo_mutations.yml`: De novo mutation detection with full documentation
58
+
59
+ #### Documentation
60
+
61
+ - **Comprehensive README**: Complete usage guide with examples for all features
62
+ - **Example Workflows**: Real-world usage scenarios (rare disease, autism trios, etc.)
63
+ - **Input Requirements**: Detailed bcftools command examples for generating input files
64
+ - **VEP Annotation Guide**: Complete workflow from VEP annotation to PyWombat processing
65
+ - **Examples Directory**: Dedicated directory with configuration files and detailed README
66
+ - **Troubleshooting Section**: Common issues and solutions
67
+
68
+ #### Installation Methods
69
+
70
+ - **uvx Support**: One-line execution without installation (`uvx pywombat`)
71
+ - **uv Development Mode**: Local installation for repeated use (`uv sync`, `uv run wombat`)
72
+
73
+ ### Changed
74
+
75
+ - Improved performance with streaming lazy operations
76
+ - Optimized parent genotype lookup (excludes 0/0 genotypes from storage)
77
+ - Enhanced error messages for better user experience
78
+ - Normalized chromosome names for PAR region matching (handles both 'X' and 'chrX')
79
+
80
+ ### Fixed
81
+
82
+ - Sex column reading from pedigree file
83
+ - Parent genotype column naming consistency (father_id/mother_id)
84
+ - Genotype filtering to catch all partial genotypes (`./.`, `0/.`, `1/.`)
85
+ - PAR region matching for different chromosome naming conventions
86
+ - Empty chunk handling in output to avoid blank lines
87
+
88
+ ### Performance Optimizations
89
+
90
+ - Delayed annotation expansion (filter before expanding `(null)` field)
91
+ - Vectorized filtering operations (no Python loops)
92
+ - Early genotype filtering (skip 0/0 before parent lookup)
93
+ - Optimized parent lookup (stores only non-reference genotypes)
94
+ - Streaming mode by default for memory efficiency
95
+
96
+ ### Removed
97
+
98
+ - **Progress bar options**: Removed `--progress`/`--no-progress` and `--chunk-size` options for simplicity
99
+ - **Chunked processing mode**: Simplified to use only efficient streaming mode
100
+
101
+ ## [0.5.0] - 2026-01-20
102
+
103
+ ### Added
104
+
105
+ - Initial de novo mutation detection implementation
106
+ - Pedigree file support
107
+ - Basic quality filtering
108
+ - Expression-based filtering
109
+
110
+ ### Known Issues
111
+
112
+ - Progress bar had reliability issues (removed in 1.0.0)
113
+ - Chunked processing was complex (simplified in 1.0.0)
114
+
115
+ ---
116
+
117
+ ## Release Notes
118
+
119
+ ### v1.0.0 - Production Ready
120
+
121
+ This release marks PyWombat as production-ready for:
122
+
123
+ - Rare disease gene discovery
124
+ - De novo mutation detection in autism and developmental disorders
125
+ - Trio and family-based variant analysis
126
+ - High-throughput variant filtering workflows
127
+
128
+ **Recommended for**: Research groups working with rare variants, de novo mutations, and family-based genomic studies.
129
+
130
+ **Breaking Changes**: None from 0.5.0, but removed progress bar options for cleaner interface.
131
+
132
+ ---
133
+
134
+ [1.0.0]: https://github.com/bourgeron-lab/pywombat/releases/tag/v1.0.0
135
+ [0.5.0]: https://github.com/bourgeron-lab/pywombat/releases/tag/v0.5.0