pywombat 0.5.0__tar.gz → 1.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pywombat-1.0.1/.github/copilot-instructions.md +130 -0
- pywombat-1.0.1/CHANGELOG.md +135 -0
- pywombat-1.0.1/PKG-INFO +641 -0
- pywombat-1.0.1/README.md +619 -0
- pywombat-1.0.1/examples/README.md +219 -0
- pywombat-1.0.1/examples/de_novo_mutations.yml +159 -0
- pywombat-1.0.1/examples/rare_variants_high_impact.yml +84 -0
- {pywombat-0.5.0 → pywombat-1.0.1}/pyproject.toml +3 -3
- {pywombat-0.5.0 → pywombat-1.0.1}/src/pywombat/cli.py +907 -30
- {pywombat-0.5.0 → pywombat-1.0.1}/uv.lock +16 -2
- pywombat-0.5.0/PKG-INFO +0 -142
- pywombat-0.5.0/README.md +0 -121
- {pywombat-0.5.0 → pywombat-1.0.1}/.github/workflows/publish.yml +0 -0
- {pywombat-0.5.0 → pywombat-1.0.1}/.gitignore +0 -0
- {pywombat-0.5.0 → pywombat-1.0.1}/.python-version +0 -0
- {pywombat-0.5.0 → pywombat-1.0.1}/QUICKSTART.md +0 -0
- {pywombat-0.5.0 → pywombat-1.0.1}/src/pywombat/__init__.py +0 -0
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
# PyWombat Development Guide
|
|
2
|
+
|
|
3
|
+
## Project Overview
|
|
4
|
+
|
|
5
|
+
PyWombat is a bioinformatics CLI tool that transforms bcftools tabulated TSV files into analysis-ready long-format data. It specializes in:
|
|
6
|
+
- Expanding VCF INFO fields from `(null)` column into individual columns
|
|
7
|
+
- Converting wide-format sample data to long-format (melting)
|
|
8
|
+
- Extracting and calculating genotype metrics (GT, DP, GQ, AD, VAF)
|
|
9
|
+
- De novo mutation (DNM) detection with sex-chromosome awareness
|
|
10
|
+
- Trio/family analysis with pedigree-based parent genotype joining
|
|
11
|
+
|
|
12
|
+
## Architecture
|
|
13
|
+
|
|
14
|
+
### Single-File Monolith
|
|
15
|
+
All functionality lives in [src/pywombat/cli.py](src/pywombat/cli.py) (~2100 lines). This is intentional for deployment simplicity with `uvx` (one-command installation-free execution).
|
|
16
|
+
|
|
17
|
+
### Core Data Flow
|
|
18
|
+
1. **Input**: bcftools tabulated TSV (wide format, one row per variant)
|
|
19
|
+
2. **Transform Pipeline**:
|
|
20
|
+
- `format_bcftools_tsv_lazy()` → Expand INFO, melt samples, parse genotypes
|
|
21
|
+
- `apply_filters_lazy()` → Apply quality/expression filters OR DNM detection
|
|
22
|
+
3. **Output**: Long-format TSV/Parquet (one row per variant-sample)
|
|
23
|
+
|
|
24
|
+
### Key Functions
|
|
25
|
+
- **`format_bcftools_tsv()`**: Eager transformation (collects data)
|
|
26
|
+
- **`format_bcftools_tsv_lazy()`**: Streaming wrapper using Polars LazyFrame
|
|
27
|
+
- **`apply_filters_lazy()`**: Conditional filter dispatcher (quality/expression OR DNM)
|
|
28
|
+
- **`apply_de_novo_filter()`**: Complex vectorized DNM logic with PAR region handling
|
|
29
|
+
- **`parse_impact_filter_expression()`**: Expression parser for flexible YAML-based filtering
|
|
30
|
+
|
|
31
|
+
## Development Workflow
|
|
32
|
+
|
|
33
|
+
### Setup & Running
|
|
34
|
+
```bash
|
|
35
|
+
# Install dependencies (uses uv package manager)
|
|
36
|
+
uv sync
|
|
37
|
+
|
|
38
|
+
# Run CLI locally
|
|
39
|
+
uv run wombat input.tsv -o output
|
|
40
|
+
|
|
41
|
+
# Production usage (no installation)
|
|
42
|
+
uvx pywombat input.tsv -o output
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
### Testing
|
|
46
|
+
- Test data: `tests/C0733-011-068.*` files (real variant data)
|
|
47
|
+
- Compare script: [tests/compare_dnm_results.py](tests/compare_dnm_results.py) validates DNM detection
|
|
48
|
+
- No formal test framework yet—manual validation against reference outputs in `tests/check/`
|
|
49
|
+
|
|
50
|
+
## Critical Domain Logic
|
|
51
|
+
|
|
52
|
+
### Genotype Parsing
|
|
53
|
+
- Input format: `GT:DP:GQ:AD` (e.g., `0/1:25:99:10,15`)
|
|
54
|
+
- Extracted fields:
|
|
55
|
+
- `sample_gt`: Genotype (0/1, 1/1, etc.)
|
|
56
|
+
- `sample_dp`: Total depth (25)
|
|
57
|
+
- `sample_gq`: Quality (99)
|
|
58
|
+
- `sample_ad`: **Second value** from AD (15 = alternate allele depth)
|
|
59
|
+
- `sample_vaf`: Calculated as `sample_ad / sample_dp` (0.6)
|
|
60
|
+
|
|
61
|
+
### De Novo Mutation Detection
|
|
62
|
+
Sex-chromosome-aware logic in `apply_de_novo_filter()`:
|
|
63
|
+
- **Autosomes**: Both parents must have VAF < 2% (reference genotype)
|
|
64
|
+
- **X chromosome (male proband)**: Mother must be reference; father is hemizygous (ignored)
|
|
65
|
+
- **Y chromosome**: Only father checked (mother doesn't have Y)
|
|
66
|
+
- **PAR regions**: Treated as autosomal (both parents checked)
|
|
67
|
+
- **Hemizygous variants** (X/Y in males): Require VAF ≥ 85% (not ~50% like het)
|
|
68
|
+
|
|
69
|
+
PAR regions defined in [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml):
|
|
70
|
+
```yaml
|
|
71
|
+
par_regions:
|
|
72
|
+
GRCh38:
|
|
73
|
+
PAR1: { chrom: "X", start: 10001, end: 2781479 }
|
|
74
|
+
PAR2: { chrom: "X", start: 155701383, end: 156030895 }
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Conventions & Patterns
|
|
78
|
+
|
|
79
|
+
### Polars Usage
|
|
80
|
+
- **Lazy evaluation**: Use `scan_csv()` → `LazyFrame` → `sink_csv()` for streaming
|
|
81
|
+
- **Streaming mode**: `collect(streaming=True)` for memory-efficient processing
|
|
82
|
+
- **Schema overrides**: Force string type for sample IDs to prevent numeric inference:
|
|
83
|
+
```python
|
|
84
|
+
schema_overrides = {col: pl.Utf8 for col in ["sample", "FatherBarcode", ...]}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### YAML Configuration Files
|
|
88
|
+
Filter configs in [examples/](examples/) define:
|
|
89
|
+
- `quality`: Thresholds (DP, GQ, VAF ranges for het/hom)
|
|
90
|
+
- `expression`: Polars-compatible filter expressions (see `parse_impact_filter_expression()`)
|
|
91
|
+
- `dnm`: De novo detection parameters (parent thresholds, PAR regions)
|
|
92
|
+
|
|
93
|
+
Example expression syntax:
|
|
94
|
+
```yaml
|
|
95
|
+
expression: "VEP_IMPACT = HIGH & fafmax_faf95_max_genomes <= 0.001"
|
|
96
|
+
```
|
|
97
|
+
Supports: `=`, `!=`, `<=`, `>=`, `<`, `>`, `&`, `|`, `()`, NULL, NaN
|
|
98
|
+
|
|
99
|
+
### Error Handling
|
|
100
|
+
- Use `click.echo(..., err=True)` for verbose messages (stderr)
|
|
101
|
+
- Raise `click.Abort()` instead of generic exceptions for CLI errors
|
|
102
|
+
- Validate pedigree/config requirements early (e.g., DNM mode requires pedigree)
|
|
103
|
+
|
|
104
|
+
## Common Pitfalls
|
|
105
|
+
|
|
106
|
+
1. **VAF Calculation**: Always use `sample_ad / sample_dp`, not first AD value
|
|
107
|
+
2. **Genotype Filtering**: Use `str.contains("1")` not regex—handles 0/1, 1/0, 1/1, 1/2
|
|
108
|
+
3. **Parent Joining**: Pedigree must have `sample_id`, `FatherBarcode`, `MotherBarcode` columns
|
|
109
|
+
4. **Sex Normalization**: DNM filter accepts "1"/"2" or "M"/"F" for sex; normalizes to uppercase
|
|
110
|
+
5. **Categorical Types**: Convert `#CHROM` from Categorical to Utf8 before string operations
|
|
111
|
+
|
|
112
|
+
## File References
|
|
113
|
+
|
|
114
|
+
- Main CLI: [src/pywombat/cli.py](src/pywombat/cli.py)
|
|
115
|
+
- Config examples: [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml), [examples/rare_variants_high_impact.yml](examples/rare_variants_high_impact.yml)
|
|
116
|
+
- Test data: [tests/C0733-011-068.pedigree.tsv](tests/C0733-011-068.pedigree.tsv)
|
|
117
|
+
- Documentation: [README.md](README.md), [QUICKSTART.md](QUICKSTART.md)
|
|
118
|
+
|
|
119
|
+
## Adding New Features
|
|
120
|
+
|
|
121
|
+
When adding filters:
|
|
122
|
+
1. Update YAML schema in config examples
|
|
123
|
+
2. Extend `apply_filters_lazy()` or create specialized function (like `apply_de_novo_filter()`)
|
|
124
|
+
3. Use lazy operations when possible; collect only if vectorization requires it
|
|
125
|
+
4. Add verbose logging with `--verbose` flag support
|
|
126
|
+
|
|
127
|
+
When modifying transforms:
|
|
128
|
+
- Keep `format_bcftools_tsv()` and `format_bcftools_tsv_lazy()` in sync
|
|
129
|
+
- Test with real data from `tests/` directory
|
|
130
|
+
- Verify output with `compare_dnm_results.py` for DNM changes
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to PyWombat will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [1.0.1] - 2026-01-24
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- **Boolean Flag Support in INFO Fields**: INFO field entries without `=` signs (e.g., `PASS`, `DB`, `SOMATIC`) are now extracted as boolean columns with `True`/`False` values instead of being ignored. This enables filtering on VCF flag fields.
|
|
13
|
+
|
|
14
|
+
### Fixed
|
|
15
|
+
|
|
16
|
+
- INFO field parsing now handles all field types correctly, including standalone boolean flags commonly used in VCF files.
|
|
17
|
+
|
|
18
|
+
## [1.0.0] - 2026-01-23
|
|
19
|
+
|
|
20
|
+
First stable release of PyWombat! 🎉
|
|
21
|
+
|
|
22
|
+
### Added
|
|
23
|
+
|
|
24
|
+
#### Core Features
|
|
25
|
+
|
|
26
|
+
- **Fast TSV Processing**: Efficient processing of bcftools tabulated TSV files using Polars
|
|
27
|
+
- **Flexible Output Formats**: Support for TSV, compressed TSV (`.gz`), and Parquet formats
|
|
28
|
+
- **Streaming Mode**: Memory-efficient processing for large files
|
|
29
|
+
- **Pedigree Support**: Trio and family analysis with automatic parent genotype joining
|
|
30
|
+
- **Multiple Sample Formats**: Handles various genotype formats (GT:DP:GQ:AD and variants)
|
|
31
|
+
|
|
32
|
+
#### Filtering Capabilities
|
|
33
|
+
|
|
34
|
+
- **Quality Filters**: Configurable thresholds for depth (DP), genotype quality (GQ), and variant allele frequency (VAF)
|
|
35
|
+
- **Genotype-Specific VAF Filters**: Separate thresholds for heterozygous, homozygous alternate, and homozygous reference calls
|
|
36
|
+
- **Expression-Based Filtering**: Complex logical expressions with comparison operators (`=`, `!=`, `<`, `>`, `<=`, `>=`) and logical operators (`&`, `|`)
|
|
37
|
+
- **Parent Quality Filtering**: Optional quality filter application to parent genotypes
|
|
38
|
+
|
|
39
|
+
#### De Novo Mutation Detection
|
|
40
|
+
|
|
41
|
+
- **Sex-Chromosome Aware Logic**: Proper handling of X and Y chromosomes in males
|
|
42
|
+
- **PAR Region Support**: Configurable pseudo-autosomal region (PAR) coordinates for GRCh37 and GRCh38
|
|
43
|
+
- **Hemizygous Variant Detection**: Specialized VAF thresholds for X chromosome in males (non-PAR) and Y chromosome
|
|
44
|
+
- **Homozygous VAF Thresholds**: Higher VAF requirements (≥85%) for homozygous variants
|
|
45
|
+
- **Parent Genotype Validation**: Ensures parents are homozygous reference with low VAF (<2%)
|
|
46
|
+
- **Missing Genotype Filtering**: Removes variants with partial/missing genotypes (`./.`, `0/.`, etc.)
|
|
47
|
+
- **Population Frequency Filtering**: Maximum allele frequency thresholds (gnomAD fafmax_faf95_max_genomes)
|
|
48
|
+
- **Quality Filter Support**: gnomAD genomes_filters PASS-only option
|
|
49
|
+
|
|
50
|
+
#### User Experience
|
|
51
|
+
|
|
52
|
+
- **Debug Mode**: Inspect specific variants by chromosome:position for troubleshooting
|
|
53
|
+
- **Verbose Mode**: Detailed filtering step information with variant counts
|
|
54
|
+
- **Automatic Output Naming**: Intelligent output file naming based on input and filter config
|
|
55
|
+
- **Configuration Examples**: Two comprehensive example configurations with extensive documentation
|
|
56
|
+
- `rare_variants_high_impact.yml`: Ultra-rare, high-impact variant filtering
|
|
57
|
+
- `de_novo_mutations.yml`: De novo mutation detection with full documentation
|
|
58
|
+
|
|
59
|
+
#### Documentation
|
|
60
|
+
|
|
61
|
+
- **Comprehensive README**: Complete usage guide with examples for all features
|
|
62
|
+
- **Example Workflows**: Real-world usage scenarios (rare disease, autism trios, etc.)
|
|
63
|
+
- **Input Requirements**: Detailed bcftools command examples for generating input files
|
|
64
|
+
- **VEP Annotation Guide**: Complete workflow from VEP annotation to PyWombat processing
|
|
65
|
+
- **Examples Directory**: Dedicated directory with configuration files and detailed README
|
|
66
|
+
- **Troubleshooting Section**: Common issues and solutions
|
|
67
|
+
|
|
68
|
+
#### Installation Methods
|
|
69
|
+
|
|
70
|
+
- **uvx Support**: One-line execution without installation (`uvx pywombat`)
|
|
71
|
+
- **uv Development Mode**: Local installation for repeated use (`uv sync`, `uv run wombat`)
|
|
72
|
+
|
|
73
|
+
### Changed
|
|
74
|
+
|
|
75
|
+
- Improved performance with streaming lazy operations
|
|
76
|
+
- Optimized parent genotype lookup (excludes 0/0 genotypes from storage)
|
|
77
|
+
- Enhanced error messages for better user experience
|
|
78
|
+
- Normalized chromosome names for PAR region matching (handles both 'X' and 'chrX')
|
|
79
|
+
|
|
80
|
+
### Fixed
|
|
81
|
+
|
|
82
|
+
- Sex column reading from pedigree file
|
|
83
|
+
- Parent genotype column naming consistency (father_id/mother_id)
|
|
84
|
+
- Genotype filtering to catch all partial genotypes (`./.`, `0/.`, `1/.`)
|
|
85
|
+
- PAR region matching for different chromosome naming conventions
|
|
86
|
+
- Empty chunk handling in output to avoid blank lines
|
|
87
|
+
|
|
88
|
+
### Performance Optimizations
|
|
89
|
+
|
|
90
|
+
- Delayed annotation expansion (filter before expanding `(null)` field)
|
|
91
|
+
- Vectorized filtering operations (no Python loops)
|
|
92
|
+
- Early genotype filtering (skip 0/0 before parent lookup)
|
|
93
|
+
- Optimized parent lookup (stores only non-reference genotypes)
|
|
94
|
+
- Streaming mode by default for memory efficiency
|
|
95
|
+
|
|
96
|
+
### Removed
|
|
97
|
+
|
|
98
|
+
- **Progress bar options**: Removed `--progress`/`--no-progress` and `--chunk-size` options for simplicity
|
|
99
|
+
- **Chunked processing mode**: Simplified to use only efficient streaming mode
|
|
100
|
+
|
|
101
|
+
## [0.5.0] - 2026-01-20
|
|
102
|
+
|
|
103
|
+
### Added
|
|
104
|
+
|
|
105
|
+
- Initial de novo mutation detection implementation
|
|
106
|
+
- Pedigree file support
|
|
107
|
+
- Basic quality filtering
|
|
108
|
+
- Expression-based filtering
|
|
109
|
+
|
|
110
|
+
### Known Issues
|
|
111
|
+
|
|
112
|
+
- Progress bar had reliability issues (removed in 1.0.0)
|
|
113
|
+
- Chunked processing was complex (simplified in 1.0.0)
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## Release Notes
|
|
118
|
+
|
|
119
|
+
### v1.0.0 - Production Ready
|
|
120
|
+
|
|
121
|
+
This release marks PyWombat as production-ready for:
|
|
122
|
+
|
|
123
|
+
- Rare disease gene discovery
|
|
124
|
+
- De novo mutation detection in autism and developmental disorders
|
|
125
|
+
- Trio and family-based variant analysis
|
|
126
|
+
- High-throughput variant filtering workflows
|
|
127
|
+
|
|
128
|
+
**Recommended for**: Research groups working with rare variants, de novo mutations, and family-based genomic studies.
|
|
129
|
+
|
|
130
|
+
**Breaking Changes**: None from 0.5.0, but removed progress bar options for cleaner interface.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
[1.0.0]: https://github.com/bourgeron-lab/pywombat/releases/tag/v1.0.0
|
|
135
|
+
[0.5.0]: https://github.com/bourgeron-lab/pywombat/releases/tag/v0.5.0
|