pywombat 0.4.0__py3-none-any.whl → 1.0.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pywombat/cli.py +929 -32
- pywombat-1.0.0.dist-info/METADATA +638 -0
- pywombat-1.0.0.dist-info/RECORD +6 -0
- pywombat-0.4.0.dist-info/METADATA +0 -142
- pywombat-0.4.0.dist-info/RECORD +0 -6
- {pywombat-0.4.0.dist-info → pywombat-1.0.0.dist-info}/WHEEL +0 -0
- {pywombat-0.4.0.dist-info → pywombat-1.0.0.dist-info}/entry_points.txt +0 -0
|
@@ -0,0 +1,638 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pywombat
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
|
|
5
|
+
Project-URL: Homepage, https://github.com/bourgeron-lab/pywombat
|
|
6
|
+
Project-URL: Repository, https://github.com/bourgeron-lab/pywombat
|
|
7
|
+
Project-URL: Issues, https://github.com/bourgeron-lab/pywombat/issues
|
|
8
|
+
Author-email: Freddy Cliquet <fcliquet@pasteur.fr>
|
|
9
|
+
License: MIT
|
|
10
|
+
Keywords: bioinformatics,genomics,pedigree,variant-calling,vcf
|
|
11
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
12
|
+
Classifier: Intended Audience :: Science/Research
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
16
|
+
Requires-Python: >=3.12
|
|
17
|
+
Requires-Dist: click>=8.1.0
|
|
18
|
+
Requires-Dist: polars>=0.19.0
|
|
19
|
+
Requires-Dist: pyyaml>=6.0
|
|
20
|
+
Requires-Dist: tqdm>=4.67.1
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
|
|
23
|
+
# PyWombat 🦘
|
|
24
|
+
|
|
25
|
+
A high-performance CLI tool for processing and filtering bcftools tabulated TSV files with advanced filtering capabilities, pedigree support, and de novo mutation detection.
|
|
26
|
+
|
|
27
|
+
[](https://www.python.org/downloads/)
|
|
28
|
+
[](https://opensource.org/licenses/MIT)
|
|
29
|
+
|
|
30
|
+
## Features
|
|
31
|
+
|
|
32
|
+
✨ **Fast Processing**: Uses Polars for efficient data handling
|
|
33
|
+
🔬 **Quality Filtering**: Configurable depth, quality, and VAF thresholds
|
|
34
|
+
👨👩👧 **Pedigree Support**: Trio and family analysis with parent genotypes
|
|
35
|
+
🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
|
|
36
|
+
📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
|
|
37
|
+
🎯 **Expression Filters**: Complex filtering with logical expressions
|
|
38
|
+
⚡ **Streaming Mode**: Memory-efficient processing of large files
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Quick Start
|
|
43
|
+
|
|
44
|
+
### One-Time Usage (Recommended for Most Users)
|
|
45
|
+
|
|
46
|
+
Use `uvx` to run PyWombat without installation:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# Basic formatting
|
|
50
|
+
uvx pywombat input.tsv -o output
|
|
51
|
+
|
|
52
|
+
# With filtering
|
|
53
|
+
uvx pywombat input.tsv -F examples/rare_variants_high_impact.yml -o output
|
|
54
|
+
|
|
55
|
+
# De novo mutation detection
|
|
56
|
+
uvx pywombat input.tsv --pedigree pedigree.tsv \
|
|
57
|
+
-F examples/de_novo_mutations.yml -o denovo
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Installation for Development/Repeated Use
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
# Clone the repository
|
|
64
|
+
git clone https://github.com/bourgeron-lab/pywombat.git
|
|
65
|
+
cd pywombat
|
|
66
|
+
|
|
67
|
+
# Install with uv
|
|
68
|
+
uv sync
|
|
69
|
+
|
|
70
|
+
# Run with uv run
|
|
71
|
+
uv run wombat input.tsv -o output
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## What Does PyWombat Do?
|
|
77
|
+
|
|
78
|
+
PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
|
|
79
|
+
|
|
80
|
+
1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) into separate columns
|
|
81
|
+
2. **Melting sample columns**: Converts wide-format sample data into long format with one row per variant-sample combination
|
|
82
|
+
3. **Extracting genotype data**: Parses `GT:DP:GQ:AD` format into separate columns with calculated VAF
|
|
83
|
+
4. **Adding parent data**: Joins father/mother genotypes when pedigree is provided
|
|
84
|
+
5. **Applying filters**: Quality filters, expression-based filters, or de novo detection
|
|
85
|
+
|
|
86
|
+
### Example Transformation
|
|
87
|
+
|
|
88
|
+
**Input (Wide Format):**
|
|
89
|
+
|
|
90
|
+
```tsv
|
|
91
|
+
#CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
|
|
92
|
+
chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
**Output (Long Format):**
|
|
96
|
+
|
|
97
|
+
```tsv
|
|
98
|
+
#CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
|
|
99
|
+
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
|
|
100
|
+
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
**Generated Columns:**
|
|
104
|
+
|
|
105
|
+
- `sample`: Sample identifier
|
|
106
|
+
- `sample_gt`: Genotype (e.g., 0/1, 1/1)
|
|
107
|
+
- `sample_dp`: Read depth (total coverage)
|
|
108
|
+
- `sample_gq`: Genotype quality score
|
|
109
|
+
- `sample_ad`: Alternate allele depth (from second value in AD field)
|
|
110
|
+
- `sample_vaf`: Variant allele frequency (sample_ad / sample_dp)
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## Basic Usage
|
|
115
|
+
|
|
116
|
+
### Format Without Filtering
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
# Output to file
|
|
120
|
+
uvx pywombat input.tsv -o output
|
|
121
|
+
|
|
122
|
+
# Output to stdout (useful for piping)
|
|
123
|
+
uvx pywombat input.tsv
|
|
124
|
+
|
|
125
|
+
# Compressed output
|
|
126
|
+
uvx pywombat input.tsv -o output -f tsv.gz
|
|
127
|
+
|
|
128
|
+
# Parquet format (fastest for large files)
|
|
129
|
+
uvx pywombat input.tsv -o output -f parquet
|
|
130
|
+
|
|
131
|
+
# With verbose output
|
|
132
|
+
uvx pywombat input.tsv -o output --verbose
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### With Pedigree (Trio/Family Analysis)
|
|
136
|
+
|
|
137
|
+
Add parent genotype information for inheritance analysis:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
uvx pywombat input.tsv --pedigree pedigree.tsv -o output
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
**Pedigree File Format** (tab-separated):
|
|
144
|
+
|
|
145
|
+
```tsv
|
|
146
|
+
FID sample_id FatherBarcode MotherBarcode Sex Pheno
|
|
147
|
+
FAM1 Child1 Father1 Mother1 1 2
|
|
148
|
+
FAM1 Father1 0 0 1 1
|
|
149
|
+
FAM1 Mother1 0 0 2 1
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
- `FID`: Family ID
|
|
153
|
+
- `sample_id`: Sample name (must match VCF)
|
|
154
|
+
- `FatherBarcode`: Father's sample name (0 = unknown)
|
|
155
|
+
- `MotherBarcode`: Mother's sample name (0 = unknown)
|
|
156
|
+
- `Sex`: 1=male, 2=female (or M/F)
|
|
157
|
+
- `Pheno`: 1=unaffected, 2=affected
|
|
158
|
+
|
|
159
|
+
**Output with pedigree includes additional columns:**
|
|
160
|
+
|
|
161
|
+
- `father_gt`, `father_dp`, `father_gq`, `father_ad`, `father_vaf`
|
|
162
|
+
- `mother_gt`, `mother_dp`, `mother_gq`, `mother_ad`, `mother_vaf`
|
|
163
|
+
|
|
164
|
+
---
|
|
165
|
+
|
|
166
|
+
## Advanced Filtering
|
|
167
|
+
|
|
168
|
+
PyWombat supports two types of filtering:
|
|
169
|
+
|
|
170
|
+
1. **Expression-based filtering**: For rare variants, impact filtering, frequency filtering
|
|
171
|
+
2. **De novo mutation detection**: Specialized logic for identifying DNMs in trios/families
|
|
172
|
+
|
|
173
|
+
### 1. Rare Variant Filtering
|
|
174
|
+
|
|
175
|
+
Filter for ultra-rare, high-impact variants:
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
uvx pywombat input.tsv \
|
|
179
|
+
-F examples/rare_variants_high_impact.yml \
|
|
180
|
+
-o rare_variants
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Configuration** (`rare_variants_high_impact.yml`):
|
|
184
|
+
|
|
185
|
+
```yaml
|
|
186
|
+
quality:
|
|
187
|
+
sample_dp_min: 10 # Minimum read depth
|
|
188
|
+
sample_gq_min: 19 # Minimum genotype quality
|
|
189
|
+
sample_vaf_het_min: 0.25 # Het VAF range: 25-75%
|
|
190
|
+
sample_vaf_het_max: 0.75
|
|
191
|
+
sample_vaf_homalt_min: 0.85 # Hom alt VAF ≥ 85%
|
|
192
|
+
apply_to_parents: true # Also filter parent genotypes
|
|
193
|
+
filter_no_alt_allele: true # Exclude 0/0 genotypes
|
|
194
|
+
|
|
195
|
+
expression: "VEP_CANONICAL = YES & VEP_IMPACT = HIGH & VEP_LoF = HC & VEP_LoF_flags = . & ( fafmax_faf95_max_genomes = null | fafmax_faf95_max_genomes <= 0.001 )"
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**What this does:**
|
|
199
|
+
|
|
200
|
+
- Filters for canonical transcripts with HIGH impact
|
|
201
|
+
- Loss-of-function (LoF) with high confidence, no flags
|
|
202
|
+
- Ultra-rare: ≤0.1% frequency in gnomAD genomes
|
|
203
|
+
- Stringent quality: DP≥10, GQ≥19, appropriate VAF
|
|
204
|
+
|
|
205
|
+
### 2. De Novo Mutation Detection
|
|
206
|
+
|
|
207
|
+
Identify de novo mutations in trio data:
|
|
208
|
+
|
|
209
|
+
```bash
|
|
210
|
+
uvx pywombat input.tsv \
|
|
211
|
+
--pedigree pedigree.tsv \
|
|
212
|
+
-F examples/de_novo_mutations.yml \
|
|
213
|
+
-o denovo
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Configuration** (`de_novo_mutations.yml`):
|
|
217
|
+
|
|
218
|
+
```yaml
|
|
219
|
+
quality:
|
|
220
|
+
sample_dp_min: 10
|
|
221
|
+
sample_gq_min: 18
|
|
222
|
+
sample_vaf_min: 0.20
|
|
223
|
+
|
|
224
|
+
dnm:
|
|
225
|
+
enabled: true
|
|
226
|
+
parent_dp_min: 10 # Parent quality thresholds
|
|
227
|
+
parent_gq_min: 18
|
|
228
|
+
parent_vaf_max: 0.02 # Max 2% VAF in parents
|
|
229
|
+
|
|
230
|
+
sample_vaf_hemizygous_min: 0.85 # For X male, Y, and hom variants
|
|
231
|
+
|
|
232
|
+
fafmax_faf95_max_genomes_max: 0.001 # Max 0.1% in gnomAD
|
|
233
|
+
genomes_filters_pass_only: true # Only PASS variants
|
|
234
|
+
|
|
235
|
+
par_regions: # Pseudo-autosomal regions (GRCh38)
|
|
236
|
+
grch38:
|
|
237
|
+
PAR1:
|
|
238
|
+
chrom: X
|
|
239
|
+
start: 10000
|
|
240
|
+
end: 2781479
|
|
241
|
+
PAR2:
|
|
242
|
+
chrom: X
|
|
243
|
+
start: 155701383
|
|
244
|
+
end: 156030895
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
**Sex-Chromosome Aware Logic:**
|
|
248
|
+
|
|
249
|
+
- **Autosomes & PAR**: Both parents must be 0/0 with VAF<2%
|
|
250
|
+
- **X in males (non-PAR)**: Mother must be 0/0, father not informative
|
|
251
|
+
- **Y in males**: Father must be 0/0, mother N/A
|
|
252
|
+
- **Hemizygous variants**: Require VAF ≥ 85%
|
|
253
|
+
|
|
254
|
+
---
|
|
255
|
+
|
|
256
|
+
## Expression-Based Filtering
|
|
257
|
+
|
|
258
|
+
Create custom filters using logical expressions:
|
|
259
|
+
|
|
260
|
+
### Available Operators
|
|
261
|
+
|
|
262
|
+
- **Comparison**: `=`, `!=`, `<`, `>`, `<=`, `>=`
|
|
263
|
+
- **Logical**: `&` (AND), `|` (OR)
|
|
264
|
+
- **Grouping**: `(`, `)`
|
|
265
|
+
- **Null checks**: `= null`, `!= null`
|
|
266
|
+
|
|
267
|
+
### Example Expressions
|
|
268
|
+
|
|
269
|
+
```yaml
|
|
270
|
+
# High or moderate impact with low frequency
|
|
271
|
+
expression: "(VEP_IMPACT = HIGH | VEP_IMPACT = MODERATE) & gnomad_AF < 0.001"
|
|
272
|
+
|
|
273
|
+
# Canonical transcripts with CADD score
|
|
274
|
+
expression: "VEP_CANONICAL = YES & CADD_PHRED >= 20"
|
|
275
|
+
|
|
276
|
+
# Specific consequence types
|
|
277
|
+
expression: "VEP_Consequence = frameshift_variant | VEP_Consequence = stop_gained"
|
|
278
|
+
|
|
279
|
+
# Multiple criteria
|
|
280
|
+
expression: "VEP_IMPACT = HIGH & VEP_CANONICAL = YES & gnomad_AF < 0.01 & CADD_PHRED >= 25"
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
---
|
|
284
|
+
|
|
285
|
+
## Debug Mode
|
|
286
|
+
|
|
287
|
+
Inspect specific variants for troubleshooting:
|
|
288
|
+
|
|
289
|
+
```bash
|
|
290
|
+
uvx pywombat input.tsv \
|
|
291
|
+
-F config.yml \
|
|
292
|
+
--debug chr11:70486013
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
Shows:
|
|
296
|
+
|
|
297
|
+
- All rows matching the position
|
|
298
|
+
- VEP_SYMBOL if available
|
|
299
|
+
- All columns mentioned in filter expression
|
|
300
|
+
- Useful for understanding why variants pass/fail filters
|
|
301
|
+
|
|
302
|
+
---
|
|
303
|
+
|
|
304
|
+
## Output Formats
|
|
305
|
+
|
|
306
|
+
### TSV (Default)
|
|
307
|
+
|
|
308
|
+
```bash
|
|
309
|
+
uvx pywombat input.tsv -o output # Creates output.tsv
|
|
310
|
+
uvx pywombat input.tsv -o output -f tsv # Same as above
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
### Compressed TSV
|
|
314
|
+
|
|
315
|
+
```bash
|
|
316
|
+
uvx pywombat input.tsv -o output -f tsv.gz # Creates output.tsv.gz
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
### Parquet (Fastest for Large Files)
|
|
320
|
+
|
|
321
|
+
```bash
|
|
322
|
+
uvx pywombat input.tsv -o output -f parquet # Creates output.parquet
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
**When to use Parquet:**
|
|
326
|
+
|
|
327
|
+
- Large cohorts (>100 samples)
|
|
328
|
+
- Downstream analysis with Polars/Pandas
|
|
329
|
+
- Need for fast repeated access
|
|
330
|
+
- Storage efficiency
|
|
331
|
+
|
|
332
|
+
---
|
|
333
|
+
|
|
334
|
+
## Example Workflows
|
|
335
|
+
|
|
336
|
+
### 1. Rare Disease Gene Discovery
|
|
337
|
+
|
|
338
|
+
```bash
|
|
339
|
+
# Step 1: Filter for rare, high-impact variants
|
|
340
|
+
uvx pywombat cohort.tsv \
|
|
341
|
+
-F examples/rare_variants_high_impact.yml \
|
|
342
|
+
-o rare_variants
|
|
343
|
+
|
|
344
|
+
# Step 2: Further filter with gene list (in downstream analysis)
|
|
345
|
+
# Use the output TSV with your favorite tools (R, Python, etc.)
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
### 2. Autism Trio Analysis
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
# Identify de novo mutations in autism cohort
|
|
352
|
+
uvx pywombat autism_trios.tsv \
|
|
353
|
+
--pedigree autism_pedigree.tsv \
|
|
354
|
+
-F examples/de_novo_mutations.yml \
|
|
355
|
+
-o autism_denovo \
|
|
356
|
+
--verbose
|
|
357
|
+
|
|
358
|
+
# Review output for genes in autism risk lists
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
### 3. Multi-Family Rare Variant Analysis
|
|
362
|
+
|
|
363
|
+
```bash
|
|
364
|
+
# Process multiple families together
|
|
365
|
+
uvx pywombat families.tsv \
|
|
366
|
+
--pedigree families_pedigree.tsv \
|
|
367
|
+
-F examples/rare_variants_high_impact.yml \
|
|
368
|
+
-o families_rare_variants \
|
|
369
|
+
-f parquet # Parquet for fast downstream analysis
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
### 4. Custom Expression Filter
|
|
373
|
+
|
|
374
|
+
Create `custom_filter.yml`:
|
|
375
|
+
|
|
376
|
+
```yaml
|
|
377
|
+
quality:
|
|
378
|
+
sample_dp_min: 15
|
|
379
|
+
sample_gq_min: 20
|
|
380
|
+
sample_vaf_het_min: 0.30
|
|
381
|
+
sample_vaf_het_max: 0.70
|
|
382
|
+
|
|
383
|
+
expression: "VEP_IMPACT = HIGH & (gnomad_AF < 0.0001 | gnomad_AF = null)"
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
Apply:
|
|
387
|
+
|
|
388
|
+
```bash
|
|
389
|
+
uvx pywombat input.tsv -F custom_filter.yml -o output
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
---
|
|
393
|
+
|
|
394
|
+
## Input Requirements
|
|
395
|
+
|
|
396
|
+
### bcftools Tabulated Format
|
|
397
|
+
|
|
398
|
+
PyWombat expects TSV files created with bcftools:
|
|
399
|
+
|
|
400
|
+
```bash
|
|
401
|
+
# From VCF to tabulated TSV
|
|
402
|
+
bcftools query -HH -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' \
|
|
403
|
+
input.vcf.gz > input.tsv
|
|
404
|
+
|
|
405
|
+
# With specific INFO fields
|
|
406
|
+
bcftools query -HH \
|
|
407
|
+
-f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/DP;%INFO/AF;%INFO/AC[\t%GT:%DP:%GQ:%AD]\n' \
|
|
408
|
+
input.vcf.gz > input.tsv
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
**Expected format:**
|
|
412
|
+
|
|
413
|
+
- Tab-separated values
|
|
414
|
+
- Header row with column names (use `-HH` for proper headers)
|
|
415
|
+
- `(null)` column containing INFO fields
|
|
416
|
+
- Sample columns in `GT:DP:GQ:AD` format (or similar)
|
|
417
|
+
- Optional FILTER column for quality control
|
|
418
|
+
|
|
419
|
+
### VEP Annotations
|
|
420
|
+
|
|
421
|
+
For expression-based filtering on VEP annotations, a two-step process is required:
|
|
422
|
+
|
|
423
|
+
#### Step 1: Split VEP CSQ Field
|
|
424
|
+
|
|
425
|
+
**IMPORTANT**: PyWombat requires VEP annotations to be split into individual fields with the `VEP_` prefix:
|
|
426
|
+
|
|
427
|
+
```bash
|
|
428
|
+
# Split VEP CSQ field - creates one row per transcript/consequence
|
|
429
|
+
# This prefixes all CSQ columns with "VEP_" (e.g., VEP_SYMBOL, VEP_IMPACT)
|
|
430
|
+
bcftools +split-vep -c - -p VEP_ -O b -o annotated.split.bcf input.vcf.gz
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
#### Step 2: Convert to Tabulated Format
|
|
434
|
+
|
|
435
|
+
```bash
|
|
436
|
+
# Extract to TSV with genotype information
|
|
437
|
+
bcftools query -HH \
|
|
438
|
+
-f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' \
|
|
439
|
+
annotated.split.bcf > annotated.tsv
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
#### Complete VEP Workflow
|
|
443
|
+
|
|
444
|
+
```bash
|
|
445
|
+
# 1. Annotate with VEP
|
|
446
|
+
vep -i input.vcf.gz \
|
|
447
|
+
--cache --offline \
|
|
448
|
+
--format vcf \
|
|
449
|
+
--vcf \
|
|
450
|
+
--everything \
|
|
451
|
+
--canonical \
|
|
452
|
+
--plugin LoF \
|
|
453
|
+
-o annotated.vcf.gz
|
|
454
|
+
|
|
455
|
+
# 2. Split VEP CSQ field (REQUIRED for PyWombat)
|
|
456
|
+
bcftools +split-vep -c - -p VEP_ -O b -o annotated.split.bcf annotated.vcf.gz
|
|
457
|
+
|
|
458
|
+
# 3. Convert to tabulated format
|
|
459
|
+
bcftools query -HH \
|
|
460
|
+
-f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' \
|
|
461
|
+
annotated.split.bcf > annotated.tsv
|
|
462
|
+
|
|
463
|
+
# 4. Process with PyWombat
|
|
464
|
+
uvx pywombat annotated.tsv -F examples/rare_variants_high_impact.yml -o output
|
|
465
|
+
```
|
|
466
|
+
|
|
467
|
+
**Why split-vep is required:**
|
|
468
|
+
|
|
469
|
+
- Creates one row per transcript/consequence (instead of all in one CSQ field)
|
|
470
|
+
- Prefixes VEP fields with `VEP_` making them accessible in expressions
|
|
471
|
+
- Enables filtering on `VEP_CANONICAL`, `VEP_IMPACT`, `VEP_LoF`, etc.
|
|
472
|
+
|
|
473
|
+
#### Pipeline Optimization
|
|
474
|
+
|
|
475
|
+
For production workflows, these commands can be piped together:
|
|
476
|
+
|
|
477
|
+
```bash
|
|
478
|
+
# Efficient pipeline (single pass through data)
|
|
479
|
+
bcftools +split-vep -c - -p VEP_ input.vcf.gz | \
|
|
480
|
+
bcftools query -HH -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' | \
|
|
481
|
+
uvx pywombat - -F config.yml -o output
|
|
482
|
+
```
|
|
483
|
+
|
|
484
|
+
**Note**: For multiple filter configurations, it's more efficient to save the intermediate TSV file rather than regenerating it each time.
|
|
485
|
+
|
|
486
|
+
### gnomAD Annotations
|
|
487
|
+
|
|
488
|
+
For frequency filtering, annotate with gnomAD:
|
|
489
|
+
|
|
490
|
+
```bash
|
|
491
|
+
# Using bcftools annotate
|
|
492
|
+
bcftools annotate -a gnomad.genomes.vcf.gz \
|
|
493
|
+
-c INFO/AF,INFO/fafmax_faf95_max \
|
|
494
|
+
input.vcf.gz -o annotated.vcf.gz
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
---
|
|
498
|
+
|
|
499
|
+
## Configuration Examples
|
|
500
|
+
|
|
501
|
+
See the [`examples/`](examples/) directory for complete configuration files:
|
|
502
|
+
|
|
503
|
+
- **[`rare_variants_high_impact.yml`](examples/rare_variants_high_impact.yml)**: Ultra-rare, high-impact variants
|
|
504
|
+
- **[`de_novo_mutations.yml`](examples/de_novo_mutations.yml)**: De novo mutation detection with sex-chromosome handling
|
|
505
|
+
|
|
506
|
+
Each configuration file is fully documented with:
|
|
507
|
+
|
|
508
|
+
- Parameter descriptions
|
|
509
|
+
- Use case recommendations
|
|
510
|
+
- Customization tips
|
|
511
|
+
- Example command lines
|
|
512
|
+
|
|
513
|
+
---
|
|
514
|
+
|
|
515
|
+
## Performance Tips
|
|
516
|
+
|
|
517
|
+
1. **Use streaming mode** (default): Efficient for most workflows
|
|
518
|
+
2. **Parquet output**: Faster for large files and repeated analysis
|
|
519
|
+
3. **Pre-filter with bcftools**: Filter by region/gene before PyWombat
|
|
520
|
+
4. **Compressed input**: PyWombat handles `.gz` files natively
|
|
521
|
+
5. **Filter early**: Apply quality filters before complex expression filters
|
|
522
|
+
|
|
523
|
+
---
|
|
524
|
+
|
|
525
|
+
## Development
|
|
526
|
+
|
|
527
|
+
### Setup
|
|
528
|
+
|
|
529
|
+
```bash
|
|
530
|
+
# Clone repository
|
|
531
|
+
git clone https://github.com/bourgeron-lab/pywombat.git
|
|
532
|
+
cd pywombat
|
|
533
|
+
|
|
534
|
+
# Install dependencies
|
|
535
|
+
uv sync
|
|
536
|
+
|
|
537
|
+
# Run tests (if available)
|
|
538
|
+
uv run pytest
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
### Project Structure
|
|
542
|
+
|
|
543
|
+
```
|
|
544
|
+
pywombat/
|
|
545
|
+
├── src/pywombat/
|
|
546
|
+
│ ├── __init__.py
|
|
547
|
+
│ └── cli.py # Main CLI implementation
|
|
548
|
+
├── examples/ # Configuration examples
|
|
549
|
+
│ ├── README.md
|
|
550
|
+
│ ├── rare_variants_high_impact.yml
|
|
551
|
+
│ └── de_novo_mutations.yml
|
|
552
|
+
├── tests/ # Test files and data
|
|
553
|
+
├── pyproject.toml # Project metadata
|
|
554
|
+
└── README.md
|
|
555
|
+
```
|
|
556
|
+
|
|
557
|
+
### Technology Stack
|
|
558
|
+
|
|
559
|
+
- **[Polars](https://pola.rs/)**: Fast DataFrame operations
|
|
560
|
+
- **[Click](https://click.palletsprojects.com/)**: CLI interface
|
|
561
|
+
- **[PyYAML](https://pyyaml.org/)**: Configuration parsing
|
|
562
|
+
- **[uv](https://github.com/astral-sh/uv)**: Package management
|
|
563
|
+
|
|
564
|
+
---
|
|
565
|
+
|
|
566
|
+
## Troubleshooting
|
|
567
|
+
|
|
568
|
+
### Common Issues
|
|
569
|
+
|
|
570
|
+
**Issue**: `Column '(null)' not found`
|
|
571
|
+
|
|
572
|
+
- **Solution**: Ensure input is bcftools tabulated format with INFO column
|
|
573
|
+
|
|
574
|
+
**Issue**: `No parent genotypes found`
|
|
575
|
+
|
|
576
|
+
- **Solution**: Check pedigree file format and sample name matching
|
|
577
|
+
|
|
578
|
+
**Issue**: `DNM filter requires pedigree`
|
|
579
|
+
|
|
580
|
+
- **Solution**: Add `--pedigree` option when using de novo config
|
|
581
|
+
|
|
582
|
+
**Issue**: Variants missing from output
|
|
583
|
+
|
|
584
|
+
- **Solution**: Use `--debug chr:pos` to see why variants are filtered
|
|
585
|
+
|
|
586
|
+
**Issue**: Memory errors on large files
|
|
587
|
+
|
|
588
|
+
- **Solution**: Files are processed in streaming mode by default; if issues persist, pre-filter with bcftools
|
|
589
|
+
|
|
590
|
+
### Getting Help
|
|
591
|
+
|
|
592
|
+
1. Check `--help` for command options: `uvx pywombat --help`
|
|
593
|
+
2. Review example configurations in [`examples/`](examples/)
|
|
594
|
+
3. Use `--debug` mode to inspect specific variants
|
|
595
|
+
4. Use `--verbose` to see filtering steps
|
|
596
|
+
|
|
597
|
+
---
|
|
598
|
+
|
|
599
|
+
## Citation
|
|
600
|
+
|
|
601
|
+
If you use PyWombat in your research, please cite:
|
|
602
|
+
|
|
603
|
+
```
|
|
604
|
+
PyWombat: A tool for processing and filtering bcftools tabulated files
|
|
605
|
+
Bourgeron Lab, Institut Pasteur
|
|
606
|
+
https://github.com/bourgeron-lab/pywombat
|
|
607
|
+
```
|
|
608
|
+
|
|
609
|
+
---
|
|
610
|
+
|
|
611
|
+
## License
|
|
612
|
+
|
|
613
|
+
MIT License - see [LICENSE](LICENSE) file for details
|
|
614
|
+
|
|
615
|
+
---
|
|
616
|
+
|
|
617
|
+
## Contributing
|
|
618
|
+
|
|
619
|
+
Contributions are welcome! Please:
|
|
620
|
+
|
|
621
|
+
1. Fork the repository
|
|
622
|
+
2. Create a feature branch
|
|
623
|
+
3. Make your changes
|
|
624
|
+
4. Submit a pull request
|
|
625
|
+
|
|
626
|
+
---
|
|
627
|
+
|
|
628
|
+
## Authors
|
|
629
|
+
|
|
630
|
+
- **Freddy Cliquet** - [Bourgeron Lab](https://research.pasteur.fr/en/team/human-genetics-and-cognitive-functions/), Institut Pasteur
|
|
631
|
+
|
|
632
|
+
---
|
|
633
|
+
|
|
634
|
+
## Acknowledgments
|
|
635
|
+
|
|
636
|
+
- Bourgeron Lab for project support
|
|
637
|
+
- Institut Pasteur for infrastructure
|
|
638
|
+
- Polars team for the excellent DataFrame library
|
|
@@ -0,0 +1,6 @@
|
|
|
1
|
+
pywombat/__init__.py,sha256=iIPN9vJtsIUhl_DiKNnknxCamLinfayodLLFK8y-aJg,54
|
|
2
|
+
pywombat/cli.py,sha256=FK1bEKtFD1Drp5LNdXaVie4zyjYbZc3wTbsjms-wISU,74176
|
|
3
|
+
pywombat-1.0.0.dist-info/METADATA,sha256=bIm5-Az795PLluvA_6yBPcHkcq6EOZbvB_g-4jPjx_U,16828
|
|
4
|
+
pywombat-1.0.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
|
|
5
|
+
pywombat-1.0.0.dist-info/entry_points.txt,sha256=Vt7U2ypbiEgCBlEV71ZPk287H5_HKmPBT4iBu6duEcE,44
|
|
6
|
+
pywombat-1.0.0.dist-info/RECORD,,
|