pywombat 1.0.0__tar.gz → 1.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,130 @@
1
+ # PyWombat Development Guide
2
+
3
+ ## Project Overview
4
+
5
+ PyWombat is a bioinformatics CLI tool that transforms bcftools tabulated TSV files into analysis-ready long-format data. It specializes in:
6
+ - Expanding VCF INFO fields from `(null)` column into individual columns
7
+ - Converting wide-format sample data to long-format (melting)
8
+ - Extracting and calculating genotype metrics (GT, DP, GQ, AD, VAF)
9
+ - De novo mutation (DNM) detection with sex-chromosome awareness
10
+ - Trio/family analysis with pedigree-based parent genotype joining
11
+
12
+ ## Architecture
13
+
14
+ ### Single-File Monolith
15
+ All functionality lives in [src/pywombat/cli.py](src/pywombat/cli.py) (~2100 lines). This is intentional for deployment simplicity with `uvx` (one-command installation-free execution).
16
+
17
+ ### Core Data Flow
18
+ 1. **Input**: bcftools tabulated TSV (wide format, one row per variant)
19
+ 2. **Transform Pipeline**:
20
+ - `format_bcftools_tsv_lazy()` → Expand INFO, melt samples, parse genotypes
21
+ - `apply_filters_lazy()` → Apply quality/expression filters OR DNM detection
22
+ 3. **Output**: Long-format TSV/Parquet (one row per variant-sample)
23
+
24
+ ### Key Functions
25
+ - **`format_bcftools_tsv()`**: Eager transformation (collects data)
26
+ - **`format_bcftools_tsv_lazy()`**: Streaming wrapper using Polars LazyFrame
27
+ - **`apply_filters_lazy()`**: Conditional filter dispatcher (quality/expression OR DNM)
28
+ - **`apply_de_novo_filter()`**: Complex vectorized DNM logic with PAR region handling
29
+ - **`parse_impact_filter_expression()`**: Expression parser for flexible YAML-based filtering
30
+
31
+ ## Development Workflow
32
+
33
+ ### Setup & Running
34
+ ```bash
35
+ # Install dependencies (uses uv package manager)
36
+ uv sync
37
+
38
+ # Run CLI locally
39
+ uv run wombat input.tsv -o output
40
+
41
+ # Production usage (no installation)
42
+ uvx pywombat input.tsv -o output
43
+ ```
44
+
45
+ ### Testing
46
+ - Test data: `tests/C0733-011-068.*` files (real variant data)
47
+ - Compare script: [tests/compare_dnm_results.py](tests/compare_dnm_results.py) validates DNM detection
48
+ - No formal test framework yet—manual validation against reference outputs in `tests/check/`
49
+
50
+ ## Critical Domain Logic
51
+
52
+ ### Genotype Parsing
53
+ - Input format: `GT:DP:GQ:AD` (e.g., `0/1:25:99:10,15`)
54
+ - Extracted fields:
55
+ - `sample_gt`: Genotype (0/1, 1/1, etc.)
56
+ - `sample_dp`: Total depth (25)
57
+ - `sample_gq`: Quality (99)
58
+ - `sample_ad`: **Second value** from AD (15 = alternate allele depth)
59
+ - `sample_vaf`: Calculated as `sample_ad / sample_dp` (0.6)
60
+
61
+ ### De Novo Mutation Detection
62
+ Sex-chromosome-aware logic in `apply_de_novo_filter()`:
63
+ - **Autosomes**: Both parents must have VAF < 2% (reference genotype)
64
+ - **X chromosome (male proband)**: Mother must be reference; father is hemizygous (ignored)
65
+ - **Y chromosome**: Only father checked (mother doesn't have Y)
66
+ - **PAR regions**: Treated as autosomal (both parents checked)
67
+ - **Hemizygous variants** (X/Y in males): Require VAF ≥ 85% (not ~50% like het)
68
+
69
+ PAR regions defined in [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml):
70
+ ```yaml
71
+ par_regions:
72
+ GRCh38:
73
+ PAR1: { chrom: "X", start: 10001, end: 2781479 }
74
+ PAR2: { chrom: "X", start: 155701383, end: 156030895 }
75
+ ```
76
+
77
+ ## Conventions & Patterns
78
+
79
+ ### Polars Usage
80
+ - **Lazy evaluation**: Use `scan_csv()` → `LazyFrame` → `sink_csv()` for streaming
81
+ - **Streaming mode**: `collect(streaming=True)` for memory-efficient processing
82
+ - **Schema overrides**: Force string type for sample IDs to prevent numeric inference:
83
+ ```python
84
+ schema_overrides = {col: pl.Utf8 for col in ["sample", "FatherBarcode", ...]}
85
+ ```
86
+
87
+ ### YAML Configuration Files
88
+ Filter configs in [examples/](examples/) define:
89
+ - `quality`: Thresholds (DP, GQ, VAF ranges for het/hom)
90
+ - `expression`: Polars-compatible filter expressions (see `parse_impact_filter_expression()`)
91
+ - `dnm`: De novo detection parameters (parent thresholds, PAR regions)
92
+
93
+ Example expression syntax:
94
+ ```yaml
95
+ expression: "VEP_IMPACT = HIGH & fafmax_faf95_max_genomes <= 0.001"
96
+ ```
97
+ Supports: `=`, `!=`, `<=`, `>=`, `<`, `>`, `&`, `|`, `()`, NULL, NaN
98
+
99
+ ### Error Handling
100
+ - Use `click.echo(..., err=True)` for verbose messages (stderr)
101
+ - Raise `click.Abort()` instead of generic exceptions for CLI errors
102
+ - Validate pedigree/config requirements early (e.g., DNM mode requires pedigree)
103
+
104
+ ## Common Pitfalls
105
+
106
+ 1. **VAF Calculation**: Always use `sample_ad / sample_dp`, not first AD value
107
+ 2. **Genotype Filtering**: Use `str.contains("1")` not regex—handles 0/1, 1/0, 1/1, 1/2
108
+ 3. **Parent Joining**: Pedigree must have `sample_id`, `FatherBarcode`, `MotherBarcode` columns
109
+ 4. **Sex Normalization**: DNM filter accepts "1"/"2" or "M"/"F" for sex; normalizes to uppercase
110
+ 5. **Categorical Types**: Convert `#CHROM` from Categorical to Utf8 before string operations
111
+
112
+ ## File References
113
+
114
+ - Main CLI: [src/pywombat/cli.py](src/pywombat/cli.py)
115
+ - Config examples: [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml), [examples/rare_variants_high_impact.yml](examples/rare_variants_high_impact.yml)
116
+ - Test data: [tests/C0733-011-068.pedigree.tsv](tests/C0733-011-068.pedigree.tsv)
117
+ - Documentation: [README.md](README.md), [QUICKSTART.md](QUICKSTART.md)
118
+
119
+ ## Adding New Features
120
+
121
+ When adding filters:
122
+ 1. Update YAML schema in config examples
123
+ 2. Extend `apply_filters_lazy()` or create specialized function (like `apply_de_novo_filter()`)
124
+ 3. Use lazy operations when possible; collect only if vectorization requires it
125
+ 4. Add verbose logging with `--verbose` flag support
126
+
127
+ When modifying transforms:
128
+ - Keep `format_bcftools_tsv()` and `format_bcftools_tsv_lazy()` in sync
129
+ - Test with real data from `tests/` directory
130
+ - Verify output with `compare_dnm_results.py` for DNM changes
@@ -5,6 +5,16 @@ All notable changes to PyWombat will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [1.0.1] - 2026-01-24
9
+
10
+ ### Added
11
+
12
+ - **Boolean Flag Support in INFO Fields**: INFO field entries without `=` signs (e.g., `PASS`, `DB`, `SOMATIC`) are now extracted as boolean columns with `True`/`False` values instead of being ignored. This enables filtering on VCF flag fields.
13
+
14
+ ### Fixed
15
+
16
+ - INFO field parsing now handles all field types correctly, including standalone boolean flags commonly used in VCF files.
17
+
8
18
  ## [1.0.0] - 2026-01-23
9
19
 
10
20
  First stable release of PyWombat! 🎉
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pywombat
3
- Version: 1.0.0
3
+ Version: 1.0.1
4
4
  Summary: A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
5
5
  Project-URL: Homepage, https://github.com/bourgeron-lab/pywombat
6
6
  Project-URL: Repository, https://github.com/bourgeron-lab/pywombat
@@ -35,6 +35,7 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
35
35
  🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
36
36
  📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
37
37
  🎯 **Expression Filters**: Complex filtering with logical expressions
38
+ 🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
38
39
  ⚡ **Streaming Mode**: Memory-efficient processing of large files
39
40
 
40
41
  ---
@@ -77,7 +78,7 @@ uv run wombat input.tsv -o output
77
78
 
78
79
  PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
79
80
 
80
- 1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) into separate columns
81
+ 1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) and boolean flags (e.g., `PASS`, `DB`) into separate columns
81
82
  2. **Melting sample columns**: Converts wide-format sample data into long format with one row per variant-sample combination
82
83
  3. **Extracting genotype data**: Parses `GT:DP:GQ:AD` format into separate columns with calculated VAF
83
84
  4. **Adding parent data**: Joins father/mother genotypes when pedigree is provided
@@ -88,20 +89,22 @@ PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
88
89
  **Input (Wide Format):**
89
90
 
90
91
  ```tsv
91
- #CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
92
- chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
92
+ #CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
93
+ chr1 100 A T DP=30;AF=0.5;PASS;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
93
94
  ```
94
95
 
95
96
  **Output (Long Format):**
96
97
 
97
98
  ```tsv
98
- #CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
99
- chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
100
- chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0
99
+ #CHROM POS REF ALT AC AF DP PASS sample sample_gt sample_dp sample_gq sample_ad sample_vaf
100
+ chr1 100 A T 2 0.5 30 true Sample1 0/1 15 99 10 0.6667
101
+ chr1 100 A T 2 0.5 30 true Sample2 1/1 18 99 18 1.0
101
102
  ```
102
103
 
103
104
  **Generated Columns:**
104
105
 
106
+ - INFO fields with `=`: Extracted as separate columns (e.g., `DP`, `AF`, `AC`)
107
+ - INFO boolean flags: Extracted as True/False columns (e.g., `PASS`, `DB`, `SOMATIC`)
105
108
  - `sample`: Sample identifier
106
109
  - `sample_gt`: Genotype (e.g., 0/1, 1/1)
107
110
  - `sample_dp`: Read depth (total coverage)
@@ -13,6 +13,7 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
13
13
  🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
14
14
  📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
15
15
  🎯 **Expression Filters**: Complex filtering with logical expressions
16
+ 🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
16
17
  ⚡ **Streaming Mode**: Memory-efficient processing of large files
17
18
 
18
19
  ---
@@ -55,7 +56,7 @@ uv run wombat input.tsv -o output
55
56
 
56
57
  PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
57
58
 
58
- 1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) into separate columns
59
+ 1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) and boolean flags (e.g., `PASS`, `DB`) into separate columns
59
60
  2. **Melting sample columns**: Converts wide-format sample data into long format with one row per variant-sample combination
60
61
  3. **Extracting genotype data**: Parses `GT:DP:GQ:AD` format into separate columns with calculated VAF
61
62
  4. **Adding parent data**: Joins father/mother genotypes when pedigree is provided
@@ -66,20 +67,22 @@ PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
66
67
  **Input (Wide Format):**
67
68
 
68
69
  ```tsv
69
- #CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
70
- chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
70
+ #CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
71
+ chr1 100 A T DP=30;AF=0.5;PASS;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
71
72
  ```
72
73
 
73
74
  **Output (Long Format):**
74
75
 
75
76
  ```tsv
76
- #CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
77
- chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
78
- chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0
77
+ #CHROM POS REF ALT AC AF DP PASS sample sample_gt sample_dp sample_gq sample_ad sample_vaf
78
+ chr1 100 A T 2 0.5 30 true Sample1 0/1 15 99 10 0.6667
79
+ chr1 100 A T 2 0.5 30 true Sample2 1/1 18 99 18 1.0
79
80
  ```
80
81
 
81
82
  **Generated Columns:**
82
83
 
84
+ - INFO fields with `=`: Extracted as separate columns (e.g., `DP`, `AF`, `AC`)
85
+ - INFO boolean flags: Extracted as True/False columns (e.g., `PASS`, `DB`, `SOMATIC`)
83
86
  - `sample`: Sample identifier
84
87
  - `sample_gt`: Genotype (e.g., 0/1, 1/1)
85
88
  - `sample_dp`: Read depth (total coverage)
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "pywombat"
3
- version = "1.0.0"
3
+ version = "1.0.1"
4
4
  description = "A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support"
5
5
  readme = "README.md"
6
6
  authors = [{ name = "Freddy Cliquet", email = "fcliquet@pasteur.fr" }]
@@ -1313,6 +1313,10 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
1313
1313
  This is a separate step that can be applied after filtering to avoid
1314
1314
  expensive annotation expansion on variants that will be filtered out.
1315
1315
 
1316
+ Handles two types of INFO fields:
1317
+ - Key-value pairs (e.g., "DP=30") -> extracted as string values
1318
+ - Boolean flags (e.g., "PASS", "DB") -> created as True/False columns
1319
+
1316
1320
  Args:
1317
1321
  df: DataFrame with (null) column
1318
1322
 
@@ -1324,9 +1328,10 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
1324
1328
  # Already expanded or missing - return as-is
1325
1329
  return df
1326
1330
 
1327
- # Extract all unique field names from the (null) column
1331
+ # Extract all unique field names and flags from the (null) column
1328
1332
  null_values = df.select("(null)").to_series()
1329
1333
  all_fields = set()
1334
+ all_flags = set()
1330
1335
 
1331
1336
  for value in null_values:
1332
1337
  if value and not (isinstance(value, float)): # Skip null/NaN values
@@ -1335,8 +1340,10 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
1335
1340
  if "=" in pair:
1336
1341
  field_name = pair.split("=", 1)[0]
1337
1342
  all_fields.add(field_name)
1343
+ elif pair.strip(): # Boolean flag (no '=')
1344
+ all_flags.add(pair.strip())
1338
1345
 
1339
- # Create expressions to extract each field
1346
+ # Create expressions to extract each key-value field
1340
1347
  for field in sorted(all_fields):
1341
1348
  # Extract the field value from the (null) column
1342
1349
  # Pattern: extract value after "field=" and before ";" or end of string
@@ -1344,6 +1351,14 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
1344
1351
  pl.col("(null)").str.extract(f"{field}=([^;]+)").alias(field)
1345
1352
  )
1346
1353
 
1354
+ # Create boolean columns for flags
1355
+ for flag in sorted(all_flags):
1356
+ # Check if flag appears in the (null) column (as whole word)
1357
+ # Use regex to match flag as a separate field (not part of another field name)
1358
+ df = df.with_columns(
1359
+ pl.col("(null)").str.contains(f"(^|;){flag}(;|$)").alias(flag)
1360
+ )
1361
+
1347
1362
  # Drop the original (null) column
1348
1363
  df = df.drop("(null)")
1349
1364
 
File without changes
File without changes
File without changes
File without changes
File without changes