pywombat 0.4.0__py3-none-any.whl → 1.0.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pywombat/cli.py +929 -32
- pywombat-1.0.0.dist-info/METADATA +638 -0
- pywombat-1.0.0.dist-info/RECORD +6 -0
- pywombat-0.4.0.dist-info/METADATA +0 -142
- pywombat-0.4.0.dist-info/RECORD +0 -6
- {pywombat-0.4.0.dist-info → pywombat-1.0.0.dist-info}/WHEEL +0 -0
- {pywombat-0.4.0.dist-info → pywombat-1.0.0.dist-info}/entry_points.txt +0 -0
|
@@ -1,142 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: pywombat
|
|
3
|
-
Version: 0.4.0
|
|
4
|
-
Summary: A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
|
|
5
|
-
Project-URL: Homepage, https://github.com/bourgeron-lab/pywombat
|
|
6
|
-
Project-URL: Repository, https://github.com/bourgeron-lab/pywombat
|
|
7
|
-
Project-URL: Issues, https://github.com/bourgeron-lab/pywombat/issues
|
|
8
|
-
Author-email: Freddy Cliquet <fcliquet@pasteur.fr>
|
|
9
|
-
License: MIT
|
|
10
|
-
Keywords: bioinformatics,genomics,pedigree,variant-calling,vcf
|
|
11
|
-
Classifier: Development Status :: 3 - Alpha
|
|
12
|
-
Classifier: Intended Audience :: Science/Research
|
|
13
|
-
Classifier: Programming Language :: Python :: 3
|
|
14
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
-
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
16
|
-
Requires-Python: >=3.12
|
|
17
|
-
Requires-Dist: click>=8.1.0
|
|
18
|
-
Requires-Dist: polars>=0.19.0
|
|
19
|
-
Requires-Dist: pyyaml>=6.0
|
|
20
|
-
Description-Content-Type: text/markdown
|
|
21
|
-
|
|
22
|
-
# PyWombat
|
|
23
|
-
|
|
24
|
-
A CLI tool for processing bcftools tabulated TSV files.
|
|
25
|
-
|
|
26
|
-
## Installation
|
|
27
|
-
|
|
28
|
-
This is a UV-managed Python package. To install:
|
|
29
|
-
|
|
30
|
-
```bash
|
|
31
|
-
uv sync
|
|
32
|
-
```
|
|
33
|
-
|
|
34
|
-
## Usage
|
|
35
|
-
|
|
36
|
-
The `wombat` command processes bcftools tabulated TSV files:
|
|
37
|
-
|
|
38
|
-
```bash
|
|
39
|
-
# Format a bcftools TSV file and print to stdout
|
|
40
|
-
wombat input.tsv
|
|
41
|
-
|
|
42
|
-
# Format and save to output file (creates output.tsv by default)
|
|
43
|
-
wombat input.tsv -o output
|
|
44
|
-
|
|
45
|
-
# Format and save as parquet
|
|
46
|
-
wombat input.tsv -o output -f parquet
|
|
47
|
-
wombat input.tsv -o output --format parquet
|
|
48
|
-
|
|
49
|
-
# Format with pedigree information to add parent genotypes
|
|
50
|
-
wombat input.tsv --pedigree pedigree.tsv -o output
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
### What does `wombat` do?
|
|
54
|
-
|
|
55
|
-
The `wombat` command processes bcftools tabulated TSV files by:
|
|
56
|
-
|
|
57
|
-
1. **Expanding the `(null)` column**: This column contains multiple fields in the format `NAME=value` separated by semicolons (e.g., `DP=30;AF=0.5;AC=2`). Each field is extracted into its own column.
|
|
58
|
-
|
|
59
|
-
2. **Preserving the `CSQ` column**: The CSQ (Consequence) column is preserved as-is and not melted, allowing VEP annotations to remain intact.
|
|
60
|
-
|
|
61
|
-
3. **Melting and splitting sample columns**: After the `(null)` column, there are typically sample columns with values in `GT:DP:GQ:AD` format. The tool:
|
|
62
|
-
- Extracts the sample name (the part before the first `:` character)
|
|
63
|
-
- Transforms the wide format into long format
|
|
64
|
-
- Creates a `sample` column with the sample names
|
|
65
|
-
- Splits the sample values into separate columns:
|
|
66
|
-
- `sample_gt`: Genotype (e.g., 0/1, 1/1)
|
|
67
|
-
- `sample_dp`: Read depth
|
|
68
|
-
- `sample_gq`: Genotype quality
|
|
69
|
-
- `sample_ad`: Allele depth (takes the second value from comma-separated list)
|
|
70
|
-
- `sample_vaf`: Variant allele frequency (calculated as sample_ad / sample_dp)
|
|
71
|
-
|
|
72
|
-
### Example
|
|
73
|
-
|
|
74
|
-
**Input:**
|
|
75
|
-
|
|
76
|
-
```tsv
|
|
77
|
-
CHROM POS REF ALT (null) Sample1:GT:Sample1:DP:Sample1:GQ:Sample1:AD Sample2:GT:Sample2:DP:Sample2:GQ:Sample2:AD
|
|
78
|
-
chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
**Output:**
|
|
82
|
-
|
|
83
|
-
```tsv
|
|
84
|
-
CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
|
|
85
|
-
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
|
|
86
|
-
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
Notes:
|
|
90
|
-
|
|
91
|
-
- The `sample_ad` column contains the second value from the AD field (e.g., from `5,10` it extracts `10`)
|
|
92
|
-
- The `sample_vaf` column is the variant allele frequency calculated as `sample_ad / sample_dp`
|
|
93
|
-
- By default, output is in TSV format. Use `-f parquet` to output as Parquet files
|
|
94
|
-
- The `-o` option specifies an output prefix (e.g., `-o output` creates `output.tsv` or `output.parquet`)
|
|
95
|
-
|
|
96
|
-
### Pedigree Support
|
|
97
|
-
|
|
98
|
-
You can provide a pedigree file with the `--pedigree` option to add parent genotype information to the output. This enables trio analysis by including the father's and mother's genotypes for each sample.
|
|
99
|
-
|
|
100
|
-
**Pedigree File Format:**
|
|
101
|
-
|
|
102
|
-
The pedigree file should be a tab-separated file with the following columns:
|
|
103
|
-
|
|
104
|
-
- `FID`: Family ID
|
|
105
|
-
- `sample_id`: Sample identifier (matches the sample names in the VCF)
|
|
106
|
-
- `FatherBarcode`: Father's sample identifier (use `0` or `-9` if unknown)
|
|
107
|
-
- `MotherBarcode`: Mother's sample identifier (use `0` or `-9` if unknown)
|
|
108
|
-
- `Sex`: Sex of the sample (optional)
|
|
109
|
-
- `Pheno`: Phenotype information (optional)
|
|
110
|
-
|
|
111
|
-
Example pedigree file:
|
|
112
|
-
|
|
113
|
-
```tsv
|
|
114
|
-
FID sample_id FatherBarcode MotherBarcode Sex Pheno
|
|
115
|
-
FAM1 Child1 Father1 Mother1 1 2
|
|
116
|
-
FAM1 Father1 0 0 1 1
|
|
117
|
-
FAM1 Mother1 0 0 2 1
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
**Output with Pedigree:**
|
|
121
|
-
|
|
122
|
-
When using `--pedigree`, the output will include additional columns for each parent:
|
|
123
|
-
|
|
124
|
-
- `father_gt`, `father_dp`, `father_gq`, `father_ad`, `father_vaf`: Father's genotype information
|
|
125
|
-
- `mother_gt`, `mother_dp`, `mother_gq`, `mother_ad`, `mother_vaf`: Mother's genotype information
|
|
126
|
-
|
|
127
|
-
These columns will contain the parent's genotype data for the same variant, allowing you to analyze inheritance patterns.
|
|
128
|
-
|
|
129
|
-
## Development
|
|
130
|
-
|
|
131
|
-
This project uses:
|
|
132
|
-
|
|
133
|
-
- **UV** for package management
|
|
134
|
-
- **Polars** for fast data processing
|
|
135
|
-
- **Click** for CLI interface
|
|
136
|
-
|
|
137
|
-
## Testing
|
|
138
|
-
|
|
139
|
-
Test files are available in the `tests/` directory:
|
|
140
|
-
|
|
141
|
-
- `test.tabulated.tsv` - Real bcftools output
|
|
142
|
-
- `test_small.tsv` - Small example for quick testing
|
pywombat-0.4.0.dist-info/RECORD
DELETED
|
@@ -1,6 +0,0 @@
|
|
|
1
|
-
pywombat/__init__.py,sha256=iIPN9vJtsIUhl_DiKNnknxCamLinfayodLLFK8y-aJg,54
|
|
2
|
-
pywombat/cli.py,sha256=dg38E39VpdJhKQt3aGSHwSiLWn1W8JnUkcsy3ZUHD5w,43518
|
|
3
|
-
pywombat-0.4.0.dist-info/METADATA,sha256=ZKPTIp9ud2AIVbcujg4ciq900DX-UkGs5oafa41jxTQ,4982
|
|
4
|
-
pywombat-0.4.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
|
|
5
|
-
pywombat-0.4.0.dist-info/entry_points.txt,sha256=Vt7U2ypbiEgCBlEV71ZPk287H5_HKmPBT4iBu6duEcE,44
|
|
6
|
-
pywombat-0.4.0.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|