ferromic 0.1.0__cp310-cp310-manylinux_2_34_x86_64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of ferromic might be problematic. Click here for more details.

ferromic/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ from .ferromic import *
2
+
3
+ __doc__ = ferromic.__doc__
4
+ if hasattr(ferromic, "__all__"):
5
+ __all__ = ferromic.__all__
@@ -0,0 +1,234 @@
1
+ Metadata-Version: 2.4
2
+ Name: ferromic
3
+ Version: 0.1.0
4
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
5
+
6
+ # Ferromic ⚙️
7
+ Ferromic is a place for multiple bioinformatics tools for VCFs written in Rust. Right now, this includes file merging and region statistics calculation.
8
+
9
+ # VCF Statistics Calculator 📊
10
+
11
+ Welcome to the **VCF Statistics Calculator**, a Rust-based tool designed to compute **Watterson's Theta (θ)** and **Pi (π)** (and others, coming soon) for genomic regions defined in VCF (Variant Call Format) files.
12
+
13
+ ---
14
+
15
+ ## Features ✨
16
+
17
+ - **Calculate Genetic Diversity Metrics**: Compute Watterson's Theta (θ) and Pi (π) for specified regions.
18
+ - **Haplotype Group Analysis**: Separate calculations for haplotypes with and without a structural variant class (such as inversions).
19
+ - **Flexible Input Handling**: Supports configuration via TSV files for multiple regions and haplotype groupings (e.g., by presence of structural variant).
20
+ - **Filtering**: Filter variants based on genotype quality (GQ) scores and predefined genomic masks.
21
+ - **Output**: Generates CSV files with statistical metrics for each genomic region.
22
+
23
+ ---
24
+
25
+ ## Background 🧬
26
+
27
+ This tool processes VCF files to calculate **Watterson's Theta (θ)** and **Pi (π)**, metrics for understanding genetic diversity within genomic regions. By using a TSV configuration file, users can define multiple regions and categorize haplotypes based on SV (e.g. inversion) statuses. This allows for distinguishing between inverted and non-inverted haplotypes across different regions and samples. Note that if you are using VCFs, all sites which differ between samples must be included.
28
+
29
+ **Metrics**:
30
+ - **Watterson's Theta (θ)**: Based on the number of segregating (polymorphic) sites.
31
+ - **Pi (π)**: Measures nucleotide diversity, based on pairwise per-site nucleotide differences.
32
+
33
+ ---
34
+
35
+ ## Installation 🛠️
36
+
37
+ The easiest way to install the project is to run this command in your terminal:
38
+ ```
39
+ curl -sSL https://github.com/ScottSauers/ferromic/raw/main/install.sh | bash
40
+ ```
41
+
42
+ Or, you can download the binary from the releases section.
43
+
44
+ Or, make sure you have [Rust](https://www.rust-lang.org/tools/install) and Cargo installed, and you can clone the repository and build the project.
45
+
46
+ ---
47
+
48
+ ## Usage 🚀
49
+
50
+ ### Command-Line Arguments
51
+
52
+ ```bash
53
+ vcf_stats_calculator -v <VCF_FOLDER> \
54
+ -c <CONFIG_FILE> \
55
+ -o <OUTPUT_CSV> \
56
+ --min_gq <MIN_GQ> \
57
+ --mask_file <MASK_FILE> \
58
+ -h <CHR> \
59
+ -r <REGION>
60
+ ```
61
+
62
+ **Parameters**:
63
+
64
+ - `-v`, `--vcf_folder`: **(Required)** Path to the directory containing VCF files.
65
+ - `-c`, `--config_file`: **(Optional)** Path to the TSV configuration file defining regions and haplotype groupings.
66
+ - `-o`, `--output_file`: **(Optional)** Path for the output CSV file containing statistical results. Defaults to `output.csv` if not specified.
67
+ - `--min_gq`: **(Optional)** Minimum genotype quality (GQ) Phred score for filtering variants. Defaults to `30`.
68
+ - `--mask_file`: **(Optional)** Path to the BED file specifying genomic regions to mask (filter out).
69
+ - `-h`, `--chr`: **(Optional)** Chromosome name to process when not using a config file.
70
+ - `-r`, `--region`: **(Optional)** Specific region to process within the chromosome, in the format `start-end` (e.g., `10732039-23685112`).
71
+
72
+ **Notes**:
73
+ - Either `--config_file` or both `--chr` and `--region` must be provided.
74
+ - When using `--config_file`, the tool can process multiple regions and haplotype groupings as defined in the TSV.
75
+ - When not using a config file, the tool will process the specified chromosome and region and output results to the console.
76
+
77
+ ### Input Files
78
+
79
+ #### VCF File 🧬
80
+
81
+ - **Format**: [VCF v4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
82
+ - **Contents**: Variant data including positions, alleles, and genotype information for multiple samples.
83
+ - **Genotype Format**: Must include `GT` (genotype) and `GQ` (genotype quality) fields.
84
+
85
+ **Example**:
86
+ ```vcf
87
+ ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred scaled genotype quality computed by whatshap genotyping algorithm.">
88
+ #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
89
+ chr1 1500 . A T . PASS . GT:GQ 0|0:35 0|1:40
90
+ ```
91
+
92
+ #### TSV Configuration File 📋
93
+
94
+ - **Purpose**: Defines multiple genomic regions and specifies haplotype groupings based on inversion statuses.
95
+ - **Structure**:
96
+ - **Columns**:
97
+ - `seqnames`: Chromosome name (e.g., `chr1`).
98
+ - `start`: Start position of the region.
99
+ - `end`: End position of the region.
100
+ - **Sample Columns**: Genotype information for each sample in the format `0|1`, `1|0`, `0|0`, `1|1`, etc.
101
+
102
+ **Example**:
103
+ ```tsv
104
+ seqnames start end POS orig_ID verdict categ NA19434 HG00036 HG00191 ...
105
+ chr1 13004251 13122531 13113384 chr1-13113384-INV-62181 pass inv 1|1 1|1 1|1 ...
106
+ ```
107
+
108
+ **Notes**:
109
+ - Genotypes beyond the standard `0|0`, `0|1`, `1|0`, `1|1` (e.g., `0|1_lowconf`) will be used only for the "unfiltered" outputs.
110
+ - Haplotype groupings (presence or absence) are determined by the values in the genotype columns, indicating, e.g., inversion (`1`) or direct (`0`) haplotypes.
111
+
112
+ #### Mask File 🛡️
113
+
114
+ - **Format**: [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
115
+ - **Contains**: Genomic regions to exclude from analysis (variants within the regions will betreated similarly to variants with low GQ scores).
116
+ - **Structure**:
117
+ - Three columns: `chromosome`, `start`, `end`.
118
+
119
+ **Example**:
120
+ ```bed
121
+ chr1 0 77102
122
+ chr1 88113 190752
123
+ ...
124
+ ```
125
+
126
+ ---
127
+
128
+ ### Output File 📈
129
+
130
+ - **Format**: CSV
131
+ - **Filename**: As specified by the `--output_file` parameter.
132
+ - **Headers**:
133
+ ```
134
+ chr,region_start,region_end,0_sequence_length,1_sequence_length,0_sequence_length_adjusted,1_sequence_length_adjusted,0_segregating_sites,1_segregating_sites,0_w_theta,1_w_theta,0_pi,1_pi,0_segregating_sites_filtered,1_segregating_sites_filtered,0_w_theta_filtered,1_w_theta_filtered,0_pi_filtered,1_pi_filtered,0_num_hap_no_filter,1_num_hap_no_filter,0_num_hap_filter,1_num_hap_filter,inversion_freq_no_filter,inversion_freq_filter
135
+ ```
136
+
137
+ - **Column Descriptions**:
138
+ - `chr`: Chromosome name.
139
+ - `region_start`: Start position of the region.
140
+ - `region_end`: End position of the region.
141
+ - `0_sequence_length`: Total length of the sequence for haplotype group `0`.
142
+ - `1_sequence_length`: Total length of the sequence for haplotype group `1`.
143
+ - `0_sequence_length_adjusted`: Adjusted sequence length for haplotype group `0` after filtering.
144
+ - `1_sequence_length_adjusted`: Adjusted sequence length for haplotype group `1` after filtering.
145
+ - `0_segregating_sites`: Number of segregating sites (unfiltered) for haplotype group `0`.
146
+ - `1_segregating_sites`: Number of segregating sites (unfiltered) for haplotype group `1`.
147
+ - `0_w_theta`: Watterson's Theta (unfiltered) for haplotype group `0`.
148
+ - `1_w_theta`: Watterson's Theta (unfiltered) for haplotype group `1`.
149
+ - `0_pi`: Pi (unfiltered) for haplotype group `0`.
150
+ - `1_pi`: Pi (unfiltered) for haplotype group `1`.
151
+ - `0_segregating_sites_filtered`: Segregating sites for haplotype group `0`.
152
+ - `1_segregating_sites_filtered`: Segregating sites for haplotype group `1`.
153
+ - `0_w_theta_filtered`: Watterson's Theta for haplotype group `0`.
154
+ - `1_w_theta_filtered`: Watterson's Theta for haplotype group `1`.
155
+ - `0_pi_filtered`: Pi for haplotype group `0`.
156
+ - `1_pi_filtered`: Pi for haplotype group `1`.
157
+ - `0_num_hap_no_filter`: Number of haplotypes for group `0` before filtering.
158
+ - `1_num_hap_no_filter`: Number of haplotypes for group `1` before filtering.
159
+ - `0_num_hap_filter`: Number of haplotypes for group `0`.
160
+ - `1_num_hap_filter`: Number of haplotypes for group `1`.
161
+ - `inversion_freq_no_filter`: Allele frequency of inversion (1) before filtering.
162
+ - `inversion_freq_filter`: Allele frequency of inversion (1).
163
+
164
+ - **Special Values**:
165
+ - `θ = 0`: No segregating sites; no genetic variation observed.
166
+ - `θ = Infinity (inf)`: Insufficient haplotypes or zero-length region; metrics undefined.
167
+ - `π = 0`: No nucleotide differences.
168
+ - `π = Infinity (inf)`: Insufficient data; metrics undefined.
169
+
170
+ ---
171
+
172
+ ## Filtering Mechanisms 🔍
173
+
174
+ ### Genotype Quality (GQ) Filtering
175
+
176
+ - **Purpose**: Exclude variants with low genotype quality.
177
+ - **Mechanism**:
178
+ - If any sample within a variant has a GQ score below the specified `--min_gq` threshold, the entire variant is excluded from **filtered** (but not unfiltered) analyses.
179
+ - Variants passing the GQ filter are included in both **unfiltered** and **filtered** analyses.
180
+
181
+ ### Genotype Matching
182
+
183
+ - **Purpose**: Only exact genotype matches (`0|0`, `0|1`, `1|0`, `1|1`) are included in **filtered** analyses.
184
+ - **Mechanism**:
185
+ - Genotypes not strictly matching the four expected formats (e.g., `0|1_lowconf`) are considered missing data and excluded from **filtered** analyses.
186
+ - **Unfiltered** analyses include all genotypes that can be parsed into valid formats based on the first three characters.
187
+
188
+ ### Masking
189
+
190
+ - **Purpose**: Exclude entire genomic regions from analysis based on predefined masks.
191
+ - **Mechanism**:
192
+ - Regions specified in the BED mask file are treated similarly to low GQ variants and excluded from **filtered** (but not unfiltered) analyses.
193
+ - Sequence lengths are adjusted to account for masked regions in statistic calculations
194
+
195
+ ---
196
+
197
+ ## Common Warnings and Errors ⚠️
198
+
199
+ - **Missing Samples**: If certain samples defined in the configuration file are not found in the VCF, a warning is displayed with the missing samples.
200
+
201
+ - **Invalid Genotypes**: Genotypes not conforming to the expected formats (`0|0`, `0|1`, `1|0`, `1|1`) will be considered missing data. The number and percentage of invalid genotypes encountered will be shown.
202
+
203
+ - **Multi-allelic Sites**: Multi-allelic variants are not supported.
204
+
205
+ - **No Variants Found**: If no variants are found within the specified region or all variants are filtered out, a warning will be printed.
206
+
207
+ ---
208
+
209
+ ## Examples 🧪
210
+
211
+ ### Running the Tool with All Parameters
212
+
213
+ ```bash
214
+ cargo run --release --bin vcf_stats_calculator \
215
+ -v ../vcfs \
216
+ -c ../config/regions.tsv \
217
+ -o ../results/output_stats.csv \
218
+ --min_gq 30 \
219
+ --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed
220
+ ```
221
+
222
+ ### Running the Tool Without a Configuration File
223
+
224
+ If you prefer to calculate statistics for a specific chromosome and region without using a configuration file, you can run the tool with the `--chr` and `--region` flags. **Note:** In this mode, the results will be printed to the console rather than written to a CSV file.
225
+
226
+ ```bash
227
+ cargo run --release --bin vcf_stats_calculator \
228
+ -v ../vcfs \
229
+ -c chr22 \
230
+ -r 10732039-23685112 \
231
+ --min_gq 30 \
232
+ --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed
233
+ ```
234
+
@@ -0,0 +1,5 @@
1
+ ferromic-0.1.0.dist-info/METADATA,sha256=XZSQZK_qt_vkucqF8epG8bPocz-wrWdC1UDKMrZavPw,10957
2
+ ferromic-0.1.0.dist-info/WHEEL,sha256=T5N8wV_FFFFqNMyMMjqvVZVA4nBXclX-pkxhEjZDV9w,108
3
+ ferromic/__init__.py,sha256=aNX1Ggsg2XdYFZ-hNIAQzCzjz7dGwk9BCdmrwwYZKh0,115
4
+ ferromic/ferromic.cpython-310-x86_64-linux-gnu.so,sha256=zl3EMnJqMAXvpkHXHZViORDQsYkYu7hHw1w5Lvu_hsk,446368
5
+ ferromic-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.8.2)
3
+ Root-Is-Purelib: false
4
+ Tag: cp310-cp310-manylinux_2_34_x86_64