PyPI - ferromic - Versions diffs - 0.1.0__cp310-cp310-manylinux_2_34_x86_64.whl - Mend

ferromic 0.1.0__cp310-cp310-manylinux_2_34_x86_64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of ferromic might be problematic. Click here for more details.

Files changed (5) hide show

ferromic/__init__.py +5 -0
ferromic/ferromic.cpython-310-x86_64-linux-gnu.so +0 -0
ferromic-0.1.0.dist-info/METADATA +234 -0
ferromic-0.1.0.dist-info/RECORD +5 -0
ferromic-0.1.0.dist-info/WHEEL +4 -0

ferromic/__init__.py ADDED Viewed

@@ -0,0 +1,5 @@
+from .ferromic import *
+__doc__ = ferromic.__doc__
+if hasattr(ferromic, "__all__"):
+    __all__ = ferromic.__all__

ferromic/ferromic.cpython-310-x86_64-linux-gnu.so ADDED Viewed

Binary file

ferromic-0.1.0.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,234 @@
+Metadata-Version: 2.4
+Name: ferromic
+Version: 0.1.0
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+# Ferromic ⚙️
+Ferromic is a place for multiple bioinformatics tools for VCFs written in Rust. Right now, this includes file merging and region statistics calculation.
+# VCF Statistics Calculator 📊
+Welcome to the **VCF Statistics Calculator**, a Rust-based tool designed to compute **Watterson's Theta (θ)** and **Pi (π)** (and others, coming soon) for genomic regions defined in VCF (Variant Call Format) files.
+---
+## Features ✨
+- **Calculate Genetic Diversity Metrics**: Compute Watterson's Theta (θ) and Pi (π) for specified regions.
+- **Haplotype Group Analysis**: Separate calculations for haplotypes with and without a structural variant class (such as inversions).
+- **Flexible Input Handling**: Supports configuration via TSV files for multiple regions and haplotype groupings (e.g., by presence of structural variant).
+- **Filtering**: Filter variants based on genotype quality (GQ) scores and predefined genomic masks.
+- **Output**: Generates CSV files with statistical metrics for each genomic region.
+---
+## Background 🧬
+This tool processes VCF files to calculate **Watterson's Theta (θ)** and **Pi (π)**, metrics for understanding genetic diversity within genomic regions. By using a TSV configuration file, users can define multiple regions and categorize haplotypes based on SV (e.g. inversion) statuses. This allows for distinguishing between inverted and non-inverted haplotypes across different regions and samples. Note that if you are using VCFs, all sites which differ between samples must be included.
+**Metrics**:
+- **Watterson's Theta (θ)**: Based on the number of segregating (polymorphic) sites.
+- **Pi (π)**: Measures nucleotide diversity, based on pairwise per-site nucleotide differences.
+---
+## Installation 🛠️
+The easiest way to install the project is to run this command in your terminal:
+```
+curl -sSL https://github.com/ScottSauers/ferromic/raw/main/install.sh | bash
+```
+Or, you can download the binary from the releases section.
+Or, make sure you have [Rust](https://www.rust-lang.org/tools/install) and Cargo installed, and you can clone the repository and build the project.
+---
+## Usage 🚀
+### Command-Line Arguments
+```bash
+vcf_stats_calculator -v <VCF_FOLDER> \
+                    -c <CONFIG_FILE> \
+                    -o <OUTPUT_CSV> \
+                    --min_gq <MIN_GQ> \
+                    --mask_file <MASK_FILE> \
+                    -h <CHR> \
+                    -r <REGION>
+```
+**Parameters**:
+- `-v`, `--vcf_folder`: **(Required)** Path to the directory containing VCF files.
+- `-c`, `--config_file`: **(Optional)** Path to the TSV configuration file defining regions and haplotype groupings.
+- `-o`, `--output_file`: **(Optional)** Path for the output CSV file containing statistical results. Defaults to `output.csv` if not specified.
+- `--min_gq`: **(Optional)** Minimum genotype quality (GQ) Phred score for filtering variants. Defaults to `30`.
+- `--mask_file`: **(Optional)** Path to the BED file specifying genomic regions to mask (filter out).
+- `-h`, `--chr`: **(Optional)** Chromosome name to process when not using a config file.
+- `-r`, `--region`: **(Optional)** Specific region to process within the chromosome, in the format `start-end` (e.g., `10732039-23685112`).
+**Notes**:
+- Either `--config_file` or both `--chr` and `--region` must be provided.
+- When using `--config_file`, the tool can process multiple regions and haplotype groupings as defined in the TSV.
+- When not using a config file, the tool will process the specified chromosome and region and output results to the console.
+### Input Files
+#### VCF File 🧬
+- **Format**: [VCF v4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
+- **Contents**: Variant data including positions, alleles, and genotype information for multiple samples.
+- **Genotype Format**: Must include `GT` (genotype) and `GQ` (genotype quality) fields.
+**Example**:
+```vcf
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred scaled genotype quality computed by whatshap genotyping algorithm.">
+#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
+chr1 1500 . A T . PASS . GT:GQ 0|0:35 0|1:40
+```
+#### TSV Configuration File 📋
+- **Purpose**: Defines multiple genomic regions and specifies haplotype groupings based on inversion statuses.
+- **Structure**:
+    - **Columns**:
+        - `seqnames`: Chromosome name (e.g., `chr1`).
+        - `start`: Start position of the region.
+        - `end`: End position of the region.
+        - **Sample Columns**: Genotype information for each sample in the format `0|1`, `1|0`, `0|0`, `1|1`, etc.
+**Example**:
+```tsv
+seqnames	start	end	POS	orig_ID	verdict	categ	NA19434	HG00036	HG00191	...
+chr1	13004251	13122531	13113384	chr1-13113384-INV-62181	pass	inv	1|1	1|1	1|1	...
+```
+**Notes**:
+- Genotypes beyond the standard `0|0`, `0|1`, `1|0`, `1|1` (e.g., `0|1_lowconf`) will be used only for the "unfiltered" outputs.
+- Haplotype groupings (presence or absence) are determined by the values in the genotype columns, indicating, e.g., inversion (`1`) or direct (`0`) haplotypes.
+#### Mask File 🛡️
+- **Format**: [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
+- **Contains**: Genomic regions to exclude from analysis (variants within the regions will betreated similarly to variants with low GQ scores).
+- **Structure**:
+    - Three columns: `chromosome`, `start`, `end`.
+**Example**:
+```bed
+chr1	0	77102
+chr1	88113	190752
+...
+```
+---
+### Output File 📈
+- **Format**: CSV
+- **Filename**: As specified by the `--output_file` parameter.
+- **Headers**:
+    ```
+    chr,region_start,region_end,0_sequence_length,1_sequence_length,0_sequence_length_adjusted,1_sequence_length_adjusted,0_segregating_sites,1_segregating_sites,0_w_theta,1_w_theta,0_pi,1_pi,0_segregating_sites_filtered,1_segregating_sites_filtered,0_w_theta_filtered,1_w_theta_filtered,0_pi_filtered,1_pi_filtered,0_num_hap_no_filter,1_num_hap_no_filter,0_num_hap_filter,1_num_hap_filter,inversion_freq_no_filter,inversion_freq_filter
+    ```
+- **Column Descriptions**:
+    - `chr`: Chromosome name.
+    - `region_start`: Start position of the region.
+    - `region_end`: End position of the region.
+    - `0_sequence_length`: Total length of the sequence for haplotype group `0`.
+    - `1_sequence_length`: Total length of the sequence for haplotype group `1`.
+    - `0_sequence_length_adjusted`: Adjusted sequence length for haplotype group `0` after filtering.
+    - `1_sequence_length_adjusted`: Adjusted sequence length for haplotype group `1` after filtering.
+    - `0_segregating_sites`: Number of segregating sites (unfiltered) for haplotype group `0`.
+    - `1_segregating_sites`: Number of segregating sites (unfiltered) for haplotype group `1`.
+    - `0_w_theta`: Watterson's Theta (unfiltered) for haplotype group `0`.
+    - `1_w_theta`: Watterson's Theta (unfiltered) for haplotype group `1`.
+    - `0_pi`: Pi (unfiltered) for haplotype group `0`.
+    - `1_pi`: Pi (unfiltered) for haplotype group `1`.
+    - `0_segregating_sites_filtered`: Segregating sites for haplotype group `0`.
+    - `1_segregating_sites_filtered`: Segregating sites for haplotype group `1`.
+    - `0_w_theta_filtered`: Watterson's Theta for haplotype group `0`.
+    - `1_w_theta_filtered`: Watterson's Theta for haplotype group `1`.
+    - `0_pi_filtered`: Pi for haplotype group `0`.
+    - `1_pi_filtered`: Pi for haplotype group `1`.
+    - `0_num_hap_no_filter`: Number of haplotypes for group `0` before filtering.
+    - `1_num_hap_no_filter`: Number of haplotypes for group `1` before filtering.
+    - `0_num_hap_filter`: Number of haplotypes for group `0`.
+    - `1_num_hap_filter`: Number of haplotypes for group `1`.
+    - `inversion_freq_no_filter`: Allele frequency of inversion (1) before filtering.
+    - `inversion_freq_filter`: Allele frequency of inversion (1).
+- **Special Values**:
+    - `θ = 0`: No segregating sites; no genetic variation observed.
+    - `θ = Infinity (inf)`: Insufficient haplotypes or zero-length region; metrics undefined.
+    - `π = 0`: No nucleotide differences.
+    - `π = Infinity (inf)`: Insufficient data; metrics undefined.
+---
+## Filtering Mechanisms 🔍
+### Genotype Quality (GQ) Filtering
+- **Purpose**: Exclude variants with low genotype quality.
+- **Mechanism**:
+    - If any sample within a variant has a GQ score below the specified `--min_gq` threshold, the entire variant is excluded from **filtered** (but not unfiltered) analyses.
+    - Variants passing the GQ filter are included in both **unfiltered** and **filtered** analyses.
+### Genotype Matching
+- **Purpose**: Only exact genotype matches (`0|0`, `0|1`, `1|0`, `1|1`) are included in **filtered** analyses.
+- **Mechanism**:
+    - Genotypes not strictly matching the four expected formats (e.g., `0|1_lowconf`) are considered missing data and excluded from **filtered** analyses.
+    - **Unfiltered** analyses include all genotypes that can be parsed into valid formats based on the first three characters.
+### Masking
+- **Purpose**: Exclude entire genomic regions from analysis based on predefined masks.
+- **Mechanism**:
+    - Regions specified in the BED mask file are treated similarly to low GQ variants and excluded from **filtered** (but not unfiltered) analyses.
+    - Sequence lengths are adjusted to account for masked regions in statistic calculations
+---
+## Common Warnings and Errors ⚠️
+- **Missing Samples**: If certain samples defined in the configuration file are not found in the VCF, a warning is displayed with the missing samples.
+- **Invalid Genotypes**: Genotypes not conforming to the expected formats (`0|0`, `0|1`, `1|0`, `1|1`) will be considered missing data. The number and percentage of invalid genotypes encountered will be shown.
+- **Multi-allelic Sites**: Multi-allelic variants are not supported.
+- **No Variants Found**: If no variants are found within the specified region or all variants are filtered out, a warning will be printed.
+---
+## Examples 🧪
+### Running the Tool with All Parameters
+```bash
+cargo run --release --bin vcf_stats_calculator \
+    -v ../vcfs \
+    -c ../config/regions.tsv \
+    -o ../results/output_stats.csv \
+    --min_gq 30 \
+    --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed
+```
+### Running the Tool Without a Configuration File
+If you prefer to calculate statistics for a specific chromosome and region without using a configuration file, you can run the tool with the `--chr` and `--region` flags. **Note:** In this mode, the results will be printed to the console rather than written to a CSV file.
+```bash
+cargo run --release --bin vcf_stats_calculator \
+    -v ../vcfs \
+    -c chr22 \
+    -r 10732039-23685112 \
+    --min_gq 30 \
+    --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed
+```

ferromic-0.1.0.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,5 @@
+ferromic-0.1.0.dist-info/METADATA,sha256=XZSQZK_qt_vkucqF8epG8bPocz-wrWdC1UDKMrZavPw,10957
+ferromic-0.1.0.dist-info/WHEEL,sha256=T5N8wV_FFFFqNMyMMjqvVZVA4nBXclX-pkxhEjZDV9w,108
+ferromic/__init__.py,sha256=aNX1Ggsg2XdYFZ-hNIAQzCzjz7dGwk9BCdmrwwYZKh0,115
+ferromic/ferromic.cpython-310-x86_64-linux-gnu.so,sha256=zl3EMnJqMAXvpkHXHZViORDQsYkYu7hHw1w5Lvu_hsk,446368
+ferromic-0.1.0.dist-info/RECORD,,

ferromic-0.1.0.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,4 @@
+Wheel-Version: 1.0
+Generator: maturin (1.8.2)
+Root-Is-Purelib: false
+Tag: cp310-cp310-manylinux_2_34_x86_64