ferromic 0.1.2__cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of ferromic might be problematic. Click here for more details.

ferromic/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ from .ferromic import *
2
+
3
+ __doc__ = ferromic.__doc__
4
+ if hasattr(ferromic, "__all__"):
5
+ __all__ = ferromic.__all__
@@ -0,0 +1,366 @@
1
+ Metadata-Version: 2.4
2
+ Name: ferromic
3
+ Version: 0.1.2
4
+ Classifier: Development Status :: 4 - Beta
5
+ Classifier: Intended Audience :: Science/Research
6
+ Classifier: License :: Other/Proprietary License
7
+ Classifier: Programming Language :: Python
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Rust
10
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
11
+ Requires-Dist: numpy>=1.23
12
+ Requires-Dist: pytest ; extra == 'test'
13
+ Requires-Dist: numpy ; extra == 'test'
14
+ Requires-Dist: scikit-allel ; extra == 'test'
15
+ Requires-Dist: scipy ; extra == 'test'
16
+ Requires-Dist: pytest-benchmark ; extra == 'test'
17
+ Requires-Dist: patchelf ; platform_system == 'Linux' and extra == 'test'
18
+ Provides-Extra: test
19
+ License-File: LICENSE.md
20
+ Summary: Rust-accelerated population genetics toolkit with ergonomic Python bindings
21
+ Keywords: population-genetics,bioinformatics,rust,pyo3,numpy
22
+ Home-Page: https://github.com/SauersML/ferromic
23
+ Author: SauersML
24
+ Requires-Python: >=3.9
25
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
26
+
27
+ # Ferromic
28
+
29
+ A Rust-based tool for population genetic analysis that calculates diversity statistics from VCF files, with support for haplotype-group-specific analyses and genomic regions.
30
+
31
+ ## Overview
32
+
33
+ Ferromic processes genomic variant data from VCF files to calculate key population genetic statistics. It can analyze diversity metrics separately for different haplotype groups (0 and 1) as defined in a configuration file, making it particularly useful for analyzing regions with structural variants or any other genomic features where haplotypes can be classified into distinct groups.
34
+
35
+ ## Features
36
+
37
+ - Efficient VCF processing using multi-threaded parallelization
38
+ - Calculate key population genetic statistics:
39
+ - Nucleotide diversity (π)
40
+ - Watterson's theta (θ)
41
+ - Segregating sites counts
42
+ - Allele frequencies
43
+ - Apply various filtering strategies:
44
+ - Genotype quality (GQ) thresholds
45
+ - Genomic masks (exclude regions)
46
+ - Allowed regions (include only)
47
+ - Multi-allelic site handling
48
+ - Missing data management
49
+ - Extract coding sequences (CDS) from genomic regions using GTF annotations
50
+ - Generate PHYLIP format sequence files for phylogenetic analysis
51
+ - Create per-site diversity statistics for fine-grained analysis
52
+ - Support both individual region analysis and batch processing via configuration files
53
+
54
+ ## Python bindings
55
+
56
+ Ferromic's Rust core is exposed to Python through [PyO3](https://pyo3.rs) and
57
+ is distributed as a binary wheel on PyPI. Installing the extension pulls in the
58
+ compiled Rust library together with its runtime dependency on NumPy:
59
+
60
+ ```bash
61
+ pip install ferromic
62
+ ```
63
+
64
+ Once installed, the `ferromic` module mirrors the high-level statistics API of
65
+ the Rust crate. The example below shows how to construct an in-memory
66
+ population, compute basic diversity statistics, and run a principal component
67
+ analysis directly from Python:
68
+
69
+ ```python
70
+ import numpy as np
71
+ import ferromic as fm
72
+
73
+ genotypes = np.array([
74
+ [[0, 0], [0, 1], [1, 1]],
75
+ [[0, 1], [0, 0], [1, 1]],
76
+ ], dtype=np.uint8)
77
+
78
+ population = fm.Population.from_numpy(
79
+ "demo",
80
+ genotypes=genotypes,
81
+ positions=[101, 202], # plain Python sequences are accepted
82
+ haplotypes=[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)],
83
+ sequence_length=1000,
84
+ sample_names=["sampleA", "sampleB", "sampleC"],
85
+ )
86
+
87
+ print(f"Ferromic version: {fm.__version__}")
88
+ print("Segregating sites:", population.segregating_sites())
89
+ print("Nucleotide diversity:", population.nucleotide_diversity())
90
+
91
+ pca = fm.chromosome_pca(
92
+ variants=[
93
+ {"position": 101, "genotypes": [[0, 0], [0, 1], [1, 1]]},
94
+ {"position": 202, "genotypes": [[0, 1], [0, 0], [1, 1]]},
95
+ ],
96
+ sample_names=["sampleA", "sampleB", "sampleC"],
97
+ )
98
+
99
+ print("PCA components shape:", pca.coordinates.shape)
100
+ ```
101
+
102
+ Additional helpers for Hudson FST, Weir & Cockerham FST, sequence length
103
+ adjustment, and inversion allele frequency are available under the top-level
104
+ `ferromic` namespace. Consult `src/pytests` for further end-to-end examples
105
+ and integration tests.
106
+
107
+ ## Usage
108
+
109
+ ```
110
+ cargo run --release --bin run_vcf -- [OPTIONS]
111
+ ```
112
+
113
+ ### Required Arguments
114
+
115
+ - `--vcf_folder <FOLDER>`: Directory containing VCF files
116
+ - `--reference <PATH>`: Path to reference genome FASTA file
117
+ - `--gtf <PATH>`: Path to GTF annotation file
118
+
119
+ ### Optional Arguments
120
+
121
+ - `--chr <CHROMOSOME>`: Process a specific chromosome
122
+ - `--region <START-END>`: Process a specific region (1-based coordinates)
123
+ - `--config_file <FILE>`: Configuration file for batch processing multiple regions
124
+ - `--output_file <FILE>`: Output file path (default: output.csv)
125
+ - `--min_gq <INT>`: Minimum genotype quality threshold (default: 30)
126
+ - `--mask_file <FILE>`: BED file of regions to exclude
127
+ - `--allow_file <FILE>`: BED file of regions to include
128
+ - `--pca`: Perform principal component analysis on filtered haplotypes (writes per-chromosome TSV files under `pca_per_chr_outputs/`)
129
+ - `--pca_components <INT>`: Number of principal components to compute (default: 10)
130
+ - `--pca_output <FILE>`: Desired filename for the combined PCA summary table produced by Ferromic's PCA utilities (default: `pca_results.tsv`)
131
+ - `--fst`: Enable haplotype FST calculations (required for Weir & Cockerham and Hudson outputs)
132
+ - `--fst_populations <FILE>`: Optional CSV (population name followed by comma-separated sample IDs) describing named populations for additional FST comparisons
133
+
134
+ ## Example Command
135
+
136
+ ```
137
+ cargo run --release --bin run_vcf -- \
138
+ --vcf_folder ../vcfs \
139
+ --config_file ../variants.tsv \
140
+ --mask_file ../hardmask.bed \
141
+ --reference ../hg38.no_alt.fa \
142
+ --gtf ../hg38.knownGene.gtf
143
+ ```
144
+
145
+ ## Coordinate Systems
146
+
147
+ Ferromic handles different coordinate systems:
148
+ - VCF files: 1-based coordinates
149
+ - BED mask/allow files: 0-based, half-open intervals
150
+ - TSV config files: 1-based, inclusive coordinates
151
+ - GTF files: 1-based, inclusive coordinates
152
+
153
+ ## Configuration File Format
154
+
155
+ The configuration file must be tab-delimited. Ferromic expects the header to begin with seven metadata columns followed by one column per sample:
156
+
157
+ 1. `seqnames`: Chromosome (with or without "chr" prefix)
158
+ 2. `start`: Region start position (1-based, inclusive)
159
+ 3. `end`: Region end position (1-based, inclusive)
160
+ 4. `POS`: Representative variant position within the region (retained for bookkeeping)
161
+ 5. `orig_ID`: Region identifier
162
+ 6. `verdict`: Manual/automated review verdict
163
+ 7. `categ`: Category label for the region
164
+
165
+ Columns eight onward must be sample names. Each cell in these sample columns stores a genotype string such as `"0|0"`, `"0|1"`, `"1|0"`, or `"1|1"` that assigns both haplotypes to group 0 or group 1.
166
+
167
+ Where:
168
+ - "0" and "1" represent the two haplotype groups to be analyzed separately
169
+ - The "|" character indicates the phase separation between left and right haplotypes
170
+ - Genotypes with special formats (e.g., "0|1_lowconf") are included in unfiltered analyses but excluded from filtered analyses
171
+
172
+ ## Output Files
173
+
174
+ ### Main CSV Output
175
+
176
+ Contains summary statistics for each region with columns:
177
+ ```
178
+ chr,region_start,region_end,0_sequence_length,1_sequence_length,
179
+ 0_sequence_length_adjusted,1_sequence_length_adjusted,
180
+ 0_segregating_sites,1_segregating_sites,0_w_theta,1_w_theta,
181
+ 0_pi,1_pi,0_segregating_sites_filtered,1_segregating_sites_filtered,
182
+ 0_w_theta_filtered,1_w_theta_filtered,0_pi_filtered,1_pi_filtered,
183
+ 0_num_hap_no_filter,1_num_hap_no_filter,0_num_hap_filter,1_num_hap_filter,
184
+ inversion_freq_no_filter,inversion_freq_filter,
185
+ haplotype_overall_fst_wc,haplotype_between_pop_variance_wc,
186
+ haplotype_within_pop_variance_wc,haplotype_num_informative_sites_wc,
187
+ hudson_fst_hap_group_0v1,hudson_dxy_hap_group_0v1,
188
+ hudson_pi_hap_group_0,hudson_pi_hap_group_1,hudson_pi_avg_hap_group_0v1
189
+ ```
190
+
191
+ Where:
192
+ - Values prefixed with "0_" are statistics for haplotype group 0
193
+ - Values prefixed with "1_" are statistics for haplotype group 1
194
+ - "sequence_length" is the raw length of the region
195
+ - "sequence_length_adjusted" accounts for masked regions
196
+ - "num_hap" columns indicate the number of haplotypes in each group
197
+ - Statistics with "_filtered" are calculated from strictly filtered data
198
+ - Columns prefixed with `haplotype_` contain Weir & Cockerham FST outputs; they are
199
+ populated when haplotype FST analysis is enabled and `NA` when insufficient data
200
+ are available
201
+ - Columns prefixed with `hudson_` summarise Hudson-style FST components for the
202
+ haplotype 0 vs. 1 comparison and are likewise `NA` when FST statistics cannot be
203
+ computed
204
+
205
+ ### Per-site FASTA-style outputs
206
+
207
+ Two FASTA-like files are produced in the working directory to capture
208
+ position-specific metrics:
209
+
210
+ - `per_site_diversity_output.falsta` – per-haplotype π and θ values. Each record is
211
+ emitted with a FASTA-style header such as
212
+ `>filtered_pi_chr_X_start_Y_end_Z_group_0` followed by a comma-separated vector of
213
+ site-wise values (one entry per base in the region). `NA` marks positions without
214
+ data and `0` marks zero-valued statistics.
215
+ - `per_site_fst_output.falsta` – per-site summaries for Weir & Cockerham and Hudson
216
+ FST. Headers identify the statistic (overall FST, pairwise 0 vs 1, or Hudson
217
+ haplotype FST) and the associated region, with comma-separated values mirroring the
218
+ region length.
219
+
220
+ Both files encode positions implicitly by index: the first entry corresponds to the
221
+ region start (1-based), the second to start + 1, and so on.
222
+
223
+ ### Hudson FST TSV (optional)
224
+
225
+ When run with `--fst`, Ferromic writes `hudson_fst_results.tsv` alongside the main
226
+ CSV. The TSV header is: `chr`, `region_start_0based`, `region_end_0based`,
227
+ `pop1_id_type`, `pop1_id_name`, `pop2_id_type`, `pop2_id_name`, `Dxy`, `pi_pop1`,
228
+ `pi_pop2`, `pi_xy_avg`, `FST`. Region coordinates are 0-based inclusive and the
229
+ population columns capture the identifier type (haplotype group or named
230
+ population) alongside the label for each comparison.
231
+
232
+ ### PHYLIP Files
233
+
234
+ Generated for each transcript that overlaps with the query region:
235
+ - File naming: `group_{0/1}_{transcript_id}_chr_{chromosome}_start_{start}_end_{end}_combined.phy`
236
+ - Contains aligned sequences (based on the reference genome with variants applied)
237
+ - Sample names in the PHYLIP files are constructed from sample names with "_L" or "_R" suffixes to indicate left or right haplotypes
238
+
239
+ ## Implementation Details
240
+
241
+ - For PHYLIP files, if a CDS region overlaps with the query region (even partially), the entire transcript's coding sequence is included
242
+ - For diversity statistics (π and θ), only variants strictly within the region boundaries are used
243
+ - Different filtering approaches:
244
+ - Unfiltered: Includes all valid genotypes, regardless of quality or exact format
245
+ - Filtered: Excludes low-quality variants, masked regions, and non-standard genotypes
246
+ - Sequence length is adjusted for masked regions when calculating diversity statistics
247
+ - Multi-threading is implemented via Rayon for efficient processing
248
+ - Missing data is properly accounted for in diversity calculations
249
+ - Special values in results:
250
+ - θ = 0: No segregating sites (no genetic variation)
251
+ - θ = Infinity: Insufficient haplotypes or zero sequence length
252
+ - π = 0: No nucleotide differences (genetic uniformity)
253
+ - π = Infinity: Insufficient data
254
+
255
+ ## Python bindings with PyO3
256
+
257
+ Ferromic now ships with a rich, Python-first API powered by
258
+ [PyO3](https://pyo3.rs/). You can compute the same high-performance statistics
259
+ that drive the Rust binaries directly from notebooks or scripts using familiar
260
+ Python data structures.
261
+
262
+ ### Building the extension module
263
+
264
+ 1. Install Python 3.8+ and the [maturin](https://github.com/PyO3/maturin) build tool
265
+ (include the optional `patchelf` dependency on Linux to enable rpath fixing):
266
+ ```bash
267
+ python -m pip install "maturin[patchelf]"
268
+ ```
269
+ 2. Compile and install the extension into your active virtual environment:
270
+ ```bash
271
+ maturin develop --release
272
+ ```
273
+ The command compiles the `ferromic` shared library and makes it importable from Python. To
274
+ target a specific interpreter (for example, one provided by Conda), pass
275
+ `--python /path/to/python` or set the `PYO3_PYTHON` environment variable before invoking
276
+ `maturin`.
277
+
278
+ After `maturin develop` completes successfully, you can import the module with `import ferromic`
279
+ inside Python.
280
+
281
+ ### Ergonomic Python API
282
+
283
+ All functions accept plain Python collections. Variants can be dictionaries,
284
+ dataclasses, namedtuples or any object exposing a ``position`` attribute and a
285
+ ``genotypes`` iterable (with allele integers or ``None``). Haplotype entries are
286
+ interpreted from tuples like ``(sample_index, "L")`` or ``(sample_index, 1)``.
287
+
288
+ | Function or class | Description |
289
+ | --- | --- |
290
+ | `ferromic.segregating_sites(variants)` | Count polymorphic sites. |
291
+ | `ferromic.nucleotide_diversity(variants, haplotypes, sequence_length)` | Compute π (nucleotide diversity). |
292
+ | `ferromic.watterson_theta(segregating_sites, sample_count, sequence_length)` | Watterson's θ estimator. |
293
+ | `ferromic.pairwise_differences(variants, sample_count)` | List of `PairwiseDifference` objects containing counts for every sample pair. |
294
+ | `ferromic.per_site_diversity(variants, haplotypes, region=None)` | Per-position π and θ as `DiversitySite` objects. |
295
+ | `ferromic.Population` | Reusable container for Hudson-style statistics. Pass either a haplotype group (0/1) or a custom label. |
296
+ | `ferromic.hudson_dxy(pop1, pop2)` | Between-population nucleotide diversity. |
297
+ | `ferromic.hudson_fst(pop1, pop2)` | Hudson FST with rich metadata. |
298
+ | `ferromic.hudson_fst_sites(pop1, pop2, region)` | Per-site Hudson components across a region. |
299
+ | `ferromic.hudson_fst_with_sites(pop1, pop2, region)` | Tuple ``(HudsonFstResult, [HudsonFstSite, ...])``. |
300
+ | `ferromic.wc_fst(variants, sample_names, sample_to_group, region)` | Weir & Cockerham FST with pairwise and per-site summaries. |
301
+ | `ferromic.wc_fst_components(fst_estimate)` | Extract `(value, sum_a, sum_b, sites)` from any `FstEstimate`. |
302
+ | `ferromic.chromosome_pca(variants, sample_names, n_components=10)` | Run PCA for a single chromosome and return a `ChromosomePcaResult`. |
303
+ | `ferromic.chromosome_pca_to_file(variants, sample_names, chromosome, output_dir, n_components=10)` | Convenience helper that writes a TSV with PCA coordinates for one chromosome. |
304
+ | `ferromic.per_chromosome_pca(variants_by_chromosome, sample_names, output_dir, n_components=10)` | Batch PCA analysis across chromosomes, emitting one TSV per chromosome. |
305
+ | `ferromic.global_pca(variants_by_chromosome, sample_names, output_dir, n_components=10)` | Memory-efficient pipeline that runs per-chromosome PCA and produces a combined summary table. |
306
+ | `ferromic.ChromosomePcaResult` | Light-weight container exposing `haplotype_labels`, `coordinates`, and `positions`. |
307
+ | `ferromic.adjusted_sequence_length(start, end, allow=None, mask=None)` | Apply BED-style masks to a region length. |
308
+ | `ferromic.inversion_allele_frequency(sample_map)` | Frequency of allele ``1`` across haplotypes. |
309
+
310
+ Every result type is a tiny Python class with descriptive attributes and a
311
+ readable ``repr`` making it pleasant to explore interactively.
312
+
313
+ ### End-to-end example
314
+
315
+ ```python
316
+ from dataclasses import dataclass
317
+
318
+ import ferromic
319
+
320
+
321
+ @dataclass
322
+ class Variant:
323
+ # Positions are zero-based and inclusive to match Ferromic's internal representation.
324
+ position: int
325
+ genotypes: list
326
+
327
+
328
+ variants = [
329
+ Variant(position=999, genotypes=[(0, 0), (0, 1), None]),
330
+ Variant(position=1_009, genotypes=[(0, 0), (0, 0), (1, 1)]),
331
+ ]
332
+
333
+ haplotypes = [(0, "L"), (0, "R"), (1, 0), (1, 1), (2, "L")]
334
+
335
+ pi = ferromic.nucleotide_diversity(variants, haplotypes, sequence_length=100)
336
+ theta = ferromic.watterson_theta(ferromic.segregating_sites(variants), len(haplotypes), 100)
337
+
338
+ group0 = ferromic.Population(0, variants, haplotypes, sequence_length=100)
339
+ group1 = ferromic.Population("inversion", variants, haplotypes, sequence_length=100)
340
+
341
+ hudson = ferromic.hudson_fst(group0, group1)
342
+ sites = ferromic.hudson_fst_sites(group0, group1, region=(990, 1_020))
343
+
344
+ wc = ferromic.wc_fst(
345
+ variants,
346
+ sample_names=["S0", "S1", "S2"],
347
+ sample_to_group={"S0": (0, 0), "S1": (0, 1), "S2": (1, 1)},
348
+ region=(990, 1_020),
349
+ )
350
+
351
+ pca_result = ferromic.chromosome_pca(variants, ["S0", "S1", "S2"], n_components=3)
352
+ ferromic.chromosome_pca_to_file(variants, ["S0", "S1", "S2"], "2L", "./pca_outputs")
353
+
354
+ print(f"π={pi:.6f}, θ={theta:.6f}")
355
+ print(hudson)
356
+ print(sites[0])
357
+ print(wc.overall_fst)
358
+ print(pca_result.coordinates[0][:3])
359
+ ```
360
+
361
+ The example demonstrates how the Python API mirrors Ferromic's Rust types while
362
+ remaining easy to use from high-level workflows. Variants and haplotypes can be
363
+ assembled from pandas data frames, NumPy arrays, or plain Python lists—Ferromic
364
+ only inspects the fields it needs.
365
+
366
+
@@ -0,0 +1,6 @@
1
+ ferromic-0.1.2.dist-info/METADATA,sha256=VRWo8M4ok3Sf5W_T2hnqaBnoGIQElmX4iMWhIhI-c8k,16759
2
+ ferromic-0.1.2.dist-info/WHEEL,sha256=Lx-pFv0bv94KLoPldUDqVDYOObYbX38u182v00hM9hI,127
3
+ ferromic-0.1.2.dist-info/licenses/LICENSE.md,sha256=ZUhjdsYcsoc7ZisJORA4kRd2t3eKONA_vPDeIqiRNFY,169
4
+ ferromic/__init__.py,sha256=aNX1Ggsg2XdYFZ-hNIAQzCzjz7dGwk9BCdmrwwYZKh0,115
5
+ ferromic/ferromic.cpython-39-arm-linux-gnueabihf.so,sha256=H47DZNIu8xdgpkr99XMVw-dF8C4oz5BkddsvPNpzNHY,1633776
6
+ ferromic-0.1.2.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.9.4)
3
+ Root-Is-Purelib: false
4
+ Tag: cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l
@@ -0,0 +1,4 @@
1
+ Copyright (c) 2025 Ferromic Developers. All rights reserved.
2
+
3
+ Unauthorized copying of this project, via any medium is strictly prohibited.
4
+ Proprietary and confidential.