ancify 1.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
ancify-1.3.0/PKG-INFO ADDED
@@ -0,0 +1,31 @@
1
+ Metadata-Version: 2.4
2
+ Name: ancify
3
+ Version: 1.3.0
4
+ Summary: Ancestral allele polarization pipeline using outgroup species
5
+ License: MIT
6
+ Requires-Python: >=3.8
7
+ Requires-Dist: pyyaml>=5.0
8
+ Requires-Dist: numpy>=1.20
9
+ Requires-Dist: lightgbm>=4.6.0
10
+ Provides-Extra: fast
11
+ Requires-Dist: isal>=1.0; extra == "fast"
12
+ Provides-Extra: evaluate
13
+ Requires-Dist: scikit-allel>=1.3; extra == "evaluate"
14
+ Requires-Dist: matplotlib>=3.0; extra == "evaluate"
15
+ Provides-Extra: docs
16
+ Requires-Dist: sphinx>=7.0; extra == "docs"
17
+ Requires-Dist: sphinx-rtd-theme>=2.0; extra == "docs"
18
+ Requires-Dist: sphinx-autodoc-typehints>=1.25; extra == "docs"
19
+ Requires-Dist: myst-parser>=2.0; extra == "docs"
20
+ Provides-Extra: ml
21
+ Requires-Dist: lightgbm>=4.0; extra == "ml"
22
+ Requires-Dist: scikit-learn>=1.0; extra == "ml"
23
+ Provides-Extra: dev
24
+ Requires-Dist: pytest>=7.0; extra == "dev"
25
+ Provides-Extra: all
26
+ Requires-Dist: isal>=1.0; extra == "all"
27
+ Requires-Dist: scikit-allel>=1.3; extra == "all"
28
+ Requires-Dist: matplotlib>=3.0; extra == "all"
29
+ Requires-Dist: lightgbm>=4.0; extra == "all"
30
+ Requires-Dist: scikit-learn>=1.0; extra == "all"
31
+ Requires-Dist: pytest>=7.0; extra == "all"
ancify-1.3.0/README.md ADDED
@@ -0,0 +1,376 @@
1
+ # ancify
2
+
3
+ **Infer ancestral alleles for any species using outgroup alignments.**
4
+
5
+ [![Documentation](https://img.shields.io/badge/docs-Read%20the%20Docs-blue)](https://ancify.readthedocs.io)
6
+
7
+ ancify is a config-driven Python pipeline that determines the ancestral state at every position in a reference genome by comparing pairwise alignments from multiple outgroup species. It supports two inference methods: a **two-tier inner/outer outgroup voting** scheme and **Fitch parsimony** on a phylogenetic tree, both with case-encoded confidence levels.
8
+
9
+ **Full documentation:** [ancify.readthedocs.io](https://ancify.readthedocs.io) — includes a population genetics background primer, step-by-step tutorials, algorithm deep dives, and a species adaptation guide.
10
+
11
+ ---
12
+
13
+ ## For the Impatient
14
+
15
+ ```bash
16
+ pip install .
17
+ ancify init -o config.yaml # generate template config
18
+ # edit config.yaml with your species, alignments, and paths
19
+ ancify run -c config.yaml # run everything
20
+ ```
21
+
22
+ That's it. Your ancestral FASTA files appear in the configured `output_dir`. Uppercase = high confidence, lowercase = low confidence.
23
+
24
+ ### Quick Example (Human, hg38)
25
+
26
+ ```bash
27
+ ancify run -c example_configs/hg38_bcgm.yaml
28
+ ```
29
+
30
+ This polarizes the human genome using bonobo + chimp + gorilla (inner) and macaque (outer), producing one ancestral FASTA per chromosome.
31
+
32
+ ---
33
+
34
+ ## What It Does
35
+
36
+ ```
37
+ Net AXT alignments Projected sequences Ancestral FASTA
38
+ (outgroup vs focal) --> (in focal coordinates) --> (with confidence)
39
+
40
+ hg38.panTro6.axt.gz projected/chimp/chr1.fa
41
+ hg38.panPan3.axt.gz --> projected/bonobo/chr1.fa --> ancestral/chr1.fa
42
+ hg38.gorGor6.axt.gz projected/gorilla/chr1.fa
43
+ hg38.rheMac10.axt.gz projected/macaque/chr1.fa
44
+
45
+ Phase 1: project Phase 2: call Done.
46
+ ```
47
+
48
+ **Phase 1** projects each outgroup alignment onto the focal genome's coordinates.
49
+ **Phase 2** infers the ancestral allele at every position using majority voting.
50
+ **Phase 3** (optional) evaluates calls against a reference and/or VCF variants.
51
+
52
+ ### Confidence Encoding
53
+
54
+ **Voting method** (default):
55
+
56
+ | Character | Confidence | Meaning |
57
+ |-----------|-----------|---------|
58
+ | `ACGT` | High | Inner and outer outgroups agree |
59
+ | `acgt` | Low | Only one tier has data |
60
+ | `n` | Unresolved | Inner and outer disagree |
61
+ | `N` | Missing | No data from either tier |
62
+
63
+ **Parsimony method**:
64
+
65
+ | Character | Confidence | Meaning |
66
+ |-----------|-----------|---------|
67
+ | `ACGT` | High | Unique most-parsimonious root state |
68
+ | `acgt` | Low | Ambiguous root (multiple equally parsimonious states) |
69
+ | `N` | Missing | All outgroup leaves lack data |
70
+
71
+ ---
72
+
73
+ ## Installation
74
+
75
+ ```bash
76
+ git clone https://github.com/kevinkorfmann/ancify.git
77
+ cd ancify
78
+ pip install .
79
+
80
+ # with evaluation dependencies (scikit-allel, matplotlib):
81
+ pip install '.[evaluate]'
82
+ ```
83
+
84
+ **Requirements:** Python >= 3.8, PyYAML, NumPy.
85
+
86
+ ---
87
+
88
+ ## Configuration
89
+
90
+ Everything is controlled by a single YAML file. Generate a starter template:
91
+
92
+ ```bash
93
+ ancify init -o config.yaml
94
+ ```
95
+
96
+ ### Minimal Config
97
+
98
+ ```yaml
99
+ focal_species: human
100
+ chromosome_lengths: chromoLens.txt
101
+
102
+ outgroups:
103
+ inner:
104
+ - name: bonobo
105
+ alignment: hg38.panPan3.net.axt.gz
106
+ - name: chimp
107
+ alignment: hg38.panTro6.net.axt.gz
108
+ - name: gorilla
109
+ alignment: hg38.gorGor6.net.axt.gz
110
+ outer:
111
+ - name: macaque
112
+ alignment: hg38.rheMac10.net.axt.gz
113
+
114
+ output_dir: ./ancestral_calls
115
+ num_cpus: 24
116
+ ```
117
+
118
+ ### Key Fields
119
+
120
+ | Field | Description |
121
+ |-------|-------------|
122
+ | `focal_species` | Label for the focal species (cosmetic) |
123
+ | `chromosome_lengths` | Tab-separated file: `chrom_name\tlength` |
124
+ | `chromosomes` | Optional list; defaults to all chroms in lengths file |
125
+ | `outgroups.inner` | Closely related species (majority vote) |
126
+ | `outgroups.outer` | Distantly related species (independent check) |
127
+ | `min_inner_freq` | Min count for inner majority vote (default: 1) |
128
+ | `min_outer_freq` | Min count for outer majority vote (default: 1) |
129
+ | `method` | `"voting"` (default) or `"parsimony"` |
130
+ | `tree` | Newick tree string or path to `.nwk` file (required for parsimony) |
131
+ | `num_cpus` | Parallel workers (default: 4) |
132
+ | `evaluation` | Optional block for Phase 3 (reference + VCF comparison) |
133
+
134
+ ---
135
+
136
+ ## CLI Reference
137
+
138
+ ```bash
139
+ ancify init [-o FILE] # generate template config
140
+ ancify project -c CONFIG [-n N] # Phase 1: project alignments
141
+ ancify call -c CONFIG [-n N] # Phase 2: call ancestral states
142
+ ancify evaluate -c CONFIG [-n N] # Phase 3: evaluate (optional)
143
+ ancify run -c CONFIG [-n N] # all phases end-to-end
144
+ ```
145
+
146
+ Also works as: `python -m ancify run -c config.yaml`
147
+
148
+ ---
149
+
150
+ ## Works With Any Species
151
+
152
+ ancify is not tied to humans. It works with any focal species for which you have pairwise net AXT alignments (widely available from the [UCSC Genome Browser](https://hgdownload.soe.ucsc.edu/downloads.html)).
153
+
154
+ ### Mouse
155
+
156
+ ```yaml
157
+ focal_species: mouse
158
+ chromosome_lengths: mm39.chromLens.txt
159
+ outgroups:
160
+ inner:
161
+ - name: rat
162
+ alignment: mm39.rn7.net.axt.gz
163
+ outer:
164
+ - name: rabbit
165
+ alignment: mm39.oryCun2.net.axt.gz
166
+ ```
167
+
168
+ ### Drosophila
169
+
170
+ ```yaml
171
+ focal_species: drosophila_melanogaster
172
+ chromosome_lengths: dm6.chromLens.txt
173
+ chromosomes: [2L, 2R, 3L, 3R, 4, X]
174
+ outgroups:
175
+ inner:
176
+ - name: simulans
177
+ alignment: dm6.droSim2.net.axt.gz
178
+ - name: sechellia
179
+ alignment: dm6.droSec1.net.axt.gz
180
+ outer:
181
+ - name: yakuba
182
+ alignment: dm6.droYak3.net.axt.gz
183
+ ```
184
+
185
+ ### Brassica rapa (plant)
186
+
187
+ ```yaml
188
+ focal_species: brassica_rapa
189
+ chromosome_lengths: braRap1.chromLens.txt
190
+ outgroups:
191
+ inner:
192
+ - name: brassica_oleracea
193
+ alignment: braRap1.braOleracea.net.axt.gz
194
+ outer:
195
+ - name: arabidopsis_thaliana
196
+ alignment: braRap1.araTha1.net.axt.gz
197
+ ```
198
+
199
+ See `example_configs/` for complete examples.
200
+
201
+ ---
202
+
203
+ ## How It Works
204
+
205
+ ### Method 1: Two-tier voting (default)
206
+
207
+ 1. **Inner consensus**: majority vote among closely related outgroup species (e.g. bonobo, chimp, gorilla).
208
+ 2. **Outer consensus**: majority vote among distantly related outgroup species (e.g. macaque).
209
+ 3. **Compare**:
210
+ - Agree → **high confidence** (uppercase)
211
+ - One missing → **low confidence** (lowercase, use the available call)
212
+ - Disagree → **unresolved** (`n`)
213
+ - Both missing → **missing** (`N`)
214
+
215
+ This two-tier approach guards against incomplete lineage sorting and lineage-specific substitutions. The outer outgroup provides an independent evolutionary check on the inner consensus.
216
+
217
+ ### Method 2: Fitch parsimony
218
+
219
+ Instead of splitting outgroups into two tiers, you provide a **Newick phylogenetic tree** and ancify uses the **Fitch (1971) algorithm** to reconstruct the most parsimonious ancestral state at the root:
220
+
221
+ ```yaml
222
+ method: parsimony
223
+ tree: "(((bonobo,chimp),gorilla),macaque)"
224
+ ```
225
+
226
+ The tree topology determines how species are weighted, resolving cases that the voting method marks as unresolved. See the [algorithm docs](https://ancify.readthedocs.io/en/latest/algorithm.html) for a detailed walkthrough.
227
+
228
+ ### Input: Net AXT Alignments
229
+
230
+ The pipeline reads pairwise **net AXT** alignment files from UCSC. These represent best-in-genome one-to-one alignments between the focal species and each outgroup. Download them from:
231
+
232
+ ```
233
+ https://hgdownload.soe.ucsc.edu/goldenPath/<assembly>/vs<Outgroup>/
234
+ ```
235
+
236
+ ### Getting Chromosome Lengths
237
+
238
+ From UCSC:
239
+ ```bash
240
+ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A \
241
+ -e "SELECT chrom, size FROM chromInfo" hg38 > chromoLens.txt
242
+ ```
243
+
244
+ Or from a FASTA index:
245
+ ```bash
246
+ samtools faidx reference.fa
247
+ cut -f1,2 reference.fa.fai > chromoLens.txt
248
+ ```
249
+
250
+ ---
251
+
252
+ ## Using the Output
253
+
254
+ ### Look Up an Ancestral Allele
255
+
256
+ ```python
257
+ from ancify.utils import read_fasta
258
+
259
+ _, seq = read_fasta("ancestral_calls/chr1.fa")
260
+ allele = seq[999999] # 0-based index for position 1,000,000
261
+ print(f"Ancestral: {allele}, High confidence: {allele in 'ACGT'}")
262
+ ```
263
+
264
+ ### Polarize a VCF
265
+
266
+ ```python
267
+ for variant in vcf:
268
+ anc = seq[variant.POS - 1].upper()
269
+ if anc == variant.REF:
270
+ # REF is ancestral, ALT is derived
271
+ ...
272
+ elif anc == variant.ALT:
273
+ # ALT is ancestral, REF is derived (flip frequencies)
274
+ ...
275
+ ```
276
+
277
+ ### Python API
278
+
279
+ ```python
280
+ from ancify.config import load_config
281
+ from ancify.project import run_projection
282
+ from ancify.ancestral import run_ancestral_calling, call_ancestral_base
283
+
284
+ # Run the full pipeline programmatically
285
+ cfg = load_config("config.yaml")
286
+ run_projection(cfg)
287
+ run_ancestral_calling(cfg)
288
+
289
+ # Or call the core function directly (voting)
290
+ base = call_ancestral_base(
291
+ inner_bases=["A", "A", "G"],
292
+ outer_bases=["A"],
293
+ )
294
+ # Returns "A" (high confidence)
295
+
296
+ # Or use Fitch parsimony directly
297
+ from ancify.ancestral import call_ancestral_base_parsimony
298
+ from ancify.parsimony import parse_newick
299
+
300
+ tree = parse_newick("(((bonobo,chimp),gorilla),macaque)")
301
+ base = call_ancestral_base_parsimony(tree, {
302
+ "bonobo": "G", "chimp": "G", "gorilla": "A", "macaque": "A"
303
+ })
304
+ # Returns "A" (high confidence -- tree resolves the ambiguity)
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Evaluation (Optional)
310
+
311
+ Compare your calls against a reference ancestral sequence (e.g. Ensembl EPO) and/or VCF variant data:
312
+
313
+ ```yaml
314
+ evaluation:
315
+ reference_dir: ./ensembl_ancestor/
316
+ reference_pattern: "homo_sapiens_ancestor_{chrom_id}.fa"
317
+ vcf_dir: ./vcf/
318
+ vcf_pattern: "ALL.chr{chrom_id}.vcf.gz"
319
+ ```
320
+
321
+ Pattern placeholders: `{chrom}` = full name (e.g. `chr1`), `{chrom_id}` = without `chr` prefix (e.g. `1`).
322
+
323
+ ### Human (hg38) Validation Results
324
+
325
+ The BCGM method (bonobo + chimp + gorilla + macaque) was validated against the Ensembl EPO 13-primate ancestral reference:
326
+
327
+ | Metric | chr1 | chr22 |
328
+ |--------|------|-------|
329
+ | Coverage (BCGM) | 79.1% | 74.6% |
330
+ | Coverage (Ensembl EPO) | 90.2% | 81.2% |
331
+ | Disagreement rate | 0.08% | 0.11% |
332
+ | Matches REF or ALT | 99.6% | 99.5% |
333
+
334
+ ---
335
+
336
+ ## Documentation
337
+
338
+ - **Online docs:** [ancify.readthedocs.io](https://ancify.readthedocs.io)
339
+ - **Manual (PDF):** [docs/manual.pdf](https://github.com/kevinkorfmann/ancify/raw/main/docs/manual.pdf) — comprehensive LaTeX manual with algorithm descriptions, flowcharts, and worked examples.
340
+
341
+ To rebuild the PDF from source:
342
+
343
+ ```bash
344
+ cd docs && pdflatex manual.tex && pdflatex manual.tex
345
+ ```
346
+
347
+ ---
348
+
349
+ ## Project Structure
350
+
351
+ ```
352
+ ancify/
353
+ ├── pyproject.toml # package metadata
354
+ ├── example_configs/
355
+ │ ├── hg38_bcgm.yaml # human (worked example)
356
+ │ ├── mouse_example.yaml # mouse (hypothetical)
357
+ │ ├── drosophila_example.yaml # fruit fly (hypothetical)
358
+ │ └── brassica_rapa_example.yaml # Brassica rapa plant (hypothetical)
359
+ ├── docs/
360
+ │ ├── manual.tex # comprehensive LaTeX manual
361
+ │ └── manual.pdf # pre-compiled PDF
362
+ └── ancify/ # Python package
363
+ ├── __init__.py
364
+ ├── __main__.py
365
+ ├── cli.py # command-line interface
366
+ ├── config.py # YAML config loading
367
+ ├── utils.py # FASTA I/O, majority vote
368
+ ├── project.py # Phase 1: coordinate projection
369
+ ├── ancestral.py # Phase 2: ancestral calling
370
+ ├── parsimony.py # Fitch algorithm & Newick parser
371
+ └── evaluate.py # Phase 3: evaluation
372
+ ```
373
+
374
+ ## License
375
+
376
+ MIT
@@ -0,0 +1,3 @@
1
+ """ancify -- Ancestral allele polarization pipeline using outgroup species."""
2
+
3
+ __version__ = "1.0.0"
@@ -0,0 +1,3 @@
1
+ from .cli import main
2
+
3
+ main()