assemblytics 2.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,22 @@
1
+
2
+ The MIT License (MIT)
3
+
4
+ Copyright (c) 2016 Maria Nattestad
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE.
@@ -0,0 +1,196 @@
1
+ Metadata-Version: 2.4
2
+ Name: assemblytics
3
+ Version: 2.0.0
4
+ Summary: Detect and analyze structural variants from a de novo genome assembly aligned to a reference genome
5
+ Author: Maria Nattestad
6
+ License: MIT
7
+ Project-URL: Homepage, http://assemblytics.com/
8
+ Project-URL: Repository, https://github.com/MariaNattestad/assemblytics
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
12
+ Requires-Python: >=3.8
13
+ Description-Content-Type: text/markdown
14
+ License-File: LICENSE
15
+ Requires-Dist: numpy
16
+ Requires-Dist: pandas
17
+ Requires-Dist: matplotlib
18
+ Dynamic: license-file
19
+
20
+ # Assemblytics: detect variants from an assembly
21
+
22
+ If you use Assemblytics, please cite our paper in Bioinformatics: http://www.ncbi.nlm.nih.gov/pubmed/27318204
23
+
24
+ BioRxiv preprint also available: https://www.biorxiv.org/content/10.1101/044925v1
25
+
26
+ ## How Assemblytics works
27
+
28
+ Assemblytics analyzes alignments of a "query" assembly to a "reference" genome (or another assembly) to identify structural variants. The pipeline consists of the following key steps:
29
+
30
+ 1. **Unique Anchor Filtering:** For every alignment, Assemblytics calculates how much of the query sequence is "unique" (not covered by any other alignments). Alignments are only retained if they meet a minimum unique anchor length requirement (default 10,000 bp). This ensures that variants are called from high-confidence, non-repetitive regions.
31
+ 2. **Calling Variants Between Alignments:** Assemblytics identifies variants that occur in the gaps between adjacent alignments of the same query sequence. These include insertions, deletions, and tandem expansions/contractions that occur when the assembly and reference don't quite meet up.
32
+ 3. **Calling Variants Within Alignments:** The pipeline also scans within individual alignments for mismatches in the gap sizes on the reference vs. query side.
33
+ 4. **Integration and Categorization:** All identified variants are combined and categorized by type (Insertion, Deletion, Tandem Expansion/Contraction, Repeat Expansion/Contraction) and size.
34
+ 5. **Visualization and Summary:** Finally, the tool generates summary statistics and several plots, including a dot plot of filtered alignments, an Nchart of the assembly, and size distributions of all called variants.
35
+
36
+ ## How to use Assemblytics
37
+
38
+ 1. Align your assembly fasta file to some kind of reference you want to compare against. See `nucmer` input instructions below for the exact command we recommend.
39
+ 2. Go to [assemblytics.com](https://assemblytics.com) and input your .delta file for analysis.
40
+
41
+ Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference.
42
+
43
+ ## `nucmer` input instructions
44
+
45
+ See my [MUMmer tutorial on sandbox.bio](https://sandbox.bio/tutorials/mummer-circa).
46
+
47
+ IMPORTANT: Assemblytics was built for `nucmer -maxmatch` output and tuned to the following parameters, which is important for making the unique anchor filtering in Assemblytics work correctly.
48
+
49
+ Upload a delta file to analyze alignments of an assembly to another assembly or a reference genome
50
+
51
+ 1. Download and install [MUMmer 4](https://github.com/mummer4/mummer/releases).
52
+ 2. Align your assembly to a reference genome using nucmer (from MUMmer package)
53
+ ```bash
54
+ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT
55
+ # Settings above are important for unique anchor filtering to work correctly in Assemblytics.
56
+ # I increased -l to 10000 for the human in input_examples, which cut down on file size significantly at the cost of losing a lot of sensitivity and thus alignments. I don't really recommend setting it that high for your main analysis, but it can be useful for a fast initial run.
57
+
58
+ # Optionally gzip
59
+ gzip OUT.delta
60
+ ```
61
+
62
+ Consult the [MUMmer github](https://github.com/mummer4/mummer/releases) if you encounter problems.
63
+
64
+ 3. Use the output .delta or .delta.gz file at assemblytics.com
65
+
66
+ ## FAQ
67
+
68
+ ### What do the different variant types mean? What is tandem expansion versus repeat expansion?
69
+
70
+ ![variants types in Assemblytics](docs/variant_types_in_Assemblytics.jpg)
71
+
72
+ ### What is unique anchor filtering for?
73
+ See this example showing the point of unique anchor filtering (from the bioRxiv preprint supplementary materials): ![unique anchor filtering](docs/unique_anchor_filtering.png)
74
+
75
+ <sub>
76
+ <b>Supplementary Figure 1 caption:</b> Each repetitive element in a genome assembly can map ambiguously to multiple locations in the reference genome. Delta-filter, a component of MUMmer, filters repetitive alignments using a longest-increasing subsequence (LIS) dynamic programming algorithm to select subsets of long, high-identity alignments while penalizing overlaps (Kurtz et al., 2004; Phillippy et al., 2008). In contrast, Assemblytics eliminates repeats lacking substantial unique anchoring sequence (default: 10 kb). <b>A</b>. Example: a simulated 20 kb contig sequence matches three locations in the reference except for a single nucleotide (red point) providing a better match on the right. <b>B</b>. Dot plot of all raw, unfiltered alignments from nucmer. <b>C</b>. Dot plot after <code>delta-filter -r</code> (equivalent to unfiltered). <b>D</b>. Dot plot after <code>delta-filter -q</code>; here, a single nucleotide is enough for <code>-q</code> to prefer the third alignment. <b>E</b>. Dot plot after Assemblytics unique anchor filtering: only alignments with at least 10 kb uniquely anchored sequence (aligning to a single position in the reference) are retained; the repeats are removed. Assemblytics annotates structural variants within such filtered gaps as repeat expansions or contractions, depending on whether the gap is larger in the query or reference, respectively. No variant is reported unless the gap size changes, so repeats themselves are not reported as SVs—only expansions (increased size) or contractions (decreased size) are.
77
+ </sub>
78
+
79
+ For small genomes (e.g. bacteria), you may want to reduce the unique_length to 1000.
80
+
81
+ ### How long does the analysis take?
82
+
83
+ The analysis will run in a few seconds for most genomes, and for the human example which is a 6 MB gzipped delta, it takes 50 seconds. It should scale linearly with file size, so expect at least a minute per 10MB. On assemblytics.com, it runs client-side meaning using your computer's own CPU, so if you are working on a really slow computer, it could run somewhat slower. If it's an issue, see nucmer instructions note on `-l` above, or consider running the python version.
84
+
85
+ ### What aligners can I use?
86
+
87
+ Assemblytics was built on [MUMmer 3](https://sourceforge.net/projects/mummer/files/) but MUMmer 4 is still compatible. Other aligners do not produce .delta files but rather SAM/BAM outputs, which MUMmer 4 also supports now, but MUMmer was sort of the original aligner for genome assemblies (as opposed to reads), so that's what Assemblytics was built to work with. Many choices about which alignments are kept are also going to be different from other aligners, so I don't recommend using Assemblytics with anything other than MUMmer.
88
+
89
+ ### why no translocations?
90
+
91
+ By default, candidate variants that span two different reference chromosomes ("Interchromosomal") are left out of the main results, since most of them come from misassemblies rather than real variants. Pass `--long-range` to also write these candidates to a separate `assemblytics_long_range_variants.bed`, so they're easy to find but clearly kept apart from the main, higher-confidence call set:
92
+
93
+ ```bash
94
+ assemblytics -d input_examples/ecoli.delta.gz -o ecoli_output --long-range
95
+
96
+ # In addition to the usual output, this also writes ecoli_output/assemblytics_long_range_variants.bed.
97
+ # These candidates are usually misassemblies, but can occasionally be real translocations or
98
+ # other large-scale rearrangements -- review them manually before trusting them as true variants.
99
+ ```
100
+
101
+ ## Python-only version for pipelines
102
+
103
+ The python part of Assemblytics can be run without the web app.
104
+ Depends on Python 3.8+, and includes `numpy`, `pandas`, and `matplotlib` dependencies.
105
+
106
+ ```bash
107
+ pip install assemblytics
108
+ ```
109
+
110
+ The `assemblytics` command orchestrates the entire pipeline from filtering to plotting.
111
+
112
+ ```bash
113
+ assemblytics -d <delta_file> -o <output_dir>
114
+ ```
115
+
116
+ Example using the provided *E. coli* sample:
117
+ ```bash
118
+ assemblytics -d input_examples/ecoli.delta.gz -o ecoli_output
119
+
120
+ # The output should match the one in the output_examples/ecoli folder.
121
+ ```
122
+
123
+ ## Development instructions
124
+
125
+ ### Python
126
+
127
+ ```bash
128
+ git clone https://github.com/MariaNattestad/assemblytics.git
129
+ cd assemblytics
130
+ pip install -e .
131
+ assemblytics
132
+ ```
133
+
134
+ ### Local web app
135
+
136
+ The web app (`public/`) runs the entire Assemblytics pipeline client-side in the browser via [Pyodide](https://pyodide.org/) (Python compiled to WebAssembly) in a Web Worker. There is no server-side code, no upload step, and no installation beyond a static file server — your delta file never leaves your machine.
137
+
138
+ To run it locally, serve the `public/` folder with any static file server, for example:
139
+
140
+ ```bash
141
+ cd assemblytics
142
+ python3 -m http.server 8000 --directory public
143
+ # Then open http://localhost:8000 in your browser
144
+ ```
145
+
146
+ The Python source lives in `assemblytics/` at the repo root. The web app loads it as a Python wheel (`public/assemblytics-2.0.0-py3-none-any.whl`) installed at runtime by Pyodide's `micropip`.
147
+
148
+ After editing any Python files under `assemblytics/`, rebuild the wheel before testing or deploying:
149
+
150
+ ```bash
151
+ make wheel
152
+ ```
153
+
154
+ This runs `python3 -m build --wheel` and copies the result into `public/`. If you bump the version in `pyproject.toml`, also update the filename on line 18 of `public/worker.js` to match.
155
+
156
+ ## Testing
157
+
158
+ `output_examples/` contains pre-computed results for five organisms, generated from the delta files in `input_examples/`. These are kept around (and untouched by any refactoring) specifically so the pipeline's correctness can be checked by re-running it and comparing the variant calls. The most important file to compare is `assemblytics_structural_variants.bed` (the combined, final set of structural variant calls) — everything else (plots, indices, summary stats) is derived from it.
159
+
160
+ To re-run the pipeline on each input and diff its variant calls against the matching example output:
161
+
162
+ ```bash
163
+ # E. coli (uses a smaller unique anchor length since it's a small genome)
164
+ assemblytics -d input_examples/ecoli.delta.gz -o /tmp/assemblytics_test/ecoli -l 1000
165
+ diff <(tail -n +2 /tmp/assemblytics_test/ecoli/assemblytics_structural_variants.bed | sort) \
166
+ <(tail -n +2 output_examples/ecoli/E__coli_example.Assemblytics_structural_variants.bed | sort) \
167
+ && echo "ecoli: OK"
168
+
169
+ # Yeast (Saccharomyces cerevisiae)
170
+ assemblytics -d input_examples/yeast.delta.gz -o /tmp/assemblytics_test/yeast
171
+ diff <(tail -n +2 /tmp/assemblytics_test/yeast/assemblytics_structural_variants.bed | sort) \
172
+ <(tail -n +2 output_examples/yeast/Saccharomyces_cerevisiae_example.Assemblytics_structural_variants.bed | sort) \
173
+ && echo "yeast: OK"
174
+
175
+ # Arabidopsis thaliana
176
+ assemblytics -d input_examples/arabidopsis.delta.gz -o /tmp/assemblytics_test/arabidopsis
177
+ diff <(tail -n +2 /tmp/assemblytics_test/arabidopsis/assemblytics_structural_variants.bed | sort) \
178
+ <(tail -n +2 output_examples/arabidopsis/Arabidopsis_example.Assemblytics_structural_variants.bed | sort) \
179
+ && echo "arabidopsis: OK"
180
+
181
+ # Drosophila melanogaster
182
+ assemblytics -d input_examples/drosophila.delta.gz -o /tmp/assemblytics_test/drosophila
183
+ diff <(tail -n +2 /tmp/assemblytics_test/drosophila/assemblytics_structural_variants.bed | sort) \
184
+ <(tail -n +2 output_examples/drosophila/Drosophila_example.Assemblytics_structural_variants.bed | sort) \
185
+ && echo "drosophila: OK"
186
+
187
+ # Human (assembly aligned to hg19) -- the largest input, this one takes the longest to run
188
+ assemblytics -d input_examples/human.delta.gz -o /tmp/assemblytics_test/human
189
+ diff <(tail -n +2 /tmp/assemblytics_test/human/assemblytics_structural_variants.bed | sort) \
190
+ <(tail -n +2 output_examples/human/Human_NA12878_to_hg19.Assemblytics_structural_variants.bed | sort) \
191
+ && echo "human: OK"
192
+ ```
193
+
194
+ (No `pip install -e .` yet? Run these from inside `public/` instead, replacing `assemblytics` with `python -m assemblytics.cli` and adjusting the `input_examples/`/`output_examples/` paths to `../input_examples/`/`../output_examples/`.)
195
+
196
+ Each `diff` should print nothing (no differences) followed by the "OK" line. The `tail -n +2` skips the header line, and `sort` makes the comparison order-independent since variant IDs can legitimately be assigned in a different order between runs.
@@ -0,0 +1,177 @@
1
+ # Assemblytics: detect variants from an assembly
2
+
3
+ If you use Assemblytics, please cite our paper in Bioinformatics: http://www.ncbi.nlm.nih.gov/pubmed/27318204
4
+
5
+ BioRxiv preprint also available: https://www.biorxiv.org/content/10.1101/044925v1
6
+
7
+ ## How Assemblytics works
8
+
9
+ Assemblytics analyzes alignments of a "query" assembly to a "reference" genome (or another assembly) to identify structural variants. The pipeline consists of the following key steps:
10
+
11
+ 1. **Unique Anchor Filtering:** For every alignment, Assemblytics calculates how much of the query sequence is "unique" (not covered by any other alignments). Alignments are only retained if they meet a minimum unique anchor length requirement (default 10,000 bp). This ensures that variants are called from high-confidence, non-repetitive regions.
12
+ 2. **Calling Variants Between Alignments:** Assemblytics identifies variants that occur in the gaps between adjacent alignments of the same query sequence. These include insertions, deletions, and tandem expansions/contractions that occur when the assembly and reference don't quite meet up.
13
+ 3. **Calling Variants Within Alignments:** The pipeline also scans within individual alignments for mismatches in the gap sizes on the reference vs. query side.
14
+ 4. **Integration and Categorization:** All identified variants are combined and categorized by type (Insertion, Deletion, Tandem Expansion/Contraction, Repeat Expansion/Contraction) and size.
15
+ 5. **Visualization and Summary:** Finally, the tool generates summary statistics and several plots, including a dot plot of filtered alignments, an Nchart of the assembly, and size distributions of all called variants.
16
+
17
+ ## How to use Assemblytics
18
+
19
+ 1. Align your assembly fasta file to some kind of reference you want to compare against. See `nucmer` input instructions below for the exact command we recommend.
20
+ 2. Go to [assemblytics.com](https://assemblytics.com) and input your .delta file for analysis.
21
+
22
+ Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference.
23
+
24
+ ## `nucmer` input instructions
25
+
26
+ See my [MUMmer tutorial on sandbox.bio](https://sandbox.bio/tutorials/mummer-circa).
27
+
28
+ IMPORTANT: Assemblytics was built for `nucmer -maxmatch` output and tuned to the following parameters, which is important for making the unique anchor filtering in Assemblytics work correctly.
29
+
30
+ Upload a delta file to analyze alignments of an assembly to another assembly or a reference genome
31
+
32
+ 1. Download and install [MUMmer 4](https://github.com/mummer4/mummer/releases).
33
+ 2. Align your assembly to a reference genome using nucmer (from MUMmer package)
34
+ ```bash
35
+ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT
36
+ # Settings above are important for unique anchor filtering to work correctly in Assemblytics.
37
+ # I increased -l to 10000 for the human in input_examples, which cut down on file size significantly at the cost of losing a lot of sensitivity and thus alignments. I don't really recommend setting it that high for your main analysis, but it can be useful for a fast initial run.
38
+
39
+ # Optionally gzip
40
+ gzip OUT.delta
41
+ ```
42
+
43
+ Consult the [MUMmer github](https://github.com/mummer4/mummer/releases) if you encounter problems.
44
+
45
+ 3. Use the output .delta or .delta.gz file at assemblytics.com
46
+
47
+ ## FAQ
48
+
49
+ ### What do the different variant types mean? What is tandem expansion versus repeat expansion?
50
+
51
+ ![variants types in Assemblytics](docs/variant_types_in_Assemblytics.jpg)
52
+
53
+ ### What is unique anchor filtering for?
54
+ See this example showing the point of unique anchor filtering (from the bioRxiv preprint supplementary materials): ![unique anchor filtering](docs/unique_anchor_filtering.png)
55
+
56
+ <sub>
57
+ <b>Supplementary Figure 1 caption:</b> Each repetitive element in a genome assembly can map ambiguously to multiple locations in the reference genome. Delta-filter, a component of MUMmer, filters repetitive alignments using a longest-increasing subsequence (LIS) dynamic programming algorithm to select subsets of long, high-identity alignments while penalizing overlaps (Kurtz et al., 2004; Phillippy et al., 2008). In contrast, Assemblytics eliminates repeats lacking substantial unique anchoring sequence (default: 10 kb). <b>A</b>. Example: a simulated 20 kb contig sequence matches three locations in the reference except for a single nucleotide (red point) providing a better match on the right. <b>B</b>. Dot plot of all raw, unfiltered alignments from nucmer. <b>C</b>. Dot plot after <code>delta-filter -r</code> (equivalent to unfiltered). <b>D</b>. Dot plot after <code>delta-filter -q</code>; here, a single nucleotide is enough for <code>-q</code> to prefer the third alignment. <b>E</b>. Dot plot after Assemblytics unique anchor filtering: only alignments with at least 10 kb uniquely anchored sequence (aligning to a single position in the reference) are retained; the repeats are removed. Assemblytics annotates structural variants within such filtered gaps as repeat expansions or contractions, depending on whether the gap is larger in the query or reference, respectively. No variant is reported unless the gap size changes, so repeats themselves are not reported as SVs—only expansions (increased size) or contractions (decreased size) are.
58
+ </sub>
59
+
60
+ For small genomes (e.g. bacteria), you may want to reduce the unique_length to 1000.
61
+
62
+ ### How long does the analysis take?
63
+
64
+ The analysis will run in a few seconds for most genomes, and for the human example which is a 6 MB gzipped delta, it takes 50 seconds. It should scale linearly with file size, so expect at least a minute per 10MB. On assemblytics.com, it runs client-side meaning using your computer's own CPU, so if you are working on a really slow computer, it could run somewhat slower. If it's an issue, see nucmer instructions note on `-l` above, or consider running the python version.
65
+
66
+ ### What aligners can I use?
67
+
68
+ Assemblytics was built on [MUMmer 3](https://sourceforge.net/projects/mummer/files/) but MUMmer 4 is still compatible. Other aligners do not produce .delta files but rather SAM/BAM outputs, which MUMmer 4 also supports now, but MUMmer was sort of the original aligner for genome assemblies (as opposed to reads), so that's what Assemblytics was built to work with. Many choices about which alignments are kept are also going to be different from other aligners, so I don't recommend using Assemblytics with anything other than MUMmer.
69
+
70
+ ### why no translocations?
71
+
72
+ By default, candidate variants that span two different reference chromosomes ("Interchromosomal") are left out of the main results, since most of them come from misassemblies rather than real variants. Pass `--long-range` to also write these candidates to a separate `assemblytics_long_range_variants.bed`, so they're easy to find but clearly kept apart from the main, higher-confidence call set:
73
+
74
+ ```bash
75
+ assemblytics -d input_examples/ecoli.delta.gz -o ecoli_output --long-range
76
+
77
+ # In addition to the usual output, this also writes ecoli_output/assemblytics_long_range_variants.bed.
78
+ # These candidates are usually misassemblies, but can occasionally be real translocations or
79
+ # other large-scale rearrangements -- review them manually before trusting them as true variants.
80
+ ```
81
+
82
+ ## Python-only version for pipelines
83
+
84
+ The python part of Assemblytics can be run without the web app.
85
+ Depends on Python 3.8+, and includes `numpy`, `pandas`, and `matplotlib` dependencies.
86
+
87
+ ```bash
88
+ pip install assemblytics
89
+ ```
90
+
91
+ The `assemblytics` command orchestrates the entire pipeline from filtering to plotting.
92
+
93
+ ```bash
94
+ assemblytics -d <delta_file> -o <output_dir>
95
+ ```
96
+
97
+ Example using the provided *E. coli* sample:
98
+ ```bash
99
+ assemblytics -d input_examples/ecoli.delta.gz -o ecoli_output
100
+
101
+ # The output should match the one in the output_examples/ecoli folder.
102
+ ```
103
+
104
+ ## Development instructions
105
+
106
+ ### Python
107
+
108
+ ```bash
109
+ git clone https://github.com/MariaNattestad/assemblytics.git
110
+ cd assemblytics
111
+ pip install -e .
112
+ assemblytics
113
+ ```
114
+
115
+ ### Local web app
116
+
117
+ The web app (`public/`) runs the entire Assemblytics pipeline client-side in the browser via [Pyodide](https://pyodide.org/) (Python compiled to WebAssembly) in a Web Worker. There is no server-side code, no upload step, and no installation beyond a static file server — your delta file never leaves your machine.
118
+
119
+ To run it locally, serve the `public/` folder with any static file server, for example:
120
+
121
+ ```bash
122
+ cd assemblytics
123
+ python3 -m http.server 8000 --directory public
124
+ # Then open http://localhost:8000 in your browser
125
+ ```
126
+
127
+ The Python source lives in `assemblytics/` at the repo root. The web app loads it as a Python wheel (`public/assemblytics-2.0.0-py3-none-any.whl`) installed at runtime by Pyodide's `micropip`.
128
+
129
+ After editing any Python files under `assemblytics/`, rebuild the wheel before testing or deploying:
130
+
131
+ ```bash
132
+ make wheel
133
+ ```
134
+
135
+ This runs `python3 -m build --wheel` and copies the result into `public/`. If you bump the version in `pyproject.toml`, also update the filename on line 18 of `public/worker.js` to match.
136
+
137
+ ## Testing
138
+
139
+ `output_examples/` contains pre-computed results for five organisms, generated from the delta files in `input_examples/`. These are kept around (and untouched by any refactoring) specifically so the pipeline's correctness can be checked by re-running it and comparing the variant calls. The most important file to compare is `assemblytics_structural_variants.bed` (the combined, final set of structural variant calls) — everything else (plots, indices, summary stats) is derived from it.
140
+
141
+ To re-run the pipeline on each input and diff its variant calls against the matching example output:
142
+
143
+ ```bash
144
+ # E. coli (uses a smaller unique anchor length since it's a small genome)
145
+ assemblytics -d input_examples/ecoli.delta.gz -o /tmp/assemblytics_test/ecoli -l 1000
146
+ diff <(tail -n +2 /tmp/assemblytics_test/ecoli/assemblytics_structural_variants.bed | sort) \
147
+ <(tail -n +2 output_examples/ecoli/E__coli_example.Assemblytics_structural_variants.bed | sort) \
148
+ && echo "ecoli: OK"
149
+
150
+ # Yeast (Saccharomyces cerevisiae)
151
+ assemblytics -d input_examples/yeast.delta.gz -o /tmp/assemblytics_test/yeast
152
+ diff <(tail -n +2 /tmp/assemblytics_test/yeast/assemblytics_structural_variants.bed | sort) \
153
+ <(tail -n +2 output_examples/yeast/Saccharomyces_cerevisiae_example.Assemblytics_structural_variants.bed | sort) \
154
+ && echo "yeast: OK"
155
+
156
+ # Arabidopsis thaliana
157
+ assemblytics -d input_examples/arabidopsis.delta.gz -o /tmp/assemblytics_test/arabidopsis
158
+ diff <(tail -n +2 /tmp/assemblytics_test/arabidopsis/assemblytics_structural_variants.bed | sort) \
159
+ <(tail -n +2 output_examples/arabidopsis/Arabidopsis_example.Assemblytics_structural_variants.bed | sort) \
160
+ && echo "arabidopsis: OK"
161
+
162
+ # Drosophila melanogaster
163
+ assemblytics -d input_examples/drosophila.delta.gz -o /tmp/assemblytics_test/drosophila
164
+ diff <(tail -n +2 /tmp/assemblytics_test/drosophila/assemblytics_structural_variants.bed | sort) \
165
+ <(tail -n +2 output_examples/drosophila/Drosophila_example.Assemblytics_structural_variants.bed | sort) \
166
+ && echo "drosophila: OK"
167
+
168
+ # Human (assembly aligned to hg19) -- the largest input, this one takes the longest to run
169
+ assemblytics -d input_examples/human.delta.gz -o /tmp/assemblytics_test/human
170
+ diff <(tail -n +2 /tmp/assemblytics_test/human/assemblytics_structural_variants.bed | sort) \
171
+ <(tail -n +2 output_examples/human/Human_NA12878_to_hg19.Assemblytics_structural_variants.bed | sort) \
172
+ && echo "human: OK"
173
+ ```
174
+
175
+ (No `pip install -e .` yet? Run these from inside `public/` instead, replacing `assemblytics` with `python -m assemblytics.cli` and adjusting the `input_examples/`/`output_examples/` paths to `../input_examples/`/`../output_examples/`.)
176
+
177
+ Each `diff` should print nothing (no differences) followed by the "OK" line. The `tail -n +2` skips the header line, and `sort` makes the comparison order-independent since variant IDs can legitimately be assigned in a different order between runs.
@@ -0,0 +1 @@
1
+ __version__ = "2.0.0"
@@ -0,0 +1,211 @@
1
+ #!/usr/bin/env python3
2
+
3
+ """Python orchestrator for the Assemblytics pipeline."""
4
+
5
+ import argparse
6
+ import io
7
+ import os
8
+ import sys
9
+ import zipfile
10
+
11
+ from .dot_prep import run as run_dot_prep
12
+ from .dotplot import run as run_dotplot
13
+ from .index import run as run_index
14
+ from .nchart import run as run_nchart
15
+ from .summary import SVtable as run_summary
16
+ from .uniq_anchor import run as run_uniq_anchor
17
+ from .variant_charts import run as run_variant_charts
18
+ from .variants import run as run_variants
19
+
20
+
21
+ USAGE = "assemblytics -d delta -o output_dir -l unique_length -min min_size -max max_size"
22
+
23
+
24
+ def log_progress(log_file, message):
25
+ with open(log_file, "a") as log:
26
+ log.write(message + "\n")
27
+
28
+
29
+ def fail(log_file, step, message, exit_code=1):
30
+ log_progress(log_file, step)
31
+ sys.exit(exit_code)
32
+
33
+
34
+ def zip_results(output_dir):
35
+ zip_path = os.path.join(output_dir, "assemblytics_results.zip")
36
+ zip_filename = os.path.basename(zip_path)
37
+ with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as archive:
38
+ for filename in os.listdir(output_dir):
39
+ if filename.startswith("assemblytics_") and filename != zip_filename:
40
+ archive.write(os.path.join(output_dir, filename), filename)
41
+
42
+
43
+ def run_summary_to_file(output_dir, minimum_size, maximum_size):
44
+ summary_path = os.path.join(output_dir, "assemblytics_structural_variants_summary.txt")
45
+ bed_path = os.path.join(output_dir, "assemblytics_structural_variants.bed")
46
+ summary_args = argparse.Namespace(
47
+ file=bed_path,
48
+ minimum_variant_size=minimum_size,
49
+ maximum_variant_size=maximum_size,
50
+ )
51
+ buffer = io.StringIO()
52
+ stdout = sys.stdout
53
+ sys.stdout = buffer
54
+ try:
55
+ run_summary(summary_args)
56
+ finally:
57
+ sys.stdout = stdout
58
+ with open(summary_path, "w") as summary:
59
+ summary.write(buffer.getvalue())
60
+
61
+
62
+ def run(args):
63
+ delta = args.delta
64
+ output_dir = args.output_dir
65
+ unique_length = args.unique_length
66
+ minimum_size = args.minimum_size
67
+ maximum_size = args.maximum_size
68
+ long_range = getattr(args, "long_range", False)
69
+
70
+ print("Input delta file:", delta)
71
+ print("Output directory:", output_dir)
72
+ print("Unique anchor length:", unique_length)
73
+ print("Minimum variant size to call:", minimum_size)
74
+ print("Maximum variant size to call:", maximum_size)
75
+
76
+ os.makedirs(output_dir, exist_ok=True)
77
+
78
+ log_file = os.path.join(output_dir, "assemblytics_progress.log")
79
+ print("Logging progress updates in", log_file)
80
+
81
+ log_progress(log_file, "STARTING,DONE,Starting unique anchor filtering.")
82
+
83
+ print("1. Filter delta file")
84
+ run_uniq_anchor(
85
+ argparse.Namespace(
86
+ delta=delta,
87
+ out=output_dir,
88
+ unique_length=unique_length,
89
+ keep_small_uniques=True,
90
+ )
91
+ )
92
+ print("FILE_READY:assemblytics_assembly_stats.txt")
93
+ print("FILE_READY:assemblytics_coords.tab")
94
+ print("FILE_READY:assemblytics_coords.csv")
95
+
96
+ filtered_delta = os.path.join(output_dir, "assemblytics_unique_length_filtered_l{}.delta.gz".format(unique_length))
97
+ if not os.path.exists(filtered_delta):
98
+ fail(
99
+ log_file,
100
+ "UNIQFILTER,FAIL,Step 1: uniq_anchor.py failed: "
101
+ "Possible problem with Python or Python packages on server.",
102
+ )
103
+ print("FILE_READY:" + os.path.basename(filtered_delta))
104
+
105
+ log_progress(
106
+ log_file,
107
+ "UNIQFILTER,DONE,Step 1: uniq_anchor.py completed successfully. "
108
+ "Now finding variants between alignments.",
109
+ )
110
+
111
+ print("2. Finding structural variants")
112
+ combined_path = os.path.join(output_dir, "assemblytics_structural_variants.bed")
113
+ long_range_path = os.path.join(output_dir, "assemblytics_long_range_variants.bed") if long_range else None
114
+ run_variants(filtered_delta, minimum_size, maximum_size, minimum_size, combined_path, long_range_path)
115
+ if not os.path.exists(combined_path):
116
+ fail(
117
+ log_file,
118
+ "VARIANTS,FAIL,Step 2: variants.py failed: "
119
+ "Possible problem with Python on server.",
120
+ )
121
+ print("FILE_READY:" + os.path.basename(combined_path))
122
+ if long_range:
123
+ print("FILE_READY:" + os.path.basename(long_range_path))
124
+
125
+ log_progress(
126
+ log_file,
127
+ "VARIANTS,DONE,Step 2: variants.py completed successfully. "
128
+ "Now generating figures and summary statistics.",
129
+ )
130
+
131
+ print("3. Index coordinates and generate summary statistics")
132
+ run_index(
133
+ argparse.Namespace(
134
+ coords=os.path.join(output_dir, "assemblytics_coords.csv"),
135
+ out=output_dir,
136
+ )
137
+ )
138
+ run_summary_to_file(output_dir, minimum_size, maximum_size)
139
+ print("FILE_READY:assemblytics_structural_variants_summary.txt")
140
+
141
+ print("4. Generating figures")
142
+ run_variant_charts(output_dir, minimum_size, maximum_size)
143
+ # Charts are ready incrementally too
144
+ charts = [f for f in os.listdir(output_dir) if f.startswith("assemblytics_size_distributions") and f.endswith(".png")]
145
+ for chart in charts:
146
+ print("FILE_READY:" + chart)
147
+
148
+ run_dotplot(output_dir)
149
+ print("FILE_READY:assemblytics_dotplot_filtered.png")
150
+
151
+ run_nchart(output_dir)
152
+ print("FILE_READY:assemblytics_nchart.png")
153
+
154
+ print("5. Preparing interactive Dot plot")
155
+ dot_prefix = os.path.join(output_dir, "assemblytics_dot")
156
+ run_dot_prep(
157
+ argparse.Namespace(
158
+ delta=delta,
159
+ out=dot_prefix,
160
+ unique_length=unique_length,
161
+ overview=1000,
162
+ ),
163
+ write_delta=False,
164
+ )
165
+ print("FILE_READY:assemblytics_dot.coords")
166
+ print("FILE_READY:assemblytics_dot.coords.idx")
167
+
168
+ zip_results(output_dir)
169
+ print("FILE_READY:assemblytics_results.zip")
170
+
171
+ summary_path = os.path.join(output_dir, "assemblytics_structural_variants_summary.txt")
172
+ with open(summary_path) as summary:
173
+ if "Total" not in summary.read():
174
+ fail(log_file, "SUMMARY,FAIL,Step 3: summary.py failed")
175
+
176
+ log_progress(
177
+ log_file,
178
+ "SUMMARY,DONE,Step 3: summary.py completed successfully",
179
+ )
180
+
181
+
182
+ def main():
183
+ parser = argparse.ArgumentParser(
184
+ description="Assemblytics structural variant detection pipeline",
185
+ usage=USAGE,
186
+ )
187
+ parser.add_argument("-d", "--delta", help="MUMmer delta file (.delta or .delta.gz)", required=True)
188
+ parser.add_argument("-o", "--output_dir", help="Output directory for assemblytics_* result files (default: current directory)", default=".")
189
+ parser.add_argument("-l", "--unique_length", type=int, default=10000, help="Unique anchor length requirement (default: 10000)")
190
+ parser.add_argument("-min", "--minimum_size", type=int, default=50, help="Minimum variant size to call (default: 50)")
191
+ parser.add_argument("-max", "--maximum_size", type=int, default=10000, help="Maximum variant size to call (default: 10000)")
192
+ parser.add_argument(
193
+ "--long-range",
194
+ dest="long_range",
195
+ action="store_true",
196
+ help=(
197
+ "Also report long-range and inter-chromosomal candidate variants (events bigger "
198
+ "than --maximum_size, or spanning two different reference chromosomes) to a "
199
+ "separate assemblytics_long_range_variants.bed file. These are usually caused by "
200
+ "misassemblies, but can also represent real translocations or other large-scale "
201
+ "rearrangements, so they're kept out of the main results by default and require "
202
+ "manual review."
203
+ ),
204
+ )
205
+ parser.set_defaults(func=run)
206
+ args = parser.parse_args()
207
+ args.func(args)
208
+
209
+
210
+ if __name__ == "__main__":
211
+ main()