speconsense 0.7.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1449 @@
1
+ Metadata-Version: 2.4
2
+ Name: speconsense
3
+ Version: 0.7.2
4
+ Summary: High-quality clustering and consensus generation for Oxford Nanopore amplicon reads
5
+ Author-email: Josh Walker <joshowalker@yahoo.com>
6
+ License: BSD-3-Clause
7
+ Project-URL: Homepage, https://github.com/joshuaowalker/speconsense
8
+ Project-URL: Repository, https://github.com/joshuaowalker/speconsense
9
+ Project-URL: Issues, https://github.com/joshuaowalker/speconsense/issues
10
+ Keywords: bioinformatics,nanopore,consensus,clustering,amplicon
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Operating System :: OS Independent
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.8
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
21
+ Requires-Python: >=3.8
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: edlib>=1.3.9
25
+ Requires-Dist: numpy>=1.20.0
26
+ Requires-Dist: biopython>=1.79
27
+ Requires-Dist: tqdm>=4.62.0
28
+ Requires-Dist: scipy>=1.7.0
29
+ Requires-Dist: adjusted-identity>=0.2.4
30
+ Requires-Dist: pyyaml>=5.0
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
33
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
34
+ Dynamic: license-file
35
+
36
+ # Speconsense
37
+
38
+ A tool for high-quality clustering and consensus generation from Oxford Nanopore amplicon reads.
39
+
40
+ ## Overview
41
+
42
+ Speconsense is a specialized tool for generating high-quality consensus sequences from Oxford Nanopore Technology (ONT) amplicon data. It is specifically designed as an experimental alternative to NGSpeciesID in the fungal DNA barcoding pipeline.
43
+
44
+ The key features of Speconsense include:
45
+ - Robust clustering of amplicon reads using either Markov Clustering (MCL) graph-based or greedy algorithms
46
+ - **Automatic variant phasing**: Detects variant positions within clusters and splits reads into separate haplotypes
47
+ - Automatic merging of clusters with identical or similar consensus sequences
48
+ - High-quality consensus generation using SPOA
49
+ - Primer trimming for clean consensus sequences
50
+ - Read identity metrics for quality assessment
51
+ - IUPAC ambiguity codes for unphased heterozygous positions
52
+ - Optimized for fungal amplicon datasets but suitable for any amplicon sequencing application
53
+
54
+ ## Installation
55
+
56
+ ### Requirements
57
+
58
+ - Python 3.8 or higher
59
+ - External dependencies:
60
+ - [SPOA (SIMD POA)](https://github.com/rvaser/spoa) - Required (install via conda)
61
+ - [MCL](https://micans.org/mcl/) - Optional but recommended for graph-based clustering (install via conda)
62
+ - [vsearch](https://github.com/torognes/vsearch) - Optional, required for scalability mode with large datasets (install via conda)
63
+
64
+ ### Install from GitHub (Recommended)
65
+
66
+ The easiest way to install speconsense is directly from GitHub using pip. We recommend using a virtual environment to avoid dependency conflicts:
67
+
68
+ ```bash
69
+ # Create and activate a virtual environment (recommended)
70
+ python -m venv speconsense-env
71
+ source speconsense-env/bin/activate # On Windows: speconsense-env\Scripts\activate
72
+
73
+ # Install directly from GitHub
74
+ pip install git+https://github.com/joshuaowalker/speconsense.git
75
+
76
+ # External dependencies need to be installed separately
77
+ # SPOA: conda install bioconda::spoa
78
+ # MCL: conda install bioconda::mcl (optional but recommended)
79
+ ```
80
+
81
+ After installation, the tools will be available as command-line programs:
82
+ - `speconsense` - Main clustering and consensus tool
83
+ - `speconsense-summarize` - Post-processing and summary tool
84
+ - `speconsense-synth` - Synthetic read generator for testing
85
+
86
+ To deactivate the virtual environment when done:
87
+ ```bash
88
+ deactivate
89
+ ```
90
+
91
+ ### Installing External Dependencies
92
+
93
+ **MCL (Markov Clustering) - Recommended:**
94
+ ```bash
95
+ # Via conda (easiest)
96
+ conda install bioconda::mcl
97
+
98
+ # Or from source (more complex)
99
+ # See https://micans.org/mcl/ for source installation
100
+ ```
101
+
102
+ **SPOA (SIMD POA) - Required:**
103
+ ```bash
104
+ # Via conda (easiest)
105
+ conda install bioconda::spoa
106
+
107
+ # Or install from GitHub releases or build from source
108
+ # See https://github.com/rvaser/spoa for source installation instructions
109
+ ```
110
+
111
+ **vsearch - Optional (for scalability mode):**
112
+ ```bash
113
+ # Via conda (easiest)
114
+ conda install bioconda::vsearch
115
+
116
+ # Or download from https://github.com/torognes/vsearch/releases
117
+ ```
118
+
119
+ **Note:** If the mcl tool is not available, speconsense will automatically fall back to the greedy clustering algorithm.
120
+
121
+ ## Usage
122
+
123
+ ### Basic Usage
124
+
125
+ ```bash
126
+ speconsense input.fastq
127
+ ```
128
+
129
+ By default, this will:
130
+ 1. Cluster reads using graph-based Markov Clustering (MCL) algorithm
131
+ 2. Merge clusters with identical consensus sequences
132
+ 3. Generate a consensus sequence for each cluster
133
+ 4. Output FASTA files containing consensus sequences
134
+
135
+ ### Common Options
136
+
137
+ ```bash
138
+ # Use greedy clustering algorithm instead of Markov Clustering
139
+ speconsense input.fastq --algorithm greedy
140
+
141
+ # Set minimum cluster size
142
+ speconsense input.fastq --min-size 10
143
+
144
+ # Set minimum identity threshold for clustering (default: 0.9)
145
+ speconsense input.fastq --min-identity 0.85
146
+
147
+ # Control the maximum sample size for consensus generation (default: 100)
148
+ speconsense input.fastq --max-sample-size 200
149
+
150
+ # Specify output directory (default: clusters)
151
+ speconsense input.fastq --output-dir results/
152
+
153
+ # Disable automatic variant phasing
154
+ speconsense input.fastq --disable-position-phasing
155
+
156
+ # Using short form for output directory
157
+ speconsense input.fastq -O my_results/
158
+ ```
159
+
160
+ ### Using Profiles
161
+
162
+ Profiles save parameter configurations for different workflows:
163
+
164
+ ```bash
165
+ # List available profiles
166
+ speconsense --list-profiles
167
+
168
+ # Use a profile (CLI arguments override profile values)
169
+ speconsense input.fastq -p herbarium
170
+ speconsense input.fastq -p herbarium --min-size 10
171
+ ```
172
+
173
+ **Bundled profiles:**
174
+ - `herbarium` — High-recall for degraded DNA/type specimens
175
+ - `largedata` — Experimental settings for large input files
176
+ - `nostalgia` — Simulate older bioinformatics pipelines
177
+ - `strict` — High-precision for confident results
178
+
179
+ On first use, an `example.yaml` template is created in `~/.config/speconsense/profiles/` — copy and edit it to create custom profiles.
180
+
181
+ ### Two-Phase Processing Philosophy
182
+
183
+ **speconsense** extracts distinct biological sequences from raw reads within a single specimen. It separates true biological variation (alleles, loci, contaminants) from sequencing noise by clustering reads and generating consensus sequences. Philosophy: *split rather than conflate* — better to produce separate clusters that can be merged downstream than to lose real biological distinctions.
184
+
185
+ **speconsense-summarize** curates and consolidates consensus sequences for downstream analysis. It handles residual ambiguity that consensus generation cannot resolve — primarily minor SNP differences that may represent the same biological entity — groups related variants, selects representatives, and can analyze patterns across a run. Philosophy: *conservative simplification* — merge only what is clearly equivalent, preserve distinctions when uncertain.
186
+
187
+ | Aspect | speconsense | speconsense-summarize |
188
+ |--------|-------------|----------------------|
189
+ | Input | Raw reads | Consensus sequences |
190
+ | Scope | Single specimen | Single or multiple specimens |
191
+ | Goal | Discovery/enumeration | Curation/simplification |
192
+ | Error model | Read-level noise | Residual ambiguity |
193
+ | Bias | Prefer over-splitting | Conservative merging |
194
+
195
+ ### Post-processing with Speconsense-Summarize
196
+
197
+ After running speconsense, use the summarize tool to process and refine outputs:
198
+
199
+ ```bash
200
+ # Generate summary with default settings
201
+ speconsense-summarize
202
+
203
+ # Custom minimum RiC (Reads in Consensus) threshold
204
+ speconsense-summarize --min-ric 5
205
+
206
+ # Process specific source directory and custom output
207
+ speconsense-summarize --source /path/to/speconsense/output --summary-dir MyResults
208
+ ```
209
+
210
+ **Variant Detection and Haplotype Phasing:**
211
+
212
+ Speconsense can detect and isolate sequence variants within specimens (aka "haplotype phasing"). The graph-based clustering algorithm excels at discriminating between variants, and `speconsense-summarize` provides sophisticated tools for managing multiple variants per specimen, including:
213
+ - MSA-based variant merging with IUPAC ambiguity codes and size-weighted consensus
214
+ - Hierarchical variant grouping to separate contaminants from primary targets
215
+ - Size-based or diversity-based variant selection strategies
216
+
217
+ For detailed information on variant handling options, see the [Advanced Post-Processing](#advanced-post-processing) section below.
218
+
219
+ ## Quick Start: Complete Workflow
220
+
221
+ Speconsense is designed to replace the NGSpeciesID step in the [ONT DNA Barcoding Fungal Amplicons protocol](https://www.protocols.io/view/primary-data-analysis-basecalling-demultiplexing-a-dm6gpbm88lzp/v4). Here's a complete workflow:
222
+
223
+ **1. Demultiplex with Specimux**
224
+
225
+ Follow the protocol through the "Demultiplex the Reads with Specimux" step.
226
+
227
+ **2. Run Speconsense**
228
+
229
+ Replace the NGSpeciesID step with speconsense:
230
+
231
+ ```bash
232
+ # Instead of NGSpeciesID:
233
+ # ls *.fastq | parallel NGSpeciesID --ont --consensus --t 1 --abundance_ratio 0.2 ...
234
+
235
+ # Use Speconsense:
236
+ ls *.fastq | parallel speconsense {}
237
+ ```
238
+
239
+ **3. Post-process with Speconsense-Summarize**
240
+
241
+ Process the outputs to prepare for downstream analysis:
242
+
243
+ ```bash
244
+ speconsense-summarize
245
+ ```
246
+
247
+ This will create a `__Summary__/` directory with:
248
+ - `summary.fasta` - All final consensus sequences (merged variants only)
249
+ - Individual FASTA files per specimen
250
+ - `summary.txt` - Statistics and metrics
251
+ - `quality_report.txt` - Prioritized list of sequences with potential quality concerns
252
+ - `FASTQ Files/` - Reads contributing to each final consensus
253
+ - `variants/` - Pre-merge variant FASTA and FASTQ files (for merged sequences)
254
+
255
+ ## Output Files
256
+
257
+ ### Speconsense Core Output
258
+
259
+ For each specimen, Speconsense generates:
260
+
261
+ 1. **Main consensus FASTA**: `{sample_name}-all.fasta`
262
+ - Contains all consensus sequences for the specimen (one per cluster)
263
+ - Uses original cluster numbering: `{sample_name}-c{cluster_num}-RiC{size}`
264
+
265
+ 2. **Debug directory** (`cluster_debug/`):
266
+ - `{sample_name}-c{cluster_num}-RiC{size}-reads.fastq`: Original reads in each cluster
267
+ - `{sample_name}-c{cluster_num}-RiC{size}-sampled.fastq`: Sampled reads used for consensus generation
268
+ - `{sample_name}-c{cluster_num}-RiC{size}-untrimmed.fasta`: Untrimmed consensus sequences
269
+
270
+ ### Speconsense-Summarize Output
271
+
272
+ When using `speconsense-summarize` for post-processing, creates `__Summary__/` directory with:
273
+
274
+ #### **Main Output Files** (summarization numbering: `-1.v1`, `-1.v2`, `-2.v1`):
275
+ - **Individual FASTA files**: `{sample_name}-{group}.v{variant}-RiC{reads}.fasta` (all final consensus variants)
276
+ - **Combined file**: `summary.fasta` - all final consensus sequences (merged variants only, excludes .raw files)
277
+ - **Statistics**: `summary.txt` - sequence counts and metrics
278
+ - **Quality report**: `quality_report.txt` - highlights sequences with potential quality concerns
279
+ - **Log file**: `summarize_log.txt` - complete processing log
280
+
281
+ #### **FASTQ Files/** (aggregated reads for final consensus):
282
+ - `{sample_name}-{group}.v{variant}-RiC{reads}.fastq` - all reads contributing to final variant consensus
283
+
284
+ #### **variants/** (pre-merge variant files for traceability):
285
+ - `{sample_name}-{group}.v{variant}.raw{N}-RiC{reads}.fasta` - FASTA for pre-merge variant N (when variants were merged)
286
+ - **FASTQ Files/**:
287
+ - `{sample_name}-{group}.v{variant}.raw{N}-RiC{reads}.fastq` - reads for pre-merge variant N
288
+
289
+ ### Naming Convention Summary
290
+
291
+ **Two distinct namespaces maintain traceability:**
292
+
293
+ | **Namespace** | **Used In** | **Format** | **Purpose** |
294
+ |---------------|-------------|------------|-------------|
295
+ | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
296
+ | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
297
+
298
+ ### Example Directory Structure
299
+ ```
300
+ __Summary__/
301
+ ├── sample-1.v1-RiC45.fasta # Primary variant (group 1, merged)
302
+ ├── sample-1.v2-RiC23.fasta # Additional variant (not merged)
303
+ ├── sample-2.v1-RiC30.fasta # Second organism group, primary variant
304
+ ├── summary.fasta # All final consensus sequences (excludes .raw)
305
+ ├── summary.txt # Statistics
306
+ ├── quality_report.txt # Quality assessment report
307
+ ├── summarize_log.txt # Processing log
308
+ ├── FASTQ Files/ # Reads for final consensus
309
+ │ ├── sample-1.v1-RiC45.fastq # All reads for merged consensus
310
+ │ ├── sample-1.v2-RiC23.fastq # Reads for additional variant
311
+ │ └── sample-2.v1-RiC30.fastq # All reads for second group
312
+ └── variants/ # Pre-merge variant files (for traceability)
313
+ ├── sample-1.v1.raw1-RiC30.fasta # Pre-merge variant 1 FASTA
314
+ ├── sample-1.v1.raw2-RiC15.fasta # Pre-merge variant 2 FASTA
315
+ └── FASTQ Files/ # Pre-merge variant FASTQ files
316
+ ├── sample-1.v1.raw1-RiC30.fastq # Reads for pre-merge variant 1
317
+ └── sample-1.v1.raw2-RiC15.fastq # Reads for pre-merge variant 2
318
+ ```
319
+
320
+ ### FASTA Header Metadata
321
+
322
+ Consensus sequence headers contain metadata fields separated by spaces:
323
+
324
+ **Core Fields (Always Present):**
325
+ - `size=N` - Total number of raw sequence reads in the cluster
326
+ - `ric=N` - **Reads in Consensus** - Number of reads actually used for consensus generation (may be less than size due to sampling limits)
327
+
328
+ **Optional Fields:**
329
+ - `rawric=N+N+...` - RiC values of .raw source variants (pre-merge, largest-first, only present in merged variants from speconsense-summarize)
330
+ - `rawlen=N+N+...` - Original sequence lengths before overlap merging (largest-first, only present in overlap-merged variants)
331
+ - `snp=N` - Number of SNP positions from IUPAC merging (only present in merged variants from speconsense-summarize)
332
+ - `ambig=N` - Count of IUPAC ambiguity codes in consensus (Y, R, W, S, K, M, etc.)
333
+ - `length=N` - Sequence length in bases (available via --fasta-fields option)
334
+ - `primers=list` - Comma-separated list of detected primer names (e.g., `primers=ITS1F,ITS4`)
335
+ - `rid=X.X` - Mean read identity percentage (0-100%, available via --fasta-fields qc preset)
336
+ - `rid_min=X.X` - Minimum read identity percentage (worst-case read, available via --fasta-fields qc preset)
337
+ - `group=N` - Variant group number (available via --fasta-fields option)
338
+ - `variant=vN` - Variant identifier within group (available via --fasta-fields option, only for variants)
339
+
340
+ **Example Headers:**
341
+ ```
342
+ # Simple consensus from speconsense (debug files):
343
+ >sample-c1 size=50 ric=45 rid=97.2 rid_min=94.1 primers=ITS1F,ITS4
344
+
345
+ # Merged variant from speconsense-summarize (default fields):
346
+ >sample-1.v1 size=250 ric=250 rawric=100+89+61 snp=2 ambig=2 primers=ITS1F,ITS4
347
+
348
+ # Overlap-merged variant (different-length sequences merged):
349
+ >sample-1.v1 size=471 ric=471 rawric=248+223 rawlen=630+361 primers=ITS1F,ITS4
350
+
351
+ # With QC preset (--fasta-fields qc):
352
+ >sample-1.v1 size=250 ric=250 length=589 rid=98.5 ambig=1
353
+ ```
354
+
355
+ **Notes:**
356
+ - Use `--fasta-fields` option to customize which fields appear in output headers (see [Customizing FASTA Header Fields](#customizing-fasta-header-fields))
357
+ - Read identity metrics (`rid`, `rid_min`) reflect homopolymer-normalized sequence identity
358
+ - Field name changes from earlier versions: `merged_ric` → `rawric`, `p50diff`/`p95diff` → `rid`/`rid_min`
359
+ - Variant merging only occurs between sequences with identical primer sets
360
+ - `snp` counts positions where IUPAC codes were introduced during merging; `ambig` counts total ambiguity codes in the final sequence
361
+
362
+ ## Scalability Features
363
+
364
+ For large datasets (thousands of sequences), speconsense uses external tools to accelerate pairwise comparison operations. **Scalability mode is enabled by default** for datasets with 1000+ sequences.
365
+
366
+ ### Configuration
367
+
368
+ ```bash
369
+ # Default: scalability enabled for datasets with 1000+ sequences
370
+ speconsense input.fastq
371
+
372
+ # Custom threshold: enable scalability only for datasets with 500+ sequences
373
+ speconsense input.fastq --scale-threshold 500
374
+
375
+ # Disable scalability entirely (use brute-force for all sizes)
376
+ speconsense input.fastq --scale-threshold 0
377
+
378
+ # Same options work for speconsense-summarize
379
+ speconsense-summarize --scale-threshold 500 --source clusters/
380
+ ```
381
+
382
+ The threshold parameter allows you to run a single command across multiple specimens of varying sizes. Smaller specimens will use brute-force (which has less overhead for small inputs), while larger specimens will use vsearch acceleration.
383
+
384
+ ### Requirements
385
+
386
+ Scalability mode requires **vsearch** to be installed:
387
+
388
+ ```bash
389
+ conda install bioconda::vsearch
390
+ ```
391
+
392
+ If vsearch is not available, speconsense will automatically fall back to brute-force computation.
393
+
394
+ ### How It Works
395
+
396
+ Scalability mode uses a two-stage approach to reduce the O(n^2) pairwise comparison cost:
397
+
398
+ 1. **Fast candidate finding**: vsearch quickly identifies approximate sequence matches using heuristic search
399
+ 2. **Exact scoring**: Only candidate pairs are scored using the full alignment algorithm
400
+
401
+ This reduces the computational complexity from O(n^2) to approximately O(n x k), where k is the number of candidates per sequence.
402
+
403
+ ### Performance
404
+
405
+ Based on benchmarking, the break-even point is approximately 1000 sequences:
406
+ - Below 1000 sequences: brute-force is faster (less overhead)
407
+ - Above 1000 sequences: vsearch acceleration provides significant speedup
408
+ - At 10,000 sequences: ~10x faster than brute-force
409
+
410
+ ### Implementation Notes
411
+
412
+ - The scalability layer is designed to be backend-agnostic; future versions may support alternative tools (BLAST, minimap2, etc.)
413
+ - For summarize's HAC clustering, scalability is most effective when there are many consensus sequences to cluster
414
+ - Scalability mode produces identical or near-identical results to brute-force mode
415
+
416
+ ### Concurrency Control
417
+
418
+ The `--threads` option controls internal parallelism (vsearch, SPOA consensus generation). Use `0` for auto-detect.
419
+
420
+ **Default behavior differs between tools:**
421
+ - `speconsense`: defaults to `--threads 1` (safe for GNU parallel workflows)
422
+ - `speconsense-summarize`: defaults to `--threads 0` (auto-detect, since it runs once across all specimens)
423
+
424
+ ```bash
425
+ # Many speconsense jobs via parallel - single-threaded by default (safe)
426
+ ls *.fastq | parallel speconsense {}
427
+
428
+ # Single large speconsense job - enable internal parallelism
429
+ speconsense large_dataset.fastq --threads 0
430
+
431
+ # speconsense-summarize auto-detects threads by default
432
+ speconsense-summarize --source clusters/
433
+ ```
434
+
435
+ ## Early Filtering
436
+
437
+ Speconsense can optionally apply size filtering early in the pipeline (after pre-phasing merge) to skip expensive variant phasing on small clusters that will ultimately be filtered out. This can significantly improve performance for large datasets.
438
+
439
+ ### Options
440
+
441
+ - `--enable-early-filter`: Apply min-size and min-cluster-ratio filtering after pre-phasing merge. Small clusters skip variant phasing, improving performance.
442
+ - `--collect-discards`: Write all discarded reads (outliers + filtered clusters) to `cluster_debug/{sample}-discards.fastq` for inspection.
443
+
444
+ ### Performance Optimization
445
+
446
+ For large datasets where performance is critical:
447
+
448
+ ```bash
449
+ speconsense input.fastq --enable-early-filter
450
+ ```
451
+
452
+ This skips variant phasing for clusters that will be filtered out anyway, while still using scalability optimizations for O(n²) comparisons.
453
+
454
+ ## Algorithm Details
455
+
456
+ ### Clustering Methods
457
+
458
+ Speconsense offers two clustering approaches with different characteristics:
459
+
460
+ #### **Graph-based clustering with Markov Clustering (MCL)** - Default and recommended
461
+ - Constructs a similarity graph between reads and applies the Markov Clustering algorithm to identify clusters
462
+ - **Tends to produce more clusters** by discriminating between sequence variants within a specimen
463
+ - Suitable when you want to identify multiple variants per specimen and are willing to interpret complex results
464
+ - Excellent for detecting subtle sequence differences and biological variation
465
+ - Relatively fast with high-quality results for most datasets
466
+
467
+ **Why MCL?** Markov Clustering was chosen for its ability to detect natural communities in densely connected graphs without requiring a priori specification of cluster numbers. The algorithm simulates flow in the graph, causing flow to spread within natural clusters and evaporate between different clusters. By varying a single parameter (inflation), MCL can detect clusters at different scales of granularity. MCL has proven highly effective in bioinformatics applications, particularly for protein family detection in large-scale sequence databases, making it well-suited for identifying sequence variants in amplicon data.
468
+
469
+ #### **Greedy clustering** - Fast and simple alternative
470
+ - Uses greedy star clustering that iteratively selects the sequence with the most similar neighbors as cluster centers
471
+ - **Tends to produce fewer, simpler clusters** by focusing on well-separated sequences
472
+ - Excellent at discriminating between distinct targets (e.g., primary sequence vs. contaminant)
473
+ - Usually does not separate variants within the same biological sequence
474
+ - Suitable when you want a fast algorithm with minimal complexity in interpreting results
475
+ - Ideal for applications where one consensus per specimen is sufficient
476
+
477
+ #### **Choosing the Right Algorithm**
478
+
479
+ **Use `--algorithm greedy` when:**
480
+ - You want one clear consensus sequence per specimen
481
+ - Speed is prioritized over detailed variant detection
482
+ - You prefer simpler output interpretation
483
+ - Your dataset has well-separated sequences (targets vs. contaminants)
484
+
485
+ **Use `--algorithm graph` (default) when:**
486
+ - You want to detect sequence variants within specimens
487
+ - You're willing to evaluate multiple consensus sequences per specimen
488
+ - You need fine-grained discrimination between similar sequences
489
+ - You want the most comprehensive analysis of sequence diversity
490
+ - **Note**: Use `speconsense-summarize` after clustering to manage multiple variants per specimen. Key options include `--merge-position-count` for merging variants differing by few SNPs/indels, `--group-identity` for grouping similar variants, and `--select-max-variants`/`--select-strategy` for controlling which variants to output
491
+
492
+ ### Cluster Merging
493
+
494
+ After initial clustering, Speconsense automatically merges clusters with identical or homopolymer-equivalent consensus sequences:
495
+
496
+ **Default behavior (homopolymer-aware merging):**
497
+ - Clusters with identical consensus sequences are automatically merged
498
+ - Clusters with homopolymer-equivalent sequences are also merged (e.g., "AAA" vs "AAAAA" treated as identical)
499
+ - Homopolymer length differences are ignored as they typically represent ONT sequencing artifacts
500
+ - This helps eliminate redundant clusters that represent the same biological sequence but differ only in homopolymer lengths
501
+
502
+ **Strict identity merging (`--disable-homopolymer-equivalence`):**
503
+ - Only clusters with perfectly identical consensus sequences are merged
504
+ - Use this flag when homopolymer differences may have biological significance
505
+ - Results in more clusters but ensures no information loss from homopolymer variation
506
+
507
+ This automatic merging step helps consolidate redundant clusters that were separated during initial clustering. The adjusted identity scoring with homopolymer normalization provides more accurate assessment of sequence similarity for merging decisions, especially for nanopore data where homopolymer length calling can be inconsistent.
508
+
509
+ **Variant merging with IUPAC codes:** For more aggressive merging based on SNP and indel thresholds (which creates IUPAC consensus sequences with ambiguity codes), use the `--merge-position-count` and `--merge-indel-length` options in `speconsense-summarize` during post-processing. See the [Advanced Post-Processing](#advanced-post-processing) section.
510
+
511
+ ### Cluster Size Filtering
512
+
513
+ Speconsense provides two complementary filters to control which clusters are output:
514
+
515
+ **Absolute size filtering (`--min-size`, default: 5):**
516
+ - Filters clusters by absolute number of sequences
517
+ - Applied **after merging** identical/homopolymer-equivalent clusters
518
+ - Set to 0 to disable and output all clusters regardless of size
519
+
520
+ **Relative size filtering (`--min-cluster-ratio`, default: 0.01):**
521
+ - Filters clusters based on size relative to the largest cluster
522
+ - Also applied **after merging** identical/homopolymer-equivalent clusters (post-merge sizes)
523
+ - **Based on original cluster sizes**, not sampled sizes from `--max-sample-size`
524
+ - Set to 0 to disable and keep all clusters that pass `--min-size`
525
+
526
+ **Processing order:**
527
+ 1. Initial clustering produces raw clusters
528
+ 2. Merge identical/homopolymer-equivalent clusters
529
+ 3. Filter by `--min-size` (absolute threshold, using post-merge sizes)
530
+ 4. Filter by `--min-cluster-ratio` (relative threshold, using post-merge sizes)
531
+ 5. Sample sequences for consensus generation if cluster > `--max-sample-size`
532
+
533
+ This order ensures that small clusters with identical/homopolymer-equivalent consensus sequences can merge before size filtering is applied. This allows, for example, two clusters of size 3 with identical consensus to merge into a size-6 cluster that passes the `--min-size=5` threshold, rather than being discarded prematurely.
534
+
535
+ **Deferred filtering strategy:**
536
+ For maximum flexibility in detecting rare variants and contaminants, disable filtering in speconsense (`--min-size 0 --min-cluster-ratio 0`) and apply final quality thresholds using `--min-ric` in speconsense-summarize. This allows you to run expensive clustering once and experiment with different quality thresholds during post-processing. However, be aware that permissive filtering may allow more bioinformatic contamination through the pipeline. When using this approach, consider stricter filtering during upstream demultiplexing or perform careful manual review of low-abundance clusters.
537
+
538
+ ### Consensus Generation
539
+
540
+ Consensus sequences are generated using SPOA (SIMD Partial Order Alignment) which efficiently handles the error profile of nanopore reads. For larger clusters, a random subset of reads (controlled by `--max-sample-size`) is used to generate the consensus.
541
+
542
+ ### Read Identity and Variance Metrics
543
+
544
+ To evaluate the reliability of consensus sequences, Speconsense calculates read identity metrics by:
545
+ 1. Aligning all reads in a cluster to the consensus sequence using SPOA multiple sequence alignment
546
+ 2. Computing per-read identity scores with homopolymer normalization
547
+ 3. Reporting mean read identity (`rid`) and minimum read identity (`rid_min`)
548
+
549
+ **Key metrics:**
550
+ - `rid` - Mean read identity: The average identity of all reads to the consensus (0-100%). Higher values indicate more homogeneous clusters.
551
+ - `rid_min` - Minimum read identity: The worst-case read identity. Low values may indicate outliers or mixed clusters.
552
+
553
+ **Homopolymer-normalized identity**: Speconsense uses the adjusted-identity algorithm with homopolymer normalization for more accurate sequence comparisons. This means that differences in homopolymer run lengths (e.g., AAA vs AAAAA) are treated as identical, which is particularly important for nanopore sequencing data where homopolymer length calling can be inconsistent. The identity metrics exclude homopolymer length differences, so low `rid` values indicate true substitutions or structural indels rather than sequencing artifacts.
554
+
555
+ **Automatic variant detection**: By default, Speconsense analyzes positional variation within clusters to detect true biological variants vs sequencing errors. Positions where variant alleles exceed the minimum frequency threshold are flagged as variant positions, and reads are automatically separated into distinct haplotypes (see [Automatic Variant Phasing](#automatic-variant-phasing)).
556
+
557
+ ### Primer Trimming
558
+
559
+ When a primers file is provided via `--primers`, Speconsense will identify and trim primer sequences from the 5' and 3' ends of consensus sequences, producing clean amplicon sequences for downstream analysis.
560
+
561
+ **Automatic primer detection**: If `--primers` is not specified, Speconsense will automatically look for `primers.fasta` in the same directory as the input FASTQ file. If found, primer trimming will be enabled automatically.
562
+
563
+ ### Automatic Variant Phasing
564
+
565
+ By default, Speconsense automatically detects and separates biological variants within clusters. This feature is particularly useful for heterozygous samples or mixed-species amplicons.
566
+
567
+ **How variant phasing works:**
568
+
569
+ 1. **Variant detection**: After initial clustering, Speconsense analyzes positional variation using multiple sequence alignment. Positions where the minor allele frequency exceeds the threshold (default 10%) and meets minimum read count requirements are identified as variant positions.
570
+
571
+ 2. **Position selection**: When multiple variant positions are detected, Speconsense selects the single best position for splitting (minimizing within-cluster error), then recursively regenerates MSA for each subcluster to discover additional variant positions. This hierarchical approach prevents over-fragmentation while allowing deep phasing when supported by the data.
572
+
573
+ 3. **Haplotype separation**: Reads are grouped by their allele combinations at selected variant positions. Each unique combination becomes a separate sub-cluster (haplotype).
574
+
575
+ 4. **Haplotype filtering**: Small haplotypes that don't meet minimum thresholds are reassigned to the nearest qualifying haplotype, preventing read loss.
576
+
577
+ 5. **IUPAC ambiguity calling**: When only one haplotype qualifies (insufficient support for phasing), variant positions are encoded using IUPAC ambiguity codes (e.g., `Y` for C/T, `R` for A/G) rather than forcing a potentially incorrect consensus call. Ambiguity calling uses a lower count threshold (3 reads vs 5 for splitting), capturing variation that doesn't have enough support to form a viable separate cluster.
578
+
579
+ **Key parameters for variant phasing:**
580
+ - `--min-variant-frequency` - Minimum minor allele frequency to trigger cluster splitting (default: 0.10 = 10%)
581
+ - `--min-variant-count` - Minimum read count for minor allele to trigger splitting (default: 5)
582
+ - `--disable-position-phasing` - Disable variant phasing entirely
583
+
584
+ **Key parameters for IUPAC ambiguity calling:**
585
+ - `--min-ambiguity-frequency` - Minimum minor allele frequency for IUPAC codes (default: 0.10 = 10%)
586
+ - `--min-ambiguity-count` - Minimum read count for minor allele for IUPAC codes (default: 3)
587
+ - `--disable-ambiguity-calling` - Disable IUPAC codes for unphased variants
588
+
589
+ The frequency thresholds are unified at 10%, but the count thresholds differ: splitting requires 5 reads (matching `--min-size`) to ensure viable clusters, while ambiguity calling uses 3 reads since it doesn't create new clusters.
590
+
591
+ **Example:**
592
+ ```bash
593
+ # Default behavior: variant phasing enabled
594
+ speconsense input.fastq
595
+
596
+ # More permissive variant detection (lower frequency threshold)
597
+ speconsense input.fastq --min-variant-frequency 0.05
598
+
599
+ # More aggressive IUPAC ambiguity calling
600
+ speconsense input.fastq --min-ambiguity-frequency 0.05 --min-ambiguity-count 2
601
+
602
+ # Disable variant phasing but keep ambiguity calling
603
+ speconsense input.fastq --disable-position-phasing
604
+
605
+ # Disable both phasing and ambiguity calling
606
+ speconsense input.fastq --disable-position-phasing --disable-ambiguity-calling
607
+ ```
608
+
609
+ **When to use variant phasing (default):**
610
+ - Heterozygous specimens where you want to separate alleles
611
+ - Samples with potential mixed-species amplification
612
+ - Quality control to identify clusters with internal heterogeneity
613
+
614
+ **When to disable variant phasing:**
615
+ - Homozygous samples where variation indicates sequencing errors only
616
+ - When you prefer merged IUPAC consensus over separate haplotypes
617
+ - For faster processing when variant separation is not needed
618
+
619
+ ## Advanced Post-Processing
620
+
621
+ The `speconsense-summarize` tool provides sophisticated options for managing multiple variants per specimen. This section covers advanced variant handling - for basic usage, see the [Usage](#usage) section above.
622
+
623
+ ### Variant Grouping and Selection
624
+
625
+ When multiple variants exist per specimen, `speconsense-summarize` first groups similar variants together, then applies selection strategies within each group:
626
+
627
+ **Variant Grouping:**
628
+ ```bash
629
+ speconsense-summarize --group-identity 0.9
630
+ ```
631
+ - Uses **Hierarchical Agglomerative Clustering (HAC)** to group variants with sequence identity ≥ threshold
632
+ - Default threshold is 0.9 (90% identity)
633
+ - Variants within each group are considered similar enough to represent the same biological entity
634
+ - **Occurs before merging** to separate dissimilar sequences (e.g., contaminants from primary targets)
635
+ - Each group will contribute one primary variant plus additional variants based on selection strategy
636
+
637
+ **Variant Selection (within each group):**
638
+
639
+ When multiple variants exist per specimen, `speconsense-summarize` offers two strategies for selecting which variants to output from each group:
640
+
641
+ **Size-based selection (default):**
642
+ ```bash
643
+ speconsense-summarize --select-strategy size --select-max-variants 2
644
+ ```
645
+ - Selects variants by cluster size (largest first)
646
+ - Primary variant is always the largest cluster
647
+ - Additional variants are selected in order of decreasing size
648
+ - Best for identifying the most abundant sequence variants
649
+ - Suitable when read count reflects biological abundance
650
+ - **Default**: `--select-max-variants=-1` (outputs all variants, no limit)
651
+
652
+ **Diversity-based selection:**
653
+ ```bash
654
+ speconsense-summarize --select-strategy diversity --select-max-variants 2
655
+ ```
656
+ - Uses a **maximum distance algorithm** to select variants that are most genetically different from each other
657
+ - Primary variant is still the largest cluster
658
+ - Additional variants are selected to maximize sequence diversity in the output
659
+ - Iteratively selects the variant with the **maximum minimum distance** to all previously selected variants
660
+ - Best for capturing the full range of genetic variation in your sample
661
+ - Suitable when you want to detect distinct sequence types regardless of their abundance
662
+
663
+ **Algorithm Details for Diversity Selection (within each group):**
664
+ 1. Primary variant = largest cluster within the group (by read count)
665
+ 2. For each additional variant slot in the group:
666
+ - Calculate the minimum sequence distance from each remaining candidate to all already-selected variants in this group
667
+ - Select the candidate with the largest minimum distance (farthest from all selected in this group)
668
+ - Repeat until select_max_variants reached for this group
669
+
670
+ **Overall Process:**
671
+ 1. Group variants by sequence identity using HAC clustering
672
+ 2. For each group independently:
673
+ - Apply MSA-based merging to find largest compatible subsets
674
+ - Apply variant selection strategy (size or diversity)
675
+ - Output up to select_max_variants per group
676
+ 3. Final output contains representatives from all groups, ensuring both biological diversity (between groups) and appropriate sampling within each biological entity (within groups)
677
+
678
+ This two-stage process ensures that distinct biological sequences are preserved as separate groups, while providing control over variant complexity within each group.
679
+
680
+ ### Customizing FASTA Header Fields
681
+
682
+ Control which metadata fields appear in FASTA headers using the `--fasta-fields` option:
683
+
684
+ **Presets:**
685
+ ```bash
686
+ # Default preset (current behavior)
687
+ speconsense-summarize --fasta-fields default
688
+ # Output: >sample-1 size=638 ric=638 rawric=333+305 snp=1 ambig=1 primers=5'-ITS1F,3'-ITS4_RC
689
+
690
+ # Minimal headers - just the essentials
691
+ speconsense-summarize --fasta-fields minimal
692
+ # Output: >sample-1 size=638 ric=638
693
+
694
+ # QC preset - includes read identity and length
695
+ speconsense-summarize --fasta-fields qc
696
+ # Output: >sample-1 size=638 ric=638 length=589 rid=98.5 ambig=1
697
+
698
+ # Full metadata (all available fields)
699
+ speconsense-summarize --fasta-fields full
700
+ # Output: >sample-1 size=638 ric=638 length=589 rawric=333+305 snp=1 ambig=1 rid=98.5 primers=...
701
+
702
+ # ID only (no metadata)
703
+ speconsense-summarize --fasta-fields id-only
704
+ # Output: >sample-1
705
+ ```
706
+
707
+ **Custom field selection:**
708
+ ```bash
709
+ # Specify individual fields (comma-separated)
710
+ speconsense-summarize --fasta-fields size,ric,primers
711
+
712
+ # Combine presets and fields
713
+ speconsense-summarize --fasta-fields minimal,rid
714
+
715
+ # Combine multiple presets
716
+ speconsense-summarize --fasta-fields minimal,qc
717
+ ```
718
+
719
+ **Available fields:**
720
+ - `size` - Total reads across merged variants
721
+ - `ric` - Reads in consensus
722
+ - `length` - Sequence length in bases
723
+ - `rawric` - RiC values of .raw source variants (only when merged)
724
+ - `rawlen` - Original sequence lengths before overlap merging (only when overlap-merged)
725
+ - `snp` - Number of IUPAC positions from merging (only when >0)
726
+ - `ambig` - Count of IUPAC ambiguity codes in consensus
727
+ - `rid` - Mean read identity percentage (when available)
728
+ - `rid_min` - Minimum read identity percentage (when available)
729
+ - `primers` - Detected primer names (when detected)
730
+ - `group` - Variant group number
731
+ - `variant` - Variant identifier within group (only for variants)
732
+
733
+ **Use cases:**
734
+ - **Downstream tool compatibility**: Use `minimal` or `id-only` for tools expecting simple headers
735
+ - **Quality control**: Use `qc` preset to include read identity metrics for assessing consensus quality
736
+ - **File size optimization**: Use `minimal` to reduce file size for large datasets
737
+ - **Custom workflows**: Combine presets and fields for workflow-specific needs
738
+
739
+ ### Quality Assessment and Reporting
740
+
741
+ Speconsense-summarize automatically generates a `quality_report.txt` file to help prioritize manual review of sequences with potential quality concerns. This is particularly valuable for high-throughput workflows where human time for manual inspection is limited.
742
+
743
+ **Report Generation:**
744
+ - Created automatically in the summary output directory
745
+ - No configuration required - generated regardless of `--fasta-fields` settings
746
+ - Focuses exclusively on sequences that may need review (no "excellent" sequences listed)
747
+
748
+ **What the Report Includes:**
749
+
750
+ **1. Statistical Outliers (Low Read Identity):**
751
+ - Sequences with mean read identity (`rid`) below the statistical threshold
752
+ - Threshold is calculated as: mean - 2×standard deviation across all sequences
753
+ - These represent the ~2.5% of sequences with lowest internal consistency
754
+ - May indicate mixed clusters, contamination, or problematic consensus
755
+
756
+ **2. Sequences with IUPAC Ambiguity Codes:**
757
+ - Sequences containing IUPAC ambiguity codes (Y, R, W, S, K, M, etc.)
758
+ - May result from merging or from unphased heterozygous positions
759
+ - Count shown in `ambig` field
760
+
761
+ **3. Overlap Merge Analysis:**
762
+ - Lists specimens where partial overlap merging occurred (different-length sequences with shared overlap region)
763
+ - Shows merge iterations, overlap percentages, and prefix/suffix extensions
764
+ - Flags edge cases (overlap near threshold, large length ratios >3:1)
765
+ - Only includes partial overlaps (prefix or suffix extension >0); excludes full containment merges
766
+
767
+ **Understanding Read Identity Metrics:**
768
+
769
+ Read identity assessment works by:
770
+ 1. Aligning all reads in a cluster to the consensus using SPOA multiple sequence alignment
771
+ 2. Computing per-read identity with homopolymer normalization
772
+ 3. Reporting mean identity (`rid`) and minimum identity (`rid_min`)
773
+
774
+ **Key points about identity calculation:**
775
+ - Uses homopolymer normalization - differences like AAA vs AAAAA don't reduce identity
776
+ - Therefore, low identity indicates **true substitutions or structural indels**
777
+ - `rid` typically ranges from 95-99% for good quality clusters
778
+ - `rid_min` shows the worst-case read - very low values may indicate outlier reads
779
+
780
+ **Example Report Entry:**
781
+ ```
782
+ Type Sequence RiC rid Notes
783
+ ------------------------------------------------------------------------------------------------
784
+ STAT specimen-1 450 93.2 Statistical outlier (rid below threshold)
785
+ specimen-2 320 97.8 Contains 2 ambiguity codes
786
+ ```
787
+
788
+ **Recommended Actions:**
789
+
790
+ **For Statistical Outliers (low rid):**
791
+ - Review cluster using `cluster_debug/` FASTQ files in source directory
792
+ - Check for biological variation (multiple true variants) vs bioinformatic contamination
793
+ - Variant phasing is enabled by default; consider adjusting `--min-variant-frequency` if variants weren't separated
794
+ - May require manual curation or re-demultiplexing
795
+
796
+ **For Sequences with Ambiguity Codes:**
797
+ - Review whether ambiguity represents true heterozygosity or merging artifact
798
+ - Consider adjusting merge parameters if too many variants are being merged
799
+ - Check if variant phasing would produce cleaner separate haplotypes
800
+
801
+ **Workflow Integration:**
802
+
803
+ The quality report is designed for efficient triage:
804
+ 1. Statistical outliers are flagged automatically based on global thresholds
805
+ 2. Review flagged sequences starting with lowest `rid` values
806
+ 3. Use `rid_min` to identify sequences with problematic individual reads
807
+ 4. Consider re-processing with variant phasing for mixed clusters
808
+
809
+ For high-throughput workflows (e.g., 100K sequences/year), this prioritization ensures human review time focuses on the most actionable quality issues.
810
+
811
+ ### Additional Summarize Options
812
+
813
+ **Quality Filtering:**
814
+ ```bash
815
+ speconsense-summarize --min-ric 5
816
+ ```
817
+ - Filters out consensus sequences with fewer than the specified number of Reads in Consensus (RiC)
818
+ - Default is 3 - only sequences supported by at least 3 reads are processed
819
+ - Higher values provide more stringent quality control but may exclude valid low-abundance variants
820
+
821
+ **Length Filtering:**
822
+ ```bash
823
+ speconsense-summarize --min-len 400 --max-len 800
824
+ ```
825
+ - `--min-len`: Minimum sequence length in bp (default: 0 = disabled)
826
+ - `--max-len`: Maximum sequence length in bp (default: 0 = disabled)
827
+ - Applied during initial sequence loading, before HAC grouping
828
+ - Useful for filtering incomplete amplicons or chimeric sequences
829
+
830
+ **Variant Merging:**
831
+ ```bash
832
+ # Basic SNP-only merging (default)
833
+ speconsense-summarize --merge-position-count 2
834
+
835
+ # Enable indel merging (up to 3bp indels)
836
+ speconsense-summarize --merge-position-count 3 --merge-indel-length 3
837
+
838
+ # Disable SNP merging (only merge identical sequences)
839
+ speconsense-summarize --merge-snp false
840
+
841
+ # Disable all merging (fastest, skip MSA evaluation entirely)
842
+ speconsense-summarize --disable-merging
843
+
844
+ # Control merge evaluation effort (performance vs thoroughness)
845
+ speconsense-summarize --merge-effort fast # Faster, may miss some merges in large groups
846
+ speconsense-summarize --merge-effort balanced # Default
847
+ speconsense-summarize --merge-effort thorough # More exhaustive for large variant groups
848
+
849
+ # Legacy parameter (still supported)
850
+ speconsense-summarize --snp-merge-limit 2 # Equivalent to --merge-position-count 2
851
+ ```
852
+ - **Occurs within each HAC group** - merges variants that differ by small numbers of SNPs and/or indels
853
+ - Uses **MSA-based approach with SPOA**: evaluates all possible subsets to find the largest compatible group for merging
854
+ - Creates **IUPAC consensus sequences** with size-weighted majority voting at polymorphic positions
855
+ - **Order-independent**: produces identical results regardless of input order (unlike pairwise greedy approaches)
856
+ - Only merges variants with **identical primer sets** to maintain biological validity
857
+
858
+ **Merge Parameters:**
859
+ - `--disable-merging`: Skip MSA-based merge evaluation entirely (fastest option when merging is not needed)
860
+ - `--merge-effort LEVEL`: Control merge evaluation thoroughness. Presets: `fast` (8), `balanced` (10, default), `thorough` (12), or numeric 6-14. Higher values use larger batch sizes for exhaustive subset search, improving merge quality for large variant groups at the cost of runtime
861
+ - `--merge-position-count N`: Maximum total SNP + structural indel positions allowed (default: 2)
862
+ - `--merge-indel-length N`: Maximum length of individual structural indels allowed (default: 0 = disabled)
863
+ - `--merge-snp`: Enable/disable SNP merging (default: True)
864
+ - `--merge-min-size-ratio R`: Minimum size ratio (smaller/larger) for merging clusters (default: 0.1, 0 to disable)
865
+ - `--disable-homopolymer-equivalence`: Treat homopolymer length differences as structural indels (default: disabled, meaning homopolymer equivalence is enabled)
866
+ - `--snp-merge-limit N`: Legacy parameter, equivalent to `--merge-position-count` (deprecated)
867
+
868
+ **Homopolymer-Aware Merging (Default Behavior):**
869
+
870
+ By default, speconsense-summarize uses **homopolymer-aware merging** that distinguishes between structural indels (true insertions/deletions) and homopolymer length differences (e.g., AAA vs AAAA). This matches the semantics of adjusted-identity alignment used throughout the pipeline.
871
+
872
+ **How it works:**
873
+ - Analyzes SPOA multiple sequence alignment to classify each indel column
874
+ - **Homopolymer indel**: All non-gap bases are the same, and all sequences agree on that base in adjacent columns
875
+ - Example: `ATA-AAGC` vs `ATAAAAGC` → homopolymer (all have A's flanking the gap)
876
+ - **Structural indel**: True insertion/deletion, or indel adjacent to SNP
877
+ - Example: `CTAA-GC` vs `CTG-AGC` → structural (A vs G flanking the gap)
878
+ - **Homopolymer indels are ignored** when checking merge compatibility (treated as equivalent)
879
+ - **Structural indels count** against `--merge-position-count` and `--merge-indel-length` limits
880
+
881
+ **Default behavior examples:**
882
+ - Sequences differing only by homopolymer length (AAA vs AAAA): **Merge** ✓
883
+ - Sequences with 2 SNPs + 5 homopolymer indels: **Merge** ✓ (only SNPs count)
884
+ - Sequences with 1 structural indel: **Blocked** (default `--merge-indel-length=0`)
885
+
886
+ **Strict identity merging (`--disable-homopolymer-equivalence`):**
887
+ ```bash
888
+ speconsense-summarize --disable-homopolymer-equivalence
889
+ ```
890
+ - Treats homopolymer length differences as structural indels
891
+ - All indels (both homopolymer and structural) count against merge limits
892
+ - Only sequences identical (or differing by SNPs if `--merge-snp` enabled) can merge
893
+ - Use when you want to preserve homopolymer length variation as distinct variants
894
+
895
+ **Example comparison:**
896
+ ```bash
897
+ # Default (homopolymer equivalence enabled)
898
+ # Variants: ATAAAGC (3 A's) vs ATAAAAGC (4 A's)
899
+ speconsense-summarize
900
+ # Result: Merge ✓ (homopolymer difference ignored)
901
+
902
+ # Strict mode (homopolymer equivalence disabled)
903
+ speconsense-summarize --disable-homopolymer-equivalence
904
+ # Result: Do not merge (treated as structural indel)
905
+ ```
906
+
907
+ **Position counting examples:**
908
+
909
+ With default settings (`--merge-position-count 2 --merge-indel-length 0`):
910
+ - 2 SNPs + 0 structural indels + 10 homopolymer indels ✓ (only structural counted)
911
+ - 0 SNPs + 1 structural indel + 5 homopolymer indels ✗ (structural indel blocked)
912
+ - 1 SNP + 1 structural indel (≤2bp) ✗ (indel blocked by length limit)
913
+
914
+ With indels enabled (`--merge-position-count 3 --merge-indel-length 2`):
915
+ - 3 SNPs + 0 structural indels + any homopolymer indels ✓
916
+ - 2 SNPs + 1 structural indel (≤2bp) + any homopolymer indels ✓
917
+ - 0 SNPs + 3 structural indels (each ≤2bp) + any homopolymer indels ✓
918
+ - 2 SNPs + 2 structural indels (each ≤2bp) ✗ (total=4 > 3)
919
+ - 2 SNPs + 1 structural indel (3bp) ✗ (indel too long)
920
+
921
+ **Merged consensus tracking:**
922
+ - Merged sequences include `rawric` header field showing RiC values of merged .raw variants
923
+ - Example: `>sample-1 size=250 ric=250 rawric=100+89+61 snp=2`
924
+ - Helps trace which original clusters contributed to merged consensus
925
+
926
+ **Note**: Homopolymer-aware merging in speconsense-summarize complements the automatic homopolymer-equivalent cluster merging that occurs during the main clustering step in speconsense
927
+
928
+ **Understanding MSA-Based Merge Evaluation:**
929
+
930
+ A key architectural feature of speconsense-summarize is that **all sequences within a HAC group are aligned together by SPOA first**, creating a single multi-sequence alignment (MSA) that provides biological and evolutionary context for all subsequent merge decisions within that group.
931
+
932
+ **Why this matters:**
933
+
934
+ When evaluating whether two sequences can merge, the algorithm doesn't perform a simple pairwise alignment. Instead, it examines how those sequences align within the context of all other sequences in their HAC group. This shared alignment context can reveal that apparent structural differences are actually homopolymer length variations relative to the majority pattern.
935
+
936
+ **Example:**
937
+
938
+ Consider two sequences (c4 and c9) from the same HAC group. When aligned in isolation, they appear incompatible:
939
+
940
+ ```
941
+ Pairwise alignment (c4 vs c9 only):
942
+ c4: ..504..GTAAATT..242..AG
943
+ c9: ..504..G-GTACT..242..AA
944
+ ```
945
+
946
+ This pairwise alignment shows 4 SNPs + 1 structural indel, making them incompatible for merging with default parameters.
947
+
948
+ However, when aligned with all sequences in their HAC group, a different picture emerges:
949
+
950
+ ```
951
+ Multi-sequence alignment (c1, c2, c4, c5, c7, c8, c9):
952
+ c1: ..153..AAA..57..AGC..151..CTCTT..131..A-GTAAATT.TCA..213..TCAG..21..AA
953
+ c2: ..153..ATA..57..AAC..151..CTCTT..131..A-GTAAATT.TCA..213..TCAG..21..AA
954
+ c4: ..153..AAA..57..AGC..151..CTCTT..131..A-GTAAATT.TCA..213..TCAG..21..AG
955
+ c5: ..153..AAA..57..AGC..151..C---T..131..A-GTAAATT.TCA..213..TCAG..21..AA
956
+ c7: ..153..AAA..57..AGC..151..CTCTT..131..A-GT-ACTT.T-A..213..TCAG..21..AA
957
+ c8: ..153..AAA..57..AGC..151..CTCTT..131..A-GTAAATT.TCA..213..TAGG..21..AA
958
+ c9: ..153..AAA..57..AGC..151..CTCTT..131..AGGT-AC-T.TCA..213..TCAG..21..AA
959
+ ```
960
+
961
+ In this multi-sequence context, the majority pattern (seen in c1, c2, c4, c5, c8) is `A-GTAAATT` in the critical region around position 131. The c9 sequence (`AGGT-AC-T`) differs from c4, but the gaps occur at different positions in the homopolymer-rich region. The multi-sequence alignment reveals that c9's differences are better explained as homopolymer length variations (extra G's and different gap placements in poly-A and poly-T regions) rather than true structural changes. This results in only 2 SNPs + 3 homopolymer indels, making them compatible for merging.
962
+
963
+ This biological interpretation, informed by the evolutionary context of multiple related sequences, produces more accurate merge decisions than pairwise comparisons alone.
964
+
965
+ **Practical implication:**
966
+
967
+ Merge decisions depend on the complete HAC group composition, not just the sequences being evaluated. Two sequences that appear incompatible in isolation may merge when analyzed in their full biological context, and vice versa. This is by design—the multi-sequence context provides more accurate biological interpretation of sequence variation.
968
+
969
+ **Merge Size Ratio Filtering:**
970
+ ```bash
971
+ speconsense-summarize --merge-min-size-ratio 0.1
972
+ ```
973
+ - Prevents merging clusters with very different sizes (e.g., well-supported variant + poorly-supported variant)
974
+ - Ratio calculated as `smaller_size / larger_size` - must be ≥ threshold to merge
975
+ - Example: `--merge-min-size-ratio 0.1` means smaller cluster must be ≥10% size of larger
976
+ - Default is 0.1 - smaller cluster must be ≥10% size of larger to merge
977
+ - **Use cases:**
978
+ - Prevent poorly-supported variants (low read count) from introducing ambiguities into well-supported sequences
979
+ - Avoid merging weakly-supported bioinformatic artifacts into high-confidence sequences
980
+ - Applied during MSA-based merging step, within each HAC group
981
+
982
+ **Group Output Limiting:**
983
+ ```bash
984
+ speconsense-summarize --select-max-groups 2
985
+ ```
986
+ - Limits output to top N groups per specimen (by size of largest member in each group)
987
+ - Default is -1 (output all groups)
988
+ - Applied after HAC clustering but before MSA-based merging and variant selection
989
+ - Useful for focusing on primary specimens and ignoring small contaminant groups
990
+ - Example: `--select-max-groups=1` outputs only the largest variant group per specimen
991
+
992
+ **Directory Control:**
993
+ ```bash
994
+ speconsense-summarize --source /path/to/speconsense/output --summary-dir MyResults
995
+ ```
996
+ - `--source`: Directory containing speconsense output files (default: clusters)
997
+ - `--summary-dir`: Output directory name (default: `__Summary__`)
998
+
999
+ ### Overlap Merging for Primer Pools
1000
+
1001
+ When using multiple primers targeting the same locus ("primer pools"), reads may have overlapping but different coverage. The overlap merge feature (enabled by default) allows merging such sequences when they share sufficient overlap.
1002
+
1003
+ **How it works:**
1004
+ 1. During HAC clustering, uses single-linkage to group overlapping sequences
1005
+ 2. Identifies sequences with sufficient overlap meeting identity threshold
1006
+ 3. Creates consensus from union of coverage (overlap region uses majority voting)
1007
+ 4. Supports iterative merging for 3+ overlapping sequences
1008
+ 5. **Primer-constrained**: Overlap-aware distance only applies when sequences have different primer pairs (legitimate primer pool scenarios). Same-primer sequences use global distance to prevent chimeras from incorrectly merging with shorter amplicons.
1009
+
1010
+ **Parameters:**
1011
+ - `--min-merge-overlap N`: Minimum overlap in bp (default: 200, 0 to disable)
1012
+ - `--group-identity`: Identity threshold for HAC grouping, also used for overlap region identity (default: 0.9)
1013
+
1014
+ **Example:**
1015
+ ```bash
1016
+ # Default behavior (overlap merging enabled)
1017
+ speconsense-summarize --source clusters
1018
+
1019
+ # Disable overlap merging (original behavior)
1020
+ speconsense-summarize --source clusters --min-merge-overlap 0
1021
+
1022
+ # More permissive overlap (allow smaller overlaps)
1023
+ speconsense-summarize --source clusters --min-merge-overlap 100
1024
+ ```
1025
+
1026
+ **Use cases:**
1027
+ - ITS2 sequence merging with full ITS sequence
1028
+ - Overlapping amplicons from primer pools
1029
+ - Partial sequences merging with complete references
1030
+
1031
+ **Output indicators:**
1032
+ - Log messages show `(overlap=Xbp, prefix=Ybp, suffix=Zbp)` for overlap merges
1033
+ - FASTA headers include `rawlen=X+Y` showing original sequence lengths
1034
+ - Quality report includes "OVERLAP MERGE ANALYSIS" section with details on partial overlaps
1035
+
1036
+ **Containment handling:**
1037
+ When a shorter sequence is fully contained within a longer one (e.g., ITS2 within full ITS), the merge is allowed if `overlap >= min(threshold, shorter_length)`.
1038
+
1039
+ ### Processing Workflow Summary
1040
+
1041
+ The complete speconsense-summarize workflow operates in this order:
1042
+
1043
+ 1. **Load sequences** with RiC filtering (`--min-ric`) and length filtering (`--min-len`, `--max-len`)
1044
+ 2. **HAC variant grouping** by sequence identity to separate dissimilar sequences (`--group-identity`); uses single-linkage when overlap merging is enabled
1045
+ 3. **Group filtering** to limit output groups (`--select-max-groups`)
1046
+ 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
1047
+ 5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
1048
+ 6. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
1049
+
1050
+ **Key architectural features**:
1051
+ - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
1052
+ - Merging is applied independently within each group using MSA-based consensus generation
1053
+ - Homopolymer-aware merging by default (AAA ≈ AAAA) to match pipeline-wide adjusted-identity semantics
1054
+ - Overlap merging enabled by default (`--min-merge-overlap 200`) for primer pool support
1055
+
1056
+ ### Enhanced Logging and Traceability
1057
+
1058
+ Speconsense-summarize provides comprehensive logging to help users understand processing decisions:
1059
+
1060
+ **Variant Analysis Logging:**
1061
+ - **Complete variant summaries** for every variant in each group, including those that are skipped
1062
+ - **Detailed difference categorization**: substitutions, single-nt indels, short (≤3nt) indels, and long indels
1063
+ - **IUPAC-aware comparisons**: treats ambiguity codes as matches (e.g., R matches A or G)
1064
+ - **Homopolymer-aware merge reporting**: distinguishes structural indels from homopolymer indels
1065
+ - **Group context**: clearly shows which variants belong to each HAC clustering group
1066
+ - **Selection rationale**: explains why variants were included or excluded
1067
+
1068
+ **Example log output:**
1069
+ ```
1070
+ HAC clustering created 2 groups
1071
+ Group 1: ['sample-c3']
1072
+ Group 2: ['sample-c1', 'sample-c2']
1073
+ Found mergeable subset of 2 variants: 2 SNPs, 1 homopolymer indels
1074
+ Processing Variants in Group 2
1075
+ Primary: sample-c1 (size=403, ric=403)
1076
+ Variant 1: (size=269, ric=269) - 1 short (<= 3nt) indel
1077
+ Variant 2: (size=180, ric=180) - 3 substitutions, 1 single-nt indel - skipping
1078
+ ```
1079
+
1080
+ **Traceability Features:**
1081
+ - **Merge history**: tracks which original clusters were combined during variant merging
1082
+ - **File lineage**: maintains connection between final outputs and original speconsense clusters
1083
+ - **Read aggregation**: `FASTQ Files/` directory contains all reads that contributed to each final consensus
1084
+ - **Pre-merge preservation**: `variants/` directory contains `.raw` files that preserve individual pre-merge variants with their original sequences and reads
1085
+ - **Cluster boundaries in merged FASTQ**: When multiple clusters are merged, synthetic delimiter records mark boundaries between clusters for easy identification in sequence viewers (e.g., UGENE). Format: `@CLUSTER_BOUNDARY_{n}:{cluster}:RiC={ric}:reads={count}`
1086
+
1087
+ This comprehensive logging allows users to understand exactly how the pipeline processed their data and make informed decisions about parameter tuning.
1088
+
1089
+ ## Full Command Line Options
1090
+
1091
+ ```
1092
+ usage: speconsense [-h] [-O OUTPUT_DIR] [--primers PRIMERS]
1093
+ [--augment-input AUGMENT_INPUT]
1094
+ [--algorithm {graph,greedy}] [--min-identity MIN_IDENTITY]
1095
+ [--inflation INFLATION]
1096
+ [--k-nearest-neighbors K_NEAREST_NEIGHBORS]
1097
+ [--min-size MIN_SIZE]
1098
+ [--min-cluster-ratio MIN_CLUSTER_RATIO]
1099
+ [--max-sample-size MAX_SAMPLE_SIZE]
1100
+ [--outlier-identity OUTLIER_IDENTITY]
1101
+ [--disable-position-phasing]
1102
+ [--min-variant-frequency MIN_VARIANT_FREQUENCY]
1103
+ [--min-variant-count MIN_VARIANT_COUNT]
1104
+ [--disable-ambiguity-calling]
1105
+ [--min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY]
1106
+ [--min-ambiguity-count MIN_AMBIGUITY_COUNT]
1107
+ [--disable-cluster-merging]
1108
+ [--disable-homopolymer-equivalence]
1109
+ [--orient-mode {skip,keep-all,filter-failed}]
1110
+ [--presample PRESAMPLE] [--scale-threshold SCALE_THRESHOLD]
1111
+ [--threads N] [--enable-early-filter] [--collect-discards]
1112
+ [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
1113
+ [--version] [-p NAME] [--list-profiles]
1114
+ input_file
1115
+
1116
+ MCL-based clustering of nanopore amplicon reads
1117
+
1118
+ options:
1119
+ -h, --help show this help message and exit
1120
+ --version Show program's version number and exit
1121
+ -p NAME, --profile NAME
1122
+ Load parameter profile (use --list-profiles to see
1123
+ available)
1124
+ --list-profiles List available profiles and exit
1125
+
1126
+ Input/Output:
1127
+ input_file Input FASTQ file
1128
+ -O OUTPUT_DIR, --output-dir OUTPUT_DIR
1129
+ Output directory for all files (default: clusters)
1130
+ --primers PRIMERS FASTA file containing primer sequences (default: looks
1131
+ for primers.fasta in input file directory)
1132
+ --augment-input AUGMENT_INPUT
1133
+ Additional FASTQ/FASTA file with sequences recovered
1134
+ after primary demultiplexing (e.g., from specimine)
1135
+
1136
+ Clustering:
1137
+ --algorithm {graph,greedy}
1138
+ Clustering algorithm to use (default: graph)
1139
+ --min-identity MIN_IDENTITY
1140
+ Minimum sequence identity threshold for clustering
1141
+ (default: 0.9)
1142
+ --inflation INFLATION
1143
+ MCL inflation parameter (default: 4.0)
1144
+ --k-nearest-neighbors K_NEAREST_NEIGHBORS
1145
+ Number of nearest neighbors for graph construction
1146
+ (default: 5)
1147
+
1148
+ Filtering:
1149
+ --min-size MIN_SIZE Minimum cluster size (default: 5, 0 to disable)
1150
+ --min-cluster-ratio MIN_CLUSTER_RATIO
1151
+ Minimum size ratio between a cluster and the largest
1152
+ cluster (default: 0.01, 0 to disable)
1153
+ --max-sample-size MAX_SAMPLE_SIZE
1154
+ Maximum cluster size for consensus (default: 100)
1155
+ --outlier-identity OUTLIER_IDENTITY
1156
+ Minimum read-to-consensus identity to keep a read
1157
+ (default: auto). Reads below this threshold are
1158
+ removed as outliers before final consensus generation.
1159
+ Auto-calculated as (1 + min_identity) / 2. This
1160
+ threshold is typically higher than --min-identity
1161
+ because the consensus is error-corrected through
1162
+ averaging.
1163
+
1164
+ Variant Phasing:
1165
+ --disable-position-phasing
1166
+ Disable position-based variant phasing (enabled by
1167
+ default). MCL graph clustering already separates most
1168
+ variants; this second pass analyzes MSA positions to
1169
+ phase remaining variants.
1170
+ --min-variant-frequency MIN_VARIANT_FREQUENCY
1171
+ Minimum alternative allele frequency to call variant
1172
+ (default: 0.10 for 10%)
1173
+ --min-variant-count MIN_VARIANT_COUNT
1174
+ Minimum alternative allele read count to call variant
1175
+ (default: 5)
1176
+
1177
+ Ambiguity Calling:
1178
+ --disable-ambiguity-calling
1179
+ Disable IUPAC ambiguity code calling for unphased
1180
+ variant positions
1181
+ --min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY
1182
+ Minimum alternative allele frequency for IUPAC
1183
+ ambiguity calling (default: 0.10 for 10%)
1184
+ --min-ambiguity-count MIN_AMBIGUITY_COUNT
1185
+ Minimum alternative allele read count for IUPAC
1186
+ ambiguity calling (default: 3)
1187
+
1188
+ Cluster Merging:
1189
+ --disable-cluster-merging
1190
+ Disable merging of clusters with identical consensus
1191
+ sequences
1192
+ --disable-homopolymer-equivalence
1193
+ Disable homopolymer equivalence in cluster merging
1194
+ (only merge identical sequences)
1195
+
1196
+ Orientation:
1197
+ --orient-mode {skip,keep-all,filter-failed}
1198
+ Sequence orientation mode: skip (default, no
1199
+ orientation), keep-all (orient but keep failed), or
1200
+ filter-failed (orient and remove failed)
1201
+
1202
+ Performance:
1203
+ --presample PRESAMPLE
1204
+ Presample size for initial reads (default: 1000, 0 to
1205
+ disable)
1206
+ --scale-threshold SCALE_THRESHOLD
1207
+ Sequence count threshold for scalable mode (requires
1208
+ vsearch). Set to 0 to disable. Default: 1001
1209
+ --threads N Max threads for internal parallelism (vsearch, SPOA).
1210
+ 0=auto-detect, default=1 (safe for parallel
1211
+ workflows).
1212
+ --enable-early-filter
1213
+ Enable early filtering to skip small clusters before
1214
+ variant phasing (improves performance for large
1215
+ datasets)
1216
+
1217
+ Debugging:
1218
+ --collect-discards Write discarded reads (outliers and filtered clusters)
1219
+ to cluster_debug/{sample}-discards.fastq
1220
+ --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
1221
+ ```
1222
+
1223
+ ### speconsense-summarize Options
1224
+
1225
+ ```
1226
+ usage: speconsense-summarize [-h] [--source SOURCE]
1227
+ [--summary-dir SUMMARY_DIR]
1228
+ [--fasta-fields FASTA_FIELDS] [--min-ric MIN_RIC]
1229
+ [--min-len MIN_LEN] [--max-len MAX_LEN]
1230
+ [--group-identity GROUP_IDENTITY] [--merge-snp]
1231
+ [--merge-indel-length MERGE_INDEL_LENGTH]
1232
+ [--merge-position-count MERGE_POSITION_COUNT]
1233
+ [--merge-min-size-ratio MERGE_MIN_SIZE_RATIO]
1234
+ [--min-merge-overlap MIN_MERGE_OVERLAP]
1235
+ [--disable-homopolymer-equivalence]
1236
+ [--select-max-groups SELECT_MAX_GROUPS]
1237
+ [--select-max-variants SELECT_MAX_VARIANTS]
1238
+ [--select-strategy {size,diversity}]
1239
+ [--scale-threshold SCALE_THRESHOLD] [--threads N]
1240
+ [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
1241
+ [--version] [-p NAME] [--list-profiles]
1242
+
1243
+ Process Speconsense output with advanced variant handling.
1244
+
1245
+ options:
1246
+ -h, --help show this help message and exit
1247
+ --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
1248
+ Logging level
1249
+ --version Show program's version number and exit
1250
+ -p NAME, --profile NAME
1251
+ Load parameter profile (use --list-profiles to see
1252
+ available)
1253
+ --list-profiles List available profiles and exit
1254
+
1255
+ Input/Output:
1256
+ --source SOURCE Source directory containing Speconsense output
1257
+ (default: clusters)
1258
+ --summary-dir SUMMARY_DIR
1259
+ Output directory for summary files (default:
1260
+ __Summary__)
1261
+ --fasta-fields FASTA_FIELDS
1262
+ FASTA header fields to output. Can be: (1) a preset
1263
+ name (default, minimal, qc, full, id-only), (2) comma-
1264
+ separated field names (size, ric, length, rawric, snp,
1265
+ rid, rid_min, primers, group, variant), or (3) a
1266
+ combination of presets and fields (e.g., minimal,qc or
1267
+ minimal,rid). Duplicates removed, order preserved left
1268
+ to right. Default: default
1269
+
1270
+ Filtering:
1271
+ --min-ric MIN_RIC Minimum Reads in Consensus (RiC) threshold (default:
1272
+ 3)
1273
+ --min-len MIN_LEN Minimum sequence length in bp (default: 0 = disabled)
1274
+ --max-len MAX_LEN Maximum sequence length in bp (default: 0 = disabled)
1275
+
1276
+ Grouping:
1277
+ --group-identity GROUP_IDENTITY, --variant-group-identity GROUP_IDENTITY
1278
+ Identity threshold for variant grouping using HAC
1279
+ (default: 0.9)
1280
+
1281
+ Merging:
1282
+ --disable-merging Disable all variant merging (skip MSA-based merge
1283
+ evaluation entirely)
1284
+ --merge-effort LEVEL Merging effort level: fast (8), balanced (10),
1285
+ thorough (12), or numeric 6-14. Higher values allow
1286
+ larger batch sizes for exhaustive subset search.
1287
+ Default: balanced
1288
+ --merge-snp, --no-merge-snp
1289
+ Enable SNP-based merging (default: True, use --no-
1290
+ merge-snp to disable)
1291
+ --merge-indel-length MERGE_INDEL_LENGTH
1292
+ Maximum length of individual indels allowed in merging
1293
+ (default: 0 = disabled)
1294
+ --merge-position-count MERGE_POSITION_COUNT
1295
+ Maximum total SNP+indel positions allowed in merging
1296
+ (default: 2)
1297
+ --merge-min-size-ratio MERGE_MIN_SIZE_RATIO
1298
+ Minimum size ratio (smaller/larger) for merging
1299
+ clusters (default: 0.1, 0 to disable)
1300
+ --min-merge-overlap MIN_MERGE_OVERLAP
1301
+ Minimum overlap in bp for merging sequences of
1302
+ different lengths (default: 200, 0 to disable)
1303
+ --disable-homopolymer-equivalence
1304
+ Disable homopolymer equivalence in merging (treat AAA
1305
+ vs AAAA as different)
1306
+
1307
+ Selection:
1308
+ --select-max-groups SELECT_MAX_GROUPS, --max-groups SELECT_MAX_GROUPS
1309
+ Maximum number of groups to output per specimen
1310
+ (default: -1 = all groups)
1311
+ --select-max-variants SELECT_MAX_VARIANTS, --max-variants SELECT_MAX_VARIANTS
1312
+ Maximum total variants to output per group (default:
1313
+ -1 = no limit, 0 also means no limit)
1314
+ --select-strategy {size,diversity}, --variant-selection {size,diversity}
1315
+ Variant selection strategy: size or diversity
1316
+ (default: size)
1317
+
1318
+ Performance:
1319
+ --scale-threshold SCALE_THRESHOLD
1320
+ Sequence count threshold for scalable mode in HAC
1321
+ clustering (requires vsearch). Set to 0 to disable.
1322
+ Default: 1001
1323
+ --threads N Max threads for internal parallelism. 0=auto-detect
1324
+ (default), N>0 for explicit count.
1325
+ ```
1326
+
1327
+ ## Specialized Workflows
1328
+
1329
+ The following features support less common workflows and specialized use cases.
1330
+
1331
+ ### Sequence Orientation Normalization
1332
+
1333
+ **Use Case:** Reprocessing output from older bioinformatics pipelines (such as minibar) that do not automatically normalize sequence orientation.
1334
+
1335
+ **Note:** When using speconsense downstream of specimux (recommended workflow), orientation normalization is unnecessary as specimux automatically orients all sequences during demultiplexing.
1336
+
1337
+ The `--orient-mode` parameter enables automatic detection and correction of sequence orientation based on primer positions:
1338
+
1339
+ ```bash
1340
+ # Keep all sequences, including those with ambiguous orientation
1341
+ speconsense input.fastq --primers primers.fasta --orient-mode keep-all
1342
+
1343
+ # Filter out sequences that couldn't be reliably oriented
1344
+ speconsense input.fastq --primers primers.fasta --orient-mode filter-failed
1345
+
1346
+ # Skip orientation (default, appropriate when using specimux output)
1347
+ speconsense input.fastq --primers primers.fasta
1348
+ ```
1349
+
1350
+ **Available modes:**
1351
+ - `skip` (default): No orientation performed, sequences processed as-is
1352
+ - `keep-all`: Perform orientation but keep all sequences, including those that failed
1353
+ - `filter-failed`: Perform orientation and remove sequences with failed/ambiguous orientation
1354
+
1355
+ **How it works:**
1356
+ - Detects orientation by checking for forward and reverse primers at expected positions
1357
+ - Uses binary scoring: +1 for forward primer match, +1 for reverse primer match
1358
+ - Sequences with clear orientation (score >0 in one direction, 0 in the other) are reoriented if needed
1359
+ - Sequences with ambiguous orientation (both 0 or both >0) are marked as failed
1360
+
1361
+ **Requirements:**
1362
+ - Primers FASTA file must include position annotations: `position=forward` or `position=reverse`
1363
+ - Example format:
1364
+ ```
1365
+ >ITS1F pool=ITS position=forward
1366
+ CTTGGTCATTTAGAGGAAGTAA
1367
+ >ITS4 pool=ITS position=reverse
1368
+ TCCTCCGCTTATTGATATGC
1369
+ ```
1370
+
1371
+ **Technical details:**
1372
+ - Orientation occurs before clustering, ensuring all sequences are in the same direction
1373
+ - Failed orientations typically indicate: no primers found, chimeric sequences, or degraded primers
1374
+ - Quality scores are reversed when sequences are reverse-complemented
1375
+
1376
+ ### Augmenting Clusters with Recovered Sequences
1377
+
1378
+ **Use Case:** Increasing cluster abundance by including sequences that were partially demultiplexed or recovered through mining tools like `specimine`.
1379
+
1380
+ The `--augment-input` parameter allows you to supplement primary demultiplexing results with additional sequences recovered from unmatched reads:
1381
+
1382
+ ```bash
1383
+ # 1. Run primary demultiplexing with specimux
1384
+ specimux primers.fasta specimens.txt input.fastq -O results/
1385
+
1386
+ # 2. Recover additional sequences from unmatched reads
1387
+ specimine results/unknown/ specimen_name --output recovered.fastq
1388
+
1389
+ # 3. Cluster with both primary and recovered sequences
1390
+ speconsense results/full/specimen_name.fastq --augment-input recovered.fastq
1391
+ ```
1392
+
1393
+ **How it works:**
1394
+ - Augmented sequences participate equally in clustering with primary sequences
1395
+ - During presampling, primary sequences are prioritized to ensure representative sampling
1396
+ - All sequences (primary + augmented) contribute to final consensus generation
1397
+ - Final output headers show total read counts including augmented sequences
1398
+
1399
+ **Key features:**
1400
+ - Supports both FASTQ and FASTA formats (auto-detected by file extension)
1401
+ - Augmented sequences fully traceable through the entire pipeline
1402
+ - Sequences only cluster together if they meet the similarity threshold
1403
+ - Recovered sequences may form separate clusters if they represent different taxa
1404
+ - Maximizes data utilization by including sequences that would otherwise be discarded
1405
+
1406
+ **Typical workflow:**
1407
+ After primary demultiplexing, some sequences may remain unmatched due to sequencing errors, primer degradation, or edge cases in barcode detection. Mining tools like `specimine` can recover these sequences based on sequence composition or other characteristics, allowing them to be included in consensus generation and increase cluster support.
1408
+
1409
+ ### Testing with Synthetic Data
1410
+
1411
+ For empirical testing of consensus quality, variant detection, contamination scenarios, and understanding toolchain behavior with controlled datasets, see [Testing with Speconsense-Synth](docs/synthetic-testing.md).
1412
+
1413
+ ## Future Enhancements
1414
+
1415
+ The following features are under consideration for future development:
1416
+
1417
+ - **Background contamination detection tool**: A utility to identify contamination patterns affecting multiple specimens within the same sequencing run, helping to distinguish systematic contamination from genuine biological sequences
1418
+
1419
+ - **Scalability improvements**: Algorithm enhancements to enable graph-based clustering to efficiently handle datasets with hundreds of thousands of sequences while maintaining accuracy
1420
+
1421
+ These features are being explored based on user feedback. Implementation timelines and feasibility are still being evaluated.
1422
+
1423
+ ## Changelog
1424
+
1425
+ See [CHANGELOG.md](CHANGELOG.md) for detailed version history.
1426
+
1427
+ ## Citations
1428
+
1429
+ This project uses and builds upon:
1430
+
1431
+ - **Markov Clustering (MCL) algorithm**:
1432
+ - van Dongen, S. (2000). *Graph Clustering by Flow Simulation*. PhD thesis, University of Utrecht. https://micans.org/mcl/
1433
+ - van Dongen, S. (2008). *Graph clustering via a discrete uncoupling process*. SIAM Journal on Matrix Analysis and Applications 30(1):121-141. https://doi.org/10.1137/040608635
1434
+
1435
+ - **MCL in bioinformatics**: Enright, A.J., Van Dongen, S., Ouzounis, C.A. (2002). *An efficient algorithm for large-scale detection of protein families*. Nucleic Acids Research 30(7):1575-1584. https://doi.org/10.1093/nar/30.7.1575 (PMC: https://pmc.ncbi.nlm.nih.gov/articles/PMC101833/)
1436
+
1437
+ - **ONT fungal barcoding protocol**: Russell, S.D., Geurin, Z., Walker, J. (2024). *Primary Data Analysis - Basecalling, Demultiplexing, and Consensus Building for ONT Fungal Barcodes*. protocols.io. https://dx.doi.org/10.17504/protocols.io.dm6gpbm88lzp/v4
1438
+
1439
+ ## Contributing
1440
+
1441
+ Contributions are welcome! For development setup, testing guidelines, and contribution workflow, please see [CONTRIBUTING.md](CONTRIBUTING.md).
1442
+
1443
+ ## License
1444
+
1445
+ [BSD 3-Clause License](LICENSE)
1446
+
1447
+ ## Name
1448
+
1449
+ Speconsense is a playful portmanteau of "Specimen" and "Consensus", reflecting the tool's focus on generating high-quality consensus sequences from specimen amplicon reads.