PyPI - speconsense - Versions diffs - 0.7.2__tar.gz → 0.7.3__tar.gz - Mend

speconsense 0.7.2tar.gz → 0.7.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{speconsense-0.7.2/speconsense.egg-info → speconsense-0.7.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: speconsense
-Version: 0.7.2
+Version: 0.7.3
 Summary: High-quality clustering and consensus generation for Oxford Nanopore amplicon reads
 Author-email: Josh Walker <joshowalker@yahoo.com>
 License: BSD-3-Clause
@@ -171,6 +171,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -294,12 +295,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -810,6 +813,18 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ### Additional Summarize Options
+**Full Consensus:**
+```bash
+speconsense-summarize --enable-full-consensus
+```
+- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
+- Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
+- Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
+- Included in `summary.fasta` (but excluded from total RiC to avoid double-counting)
+- Enabled by default in the `compressed` profile
+- Use `--disable-full-consensus` to override when set by a profile
 **Quality Filtering:**
 ```bash
 speconsense-summarize --min-ric 5
@@ -1045,7 +1060,8 @@ The complete speconsense-summarize workflow operates in this order:
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
 5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+6. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
+7. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1098,17 +1114,20 @@ usage: speconsense [-h] [-O OUTPUT_DIR] [--primers PRIMERS]
                    [--min-cluster-ratio MIN_CLUSTER_RATIO]
                    [--max-sample-size MAX_SAMPLE_SIZE]
                    [--outlier-identity OUTLIER_IDENTITY]
-                   [--disable-position-phasing]
+                   [--disable-position-phasing] [--enable-position-phasing]
                    [--min-variant-frequency MIN_VARIANT_FREQUENCY]
                    [--min-variant-count MIN_VARIANT_COUNT]
-                   [--disable-ambiguity-calling]
+                   [--disable-ambiguity-calling] [--enable-ambiguity-calling]
                    [--min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY]
                    [--min-ambiguity-count MIN_AMBIGUITY_COUNT]
-                   [--disable-cluster-merging]
+                   [--disable-cluster-merging] [--enable-cluster-merging]
                    [--disable-homopolymer-equivalence]
+                   [--enable-homopolymer-equivalence]
                    [--orient-mode {skip,keep-all,filter-failed}]
                    [--presample PRESAMPLE] [--scale-threshold SCALE_THRESHOLD]
-                   [--threads N] [--enable-early-filter] [--collect-discards]
+                   [--threads N] [--enable-early-filter]
+                   [--disable-early-filter] [--collect-discards]
+                   [--no-collect-discards]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                    [--version] [-p NAME] [--list-profiles]
                    input_file
@@ -1167,6 +1186,8 @@ Variant Phasing:
                         default). MCL graph clustering already separates most
                         variants; this second pass analyzes MSA positions to
                         phase remaining variants.
+  --enable-position-phasing
+                        Override --disable-position-phasing or profile setting
   --min-variant-frequency MIN_VARIANT_FREQUENCY
                         Minimum alternative allele frequency to call variant
                         (default: 0.10 for 10%)
@@ -1178,6 +1199,9 @@ Ambiguity Calling:
   --disable-ambiguity-calling
                         Disable IUPAC ambiguity code calling for unphased
                         variant positions
+  --enable-ambiguity-calling
+                        Override --disable-ambiguity-calling or profile
+                        setting
   --min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY
                         Minimum alternative allele frequency for IUPAC
                         ambiguity calling (default: 0.10 for 10%)
@@ -1189,9 +1213,14 @@ Cluster Merging:
   --disable-cluster-merging
                         Disable merging of clusters with identical consensus
                         sequences
+  --enable-cluster-merging
+                        Override --disable-cluster-merging or profile setting
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in cluster merging
                         (only merge identical sequences)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
 Orientation:
   --orient-mode {skip,keep-all,filter-failed}
@@ -1213,10 +1242,14 @@ Performance:
                         Enable early filtering to skip small clusters before
                         variant phasing (improves performance for large
                         datasets)
+  --disable-early-filter
+                        Override --enable-early-filter or profile setting
 Debugging:
   --collect-discards    Write discarded reads (outliers and filtered clusters)
                         to cluster_debug/{sample}-discards.fastq
+  --no-collect-discards
+                        Override --collect-discards or profile setting
   --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 ```
@@ -1227,15 +1260,21 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--summary-dir SUMMARY_DIR]
                              [--fasta-fields FASTA_FIELDS] [--min-ric MIN_RIC]
                              [--min-len MIN_LEN] [--max-len MAX_LEN]
-                             [--group-identity GROUP_IDENTITY] [--merge-snp]
+                             [--group-identity GROUP_IDENTITY]
+                             [--disable-merging] [--enable-merging]
+                             [--merge-snp | --no-merge-snp]
                              [--merge-indel-length MERGE_INDEL_LENGTH]
                              [--merge-position-count MERGE_POSITION_COUNT]
                              [--merge-min-size-ratio MERGE_MIN_SIZE_RATIO]
                              [--min-merge-overlap MIN_MERGE_OVERLAP]
                              [--disable-homopolymer-equivalence]
+                             [--enable-homopolymer-equivalence]
+                             [--merge-effort LEVEL]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--enable-full-consensus]
+                             [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
                              [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                              [--version] [-p NAME] [--list-profiles]
@@ -1281,10 +1320,7 @@ Grouping:
 Merging:
   --disable-merging     Disable all variant merging (skip MSA-based merge
                         evaluation entirely)
-  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
-                        thorough (12), or numeric 6-14. Higher values allow
-                        larger batch sizes for exhaustive subset search.
-                        Default: balanced
+  --enable-merging      Override --disable-merging or profile setting
   --merge-snp, --no-merge-snp
                         Enable SNP-based merging (default: True, use --no-
                         merge-snp to disable)
@@ -1303,6 +1339,13 @@ Merging:
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in merging (treat AAA
                         vs AAAA as different)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
+  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
+                        thorough (12), or numeric 6-14. Higher values allow
+                        larger batch sizes for exhaustive subset search.
+                        Default: balanced
 Selection:
   --select-max-groups SELECT_MAX_GROUPS, --max-groups SELECT_MAX_GROUPS
@@ -1314,6 +1357,12 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --enable-full-consensus
+                        Generate a full consensus per variant group
+                        representing all variation from pre-merge variants
+                        (gaps never win)
+  --disable-full-consensus
+                        Override --enable-full-consensus or profile setting
 Performance:
   --scale-threshold SCALE_THRESHOLD

{speconsense-0.7.2 → speconsense-0.7.3}/README.md RENAMED Viewed

@@ -136,6 +136,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -259,12 +260,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -775,6 +778,18 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ### Additional Summarize Options
+**Full Consensus:**
+```bash
+speconsense-summarize --enable-full-consensus
+```
+- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
+- Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
+- Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
+- Included in `summary.fasta` (but excluded from total RiC to avoid double-counting)
+- Enabled by default in the `compressed` profile
+- Use `--disable-full-consensus` to override when set by a profile
 **Quality Filtering:**
 ```bash
 speconsense-summarize --min-ric 5
@@ -1010,7 +1025,8 @@ The complete speconsense-summarize workflow operates in this order:
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
 5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+6. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
+7. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1063,17 +1079,20 @@ usage: speconsense [-h] [-O OUTPUT_DIR] [--primers PRIMERS]
                    [--min-cluster-ratio MIN_CLUSTER_RATIO]
                    [--max-sample-size MAX_SAMPLE_SIZE]
                    [--outlier-identity OUTLIER_IDENTITY]
-                   [--disable-position-phasing]
+                   [--disable-position-phasing] [--enable-position-phasing]
                    [--min-variant-frequency MIN_VARIANT_FREQUENCY]
                    [--min-variant-count MIN_VARIANT_COUNT]
-                   [--disable-ambiguity-calling]
+                   [--disable-ambiguity-calling] [--enable-ambiguity-calling]
                    [--min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY]
                    [--min-ambiguity-count MIN_AMBIGUITY_COUNT]
-                   [--disable-cluster-merging]
+                   [--disable-cluster-merging] [--enable-cluster-merging]
                    [--disable-homopolymer-equivalence]
+                   [--enable-homopolymer-equivalence]
                    [--orient-mode {skip,keep-all,filter-failed}]
                    [--presample PRESAMPLE] [--scale-threshold SCALE_THRESHOLD]
-                   [--threads N] [--enable-early-filter] [--collect-discards]
+                   [--threads N] [--enable-early-filter]
+                   [--disable-early-filter] [--collect-discards]
+                   [--no-collect-discards]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                    [--version] [-p NAME] [--list-profiles]
                    input_file
@@ -1132,6 +1151,8 @@ Variant Phasing:
                         default). MCL graph clustering already separates most
                         variants; this second pass analyzes MSA positions to
                         phase remaining variants.
+  --enable-position-phasing
+                        Override --disable-position-phasing or profile setting
   --min-variant-frequency MIN_VARIANT_FREQUENCY
                         Minimum alternative allele frequency to call variant
                         (default: 0.10 for 10%)
@@ -1143,6 +1164,9 @@ Ambiguity Calling:
   --disable-ambiguity-calling
                         Disable IUPAC ambiguity code calling for unphased
                         variant positions
+  --enable-ambiguity-calling
+                        Override --disable-ambiguity-calling or profile
+                        setting
   --min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY
                         Minimum alternative allele frequency for IUPAC
                         ambiguity calling (default: 0.10 for 10%)
@@ -1154,9 +1178,14 @@ Cluster Merging:
   --disable-cluster-merging
                         Disable merging of clusters with identical consensus
                         sequences
+  --enable-cluster-merging
+                        Override --disable-cluster-merging or profile setting
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in cluster merging
                         (only merge identical sequences)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
 Orientation:
   --orient-mode {skip,keep-all,filter-failed}
@@ -1178,10 +1207,14 @@ Performance:
                         Enable early filtering to skip small clusters before
                         variant phasing (improves performance for large
                         datasets)
+  --disable-early-filter
+                        Override --enable-early-filter or profile setting
 Debugging:
   --collect-discards    Write discarded reads (outliers and filtered clusters)
                         to cluster_debug/{sample}-discards.fastq
+  --no-collect-discards
+                        Override --collect-discards or profile setting
   --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 ```
@@ -1192,15 +1225,21 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--summary-dir SUMMARY_DIR]
                              [--fasta-fields FASTA_FIELDS] [--min-ric MIN_RIC]
                              [--min-len MIN_LEN] [--max-len MAX_LEN]
-                             [--group-identity GROUP_IDENTITY] [--merge-snp]
+                             [--group-identity GROUP_IDENTITY]
+                             [--disable-merging] [--enable-merging]
+                             [--merge-snp | --no-merge-snp]
                              [--merge-indel-length MERGE_INDEL_LENGTH]
                              [--merge-position-count MERGE_POSITION_COUNT]
                              [--merge-min-size-ratio MERGE_MIN_SIZE_RATIO]
                              [--min-merge-overlap MIN_MERGE_OVERLAP]
                              [--disable-homopolymer-equivalence]
+                             [--enable-homopolymer-equivalence]
+                             [--merge-effort LEVEL]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--enable-full-consensus]
+                             [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
                              [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                              [--version] [-p NAME] [--list-profiles]
@@ -1246,10 +1285,7 @@ Grouping:
 Merging:
   --disable-merging     Disable all variant merging (skip MSA-based merge
                         evaluation entirely)
-  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
-                        thorough (12), or numeric 6-14. Higher values allow
-                        larger batch sizes for exhaustive subset search.
-                        Default: balanced
+  --enable-merging      Override --disable-merging or profile setting
   --merge-snp, --no-merge-snp
                         Enable SNP-based merging (default: True, use --no-
                         merge-snp to disable)
@@ -1268,6 +1304,13 @@ Merging:
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in merging (treat AAA
                         vs AAAA as different)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
+  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
+                        thorough (12), or numeric 6-14. Higher values allow
+                        larger batch sizes for exhaustive subset search.
+                        Default: balanced
 Selection:
   --select-max-groups SELECT_MAX_GROUPS, --max-groups SELECT_MAX_GROUPS
@@ -1279,6 +1322,12 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --enable-full-consensus
+                        Generate a full consensus per variant group
+                        representing all variation from pre-merge variants
+                        (gaps never win)
+  --disable-full-consensus
+                        Override --enable-full-consensus or profile setting
 Performance:
   --scale-threshold SCALE_THRESHOLD

{speconsense-0.7.2 → speconsense-0.7.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "speconsense"
-version = "0.7.2"
+version = "0.7.3"
 description = "High-quality clustering and consensus generation for Oxford Nanopore amplicon reads"
 readme = "README.md"
 requires-python = ">=3.8"

{speconsense-0.7.2 → speconsense-0.7.3}/speconsense/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ A Python tool for experimental clustering and consensus generation as an alterna
 in the fungal DNA barcoding pipeline.
 """
-__version__ = "0.7.2"
+__version__ = "0.7.3"
 __author__ = "Josh Walker"
 __email__ = "joshowalker@yahoo.com"

{speconsense-0.7.2 → speconsense-0.7.3}/speconsense/core/cli.py RENAMED Viewed

@@ -66,6 +66,9 @@ def main():
                                help="Disable position-based variant phasing (enabled by default). "
                                     "MCL graph clustering already separates most variants; this "
                                     "second pass analyzes MSA positions to phase remaining variants.")
+    phasing_group.add_argument("--enable-position-phasing", action="store_false",
+                               dest="disable_position_phasing",
+                               help="Override --disable-position-phasing or profile setting")
     phasing_group.add_argument("--min-variant-frequency", type=float, default=0.10,
                                help="Minimum alternative allele frequency to call variant (default: 0.10 for 10%%)")
     phasing_group.add_argument("--min-variant-count", type=int, default=5,
@@ -75,6 +78,9 @@ def main():
     ambiguity_group = parser.add_argument_group("Ambiguity Calling")
     ambiguity_group.add_argument("--disable-ambiguity-calling", action="store_true",
                                  help="Disable IUPAC ambiguity code calling for unphased variant positions")
+    ambiguity_group.add_argument("--enable-ambiguity-calling", action="store_false",
+                                 dest="disable_ambiguity_calling",
+                                 help="Override --disable-ambiguity-calling or profile setting")
     ambiguity_group.add_argument("--min-ambiguity-frequency", type=float, default=0.10,
                                  help="Minimum alternative allele frequency for IUPAC ambiguity calling (default: 0.10 for 10%%)")
     ambiguity_group.add_argument("--min-ambiguity-count", type=int, default=3,
@@ -84,8 +90,14 @@ def main():
     merging_group = parser.add_argument_group("Cluster Merging")
     merging_group.add_argument("--disable-cluster-merging", action="store_true",
                                help="Disable merging of clusters with identical consensus sequences")
+    merging_group.add_argument("--enable-cluster-merging", action="store_false",
+                               dest="disable_cluster_merging",
+                               help="Override --disable-cluster-merging or profile setting")
     merging_group.add_argument("--disable-homopolymer-equivalence", action="store_true",
                                help="Disable homopolymer equivalence in cluster merging (only merge identical sequences)")
+    merging_group.add_argument("--enable-homopolymer-equivalence", action="store_false",
+                               dest="disable_homopolymer_equivalence",
+                               help="Override --disable-homopolymer-equivalence or profile setting")
     # Orientation group
     orient_group = parser.add_argument_group("Orientation")
@@ -104,11 +116,17 @@ def main():
                                  "0=auto-detect, default=1 (safe for parallel workflows).")
     perf_group.add_argument("--enable-early-filter", action="store_true",
                             help="Enable early filtering to skip small clusters before variant phasing (improves performance for large datasets)")
+    perf_group.add_argument("--disable-early-filter", action="store_false",
+                            dest="enable_early_filter",
+                            help="Override --enable-early-filter or profile setting")
     # Debugging group
     debug_group = parser.add_argument_group("Debugging")
     debug_group.add_argument("--collect-discards", action="store_true",
                              help="Write discarded reads (outliers and filtered clusters) to cluster_debug/{sample}-discards.fastq")
+    debug_group.add_argument("--no-collect-discards", action="store_false",
+                             dest="collect_discards",
+                             help="Override --collect-discards or profile setting")
     debug_group.add_argument("--log-level", default="INFO",
                              choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"])

{speconsense-0.7.2 → speconsense-0.7.3}/speconsense/profiles/__init__.py RENAMED Viewed

@@ -103,6 +103,7 @@ VALID_SUMMARIZE_KEYS = {
     "select-max-groups",
     "select-max-variants",
     "select-strategy",
+    "enable-full-consensus",
     # Processing
     "scale-threshold",
     "threads",

speconsense-0.7.3/speconsense/profiles/compressed.yaml ADDED Viewed

@@ -0,0 +1,27 @@
+# Compress variants into minimal IUPAC consensus sequences
+#
+# Aggressively merges similar variants (including indels) into single
+# IUPAC consensus sequences. Only truly dissimilar sequences remain
+# separate. Uses 20% frequency thresholds throughout.
+#
+# Designed for workflows where reviewers want fewer sequences to
+# examine, with all variation represented via IUPAC ambiguity codes.
+# Partial overlap merging is disabled as a safety measure.
+#
+# Use with:
+#   speconsense input.fastq -p compressed
+#   speconsense-summarize -p compressed
+speconsense-version: "0.7.*"
+description: "Compress variants into minimal IUPAC consensus sequences"
+speconsense:
+  min-ambiguity-frequency: 0.20  # 20% threshold for IUPAC ambiguity calling
+  min-variant-frequency: 0.20   # 20% threshold for variant phasing
+speconsense-summarize:
+  merge-indel-length: 5         # Merge indels up to 5bp
+  merge-position-count: 10      # Allow up to 10 variant positions in a merge
+  merge-min-size-ratio: 0.2     # Match 20% calling threshold
+  min-merge-overlap: 0          # Disable partial overlap merging
+  enable-full-consensus: true   # Include full IUPAC consensus per group

{speconsense-0.7.2 → speconsense-0.7.3}/speconsense/summarize/cli.py RENAMED Viewed

@@ -54,8 +54,8 @@ from .io import (
     write_output_files,
 )
 from .clustering import perform_hac_clustering, select_variants
-from .merging import merge_group_with_msa
-from .analysis import MAX_MSA_MERGE_VARIANTS, MIN_MERGE_BATCH, MAX_MERGE_BATCH
+from .merging import merge_group_with_msa, create_full_consensus_from_msa
+from .analysis import run_spoa_msa, MAX_MSA_MERGE_VARIANTS, MIN_MERGE_BATCH, MAX_MERGE_BATCH
 # Merge effort configuration
@@ -132,6 +132,8 @@ def parse_arguments():
     merging_group = parser.add_argument_group("Merging")
     merging_group.add_argument("--disable-merging", action="store_true",
                                help="Disable all variant merging (skip MSA-based merge evaluation entirely)")
+    merging_group.add_argument("--enable-merging", action="store_false", dest="disable_merging",
+                               help="Override --disable-merging or profile setting")
     merging_group.add_argument("--merge-snp", action=argparse.BooleanOptionalAction, default=True,
                                help="Enable SNP-based merging (default: True, use --no-merge-snp to disable)")
     merging_group.add_argument("--merge-indel-length", type=int, default=0,
@@ -144,6 +146,9 @@ def parse_arguments():
                                help="Minimum overlap in bp for merging sequences of different lengths (default: 200, 0 to disable)")
     merging_group.add_argument("--disable-homopolymer-equivalence", action="store_true",
                                help="Disable homopolymer equivalence in merging (treat AAA vs AAAA as different)")
+    merging_group.add_argument("--enable-homopolymer-equivalence", action="store_false",
+                               dest="disable_homopolymer_equivalence",
+                               help="Override --disable-homopolymer-equivalence or profile setting")
     merging_group.add_argument("--merge-effort", type=str, default="balanced", metavar="LEVEL",
                                help="Merging effort level: fast (8), balanced (10), thorough (12), "
                                     "or numeric 6-14. Higher values allow larger batch sizes for "
@@ -164,6 +169,12 @@ def parse_arguments():
     selection_group.add_argument("--select-strategy", "--variant-selection",
                                  dest="select_strategy", choices=["size", "diversity"], default="size",
                                  help="Variant selection strategy: size or diversity (default: size)")
+    selection_group.add_argument("--enable-full-consensus", action="store_true",
+                                 help="Generate a full consensus per variant group representing all variation "
+                                      "from pre-merge variants (gaps never win)")
+    selection_group.add_argument("--disable-full-consensus", action="store_false",
+                                 dest="enable_full_consensus",
+                                 help="Override --enable-full-consensus or profile setting")
     # Performance group
     perf_group = parser.add_argument_group("Performance")
@@ -345,7 +356,7 @@ def process_single_specimen(file_consensuses: List[ConsensusInfo],
                           key=lambda x: max(m.size for m in x[1]),
                           reverse=True)
-    for group_idx, (_, group_members) in enumerate(sorted_groups):
+    for group_idx, (group_id, group_members) in enumerate(sorted_groups):
         final_group_name = group_idx + 1
         # Select variants for this group
@@ -366,6 +377,24 @@ def process_single_specimen(file_consensuses: List[ConsensusInfo],
             final_consensus.append(renamed_variant)
             group_naming.append((variant.sample_name, new_name))
+        # Generate full consensus from PRE-MERGE variants
+        if getattr(args, 'enable_full_consensus', False):
+            pre_merge_variants = variant_groups[group_id]
+            specimen_base = selected_variants[0].sample_name.rsplit('-c', 1)[0]
+            full_name = f"{specimen_base}-{group_idx + 1}.full"
+            if len(pre_merge_variants) == 1:
+                # Single variant — copy directly
+                full_consensus = pre_merge_variants[0]._replace(sample_name=full_name)
+            else:
+                # MSA on pre-merge variants, full consensus logic
+                sequences = [v.sequence for v in pre_merge_variants]
+                aligned_seqs = run_spoa_msa(sequences, alignment_mode=1)
+                full_consensus = create_full_consensus_from_msa(aligned_seqs, pre_merge_variants)
+                full_consensus = full_consensus._replace(sample_name=full_name)
+            final_consensus.append(full_consensus)
         naming_info[group_idx + 1] = group_naming
     logging.info(f"Processed {file_name}: {len(final_consensus)} final variants across {len(merged_groups)} groups")
@@ -421,6 +450,7 @@ def main():
     logging.info(f"  --select-max-variants: {args.select_max_variants}")
     logging.info(f"  --select-max-groups: {args.select_max_groups}")
     logging.info(f"  --select-strategy: {args.select_strategy}")
+    logging.info(f"  --enable-full-consensus: {args.enable_full_consensus}")
     logging.info(f"  --log-level: {args.log_level}")
     logging.info("")
     logging.info("Processing each specimen file independently to organize variants within specimens")

{speconsense-0.7.2 → speconsense-0.7.3}/speconsense/summarize/fields.py RENAMED Viewed

@@ -124,8 +124,8 @@ class GroupField(FastaField):
         super().__init__('group', 'Variant group number')
     def format_value(self, consensus: ConsensusInfo) -> Optional[str]:
-        # Extract from sample_name (e.g., "...-1.v1" or "...-2.v1.raw1")
-        match = re.search(r'-(\d+)\.v\d+(?:\.raw\d+)?$', consensus.sample_name)
+        # Extract from sample_name (e.g., "...-1.v1", "...-2.v1.raw1", or "...-1.full")
+        match = re.search(r'-(\d+)(?:\.v\d+(?:\.raw\d+)?|\.full)$', consensus.sample_name)
         if match:
             return f"group={match.group(1)}"
         return None
@@ -136,8 +136,10 @@ class VariantField(FastaField):
         super().__init__('variant', 'Variant identifier within group')
     def format_value(self, consensus: ConsensusInfo) -> Optional[str]:
-        # Extract from sample_name (e.g., "...-1.v1" -> "v1" or "...-1.v1.raw1" -> "v1")
+        # Extract from sample_name (e.g., "...-1.v1" -> "v1", "...-1.v1.raw1" -> "v1", "...-1.full" -> "full")
         match = re.search(r'\.(v\d+)(?:\.raw\d+)?$', consensus.sample_name)
+        if not match:
+            match = re.search(r'\.(full)$', consensus.sample_name)
         if match:
             return f"variant={match.group(1)}"
         return None

{speconsense-0.7.2 → speconsense-0.7.3}/speconsense/summarize/io.py RENAMED Viewed

@@ -358,6 +358,9 @@ def write_specimen_data_files(specimen_consensus: List[ConsensusInfo],
     # Generate .raw file consensuses for merged variants
     raw_file_consensuses = []
     for consensus in specimen_consensus:
+        # Skip .raw generation for .full consensus (synthetic/derived)
+        if consensus.sample_name.endswith('.full'):
+            continue
         # Only create .raw files if this consensus was actually merged
         if consensus.raw_ric and len(consensus.raw_ric) > 1:
             # Find the original cluster name from naming_info
@@ -412,6 +415,9 @@ def write_specimen_data_files(specimen_consensus: List[ConsensusInfo],
     # Write FASTQ files for each final consensus containing all contributing reads
     for consensus in specimen_consensus:
+        # Skip FASTQ for .full consensus (synthetic/derived, no traceable cluster reads)
+        if consensus.sample_name.endswith('.full'):
+            continue
         write_consensus_fastq(consensus, merge_traceability, naming_info, fastq_dir, fastq_lookup, original_consensus_lookup)
     # Write .raw files (individual FASTA and FASTQ for pre-merge variants)
@@ -704,7 +710,10 @@ def write_output_files(final_consensus: List[ConsensusInfo],
             multiple_id = specimen_counters[base_name]
             writer.writerow([consensus.sample_name, len(consensus.sequence), consensus.ric, multiple_id])
             unique_samples.add(base_name)
-            total_ric += consensus.ric
+            # Exclude .full from total RiC to avoid double-counting
+            # (.full aggregates reads already counted in merged variants)
+            if not consensus.sample_name.endswith('.full'):
+                total_ric += consensus.ric
         writer.writerow([])
         writer.writerow(['Total Unique Samples', len(unique_samples)])

speconsense 0.7.2__tar.gz → 0.7.3__tar.gz

speconsense 0.7.2tar.gz → 0.7.3tar.gz