PyPI - speconsense - Versions diffs - 0.7.2__tar.gz → 0.7.4__tar.gz - Mend

speconsense 0.7.2tar.gz → 0.7.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{speconsense-0.7.2/speconsense.egg-info → speconsense-0.7.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: speconsense
-Version: 0.7.2
+Version: 0.7.4
 Summary: High-quality clustering and consensus generation for Oxford Nanopore amplicon reads
 Author-email: Josh Walker <joshowalker@yahoo.com>
 License: BSD-3-Clause
@@ -171,6 +171,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus, 20% selection size ratio)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -294,12 +295,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -675,6 +678,18 @@ speconsense-summarize --select-strategy diversity --select-max-variants 2
    - Output up to select_max_variants per group
 3. Final output contains representatives from all groups, ensuring both biological diversity (between groups) and appropriate sampling within each biological entity (within groups)
+**Selection Size Ratio Filtering:**
+```bash
+speconsense-summarize --select-min-size-ratio 0.2
+```
+- Filters out post-merge variants whose size is too small relative to the largest variant in their group
+- Ratio calculated as `variant_size / largest_size` — must be ≥ threshold to keep
+- Example: `--select-min-size-ratio 0.2` means a variant must have ≥20% the reads of the largest variant in its group
+- Default is 0 (disabled) — all post-merge variants pass through to selection
+- Applied after merging but before variant selection
+- Useful for suppressing noise variants that survived merging but are too small to be meaningful
+- Set to 0.2 in the `compressed` profile to match the 20% calling threshold theme
 This two-stage process ensures that distinct biological sequences are preserved as separate groups, while providing control over variant complexity within each group.
 ### Customizing FASTA Header Fields
@@ -810,6 +825,18 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ### Additional Summarize Options
+**Full Consensus:**
+```bash
+speconsense-summarize --enable-full-consensus
+```
+- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
+- Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
+- Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
+- Included in `summary.fasta` (but excluded from total RiC to avoid double-counting)
+- Enabled by default in the `compressed` profile
+- Use `--disable-full-consensus` to override when set by a profile
 **Quality Filtering:**
 ```bash
 speconsense-summarize --min-ric 5
@@ -1044,8 +1071,10 @@ The complete speconsense-summarize workflow operates in this order:
 2. **HAC variant grouping** by sequence identity to separate dissimilar sequences (`--group-identity`); uses single-linkage when overlap merging is enabled
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
-5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+5. **Selection size ratio filtering** to remove tiny post-merge variants (`--select-min-size-ratio`)
+6. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
+7. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
+8. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1098,17 +1127,20 @@ usage: speconsense [-h] [-O OUTPUT_DIR] [--primers PRIMERS]
                    [--min-cluster-ratio MIN_CLUSTER_RATIO]
                    [--max-sample-size MAX_SAMPLE_SIZE]
                    [--outlier-identity OUTLIER_IDENTITY]
-                   [--disable-position-phasing]
+                   [--disable-position-phasing] [--enable-position-phasing]
                    [--min-variant-frequency MIN_VARIANT_FREQUENCY]
                    [--min-variant-count MIN_VARIANT_COUNT]
-                   [--disable-ambiguity-calling]
+                   [--disable-ambiguity-calling] [--enable-ambiguity-calling]
                    [--min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY]
                    [--min-ambiguity-count MIN_AMBIGUITY_COUNT]
-                   [--disable-cluster-merging]
+                   [--disable-cluster-merging] [--enable-cluster-merging]
                    [--disable-homopolymer-equivalence]
+                   [--enable-homopolymer-equivalence]
                    [--orient-mode {skip,keep-all,filter-failed}]
                    [--presample PRESAMPLE] [--scale-threshold SCALE_THRESHOLD]
-                   [--threads N] [--enable-early-filter] [--collect-discards]
+                   [--threads N] [--enable-early-filter]
+                   [--disable-early-filter] [--collect-discards]
+                   [--no-collect-discards]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                    [--version] [-p NAME] [--list-profiles]
                    input_file
@@ -1167,6 +1199,8 @@ Variant Phasing:
                         default). MCL graph clustering already separates most
                         variants; this second pass analyzes MSA positions to
                         phase remaining variants.
+  --enable-position-phasing
+                        Override --disable-position-phasing or profile setting
   --min-variant-frequency MIN_VARIANT_FREQUENCY
                         Minimum alternative allele frequency to call variant
                         (default: 0.10 for 10%)
@@ -1178,6 +1212,9 @@ Ambiguity Calling:
   --disable-ambiguity-calling
                         Disable IUPAC ambiguity code calling for unphased
                         variant positions
+  --enable-ambiguity-calling
+                        Override --disable-ambiguity-calling or profile
+                        setting
   --min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY
                         Minimum alternative allele frequency for IUPAC
                         ambiguity calling (default: 0.10 for 10%)
@@ -1189,9 +1226,14 @@ Cluster Merging:
   --disable-cluster-merging
                         Disable merging of clusters with identical consensus
                         sequences
+  --enable-cluster-merging
+                        Override --disable-cluster-merging or profile setting
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in cluster merging
                         (only merge identical sequences)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
 Orientation:
   --orient-mode {skip,keep-all,filter-failed}
@@ -1213,10 +1255,14 @@ Performance:
                         Enable early filtering to skip small clusters before
                         variant phasing (improves performance for large
                         datasets)
+  --disable-early-filter
+                        Override --enable-early-filter or profile setting
 Debugging:
   --collect-discards    Write discarded reads (outliers and filtered clusters)
                         to cluster_debug/{sample}-discards.fastq
+  --no-collect-discards
+                        Override --collect-discards or profile setting
   --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 ```
@@ -1227,15 +1273,22 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--summary-dir SUMMARY_DIR]
                              [--fasta-fields FASTA_FIELDS] [--min-ric MIN_RIC]
                              [--min-len MIN_LEN] [--max-len MAX_LEN]
-                             [--group-identity GROUP_IDENTITY] [--merge-snp]
+                             [--group-identity GROUP_IDENTITY]
+                             [--disable-merging] [--enable-merging]
+                             [--merge-snp | --no-merge-snp]
                              [--merge-indel-length MERGE_INDEL_LENGTH]
                              [--merge-position-count MERGE_POSITION_COUNT]
                              [--merge-min-size-ratio MERGE_MIN_SIZE_RATIO]
                              [--min-merge-overlap MIN_MERGE_OVERLAP]
                              [--disable-homopolymer-equivalence]
+                             [--enable-homopolymer-equivalence]
+                             [--merge-effort LEVEL]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--select-min-size-ratio SELECT_MIN_SIZE_RATIO]
+                             [--enable-full-consensus]
+                             [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
                              [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                              [--version] [-p NAME] [--list-profiles]
@@ -1281,10 +1334,7 @@ Grouping:
 Merging:
   --disable-merging     Disable all variant merging (skip MSA-based merge
                         evaluation entirely)
-  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
-                        thorough (12), or numeric 6-14. Higher values allow
-                        larger batch sizes for exhaustive subset search.
-                        Default: balanced
+  --enable-merging      Override --disable-merging or profile setting
   --merge-snp, --no-merge-snp
                         Enable SNP-based merging (default: True, use --no-
                         merge-snp to disable)
@@ -1303,6 +1353,13 @@ Merging:
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in merging (treat AAA
                         vs AAAA as different)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
+  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
+                        thorough (12), or numeric 6-14. Higher values allow
+                        larger batch sizes for exhaustive subset search.
+                        Default: balanced
 Selection:
   --select-max-groups SELECT_MAX_GROUPS, --max-groups SELECT_MAX_GROUPS
@@ -1314,6 +1371,16 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --select-min-size-ratio SELECT_MIN_SIZE_RATIO
+                        Minimum size ratio (variant/largest) to include in
+                        output (default: 0 = disabled, e.g. 0.2 for 20%
+                        cutoff)
+  --enable-full-consensus
+                        Generate a full consensus per variant group
+                        representing all variation from pre-merge variants
+                        (gaps never win)
+  --disable-full-consensus
+                        Override --enable-full-consensus or profile setting
 Performance:
   --scale-threshold SCALE_THRESHOLD

{speconsense-0.7.2 → speconsense-0.7.4}/README.md RENAMED Viewed

@@ -136,6 +136,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus, 20% selection size ratio)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -259,12 +260,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -640,6 +643,18 @@ speconsense-summarize --select-strategy diversity --select-max-variants 2
    - Output up to select_max_variants per group
 3. Final output contains representatives from all groups, ensuring both biological diversity (between groups) and appropriate sampling within each biological entity (within groups)
+**Selection Size Ratio Filtering:**
+```bash
+speconsense-summarize --select-min-size-ratio 0.2
+```
+- Filters out post-merge variants whose size is too small relative to the largest variant in their group
+- Ratio calculated as `variant_size / largest_size` — must be ≥ threshold to keep
+- Example: `--select-min-size-ratio 0.2` means a variant must have ≥20% the reads of the largest variant in its group
+- Default is 0 (disabled) — all post-merge variants pass through to selection
+- Applied after merging but before variant selection
+- Useful for suppressing noise variants that survived merging but are too small to be meaningful
+- Set to 0.2 in the `compressed` profile to match the 20% calling threshold theme
 This two-stage process ensures that distinct biological sequences are preserved as separate groups, while providing control over variant complexity within each group.
 ### Customizing FASTA Header Fields
@@ -775,6 +790,18 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ### Additional Summarize Options
+**Full Consensus:**
+```bash
+speconsense-summarize --enable-full-consensus
+```
+- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
+- Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
+- Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
+- Included in `summary.fasta` (but excluded from total RiC to avoid double-counting)
+- Enabled by default in the `compressed` profile
+- Use `--disable-full-consensus` to override when set by a profile
 **Quality Filtering:**
 ```bash
 speconsense-summarize --min-ric 5
@@ -1009,8 +1036,10 @@ The complete speconsense-summarize workflow operates in this order:
 2. **HAC variant grouping** by sequence identity to separate dissimilar sequences (`--group-identity`); uses single-linkage when overlap merging is enabled
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
-5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+5. **Selection size ratio filtering** to remove tiny post-merge variants (`--select-min-size-ratio`)
+6. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
+7. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
+8. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1063,17 +1092,20 @@ usage: speconsense [-h] [-O OUTPUT_DIR] [--primers PRIMERS]
                    [--min-cluster-ratio MIN_CLUSTER_RATIO]
                    [--max-sample-size MAX_SAMPLE_SIZE]
                    [--outlier-identity OUTLIER_IDENTITY]
-                   [--disable-position-phasing]
+                   [--disable-position-phasing] [--enable-position-phasing]
                    [--min-variant-frequency MIN_VARIANT_FREQUENCY]
                    [--min-variant-count MIN_VARIANT_COUNT]
-                   [--disable-ambiguity-calling]
+                   [--disable-ambiguity-calling] [--enable-ambiguity-calling]
                    [--min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY]
                    [--min-ambiguity-count MIN_AMBIGUITY_COUNT]
-                   [--disable-cluster-merging]
+                   [--disable-cluster-merging] [--enable-cluster-merging]
                    [--disable-homopolymer-equivalence]
+                   [--enable-homopolymer-equivalence]
                    [--orient-mode {skip,keep-all,filter-failed}]
                    [--presample PRESAMPLE] [--scale-threshold SCALE_THRESHOLD]
-                   [--threads N] [--enable-early-filter] [--collect-discards]
+                   [--threads N] [--enable-early-filter]
+                   [--disable-early-filter] [--collect-discards]
+                   [--no-collect-discards]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                    [--version] [-p NAME] [--list-profiles]
                    input_file
@@ -1132,6 +1164,8 @@ Variant Phasing:
                         default). MCL graph clustering already separates most
                         variants; this second pass analyzes MSA positions to
                         phase remaining variants.
+  --enable-position-phasing
+                        Override --disable-position-phasing or profile setting
   --min-variant-frequency MIN_VARIANT_FREQUENCY
                         Minimum alternative allele frequency to call variant
                         (default: 0.10 for 10%)
@@ -1143,6 +1177,9 @@ Ambiguity Calling:
   --disable-ambiguity-calling
                         Disable IUPAC ambiguity code calling for unphased
                         variant positions
+  --enable-ambiguity-calling
+                        Override --disable-ambiguity-calling or profile
+                        setting
   --min-ambiguity-frequency MIN_AMBIGUITY_FREQUENCY
                         Minimum alternative allele frequency for IUPAC
                         ambiguity calling (default: 0.10 for 10%)
@@ -1154,9 +1191,14 @@ Cluster Merging:
   --disable-cluster-merging
                         Disable merging of clusters with identical consensus
                         sequences
+  --enable-cluster-merging
+                        Override --disable-cluster-merging or profile setting
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in cluster merging
                         (only merge identical sequences)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
 Orientation:
   --orient-mode {skip,keep-all,filter-failed}
@@ -1178,10 +1220,14 @@ Performance:
                         Enable early filtering to skip small clusters before
                         variant phasing (improves performance for large
                         datasets)
+  --disable-early-filter
+                        Override --enable-early-filter or profile setting
 Debugging:
   --collect-discards    Write discarded reads (outliers and filtered clusters)
                         to cluster_debug/{sample}-discards.fastq
+  --no-collect-discards
+                        Override --collect-discards or profile setting
   --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 ```
@@ -1192,15 +1238,22 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--summary-dir SUMMARY_DIR]
                              [--fasta-fields FASTA_FIELDS] [--min-ric MIN_RIC]
                              [--min-len MIN_LEN] [--max-len MAX_LEN]
-                             [--group-identity GROUP_IDENTITY] [--merge-snp]
+                             [--group-identity GROUP_IDENTITY]
+                             [--disable-merging] [--enable-merging]
+                             [--merge-snp | --no-merge-snp]
                              [--merge-indel-length MERGE_INDEL_LENGTH]
                              [--merge-position-count MERGE_POSITION_COUNT]
                              [--merge-min-size-ratio MERGE_MIN_SIZE_RATIO]
                              [--min-merge-overlap MIN_MERGE_OVERLAP]
                              [--disable-homopolymer-equivalence]
+                             [--enable-homopolymer-equivalence]
+                             [--merge-effort LEVEL]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--select-min-size-ratio SELECT_MIN_SIZE_RATIO]
+                             [--enable-full-consensus]
+                             [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
                              [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                              [--version] [-p NAME] [--list-profiles]
@@ -1246,10 +1299,7 @@ Grouping:
 Merging:
   --disable-merging     Disable all variant merging (skip MSA-based merge
                         evaluation entirely)
-  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
-                        thorough (12), or numeric 6-14. Higher values allow
-                        larger batch sizes for exhaustive subset search.
-                        Default: balanced
+  --enable-merging      Override --disable-merging or profile setting
   --merge-snp, --no-merge-snp
                         Enable SNP-based merging (default: True, use --no-
                         merge-snp to disable)
@@ -1268,6 +1318,13 @@ Merging:
   --disable-homopolymer-equivalence
                         Disable homopolymer equivalence in merging (treat AAA
                         vs AAAA as different)
+  --enable-homopolymer-equivalence
+                        Override --disable-homopolymer-equivalence or profile
+                        setting
+  --merge-effort LEVEL  Merging effort level: fast (8), balanced (10),
+                        thorough (12), or numeric 6-14. Higher values allow
+                        larger batch sizes for exhaustive subset search.
+                        Default: balanced
 Selection:
   --select-max-groups SELECT_MAX_GROUPS, --max-groups SELECT_MAX_GROUPS
@@ -1279,6 +1336,16 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --select-min-size-ratio SELECT_MIN_SIZE_RATIO
+                        Minimum size ratio (variant/largest) to include in
+                        output (default: 0 = disabled, e.g. 0.2 for 20%
+                        cutoff)
+  --enable-full-consensus
+                        Generate a full consensus per variant group
+                        representing all variation from pre-merge variants
+                        (gaps never win)
+  --disable-full-consensus
+                        Override --enable-full-consensus or profile setting
 Performance:
   --scale-threshold SCALE_THRESHOLD

{speconsense-0.7.2 → speconsense-0.7.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "speconsense"
-version = "0.7.2"
+version = "0.7.4"
 description = "High-quality clustering and consensus generation for Oxford Nanopore amplicon reads"
 readme = "README.md"
 requires-python = ">=3.8"

{speconsense-0.7.2 → speconsense-0.7.4}/speconsense/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ A Python tool for experimental clustering and consensus generation as an alterna
 in the fungal DNA barcoding pipeline.
 """
-__version__ = "0.7.2"
+__version__ = "0.7.4"
 __author__ = "Josh Walker"
 __email__ = "joshowalker@yahoo.com"

{speconsense-0.7.2 → speconsense-0.7.4}/speconsense/core/cli.py RENAMED Viewed

@@ -66,6 +66,9 @@ def main():
                                help="Disable position-based variant phasing (enabled by default). "
                                     "MCL graph clustering already separates most variants; this "
                                     "second pass analyzes MSA positions to phase remaining variants.")
+    phasing_group.add_argument("--enable-position-phasing", action="store_false",
+                               dest="disable_position_phasing",
+                               help="Override --disable-position-phasing or profile setting")
     phasing_group.add_argument("--min-variant-frequency", type=float, default=0.10,
                                help="Minimum alternative allele frequency to call variant (default: 0.10 for 10%%)")
     phasing_group.add_argument("--min-variant-count", type=int, default=5,
@@ -75,6 +78,9 @@ def main():
     ambiguity_group = parser.add_argument_group("Ambiguity Calling")
     ambiguity_group.add_argument("--disable-ambiguity-calling", action="store_true",
                                  help="Disable IUPAC ambiguity code calling for unphased variant positions")
+    ambiguity_group.add_argument("--enable-ambiguity-calling", action="store_false",
+                                 dest="disable_ambiguity_calling",
+                                 help="Override --disable-ambiguity-calling or profile setting")
     ambiguity_group.add_argument("--min-ambiguity-frequency", type=float, default=0.10,
                                  help="Minimum alternative allele frequency for IUPAC ambiguity calling (default: 0.10 for 10%%)")
     ambiguity_group.add_argument("--min-ambiguity-count", type=int, default=3,
@@ -84,8 +90,14 @@ def main():
     merging_group = parser.add_argument_group("Cluster Merging")
     merging_group.add_argument("--disable-cluster-merging", action="store_true",
                                help="Disable merging of clusters with identical consensus sequences")
+    merging_group.add_argument("--enable-cluster-merging", action="store_false",
+                               dest="disable_cluster_merging",
+                               help="Override --disable-cluster-merging or profile setting")
     merging_group.add_argument("--disable-homopolymer-equivalence", action="store_true",
                                help="Disable homopolymer equivalence in cluster merging (only merge identical sequences)")
+    merging_group.add_argument("--enable-homopolymer-equivalence", action="store_false",
+                               dest="disable_homopolymer_equivalence",
+                               help="Override --disable-homopolymer-equivalence or profile setting")
     # Orientation group
     orient_group = parser.add_argument_group("Orientation")
@@ -104,11 +116,17 @@ def main():
                                  "0=auto-detect, default=1 (safe for parallel workflows).")
     perf_group.add_argument("--enable-early-filter", action="store_true",
                             help="Enable early filtering to skip small clusters before variant phasing (improves performance for large datasets)")
+    perf_group.add_argument("--disable-early-filter", action="store_false",
+                            dest="enable_early_filter",
+                            help="Override --enable-early-filter or profile setting")
     # Debugging group
     debug_group = parser.add_argument_group("Debugging")
     debug_group.add_argument("--collect-discards", action="store_true",
                              help="Write discarded reads (outliers and filtered clusters) to cluster_debug/{sample}-discards.fastq")
+    debug_group.add_argument("--no-collect-discards", action="store_false",
+                             dest="collect_discards",
+                             help="Override --collect-discards or profile setting")
     debug_group.add_argument("--log-level", default="INFO",
                              choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"])

{speconsense-0.7.2 → speconsense-0.7.4}/speconsense/profiles/__init__.py RENAMED Viewed

@@ -103,6 +103,8 @@ VALID_SUMMARIZE_KEYS = {
     "select-max-groups",
     "select-max-variants",
     "select-strategy",
+    "select-min-size-ratio",
+    "enable-full-consensus",
     # Processing
     "scale-threshold",
     "threads",

speconsense-0.7.4/speconsense/profiles/compressed.yaml ADDED Viewed

@@ -0,0 +1,28 @@
+# Compress variants into minimal IUPAC consensus sequences
+#
+# Aggressively merges similar variants (including indels) into single
+# IUPAC consensus sequences. Only truly dissimilar sequences remain
+# separate. Uses 20% frequency thresholds throughout.
+#
+# Designed for workflows where reviewers want fewer sequences to
+# examine, with all variation represented via IUPAC ambiguity codes.
+# Partial overlap merging is disabled as a safety measure.
+#
+# Use with:
+#   speconsense input.fastq -p compressed
+#   speconsense-summarize -p compressed
+speconsense-version: "0.7.*"
+description: "Compress variants into minimal IUPAC consensus sequences"
+speconsense:
+  min-ambiguity-frequency: 0.20  # 20% threshold for IUPAC ambiguity calling
+  min-variant-frequency: 0.20   # 20% threshold for variant phasing
+speconsense-summarize:
+  merge-indel-length: 5         # Merge indels up to 5bp
+  merge-position-count: 10      # Allow up to 10 variant positions in a merge
+  merge-min-size-ratio: 0.2     # Match 20% calling threshold
+  select-min-size-ratio: 0.2    # Match 20% calling threshold
+  min-merge-overlap: 0          # Disable partial overlap merging
+  enable-full-consensus: true   # Include full IUPAC consensus per group

{speconsense-0.7.2 → speconsense-0.7.4}/speconsense/profiles/example.yaml RENAMED Viewed

@@ -91,6 +91,7 @@ speconsense-summarize:
   # select-max-groups: -1         # Max groups to output (-1 = no limit)
   # select-max-variants: -1       # Max variants per group (-1 = no limit)
   # select-strategy: size         # Selection strategy: size or diversity
+  # select-min-size-ratio: 0    # Min size ratio to include variant (0 = disabled)
   # --- Processing ---
   # threads: 0                    # Max threads (0 = auto-detect)

speconsense 0.7.2__tar.gz → 0.7.4__tar.gz

speconsense 0.7.2tar.gz → 0.7.4tar.gz