PyPI - speconsense - Versions diffs - 0.7.3__tar.gz → 0.7.5__tar.gz - Mend

speconsense 0.7.3tar.gz → 0.7.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{speconsense-0.7.3/speconsense.egg-info → speconsense-0.7.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: speconsense
-Version: 0.7.3
+Version: 0.7.5
 Summary: High-quality clustering and consensus generation for Oxford Nanopore amplicon reads
 Author-email: Josh Walker <joshowalker@yahoo.com>
 License: BSD-3-Clause
@@ -171,7 +171,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
-- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus)
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus, 20% selection size ratio)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -295,14 +295,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
-| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from pre-merge components of surviving variants |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
-├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (surviving variants' components)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -678,6 +678,18 @@ speconsense-summarize --select-strategy diversity --select-max-variants 2
    - Output up to select_max_variants per group
 3. Final output contains representatives from all groups, ensuring both biological diversity (between groups) and appropriate sampling within each biological entity (within groups)
+**Selection Size Ratio Filtering:**
+```bash
+speconsense-summarize --select-min-size-ratio 0.2
+```
+- Filters out post-merge variants whose size is too small relative to the largest variant in their group
+- Ratio calculated as `variant_size / largest_size` — must be ≥ threshold to keep
+- Example: `--select-min-size-ratio 0.2` means a variant must have ≥20% the reads of the largest variant in its group
+- Default is 0 (disabled) — all post-merge variants pass through to selection
+- Applied after merging but before variant selection
+- Useful for suppressing noise variants that survived merging but are too small to be meaningful
+- Set to 0.2 in the `compressed` profile to match the 20% calling threshold theme
 This two-stage process ensures that distinct biological sequences are preserved as separate groups, while providing control over variant complexity within each group.
 ### Customizing FASTA Header Fields
@@ -817,7 +829,7 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ```bash
 speconsense-summarize --enable-full-consensus
 ```
-- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Generates a full IUPAC consensus sequence per variant group from pre-merge variants that contributed to surviving post-merge variants
 - Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
 - Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
 - Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
@@ -1059,9 +1071,10 @@ The complete speconsense-summarize workflow operates in this order:
 2. **HAC variant grouping** by sequence identity to separate dissimilar sequences (`--group-identity`); uses single-linkage when overlap merging is enabled
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
-5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
-7. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+5. **Selection size ratio filtering** to remove tiny post-merge variants (`--select-min-size-ratio`)
+6. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
+7. **Full consensus generation** (optional) — IUPAC consensus from pre-merge components of surviving post-merge variants (`--enable-full-consensus`)
+8. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1273,6 +1286,7 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--select-min-size-ratio SELECT_MIN_SIZE_RATIO]
                              [--enable-full-consensus]
                              [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
@@ -1357,6 +1371,10 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --select-min-size-ratio SELECT_MIN_SIZE_RATIO
+                        Minimum size ratio (variant/largest) to include in
+                        output (default: 0 = disabled, e.g. 0.2 for 20%
+                        cutoff)
   --enable-full-consensus
                         Generate a full consensus per variant group
                         representing all variation from pre-merge variants

{speconsense-0.7.3 → speconsense-0.7.5}/README.md RENAMED Viewed

@@ -136,7 +136,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
-- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus)
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus, 20% selection size ratio)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -260,14 +260,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
-| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from pre-merge components of surviving variants |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
-├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (surviving variants' components)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -643,6 +643,18 @@ speconsense-summarize --select-strategy diversity --select-max-variants 2
    - Output up to select_max_variants per group
 3. Final output contains representatives from all groups, ensuring both biological diversity (between groups) and appropriate sampling within each biological entity (within groups)
+**Selection Size Ratio Filtering:**
+```bash
+speconsense-summarize --select-min-size-ratio 0.2
+```
+- Filters out post-merge variants whose size is too small relative to the largest variant in their group
+- Ratio calculated as `variant_size / largest_size` — must be ≥ threshold to keep
+- Example: `--select-min-size-ratio 0.2` means a variant must have ≥20% the reads of the largest variant in its group
+- Default is 0 (disabled) — all post-merge variants pass through to selection
+- Applied after merging but before variant selection
+- Useful for suppressing noise variants that survived merging but are too small to be meaningful
+- Set to 0.2 in the `compressed` profile to match the 20% calling threshold theme
 This two-stage process ensures that distinct biological sequences are preserved as separate groups, while providing control over variant complexity within each group.
 ### Customizing FASTA Header Fields
@@ -782,7 +794,7 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ```bash
 speconsense-summarize --enable-full-consensus
 ```
-- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Generates a full IUPAC consensus sequence per variant group from pre-merge variants that contributed to surviving post-merge variants
 - Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
 - Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
 - Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
@@ -1024,9 +1036,10 @@ The complete speconsense-summarize workflow operates in this order:
 2. **HAC variant grouping** by sequence identity to separate dissimilar sequences (`--group-identity`); uses single-linkage when overlap merging is enabled
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
-5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
-7. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+5. **Selection size ratio filtering** to remove tiny post-merge variants (`--select-min-size-ratio`)
+6. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
+7. **Full consensus generation** (optional) — IUPAC consensus from pre-merge components of surviving post-merge variants (`--enable-full-consensus`)
+8. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1238,6 +1251,7 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--select-min-size-ratio SELECT_MIN_SIZE_RATIO]
                              [--enable-full-consensus]
                              [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
@@ -1322,6 +1336,10 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --select-min-size-ratio SELECT_MIN_SIZE_RATIO
+                        Minimum size ratio (variant/largest) to include in
+                        output (default: 0 = disabled, e.g. 0.2 for 20%
+                        cutoff)
   --enable-full-consensus
                         Generate a full consensus per variant group
                         representing all variation from pre-merge variants

{speconsense-0.7.3 → speconsense-0.7.5}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "speconsense"
-version = "0.7.3"
+version = "0.7.5"
 description = "High-quality clustering and consensus generation for Oxford Nanopore amplicon reads"
 readme = "README.md"
 requires-python = ">=3.8"

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ A Python tool for experimental clustering and consensus generation as an alterna
 in the fungal DNA barcoding pipeline.
 """
-__version__ = "0.7.3"
+__version__ = "0.7.5"
 __author__ = "Josh Walker"
 __email__ = "joshowalker@yahoo.com"

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/__init__.py RENAMED Viewed

@@ -103,6 +103,7 @@ VALID_SUMMARIZE_KEYS = {
     "select-max-groups",
     "select-max-variants",
     "select-strategy",
+    "select-min-size-ratio",
     "enable-full-consensus",
     # Processing
     "scale-threshold",

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/compressed.yaml RENAMED Viewed

@@ -23,5 +23,6 @@ speconsense-summarize:
   merge-indel-length: 5         # Merge indels up to 5bp
   merge-position-count: 10      # Allow up to 10 variant positions in a merge
   merge-min-size-ratio: 0.2     # Match 20% calling threshold
+  select-min-size-ratio: 0.2    # Match 20% calling threshold
   min-merge-overlap: 0          # Disable partial overlap merging
   enable-full-consensus: true   # Include full IUPAC consensus per group

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/example.yaml RENAMED Viewed

@@ -91,6 +91,7 @@ speconsense-summarize:
   # select-max-groups: -1         # Max groups to output (-1 = no limit)
   # select-max-variants: -1       # Max variants per group (-1 = no limit)
   # select-strategy: size         # Selection strategy: size or diversity
+  # select-min-size-ratio: 0    # Min size ratio to include variant (0 = disabled)
   # --- Processing ---
   # threads: 0                    # Max threads (0 = auto-detect)

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/cli.py RENAMED Viewed

@@ -169,6 +169,9 @@ def parse_arguments():
     selection_group.add_argument("--select-strategy", "--variant-selection",
                                  dest="select_strategy", choices=["size", "diversity"], default="size",
                                  help="Variant selection strategy: size or diversity (default: size)")
+    selection_group.add_argument("--select-min-size-ratio", type=float, default=0,
+                                 help="Minimum size ratio (variant/largest) to include in output "
+                                      "(default: 0 = disabled, e.g. 0.2 for 20%% cutoff)")
     selection_group.add_argument("--enable-full-consensus", action="store_true",
                                  help="Generate a full consensus per variant group representing all variation "
                                       "from pre-merge variants (gaps never win)")
@@ -359,6 +362,18 @@ def process_single_specimen(file_consensuses: List[ConsensusInfo],
     for group_idx, (group_id, group_members) in enumerate(sorted_groups):
         final_group_name = group_idx + 1
+        # Apply select-min-size-ratio filter
+        if args.select_min_size_ratio > 0 and len(group_members) > 1:
+            largest_size = max(v.size for v in group_members)
+            filtered = [v for v in group_members
+                        if (v.size / largest_size) >= args.select_min_size_ratio]
+            if len(filtered) < len(group_members):
+                filtered_count = len(group_members) - len(filtered)
+                logging.debug(f"Group {group_idx + 1}: filtered out {filtered_count} "
+                              f"variants with size ratio < {args.select_min_size_ratio} "
+                              f"relative to largest (size={largest_size})")
+                group_members = filtered
         # Select variants for this group
         selected_variants = select_variants(group_members, args.select_max_variants, args.select_strategy, group_number=final_group_name)
@@ -377,9 +392,20 @@ def process_single_specimen(file_consensuses: List[ConsensusInfo],
             final_consensus.append(renamed_variant)
             group_naming.append((variant.sample_name, new_name))
-        # Generate full consensus from PRE-MERGE variants
+        # Generate full consensus from PRE-MERGE variants that contributed
+        # to surviving post-merge variants (after select-min-size-ratio)
         if getattr(args, 'enable_full_consensus', False):
-            pre_merge_variants = variant_groups[group_id]
+            # Collect original cluster names from surviving post-merge variants
+            surviving_originals = set()
+            for v in group_members:
+                if v.sample_name in all_merge_traceability:
+                    surviving_originals.update(all_merge_traceability[v.sample_name])
+                else:
+                    surviving_originals.add(v.sample_name)
+            pre_merge_variants = [v for v in variant_groups[group_id]
+                                  if v.sample_name in surviving_originals]
             specimen_base = selected_variants[0].sample_name.rsplit('-c', 1)[0]
             full_name = f"{specimen_base}-{group_idx + 1}.full"
@@ -450,6 +476,7 @@ def main():
     logging.info(f"  --select-max-variants: {args.select_max_variants}")
     logging.info(f"  --select-max-groups: {args.select_max_groups}")
     logging.info(f"  --select-strategy: {args.select_strategy}")
+    logging.info(f"  --select-min-size-ratio: {args.select_min_size_ratio}")
     logging.info(f"  --enable-full-consensus: {args.enable_full_consensus}")
     logging.info(f"  --log-level: {args.log_level}")
     logging.info("")

{speconsense-0.7.3 → speconsense-0.7.5/speconsense.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: speconsense
-Version: 0.7.3
+Version: 0.7.5
 Summary: High-quality clustering and consensus generation for Oxford Nanopore amplicon reads
 Author-email: Josh Walker <joshowalker@yahoo.com>
 License: BSD-3-Clause
@@ -171,7 +171,7 @@ speconsense input.fastq -p herbarium --min-size 10
 ```
 **Bundled profiles:**
-- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus)
+- `compressed` — Compress variants into minimal IUPAC consensus sequences (aggressive merging with indels, 20% thresholds, full consensus, 20% selection size ratio)
 - `herbarium` — High-recall for degraded DNA/type specimens
 - `largedata` — Experimental settings for large input files
 - `nostalgia` — Simulate older bioinformatics pipelines
@@ -295,14 +295,14 @@ When using `speconsense-summarize` for post-processing, creates `__Summary__/` d
 |---------------|-------------|------------|-------------|
 | **Original** | Source `cluster_debug/` | `-c1`, `-c2`, `-c3` | Preserves speconsense clustering results |
 | **Summarization** | `__Summary__/`, `FASTQ Files/`, `variants/` | `-1.v1`, `-1.v2`, `-2.v1`, `.raw1` | Post-processing groups and variants |
-| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from all pre-merge variants in a group |
+| **Full consensus** | `__Summary__/` | `-1.full` | IUPAC consensus from pre-merge components of surviving variants |
 ### Example Directory Structure
 ```
 __Summary__/
 ├── sample-1.v1-RiC45.fasta                  # Primary variant (group 1, merged)
 ├── sample-1.v2-RiC23.fasta                  # Additional variant (not merged)
-├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (all pre-merge variants)
+├── sample-1.full-RiC68.fasta                # Full IUPAC consensus for group 1 (surviving variants' components)
 ├── sample-2.v1-RiC30.fasta                  # Second organism group, primary variant
 ├── summary.fasta                            # All final consensus sequences (excludes .raw)
 ├── summary.txt                              # Statistics
@@ -678,6 +678,18 @@ speconsense-summarize --select-strategy diversity --select-max-variants 2
    - Output up to select_max_variants per group
 3. Final output contains representatives from all groups, ensuring both biological diversity (between groups) and appropriate sampling within each biological entity (within groups)
+**Selection Size Ratio Filtering:**
+```bash
+speconsense-summarize --select-min-size-ratio 0.2
+```
+- Filters out post-merge variants whose size is too small relative to the largest variant in their group
+- Ratio calculated as `variant_size / largest_size` — must be ≥ threshold to keep
+- Example: `--select-min-size-ratio 0.2` means a variant must have ≥20% the reads of the largest variant in its group
+- Default is 0 (disabled) — all post-merge variants pass through to selection
+- Applied after merging but before variant selection
+- Useful for suppressing noise variants that survived merging but are too small to be meaningful
+- Set to 0.2 in the `compressed` profile to match the 20% calling threshold theme
 This two-stage process ensures that distinct biological sequences are preserved as separate groups, while providing control over variant complexity within each group.
 ### Customizing FASTA Header Fields
@@ -817,7 +829,7 @@ For high-throughput workflows (e.g., 100K sequences/year), this prioritization e
 ```bash
 speconsense-summarize --enable-full-consensus
 ```
-- Generates a full IUPAC consensus sequence per variant group from all pre-merge variants
+- Generates a full IUPAC consensus sequence per variant group from pre-merge variants that contributed to surviving post-merge variants
 - Output named `{specimen}-{group}.full-RiC{reads}.fasta` in the `__Summary__/` directory
 - Uses majority voting across all variants in the group; **gaps never win** — at each alignment column, the most common non-gap base is chosen, with IUPAC codes for ties among bases
 - Useful when you want a single representative sequence that captures all variation within a group as IUPAC ambiguity codes
@@ -1059,9 +1071,10 @@ The complete speconsense-summarize workflow operates in this order:
 2. **HAC variant grouping** by sequence identity to separate dissimilar sequences (`--group-identity`); uses single-linkage when overlap merging is enabled
 3. **Group filtering** to limit output groups (`--select-max-groups`)
 4. **Homopolymer-aware MSA-based variant merging** within each group, including **overlap merging** for different-length sequences (`--disable-merging`, `--merge-effort`, `--merge-position-count`, `--merge-indel-length`, `--min-merge-overlap`, `--merge-snp`, `--merge-min-size-ratio`, `--disable-homopolymer-equivalence`)
-5. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
-6. **Full consensus generation** (optional) — IUPAC consensus from all pre-merge variants per group (`--enable-full-consensus`)
-7. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
+5. **Selection size ratio filtering** to remove tiny post-merge variants (`--select-min-size-ratio`)
+6. **Variant selection** within each group (`--select-max-variants`, `--select-strategy`)
+7. **Full consensus generation** (optional) — IUPAC consensus from pre-merge components of surviving post-merge variants (`--enable-full-consensus`)
+8. **Output generation** with customizable header fields (`--fasta-fields`) and full traceability
 **Key architectural features**:
 - HAC grouping occurs BEFORE merging to prevent inappropriate merging of dissimilar sequences (e.g., contaminants with primary targets)
@@ -1273,6 +1286,7 @@ usage: speconsense-summarize [-h] [--source SOURCE]
                              [--select-max-groups SELECT_MAX_GROUPS]
                              [--select-max-variants SELECT_MAX_VARIANTS]
                              [--select-strategy {size,diversity}]
+                             [--select-min-size-ratio SELECT_MIN_SIZE_RATIO]
                              [--enable-full-consensus]
                              [--disable-full-consensus]
                              [--scale-threshold SCALE_THRESHOLD] [--threads N]
@@ -1357,6 +1371,10 @@ Selection:
   --select-strategy {size,diversity}, --variant-selection {size,diversity}
                         Variant selection strategy: size or diversity
                         (default: size)
+  --select-min-size-ratio SELECT_MIN_SIZE_RATIO
+                        Minimum size ratio (variant/largest) to include in
+                        output (default: 0 = disabled, e.g. 0.2 for 20%
+                        cutoff)
   --enable-full-consensus
                         Generate a full consensus per variant group
                         representing all variation from pre-merge variants

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_summarize.py RENAMED Viewed

@@ -520,6 +520,114 @@ class TestFullConsensus:
         assert result.rid == 0.95
+    def test_full_consensus_filters_small_variants(self):
+        """Integration test: select_min_size_ratio filters small variants from full consensus."""
+        temp_dir = tempfile.mkdtemp()
+        source_dir = os.path.join(temp_dir, "clusters")
+        summary_dir = os.path.join(temp_dir, "__Summary__")
+        os.makedirs(source_dir)
+        try:
+            # Two similar sequences (1 SNP at position 12: G vs A)
+            # Very different sizes so the small one is filtered by select_min_size_ratio
+            seq_large = "ATCGATCGATCGATCGATCGATCG"  # G at position 12
+            seq_small = "ATCGATCGATCAATCGATCGATCG"  # A at position 12
+            fasta_content = f""">test-c1 size=100 ric=100 primers=test
+{seq_large}
+>test-c2 size=5 ric=5 primers=test
+{seq_small}
+"""
+            fasta_file = os.path.join(source_dir, "test-all.fasta")
+            with open(fasta_file, 'w') as f:
+                f.write(fasta_content)
+            # select-min-size-ratio 0.1 filters 5/100=0.05 post-merge variant,
+            # so its pre-merge components are excluded from .full consensus
+            result = subprocess.run(
+                [
+                    "speconsense-summarize",
+                    "--source", source_dir,
+                    "--summary-dir", summary_dir,
+                    "--min-ric", "3",
+                    "--enable-full-consensus",
+                    "--select-min-size-ratio", "0.1",
+                    "--disable-merging",
+                    "--min-merge-overlap", "0",
+                ],
+                capture_output=True,
+                text=True
+            )
+            assert result.returncode == 0, f"speconsense-summarize failed: {result.stderr}"
+            output_fasta = os.path.join(summary_dir, "summary.fasta")
+            output_sequences = list(SeqIO.parse(output_fasta, "fasta"))
+            full_seqs = [s for s in output_sequences if '.full' in s.id]
+            assert len(full_seqs) == 1, f"Expected 1 .full sequence, got {len(full_seqs)}"
+            # Small variant was filtered — .full should be the large variant only (no IUPAC)
+            full_seq_str = str(full_seqs[0].seq)
+            assert full_seq_str == seq_large, \
+                f"Expected large variant sequence, got {full_seq_str}"
+        finally:
+            shutil.rmtree(temp_dir)
+    def test_full_consensus_no_filter_when_all_survive(self):
+        """Integration test: all post-merge variants surviving means all contribute to .full."""
+        temp_dir = tempfile.mkdtemp()
+        source_dir = os.path.join(temp_dir, "clusters")
+        summary_dir = os.path.join(temp_dir, "__Summary__")
+        os.makedirs(source_dir)
+        try:
+            # Same sequences as above — 1 SNP at position 12 (G vs A)
+            seq_large = "ATCGATCGATCGATCGATCGATCG"
+            seq_small = "ATCGATCGATCAATCGATCGATCG"
+            fasta_content = f""">test-c1 size=100 ric=100 primers=test
+{seq_large}
+>test-c2 size=5 ric=5 primers=test
+{seq_small}
+"""
+            fasta_file = os.path.join(source_dir, "test-all.fasta")
+            with open(fasta_file, 'w') as f:
+                f.write(fasta_content)
+            # No select-min-size-ratio — both variants survive, both contribute to .full
+            result = subprocess.run(
+                [
+                    "speconsense-summarize",
+                    "--source", source_dir,
+                    "--summary-dir", summary_dir,
+                    "--min-ric", "3",
+                    "--enable-full-consensus",
+                    "--disable-merging",
+                    "--min-merge-overlap", "0",
+                ],
+                capture_output=True,
+                text=True
+            )
+            assert result.returncode == 0, f"speconsense-summarize failed: {result.stderr}"
+            output_fasta = os.path.join(summary_dir, "summary.fasta")
+            output_sequences = list(SeqIO.parse(output_fasta, "fasta"))
+            full_seqs = [s for s in output_sequences if '.full' in s.id]
+            assert len(full_seqs) == 1, f"Expected 1 .full sequence, got {len(full_seqs)}"
+            # Both variants contribute — SNP position should be IUPAC R (A/G)
+            full_seq_str = str(full_seqs[0].seq)
+            assert "R" in full_seq_str, \
+                f"Expected IUPAC R (A/G) in full consensus, got {full_seq_str}"
+        finally:
+            shutil.rmtree(temp_dir)
 class TestFieldRegexFullConsensus:
     """Tests for GroupField and VariantField regex handling of .full names."""
@@ -578,6 +686,101 @@ class TestFieldRegexFullConsensus:
         assert field.format_value(cons) == "variant=v1"
+class TestSelectMinSizeRatio:
+    """Tests for --select-min-size-ratio filtering."""
+    def test_select_min_size_ratio_filters_small_variants(self):
+        """Integration test: --select-min-size-ratio 0.1 filters out tiny variants."""
+        temp_dir = tempfile.mkdtemp()
+        source_dir = os.path.join(temp_dir, "clusters")
+        summary_dir = os.path.join(temp_dir, "__Summary__")
+        os.makedirs(source_dir)
+        try:
+            seq1 = "ATCGATCGATCGATCGATCGATCG"
+            seq2 = "ATCGATCGATCAATCGATCGATCG"  # One SNP — different enough to not merge
+            fasta_content = f""">test-c1 size=100 ric=100 primers=test
+{seq1}
+>test-c2 size=3 ric=3 primers=test
+{seq2}
+"""
+            fasta_file = os.path.join(source_dir, "test-all.fasta")
+            with open(fasta_file, 'w') as f:
+                f.write(fasta_content)
+            result = subprocess.run(
+                [
+                    "speconsense-summarize",
+                    "--source", source_dir,
+                    "--summary-dir", summary_dir,
+                    "--min-ric", "3",
+                    "--select-min-size-ratio", "0.1",
+                    "--disable-merging",
+                ],
+                capture_output=True,
+                text=True
+            )
+            assert result.returncode == 0, f"speconsense-summarize failed: {result.stderr}"
+            output_fasta = os.path.join(summary_dir, "summary.fasta")
+            output_sequences = list(SeqIO.parse(output_fasta, "fasta"))
+            # Only the large variant should remain (3/100 = 0.03 < 0.1)
+            assert len(output_sequences) == 1, \
+                f"Expected 1 sequence after filtering, got {len(output_sequences)}"
+            assert "size=100" in output_sequences[0].description
+        finally:
+            shutil.rmtree(temp_dir)
+    def test_select_min_size_ratio_disabled_preserves_all(self):
+        """Integration test: --select-min-size-ratio 0 preserves all variants."""
+        temp_dir = tempfile.mkdtemp()
+        source_dir = os.path.join(temp_dir, "clusters")
+        summary_dir = os.path.join(temp_dir, "__Summary__")
+        os.makedirs(source_dir)
+        try:
+            seq1 = "ATCGATCGATCGATCGATCGATCG"
+            seq2 = "ATCGATCGATCAATCGATCGATCG"  # One SNP
+            fasta_content = f""">test-c1 size=100 ric=100 primers=test
+{seq1}
+>test-c2 size=3 ric=3 primers=test
+{seq2}
+"""
+            fasta_file = os.path.join(source_dir, "test-all.fasta")
+            with open(fasta_file, 'w') as f:
+                f.write(fasta_content)
+            result = subprocess.run(
+                [
+                    "speconsense-summarize",
+                    "--source", source_dir,
+                    "--summary-dir", summary_dir,
+                    "--min-ric", "3",
+                    "--select-min-size-ratio", "0",
+                    "--disable-merging",
+                ],
+                capture_output=True,
+                text=True
+            )
+            assert result.returncode == 0, f"speconsense-summarize failed: {result.stderr}"
+            output_fasta = os.path.join(summary_dir, "summary.fasta")
+            output_sequences = list(SeqIO.parse(output_fasta, "fasta"))
+            # Both variants should be preserved
+            assert len(output_sequences) == 2, \
+                f"Expected 2 sequences with ratio=0, got {len(output_sequences)}"
+        finally:
+            shutil.rmtree(temp_dir)
 class TestFullConsensusIntegration:
     """Integration test for --enable-full-consensus."""

{speconsense-0.7.3 → speconsense-0.7.5}/LICENSE RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/setup.cfg RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/cli.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/core/__init__.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/core/__main__.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/core/cli.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/core/clusterer.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/core/workers.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/msa.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/herbarium.yaml RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/largedata.yaml RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/nostalgia.yaml RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/profiles/strict.yaml RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/quality_report.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/scalability/__init__.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/scalability/base.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/scalability/config.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/scalability/vsearch.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/__init__.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/__main__.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/analysis.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/clustering.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/fields.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/io.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/iupac.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/summarize/merging.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/synth.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense/types.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense.egg-info/entry_points.txt RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense.egg-info/requires.txt RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/speconsense.egg-info/top_level.txt RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_ambiguity_calling.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_augment_input.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_complement_flags.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_fields.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_haplotype_filtering.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_orientation.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_overlap_merge.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_overlap_merge_integration.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_profiles.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_regression.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_synth.py RENAMED Viewed

File without changes

{speconsense-0.7.3 → speconsense-0.7.5}/tests/test_variant_phasing.py RENAMED Viewed

File without changes

speconsense 0.7.3__tar.gz → 0.7.5__tar.gz

speconsense 0.7.3tar.gz → 0.7.5tar.gz