uht-tooling 0.1.4__tar.gz → 0.1.6__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/PKG-INFO +14 -4
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/README.md +12 -3
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/pyproject.toml +2 -1
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/gui.py +4 -3
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/mut_rate.py +203 -106
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling.egg-info/PKG-INFO +14 -4
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling.egg-info/requires.txt +1 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/setup.cfg +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/__init__.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/cli.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/models/__init__.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/__init__.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/design_gibson.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/design_slim.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/mutation_caller.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/nextera_designer.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/profile_inserts.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling/workflows/umi_hunter.py +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling.egg-info/SOURCES.txt +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling.egg-info/dependency_links.txt +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling.egg-info/entry_points.txt +0 -0
- {uht_tooling-0.1.4 → uht_tooling-0.1.6}/src/uht_tooling.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: uht-tooling
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.6
|
|
4
4
|
Summary: Tooling for ultra-high throughput screening workflows.
|
|
5
5
|
Author: Matt115A
|
|
6
6
|
License: MIT
|
|
@@ -18,6 +18,7 @@ Requires-Dist: seaborn==0.13.2
|
|
|
18
18
|
Requires-Dist: tabulate==0.9.0
|
|
19
19
|
Requires-Dist: tqdm==4.67.1
|
|
20
20
|
Requires-Dist: typer==0.20.0
|
|
21
|
+
Requires-Dist: mappy==2.30
|
|
21
22
|
Provides-Extra: gui
|
|
22
23
|
Requires-Dist: gradio==5.49.1; extra == "gui"
|
|
23
24
|
Provides-Extra: dev
|
|
@@ -35,16 +36,18 @@ Automation helpers for ultra-high-throughput molecular biology workflows. The pa
|
|
|
35
36
|
|
|
36
37
|
### Quick install (recommended, easiest file maintainance)
|
|
37
38
|
```bash
|
|
38
|
-
pip install "uht-tooling[gui]
|
|
39
|
+
pip install "uht-tooling[gui]"
|
|
39
40
|
|
|
40
41
|
```
|
|
41
42
|
|
|
42
|
-
This installs the core workflows plus the optional GUI
|
|
43
|
+
This installs the core workflows plus the optional GUI dependency (Gradio). Omit the `[gui]` extras if you only need the CLI:
|
|
43
44
|
|
|
44
45
|
```bash
|
|
45
46
|
pip install uht-tooling
|
|
46
47
|
```
|
|
47
48
|
|
|
49
|
+
You will need a functioning version of mafft - you should install this separately and it should be accessible from your environment.
|
|
50
|
+
|
|
48
51
|
### Development install
|
|
49
52
|
```bash
|
|
50
53
|
git clone https://github.com/Matt115A/uht-tooling-packaged.git
|
|
@@ -222,7 +225,14 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
|
|
|
222
225
|
--fastq data/ep-library-profile/*.fastq.gz \
|
|
223
226
|
--output-dir results/ep-library-profile/
|
|
224
227
|
```
|
|
225
|
-
- Output bundle includes per-sample directories
|
|
228
|
+
- Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
|
|
229
|
+
|
|
230
|
+
**How the mutation rate and AA expectations are derived**
|
|
231
|
+
|
|
232
|
+
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the “target” rate; mismatches elsewhere provide the background.
|
|
233
|
+
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
|
|
234
|
+
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
|
|
235
|
+
4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
|
|
226
236
|
|
|
227
237
|
---
|
|
228
238
|
|
|
@@ -8,16 +8,18 @@ Automation helpers for ultra-high-throughput molecular biology workflows. The pa
|
|
|
8
8
|
|
|
9
9
|
### Quick install (recommended, easiest file maintainance)
|
|
10
10
|
```bash
|
|
11
|
-
pip install "uht-tooling[gui]
|
|
11
|
+
pip install "uht-tooling[gui]"
|
|
12
12
|
|
|
13
13
|
```
|
|
14
14
|
|
|
15
|
-
This installs the core workflows plus the optional GUI
|
|
15
|
+
This installs the core workflows plus the optional GUI dependency (Gradio). Omit the `[gui]` extras if you only need the CLI:
|
|
16
16
|
|
|
17
17
|
```bash
|
|
18
18
|
pip install uht-tooling
|
|
19
19
|
```
|
|
20
20
|
|
|
21
|
+
You will need a functioning version of mafft - you should install this separately and it should be accessible from your environment.
|
|
22
|
+
|
|
21
23
|
### Development install
|
|
22
24
|
```bash
|
|
23
25
|
git clone https://github.com/Matt115A/uht-tooling-packaged.git
|
|
@@ -195,7 +197,14 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
|
|
|
195
197
|
--fastq data/ep-library-profile/*.fastq.gz \
|
|
196
198
|
--output-dir results/ep-library-profile/
|
|
197
199
|
```
|
|
198
|
-
- Output bundle includes per-sample directories
|
|
200
|
+
- Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
|
|
201
|
+
|
|
202
|
+
**How the mutation rate and AA expectations are derived**
|
|
203
|
+
|
|
204
|
+
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the “target” rate; mismatches elsewhere provide the background.
|
|
205
|
+
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
|
|
206
|
+
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
|
|
207
|
+
4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
|
|
199
208
|
|
|
200
209
|
---
|
|
201
210
|
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "uht-tooling"
|
|
7
|
-
version = "0.1.
|
|
7
|
+
version = "0.1.6"
|
|
8
8
|
description = "Tooling for ultra-high throughput screening workflows."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.8"
|
|
@@ -23,6 +23,7 @@ dependencies = [
|
|
|
23
23
|
"tabulate==0.9.0",
|
|
24
24
|
"tqdm==4.67.1",
|
|
25
25
|
"typer==0.20.0",
|
|
26
|
+
"mappy==2.30",
|
|
26
27
|
]
|
|
27
28
|
|
|
28
29
|
[project.optional-dependencies]
|
|
@@ -957,9 +957,10 @@ def create_gui() -> gr.Blocks:
|
|
|
957
957
|
gr.Markdown(
|
|
958
958
|
textwrap.dedent(
|
|
959
959
|
"""
|
|
960
|
-
- Reads are aligned against
|
|
961
|
-
-
|
|
962
|
-
-
|
|
960
|
+
- Reads are aligned against both the region-of-interest and the full plasmid to measure target and background mismatch rates; their difference yields the net nucleotide mutation rate with propagated binomial and quality-score uncertainty.
|
|
961
|
+
- The net per-base rate is multiplied by the CDS length to obtain λ₍bp₎ (mutations per copy), then Monte Carlo simulations flip random bases, translate the mutated CDS, and count amino-acid differences—those simulated means and confidence intervals are the values plotted in the QC figure.
|
|
962
|
+
- When multiple Q-score thresholds are analysed, the CLI combines them via a precision-weighted consensus (after discarding filters with <1000 mappable bases). The consensus AA mutation rate is written to `aa_mutation_consensus.txt` and drawn as a horizontal guide in the plot.
|
|
963
|
+
- Download the archive to inspect per-sample plots, TSV summaries, the consensus summary, and logs for troubleshooting.
|
|
963
964
|
"""
|
|
964
965
|
)
|
|
965
966
|
)
|
|
@@ -15,7 +15,7 @@ import matplotlib.pyplot as plt
|
|
|
15
15
|
import math
|
|
16
16
|
import tempfile
|
|
17
17
|
from pathlib import Path
|
|
18
|
-
from typing import Dict, Iterable, List, Optional, Sequence
|
|
18
|
+
from typing import Dict, Iterable, List, Optional, Sequence, Tuple
|
|
19
19
|
|
|
20
20
|
# Use a built-in Matplotlib style ("ggplot") for consistency
|
|
21
21
|
plt.style.use("ggplot")
|
|
@@ -505,94 +505,153 @@ def run_qc_analysis(fastq_path, results_dir, ref_hit_fasta, plasmid_fasta):
|
|
|
505
505
|
else:
|
|
506
506
|
logging.warning(f"Failed to calculate mutation rate for quality threshold {q_threshold}")
|
|
507
507
|
|
|
508
|
-
#
|
|
509
|
-
|
|
508
|
+
# Derive consensus AA mutation estimates across valid Q-score thresholds
|
|
509
|
+
consensus_info, _ = compute_consensus_aa_mutation(qc_results)
|
|
510
510
|
|
|
511
511
|
# Create QC plots
|
|
512
512
|
if len(qc_results) >= 2:
|
|
513
|
-
create_simple_qc_plots(
|
|
513
|
+
create_simple_qc_plots(
|
|
514
|
+
successful_thresholds,
|
|
515
|
+
qc_results,
|
|
516
|
+
results_dir,
|
|
517
|
+
consensus_info=consensus_info,
|
|
518
|
+
)
|
|
514
519
|
else:
|
|
515
520
|
logging.warning("Insufficient data points for QC plots (need at least 2)")
|
|
516
521
|
|
|
517
|
-
# Save
|
|
518
|
-
|
|
519
|
-
|
|
520
|
-
|
|
521
|
-
|
|
522
|
-
f.write(f"
|
|
523
|
-
f.write(
|
|
524
|
-
|
|
525
|
-
|
|
526
|
-
|
|
527
|
-
f.write(f"
|
|
528
|
-
f.write(f"
|
|
529
|
-
|
|
530
|
-
|
|
531
|
-
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
|
|
522
|
+
# Save consensus summary
|
|
523
|
+
consensus_summary_path = os.path.join(results_dir, "aa_mutation_consensus.txt")
|
|
524
|
+
with open(consensus_summary_path, "w") as f:
|
|
525
|
+
f.write("=== CONSENSUS AMINO-ACID MUTATION ESTIMATE ===\n")
|
|
526
|
+
if consensus_info:
|
|
527
|
+
f.write(f"Minimum mappable bases required: {consensus_info['min_mappable_bases']}\n")
|
|
528
|
+
f.write(
|
|
529
|
+
f"Consensus AA mutations per gene: {consensus_info['consensus_mean']:.4f} ± "
|
|
530
|
+
f"{consensus_info['consensus_std']:.4f}\n"
|
|
531
|
+
)
|
|
532
|
+
f.write(f"Thresholds contributing: {consensus_info['thresholds_used']}\n")
|
|
533
|
+
f.write(f"Normalized weights: {consensus_info['weights']}\n")
|
|
534
|
+
if consensus_info.get("note"):
|
|
535
|
+
f.write(f"Note: {consensus_info['note']}\n")
|
|
536
|
+
else:
|
|
537
|
+
f.write("Consensus AA mutation rate could not be computed; see QC logs for details.\n")
|
|
538
|
+
f.write("\n=== ALL Q-SCORE RESULTS ===\n")
|
|
539
|
+
f.write(
|
|
540
|
+
"Q-score\tMean_AA\tStd_AA\tCI_Lower\tCI_Upper\tMappable_Bases\tSegments\n"
|
|
541
|
+
)
|
|
542
|
+
for result in qc_results:
|
|
543
|
+
f.write(
|
|
544
|
+
f"{result['quality_threshold']}\t"
|
|
545
|
+
f"{result['mean_aa_mutations']:.6f}\t"
|
|
546
|
+
f"{result['std_aa_mutations']:.6f}\t"
|
|
547
|
+
f"{result['ci_lower']:.6f}\t"
|
|
548
|
+
f"{result['ci_upper']:.6f}\t"
|
|
549
|
+
f"{result['total_mappable_bases']}\t"
|
|
550
|
+
f"{result['n_segments']}\n"
|
|
551
|
+
)
|
|
552
|
+
logging.info("Consensus AA mutation summary saved to: %s", consensus_summary_path)
|
|
536
553
|
|
|
537
554
|
# Clean up segment files
|
|
538
|
-
import shutil
|
|
539
555
|
segment_dir = os.path.dirname(segment_files[0])
|
|
540
556
|
if os.path.exists(segment_dir):
|
|
541
557
|
shutil.rmtree(segment_dir)
|
|
542
558
|
logging.info(f"Cleaned up segment directory: {segment_dir}")
|
|
543
559
|
|
|
544
|
-
# Return
|
|
545
|
-
return qc_results,
|
|
560
|
+
# Return QC results and consensus information for downstream analysis
|
|
561
|
+
return qc_results, consensus_info
|
|
546
562
|
|
|
547
|
-
def
|
|
563
|
+
def compute_consensus_aa_mutation(
|
|
564
|
+
qc_results: List[dict],
|
|
565
|
+
min_mappable_bases: int = 1000,
|
|
566
|
+
) -> Tuple[Optional[dict], List[dict]]:
|
|
548
567
|
"""
|
|
549
|
-
|
|
550
|
-
|
|
551
|
-
|
|
552
|
-
|
|
553
|
-
|
|
554
|
-
|
|
568
|
+
Derive a consensus amino-acid mutation estimate across Q-score thresholds.
|
|
569
|
+
|
|
570
|
+
Each threshold must meet a minimum coverage requirement. The consensus is a
|
|
571
|
+
precision-weighted average (weights = 1 / std_aa_mutations).
|
|
572
|
+
|
|
555
573
|
Returns:
|
|
556
|
-
|
|
574
|
+
consensus_info (dict or None)
|
|
575
|
+
{
|
|
576
|
+
'consensus_mean': float,
|
|
577
|
+
'consensus_std': float,
|
|
578
|
+
'thresholds_used': List[int],
|
|
579
|
+
'weights': List[float],
|
|
580
|
+
'min_mappable_bases': int,
|
|
581
|
+
}
|
|
582
|
+
valid_results: list of QC result dicts that were included in the consensus
|
|
557
583
|
"""
|
|
558
|
-
logging.info("=== FINDING OPTIMAL Q-SCORE THRESHOLD (PRECISION-WEIGHTED) ===")
|
|
559
|
-
|
|
560
584
|
if not qc_results:
|
|
561
|
-
return None,
|
|
562
|
-
|
|
563
|
-
|
|
564
|
-
max_score = -1
|
|
565
|
-
optimal_result = None
|
|
566
|
-
optimal_qscore = None
|
|
567
|
-
|
|
568
|
-
logging.info("Q-score\tEmpirical_Error\tPrecision_Score\tMappable_Bases")
|
|
569
|
-
logging.info("-" * 60)
|
|
570
|
-
|
|
585
|
+
return None, []
|
|
586
|
+
|
|
587
|
+
valid_results = []
|
|
571
588
|
for result in qc_results:
|
|
572
|
-
|
|
573
|
-
|
|
574
|
-
|
|
575
|
-
|
|
576
|
-
|
|
577
|
-
|
|
578
|
-
|
|
579
|
-
|
|
580
|
-
|
|
581
|
-
|
|
582
|
-
|
|
583
|
-
|
|
584
|
-
|
|
585
|
-
|
|
586
|
-
|
|
587
|
-
|
|
588
|
-
|
|
589
|
-
|
|
590
|
-
|
|
591
|
-
|
|
592
|
-
|
|
593
|
-
|
|
589
|
+
total_bases = result.get("total_mappable_bases", 0)
|
|
590
|
+
std_aa = result.get("std_aa_mutations", 0.0)
|
|
591
|
+
if total_bases is None:
|
|
592
|
+
total_bases = 0
|
|
593
|
+
if total_bases >= min_mappable_bases and std_aa is not None:
|
|
594
|
+
valid_results.append(result)
|
|
595
|
+
|
|
596
|
+
if not valid_results:
|
|
597
|
+
logging.warning(
|
|
598
|
+
"No Q-score thresholds met the minimum mappable base requirement (%s). "
|
|
599
|
+
"Consensus AA mutation rate will fall back to the threshold with the highest coverage.",
|
|
600
|
+
min_mappable_bases,
|
|
601
|
+
)
|
|
602
|
+
best_by_coverage = max(qc_results, key=lambda r: r.get("total_mappable_bases", 0))
|
|
603
|
+
fallback_std = best_by_coverage.get("std_aa_mutations", 0.0)
|
|
604
|
+
consensus_info = {
|
|
605
|
+
"consensus_mean": best_by_coverage.get("mean_aa_mutations", 0.0),
|
|
606
|
+
"consensus_std": fallback_std,
|
|
607
|
+
"thresholds_used": [best_by_coverage.get("quality_threshold")],
|
|
608
|
+
"weights": [1.0],
|
|
609
|
+
"min_mappable_bases": min_mappable_bases,
|
|
610
|
+
"note": "FELL_BACK_TO_MAX_COVERAGE",
|
|
611
|
+
}
|
|
612
|
+
return consensus_info, [best_by_coverage]
|
|
613
|
+
|
|
614
|
+
weights = []
|
|
615
|
+
means = []
|
|
616
|
+
variances = []
|
|
617
|
+
thresholds = []
|
|
618
|
+
for result in valid_results:
|
|
619
|
+
std_aa = result.get("std_aa_mutations", 0.0) or 0.0
|
|
620
|
+
weight = 1.0 / max(std_aa, 1e-9) # Avoid division by zero; effectively a very large weight.
|
|
621
|
+
weights.append(weight)
|
|
622
|
+
means.append(result.get("mean_aa_mutations", 0.0))
|
|
623
|
+
variances.append(std_aa**2)
|
|
624
|
+
thresholds.append(result.get("quality_threshold"))
|
|
625
|
+
|
|
626
|
+
weight_sum = float(np.sum(weights))
|
|
627
|
+
normalized_weights = [w / weight_sum for w in weights]
|
|
628
|
+
consensus_mean = float(np.sum(np.array(normalized_weights) * np.array(means)))
|
|
629
|
+
|
|
630
|
+
combined_variance = 0.0
|
|
631
|
+
for w, mean, var in zip(normalized_weights, means, variances):
|
|
632
|
+
combined_variance += w * (var + (mean - consensus_mean) ** 2)
|
|
633
|
+
combined_variance = max(combined_variance, 0.0)
|
|
634
|
+
consensus_std = float(np.sqrt(combined_variance))
|
|
635
|
+
|
|
636
|
+
consensus_info = {
|
|
637
|
+
"consensus_mean": consensus_mean,
|
|
638
|
+
"consensus_std": consensus_std,
|
|
639
|
+
"thresholds_used": thresholds,
|
|
640
|
+
"weights": normalized_weights,
|
|
641
|
+
"min_mappable_bases": min_mappable_bases,
|
|
642
|
+
"note": "WEIGHTED_AVERAGE",
|
|
643
|
+
}
|
|
644
|
+
|
|
645
|
+
logging.info(
|
|
646
|
+
"Consensus AA mutation estimate: %.4f ± %.4f (thresholds used: %s)",
|
|
647
|
+
consensus_mean,
|
|
648
|
+
consensus_std,
|
|
649
|
+
thresholds,
|
|
650
|
+
)
|
|
594
651
|
|
|
595
|
-
|
|
652
|
+
return consensus_info, valid_results
|
|
653
|
+
|
|
654
|
+
def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensus_info=None):
|
|
596
655
|
"""
|
|
597
656
|
Create simple QC plots with empirical error bars.
|
|
598
657
|
|
|
@@ -600,8 +659,7 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, optimal_
|
|
|
600
659
|
quality_thresholds: List of quality score thresholds
|
|
601
660
|
qc_results: List of segmentation analysis results
|
|
602
661
|
results_dir: Directory to save the plots
|
|
603
|
-
|
|
604
|
-
optimal_result: Optimal result data (optional)
|
|
662
|
+
consensus_info: Optional dict describing the consensus AA mutation estimate.
|
|
605
663
|
"""
|
|
606
664
|
try:
|
|
607
665
|
# Extract data for plotting
|
|
@@ -624,10 +682,17 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, optimal_
|
|
|
624
682
|
ax1.fill_between(quality_thresholds, aa_ci_lower, aa_ci_upper,
|
|
625
683
|
alpha=0.3, color=color1, label='95% Confidence Interval')
|
|
626
684
|
|
|
627
|
-
#
|
|
628
|
-
if
|
|
629
|
-
|
|
630
|
-
|
|
685
|
+
# Add consensus AA mutation estimate if available
|
|
686
|
+
if consensus_info and consensus_info.get("consensus_mean") is not None:
|
|
687
|
+
consensus_mean = consensus_info["consensus_mean"]
|
|
688
|
+
consensus_std = consensus_info.get("consensus_std", 0.0)
|
|
689
|
+
ax1.axhline(
|
|
690
|
+
y=consensus_mean,
|
|
691
|
+
color='red',
|
|
692
|
+
linestyle='--',
|
|
693
|
+
alpha=0.7,
|
|
694
|
+
label=f"Consensus AA mutations ({consensus_mean:.3f}±{consensus_std:.3f})",
|
|
695
|
+
)
|
|
631
696
|
|
|
632
697
|
ax1.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
|
|
633
698
|
ax1.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color1)
|
|
@@ -1185,16 +1250,23 @@ def run_segmented_analysis(segment_files, quality_threshold, work_dir, ref_hit_f
|
|
|
1185
1250
|
bg_rate = bg_mis / bg_cov if bg_cov > 0 else 0
|
|
1186
1251
|
net_rate = max(hit_rate - bg_rate, 0.0)
|
|
1187
1252
|
|
|
1188
|
-
# Calculate AA mutations per gene
|
|
1253
|
+
# Calculate AA mutations per gene via Monte Carlo simulation
|
|
1189
1254
|
lambda_bp = net_rate * len(hit_seq)
|
|
1190
|
-
|
|
1255
|
+
aa_samples = simulate_aa_distribution(lambda_bp, hit_seq, n_trials=500)
|
|
1256
|
+
if len(aa_samples) > 1:
|
|
1257
|
+
aa_mean = float(np.mean(aa_samples))
|
|
1258
|
+
aa_var = float(np.var(aa_samples, ddof=1))
|
|
1259
|
+
else:
|
|
1260
|
+
aa_mean = float(aa_samples[0]) if aa_samples else 0.0
|
|
1261
|
+
aa_var = 0.0
|
|
1191
1262
|
|
|
1192
1263
|
segment_results.append({
|
|
1193
1264
|
'segment': i+1,
|
|
1194
1265
|
'hit_rate': hit_rate,
|
|
1195
1266
|
'bg_rate': bg_rate,
|
|
1196
1267
|
'net_rate': net_rate,
|
|
1197
|
-
'aa_mutations':
|
|
1268
|
+
'aa_mutations': aa_mean,
|
|
1269
|
+
'aa_variance': aa_var,
|
|
1198
1270
|
'mappable_bases': hit_cov,
|
|
1199
1271
|
'hit_mismatches': hit_mis,
|
|
1200
1272
|
'hit_coverage': hit_cov
|
|
@@ -1204,29 +1276,44 @@ def run_segmented_analysis(segment_files, quality_threshold, work_dir, ref_hit_f
|
|
|
1204
1276
|
return None
|
|
1205
1277
|
|
|
1206
1278
|
# Calculate empirical statistics
|
|
1207
|
-
aa_mutations_list = [r['aa_mutations'] for r in segment_results]
|
|
1208
|
-
|
|
1209
|
-
|
|
1279
|
+
aa_mutations_list = np.array([r['aa_mutations'] for r in segment_results], dtype=float)
|
|
1280
|
+
aa_variances = np.array([r.get('aa_variance', 0.0) for r in segment_results], dtype=float)
|
|
1281
|
+
net_rates_list = np.array([r['net_rate'] for r in segment_results], dtype=float)
|
|
1282
|
+
mappable_bases_list = np.array([r['mappable_bases'] for r in segment_results], dtype=float)
|
|
1283
|
+
|
|
1284
|
+
total_mappable_bases = float(mappable_bases_list.sum())
|
|
1285
|
+
if total_mappable_bases > 0:
|
|
1286
|
+
weights = mappable_bases_list
|
|
1287
|
+
mean_aa = float(np.average(aa_mutations_list, weights=weights))
|
|
1288
|
+
mean_net_rate = float(np.average(net_rates_list, weights=weights))
|
|
1289
|
+
weighted_var = float(
|
|
1290
|
+
np.sum(weights * (aa_variances + (aa_mutations_list - mean_aa) ** 2)) / total_mappable_bases
|
|
1291
|
+
)
|
|
1292
|
+
weighted_net_var = float(
|
|
1293
|
+
np.sum(weights * ( (net_rates_list - mean_net_rate) ** 2 )) / total_mappable_bases
|
|
1294
|
+
)
|
|
1295
|
+
else:
|
|
1296
|
+
weights = None
|
|
1297
|
+
mean_aa = float(np.mean(aa_mutations_list))
|
|
1298
|
+
mean_net_rate = float(np.mean(net_rates_list))
|
|
1299
|
+
weighted_var = float(np.var(aa_mutations_list, ddof=1)) if len(aa_mutations_list) > 1 else 0.0
|
|
1300
|
+
weighted_net_var = float(np.var(net_rates_list, ddof=1)) if len(net_rates_list) > 1 else 0.0
|
|
1210
1301
|
|
|
1211
|
-
|
|
1212
|
-
|
|
1213
|
-
mean_net_rate = np.mean(net_rates_list)
|
|
1214
|
-
std_net_rate = np.std(net_rates_list, ddof=1)
|
|
1215
|
-
total_mappable_bases = sum(mappable_bases_list)
|
|
1302
|
+
std_aa = float(np.sqrt(max(weighted_var, 0.0)))
|
|
1303
|
+
std_net_rate = float(np.sqrt(max(weighted_net_var, 0.0)))
|
|
1216
1304
|
|
|
1217
1305
|
# Calculate confidence interval using t-distribution
|
|
1218
1306
|
n_segments = len(segment_results)
|
|
1219
1307
|
if n_segments > 1:
|
|
1220
|
-
# 95% confidence interval
|
|
1221
|
-
from scipy.stats import t
|
|
1222
|
-
t_val = t.ppf(0.975, n_segments - 1)
|
|
1223
1308
|
se_aa = std_aa / np.sqrt(n_segments)
|
|
1224
|
-
ci_lower = mean_aa -
|
|
1225
|
-
ci_upper = mean_aa +
|
|
1309
|
+
ci_lower = mean_aa - 1.96 * se_aa
|
|
1310
|
+
ci_upper = mean_aa + 1.96 * se_aa
|
|
1226
1311
|
else:
|
|
1227
1312
|
ci_lower = mean_aa
|
|
1228
1313
|
ci_upper = mean_aa
|
|
1229
1314
|
|
|
1315
|
+
ci_lower = max(ci_lower, 0.0)
|
|
1316
|
+
|
|
1230
1317
|
return {
|
|
1231
1318
|
'mean_aa_mutations': mean_aa,
|
|
1232
1319
|
'std_aa_mutations': std_aa,
|
|
@@ -2072,18 +2159,23 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2072
2159
|
ax3.bar(unique_vals, [1.0], color="#C44E52", alpha=0.7, width=0.1)
|
|
2073
2160
|
ax3.set_xlim(unique_vals[0] - 0.5, unique_vals[0] + 0.5)
|
|
2074
2161
|
else:
|
|
2075
|
-
# Not protein or no AA differences
|
|
2076
|
-
ax3.text(
|
|
2077
|
-
|
|
2078
|
-
|
|
2079
|
-
|
|
2080
|
-
|
|
2081
|
-
|
|
2082
|
-
|
|
2083
|
-
|
|
2084
|
-
|
|
2085
|
-
|
|
2086
|
-
|
|
2162
|
+
# Not protein or no AA differences — display an informative message
|
|
2163
|
+
ax3.text(
|
|
2164
|
+
0.5,
|
|
2165
|
+
0.5,
|
|
2166
|
+
"Amino-acid distribution unavailable",
|
|
2167
|
+
horizontalalignment="center",
|
|
2168
|
+
verticalalignment="center",
|
|
2169
|
+
fontsize=12,
|
|
2170
|
+
color="gray",
|
|
2171
|
+
transform=ax3.transAxes,
|
|
2172
|
+
)
|
|
2173
|
+
|
|
2174
|
+
ax3.set_title("AA Mutation Distribution", fontsize=14, fontweight='bold')
|
|
2175
|
+
ax3.set_xlabel("Number of AA Mutations", fontsize=12)
|
|
2176
|
+
ax3.set_ylabel("Density", fontsize=12)
|
|
2177
|
+
ax3.spines['top'].set_visible(False)
|
|
2178
|
+
ax3.spines['right'].set_visible(False)
|
|
2087
2179
|
|
|
2088
2180
|
# Save the combined figure as both PNG and PDF
|
|
2089
2181
|
panel_path_png = os.path.join(qscore_results_dir, "summary_panels.png")
|
|
@@ -2412,7 +2504,7 @@ def process_single_fastq(
|
|
|
2412
2504
|
logging.info("Running QC analysis to get Q-score results...")
|
|
2413
2505
|
qc_results = None
|
|
2414
2506
|
try:
|
|
2415
|
-
qc_results,
|
|
2507
|
+
qc_results, consensus_info = run_qc_analysis(
|
|
2416
2508
|
str(fastq_path),
|
|
2417
2509
|
str(results_dir),
|
|
2418
2510
|
str(region_fasta),
|
|
@@ -2420,8 +2512,13 @@ def process_single_fastq(
|
|
|
2420
2512
|
)
|
|
2421
2513
|
if qc_results is not None:
|
|
2422
2514
|
logging.info("QC analysis completed successfully. Found %s Q-score results.", len(qc_results))
|
|
2423
|
-
if
|
|
2424
|
-
logging.info(
|
|
2515
|
+
if consensus_info and consensus_info.get("consensus_mean") is not None:
|
|
2516
|
+
logging.info(
|
|
2517
|
+
"Consensus AA mutations per gene: %.4f ± %.4f (thresholds used: %s)",
|
|
2518
|
+
consensus_info["consensus_mean"],
|
|
2519
|
+
consensus_info.get("consensus_std", 0.0),
|
|
2520
|
+
consensus_info.get("thresholds_used"),
|
|
2521
|
+
)
|
|
2425
2522
|
else:
|
|
2426
2523
|
logging.warning("QC analysis completed but no Q-score results found.")
|
|
2427
2524
|
except Exception as exc:
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: uht-tooling
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.6
|
|
4
4
|
Summary: Tooling for ultra-high throughput screening workflows.
|
|
5
5
|
Author: Matt115A
|
|
6
6
|
License: MIT
|
|
@@ -18,6 +18,7 @@ Requires-Dist: seaborn==0.13.2
|
|
|
18
18
|
Requires-Dist: tabulate==0.9.0
|
|
19
19
|
Requires-Dist: tqdm==4.67.1
|
|
20
20
|
Requires-Dist: typer==0.20.0
|
|
21
|
+
Requires-Dist: mappy==2.30
|
|
21
22
|
Provides-Extra: gui
|
|
22
23
|
Requires-Dist: gradio==5.49.1; extra == "gui"
|
|
23
24
|
Provides-Extra: dev
|
|
@@ -35,16 +36,18 @@ Automation helpers for ultra-high-throughput molecular biology workflows. The pa
|
|
|
35
36
|
|
|
36
37
|
### Quick install (recommended, easiest file maintainance)
|
|
37
38
|
```bash
|
|
38
|
-
pip install "uht-tooling[gui]
|
|
39
|
+
pip install "uht-tooling[gui]"
|
|
39
40
|
|
|
40
41
|
```
|
|
41
42
|
|
|
42
|
-
This installs the core workflows plus the optional GUI
|
|
43
|
+
This installs the core workflows plus the optional GUI dependency (Gradio). Omit the `[gui]` extras if you only need the CLI:
|
|
43
44
|
|
|
44
45
|
```bash
|
|
45
46
|
pip install uht-tooling
|
|
46
47
|
```
|
|
47
48
|
|
|
49
|
+
You will need a functioning version of mafft - you should install this separately and it should be accessible from your environment.
|
|
50
|
+
|
|
48
51
|
### Development install
|
|
49
52
|
```bash
|
|
50
53
|
git clone https://github.com/Matt115A/uht-tooling-packaged.git
|
|
@@ -222,7 +225,14 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
|
|
|
222
225
|
--fastq data/ep-library-profile/*.fastq.gz \
|
|
223
226
|
--output-dir results/ep-library-profile/
|
|
224
227
|
```
|
|
225
|
-
- Output bundle includes per-sample directories
|
|
228
|
+
- Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
|
|
229
|
+
|
|
230
|
+
**How the mutation rate and AA expectations are derived**
|
|
231
|
+
|
|
232
|
+
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the “target” rate; mismatches elsewhere provide the background.
|
|
233
|
+
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
|
|
234
|
+
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
|
|
235
|
+
4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
|
|
226
236
|
|
|
227
237
|
---
|
|
228
238
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|