uht-tooling 0.1.4__tar.gz → 0.1.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/PKG-INFO +10 -3
  2. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/README.md +9 -2
  3. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/pyproject.toml +1 -1
  4. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/gui.py +4 -3
  5. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/mut_rate.py +203 -106
  6. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling.egg-info/PKG-INFO +10 -3
  7. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/setup.cfg +0 -0
  8. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/__init__.py +0 -0
  9. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/cli.py +0 -0
  10. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/models/__init__.py +0 -0
  11. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/__init__.py +0 -0
  12. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/design_gibson.py +0 -0
  13. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/design_slim.py +0 -0
  14. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/mutation_caller.py +0 -0
  15. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/nextera_designer.py +0 -0
  16. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/profile_inserts.py +0 -0
  17. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling/workflows/umi_hunter.py +0 -0
  18. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling.egg-info/SOURCES.txt +0 -0
  19. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling.egg-info/dependency_links.txt +0 -0
  20. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling.egg-info/entry_points.txt +0 -0
  21. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling.egg-info/requires.txt +0 -0
  22. {uht_tooling-0.1.4 → uht_tooling-0.1.5}/src/uht_tooling.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: uht-tooling
3
- Version: 0.1.4
3
+ Version: 0.1.5
4
4
  Summary: Tooling for ultra-high throughput screening workflows.
5
5
  Author: Matt115A
6
6
  License: MIT
@@ -35,7 +35,7 @@ Automation helpers for ultra-high-throughput molecular biology workflows. The pa
35
35
 
36
36
  ### Quick install (recommended, easiest file maintainance)
37
37
  ```bash
38
- pip install "uht-tooling[gui]==0.1.3"
38
+ pip install "uht-tooling[gui]==0.1.4"
39
39
 
40
40
  ```
41
41
 
@@ -222,7 +222,14 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
222
222
  --fastq data/ep-library-profile/*.fastq.gz \
223
223
  --output-dir results/ep-library-profile/
224
224
  ```
225
- - Output bundle includes per-sample directories and a master summary TSV.
225
+ - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
226
+
227
+ **How the mutation rate and AA expectations are derived**
228
+
229
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the “target” rate; mismatches elsewhere provide the background.
230
+ 2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
231
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
232
+ 4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
226
233
 
227
234
  ---
228
235
 
@@ -8,7 +8,7 @@ Automation helpers for ultra-high-throughput molecular biology workflows. The pa
8
8
 
9
9
  ### Quick install (recommended, easiest file maintainance)
10
10
  ```bash
11
- pip install "uht-tooling[gui]==0.1.3"
11
+ pip install "uht-tooling[gui]==0.1.4"
12
12
 
13
13
  ```
14
14
 
@@ -195,7 +195,14 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
195
195
  --fastq data/ep-library-profile/*.fastq.gz \
196
196
  --output-dir results/ep-library-profile/
197
197
  ```
198
- - Output bundle includes per-sample directories and a master summary TSV.
198
+ - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
199
+
200
+ **How the mutation rate and AA expectations are derived**
201
+
202
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the “target” rate; mismatches elsewhere provide the background.
203
+ 2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
204
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
205
+ 4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
199
206
 
200
207
  ---
201
208
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "uht-tooling"
7
- version = "0.1.4"
7
+ version = "0.1.5"
8
8
  description = "Tooling for ultra-high throughput screening workflows."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -957,9 +957,10 @@ def create_gui() -> gr.Blocks:
957
957
  gr.Markdown(
958
958
  textwrap.dedent(
959
959
  """
960
- - Reads are aligned against the plasmid reference; mismatches inside the region-of-interest drive target rate estimates, while mismatches elsewhere define the background rate.
961
- - Z-scores and p-values summarise enrichment versus background, mirroring the CLI outputs.
962
- - Download the archive to inspect per-sample plots, TSV summaries, and logs for troubleshooting.
960
+ - Reads are aligned against both the region-of-interest and the full plasmid to measure target and background mismatch rates; their difference yields the net nucleotide mutation rate with propagated binomial and quality-score uncertainty.
961
+ - The net per-base rate is multiplied by the CDS length to obtain λ₍bp₎ (mutations per copy), then Monte Carlo simulations flip random bases, translate the mutated CDS, and count amino-acid differences—those simulated means and confidence intervals are the values plotted in the QC figure.
962
+ - When multiple Q-score thresholds are analysed, the CLI combines them via a precision-weighted consensus (after discarding filters with <1000 mappable bases). The consensus AA mutation rate is written to `aa_mutation_consensus.txt` and drawn as a horizontal guide in the plot.
963
+ - Download the archive to inspect per-sample plots, TSV summaries, the consensus summary, and logs for troubleshooting.
963
964
  """
964
965
  )
965
966
  )
@@ -15,7 +15,7 @@ import matplotlib.pyplot as plt
15
15
  import math
16
16
  import tempfile
17
17
  from pathlib import Path
18
- from typing import Dict, Iterable, List, Optional, Sequence
18
+ from typing import Dict, Iterable, List, Optional, Sequence, Tuple
19
19
 
20
20
  # Use a built-in Matplotlib style ("ggplot") for consistency
21
21
  plt.style.use("ggplot")
@@ -505,94 +505,153 @@ def run_qc_analysis(fastq_path, results_dir, ref_hit_fasta, plasmid_fasta):
505
505
  else:
506
506
  logging.warning(f"Failed to calculate mutation rate for quality threshold {q_threshold}")
507
507
 
508
- # Find optimal Q-score threshold (lowest empirical error)
509
- optimal_qscore, optimal_result = find_optimal_qscore_simple(qc_results)
508
+ # Derive consensus AA mutation estimates across valid Q-score thresholds
509
+ consensus_info, _ = compute_consensus_aa_mutation(qc_results)
510
510
 
511
511
  # Create QC plots
512
512
  if len(qc_results) >= 2:
513
- create_simple_qc_plots(successful_thresholds, qc_results, results_dir, optimal_qscore, optimal_result)
513
+ create_simple_qc_plots(
514
+ successful_thresholds,
515
+ qc_results,
516
+ results_dir,
517
+ consensus_info=consensus_info,
518
+ )
514
519
  else:
515
520
  logging.warning("Insufficient data points for QC plots (need at least 2)")
516
521
 
517
- # Save optimal Q-score information
518
- if optimal_qscore is not None:
519
- optimal_qscore_path = os.path.join(results_dir, "optimal_qscore_analysis.txt")
520
- with open(optimal_qscore_path, 'w') as f:
521
- f.write("=== OPTIMAL Q-SCORE ANALYSIS (PRECISION-WEIGHTED) ===\n")
522
- f.write(f"Optimal Q-score threshold: {optimal_qscore}\n")
523
- f.write(f"Precision-weighted score: {(1.0 / optimal_result['std_aa_mutations']) * optimal_qscore:.6f}\n" if optimal_result['std_aa_mutations'] > 0 else "Precision-weighted score: inf (perfect precision)\n")
524
- f.write(f"Empirical error (std): {optimal_result['std_aa_mutations']:.6f}\n")
525
- f.write(f"AA mutations per gene: {optimal_result['mean_aa_mutations']:.4f} ± {optimal_result['std_aa_mutations']:.4f}\n")
526
- f.write(f"95% Confidence Interval: [{optimal_result['ci_lower']:.4f}, {optimal_result['ci_upper']:.4f}]\n")
527
- f.write(f"Total mappable bases: {optimal_result['total_mappable_bases']}\n")
528
- f.write(f"Number of segments: {optimal_result['n_segments']}\n")
529
- f.write("\n=== ALL Q-SCORE COMPARISON ===\n")
530
- f.write("Q-score\tEmpirical_Error\tPrecision_Score\tMappable_Bases\tAA_Mutations\tCI_Lower\tCI_Upper\n")
531
- for result in qc_results:
532
- precision_score = (1.0 / result['std_aa_mutations']) * result['quality_threshold'] if result['std_aa_mutations'] > 0 else float('inf')
533
- f.write(f"{result['quality_threshold']}\t{result['std_aa_mutations']:.6f}\t{precision_score:.6f}\t{result['total_mappable_bases']}\t{result['mean_aa_mutations']:.4f}\t{result['ci_lower']:.4f}\t{result['ci_upper']:.4f}\n")
534
-
535
- logging.info(f"Optimal Q-score analysis saved to: {optimal_qscore_path}")
522
+ # Save consensus summary
523
+ consensus_summary_path = os.path.join(results_dir, "aa_mutation_consensus.txt")
524
+ with open(consensus_summary_path, "w") as f:
525
+ f.write("=== CONSENSUS AMINO-ACID MUTATION ESTIMATE ===\n")
526
+ if consensus_info:
527
+ f.write(f"Minimum mappable bases required: {consensus_info['min_mappable_bases']}\n")
528
+ f.write(
529
+ f"Consensus AA mutations per gene: {consensus_info['consensus_mean']:.4f} ± "
530
+ f"{consensus_info['consensus_std']:.4f}\n"
531
+ )
532
+ f.write(f"Thresholds contributing: {consensus_info['thresholds_used']}\n")
533
+ f.write(f"Normalized weights: {consensus_info['weights']}\n")
534
+ if consensus_info.get("note"):
535
+ f.write(f"Note: {consensus_info['note']}\n")
536
+ else:
537
+ f.write("Consensus AA mutation rate could not be computed; see QC logs for details.\n")
538
+ f.write("\n=== ALL Q-SCORE RESULTS ===\n")
539
+ f.write(
540
+ "Q-score\tMean_AA\tStd_AA\tCI_Lower\tCI_Upper\tMappable_Bases\tSegments\n"
541
+ )
542
+ for result in qc_results:
543
+ f.write(
544
+ f"{result['quality_threshold']}\t"
545
+ f"{result['mean_aa_mutations']:.6f}\t"
546
+ f"{result['std_aa_mutations']:.6f}\t"
547
+ f"{result['ci_lower']:.6f}\t"
548
+ f"{result['ci_upper']:.6f}\t"
549
+ f"{result['total_mappable_bases']}\t"
550
+ f"{result['n_segments']}\n"
551
+ )
552
+ logging.info("Consensus AA mutation summary saved to: %s", consensus_summary_path)
536
553
 
537
554
  # Clean up segment files
538
- import shutil
539
555
  segment_dir = os.path.dirname(segment_files[0])
540
556
  if os.path.exists(segment_dir):
541
557
  shutil.rmtree(segment_dir)
542
558
  logging.info(f"Cleaned up segment directory: {segment_dir}")
543
559
 
544
- # Return both QC results and optimal Q-score for use in main analysis
545
- return qc_results, optimal_qscore
560
+ # Return QC results and consensus information for downstream analysis
561
+ return qc_results, consensus_info
546
562
 
547
- def find_optimal_qscore_simple(qc_results):
563
+ def compute_consensus_aa_mutation(
564
+ qc_results: List[dict],
565
+ min_mappable_bases: int = 1000,
566
+ ) -> Tuple[Optional[dict], List[dict]]:
548
567
  """
549
- Find the Q-score threshold with the highest precision-weighted score.
550
- Precision-weighted score = (1 / standard_deviation) * q_score
551
-
552
- Args:
553
- qc_results: List of segmentation analysis results
554
-
568
+ Derive a consensus amino-acid mutation estimate across Q-score thresholds.
569
+
570
+ Each threshold must meet a minimum coverage requirement. The consensus is a
571
+ precision-weighted average (weights = 1 / std_aa_mutations).
572
+
555
573
  Returns:
556
- tuple: (optimal_qscore, optimal_result)
574
+ consensus_info (dict or None)
575
+ {
576
+ 'consensus_mean': float,
577
+ 'consensus_std': float,
578
+ 'thresholds_used': List[int],
579
+ 'weights': List[float],
580
+ 'min_mappable_bases': int,
581
+ }
582
+ valid_results: list of QC result dicts that were included in the consensus
557
583
  """
558
- logging.info("=== FINDING OPTIMAL Q-SCORE THRESHOLD (PRECISION-WEIGHTED) ===")
559
-
560
584
  if not qc_results:
561
- return None, None
562
-
563
- # Find Q-score with highest precision-weighted score
564
- max_score = -1
565
- optimal_result = None
566
- optimal_qscore = None
567
-
568
- logging.info("Q-score\tEmpirical_Error\tPrecision_Score\tMappable_Bases")
569
- logging.info("-" * 60)
570
-
585
+ return None, []
586
+
587
+ valid_results = []
571
588
  for result in qc_results:
572
- qscore = result['quality_threshold']
573
- empirical_error = result['std_aa_mutations']
574
- mappable_bases = result['total_mappable_bases']
575
-
576
- # Calculate precision-weighted score: (1/sd) * q_score
577
- if empirical_error > 0:
578
- precision_score = (1.0 / empirical_error) * qscore
579
- else:
580
- precision_score = float('inf') # Perfect precision
581
-
582
- logging.info(f"Q{qscore}\t{empirical_error:.6f}\t{precision_score:.6f}\t{mappable_bases}")
583
-
584
- if precision_score > max_score:
585
- max_score = precision_score
586
- optimal_result = result
587
- optimal_qscore = qscore
588
-
589
- logging.info("-" * 60)
590
- logging.info(f"OPTIMAL Q-SCORE: Q{optimal_qscore} (highest precision-weighted score: {max_score:.6f})")
591
- logging.info(f"Optimal result: AA mutations = {optimal_result['mean_aa_mutations']:.4f} ± {optimal_result['std_aa_mutations']:.4f}")
592
-
593
- return optimal_qscore, optimal_result
589
+ total_bases = result.get("total_mappable_bases", 0)
590
+ std_aa = result.get("std_aa_mutations", 0.0)
591
+ if total_bases is None:
592
+ total_bases = 0
593
+ if total_bases >= min_mappable_bases and std_aa is not None:
594
+ valid_results.append(result)
595
+
596
+ if not valid_results:
597
+ logging.warning(
598
+ "No Q-score thresholds met the minimum mappable base requirement (%s). "
599
+ "Consensus AA mutation rate will fall back to the threshold with the highest coverage.",
600
+ min_mappable_bases,
601
+ )
602
+ best_by_coverage = max(qc_results, key=lambda r: r.get("total_mappable_bases", 0))
603
+ fallback_std = best_by_coverage.get("std_aa_mutations", 0.0)
604
+ consensus_info = {
605
+ "consensus_mean": best_by_coverage.get("mean_aa_mutations", 0.0),
606
+ "consensus_std": fallback_std,
607
+ "thresholds_used": [best_by_coverage.get("quality_threshold")],
608
+ "weights": [1.0],
609
+ "min_mappable_bases": min_mappable_bases,
610
+ "note": "FELL_BACK_TO_MAX_COVERAGE",
611
+ }
612
+ return consensus_info, [best_by_coverage]
613
+
614
+ weights = []
615
+ means = []
616
+ variances = []
617
+ thresholds = []
618
+ for result in valid_results:
619
+ std_aa = result.get("std_aa_mutations", 0.0) or 0.0
620
+ weight = 1.0 / max(std_aa, 1e-9) # Avoid division by zero; effectively a very large weight.
621
+ weights.append(weight)
622
+ means.append(result.get("mean_aa_mutations", 0.0))
623
+ variances.append(std_aa**2)
624
+ thresholds.append(result.get("quality_threshold"))
625
+
626
+ weight_sum = float(np.sum(weights))
627
+ normalized_weights = [w / weight_sum for w in weights]
628
+ consensus_mean = float(np.sum(np.array(normalized_weights) * np.array(means)))
629
+
630
+ combined_variance = 0.0
631
+ for w, mean, var in zip(normalized_weights, means, variances):
632
+ combined_variance += w * (var + (mean - consensus_mean) ** 2)
633
+ combined_variance = max(combined_variance, 0.0)
634
+ consensus_std = float(np.sqrt(combined_variance))
635
+
636
+ consensus_info = {
637
+ "consensus_mean": consensus_mean,
638
+ "consensus_std": consensus_std,
639
+ "thresholds_used": thresholds,
640
+ "weights": normalized_weights,
641
+ "min_mappable_bases": min_mappable_bases,
642
+ "note": "WEIGHTED_AVERAGE",
643
+ }
644
+
645
+ logging.info(
646
+ "Consensus AA mutation estimate: %.4f ± %.4f (thresholds used: %s)",
647
+ consensus_mean,
648
+ consensus_std,
649
+ thresholds,
650
+ )
594
651
 
595
- def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, optimal_qscore=None, optimal_result=None):
652
+ return consensus_info, valid_results
653
+
654
+ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensus_info=None):
596
655
  """
597
656
  Create simple QC plots with empirical error bars.
598
657
 
@@ -600,8 +659,7 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, optimal_
600
659
  quality_thresholds: List of quality score thresholds
601
660
  qc_results: List of segmentation analysis results
602
661
  results_dir: Directory to save the plots
603
- optimal_qscore: Optimal Q-score threshold (optional)
604
- optimal_result: Optimal result data (optional)
662
+ consensus_info: Optional dict describing the consensus AA mutation estimate.
605
663
  """
606
664
  try:
607
665
  # Extract data for plotting
@@ -624,10 +682,17 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, optimal_
624
682
  ax1.fill_between(quality_thresholds, aa_ci_lower, aa_ci_upper,
625
683
  alpha=0.3, color=color1, label='95% Confidence Interval')
626
684
 
627
- # Highlight optimal Q-score
628
- if optimal_qscore is not None:
629
- ax1.axvline(x=optimal_qscore, color='red', linestyle='--', alpha=0.7,
630
- label=f'Optimal Q{optimal_qscore}')
685
+ # Add consensus AA mutation estimate if available
686
+ if consensus_info and consensus_info.get("consensus_mean") is not None:
687
+ consensus_mean = consensus_info["consensus_mean"]
688
+ consensus_std = consensus_info.get("consensus_std", 0.0)
689
+ ax1.axhline(
690
+ y=consensus_mean,
691
+ color='red',
692
+ linestyle='--',
693
+ alpha=0.7,
694
+ label=f"Consensus AA mutations ({consensus_mean:.3f}±{consensus_std:.3f})",
695
+ )
631
696
 
632
697
  ax1.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
633
698
  ax1.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color1)
@@ -1185,16 +1250,23 @@ def run_segmented_analysis(segment_files, quality_threshold, work_dir, ref_hit_f
1185
1250
  bg_rate = bg_mis / bg_cov if bg_cov > 0 else 0
1186
1251
  net_rate = max(hit_rate - bg_rate, 0.0)
1187
1252
 
1188
- # Calculate AA mutations per gene (simplified)
1253
+ # Calculate AA mutations per gene via Monte Carlo simulation
1189
1254
  lambda_bp = net_rate * len(hit_seq)
1190
- aa_mutations = lambda_bp / 3.0 # Approximate: 3 bp per AA
1255
+ aa_samples = simulate_aa_distribution(lambda_bp, hit_seq, n_trials=500)
1256
+ if len(aa_samples) > 1:
1257
+ aa_mean = float(np.mean(aa_samples))
1258
+ aa_var = float(np.var(aa_samples, ddof=1))
1259
+ else:
1260
+ aa_mean = float(aa_samples[0]) if aa_samples else 0.0
1261
+ aa_var = 0.0
1191
1262
 
1192
1263
  segment_results.append({
1193
1264
  'segment': i+1,
1194
1265
  'hit_rate': hit_rate,
1195
1266
  'bg_rate': bg_rate,
1196
1267
  'net_rate': net_rate,
1197
- 'aa_mutations': aa_mutations,
1268
+ 'aa_mutations': aa_mean,
1269
+ 'aa_variance': aa_var,
1198
1270
  'mappable_bases': hit_cov,
1199
1271
  'hit_mismatches': hit_mis,
1200
1272
  'hit_coverage': hit_cov
@@ -1204,29 +1276,44 @@ def run_segmented_analysis(segment_files, quality_threshold, work_dir, ref_hit_f
1204
1276
  return None
1205
1277
 
1206
1278
  # Calculate empirical statistics
1207
- aa_mutations_list = [r['aa_mutations'] for r in segment_results]
1208
- net_rates_list = [r['net_rate'] for r in segment_results]
1209
- mappable_bases_list = [r['mappable_bases'] for r in segment_results]
1279
+ aa_mutations_list = np.array([r['aa_mutations'] for r in segment_results], dtype=float)
1280
+ aa_variances = np.array([r.get('aa_variance', 0.0) for r in segment_results], dtype=float)
1281
+ net_rates_list = np.array([r['net_rate'] for r in segment_results], dtype=float)
1282
+ mappable_bases_list = np.array([r['mappable_bases'] for r in segment_results], dtype=float)
1283
+
1284
+ total_mappable_bases = float(mappable_bases_list.sum())
1285
+ if total_mappable_bases > 0:
1286
+ weights = mappable_bases_list
1287
+ mean_aa = float(np.average(aa_mutations_list, weights=weights))
1288
+ mean_net_rate = float(np.average(net_rates_list, weights=weights))
1289
+ weighted_var = float(
1290
+ np.sum(weights * (aa_variances + (aa_mutations_list - mean_aa) ** 2)) / total_mappable_bases
1291
+ )
1292
+ weighted_net_var = float(
1293
+ np.sum(weights * ( (net_rates_list - mean_net_rate) ** 2 )) / total_mappable_bases
1294
+ )
1295
+ else:
1296
+ weights = None
1297
+ mean_aa = float(np.mean(aa_mutations_list))
1298
+ mean_net_rate = float(np.mean(net_rates_list))
1299
+ weighted_var = float(np.var(aa_mutations_list, ddof=1)) if len(aa_mutations_list) > 1 else 0.0
1300
+ weighted_net_var = float(np.var(net_rates_list, ddof=1)) if len(net_rates_list) > 1 else 0.0
1210
1301
 
1211
- mean_aa = np.mean(aa_mutations_list)
1212
- std_aa = np.std(aa_mutations_list, ddof=1) # Sample standard deviation
1213
- mean_net_rate = np.mean(net_rates_list)
1214
- std_net_rate = np.std(net_rates_list, ddof=1)
1215
- total_mappable_bases = sum(mappable_bases_list)
1302
+ std_aa = float(np.sqrt(max(weighted_var, 0.0)))
1303
+ std_net_rate = float(np.sqrt(max(weighted_net_var, 0.0)))
1216
1304
 
1217
1305
  # Calculate confidence interval using t-distribution
1218
1306
  n_segments = len(segment_results)
1219
1307
  if n_segments > 1:
1220
- # 95% confidence interval
1221
- from scipy.stats import t
1222
- t_val = t.ppf(0.975, n_segments - 1)
1223
1308
  se_aa = std_aa / np.sqrt(n_segments)
1224
- ci_lower = mean_aa - t_val * se_aa
1225
- ci_upper = mean_aa + t_val * se_aa
1309
+ ci_lower = mean_aa - 1.96 * se_aa
1310
+ ci_upper = mean_aa + 1.96 * se_aa
1226
1311
  else:
1227
1312
  ci_lower = mean_aa
1228
1313
  ci_upper = mean_aa
1229
1314
 
1315
+ ci_lower = max(ci_lower, 0.0)
1316
+
1230
1317
  return {
1231
1318
  'mean_aa_mutations': mean_aa,
1232
1319
  'std_aa_mutations': std_aa,
@@ -2072,18 +2159,23 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2072
2159
  ax3.bar(unique_vals, [1.0], color="#C44E52", alpha=0.7, width=0.1)
2073
2160
  ax3.set_xlim(unique_vals[0] - 0.5, unique_vals[0] + 0.5)
2074
2161
  else:
2075
- # Not protein or no AA differences
2076
- ax3.text(0.5, 0.5, "Not a protein‐coding region",
2077
- horizontalalignment='center', verticalalignment='center',
2078
- fontsize=12, color='gray', transform=ax3.transAxes)
2079
-
2080
- ax3.set_title("AA Mutation Distribution", fontsize=14, fontweight='bold')
2081
- ax3.set_xlabel("Number of AA Mutations", fontsize=12)
2082
- ax3.set_ylabel("Density", fontsize=12)
2083
- ax3.spines['top'].set_visible(False)
2084
- ax3.spines['right'].set_visible(False)
2085
- ax3.set_xticks([])
2086
- ax3.set_yticks([])
2162
+ # Not protein or no AA differences — display an informative message
2163
+ ax3.text(
2164
+ 0.5,
2165
+ 0.5,
2166
+ "Amino-acid distribution unavailable",
2167
+ horizontalalignment="center",
2168
+ verticalalignment="center",
2169
+ fontsize=12,
2170
+ color="gray",
2171
+ transform=ax3.transAxes,
2172
+ )
2173
+
2174
+ ax3.set_title("AA Mutation Distribution", fontsize=14, fontweight='bold')
2175
+ ax3.set_xlabel("Number of AA Mutations", fontsize=12)
2176
+ ax3.set_ylabel("Density", fontsize=12)
2177
+ ax3.spines['top'].set_visible(False)
2178
+ ax3.spines['right'].set_visible(False)
2087
2179
 
2088
2180
  # Save the combined figure as both PNG and PDF
2089
2181
  panel_path_png = os.path.join(qscore_results_dir, "summary_panels.png")
@@ -2412,7 +2504,7 @@ def process_single_fastq(
2412
2504
  logging.info("Running QC analysis to get Q-score results...")
2413
2505
  qc_results = None
2414
2506
  try:
2415
- qc_results, optimal_qscore = run_qc_analysis(
2507
+ qc_results, consensus_info = run_qc_analysis(
2416
2508
  str(fastq_path),
2417
2509
  str(results_dir),
2418
2510
  str(region_fasta),
@@ -2420,8 +2512,13 @@ def process_single_fastq(
2420
2512
  )
2421
2513
  if qc_results is not None:
2422
2514
  logging.info("QC analysis completed successfully. Found %s Q-score results.", len(qc_results))
2423
- if optimal_qscore is not None:
2424
- logging.info("Optimal Q-score determined: %s", optimal_qscore)
2515
+ if consensus_info and consensus_info.get("consensus_mean") is not None:
2516
+ logging.info(
2517
+ "Consensus AA mutations per gene: %.4f ± %.4f (thresholds used: %s)",
2518
+ consensus_info["consensus_mean"],
2519
+ consensus_info.get("consensus_std", 0.0),
2520
+ consensus_info.get("thresholds_used"),
2521
+ )
2425
2522
  else:
2426
2523
  logging.warning("QC analysis completed but no Q-score results found.")
2427
2524
  except Exception as exc:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: uht-tooling
3
- Version: 0.1.4
3
+ Version: 0.1.5
4
4
  Summary: Tooling for ultra-high throughput screening workflows.
5
5
  Author: Matt115A
6
6
  License: MIT
@@ -35,7 +35,7 @@ Automation helpers for ultra-high-throughput molecular biology workflows. The pa
35
35
 
36
36
  ### Quick install (recommended, easiest file maintainance)
37
37
  ```bash
38
- pip install "uht-tooling[gui]==0.1.3"
38
+ pip install "uht-tooling[gui]==0.1.4"
39
39
 
40
40
  ```
41
41
 
@@ -222,7 +222,14 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
222
222
  --fastq data/ep-library-profile/*.fastq.gz \
223
223
  --output-dir results/ep-library-profile/
224
224
  ```
225
- - Output bundle includes per-sample directories and a master summary TSV.
225
+ - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
226
+
227
+ **How the mutation rate and AA expectations are derived**
228
+
229
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the “target” rate; mismatches elsewhere provide the background.
230
+ 2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
231
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
232
+ 4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
226
233
 
227
234
  ---
228
235
 
File without changes