uht-tooling 0.1.9__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (24) hide show
  1. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/PKG-INFO +48 -4
  2. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/README.md +47 -3
  3. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/pyproject.toml +1 -1
  4. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/mut_rate.py +476 -122
  5. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/PKG-INFO +48 -4
  6. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/setup.cfg +0 -0
  7. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/__init__.py +0 -0
  8. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/cli.py +0 -0
  9. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/models/__init__.py +0 -0
  10. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/__init__.py +0 -0
  11. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/design_gibson.py +0 -0
  12. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/design_kld.py +0 -0
  13. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/design_slim.py +0 -0
  14. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/gui.py +0 -0
  15. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/mutation_caller.py +0 -0
  16. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/nextera_designer.py +0 -0
  17. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/profile_inserts.py +0 -0
  18. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/umi_hunter.py +0 -0
  19. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/SOURCES.txt +0 -0
  20. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/dependency_links.txt +0 -0
  21. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/entry_points.txt +0 -0
  22. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/requires.txt +0 -0
  23. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/top_level.txt +0 -0
  24. {uht_tooling-0.1.9 → uht_tooling-0.2.0}/tests/test_design_kld.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: uht-tooling
3
- Version: 0.1.9
3
+ Version: 0.2.0
4
4
  Summary: Tooling for ultra-high throughput screening workflows.
5
5
  Author: Matt115A
6
6
  License-Expression: MIT
@@ -313,13 +313,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
313
313
  --fastq data/ep-library-profile/*.fastq.gz \
314
314
  --output-dir results/ep-library-profile/
315
315
  ```
316
- - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
316
+
317
+ **Output structure**
318
+
319
+ Each sample produces an organized output directory:
320
+
321
+ ```
322
+ sample_name/
323
+ ├── KEY_FINDINGS.txt # Lay-user executive summary
324
+ ├── summary_panels.png/pdf # Main visualization
325
+ ├── aa_mutation_consensus.txt # Consensus estimate details
326
+ ├── run.log # Analysis log
327
+ └── detailed/ # Technical outputs
328
+ ├── methodology_notes.txt # Documents which lambda drives what
329
+ ├── lambda_comparison.csv # Side-by-side lambda comparison
330
+ ├── gene_mismatch_rates.csv
331
+ ├── base_distribution.csv
332
+ ├── aa_substitutions.csv
333
+ ├── plasmid_coverage.csv
334
+ ├── aa_mutation_distribution.csv
335
+ ├── comprehensive_qc_data.csv
336
+ ├── simple_qc_data.csv
337
+ └── qc_plots/ # QC visualizations
338
+ ├── qc_plot_*.png
339
+ ├── comprehensive_qc_analysis.png
340
+ ├── error_analysis.png
341
+ └── qc_mutation_rate_vs_quality.png/csv
342
+ ```
343
+
344
+ **Lambda estimates: which to use**
345
+
346
+ The profiler calculates lambda (mutations per gene copy) via two methods:
347
+
348
+ | Method | Formula | Error Quantified? | Used For |
349
+ |--------|---------|-------------------|----------|
350
+ | Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
351
+ | Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
352
+
353
+ - **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
354
+ - **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
355
+ - **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
356
+
357
+ The `KEY_FINDINGS.txt` file provides a plain-language summary including:
358
+ - Expected AA mutations per gene copy
359
+ - Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
360
+ - Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
317
361
 
318
362
  **How the mutation rate and AA expectations are derived**
319
363
 
320
- 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the target rate; mismatches elsewhere provide the background.
364
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
321
365
  2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
322
- 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
366
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
323
367
  4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
324
368
 
325
369
  ---
@@ -284,13 +284,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
284
284
  --fastq data/ep-library-profile/*.fastq.gz \
285
285
  --output-dir results/ep-library-profile/
286
286
  ```
287
- - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
287
+
288
+ **Output structure**
289
+
290
+ Each sample produces an organized output directory:
291
+
292
+ ```
293
+ sample_name/
294
+ ├── KEY_FINDINGS.txt # Lay-user executive summary
295
+ ├── summary_panels.png/pdf # Main visualization
296
+ ├── aa_mutation_consensus.txt # Consensus estimate details
297
+ ├── run.log # Analysis log
298
+ └── detailed/ # Technical outputs
299
+ ├── methodology_notes.txt # Documents which lambda drives what
300
+ ├── lambda_comparison.csv # Side-by-side lambda comparison
301
+ ├── gene_mismatch_rates.csv
302
+ ├── base_distribution.csv
303
+ ├── aa_substitutions.csv
304
+ ├── plasmid_coverage.csv
305
+ ├── aa_mutation_distribution.csv
306
+ ├── comprehensive_qc_data.csv
307
+ ├── simple_qc_data.csv
308
+ └── qc_plots/ # QC visualizations
309
+ ├── qc_plot_*.png
310
+ ├── comprehensive_qc_analysis.png
311
+ ├── error_analysis.png
312
+ └── qc_mutation_rate_vs_quality.png/csv
313
+ ```
314
+
315
+ **Lambda estimates: which to use**
316
+
317
+ The profiler calculates lambda (mutations per gene copy) via two methods:
318
+
319
+ | Method | Formula | Error Quantified? | Used For |
320
+ |--------|---------|-------------------|----------|
321
+ | Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
322
+ | Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
323
+
324
+ - **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
325
+ - **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
326
+ - **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
327
+
328
+ The `KEY_FINDINGS.txt` file provides a plain-language summary including:
329
+ - Expected AA mutations per gene copy
330
+ - Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
331
+ - Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
288
332
 
289
333
  **How the mutation rate and AA expectations are derived**
290
334
 
291
- 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the target rate; mismatches elsewhere provide the background.
335
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
292
336
  2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
293
- 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
337
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
294
338
  4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
295
339
 
296
340
  ---
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "uht-tooling"
7
- version = "0.1.9"
7
+ version = "0.2.0"
8
8
  description = "Tooling for ultra-high throughput screening workflows."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -743,40 +743,50 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
743
743
 
744
744
  plt.tight_layout()
745
745
 
746
- # Save the plot
746
+ # Create detailed/qc_plots/ subdirectory for QC plots
747
+ detailed_dir = os.path.join(results_dir, "detailed")
748
+ qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
749
+ os.makedirs(qc_plots_dir, exist_ok=True)
750
+
751
+ # Save the plot to detailed/qc_plots/
747
752
  project_name = os.path.basename(results_dir)
748
- qc_plot_path = os.path.join(results_dir, f"qc_plot_{project_name}.png")
753
+ qc_plot_path = os.path.join(qc_plots_dir, f"qc_plot_{project_name}.png")
749
754
  fig.savefig(qc_plot_path, dpi=300, bbox_inches='tight')
750
755
  plt.close(fig)
751
-
756
+
752
757
  logging.info(f"QC plot saved to: {qc_plot_path}")
753
-
754
- # Save data as CSV
755
- qc_data_path = os.path.join(results_dir, "simple_qc_data.csv")
758
+
759
+ # Save data as CSV to detailed/
760
+ qc_data_path = os.path.join(detailed_dir, "simple_qc_data.csv")
756
761
  with open(qc_data_path, 'w') as f:
757
762
  f.write("quality_threshold,mean_aa_mutations,std_aa_mutations,ci_lower,ci_upper,")
758
763
  f.write("total_mappable_bases,n_segments\n")
759
-
764
+
760
765
  for q, r in zip(quality_thresholds, qc_results):
761
766
  f.write(f"{q},{r['mean_aa_mutations']:.6f},{r['std_aa_mutations']:.6f},")
762
767
  f.write(f"{r['ci_lower']:.6f},{r['ci_upper']:.6f},")
763
768
  f.write(f"{r['total_mappable_bases']},{r['n_segments']}\n")
764
-
769
+
765
770
  logging.info(f"Simple QC data saved to: {qc_data_path}")
766
-
771
+
767
772
  except Exception as e:
768
773
  logging.error(f"Error creating simple QC plots: {e}")
774
+
775
+ def create_comprehensive_qc_plots(quality_thresholds, qc_results, results_dir):
769
776
  """
770
777
  Create comprehensive QC plots with error bars and uncertainty quantification.
771
-
778
+
772
779
  Args:
773
780
  quality_thresholds: List of quality score thresholds
774
781
  qc_results: List of comprehensive analysis results
775
782
  results_dir: Directory to save the plots
776
- optimal_qscore: Optimal Q-score threshold (optional)
777
- optimal_result: Optimal result data (optional)
778
783
  """
779
784
  try:
785
+ # Create detailed/qc_plots/ subdirectory
786
+ detailed_dir = os.path.join(results_dir, "detailed")
787
+ qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
788
+ os.makedirs(qc_plots_dir, exist_ok=True)
789
+
780
790
  # Extract data for plotting
781
791
  aa_mutations = [r['mean_aa_mutations'] for r in qc_results]
782
792
  aa_errors = [r['std_aa_mutations'] for r in qc_results]
@@ -785,46 +795,46 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
785
795
  mappable_bases = [r['mappable_bases'] for r in qc_results]
786
796
  net_rates = [r['net_rate'] for r in qc_results]
787
797
  net_rate_errors = [r['net_rate_error'] for r in qc_results]
788
-
798
+
789
799
  # Create main QC plot with error bars
790
800
  fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12))
791
-
801
+
792
802
  # Top plot: AA mutations per gene with error bars
793
803
  color1 = '#2E8B57'
794
- ax1.errorbar(quality_thresholds, aa_mutations, yerr=aa_errors,
795
- fmt='o', capsize=5, capthick=2, markersize=8,
804
+ ax1.errorbar(quality_thresholds, aa_mutations, yerr=aa_errors,
805
+ fmt='o', capsize=5, capthick=2, markersize=8,
796
806
  color=color1, ecolor=color1, alpha=0.8, label='Mean ± Std')
797
-
807
+
798
808
  # Add confidence intervals as shaded area
799
- ax1.fill_between(quality_thresholds, aa_ci_lower, aa_ci_upper,
809
+ ax1.fill_between(quality_thresholds, aa_ci_lower, aa_ci_upper,
800
810
  alpha=0.3, color=color1, label='95% Confidence Interval')
801
-
811
+
802
812
  ax1.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
803
813
  ax1.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color1)
804
814
  ax1.tick_params(axis='y', labelcolor=color1)
805
- ax1.set_title('AA Mutations per Gene vs Quality Score Filter (with Error Propagation)',
815
+ ax1.set_title('AA Mutations per Gene vs Quality Score Filter (with Error Propagation)',
806
816
  fontsize=14, fontweight='bold')
807
817
  ax1.grid(True, alpha=0.3)
808
818
  ax1.legend(frameon=False, fontsize=10)
809
-
819
+
810
820
  # Add data point labels
811
821
  for i, (q, aa_mut, aa_err) in enumerate(zip(quality_thresholds, aa_mutations, aa_errors)):
812
- ax1.annotate(f'Q{q}\n{aa_mut:.3f}±{aa_err:.3f}',
813
- (q, aa_mut), xytext=(5, 5),
822
+ ax1.annotate(f'Q{q}\n{aa_mut:.3f}±{aa_err:.3f}',
823
+ (q, aa_mut), xytext=(5, 5),
814
824
  textcoords='offset points', fontsize=8, alpha=0.8, color=color1)
815
-
825
+
816
826
  # Bottom plot: Mappable bases and AA mutations per gene
817
827
  color2 = '#FF6B6B'
818
828
  color3 = '#4169E1'
819
-
829
+
820
830
  # Mappable bases (left y-axis)
821
831
  ax2_twin = ax2.twinx()
822
- ax2_twin.scatter(quality_thresholds, mappable_bases,
823
- s=100, alpha=0.7, color=color2, edgecolors='black',
832
+ ax2_twin.scatter(quality_thresholds, mappable_bases,
833
+ s=100, alpha=0.7, color=color2, edgecolors='black',
824
834
  linewidth=1, marker='s', label='Mappable Bases')
825
835
  ax2_twin.set_ylabel('Number of Mappable Bases', fontsize=12, fontweight='bold', color=color2)
826
836
  ax2_twin.tick_params(axis='y', labelcolor=color2)
827
-
837
+
828
838
  # AA mutations per gene with error bars (right y-axis)
829
839
  ax2.errorbar(quality_thresholds, aa_mutations, yerr=aa_errors,
830
840
  fmt='^', capsize=5, capthick=2, markersize=8,
@@ -832,34 +842,34 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
832
842
  ax2.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color3)
833
843
  ax2.tick_params(axis='y', labelcolor=color3)
834
844
  ax2.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
835
- ax2.set_title('Mappable Bases and AA Mutations per Gene vs Quality Score Filter',
845
+ ax2.set_title('Mappable Bases and AA Mutations per Gene vs Quality Score Filter',
836
846
  fontsize=14, fontweight='bold')
837
847
  ax2.grid(True, alpha=0.3)
838
-
848
+
839
849
  # Add legends
840
850
  lines1, labels1 = ax2.get_legend_handles_labels()
841
851
  lines2, labels2 = ax2_twin.get_legend_handles_labels()
842
852
  ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper right', frameon=False, fontsize=10)
843
-
853
+
844
854
  # Add data point labels for mappable bases
845
855
  for i, (q, bases) in enumerate(zip(quality_thresholds, mappable_bases)):
846
- ax2_twin.annotate(f'{bases}', (q, bases), xytext=(5, -15),
856
+ ax2_twin.annotate(f'{bases}', (q, bases), xytext=(5, -15),
847
857
  textcoords='offset points', fontsize=8, alpha=0.8, color=color2)
848
-
858
+
849
859
  plt.tight_layout()
850
-
851
- # Save the comprehensive plot
852
- qc_plot_path = os.path.join(results_dir, "comprehensive_qc_analysis.png")
860
+
861
+ # Save the comprehensive plot to detailed/qc_plots/
862
+ qc_plot_path = os.path.join(qc_plots_dir, "comprehensive_qc_analysis.png")
853
863
  fig.savefig(qc_plot_path, dpi=300, bbox_inches='tight')
854
864
  plt.close(fig)
855
-
865
+
856
866
  logging.info(f"Comprehensive QC plot saved to: {qc_plot_path}")
857
-
867
+
858
868
  # Create error analysis plot
859
869
  create_error_analysis_plot(quality_thresholds, qc_results, results_dir)
860
-
861
- # Save comprehensive data as CSV
862
- qc_data_path = os.path.join(results_dir, "comprehensive_qc_data.csv")
870
+
871
+ # Save comprehensive data as CSV to detailed/
872
+ qc_data_path = os.path.join(detailed_dir, "comprehensive_qc_data.csv")
863
873
  with open(qc_data_path, 'w') as f:
864
874
  f.write("quality_threshold,mean_aa_mutations,std_aa_mutations,ci_lower,ci_upper,")
865
875
  f.write("mappable_bases,hit_rate,hit_rate_ci_lower,hit_rate_ci_upper,")
@@ -869,7 +879,7 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
869
879
  f.write("bg_qscore_mean,bg_qscore_std,bg_qscore_uncertainty,")
870
880
  f.write("hit_weighted_rate,hit_weighted_error,bg_weighted_rate,bg_weighted_error,")
871
881
  f.write("net_weighted_rate,net_weighted_error,lambda_bp_weighted,lambda_error_weighted\n")
872
-
882
+
873
883
  for q, r in zip(quality_thresholds, qc_results):
874
884
  f.write(f"{q},{r['mean_aa_mutations']:.6f},{r['std_aa_mutations']:.6f},")
875
885
  f.write(f"{r['ci_lower']:.6f},{r['ci_upper']:.6f},")
@@ -878,93 +888,98 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
878
888
  f.write(f"{r['bg_rate']:.6f},{r['bg_rate_ci'][0]:.6f},{r['bg_rate_ci'][1]:.6f},")
879
889
  f.write(f"{r['net_rate']:.6f},{r['net_rate_error']:.6f},")
880
890
  f.write(f"{r['lambda_bp']:.6f},{r['lambda_error']:.6f},{r['alignment_error']:.6f},")
881
-
891
+
882
892
  # Q-score information
883
893
  hit_qscore_mean = r['hit_qscore_stats']['mean_qscore'] if r['hit_qscore_stats'] else 0.0
884
894
  hit_qscore_std = r['hit_qscore_stats']['std_qscore'] if r['hit_qscore_stats'] else 0.0
885
895
  bg_qscore_mean = r['bg_qscore_stats']['mean_qscore'] if r['bg_qscore_stats'] else 0.0
886
896
  bg_qscore_std = r['bg_qscore_stats']['std_qscore'] if r['bg_qscore_stats'] else 0.0
887
-
897
+
888
898
  f.write(f"{hit_qscore_mean:.2f},{hit_qscore_std:.2f},{r['hit_qscore_uncertainty']:.6f},")
889
899
  f.write(f"{bg_qscore_mean:.2f},{bg_qscore_std:.2f},{r['bg_qscore_uncertainty']:.6f},")
890
900
  f.write(f"{r.get('hit_weighted_rate', 0.0):.6f},{r.get('hit_weighted_error', 0.0):.6f},")
891
901
  f.write(f"{r.get('bg_weighted_rate', 0.0):.6f},{r.get('bg_weighted_error', 0.0):.6f},")
892
902
  f.write(f"{r.get('net_weighted_rate', 0.0):.6f},{r.get('net_weighted_error', 0.0):.6f},")
893
903
  f.write(f"{r.get('lambda_bp_weighted', 0.0):.6f},{r.get('lambda_error_weighted', 0.0):.6f}\n")
894
-
904
+
895
905
  logging.info(f"Comprehensive QC data saved to: {qc_data_path}")
896
-
906
+
897
907
  except Exception as e:
898
908
  logging.error(f"Error creating comprehensive QC plots: {e}")
899
909
 
900
910
  def create_error_analysis_plot(quality_thresholds, qc_results, results_dir):
901
911
  """
902
912
  Create a detailed error analysis plot showing different sources of uncertainty.
903
-
913
+
904
914
  Args:
905
915
  quality_thresholds: List of quality score thresholds
906
916
  qc_results: List of comprehensive analysis results
907
917
  results_dir: Directory to save the plot
908
918
  """
909
919
  try:
920
+ # Create detailed/qc_plots/ subdirectory
921
+ detailed_dir = os.path.join(results_dir, "detailed")
922
+ qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
923
+ os.makedirs(qc_plots_dir, exist_ok=True)
924
+
910
925
  fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
911
-
926
+
912
927
  # Extract error components
913
928
  aa_std = [r['std_aa_mutations'] for r in qc_results]
914
929
  net_rate_errors = [r['net_rate_error'] for r in qc_results]
915
930
  lambda_errors = [r['lambda_error'] for r in qc_results]
916
931
  alignment_errors = [r['alignment_error'] for r in qc_results]
917
932
  mappable_bases = [r['mappable_bases'] for r in qc_results]
918
-
933
+
919
934
  # Plot 1: AA mutation uncertainty vs quality threshold
920
935
  ax1.plot(quality_thresholds, aa_std, 'o-', color='#2E8B57', linewidth=2, markersize=6)
921
936
  ax1.set_xlabel('Quality Score Threshold')
922
937
  ax1.set_ylabel('AA Mutation Standard Deviation')
923
938
  ax1.set_title('AA Mutation Uncertainty vs Quality Filter')
924
939
  ax1.grid(True, alpha=0.3)
925
-
940
+
926
941
  # Plot 2: Net rate error vs quality threshold
927
942
  ax2.plot(quality_thresholds, net_rate_errors, 's-', color='#FF6B6B', linewidth=2, markersize=6)
928
943
  ax2.set_xlabel('Quality Score Threshold')
929
944
  ax2.set_ylabel('Net Mutation Rate Error')
930
945
  ax2.set_title('Net Rate Error vs Quality Filter')
931
946
  ax2.grid(True, alpha=0.3)
932
-
947
+
933
948
  # Plot 3: Lambda error vs quality threshold
934
949
  ax3.plot(quality_thresholds, lambda_errors, '^-', color='#4169E1', linewidth=2, markersize=6)
935
950
  ax3.set_xlabel('Quality Score Threshold')
936
951
  ax3.set_ylabel('Lambda Error (mutations per copy)')
937
952
  ax3.set_title('Lambda Error vs Quality Filter')
938
953
  ax3.grid(True, alpha=0.3)
939
-
954
+
940
955
  # Plot 4: Alignment error vs mappable bases
941
956
  ax4.scatter(mappable_bases, alignment_errors, s=100, alpha=0.7, color='#FF8C00')
942
957
  ax4.set_xlabel('Mappable Bases')
943
- ax4.set_ylabel('Alignment Error (1/√reads)')
958
+ ax4.set_ylabel('Alignment Error (1/sqrt(reads))')
944
959
  ax4.set_title('Alignment Error vs Read Count')
945
960
  ax4.grid(True, alpha=0.3)
946
-
961
+
947
962
  # Add quality threshold labels to scatter plot
948
963
  for i, q in enumerate(quality_thresholds):
949
- ax4.annotate(f'Q{q}', (mappable_bases[i], alignment_errors[i]),
964
+ ax4.annotate(f'Q{q}', (mappable_bases[i], alignment_errors[i]),
950
965
  xytext=(5, 5), textcoords='offset points', fontsize=8)
951
-
966
+
952
967
  plt.tight_layout()
953
-
954
- # Save error analysis plot
955
- error_plot_path = os.path.join(results_dir, "error_analysis.png")
968
+
969
+ # Save error analysis plot to detailed/qc_plots/
970
+ error_plot_path = os.path.join(qc_plots_dir, "error_analysis.png")
956
971
  fig.savefig(error_plot_path, dpi=300, bbox_inches='tight')
957
972
  plt.close(fig)
958
-
973
+
959
974
  logging.info(f"Error analysis plot saved to: {error_plot_path}")
960
-
975
+
961
976
  except Exception as e:
962
977
  logging.error(f"Error creating error analysis plot: {e}")
963
978
 
964
979
  def create_qc_plot(quality_thresholds, aa_mutations, mappable_bases, results_dir):
965
980
  """
966
981
  Create a dual-axis plot showing quality score threshold vs AA mutations per gene and mappable bases.
967
-
982
+
968
983
  Args:
969
984
  quality_thresholds: List of quality score thresholds
970
985
  aa_mutations: List of corresponding AA mutations per gene
@@ -972,68 +987,73 @@ def create_qc_plot(quality_thresholds, aa_mutations, mappable_bases, results_dir
972
987
  results_dir: Directory to save the plot
973
988
  """
974
989
  try:
990
+ # Create detailed/qc_plots/ subdirectory
991
+ detailed_dir = os.path.join(results_dir, "detailed")
992
+ qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
993
+ os.makedirs(qc_plots_dir, exist_ok=True)
994
+
975
995
  # Create the plot with dual y-axes
976
996
  fig, ax1 = plt.subplots(figsize=(12, 8))
977
-
997
+
978
998
  # Left y-axis: AA mutations per gene
979
999
  color1 = '#2E8B57'
980
1000
  ax1.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
981
1001
  ax1.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color1)
982
- ax1.scatter(quality_thresholds, aa_mutations,
1002
+ ax1.scatter(quality_thresholds, aa_mutations,
983
1003
  s=100, alpha=0.7, color=color1, edgecolors='black', linewidth=1, label='AA Mutations per Gene')
984
1004
  ax1.tick_params(axis='y', labelcolor=color1)
985
-
1005
+
986
1006
  # Right y-axis: Mappable bases
987
1007
  ax2 = ax1.twinx()
988
1008
  color2 = '#FF6B6B'
989
1009
  ax2.set_ylabel('Number of Mappable Bases', fontsize=12, fontweight='bold', color=color2)
990
- ax2.scatter(quality_thresholds, mappable_bases,
1010
+ ax2.scatter(quality_thresholds, mappable_bases,
991
1011
  s=100, alpha=0.7, color=color2, edgecolors='black', linewidth=1, marker='s', label='Mappable Bases')
992
1012
  ax2.tick_params(axis='y', labelcolor=color2)
993
-
1013
+
994
1014
  # Customize the plot
995
1015
  ax1.set_title('AA Mutations per Gene and Mappable Bases vs Quality Score Filter', fontsize=14, fontweight='bold')
996
-
1016
+
997
1017
  # Add grid for better readability
998
1018
  ax1.grid(True, alpha=0.3)
999
-
1019
+
1000
1020
  # Customize ticks and spines
1001
1021
  ax1.tick_params(axis='both', which='major', labelsize=10, direction='in', length=6)
1002
1022
  ax1.tick_params(axis='both', which='minor', direction='in', length=3)
1003
1023
  ax1.spines['top'].set_visible(False)
1004
1024
  ax1.spines['right'].set_visible(False)
1005
-
1025
+
1006
1026
  # Add data point labels for AA mutations
1007
1027
  for i, (q, aa_mut) in enumerate(zip(quality_thresholds, aa_mutations)):
1008
- ax1.annotate(f'Q{q}', (q, aa_mut), xytext=(5, 5),
1028
+ ax1.annotate(f'Q{q}', (q, aa_mut), xytext=(5, 5),
1009
1029
  textcoords='offset points', fontsize=9, alpha=0.8, color=color1)
1010
-
1030
+
1011
1031
  # Add data point labels for mappable bases
1012
1032
  for i, (q, bases) in enumerate(zip(quality_thresholds, mappable_bases)):
1013
- ax2.annotate(f'{reads}', (q, reads), xytext=(5, -15),
1033
+ ax2.annotate(f'{bases}', (q, bases), xytext=(5, -15),
1014
1034
  textcoords='offset points', fontsize=8, alpha=0.8, color=color2)
1015
-
1035
+
1016
1036
  # Add legend
1017
1037
  lines1, labels1 = ax1.get_legend_handles_labels()
1018
1038
  lines2, labels2 = ax2.get_legend_handles_labels()
1019
1039
  ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper right', frameon=False, fontsize=10)
1020
-
1021
- # Save the plot
1022
- qc_plot_path = os.path.join(results_dir, "qc_mutation_rate_vs_quality.png")
1040
+
1041
+ # Save the plot to detailed/qc_plots/
1042
+ qc_plot_path = os.path.join(qc_plots_dir, "qc_mutation_rate_vs_quality.png")
1023
1043
  fig.savefig(qc_plot_path, dpi=300, bbox_inches='tight')
1024
1044
  plt.close(fig)
1025
-
1045
+
1026
1046
  logging.info(f"QC plot saved to: {qc_plot_path}")
1027
-
1028
- # Also save data as CSV for reference
1029
- qc_data_path = os.path.join(results_dir, "qc_mutation_rate_vs_quality.csv")
1047
+
1048
+ # Also save data as CSV to detailed/qc_plots/
1049
+ qc_data_path = os.path.join(qc_plots_dir, "qc_mutation_rate_vs_quality.csv")
1030
1050
  with open(qc_data_path, 'w') as f:
1031
1051
  f.write("quality_threshold,aa_mutations_per_gene,mappable_bases\n")
1032
1052
  for q, aa_mut, bases in zip(quality_thresholds, aa_mutations, mappable_bases):
1033
1053
  f.write(f"{q},{aa_mut:.6f},{bases}\n")
1034
-
1054
+
1035
1055
  logging.info(f"QC data saved to: {qc_data_path}")
1036
-
1056
+
1037
1057
  except Exception as e:
1038
1058
  logging.error(f"Error creating QC plot: {e}")
1039
1059
 
@@ -1330,74 +1350,74 @@ def run_segmented_analysis(segment_files, quality_threshold, work_dir, ref_hit_f
1330
1350
  except Exception as e:
1331
1351
  logging.error(f"Error in segmented analysis: {e}")
1332
1352
  return None
1353
+
1354
+ def calculate_qscore_weighted_mismatches(sam_file, ref_seq, qscore_stats):
1333
1355
  """
1334
1356
  Calculate mismatches weighted by Q-score uncertainty with proper sampling error.
1335
-
1357
+
1336
1358
  Args:
1337
1359
  sam_file: Path to SAM file
1338
1360
  ref_seq: Reference sequence
1339
1361
  qscore_stats: Q-score statistics from extract_qscores_from_sam
1340
-
1362
+
1341
1363
  Returns:
1342
1364
  tuple: (weighted_mismatches, total_weighted_coverage, raw_mismatches, raw_coverage, position_weights, position_outcomes)
1343
1365
  """
1344
1366
  try:
1345
- import pysam
1346
-
1347
1367
  weighted_mismatches = 0.0
1348
1368
  total_weighted_coverage = 0.0
1349
1369
  raw_mismatches = 0
1350
1370
  raw_coverage = 0
1351
-
1371
+
1352
1372
  # Store position-level data for proper sampling error calculation
1353
1373
  position_weights = []
1354
1374
  position_outcomes = []
1355
-
1375
+
1356
1376
  position_qscores = qscore_stats['position_avg_qscores']
1357
-
1377
+
1358
1378
  with pysam.AlignmentFile(sam_file, "r") as samfile:
1359
1379
  for read in samfile:
1360
1380
  if read.is_unmapped:
1361
1381
  continue
1362
-
1382
+
1363
1383
  # Get aligned pairs (read_pos, ref_pos)
1364
1384
  for read_pos, ref_pos in read.get_aligned_pairs(matches_only=False):
1365
1385
  if ref_pos is None or read_pos is None:
1366
1386
  continue
1367
-
1387
+
1368
1388
  if ref_pos >= len(ref_seq):
1369
1389
  continue
1370
-
1390
+
1371
1391
  # Get base calls
1372
1392
  read_base = read.query_sequence[read_pos].upper()
1373
1393
  ref_base = ref_seq[ref_pos].upper()
1374
-
1394
+
1375
1395
  # Skip if either base is N
1376
1396
  if read_base == 'N' or ref_base == 'N':
1377
1397
  continue
1378
-
1398
+
1379
1399
  # Get Q-score for this position
1380
1400
  qscore = position_qscores.get(ref_pos, qscore_stats['mean_qscore'])
1381
1401
  uncertainty_factor = qscore_uncertainty_factor(qscore)
1382
-
1402
+
1383
1403
  # Weight by uncertainty (lower Q-score = higher uncertainty = lower weight)
1384
1404
  weight = 1.0 - uncertainty_factor
1385
-
1405
+
1386
1406
  # Store position-level data
1387
1407
  position_weights.append(weight)
1388
1408
  position_outcomes.append(1 if read_base != ref_base else 0)
1389
-
1409
+
1390
1410
  # Count coverage
1391
1411
  total_weighted_coverage += weight
1392
1412
  raw_coverage += 1
1393
-
1413
+
1394
1414
  # Count mismatches
1395
1415
  if read_base != ref_base:
1396
1416
  weighted_mismatches += weight
1397
1417
  raw_mismatches += 1
1398
-
1418
+
1399
1419
  return weighted_mismatches, total_weighted_coverage, raw_mismatches, raw_coverage, position_weights, position_outcomes
1400
-
1420
+
1401
1421
  except Exception as e:
1402
1422
  logging.error(f"Error calculating Q-score weighted mismatches: {e}")
1403
1423
  return 0.0, 0.0, 0, 0, [], []
@@ -1827,6 +1847,289 @@ def simulate_aa_distribution(lambda_bp, cds_seq, n_trials=1000):
1827
1847
 
1828
1848
  return aa_diffs
1829
1849
 
1850
+ def create_output_directories(results_dir):
1851
+ """
1852
+ Create the output directory structure with detailed/ and detailed/qc_plots/ subdirectories.
1853
+
1854
+ Args:
1855
+ results_dir: Base results directory path
1856
+
1857
+ Returns:
1858
+ dict: Paths to created directories
1859
+ """
1860
+ results_dir = Path(results_dir)
1861
+ detailed_dir = results_dir / "detailed"
1862
+ qc_plots_dir = detailed_dir / "qc_plots"
1863
+
1864
+ detailed_dir.mkdir(parents=True, exist_ok=True)
1865
+ qc_plots_dir.mkdir(parents=True, exist_ok=True)
1866
+
1867
+ logging.info(f"Created output directories: {detailed_dir}, {qc_plots_dir}")
1868
+
1869
+ return {
1870
+ 'results_dir': results_dir,
1871
+ 'detailed_dir': detailed_dir,
1872
+ 'qc_plots_dir': qc_plots_dir,
1873
+ }
1874
+
1875
+ def write_key_findings(results_dir, consensus_info, simple_lambda, simple_aa_mean, is_protein, hit_seq):
1876
+ """
1877
+ Generate lay-user executive summary KEY_FINDINGS.txt.
1878
+
1879
+ Args:
1880
+ results_dir: Results directory path
1881
+ consensus_info: Consensus AA mutation estimate from QC analysis
1882
+ simple_lambda: Simple lambda (bp mutations per copy) from main analysis
1883
+ simple_aa_mean: Simple AA mutation mean from Monte Carlo simulation
1884
+ is_protein: Whether the region is protein-coding
1885
+ hit_seq: The hit sequence (for length calculation)
1886
+ """
1887
+ key_findings_path = Path(results_dir) / "KEY_FINDINGS.txt"
1888
+
1889
+ with open(key_findings_path, "w") as f:
1890
+ f.write("=" * 60 + "\n")
1891
+ f.write("EP LIBRARY PROFILER - KEY FINDINGS\n")
1892
+ f.write("=" * 60 + "\n\n")
1893
+
1894
+ # Determine which value to use as the "headline" number
1895
+ if consensus_info and consensus_info.get("consensus_mean") is not None:
1896
+ headline_aa = consensus_info["consensus_mean"]
1897
+ headline_std = consensus_info.get("consensus_std", 0.0)
1898
+ method_note = "consensus (precision-weighted average across Q-score thresholds)"
1899
+ elif simple_aa_mean is not None:
1900
+ headline_aa = simple_aa_mean
1901
+ headline_std = 0.0 # Simple method doesn't provide error
1902
+ method_note = "Monte Carlo simulation (single Q-score)"
1903
+ else:
1904
+ headline_aa = None
1905
+ headline_std = 0.0
1906
+ method_note = "N/A"
1907
+
1908
+ f.write("EXPECTED AMINO ACID MUTATIONS PER GENE COPY\n")
1909
+ f.write("-" * 45 + "\n")
1910
+ if is_protein and headline_aa is not None:
1911
+ f.write(f" {headline_aa:.2f} +/- {headline_std:.2f} AA mutations per gene copy\n")
1912
+ f.write(f" (Method: {method_note})\n\n")
1913
+
1914
+ # Plain-language interpretation using Poisson distribution
1915
+ f.write("WHAT THIS MEANS (Poisson distribution):\n")
1916
+ f.write("-" * 45 + "\n")
1917
+ if headline_aa > 0:
1918
+ # P(k=0) = e^(-lambda)
1919
+ p_wildtype = np.exp(-headline_aa) * 100
1920
+ # P(k=1) = lambda * e^(-lambda)
1921
+ p_one_mut = headline_aa * np.exp(-headline_aa) * 100
1922
+ # P(k>=2) = 1 - P(0) - P(1)
1923
+ p_two_plus = 100 - p_wildtype - p_one_mut
1924
+
1925
+ f.write(f" ~{p_wildtype:.1f}% of gene copies are wild-type (0 AA mutations)\n")
1926
+ f.write(f" ~{p_one_mut:.1f}% have exactly 1 AA mutation\n")
1927
+ f.write(f" ~{p_two_plus:.1f}% have 2 or more AA mutations\n\n")
1928
+ else:
1929
+ f.write(" Nearly all gene copies are expected to be wild-type.\n\n")
1930
+ else:
1931
+ if not is_protein:
1932
+ f.write(" Region is not protein-coding; AA mutation estimate not applicable.\n\n")
1933
+ else:
1934
+ f.write(" AA mutation estimate could not be calculated.\n\n")
1935
+
1936
+ # Quality assessment
1937
+ f.write("QUALITY ASSESSMENT\n")
1938
+ f.write("-" * 45 + "\n")
1939
+ if consensus_info:
1940
+ n_thresholds = len(consensus_info.get("thresholds_used", []))
1941
+ min_bases = consensus_info.get("min_mappable_bases", 0)
1942
+ note = consensus_info.get("note", "")
1943
+
1944
+ if n_thresholds >= 3 and note != "FELL_BACK_TO_MAX_COVERAGE":
1945
+ f.write(" GOOD - Multiple Q-score thresholds contributed to consensus\n")
1946
+ elif n_thresholds >= 1:
1947
+ f.write(" ACCEPTABLE - Limited Q-score thresholds available\n")
1948
+ else:
1949
+ f.write(" LOW COVERAGE - Results may be unreliable\n")
1950
+
1951
+ if note == "FELL_BACK_TO_MAX_COVERAGE":
1952
+ f.write(" WARNING: Fell back to max-coverage threshold due to low coverage\n")
1953
+ else:
1954
+ f.write(" UNKNOWN - Consensus analysis not available\n")
1955
+
1956
+ f.write("\n")
1957
+ f.write("FOR DETAILED TECHNICAL INFORMATION\n")
1958
+ f.write("-" * 45 + "\n")
1959
+ f.write(" See the detailed/ folder for:\n")
1960
+ f.write(" - methodology_notes.txt: Full explanation of calculations\n")
1961
+ f.write(" - lambda_comparison.csv: Side-by-side lambda estimates\n")
1962
+ f.write(" - comprehensive_qc_data.csv: All Q-score threshold results\n")
1963
+ f.write("\n")
1964
+
1965
+ logging.info(f"Wrote KEY_FINDINGS.txt to: {key_findings_path}")
1966
+
1967
+ def write_lambda_comparison(detailed_dir, simple_lambda, simple_aa_mean, consensus_info, hit_seq_length):
1968
+ """
1969
+ Write CSV comparing all lambda estimates side-by-side.
1970
+
1971
+ Args:
1972
+ detailed_dir: Path to detailed/ directory
1973
+ simple_lambda: Simple lambda (bp mutations per copy)
1974
+ simple_aa_mean: Simple AA mutation mean from Monte Carlo
1975
+ consensus_info: Consensus info from QC analysis
1976
+ hit_seq_length: Length of the hit sequence
1977
+ """
1978
+ lambda_csv_path = Path(detailed_dir) / "lambda_comparison.csv"
1979
+
1980
+ with open(lambda_csv_path, "w") as f:
1981
+ f.write("method,lambda_bp,lambda_error,aa_estimate,aa_error,notes\n")
1982
+
1983
+ # Simple method (from main analysis)
1984
+ simple_error = "N/A" # Simple method doesn't compute error
1985
+ simple_aa_err = "N/A"
1986
+ f.write(f"simple,(hit_rate - bg_rate) * seq_len,{simple_lambda:.6f},{simple_error},")
1987
+ if simple_aa_mean is not None:
1988
+ f.write(f"{simple_aa_mean:.4f},{simple_aa_err},")
1989
+ else:
1990
+ f.write("N/A,N/A,")
1991
+ f.write("Used for KDE plot and Monte Carlo simulation\n")
1992
+
1993
+ # Consensus method (from QC analysis)
1994
+ if consensus_info and consensus_info.get("consensus_mean") is not None:
1995
+ consensus_mean = consensus_info["consensus_mean"]
1996
+ consensus_std = consensus_info.get("consensus_std", 0.0)
1997
+ thresholds = consensus_info.get("thresholds_used", [])
1998
+ # Consensus is in AA mutations, back-calculate approximate lambda
1999
+ # Rough approximation: lambda_bp ~ 3 * aa_mutations
2000
+ approx_lambda = consensus_mean * 3.0
2001
+ approx_lambda_err = consensus_std * 3.0
2002
+ f.write(f"consensus_weighted,{approx_lambda:.6f},{approx_lambda_err:.6f},")
2003
+ f.write(f"{consensus_mean:.4f},{consensus_std:.4f},")
2004
+ f.write(f"Precision-weighted across Q-scores: {thresholds}\n")
2005
+ else:
2006
+ f.write("consensus_weighted,N/A,N/A,N/A,N/A,Not computed or insufficient data\n")
2007
+
2008
+ logging.info(f"Wrote lambda_comparison.csv to: {lambda_csv_path}")
2009
+
2010
+ def write_methodology_notes(detailed_dir):
2011
+ """
2012
+ Write detailed methodology documentation explaining each lambda calculation method.
2013
+
2014
+ Args:
2015
+ detailed_dir: Path to detailed/ directory
2016
+ """
2017
+ methodology_path = Path(detailed_dir) / "methodology_notes.txt"
2018
+
2019
+ content = """EP LIBRARY PROFILER - METHODOLOGY NOTES
2020
+ =======================================
2021
+
2022
+ This document explains the different mutation rate estimates produced by the
2023
+ EP library profiler and which outputs use which estimates.
2024
+
2025
+
2026
+ LAMBDA CALCULATION METHODS
2027
+ --------------------------
2028
+
2029
+ 1. SIMPLE LAMBDA (used for KDE plot and Monte Carlo simulation)
2030
+
2031
+ Formula: lambda_bp = (hit_rate - bg_rate) * sequence_length
2032
+
2033
+ Where:
2034
+ - hit_rate = total_mismatches / total_covered_bases (in target region)
2035
+ - bg_rate = total_mismatches / total_covered_bases (in plasmid excluding target)
2036
+ - sequence_length = length of target CDS in base pairs
2037
+
2038
+ This method:
2039
+ - Does NOT include error propagation
2040
+ - Does NOT weight by Q-score
2041
+ - Is fast and provides a point estimate
2042
+
2043
+ Used in:
2044
+ - summary_panels.png/pdf (Panel 4: KDE of AA mutations)
2045
+ - summary.txt
2046
+ - aa_mutation_distribution.csv
2047
+
2048
+
2049
+ 2. Q-SCORE WEIGHTED LAMBDA (used in comprehensive QC analysis)
2050
+
2051
+ Formula: lambda_bp_weighted = net_weighted_rate * sequence_length
2052
+
2053
+ Where:
2054
+ - net_weighted_rate = hit_weighted_rate - bg_weighted_rate
2055
+ - Weighted rates account for per-base Q-score uncertainty
2056
+ - Weights = 1 - sqrt(10^(-Q/10)) for each position
2057
+
2058
+ This method:
2059
+ - DOES include error propagation
2060
+ - DOES weight by Q-score (higher Q-score = higher weight)
2061
+ - Provides confidence intervals
2062
+
2063
+ Used in:
2064
+ - comprehensive_qc_data.csv
2065
+ - error_analysis.png
2066
+
2067
+
2068
+ 3. CONSENSUS LAMBDA (recommended for reporting)
2069
+
2070
+ Formula: Precision-weighted average across Q-score thresholds
2071
+
2072
+ weights[i] = 1 / std_aa_mutations[i]
2073
+ consensus_mean = sum(weights * means) / sum(weights)
2074
+
2075
+ This method:
2076
+ - Aggregates estimates from multiple Q-score filtering thresholds
2077
+ - Weights by precision (lower uncertainty = higher weight)
2078
+ - Requires minimum coverage threshold (default 1000 mappable bases)
2079
+ - Provides the most robust estimate when multiple thresholds pass QC
2080
+
2081
+ Used in:
2082
+ - aa_mutation_consensus.txt
2083
+ - KEY_FINDINGS.txt
2084
+ - QC plots (red dashed line)
2085
+
2086
+
2087
+ WHICH VALUE SHOULD I USE?
2088
+ -------------------------
2089
+
2090
+ For publication/reporting:
2091
+ Use the CONSENSUS value from aa_mutation_consensus.txt or KEY_FINDINGS.txt
2092
+ This is the most statistically robust estimate.
2093
+
2094
+ For understanding the distribution shape:
2095
+ Use the KDE plot in summary_panels.png
2096
+ Note: This uses the SIMPLE lambda, not the consensus.
2097
+
2098
+ For detailed error analysis:
2099
+ Use comprehensive_qc_data.csv in the detailed/ folder
2100
+ This contains per-Q-score estimates with full error propagation.
2101
+
2102
+
2103
+ OUTPUT FILE REFERENCE
2104
+ ---------------------
2105
+
2106
+ Root folder:
2107
+ - KEY_FINDINGS.txt: Executive summary with consensus AA mutations
2108
+ - summary_panels.png/pdf: Main visualization (uses simple lambda for KDE)
2109
+ - aa_mutation_consensus.txt: Consensus estimate details
2110
+
2111
+ detailed/ folder:
2112
+ - methodology_notes.txt: This file
2113
+ - lambda_comparison.csv: Side-by-side comparison of all methods
2114
+ - comprehensive_qc_data.csv: Full QC data with error estimates
2115
+ - simple_qc_data.csv: Simplified QC data
2116
+ - gene_mismatch_rates.csv: Per-position mismatch rates
2117
+ - base_distribution.csv: Base counts at each position
2118
+ - aa_substitutions.csv: Amino acid substitution data
2119
+ - plasmid_coverage.csv: Coverage across plasmid
2120
+ - aa_mutation_distribution.csv: Monte Carlo AA mutation trials
2121
+
2122
+ detailed/qc_plots/ folder:
2123
+ - qc_plot_*.png: Q-score threshold analysis plot
2124
+ - comprehensive_qc_analysis.png: Detailed QC visualization
2125
+ - error_analysis.png: Error component breakdown
2126
+ """
2127
+
2128
+ with open(methodology_path, "w") as f:
2129
+ f.write(content)
2130
+
2131
+ logging.info(f"Wrote methodology_notes.txt to: {methodology_path}")
2132
+
1830
2133
  def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, work_dir, results_dir,
1831
2134
  chunks, ref_hit_fasta, plasmid_fasta, hit_seq, hit_id, plasmid_seq, idx):
1832
2135
  """
@@ -1854,13 +2157,18 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
1854
2157
 
1855
2158
  # Ensure work directory exists
1856
2159
  os.makedirs(work_dir, exist_ok=True)
1857
-
2160
+
1858
2161
  # Create subdirectory for this Q-score analysis
1859
2162
  qscore_results_dir = results_dir
1860
2163
  if qscore is not None:
1861
2164
  qscore_results_dir = os.path.join(results_dir, f"q{qscore}_analysis")
1862
2165
  os.makedirs(qscore_results_dir, exist_ok=True)
1863
-
2166
+
2167
+ # Create output directory structure (detailed/ and detailed/qc_plots/)
2168
+ output_dirs = create_output_directories(qscore_results_dir)
2169
+ detailed_dir = output_dirs['detailed_dir']
2170
+ qc_plots_dir = output_dirs['qc_plots_dir']
2171
+
1864
2172
  # Write chunks FASTA & align to background‐chunks
1865
2173
  chunks_fasta = create_multi_fasta(chunks, work_dir)
1866
2174
  sam_chunks = run_minimap2(fastq_path, chunks_fasta, "plasmid_chunks_alignment", work_dir)
@@ -1976,9 +2284,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
1976
2284
  qscore_info = f" ({qscore_desc})" if qscore_desc != "unfiltered" else ""
1977
2285
 
1978
2286
  # ----------------------------
1979
- # SAVE CSV FOR MUTATION RATES (PANEL 1)
2287
+ # SAVE CSV FOR MUTATION RATES (PANEL 1) - to detailed/
1980
2288
  # ----------------------------
1981
- gene_mismatch_csv = os.path.join(qscore_results_dir, "gene_mismatch_rates.csv")
2289
+ gene_mismatch_csv = os.path.join(detailed_dir, "gene_mismatch_rates.csv")
1982
2290
  with open(gene_mismatch_csv, "w", newline="") as csvfile:
1983
2291
  csvfile.write(f"# gene_id: {hit_id}\n")
1984
2292
  csvfile.write(f"# background_rate_per_kb: {bg_rate_per_kb:.6f}\n")
@@ -1988,9 +2296,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
1988
2296
  logging.info(f"Saved CSV for gene mismatch rates: {gene_mismatch_csv}")
1989
2297
 
1990
2298
  # ----------------------------
1991
- # SAVE CSV FOR BASE DISTRIBUTION (PANEL 2)
2299
+ # SAVE CSV FOR BASE DISTRIBUTION (PANEL 2) - to detailed/
1992
2300
  # ----------------------------
1993
- base_dist_csv = os.path.join(qscore_results_dir, "base_distribution.csv")
2301
+ base_dist_csv = os.path.join(detailed_dir, "base_distribution.csv")
1994
2302
  with open(base_dist_csv, "w", newline="") as csvfile:
1995
2303
  csvfile.write(f"# gene_id: {hit_id}\n")
1996
2304
  csvfile.write("position_1based,ref_base,A_count,C_count,G_count,T_count,N_count\n")
@@ -2000,10 +2308,10 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2000
2308
  logging.info(f"Saved CSV for base distribution: {base_dist_csv}")
2001
2309
 
2002
2310
  # ----------------------------
2003
- # SAVE CSV FOR AA SUBSTITUTIONS (PANEL 3) - only if protein
2311
+ # SAVE CSV FOR AA SUBSTITUTIONS (PANEL 3) - to detailed/ - only if protein
2004
2312
  # ----------------------------
2005
2313
  if is_protein:
2006
- aa_subst_csv = os.path.join(qscore_results_dir, "aa_substitutions.csv")
2314
+ aa_subst_csv = os.path.join(detailed_dir, "aa_substitutions.csv")
2007
2315
  with open(aa_subst_csv, "w", newline="") as csvfile:
2008
2316
  csvfile.write(f"# gene_id: {hit_id}\n")
2009
2317
  csvfile.write(f"# lambda_bp_mut: {est_mut_per_copy:.6f}\n")
@@ -2013,9 +2321,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2013
2321
  logging.info(f"Saved CSV for AA substitutions: {aa_subst_csv}")
2014
2322
 
2015
2323
  # ----------------------------
2016
- # SAVE CSV FOR PLASMID COVERAGE (PANEL 4)
2324
+ # SAVE CSV FOR PLASMID COVERAGE (PANEL 4) - to detailed/
2017
2325
  # ----------------------------
2018
- plasmid_cov_csv = os.path.join(qscore_results_dir, "plasmid_coverage.csv")
2326
+ plasmid_cov_csv = os.path.join(detailed_dir, "plasmid_coverage.csv")
2019
2327
  with open(plasmid_cov_csv, "w", newline="") as csvfile:
2020
2328
  csvfile.write("position_1based,coverage\n")
2021
2329
  for pos0, cov in enumerate(plasmid_cov):
@@ -2023,9 +2331,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2023
2331
  logging.info(f"Saved CSV for plasmid coverage: {plasmid_cov_csv}")
2024
2332
 
2025
2333
  # ----------------------------
2026
- # SAVE CSV FOR AA MUTATION DISTRIBUTION (PANEL 3)
2334
+ # SAVE CSV FOR AA MUTATION DISTRIBUTION (PANEL 3) - to detailed/
2027
2335
  # ----------------------------
2028
- aa_dist_csv = os.path.join(qscore_results_dir, "aa_mutation_distribution.csv")
2336
+ aa_dist_csv = os.path.join(detailed_dir, "aa_mutation_distribution.csv")
2029
2337
  with open(aa_dist_csv, "w", newline="") as csvfile:
2030
2338
  csvfile.write(f"# gene_id: {hit_id}\n")
2031
2339
  csvfile.write(f"# lambda_bp_mut: {est_mut_per_copy:.6f}\n")
@@ -2135,7 +2443,7 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2135
2443
  if is_protein and aa_diffs and len(aa_diffs) > 0:
2136
2444
  x_vals = np.array(aa_diffs)
2137
2445
  unique_vals = np.unique(x_vals)
2138
-
2446
+
2139
2447
  if len(unique_vals) > 1:
2140
2448
  # Multiple unique values - use KDE or histogram
2141
2449
  if HAVE_SCIPY:
@@ -2149,15 +2457,23 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2149
2457
  ax3.set_ylim(bottom=0)
2150
2458
  except Exception as e:
2151
2459
  logging.warning(f"KDE failed: {e}, falling back to histogram")
2152
- ax3.hist(x_vals, bins=min(20, len(unique_vals)),
2460
+ ax3.hist(x_vals, bins=min(20, len(unique_vals)),
2153
2461
  color="#C44E52", alpha=0.7, density=True, edgecolor='black')
2154
2462
  else:
2155
- ax3.hist(x_vals, bins=min(20, len(unique_vals)),
2463
+ ax3.hist(x_vals, bins=min(20, len(unique_vals)),
2156
2464
  color="#C44E52", alpha=0.7, density=True, edgecolor='black')
2157
2465
  else:
2158
2466
  # Single unique value - just show a bar
2159
2467
  ax3.bar(unique_vals, [1.0], color="#C44E52", alpha=0.7, width=0.1)
2160
2468
  ax3.set_xlim(unique_vals[0] - 0.5, unique_vals[0] + 0.5)
2469
+
2470
+ # Set title with lambda value for protein-coding sequences
2471
+ ax3.set_title(f"AA Mutation Distribution (Monte Carlo, \u03bb={est_mut_per_copy:.2f}){qscore_info}",
2472
+ fontsize=14, fontweight='bold')
2473
+ ax3.set_xlabel("Number of AA Mutations", fontsize=12)
2474
+ ax3.set_ylabel("Density", fontsize=12)
2475
+ ax3.spines['top'].set_visible(False)
2476
+ ax3.spines['right'].set_visible(False)
2161
2477
  else:
2162
2478
  # Not protein or no AA differences — display an informative message
2163
2479
  ax3.text(
@@ -2170,7 +2486,7 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2170
2486
  color="gray",
2171
2487
  transform=ax3.transAxes,
2172
2488
  )
2173
-
2489
+
2174
2490
  ax3.set_title("AA Mutation Distribution", fontsize=14, fontweight='bold')
2175
2491
  ax3.set_xlabel("Number of AA Mutations", fontsize=12)
2176
2492
  ax3.set_ylabel("Density", fontsize=12)
@@ -2231,9 +2547,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2231
2547
  sample_percent[cat] = 0.0
2232
2548
 
2233
2549
  # ----------------------------
2234
- # GENERATE PDF TABLE (MUTATION SPECTRUM)
2550
+ # GENERATE PDF TABLE (MUTATION SPECTRUM) - to detailed/
2235
2551
  # ----------------------------
2236
- pdf_path = os.path.join(qscore_results_dir, f"{sample_name}_mutation_spectrum.pdf")
2552
+ pdf_path = os.path.join(detailed_dir, f"{sample_name}_mutation_spectrum.pdf")
2237
2553
  # Prepare table data
2238
2554
  table_rows = []
2239
2555
  for cat in categories:
@@ -2341,9 +2657,6 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
2341
2657
  }
2342
2658
 
2343
2659
 
2344
-
2345
- main()
2346
-
2347
2660
  def expand_fastq_inputs(inputs: Iterable[str]) -> List[Path]:
2348
2661
  paths: List[Path] = []
2349
2662
  for item in inputs:
@@ -2503,6 +2816,7 @@ def process_single_fastq(
2503
2816
 
2504
2817
  logging.info("Running QC analysis to get Q-score results...")
2505
2818
  qc_results = None
2819
+ consensus_info = None
2506
2820
  try:
2507
2821
  qc_results, consensus_info = run_qc_analysis(
2508
2822
  str(fastq_path),
@@ -2563,6 +2877,45 @@ def process_single_fastq(
2563
2877
  )
2564
2878
  analysis_results.append(result)
2565
2879
 
2880
+ # Generate unified summary files in the sample's root results directory
2881
+ # Get simple lambda from the unfiltered analysis (first result)
2882
+ simple_lambda = 0.0
2883
+ simple_aa_mean = None
2884
+ is_protein = False
2885
+ unfiltered_result = analysis_results[0] if analysis_results else None
2886
+ if unfiltered_result:
2887
+ simple_lambda = unfiltered_result.get('est_mut_per_copy', 0.0)
2888
+ simple_aa_mean = unfiltered_result.get('avg_aa_mutations')
2889
+ is_protein = unfiltered_result.get('is_protein', False)
2890
+
2891
+ # Create output directories and generate summary files
2892
+ output_dirs = create_output_directories(results_dir)
2893
+ detailed_dir = output_dirs['detailed_dir']
2894
+
2895
+ # Write KEY_FINDINGS.txt (lay-user summary)
2896
+ write_key_findings(
2897
+ results_dir,
2898
+ consensus_info,
2899
+ simple_lambda,
2900
+ simple_aa_mean,
2901
+ is_protein,
2902
+ hit_seq,
2903
+ )
2904
+
2905
+ # Write lambda_comparison.csv
2906
+ write_lambda_comparison(
2907
+ detailed_dir,
2908
+ simple_lambda,
2909
+ simple_aa_mean,
2910
+ consensus_info,
2911
+ len(hit_seq),
2912
+ )
2913
+
2914
+ # Write methodology_notes.txt
2915
+ write_methodology_notes(detailed_dir)
2916
+
2917
+ logging.info("Generated unified summary files: KEY_FINDINGS.txt, lambda_comparison.csv, methodology_notes.txt")
2918
+
2566
2919
  if work_dir.exists():
2567
2920
  shutil.rmtree(work_dir)
2568
2921
  logging.info("Removed temporary work directory: %s", work_dir)
@@ -2573,5 +2926,6 @@ def process_single_fastq(
2573
2926
  "sample": sample_name,
2574
2927
  "results_dir": results_dir,
2575
2928
  "analysis_results": analysis_results,
2929
+ "consensus_info": consensus_info,
2576
2930
  }
2577
2931
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: uht-tooling
3
- Version: 0.1.9
3
+ Version: 0.2.0
4
4
  Summary: Tooling for ultra-high throughput screening workflows.
5
5
  Author: Matt115A
6
6
  License-Expression: MIT
@@ -313,13 +313,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
313
313
  --fastq data/ep-library-profile/*.fastq.gz \
314
314
  --output-dir results/ep-library-profile/
315
315
  ```
316
- - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
316
+
317
+ **Output structure**
318
+
319
+ Each sample produces an organized output directory:
320
+
321
+ ```
322
+ sample_name/
323
+ ├── KEY_FINDINGS.txt # Lay-user executive summary
324
+ ├── summary_panels.png/pdf # Main visualization
325
+ ├── aa_mutation_consensus.txt # Consensus estimate details
326
+ ├── run.log # Analysis log
327
+ └── detailed/ # Technical outputs
328
+ ├── methodology_notes.txt # Documents which lambda drives what
329
+ ├── lambda_comparison.csv # Side-by-side lambda comparison
330
+ ├── gene_mismatch_rates.csv
331
+ ├── base_distribution.csv
332
+ ├── aa_substitutions.csv
333
+ ├── plasmid_coverage.csv
334
+ ├── aa_mutation_distribution.csv
335
+ ├── comprehensive_qc_data.csv
336
+ ├── simple_qc_data.csv
337
+ └── qc_plots/ # QC visualizations
338
+ ├── qc_plot_*.png
339
+ ├── comprehensive_qc_analysis.png
340
+ ├── error_analysis.png
341
+ └── qc_mutation_rate_vs_quality.png/csv
342
+ ```
343
+
344
+ **Lambda estimates: which to use**
345
+
346
+ The profiler calculates lambda (mutations per gene copy) via two methods:
347
+
348
+ | Method | Formula | Error Quantified? | Used For |
349
+ |--------|---------|-------------------|----------|
350
+ | Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
351
+ | Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
352
+
353
+ - **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
354
+ - **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
355
+ - **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
356
+
357
+ The `KEY_FINDINGS.txt` file provides a plain-language summary including:
358
+ - Expected AA mutations per gene copy
359
+ - Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
360
+ - Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
317
361
 
318
362
  **How the mutation rate and AA expectations are derived**
319
363
 
320
- 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the target rate; mismatches elsewhere provide the background.
364
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
321
365
  2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
322
- 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
366
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
323
367
  4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
324
368
 
325
369
  ---
File without changes