uht-tooling 0.1.9__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/PKG-INFO +48 -4
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/README.md +47 -3
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/pyproject.toml +1 -1
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/mut_rate.py +476 -122
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/PKG-INFO +48 -4
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/setup.cfg +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/__init__.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/cli.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/models/__init__.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/__init__.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/design_gibson.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/design_kld.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/design_slim.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/gui.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/mutation_caller.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/nextera_designer.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/profile_inserts.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling/workflows/umi_hunter.py +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/SOURCES.txt +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/dependency_links.txt +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/entry_points.txt +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/requires.txt +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/src/uht_tooling.egg-info/top_level.txt +0 -0
- {uht_tooling-0.1.9 → uht_tooling-0.2.0}/tests/test_design_kld.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: uht-tooling
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: Tooling for ultra-high throughput screening workflows.
|
|
5
5
|
Author: Matt115A
|
|
6
6
|
License-Expression: MIT
|
|
@@ -313,13 +313,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
|
|
|
313
313
|
--fastq data/ep-library-profile/*.fastq.gz \
|
|
314
314
|
--output-dir results/ep-library-profile/
|
|
315
315
|
```
|
|
316
|
-
|
|
316
|
+
|
|
317
|
+
**Output structure**
|
|
318
|
+
|
|
319
|
+
Each sample produces an organized output directory:
|
|
320
|
+
|
|
321
|
+
```
|
|
322
|
+
sample_name/
|
|
323
|
+
├── KEY_FINDINGS.txt # Lay-user executive summary
|
|
324
|
+
├── summary_panels.png/pdf # Main visualization
|
|
325
|
+
├── aa_mutation_consensus.txt # Consensus estimate details
|
|
326
|
+
├── run.log # Analysis log
|
|
327
|
+
└── detailed/ # Technical outputs
|
|
328
|
+
├── methodology_notes.txt # Documents which lambda drives what
|
|
329
|
+
├── lambda_comparison.csv # Side-by-side lambda comparison
|
|
330
|
+
├── gene_mismatch_rates.csv
|
|
331
|
+
├── base_distribution.csv
|
|
332
|
+
├── aa_substitutions.csv
|
|
333
|
+
├── plasmid_coverage.csv
|
|
334
|
+
├── aa_mutation_distribution.csv
|
|
335
|
+
├── comprehensive_qc_data.csv
|
|
336
|
+
├── simple_qc_data.csv
|
|
337
|
+
└── qc_plots/ # QC visualizations
|
|
338
|
+
├── qc_plot_*.png
|
|
339
|
+
├── comprehensive_qc_analysis.png
|
|
340
|
+
├── error_analysis.png
|
|
341
|
+
└── qc_mutation_rate_vs_quality.png/csv
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
**Lambda estimates: which to use**
|
|
345
|
+
|
|
346
|
+
The profiler calculates lambda (mutations per gene copy) via two methods:
|
|
347
|
+
|
|
348
|
+
| Method | Formula | Error Quantified? | Used For |
|
|
349
|
+
|--------|---------|-------------------|----------|
|
|
350
|
+
| Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
|
|
351
|
+
| Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
|
|
352
|
+
|
|
353
|
+
- **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
|
|
354
|
+
- **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
|
|
355
|
+
- **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
|
|
356
|
+
|
|
357
|
+
The `KEY_FINDINGS.txt` file provides a plain-language summary including:
|
|
358
|
+
- Expected AA mutations per gene copy
|
|
359
|
+
- Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
|
|
360
|
+
- Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
|
|
317
361
|
|
|
318
362
|
**How the mutation rate and AA expectations are derived**
|
|
319
363
|
|
|
320
|
-
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the
|
|
364
|
+
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
|
|
321
365
|
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
|
|
322
|
-
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these
|
|
366
|
+
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
|
|
323
367
|
4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
|
|
324
368
|
|
|
325
369
|
---
|
|
@@ -284,13 +284,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
|
|
|
284
284
|
--fastq data/ep-library-profile/*.fastq.gz \
|
|
285
285
|
--output-dir results/ep-library-profile/
|
|
286
286
|
```
|
|
287
|
-
|
|
287
|
+
|
|
288
|
+
**Output structure**
|
|
289
|
+
|
|
290
|
+
Each sample produces an organized output directory:
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
sample_name/
|
|
294
|
+
├── KEY_FINDINGS.txt # Lay-user executive summary
|
|
295
|
+
├── summary_panels.png/pdf # Main visualization
|
|
296
|
+
├── aa_mutation_consensus.txt # Consensus estimate details
|
|
297
|
+
├── run.log # Analysis log
|
|
298
|
+
└── detailed/ # Technical outputs
|
|
299
|
+
├── methodology_notes.txt # Documents which lambda drives what
|
|
300
|
+
├── lambda_comparison.csv # Side-by-side lambda comparison
|
|
301
|
+
├── gene_mismatch_rates.csv
|
|
302
|
+
├── base_distribution.csv
|
|
303
|
+
├── aa_substitutions.csv
|
|
304
|
+
├── plasmid_coverage.csv
|
|
305
|
+
├── aa_mutation_distribution.csv
|
|
306
|
+
├── comprehensive_qc_data.csv
|
|
307
|
+
├── simple_qc_data.csv
|
|
308
|
+
└── qc_plots/ # QC visualizations
|
|
309
|
+
├── qc_plot_*.png
|
|
310
|
+
├── comprehensive_qc_analysis.png
|
|
311
|
+
├── error_analysis.png
|
|
312
|
+
└── qc_mutation_rate_vs_quality.png/csv
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
**Lambda estimates: which to use**
|
|
316
|
+
|
|
317
|
+
The profiler calculates lambda (mutations per gene copy) via two methods:
|
|
318
|
+
|
|
319
|
+
| Method | Formula | Error Quantified? | Used For |
|
|
320
|
+
|--------|---------|-------------------|----------|
|
|
321
|
+
| Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
|
|
322
|
+
| Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
|
|
323
|
+
|
|
324
|
+
- **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
|
|
325
|
+
- **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
|
|
326
|
+
- **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
|
|
327
|
+
|
|
328
|
+
The `KEY_FINDINGS.txt` file provides a plain-language summary including:
|
|
329
|
+
- Expected AA mutations per gene copy
|
|
330
|
+
- Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
|
|
331
|
+
- Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
|
|
288
332
|
|
|
289
333
|
**How the mutation rate and AA expectations are derived**
|
|
290
334
|
|
|
291
|
-
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the
|
|
335
|
+
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
|
|
292
336
|
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
|
|
293
|
-
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these
|
|
337
|
+
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
|
|
294
338
|
4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
|
|
295
339
|
|
|
296
340
|
---
|
|
@@ -743,40 +743,50 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
|
|
|
743
743
|
|
|
744
744
|
plt.tight_layout()
|
|
745
745
|
|
|
746
|
-
#
|
|
746
|
+
# Create detailed/qc_plots/ subdirectory for QC plots
|
|
747
|
+
detailed_dir = os.path.join(results_dir, "detailed")
|
|
748
|
+
qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
|
|
749
|
+
os.makedirs(qc_plots_dir, exist_ok=True)
|
|
750
|
+
|
|
751
|
+
# Save the plot to detailed/qc_plots/
|
|
747
752
|
project_name = os.path.basename(results_dir)
|
|
748
|
-
qc_plot_path = os.path.join(
|
|
753
|
+
qc_plot_path = os.path.join(qc_plots_dir, f"qc_plot_{project_name}.png")
|
|
749
754
|
fig.savefig(qc_plot_path, dpi=300, bbox_inches='tight')
|
|
750
755
|
plt.close(fig)
|
|
751
|
-
|
|
756
|
+
|
|
752
757
|
logging.info(f"QC plot saved to: {qc_plot_path}")
|
|
753
|
-
|
|
754
|
-
# Save data as CSV
|
|
755
|
-
qc_data_path = os.path.join(
|
|
758
|
+
|
|
759
|
+
# Save data as CSV to detailed/
|
|
760
|
+
qc_data_path = os.path.join(detailed_dir, "simple_qc_data.csv")
|
|
756
761
|
with open(qc_data_path, 'w') as f:
|
|
757
762
|
f.write("quality_threshold,mean_aa_mutations,std_aa_mutations,ci_lower,ci_upper,")
|
|
758
763
|
f.write("total_mappable_bases,n_segments\n")
|
|
759
|
-
|
|
764
|
+
|
|
760
765
|
for q, r in zip(quality_thresholds, qc_results):
|
|
761
766
|
f.write(f"{q},{r['mean_aa_mutations']:.6f},{r['std_aa_mutations']:.6f},")
|
|
762
767
|
f.write(f"{r['ci_lower']:.6f},{r['ci_upper']:.6f},")
|
|
763
768
|
f.write(f"{r['total_mappable_bases']},{r['n_segments']}\n")
|
|
764
|
-
|
|
769
|
+
|
|
765
770
|
logging.info(f"Simple QC data saved to: {qc_data_path}")
|
|
766
|
-
|
|
771
|
+
|
|
767
772
|
except Exception as e:
|
|
768
773
|
logging.error(f"Error creating simple QC plots: {e}")
|
|
774
|
+
|
|
775
|
+
def create_comprehensive_qc_plots(quality_thresholds, qc_results, results_dir):
|
|
769
776
|
"""
|
|
770
777
|
Create comprehensive QC plots with error bars and uncertainty quantification.
|
|
771
|
-
|
|
778
|
+
|
|
772
779
|
Args:
|
|
773
780
|
quality_thresholds: List of quality score thresholds
|
|
774
781
|
qc_results: List of comprehensive analysis results
|
|
775
782
|
results_dir: Directory to save the plots
|
|
776
|
-
optimal_qscore: Optimal Q-score threshold (optional)
|
|
777
|
-
optimal_result: Optimal result data (optional)
|
|
778
783
|
"""
|
|
779
784
|
try:
|
|
785
|
+
# Create detailed/qc_plots/ subdirectory
|
|
786
|
+
detailed_dir = os.path.join(results_dir, "detailed")
|
|
787
|
+
qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
|
|
788
|
+
os.makedirs(qc_plots_dir, exist_ok=True)
|
|
789
|
+
|
|
780
790
|
# Extract data for plotting
|
|
781
791
|
aa_mutations = [r['mean_aa_mutations'] for r in qc_results]
|
|
782
792
|
aa_errors = [r['std_aa_mutations'] for r in qc_results]
|
|
@@ -785,46 +795,46 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
|
|
|
785
795
|
mappable_bases = [r['mappable_bases'] for r in qc_results]
|
|
786
796
|
net_rates = [r['net_rate'] for r in qc_results]
|
|
787
797
|
net_rate_errors = [r['net_rate_error'] for r in qc_results]
|
|
788
|
-
|
|
798
|
+
|
|
789
799
|
# Create main QC plot with error bars
|
|
790
800
|
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12))
|
|
791
|
-
|
|
801
|
+
|
|
792
802
|
# Top plot: AA mutations per gene with error bars
|
|
793
803
|
color1 = '#2E8B57'
|
|
794
|
-
ax1.errorbar(quality_thresholds, aa_mutations, yerr=aa_errors,
|
|
795
|
-
fmt='o', capsize=5, capthick=2, markersize=8,
|
|
804
|
+
ax1.errorbar(quality_thresholds, aa_mutations, yerr=aa_errors,
|
|
805
|
+
fmt='o', capsize=5, capthick=2, markersize=8,
|
|
796
806
|
color=color1, ecolor=color1, alpha=0.8, label='Mean ± Std')
|
|
797
|
-
|
|
807
|
+
|
|
798
808
|
# Add confidence intervals as shaded area
|
|
799
|
-
ax1.fill_between(quality_thresholds, aa_ci_lower, aa_ci_upper,
|
|
809
|
+
ax1.fill_between(quality_thresholds, aa_ci_lower, aa_ci_upper,
|
|
800
810
|
alpha=0.3, color=color1, label='95% Confidence Interval')
|
|
801
|
-
|
|
811
|
+
|
|
802
812
|
ax1.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
|
|
803
813
|
ax1.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color1)
|
|
804
814
|
ax1.tick_params(axis='y', labelcolor=color1)
|
|
805
|
-
ax1.set_title('AA Mutations per Gene vs Quality Score Filter (with Error Propagation)',
|
|
815
|
+
ax1.set_title('AA Mutations per Gene vs Quality Score Filter (with Error Propagation)',
|
|
806
816
|
fontsize=14, fontweight='bold')
|
|
807
817
|
ax1.grid(True, alpha=0.3)
|
|
808
818
|
ax1.legend(frameon=False, fontsize=10)
|
|
809
|
-
|
|
819
|
+
|
|
810
820
|
# Add data point labels
|
|
811
821
|
for i, (q, aa_mut, aa_err) in enumerate(zip(quality_thresholds, aa_mutations, aa_errors)):
|
|
812
|
-
ax1.annotate(f'Q{q}\n{aa_mut:.3f}±{aa_err:.3f}',
|
|
813
|
-
(q, aa_mut), xytext=(5, 5),
|
|
822
|
+
ax1.annotate(f'Q{q}\n{aa_mut:.3f}±{aa_err:.3f}',
|
|
823
|
+
(q, aa_mut), xytext=(5, 5),
|
|
814
824
|
textcoords='offset points', fontsize=8, alpha=0.8, color=color1)
|
|
815
|
-
|
|
825
|
+
|
|
816
826
|
# Bottom plot: Mappable bases and AA mutations per gene
|
|
817
827
|
color2 = '#FF6B6B'
|
|
818
828
|
color3 = '#4169E1'
|
|
819
|
-
|
|
829
|
+
|
|
820
830
|
# Mappable bases (left y-axis)
|
|
821
831
|
ax2_twin = ax2.twinx()
|
|
822
|
-
ax2_twin.scatter(quality_thresholds, mappable_bases,
|
|
823
|
-
s=100, alpha=0.7, color=color2, edgecolors='black',
|
|
832
|
+
ax2_twin.scatter(quality_thresholds, mappable_bases,
|
|
833
|
+
s=100, alpha=0.7, color=color2, edgecolors='black',
|
|
824
834
|
linewidth=1, marker='s', label='Mappable Bases')
|
|
825
835
|
ax2_twin.set_ylabel('Number of Mappable Bases', fontsize=12, fontweight='bold', color=color2)
|
|
826
836
|
ax2_twin.tick_params(axis='y', labelcolor=color2)
|
|
827
|
-
|
|
837
|
+
|
|
828
838
|
# AA mutations per gene with error bars (right y-axis)
|
|
829
839
|
ax2.errorbar(quality_thresholds, aa_mutations, yerr=aa_errors,
|
|
830
840
|
fmt='^', capsize=5, capthick=2, markersize=8,
|
|
@@ -832,34 +842,34 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
|
|
|
832
842
|
ax2.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color3)
|
|
833
843
|
ax2.tick_params(axis='y', labelcolor=color3)
|
|
834
844
|
ax2.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
|
|
835
|
-
ax2.set_title('Mappable Bases and AA Mutations per Gene vs Quality Score Filter',
|
|
845
|
+
ax2.set_title('Mappable Bases and AA Mutations per Gene vs Quality Score Filter',
|
|
836
846
|
fontsize=14, fontweight='bold')
|
|
837
847
|
ax2.grid(True, alpha=0.3)
|
|
838
|
-
|
|
848
|
+
|
|
839
849
|
# Add legends
|
|
840
850
|
lines1, labels1 = ax2.get_legend_handles_labels()
|
|
841
851
|
lines2, labels2 = ax2_twin.get_legend_handles_labels()
|
|
842
852
|
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper right', frameon=False, fontsize=10)
|
|
843
|
-
|
|
853
|
+
|
|
844
854
|
# Add data point labels for mappable bases
|
|
845
855
|
for i, (q, bases) in enumerate(zip(quality_thresholds, mappable_bases)):
|
|
846
|
-
ax2_twin.annotate(f'{bases}', (q, bases), xytext=(5, -15),
|
|
856
|
+
ax2_twin.annotate(f'{bases}', (q, bases), xytext=(5, -15),
|
|
847
857
|
textcoords='offset points', fontsize=8, alpha=0.8, color=color2)
|
|
848
|
-
|
|
858
|
+
|
|
849
859
|
plt.tight_layout()
|
|
850
|
-
|
|
851
|
-
# Save the comprehensive plot
|
|
852
|
-
qc_plot_path = os.path.join(
|
|
860
|
+
|
|
861
|
+
# Save the comprehensive plot to detailed/qc_plots/
|
|
862
|
+
qc_plot_path = os.path.join(qc_plots_dir, "comprehensive_qc_analysis.png")
|
|
853
863
|
fig.savefig(qc_plot_path, dpi=300, bbox_inches='tight')
|
|
854
864
|
plt.close(fig)
|
|
855
|
-
|
|
865
|
+
|
|
856
866
|
logging.info(f"Comprehensive QC plot saved to: {qc_plot_path}")
|
|
857
|
-
|
|
867
|
+
|
|
858
868
|
# Create error analysis plot
|
|
859
869
|
create_error_analysis_plot(quality_thresholds, qc_results, results_dir)
|
|
860
|
-
|
|
861
|
-
# Save comprehensive data as CSV
|
|
862
|
-
qc_data_path = os.path.join(
|
|
870
|
+
|
|
871
|
+
# Save comprehensive data as CSV to detailed/
|
|
872
|
+
qc_data_path = os.path.join(detailed_dir, "comprehensive_qc_data.csv")
|
|
863
873
|
with open(qc_data_path, 'w') as f:
|
|
864
874
|
f.write("quality_threshold,mean_aa_mutations,std_aa_mutations,ci_lower,ci_upper,")
|
|
865
875
|
f.write("mappable_bases,hit_rate,hit_rate_ci_lower,hit_rate_ci_upper,")
|
|
@@ -869,7 +879,7 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
|
|
|
869
879
|
f.write("bg_qscore_mean,bg_qscore_std,bg_qscore_uncertainty,")
|
|
870
880
|
f.write("hit_weighted_rate,hit_weighted_error,bg_weighted_rate,bg_weighted_error,")
|
|
871
881
|
f.write("net_weighted_rate,net_weighted_error,lambda_bp_weighted,lambda_error_weighted\n")
|
|
872
|
-
|
|
882
|
+
|
|
873
883
|
for q, r in zip(quality_thresholds, qc_results):
|
|
874
884
|
f.write(f"{q},{r['mean_aa_mutations']:.6f},{r['std_aa_mutations']:.6f},")
|
|
875
885
|
f.write(f"{r['ci_lower']:.6f},{r['ci_upper']:.6f},")
|
|
@@ -878,93 +888,98 @@ def create_simple_qc_plots(quality_thresholds, qc_results, results_dir, consensu
|
|
|
878
888
|
f.write(f"{r['bg_rate']:.6f},{r['bg_rate_ci'][0]:.6f},{r['bg_rate_ci'][1]:.6f},")
|
|
879
889
|
f.write(f"{r['net_rate']:.6f},{r['net_rate_error']:.6f},")
|
|
880
890
|
f.write(f"{r['lambda_bp']:.6f},{r['lambda_error']:.6f},{r['alignment_error']:.6f},")
|
|
881
|
-
|
|
891
|
+
|
|
882
892
|
# Q-score information
|
|
883
893
|
hit_qscore_mean = r['hit_qscore_stats']['mean_qscore'] if r['hit_qscore_stats'] else 0.0
|
|
884
894
|
hit_qscore_std = r['hit_qscore_stats']['std_qscore'] if r['hit_qscore_stats'] else 0.0
|
|
885
895
|
bg_qscore_mean = r['bg_qscore_stats']['mean_qscore'] if r['bg_qscore_stats'] else 0.0
|
|
886
896
|
bg_qscore_std = r['bg_qscore_stats']['std_qscore'] if r['bg_qscore_stats'] else 0.0
|
|
887
|
-
|
|
897
|
+
|
|
888
898
|
f.write(f"{hit_qscore_mean:.2f},{hit_qscore_std:.2f},{r['hit_qscore_uncertainty']:.6f},")
|
|
889
899
|
f.write(f"{bg_qscore_mean:.2f},{bg_qscore_std:.2f},{r['bg_qscore_uncertainty']:.6f},")
|
|
890
900
|
f.write(f"{r.get('hit_weighted_rate', 0.0):.6f},{r.get('hit_weighted_error', 0.0):.6f},")
|
|
891
901
|
f.write(f"{r.get('bg_weighted_rate', 0.0):.6f},{r.get('bg_weighted_error', 0.0):.6f},")
|
|
892
902
|
f.write(f"{r.get('net_weighted_rate', 0.0):.6f},{r.get('net_weighted_error', 0.0):.6f},")
|
|
893
903
|
f.write(f"{r.get('lambda_bp_weighted', 0.0):.6f},{r.get('lambda_error_weighted', 0.0):.6f}\n")
|
|
894
|
-
|
|
904
|
+
|
|
895
905
|
logging.info(f"Comprehensive QC data saved to: {qc_data_path}")
|
|
896
|
-
|
|
906
|
+
|
|
897
907
|
except Exception as e:
|
|
898
908
|
logging.error(f"Error creating comprehensive QC plots: {e}")
|
|
899
909
|
|
|
900
910
|
def create_error_analysis_plot(quality_thresholds, qc_results, results_dir):
|
|
901
911
|
"""
|
|
902
912
|
Create a detailed error analysis plot showing different sources of uncertainty.
|
|
903
|
-
|
|
913
|
+
|
|
904
914
|
Args:
|
|
905
915
|
quality_thresholds: List of quality score thresholds
|
|
906
916
|
qc_results: List of comprehensive analysis results
|
|
907
917
|
results_dir: Directory to save the plot
|
|
908
918
|
"""
|
|
909
919
|
try:
|
|
920
|
+
# Create detailed/qc_plots/ subdirectory
|
|
921
|
+
detailed_dir = os.path.join(results_dir, "detailed")
|
|
922
|
+
qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
|
|
923
|
+
os.makedirs(qc_plots_dir, exist_ok=True)
|
|
924
|
+
|
|
910
925
|
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
|
|
911
|
-
|
|
926
|
+
|
|
912
927
|
# Extract error components
|
|
913
928
|
aa_std = [r['std_aa_mutations'] for r in qc_results]
|
|
914
929
|
net_rate_errors = [r['net_rate_error'] for r in qc_results]
|
|
915
930
|
lambda_errors = [r['lambda_error'] for r in qc_results]
|
|
916
931
|
alignment_errors = [r['alignment_error'] for r in qc_results]
|
|
917
932
|
mappable_bases = [r['mappable_bases'] for r in qc_results]
|
|
918
|
-
|
|
933
|
+
|
|
919
934
|
# Plot 1: AA mutation uncertainty vs quality threshold
|
|
920
935
|
ax1.plot(quality_thresholds, aa_std, 'o-', color='#2E8B57', linewidth=2, markersize=6)
|
|
921
936
|
ax1.set_xlabel('Quality Score Threshold')
|
|
922
937
|
ax1.set_ylabel('AA Mutation Standard Deviation')
|
|
923
938
|
ax1.set_title('AA Mutation Uncertainty vs Quality Filter')
|
|
924
939
|
ax1.grid(True, alpha=0.3)
|
|
925
|
-
|
|
940
|
+
|
|
926
941
|
# Plot 2: Net rate error vs quality threshold
|
|
927
942
|
ax2.plot(quality_thresholds, net_rate_errors, 's-', color='#FF6B6B', linewidth=2, markersize=6)
|
|
928
943
|
ax2.set_xlabel('Quality Score Threshold')
|
|
929
944
|
ax2.set_ylabel('Net Mutation Rate Error')
|
|
930
945
|
ax2.set_title('Net Rate Error vs Quality Filter')
|
|
931
946
|
ax2.grid(True, alpha=0.3)
|
|
932
|
-
|
|
947
|
+
|
|
933
948
|
# Plot 3: Lambda error vs quality threshold
|
|
934
949
|
ax3.plot(quality_thresholds, lambda_errors, '^-', color='#4169E1', linewidth=2, markersize=6)
|
|
935
950
|
ax3.set_xlabel('Quality Score Threshold')
|
|
936
951
|
ax3.set_ylabel('Lambda Error (mutations per copy)')
|
|
937
952
|
ax3.set_title('Lambda Error vs Quality Filter')
|
|
938
953
|
ax3.grid(True, alpha=0.3)
|
|
939
|
-
|
|
954
|
+
|
|
940
955
|
# Plot 4: Alignment error vs mappable bases
|
|
941
956
|
ax4.scatter(mappable_bases, alignment_errors, s=100, alpha=0.7, color='#FF8C00')
|
|
942
957
|
ax4.set_xlabel('Mappable Bases')
|
|
943
|
-
ax4.set_ylabel('Alignment Error (1
|
|
958
|
+
ax4.set_ylabel('Alignment Error (1/sqrt(reads))')
|
|
944
959
|
ax4.set_title('Alignment Error vs Read Count')
|
|
945
960
|
ax4.grid(True, alpha=0.3)
|
|
946
|
-
|
|
961
|
+
|
|
947
962
|
# Add quality threshold labels to scatter plot
|
|
948
963
|
for i, q in enumerate(quality_thresholds):
|
|
949
|
-
ax4.annotate(f'Q{q}', (mappable_bases[i], alignment_errors[i]),
|
|
964
|
+
ax4.annotate(f'Q{q}', (mappable_bases[i], alignment_errors[i]),
|
|
950
965
|
xytext=(5, 5), textcoords='offset points', fontsize=8)
|
|
951
|
-
|
|
966
|
+
|
|
952
967
|
plt.tight_layout()
|
|
953
|
-
|
|
954
|
-
# Save error analysis plot
|
|
955
|
-
error_plot_path = os.path.join(
|
|
968
|
+
|
|
969
|
+
# Save error analysis plot to detailed/qc_plots/
|
|
970
|
+
error_plot_path = os.path.join(qc_plots_dir, "error_analysis.png")
|
|
956
971
|
fig.savefig(error_plot_path, dpi=300, bbox_inches='tight')
|
|
957
972
|
plt.close(fig)
|
|
958
|
-
|
|
973
|
+
|
|
959
974
|
logging.info(f"Error analysis plot saved to: {error_plot_path}")
|
|
960
|
-
|
|
975
|
+
|
|
961
976
|
except Exception as e:
|
|
962
977
|
logging.error(f"Error creating error analysis plot: {e}")
|
|
963
978
|
|
|
964
979
|
def create_qc_plot(quality_thresholds, aa_mutations, mappable_bases, results_dir):
|
|
965
980
|
"""
|
|
966
981
|
Create a dual-axis plot showing quality score threshold vs AA mutations per gene and mappable bases.
|
|
967
|
-
|
|
982
|
+
|
|
968
983
|
Args:
|
|
969
984
|
quality_thresholds: List of quality score thresholds
|
|
970
985
|
aa_mutations: List of corresponding AA mutations per gene
|
|
@@ -972,68 +987,73 @@ def create_qc_plot(quality_thresholds, aa_mutations, mappable_bases, results_dir
|
|
|
972
987
|
results_dir: Directory to save the plot
|
|
973
988
|
"""
|
|
974
989
|
try:
|
|
990
|
+
# Create detailed/qc_plots/ subdirectory
|
|
991
|
+
detailed_dir = os.path.join(results_dir, "detailed")
|
|
992
|
+
qc_plots_dir = os.path.join(detailed_dir, "qc_plots")
|
|
993
|
+
os.makedirs(qc_plots_dir, exist_ok=True)
|
|
994
|
+
|
|
975
995
|
# Create the plot with dual y-axes
|
|
976
996
|
fig, ax1 = plt.subplots(figsize=(12, 8))
|
|
977
|
-
|
|
997
|
+
|
|
978
998
|
# Left y-axis: AA mutations per gene
|
|
979
999
|
color1 = '#2E8B57'
|
|
980
1000
|
ax1.set_xlabel('Quality Score Threshold', fontsize=12, fontweight='bold')
|
|
981
1001
|
ax1.set_ylabel('Estimated AA Mutations per Gene', fontsize=12, fontweight='bold', color=color1)
|
|
982
|
-
ax1.scatter(quality_thresholds, aa_mutations,
|
|
1002
|
+
ax1.scatter(quality_thresholds, aa_mutations,
|
|
983
1003
|
s=100, alpha=0.7, color=color1, edgecolors='black', linewidth=1, label='AA Mutations per Gene')
|
|
984
1004
|
ax1.tick_params(axis='y', labelcolor=color1)
|
|
985
|
-
|
|
1005
|
+
|
|
986
1006
|
# Right y-axis: Mappable bases
|
|
987
1007
|
ax2 = ax1.twinx()
|
|
988
1008
|
color2 = '#FF6B6B'
|
|
989
1009
|
ax2.set_ylabel('Number of Mappable Bases', fontsize=12, fontweight='bold', color=color2)
|
|
990
|
-
ax2.scatter(quality_thresholds, mappable_bases,
|
|
1010
|
+
ax2.scatter(quality_thresholds, mappable_bases,
|
|
991
1011
|
s=100, alpha=0.7, color=color2, edgecolors='black', linewidth=1, marker='s', label='Mappable Bases')
|
|
992
1012
|
ax2.tick_params(axis='y', labelcolor=color2)
|
|
993
|
-
|
|
1013
|
+
|
|
994
1014
|
# Customize the plot
|
|
995
1015
|
ax1.set_title('AA Mutations per Gene and Mappable Bases vs Quality Score Filter', fontsize=14, fontweight='bold')
|
|
996
|
-
|
|
1016
|
+
|
|
997
1017
|
# Add grid for better readability
|
|
998
1018
|
ax1.grid(True, alpha=0.3)
|
|
999
|
-
|
|
1019
|
+
|
|
1000
1020
|
# Customize ticks and spines
|
|
1001
1021
|
ax1.tick_params(axis='both', which='major', labelsize=10, direction='in', length=6)
|
|
1002
1022
|
ax1.tick_params(axis='both', which='minor', direction='in', length=3)
|
|
1003
1023
|
ax1.spines['top'].set_visible(False)
|
|
1004
1024
|
ax1.spines['right'].set_visible(False)
|
|
1005
|
-
|
|
1025
|
+
|
|
1006
1026
|
# Add data point labels for AA mutations
|
|
1007
1027
|
for i, (q, aa_mut) in enumerate(zip(quality_thresholds, aa_mutations)):
|
|
1008
|
-
ax1.annotate(f'Q{q}', (q, aa_mut), xytext=(5, 5),
|
|
1028
|
+
ax1.annotate(f'Q{q}', (q, aa_mut), xytext=(5, 5),
|
|
1009
1029
|
textcoords='offset points', fontsize=9, alpha=0.8, color=color1)
|
|
1010
|
-
|
|
1030
|
+
|
|
1011
1031
|
# Add data point labels for mappable bases
|
|
1012
1032
|
for i, (q, bases) in enumerate(zip(quality_thresholds, mappable_bases)):
|
|
1013
|
-
ax2.annotate(f'{
|
|
1033
|
+
ax2.annotate(f'{bases}', (q, bases), xytext=(5, -15),
|
|
1014
1034
|
textcoords='offset points', fontsize=8, alpha=0.8, color=color2)
|
|
1015
|
-
|
|
1035
|
+
|
|
1016
1036
|
# Add legend
|
|
1017
1037
|
lines1, labels1 = ax1.get_legend_handles_labels()
|
|
1018
1038
|
lines2, labels2 = ax2.get_legend_handles_labels()
|
|
1019
1039
|
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper right', frameon=False, fontsize=10)
|
|
1020
|
-
|
|
1021
|
-
# Save the plot
|
|
1022
|
-
qc_plot_path = os.path.join(
|
|
1040
|
+
|
|
1041
|
+
# Save the plot to detailed/qc_plots/
|
|
1042
|
+
qc_plot_path = os.path.join(qc_plots_dir, "qc_mutation_rate_vs_quality.png")
|
|
1023
1043
|
fig.savefig(qc_plot_path, dpi=300, bbox_inches='tight')
|
|
1024
1044
|
plt.close(fig)
|
|
1025
|
-
|
|
1045
|
+
|
|
1026
1046
|
logging.info(f"QC plot saved to: {qc_plot_path}")
|
|
1027
|
-
|
|
1028
|
-
# Also save data as CSV
|
|
1029
|
-
qc_data_path = os.path.join(
|
|
1047
|
+
|
|
1048
|
+
# Also save data as CSV to detailed/qc_plots/
|
|
1049
|
+
qc_data_path = os.path.join(qc_plots_dir, "qc_mutation_rate_vs_quality.csv")
|
|
1030
1050
|
with open(qc_data_path, 'w') as f:
|
|
1031
1051
|
f.write("quality_threshold,aa_mutations_per_gene,mappable_bases\n")
|
|
1032
1052
|
for q, aa_mut, bases in zip(quality_thresholds, aa_mutations, mappable_bases):
|
|
1033
1053
|
f.write(f"{q},{aa_mut:.6f},{bases}\n")
|
|
1034
|
-
|
|
1054
|
+
|
|
1035
1055
|
logging.info(f"QC data saved to: {qc_data_path}")
|
|
1036
|
-
|
|
1056
|
+
|
|
1037
1057
|
except Exception as e:
|
|
1038
1058
|
logging.error(f"Error creating QC plot: {e}")
|
|
1039
1059
|
|
|
@@ -1330,74 +1350,74 @@ def run_segmented_analysis(segment_files, quality_threshold, work_dir, ref_hit_f
|
|
|
1330
1350
|
except Exception as e:
|
|
1331
1351
|
logging.error(f"Error in segmented analysis: {e}")
|
|
1332
1352
|
return None
|
|
1353
|
+
|
|
1354
|
+
def calculate_qscore_weighted_mismatches(sam_file, ref_seq, qscore_stats):
|
|
1333
1355
|
"""
|
|
1334
1356
|
Calculate mismatches weighted by Q-score uncertainty with proper sampling error.
|
|
1335
|
-
|
|
1357
|
+
|
|
1336
1358
|
Args:
|
|
1337
1359
|
sam_file: Path to SAM file
|
|
1338
1360
|
ref_seq: Reference sequence
|
|
1339
1361
|
qscore_stats: Q-score statistics from extract_qscores_from_sam
|
|
1340
|
-
|
|
1362
|
+
|
|
1341
1363
|
Returns:
|
|
1342
1364
|
tuple: (weighted_mismatches, total_weighted_coverage, raw_mismatches, raw_coverage, position_weights, position_outcomes)
|
|
1343
1365
|
"""
|
|
1344
1366
|
try:
|
|
1345
|
-
import pysam
|
|
1346
|
-
|
|
1347
1367
|
weighted_mismatches = 0.0
|
|
1348
1368
|
total_weighted_coverage = 0.0
|
|
1349
1369
|
raw_mismatches = 0
|
|
1350
1370
|
raw_coverage = 0
|
|
1351
|
-
|
|
1371
|
+
|
|
1352
1372
|
# Store position-level data for proper sampling error calculation
|
|
1353
1373
|
position_weights = []
|
|
1354
1374
|
position_outcomes = []
|
|
1355
|
-
|
|
1375
|
+
|
|
1356
1376
|
position_qscores = qscore_stats['position_avg_qscores']
|
|
1357
|
-
|
|
1377
|
+
|
|
1358
1378
|
with pysam.AlignmentFile(sam_file, "r") as samfile:
|
|
1359
1379
|
for read in samfile:
|
|
1360
1380
|
if read.is_unmapped:
|
|
1361
1381
|
continue
|
|
1362
|
-
|
|
1382
|
+
|
|
1363
1383
|
# Get aligned pairs (read_pos, ref_pos)
|
|
1364
1384
|
for read_pos, ref_pos in read.get_aligned_pairs(matches_only=False):
|
|
1365
1385
|
if ref_pos is None or read_pos is None:
|
|
1366
1386
|
continue
|
|
1367
|
-
|
|
1387
|
+
|
|
1368
1388
|
if ref_pos >= len(ref_seq):
|
|
1369
1389
|
continue
|
|
1370
|
-
|
|
1390
|
+
|
|
1371
1391
|
# Get base calls
|
|
1372
1392
|
read_base = read.query_sequence[read_pos].upper()
|
|
1373
1393
|
ref_base = ref_seq[ref_pos].upper()
|
|
1374
|
-
|
|
1394
|
+
|
|
1375
1395
|
# Skip if either base is N
|
|
1376
1396
|
if read_base == 'N' or ref_base == 'N':
|
|
1377
1397
|
continue
|
|
1378
|
-
|
|
1398
|
+
|
|
1379
1399
|
# Get Q-score for this position
|
|
1380
1400
|
qscore = position_qscores.get(ref_pos, qscore_stats['mean_qscore'])
|
|
1381
1401
|
uncertainty_factor = qscore_uncertainty_factor(qscore)
|
|
1382
|
-
|
|
1402
|
+
|
|
1383
1403
|
# Weight by uncertainty (lower Q-score = higher uncertainty = lower weight)
|
|
1384
1404
|
weight = 1.0 - uncertainty_factor
|
|
1385
|
-
|
|
1405
|
+
|
|
1386
1406
|
# Store position-level data
|
|
1387
1407
|
position_weights.append(weight)
|
|
1388
1408
|
position_outcomes.append(1 if read_base != ref_base else 0)
|
|
1389
|
-
|
|
1409
|
+
|
|
1390
1410
|
# Count coverage
|
|
1391
1411
|
total_weighted_coverage += weight
|
|
1392
1412
|
raw_coverage += 1
|
|
1393
|
-
|
|
1413
|
+
|
|
1394
1414
|
# Count mismatches
|
|
1395
1415
|
if read_base != ref_base:
|
|
1396
1416
|
weighted_mismatches += weight
|
|
1397
1417
|
raw_mismatches += 1
|
|
1398
|
-
|
|
1418
|
+
|
|
1399
1419
|
return weighted_mismatches, total_weighted_coverage, raw_mismatches, raw_coverage, position_weights, position_outcomes
|
|
1400
|
-
|
|
1420
|
+
|
|
1401
1421
|
except Exception as e:
|
|
1402
1422
|
logging.error(f"Error calculating Q-score weighted mismatches: {e}")
|
|
1403
1423
|
return 0.0, 0.0, 0, 0, [], []
|
|
@@ -1827,6 +1847,289 @@ def simulate_aa_distribution(lambda_bp, cds_seq, n_trials=1000):
|
|
|
1827
1847
|
|
|
1828
1848
|
return aa_diffs
|
|
1829
1849
|
|
|
1850
|
+
def create_output_directories(results_dir):
|
|
1851
|
+
"""
|
|
1852
|
+
Create the output directory structure with detailed/ and detailed/qc_plots/ subdirectories.
|
|
1853
|
+
|
|
1854
|
+
Args:
|
|
1855
|
+
results_dir: Base results directory path
|
|
1856
|
+
|
|
1857
|
+
Returns:
|
|
1858
|
+
dict: Paths to created directories
|
|
1859
|
+
"""
|
|
1860
|
+
results_dir = Path(results_dir)
|
|
1861
|
+
detailed_dir = results_dir / "detailed"
|
|
1862
|
+
qc_plots_dir = detailed_dir / "qc_plots"
|
|
1863
|
+
|
|
1864
|
+
detailed_dir.mkdir(parents=True, exist_ok=True)
|
|
1865
|
+
qc_plots_dir.mkdir(parents=True, exist_ok=True)
|
|
1866
|
+
|
|
1867
|
+
logging.info(f"Created output directories: {detailed_dir}, {qc_plots_dir}")
|
|
1868
|
+
|
|
1869
|
+
return {
|
|
1870
|
+
'results_dir': results_dir,
|
|
1871
|
+
'detailed_dir': detailed_dir,
|
|
1872
|
+
'qc_plots_dir': qc_plots_dir,
|
|
1873
|
+
}
|
|
1874
|
+
|
|
1875
|
+
def write_key_findings(results_dir, consensus_info, simple_lambda, simple_aa_mean, is_protein, hit_seq):
|
|
1876
|
+
"""
|
|
1877
|
+
Generate lay-user executive summary KEY_FINDINGS.txt.
|
|
1878
|
+
|
|
1879
|
+
Args:
|
|
1880
|
+
results_dir: Results directory path
|
|
1881
|
+
consensus_info: Consensus AA mutation estimate from QC analysis
|
|
1882
|
+
simple_lambda: Simple lambda (bp mutations per copy) from main analysis
|
|
1883
|
+
simple_aa_mean: Simple AA mutation mean from Monte Carlo simulation
|
|
1884
|
+
is_protein: Whether the region is protein-coding
|
|
1885
|
+
hit_seq: The hit sequence (for length calculation)
|
|
1886
|
+
"""
|
|
1887
|
+
key_findings_path = Path(results_dir) / "KEY_FINDINGS.txt"
|
|
1888
|
+
|
|
1889
|
+
with open(key_findings_path, "w") as f:
|
|
1890
|
+
f.write("=" * 60 + "\n")
|
|
1891
|
+
f.write("EP LIBRARY PROFILER - KEY FINDINGS\n")
|
|
1892
|
+
f.write("=" * 60 + "\n\n")
|
|
1893
|
+
|
|
1894
|
+
# Determine which value to use as the "headline" number
|
|
1895
|
+
if consensus_info and consensus_info.get("consensus_mean") is not None:
|
|
1896
|
+
headline_aa = consensus_info["consensus_mean"]
|
|
1897
|
+
headline_std = consensus_info.get("consensus_std", 0.0)
|
|
1898
|
+
method_note = "consensus (precision-weighted average across Q-score thresholds)"
|
|
1899
|
+
elif simple_aa_mean is not None:
|
|
1900
|
+
headline_aa = simple_aa_mean
|
|
1901
|
+
headline_std = 0.0 # Simple method doesn't provide error
|
|
1902
|
+
method_note = "Monte Carlo simulation (single Q-score)"
|
|
1903
|
+
else:
|
|
1904
|
+
headline_aa = None
|
|
1905
|
+
headline_std = 0.0
|
|
1906
|
+
method_note = "N/A"
|
|
1907
|
+
|
|
1908
|
+
f.write("EXPECTED AMINO ACID MUTATIONS PER GENE COPY\n")
|
|
1909
|
+
f.write("-" * 45 + "\n")
|
|
1910
|
+
if is_protein and headline_aa is not None:
|
|
1911
|
+
f.write(f" {headline_aa:.2f} +/- {headline_std:.2f} AA mutations per gene copy\n")
|
|
1912
|
+
f.write(f" (Method: {method_note})\n\n")
|
|
1913
|
+
|
|
1914
|
+
# Plain-language interpretation using Poisson distribution
|
|
1915
|
+
f.write("WHAT THIS MEANS (Poisson distribution):\n")
|
|
1916
|
+
f.write("-" * 45 + "\n")
|
|
1917
|
+
if headline_aa > 0:
|
|
1918
|
+
# P(k=0) = e^(-lambda)
|
|
1919
|
+
p_wildtype = np.exp(-headline_aa) * 100
|
|
1920
|
+
# P(k=1) = lambda * e^(-lambda)
|
|
1921
|
+
p_one_mut = headline_aa * np.exp(-headline_aa) * 100
|
|
1922
|
+
# P(k>=2) = 1 - P(0) - P(1)
|
|
1923
|
+
p_two_plus = 100 - p_wildtype - p_one_mut
|
|
1924
|
+
|
|
1925
|
+
f.write(f" ~{p_wildtype:.1f}% of gene copies are wild-type (0 AA mutations)\n")
|
|
1926
|
+
f.write(f" ~{p_one_mut:.1f}% have exactly 1 AA mutation\n")
|
|
1927
|
+
f.write(f" ~{p_two_plus:.1f}% have 2 or more AA mutations\n\n")
|
|
1928
|
+
else:
|
|
1929
|
+
f.write(" Nearly all gene copies are expected to be wild-type.\n\n")
|
|
1930
|
+
else:
|
|
1931
|
+
if not is_protein:
|
|
1932
|
+
f.write(" Region is not protein-coding; AA mutation estimate not applicable.\n\n")
|
|
1933
|
+
else:
|
|
1934
|
+
f.write(" AA mutation estimate could not be calculated.\n\n")
|
|
1935
|
+
|
|
1936
|
+
# Quality assessment
|
|
1937
|
+
f.write("QUALITY ASSESSMENT\n")
|
|
1938
|
+
f.write("-" * 45 + "\n")
|
|
1939
|
+
if consensus_info:
|
|
1940
|
+
n_thresholds = len(consensus_info.get("thresholds_used", []))
|
|
1941
|
+
min_bases = consensus_info.get("min_mappable_bases", 0)
|
|
1942
|
+
note = consensus_info.get("note", "")
|
|
1943
|
+
|
|
1944
|
+
if n_thresholds >= 3 and note != "FELL_BACK_TO_MAX_COVERAGE":
|
|
1945
|
+
f.write(" GOOD - Multiple Q-score thresholds contributed to consensus\n")
|
|
1946
|
+
elif n_thresholds >= 1:
|
|
1947
|
+
f.write(" ACCEPTABLE - Limited Q-score thresholds available\n")
|
|
1948
|
+
else:
|
|
1949
|
+
f.write(" LOW COVERAGE - Results may be unreliable\n")
|
|
1950
|
+
|
|
1951
|
+
if note == "FELL_BACK_TO_MAX_COVERAGE":
|
|
1952
|
+
f.write(" WARNING: Fell back to max-coverage threshold due to low coverage\n")
|
|
1953
|
+
else:
|
|
1954
|
+
f.write(" UNKNOWN - Consensus analysis not available\n")
|
|
1955
|
+
|
|
1956
|
+
f.write("\n")
|
|
1957
|
+
f.write("FOR DETAILED TECHNICAL INFORMATION\n")
|
|
1958
|
+
f.write("-" * 45 + "\n")
|
|
1959
|
+
f.write(" See the detailed/ folder for:\n")
|
|
1960
|
+
f.write(" - methodology_notes.txt: Full explanation of calculations\n")
|
|
1961
|
+
f.write(" - lambda_comparison.csv: Side-by-side lambda estimates\n")
|
|
1962
|
+
f.write(" - comprehensive_qc_data.csv: All Q-score threshold results\n")
|
|
1963
|
+
f.write("\n")
|
|
1964
|
+
|
|
1965
|
+
logging.info(f"Wrote KEY_FINDINGS.txt to: {key_findings_path}")
|
|
1966
|
+
|
|
1967
|
+
def write_lambda_comparison(detailed_dir, simple_lambda, simple_aa_mean, consensus_info, hit_seq_length):
|
|
1968
|
+
"""
|
|
1969
|
+
Write CSV comparing all lambda estimates side-by-side.
|
|
1970
|
+
|
|
1971
|
+
Args:
|
|
1972
|
+
detailed_dir: Path to detailed/ directory
|
|
1973
|
+
simple_lambda: Simple lambda (bp mutations per copy)
|
|
1974
|
+
simple_aa_mean: Simple AA mutation mean from Monte Carlo
|
|
1975
|
+
consensus_info: Consensus info from QC analysis
|
|
1976
|
+
hit_seq_length: Length of the hit sequence
|
|
1977
|
+
"""
|
|
1978
|
+
lambda_csv_path = Path(detailed_dir) / "lambda_comparison.csv"
|
|
1979
|
+
|
|
1980
|
+
with open(lambda_csv_path, "w") as f:
|
|
1981
|
+
f.write("method,lambda_bp,lambda_error,aa_estimate,aa_error,notes\n")
|
|
1982
|
+
|
|
1983
|
+
# Simple method (from main analysis)
|
|
1984
|
+
simple_error = "N/A" # Simple method doesn't compute error
|
|
1985
|
+
simple_aa_err = "N/A"
|
|
1986
|
+
f.write(f"simple,(hit_rate - bg_rate) * seq_len,{simple_lambda:.6f},{simple_error},")
|
|
1987
|
+
if simple_aa_mean is not None:
|
|
1988
|
+
f.write(f"{simple_aa_mean:.4f},{simple_aa_err},")
|
|
1989
|
+
else:
|
|
1990
|
+
f.write("N/A,N/A,")
|
|
1991
|
+
f.write("Used for KDE plot and Monte Carlo simulation\n")
|
|
1992
|
+
|
|
1993
|
+
# Consensus method (from QC analysis)
|
|
1994
|
+
if consensus_info and consensus_info.get("consensus_mean") is not None:
|
|
1995
|
+
consensus_mean = consensus_info["consensus_mean"]
|
|
1996
|
+
consensus_std = consensus_info.get("consensus_std", 0.0)
|
|
1997
|
+
thresholds = consensus_info.get("thresholds_used", [])
|
|
1998
|
+
# Consensus is in AA mutations, back-calculate approximate lambda
|
|
1999
|
+
# Rough approximation: lambda_bp ~ 3 * aa_mutations
|
|
2000
|
+
approx_lambda = consensus_mean * 3.0
|
|
2001
|
+
approx_lambda_err = consensus_std * 3.0
|
|
2002
|
+
f.write(f"consensus_weighted,{approx_lambda:.6f},{approx_lambda_err:.6f},")
|
|
2003
|
+
f.write(f"{consensus_mean:.4f},{consensus_std:.4f},")
|
|
2004
|
+
f.write(f"Precision-weighted across Q-scores: {thresholds}\n")
|
|
2005
|
+
else:
|
|
2006
|
+
f.write("consensus_weighted,N/A,N/A,N/A,N/A,Not computed or insufficient data\n")
|
|
2007
|
+
|
|
2008
|
+
logging.info(f"Wrote lambda_comparison.csv to: {lambda_csv_path}")
|
|
2009
|
+
|
|
2010
|
+
def write_methodology_notes(detailed_dir):
|
|
2011
|
+
"""
|
|
2012
|
+
Write detailed methodology documentation explaining each lambda calculation method.
|
|
2013
|
+
|
|
2014
|
+
Args:
|
|
2015
|
+
detailed_dir: Path to detailed/ directory
|
|
2016
|
+
"""
|
|
2017
|
+
methodology_path = Path(detailed_dir) / "methodology_notes.txt"
|
|
2018
|
+
|
|
2019
|
+
content = """EP LIBRARY PROFILER - METHODOLOGY NOTES
|
|
2020
|
+
=======================================
|
|
2021
|
+
|
|
2022
|
+
This document explains the different mutation rate estimates produced by the
|
|
2023
|
+
EP library profiler and which outputs use which estimates.
|
|
2024
|
+
|
|
2025
|
+
|
|
2026
|
+
LAMBDA CALCULATION METHODS
|
|
2027
|
+
--------------------------
|
|
2028
|
+
|
|
2029
|
+
1. SIMPLE LAMBDA (used for KDE plot and Monte Carlo simulation)
|
|
2030
|
+
|
|
2031
|
+
Formula: lambda_bp = (hit_rate - bg_rate) * sequence_length
|
|
2032
|
+
|
|
2033
|
+
Where:
|
|
2034
|
+
- hit_rate = total_mismatches / total_covered_bases (in target region)
|
|
2035
|
+
- bg_rate = total_mismatches / total_covered_bases (in plasmid excluding target)
|
|
2036
|
+
- sequence_length = length of target CDS in base pairs
|
|
2037
|
+
|
|
2038
|
+
This method:
|
|
2039
|
+
- Does NOT include error propagation
|
|
2040
|
+
- Does NOT weight by Q-score
|
|
2041
|
+
- Is fast and provides a point estimate
|
|
2042
|
+
|
|
2043
|
+
Used in:
|
|
2044
|
+
- summary_panels.png/pdf (Panel 4: KDE of AA mutations)
|
|
2045
|
+
- summary.txt
|
|
2046
|
+
- aa_mutation_distribution.csv
|
|
2047
|
+
|
|
2048
|
+
|
|
2049
|
+
2. Q-SCORE WEIGHTED LAMBDA (used in comprehensive QC analysis)
|
|
2050
|
+
|
|
2051
|
+
Formula: lambda_bp_weighted = net_weighted_rate * sequence_length
|
|
2052
|
+
|
|
2053
|
+
Where:
|
|
2054
|
+
- net_weighted_rate = hit_weighted_rate - bg_weighted_rate
|
|
2055
|
+
- Weighted rates account for per-base Q-score uncertainty
|
|
2056
|
+
- Weights = 1 - sqrt(10^(-Q/10)) for each position
|
|
2057
|
+
|
|
2058
|
+
This method:
|
|
2059
|
+
- DOES include error propagation
|
|
2060
|
+
- DOES weight by Q-score (higher Q-score = higher weight)
|
|
2061
|
+
- Provides confidence intervals
|
|
2062
|
+
|
|
2063
|
+
Used in:
|
|
2064
|
+
- comprehensive_qc_data.csv
|
|
2065
|
+
- error_analysis.png
|
|
2066
|
+
|
|
2067
|
+
|
|
2068
|
+
3. CONSENSUS LAMBDA (recommended for reporting)
|
|
2069
|
+
|
|
2070
|
+
Formula: Precision-weighted average across Q-score thresholds
|
|
2071
|
+
|
|
2072
|
+
weights[i] = 1 / std_aa_mutations[i]
|
|
2073
|
+
consensus_mean = sum(weights * means) / sum(weights)
|
|
2074
|
+
|
|
2075
|
+
This method:
|
|
2076
|
+
- Aggregates estimates from multiple Q-score filtering thresholds
|
|
2077
|
+
- Weights by precision (lower uncertainty = higher weight)
|
|
2078
|
+
- Requires minimum coverage threshold (default 1000 mappable bases)
|
|
2079
|
+
- Provides the most robust estimate when multiple thresholds pass QC
|
|
2080
|
+
|
|
2081
|
+
Used in:
|
|
2082
|
+
- aa_mutation_consensus.txt
|
|
2083
|
+
- KEY_FINDINGS.txt
|
|
2084
|
+
- QC plots (red dashed line)
|
|
2085
|
+
|
|
2086
|
+
|
|
2087
|
+
WHICH VALUE SHOULD I USE?
|
|
2088
|
+
-------------------------
|
|
2089
|
+
|
|
2090
|
+
For publication/reporting:
|
|
2091
|
+
Use the CONSENSUS value from aa_mutation_consensus.txt or KEY_FINDINGS.txt
|
|
2092
|
+
This is the most statistically robust estimate.
|
|
2093
|
+
|
|
2094
|
+
For understanding the distribution shape:
|
|
2095
|
+
Use the KDE plot in summary_panels.png
|
|
2096
|
+
Note: This uses the SIMPLE lambda, not the consensus.
|
|
2097
|
+
|
|
2098
|
+
For detailed error analysis:
|
|
2099
|
+
Use comprehensive_qc_data.csv in the detailed/ folder
|
|
2100
|
+
This contains per-Q-score estimates with full error propagation.
|
|
2101
|
+
|
|
2102
|
+
|
|
2103
|
+
OUTPUT FILE REFERENCE
|
|
2104
|
+
---------------------
|
|
2105
|
+
|
|
2106
|
+
Root folder:
|
|
2107
|
+
- KEY_FINDINGS.txt: Executive summary with consensus AA mutations
|
|
2108
|
+
- summary_panels.png/pdf: Main visualization (uses simple lambda for KDE)
|
|
2109
|
+
- aa_mutation_consensus.txt: Consensus estimate details
|
|
2110
|
+
|
|
2111
|
+
detailed/ folder:
|
|
2112
|
+
- methodology_notes.txt: This file
|
|
2113
|
+
- lambda_comparison.csv: Side-by-side comparison of all methods
|
|
2114
|
+
- comprehensive_qc_data.csv: Full QC data with error estimates
|
|
2115
|
+
- simple_qc_data.csv: Simplified QC data
|
|
2116
|
+
- gene_mismatch_rates.csv: Per-position mismatch rates
|
|
2117
|
+
- base_distribution.csv: Base counts at each position
|
|
2118
|
+
- aa_substitutions.csv: Amino acid substitution data
|
|
2119
|
+
- plasmid_coverage.csv: Coverage across plasmid
|
|
2120
|
+
- aa_mutation_distribution.csv: Monte Carlo AA mutation trials
|
|
2121
|
+
|
|
2122
|
+
detailed/qc_plots/ folder:
|
|
2123
|
+
- qc_plot_*.png: Q-score threshold analysis plot
|
|
2124
|
+
- comprehensive_qc_analysis.png: Detailed QC visualization
|
|
2125
|
+
- error_analysis.png: Error component breakdown
|
|
2126
|
+
"""
|
|
2127
|
+
|
|
2128
|
+
with open(methodology_path, "w") as f:
|
|
2129
|
+
f.write(content)
|
|
2130
|
+
|
|
2131
|
+
logging.info(f"Wrote methodology_notes.txt to: {methodology_path}")
|
|
2132
|
+
|
|
1830
2133
|
def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, work_dir, results_dir,
|
|
1831
2134
|
chunks, ref_hit_fasta, plasmid_fasta, hit_seq, hit_id, plasmid_seq, idx):
|
|
1832
2135
|
"""
|
|
@@ -1854,13 +2157,18 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
1854
2157
|
|
|
1855
2158
|
# Ensure work directory exists
|
|
1856
2159
|
os.makedirs(work_dir, exist_ok=True)
|
|
1857
|
-
|
|
2160
|
+
|
|
1858
2161
|
# Create subdirectory for this Q-score analysis
|
|
1859
2162
|
qscore_results_dir = results_dir
|
|
1860
2163
|
if qscore is not None:
|
|
1861
2164
|
qscore_results_dir = os.path.join(results_dir, f"q{qscore}_analysis")
|
|
1862
2165
|
os.makedirs(qscore_results_dir, exist_ok=True)
|
|
1863
|
-
|
|
2166
|
+
|
|
2167
|
+
# Create output directory structure (detailed/ and detailed/qc_plots/)
|
|
2168
|
+
output_dirs = create_output_directories(qscore_results_dir)
|
|
2169
|
+
detailed_dir = output_dirs['detailed_dir']
|
|
2170
|
+
qc_plots_dir = output_dirs['qc_plots_dir']
|
|
2171
|
+
|
|
1864
2172
|
# Write chunks FASTA & align to background‐chunks
|
|
1865
2173
|
chunks_fasta = create_multi_fasta(chunks, work_dir)
|
|
1866
2174
|
sam_chunks = run_minimap2(fastq_path, chunks_fasta, "plasmid_chunks_alignment", work_dir)
|
|
@@ -1976,9 +2284,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
1976
2284
|
qscore_info = f" ({qscore_desc})" if qscore_desc != "unfiltered" else ""
|
|
1977
2285
|
|
|
1978
2286
|
# ----------------------------
|
|
1979
|
-
# SAVE CSV FOR MUTATION RATES (PANEL 1)
|
|
2287
|
+
# SAVE CSV FOR MUTATION RATES (PANEL 1) - to detailed/
|
|
1980
2288
|
# ----------------------------
|
|
1981
|
-
gene_mismatch_csv = os.path.join(
|
|
2289
|
+
gene_mismatch_csv = os.path.join(detailed_dir, "gene_mismatch_rates.csv")
|
|
1982
2290
|
with open(gene_mismatch_csv, "w", newline="") as csvfile:
|
|
1983
2291
|
csvfile.write(f"# gene_id: {hit_id}\n")
|
|
1984
2292
|
csvfile.write(f"# background_rate_per_kb: {bg_rate_per_kb:.6f}\n")
|
|
@@ -1988,9 +2296,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
1988
2296
|
logging.info(f"Saved CSV for gene mismatch rates: {gene_mismatch_csv}")
|
|
1989
2297
|
|
|
1990
2298
|
# ----------------------------
|
|
1991
|
-
# SAVE CSV FOR BASE DISTRIBUTION (PANEL 2)
|
|
2299
|
+
# SAVE CSV FOR BASE DISTRIBUTION (PANEL 2) - to detailed/
|
|
1992
2300
|
# ----------------------------
|
|
1993
|
-
base_dist_csv = os.path.join(
|
|
2301
|
+
base_dist_csv = os.path.join(detailed_dir, "base_distribution.csv")
|
|
1994
2302
|
with open(base_dist_csv, "w", newline="") as csvfile:
|
|
1995
2303
|
csvfile.write(f"# gene_id: {hit_id}\n")
|
|
1996
2304
|
csvfile.write("position_1based,ref_base,A_count,C_count,G_count,T_count,N_count\n")
|
|
@@ -2000,10 +2308,10 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2000
2308
|
logging.info(f"Saved CSV for base distribution: {base_dist_csv}")
|
|
2001
2309
|
|
|
2002
2310
|
# ----------------------------
|
|
2003
|
-
# SAVE CSV FOR AA SUBSTITUTIONS (PANEL 3) - only if protein
|
|
2311
|
+
# SAVE CSV FOR AA SUBSTITUTIONS (PANEL 3) - to detailed/ - only if protein
|
|
2004
2312
|
# ----------------------------
|
|
2005
2313
|
if is_protein:
|
|
2006
|
-
aa_subst_csv = os.path.join(
|
|
2314
|
+
aa_subst_csv = os.path.join(detailed_dir, "aa_substitutions.csv")
|
|
2007
2315
|
with open(aa_subst_csv, "w", newline="") as csvfile:
|
|
2008
2316
|
csvfile.write(f"# gene_id: {hit_id}\n")
|
|
2009
2317
|
csvfile.write(f"# lambda_bp_mut: {est_mut_per_copy:.6f}\n")
|
|
@@ -2013,9 +2321,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2013
2321
|
logging.info(f"Saved CSV for AA substitutions: {aa_subst_csv}")
|
|
2014
2322
|
|
|
2015
2323
|
# ----------------------------
|
|
2016
|
-
# SAVE CSV FOR PLASMID COVERAGE (PANEL 4)
|
|
2324
|
+
# SAVE CSV FOR PLASMID COVERAGE (PANEL 4) - to detailed/
|
|
2017
2325
|
# ----------------------------
|
|
2018
|
-
plasmid_cov_csv = os.path.join(
|
|
2326
|
+
plasmid_cov_csv = os.path.join(detailed_dir, "plasmid_coverage.csv")
|
|
2019
2327
|
with open(plasmid_cov_csv, "w", newline="") as csvfile:
|
|
2020
2328
|
csvfile.write("position_1based,coverage\n")
|
|
2021
2329
|
for pos0, cov in enumerate(plasmid_cov):
|
|
@@ -2023,9 +2331,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2023
2331
|
logging.info(f"Saved CSV for plasmid coverage: {plasmid_cov_csv}")
|
|
2024
2332
|
|
|
2025
2333
|
# ----------------------------
|
|
2026
|
-
# SAVE CSV FOR AA MUTATION DISTRIBUTION (PANEL 3)
|
|
2334
|
+
# SAVE CSV FOR AA MUTATION DISTRIBUTION (PANEL 3) - to detailed/
|
|
2027
2335
|
# ----------------------------
|
|
2028
|
-
aa_dist_csv = os.path.join(
|
|
2336
|
+
aa_dist_csv = os.path.join(detailed_dir, "aa_mutation_distribution.csv")
|
|
2029
2337
|
with open(aa_dist_csv, "w", newline="") as csvfile:
|
|
2030
2338
|
csvfile.write(f"# gene_id: {hit_id}\n")
|
|
2031
2339
|
csvfile.write(f"# lambda_bp_mut: {est_mut_per_copy:.6f}\n")
|
|
@@ -2135,7 +2443,7 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2135
2443
|
if is_protein and aa_diffs and len(aa_diffs) > 0:
|
|
2136
2444
|
x_vals = np.array(aa_diffs)
|
|
2137
2445
|
unique_vals = np.unique(x_vals)
|
|
2138
|
-
|
|
2446
|
+
|
|
2139
2447
|
if len(unique_vals) > 1:
|
|
2140
2448
|
# Multiple unique values - use KDE or histogram
|
|
2141
2449
|
if HAVE_SCIPY:
|
|
@@ -2149,15 +2457,23 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2149
2457
|
ax3.set_ylim(bottom=0)
|
|
2150
2458
|
except Exception as e:
|
|
2151
2459
|
logging.warning(f"KDE failed: {e}, falling back to histogram")
|
|
2152
|
-
ax3.hist(x_vals, bins=min(20, len(unique_vals)),
|
|
2460
|
+
ax3.hist(x_vals, bins=min(20, len(unique_vals)),
|
|
2153
2461
|
color="#C44E52", alpha=0.7, density=True, edgecolor='black')
|
|
2154
2462
|
else:
|
|
2155
|
-
ax3.hist(x_vals, bins=min(20, len(unique_vals)),
|
|
2463
|
+
ax3.hist(x_vals, bins=min(20, len(unique_vals)),
|
|
2156
2464
|
color="#C44E52", alpha=0.7, density=True, edgecolor='black')
|
|
2157
2465
|
else:
|
|
2158
2466
|
# Single unique value - just show a bar
|
|
2159
2467
|
ax3.bar(unique_vals, [1.0], color="#C44E52", alpha=0.7, width=0.1)
|
|
2160
2468
|
ax3.set_xlim(unique_vals[0] - 0.5, unique_vals[0] + 0.5)
|
|
2469
|
+
|
|
2470
|
+
# Set title with lambda value for protein-coding sequences
|
|
2471
|
+
ax3.set_title(f"AA Mutation Distribution (Monte Carlo, \u03bb={est_mut_per_copy:.2f}){qscore_info}",
|
|
2472
|
+
fontsize=14, fontweight='bold')
|
|
2473
|
+
ax3.set_xlabel("Number of AA Mutations", fontsize=12)
|
|
2474
|
+
ax3.set_ylabel("Density", fontsize=12)
|
|
2475
|
+
ax3.spines['top'].set_visible(False)
|
|
2476
|
+
ax3.spines['right'].set_visible(False)
|
|
2161
2477
|
else:
|
|
2162
2478
|
# Not protein or no AA differences — display an informative message
|
|
2163
2479
|
ax3.text(
|
|
@@ -2170,7 +2486,7 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2170
2486
|
color="gray",
|
|
2171
2487
|
transform=ax3.transAxes,
|
|
2172
2488
|
)
|
|
2173
|
-
|
|
2489
|
+
|
|
2174
2490
|
ax3.set_title("AA Mutation Distribution", fontsize=14, fontweight='bold')
|
|
2175
2491
|
ax3.set_xlabel("Number of AA Mutations", fontsize=12)
|
|
2176
2492
|
ax3.set_ylabel("Density", fontsize=12)
|
|
@@ -2231,9 +2547,9 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2231
2547
|
sample_percent[cat] = 0.0
|
|
2232
2548
|
|
|
2233
2549
|
# ----------------------------
|
|
2234
|
-
# GENERATE PDF TABLE (MUTATION SPECTRUM)
|
|
2550
|
+
# GENERATE PDF TABLE (MUTATION SPECTRUM) - to detailed/
|
|
2235
2551
|
# ----------------------------
|
|
2236
|
-
pdf_path = os.path.join(
|
|
2552
|
+
pdf_path = os.path.join(detailed_dir, f"{sample_name}_mutation_spectrum.pdf")
|
|
2237
2553
|
# Prepare table data
|
|
2238
2554
|
table_rows = []
|
|
2239
2555
|
for cat in categories:
|
|
@@ -2341,9 +2657,6 @@ def run_main_analysis_for_qscore(fastq_path, qscore, qscore_desc, sample_name, w
|
|
|
2341
2657
|
}
|
|
2342
2658
|
|
|
2343
2659
|
|
|
2344
|
-
|
|
2345
|
-
main()
|
|
2346
|
-
|
|
2347
2660
|
def expand_fastq_inputs(inputs: Iterable[str]) -> List[Path]:
|
|
2348
2661
|
paths: List[Path] = []
|
|
2349
2662
|
for item in inputs:
|
|
@@ -2503,6 +2816,7 @@ def process_single_fastq(
|
|
|
2503
2816
|
|
|
2504
2817
|
logging.info("Running QC analysis to get Q-score results...")
|
|
2505
2818
|
qc_results = None
|
|
2819
|
+
consensus_info = None
|
|
2506
2820
|
try:
|
|
2507
2821
|
qc_results, consensus_info = run_qc_analysis(
|
|
2508
2822
|
str(fastq_path),
|
|
@@ -2563,6 +2877,45 @@ def process_single_fastq(
|
|
|
2563
2877
|
)
|
|
2564
2878
|
analysis_results.append(result)
|
|
2565
2879
|
|
|
2880
|
+
# Generate unified summary files in the sample's root results directory
|
|
2881
|
+
# Get simple lambda from the unfiltered analysis (first result)
|
|
2882
|
+
simple_lambda = 0.0
|
|
2883
|
+
simple_aa_mean = None
|
|
2884
|
+
is_protein = False
|
|
2885
|
+
unfiltered_result = analysis_results[0] if analysis_results else None
|
|
2886
|
+
if unfiltered_result:
|
|
2887
|
+
simple_lambda = unfiltered_result.get('est_mut_per_copy', 0.0)
|
|
2888
|
+
simple_aa_mean = unfiltered_result.get('avg_aa_mutations')
|
|
2889
|
+
is_protein = unfiltered_result.get('is_protein', False)
|
|
2890
|
+
|
|
2891
|
+
# Create output directories and generate summary files
|
|
2892
|
+
output_dirs = create_output_directories(results_dir)
|
|
2893
|
+
detailed_dir = output_dirs['detailed_dir']
|
|
2894
|
+
|
|
2895
|
+
# Write KEY_FINDINGS.txt (lay-user summary)
|
|
2896
|
+
write_key_findings(
|
|
2897
|
+
results_dir,
|
|
2898
|
+
consensus_info,
|
|
2899
|
+
simple_lambda,
|
|
2900
|
+
simple_aa_mean,
|
|
2901
|
+
is_protein,
|
|
2902
|
+
hit_seq,
|
|
2903
|
+
)
|
|
2904
|
+
|
|
2905
|
+
# Write lambda_comparison.csv
|
|
2906
|
+
write_lambda_comparison(
|
|
2907
|
+
detailed_dir,
|
|
2908
|
+
simple_lambda,
|
|
2909
|
+
simple_aa_mean,
|
|
2910
|
+
consensus_info,
|
|
2911
|
+
len(hit_seq),
|
|
2912
|
+
)
|
|
2913
|
+
|
|
2914
|
+
# Write methodology_notes.txt
|
|
2915
|
+
write_methodology_notes(detailed_dir)
|
|
2916
|
+
|
|
2917
|
+
logging.info("Generated unified summary files: KEY_FINDINGS.txt, lambda_comparison.csv, methodology_notes.txt")
|
|
2918
|
+
|
|
2566
2919
|
if work_dir.exists():
|
|
2567
2920
|
shutil.rmtree(work_dir)
|
|
2568
2921
|
logging.info("Removed temporary work directory: %s", work_dir)
|
|
@@ -2573,5 +2926,6 @@ def process_single_fastq(
|
|
|
2573
2926
|
"sample": sample_name,
|
|
2574
2927
|
"results_dir": results_dir,
|
|
2575
2928
|
"analysis_results": analysis_results,
|
|
2929
|
+
"consensus_info": consensus_info,
|
|
2576
2930
|
}
|
|
2577
2931
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: uht-tooling
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: Tooling for ultra-high throughput screening workflows.
|
|
5
5
|
Author: Matt115A
|
|
6
6
|
License-Expression: MIT
|
|
@@ -313,13 +313,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
|
|
|
313
313
|
--fastq data/ep-library-profile/*.fastq.gz \
|
|
314
314
|
--output-dir results/ep-library-profile/
|
|
315
315
|
```
|
|
316
|
-
|
|
316
|
+
|
|
317
|
+
**Output structure**
|
|
318
|
+
|
|
319
|
+
Each sample produces an organized output directory:
|
|
320
|
+
|
|
321
|
+
```
|
|
322
|
+
sample_name/
|
|
323
|
+
├── KEY_FINDINGS.txt # Lay-user executive summary
|
|
324
|
+
├── summary_panels.png/pdf # Main visualization
|
|
325
|
+
├── aa_mutation_consensus.txt # Consensus estimate details
|
|
326
|
+
├── run.log # Analysis log
|
|
327
|
+
└── detailed/ # Technical outputs
|
|
328
|
+
├── methodology_notes.txt # Documents which lambda drives what
|
|
329
|
+
├── lambda_comparison.csv # Side-by-side lambda comparison
|
|
330
|
+
├── gene_mismatch_rates.csv
|
|
331
|
+
├── base_distribution.csv
|
|
332
|
+
├── aa_substitutions.csv
|
|
333
|
+
├── plasmid_coverage.csv
|
|
334
|
+
├── aa_mutation_distribution.csv
|
|
335
|
+
├── comprehensive_qc_data.csv
|
|
336
|
+
├── simple_qc_data.csv
|
|
337
|
+
└── qc_plots/ # QC visualizations
|
|
338
|
+
├── qc_plot_*.png
|
|
339
|
+
├── comprehensive_qc_analysis.png
|
|
340
|
+
├── error_analysis.png
|
|
341
|
+
└── qc_mutation_rate_vs_quality.png/csv
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
**Lambda estimates: which to use**
|
|
345
|
+
|
|
346
|
+
The profiler calculates lambda (mutations per gene copy) via two methods:
|
|
347
|
+
|
|
348
|
+
| Method | Formula | Error Quantified? | Used For |
|
|
349
|
+
|--------|---------|-------------------|----------|
|
|
350
|
+
| Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
|
|
351
|
+
| Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
|
|
352
|
+
|
|
353
|
+
- **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
|
|
354
|
+
- **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
|
|
355
|
+
- **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
|
|
356
|
+
|
|
357
|
+
The `KEY_FINDINGS.txt` file provides a plain-language summary including:
|
|
358
|
+
- Expected AA mutations per gene copy
|
|
359
|
+
- Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
|
|
360
|
+
- Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
|
|
317
361
|
|
|
318
362
|
**How the mutation rate and AA expectations are derived**
|
|
319
363
|
|
|
320
|
-
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the
|
|
364
|
+
1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
|
|
321
365
|
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
|
|
322
|
-
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these
|
|
366
|
+
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
|
|
323
367
|
4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
|
|
324
368
|
|
|
325
369
|
---
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|