uht-tooling 0.1.9__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (26) hide show
  1. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/PKG-INFO +123 -5
  2. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/README.md +122 -4
  3. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/pyproject.toml +1 -1
  4. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/cli.py +153 -4
  5. uht_tooling-0.3.0/src/uht_tooling/config.py +137 -0
  6. uht_tooling-0.3.0/src/uht_tooling/tools.py +143 -0
  7. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/gui.py +19 -0
  8. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/mut_rate.py +484 -124
  9. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/mutation_caller.py +11 -2
  10. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/umi_hunter.py +9 -4
  11. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling.egg-info/PKG-INFO +123 -5
  12. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling.egg-info/SOURCES.txt +2 -0
  13. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/setup.cfg +0 -0
  14. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/__init__.py +0 -0
  15. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/models/__init__.py +0 -0
  16. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/__init__.py +0 -0
  17. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/design_gibson.py +0 -0
  18. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/design_kld.py +0 -0
  19. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/design_slim.py +0 -0
  20. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/nextera_designer.py +0 -0
  21. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling/workflows/profile_inserts.py +0 -0
  22. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling.egg-info/dependency_links.txt +0 -0
  23. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling.egg-info/entry_points.txt +0 -0
  24. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling.egg-info/requires.txt +0 -0
  25. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/src/uht_tooling.egg-info/top_level.txt +0 -0
  26. {uht_tooling-0.1.9 → uht_tooling-0.3.0}/tests/test_design_kld.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: uht-tooling
3
- Version: 0.1.9
3
+ Version: 0.3.0
4
4
  Summary: Tooling for ultra-high throughput screening workflows.
5
5
  Author: Matt115A
6
6
  License-Expression: MIT
@@ -47,7 +47,22 @@ This installs the core workflows plus the optional GUI dependency (Gradio). Omit
47
47
  pip install uht-tooling
48
48
  ```
49
49
 
50
- You will need a functioning version of mafft - you should install this separately and it should be accessible from your environment.
50
+ ### External Tools
51
+
52
+ Some workflows require external bioinformatics tools:
53
+
54
+ | Workflow | Required Tools |
55
+ |----------|---------------|
56
+ | mutation-caller | mafft |
57
+ | umi-hunter | mafft |
58
+ | ep-library-profile | minimap2, NanoFilt |
59
+
60
+ Install via conda:
61
+ ```bash
62
+ conda install -c bioconda mafft minimap2 nanofilt
63
+ ```
64
+
65
+ The CLI and GUI will validate tool availability before running and provide clear error messages if tools are missing.
51
66
 
52
67
  ### Development install
53
68
  ```bash
@@ -95,10 +110,69 @@ Each command provides detailed help, including option descriptions and expected
95
110
  uht-tooling mutation-caller --help
96
111
  ```
97
112
 
113
+ ### Short Flags
114
+
115
+ All commands support short flags for common options:
116
+
117
+ ```bash
118
+ # Long form
119
+ uht-tooling design-slim --gene-fasta gene.fa --context-fasta ctx.fa --mutations-csv mut.csv --output-dir out/
120
+
121
+ # Short form
122
+ uht-tooling design-slim -g gene.fa -c ctx.fa -m mut.csv -o out/
123
+ ```
124
+
125
+ | Long Flag | Short | Commands |
126
+ |-----------|-------|----------|
127
+ | `--gene-fasta` | `-g` | design-slim, design-kld, design-gibson |
128
+ | `--context-fasta` | `-c` | design-slim, design-kld, design-gibson |
129
+ | `--mutations-csv` | `-m` | design-slim, design-kld, design-gibson |
130
+ | `--output-dir` | `-o` | 7 commands |
131
+ | `--log-path` | `-l` | 7 commands |
132
+ | `--template-fasta` | `-t` | mutation-caller, umi-hunter |
133
+ | `--fastq` | `-q` | 4 commands |
134
+ | `--threshold` | `-T` | mutation-caller |
135
+ | `--config-csv` | `-C` | umi-hunter |
136
+ | `--binding-csv` | `-b` | nextera-primers |
137
+ | `--probes-csv` | `-P` | profile-inserts |
138
+ | `--region-fasta` | `-R` | ep-library-profile |
139
+ | `--plasmid-fasta` | `-p` | ep-library-profile |
140
+ | `--work-dir` | `-w` | ep-library-profile |
141
+ | `--config` | `-K` | global (all commands) |
142
+
98
143
  You can pass multiple FASTQ paths using repeated `--fastq` options or glob patterns. Optional `--log-path` flags redirect logs if you prefer a location outside the default results directory.
99
144
 
100
145
  ---
101
146
 
147
+ ## Configuration File
148
+
149
+ uht-tooling supports a YAML configuration file for default options.
150
+
151
+ **Auto-discovery locations** (in order):
152
+ 1. `$UHT_TOOLING_CONFIG` environment variable
153
+ 2. `~/.uht-tooling.yaml`
154
+ 3. `~/.config/uht-tooling/config.yaml`
155
+ 4. `.uht-tooling.yaml` (current directory)
156
+
157
+ Or specify explicitly: `uht-tooling --config my-config.yaml ...`
158
+
159
+ **Example ~/.uht-tooling.yaml:**
160
+ ```yaml
161
+ paths:
162
+ output_dir: ~/results/uht-tooling
163
+
164
+ defaults:
165
+ mutation_caller:
166
+ threshold: 15
167
+ umi_hunter:
168
+ umi_identity_threshold: 0.85
169
+ min_cluster_size: 5
170
+ ```
171
+
172
+ CLI options always take precedence over config values.
173
+
174
+ ---
175
+
102
176
  ## Workflow reference
103
177
 
104
178
  ### Nextera XT primer design
@@ -313,13 +387,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
313
387
  --fastq data/ep-library-profile/*.fastq.gz \
314
388
  --output-dir results/ep-library-profile/
315
389
  ```
316
- - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
390
+
391
+ **Output structure**
392
+
393
+ Each sample produces an organized output directory:
394
+
395
+ ```
396
+ sample_name/
397
+ ├── KEY_FINDINGS.txt # Lay-user executive summary
398
+ ├── summary_panels.png/pdf # Main visualization
399
+ ├── aa_mutation_consensus.txt # Consensus estimate details
400
+ ├── run.log # Analysis log
401
+ └── detailed/ # Technical outputs
402
+ ├── methodology_notes.txt # Documents which lambda drives what
403
+ ├── lambda_comparison.csv # Side-by-side lambda comparison
404
+ ├── gene_mismatch_rates.csv
405
+ ├── base_distribution.csv
406
+ ├── aa_substitutions.csv
407
+ ├── plasmid_coverage.csv
408
+ ├── aa_mutation_distribution.csv
409
+ ├── comprehensive_qc_data.csv
410
+ ├── simple_qc_data.csv
411
+ └── qc_plots/ # QC visualizations
412
+ ├── qc_plot_*.png
413
+ ├── comprehensive_qc_analysis.png
414
+ ├── error_analysis.png
415
+ └── qc_mutation_rate_vs_quality.png/csv
416
+ ```
417
+
418
+ **Lambda estimates: which to use**
419
+
420
+ The profiler calculates lambda (mutations per gene copy) via two methods:
421
+
422
+ | Method | Formula | Error Quantified? | Used For |
423
+ |--------|---------|-------------------|----------|
424
+ | Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
425
+ | Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
426
+
427
+ - **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
428
+ - **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
429
+ - **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
430
+
431
+ The `KEY_FINDINGS.txt` file provides a plain-language summary including:
432
+ - Expected AA mutations per gene copy
433
+ - Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
434
+ - Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
317
435
 
318
436
  **How the mutation rate and AA expectations are derived**
319
437
 
320
- 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the target rate; mismatches elsewhere provide the background.
438
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
321
439
  2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
322
- 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
440
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
323
441
  4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
324
442
 
325
443
  ---
@@ -18,7 +18,22 @@ This installs the core workflows plus the optional GUI dependency (Gradio). Omit
18
18
  pip install uht-tooling
19
19
  ```
20
20
 
21
- You will need a functioning version of mafft - you should install this separately and it should be accessible from your environment.
21
+ ### External Tools
22
+
23
+ Some workflows require external bioinformatics tools:
24
+
25
+ | Workflow | Required Tools |
26
+ |----------|---------------|
27
+ | mutation-caller | mafft |
28
+ | umi-hunter | mafft |
29
+ | ep-library-profile | minimap2, NanoFilt |
30
+
31
+ Install via conda:
32
+ ```bash
33
+ conda install -c bioconda mafft minimap2 nanofilt
34
+ ```
35
+
36
+ The CLI and GUI will validate tool availability before running and provide clear error messages if tools are missing.
22
37
 
23
38
  ### Development install
24
39
  ```bash
@@ -66,10 +81,69 @@ Each command provides detailed help, including option descriptions and expected
66
81
  uht-tooling mutation-caller --help
67
82
  ```
68
83
 
84
+ ### Short Flags
85
+
86
+ All commands support short flags for common options:
87
+
88
+ ```bash
89
+ # Long form
90
+ uht-tooling design-slim --gene-fasta gene.fa --context-fasta ctx.fa --mutations-csv mut.csv --output-dir out/
91
+
92
+ # Short form
93
+ uht-tooling design-slim -g gene.fa -c ctx.fa -m mut.csv -o out/
94
+ ```
95
+
96
+ | Long Flag | Short | Commands |
97
+ |-----------|-------|----------|
98
+ | `--gene-fasta` | `-g` | design-slim, design-kld, design-gibson |
99
+ | `--context-fasta` | `-c` | design-slim, design-kld, design-gibson |
100
+ | `--mutations-csv` | `-m` | design-slim, design-kld, design-gibson |
101
+ | `--output-dir` | `-o` | 7 commands |
102
+ | `--log-path` | `-l` | 7 commands |
103
+ | `--template-fasta` | `-t` | mutation-caller, umi-hunter |
104
+ | `--fastq` | `-q` | 4 commands |
105
+ | `--threshold` | `-T` | mutation-caller |
106
+ | `--config-csv` | `-C` | umi-hunter |
107
+ | `--binding-csv` | `-b` | nextera-primers |
108
+ | `--probes-csv` | `-P` | profile-inserts |
109
+ | `--region-fasta` | `-R` | ep-library-profile |
110
+ | `--plasmid-fasta` | `-p` | ep-library-profile |
111
+ | `--work-dir` | `-w` | ep-library-profile |
112
+ | `--config` | `-K` | global (all commands) |
113
+
69
114
  You can pass multiple FASTQ paths using repeated `--fastq` options or glob patterns. Optional `--log-path` flags redirect logs if you prefer a location outside the default results directory.
70
115
 
71
116
  ---
72
117
 
118
+ ## Configuration File
119
+
120
+ uht-tooling supports a YAML configuration file for default options.
121
+
122
+ **Auto-discovery locations** (in order):
123
+ 1. `$UHT_TOOLING_CONFIG` environment variable
124
+ 2. `~/.uht-tooling.yaml`
125
+ 3. `~/.config/uht-tooling/config.yaml`
126
+ 4. `.uht-tooling.yaml` (current directory)
127
+
128
+ Or specify explicitly: `uht-tooling --config my-config.yaml ...`
129
+
130
+ **Example ~/.uht-tooling.yaml:**
131
+ ```yaml
132
+ paths:
133
+ output_dir: ~/results/uht-tooling
134
+
135
+ defaults:
136
+ mutation_caller:
137
+ threshold: 15
138
+ umi_hunter:
139
+ umi_identity_threshold: 0.85
140
+ min_cluster_size: 5
141
+ ```
142
+
143
+ CLI options always take precedence over config values.
144
+
145
+ ---
146
+
73
147
  ## Workflow reference
74
148
 
75
149
  ### Nextera XT primer design
@@ -284,13 +358,57 @@ Please be aware, this toolkit will not scale well beyond around 50k reads/sample
284
358
  --fastq data/ep-library-profile/*.fastq.gz \
285
359
  --output-dir results/ep-library-profile/
286
360
  ```
287
- - Output bundle includes per-sample directories, a master summary TSV, and a `summary_panels` figure that visualises positional mutation rates, coverage, and amino-acid simulations.
361
+
362
+ **Output structure**
363
+
364
+ Each sample produces an organized output directory:
365
+
366
+ ```
367
+ sample_name/
368
+ ├── KEY_FINDINGS.txt # Lay-user executive summary
369
+ ├── summary_panels.png/pdf # Main visualization
370
+ ├── aa_mutation_consensus.txt # Consensus estimate details
371
+ ├── run.log # Analysis log
372
+ └── detailed/ # Technical outputs
373
+ ├── methodology_notes.txt # Documents which lambda drives what
374
+ ├── lambda_comparison.csv # Side-by-side lambda comparison
375
+ ├── gene_mismatch_rates.csv
376
+ ├── base_distribution.csv
377
+ ├── aa_substitutions.csv
378
+ ├── plasmid_coverage.csv
379
+ ├── aa_mutation_distribution.csv
380
+ ├── comprehensive_qc_data.csv
381
+ ├── simple_qc_data.csv
382
+ └── qc_plots/ # QC visualizations
383
+ ├── qc_plot_*.png
384
+ ├── comprehensive_qc_analysis.png
385
+ ├── error_analysis.png
386
+ └── qc_mutation_rate_vs_quality.png/csv
387
+ ```
388
+
389
+ **Lambda estimates: which to use**
390
+
391
+ The profiler calculates lambda (mutations per gene copy) via two methods:
392
+
393
+ | Method | Formula | Error Quantified? | Used For |
394
+ |--------|---------|-------------------|----------|
395
+ | Simple | `(hit_rate - bg_rate) × seq_len` | No | KDE plot, Monte Carlo simulation |
396
+ | Consensus | Precision-weighted average across Q-scores | Yes | Recommended for reporting |
397
+
398
+ - **For publication/reporting**: Use the consensus value from `KEY_FINDINGS.txt` or `aa_mutation_consensus.txt`.
399
+ - **For understanding distribution shape**: See the KDE plot in `summary_panels.png` (note: uses simple lambda).
400
+ - **For detailed error analysis**: See `detailed/comprehensive_qc_data.csv`.
401
+
402
+ The `KEY_FINDINGS.txt` file provides a plain-language summary including:
403
+ - Expected AA mutations per gene copy
404
+ - Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
405
+ - Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)
288
406
 
289
407
  **How the mutation rate and AA expectations are derived**
290
408
 
291
- 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the target rate; mismatches elsewhere provide the background.
409
+ 1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
292
410
  2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
293
- 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drives the AA mutation mean/variance that appear in the panel plot.
411
+ 3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot.
294
412
  4. If multiple Q-score thresholds are analysed, the CLI aggregates them via a precision-weighted consensus (1 / standard deviation weighting) after filtering out thresholds with insufficient coverage; the consensus value is written to `aa_mutation_consensus.txt` and plotted as a horizontal guide.
295
413
 
296
414
  ---
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "uht-tooling"
7
- version = "0.1.9"
7
+ version = "0.3.0"
8
8
  description = "Tooling for ultra-high throughput screening workflows."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"