GenoChar 0.6.3.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- genochar-0.6.3.2/GenoChar.egg-info/PKG-INFO +296 -0
- genochar-0.6.3.2/GenoChar.egg-info/SOURCES.txt +19 -0
- genochar-0.6.3.2/GenoChar.egg-info/dependency_links.txt +1 -0
- genochar-0.6.3.2/GenoChar.egg-info/entry_points.txt +2 -0
- genochar-0.6.3.2/GenoChar.egg-info/requires.txt +2 -0
- genochar-0.6.3.2/GenoChar.egg-info/top_level.txt +1 -0
- genochar-0.6.3.2/LICENSE +21 -0
- genochar-0.6.3.2/PKG-INFO +296 -0
- genochar-0.6.3.2/README.md +280 -0
- genochar-0.6.3.2/genochar/__init__.py +2 -0
- genochar-0.6.3.2/genochar/assembly_stats.py +124 -0
- genochar-0.6.3.2/genochar/checkm2.py +35 -0
- genochar-0.6.3.2/genochar/cli.py +585 -0
- genochar-0.6.3.2/genochar/coverage.py +48 -0
- genochar-0.6.3.2/genochar/gff_stats.py +223 -0
- genochar-0.6.3.2/genochar/managed_envs.py +201 -0
- genochar-0.6.3.2/genochar/metadata.py +47 -0
- genochar-0.6.3.2/genochar/pipeline.py +172 -0
- genochar-0.6.3.2/genochar/utils.py +228 -0
- genochar-0.6.3.2/pyproject.toml +40 -0
- genochar-0.6.3.2/setup.cfg +4 -0
|
@@ -0,0 +1,296 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: GenoChar
|
|
3
|
+
Version: 0.6.3.2
|
|
4
|
+
Summary: GenoChar: summarize-first genome characterization workflow with one-time managed setup for Prokka and CheckM2
|
|
5
|
+
Author: Jun Won Lee
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/ljunwon1114/GenoChar
|
|
8
|
+
Project-URL: Issues, https://github.com/ljunwon1114/GenoChar/issues
|
|
9
|
+
Keywords: bioinformatics,genomics,genome-assembly,genome-annotation,microbiology,bacteria,archaea,checkm2,prokka
|
|
10
|
+
Requires-Python: >=3.10
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Requires-Dist: pandas>=2.0
|
|
14
|
+
Requires-Dist: openpyxl>=3.1
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# GenoChar
|
|
18
|
+
|
|
19
|
+
**GenoChar** generates publication-ready genome characterization tables for bacterial and archaeal genomes.
|
|
20
|
+
|
|
21
|
+
Version **0.6.3.2** keeps the **single-command, summarize-first interface**, but adds a practical solution for external-tool dependency conflicts:
|
|
22
|
+
|
|
23
|
+
- **`genochar`** builds the final genome characterization table
|
|
24
|
+
- **`genochar setup`** prepares **managed Prokka and CheckM2 conda environments** under `~/.genochar`
|
|
25
|
+
- later, a normal `genochar ... --check --annotate prokka` run automatically calls those managed environments with `conda run -p ...`
|
|
26
|
+
|
|
27
|
+
That means users do **not** need to manually solve a shared Prokka/CheckM2 environment.
|
|
28
|
+
|
|
29
|
+
The main output is a wide table with one row per strain. Optional outputs include a feature-style table and an Excel workbook.
|
|
30
|
+
|
|
31
|
+
**Naming note:** the project/display name is **GenoChar**, while the Python package name and command-line executable remain lowercase as `genochar`.
|
|
32
|
+
|
|
33
|
+
## What the tool computes
|
|
34
|
+
|
|
35
|
+
### FASTA-derived fields
|
|
36
|
+
These work even if you provide only FASTA files:
|
|
37
|
+
|
|
38
|
+
- `Strain`
|
|
39
|
+
- `Strain name`
|
|
40
|
+
- `Genus`
|
|
41
|
+
- `Species`
|
|
42
|
+
- `Accession`
|
|
43
|
+
- `Genome size (bp)`
|
|
44
|
+
- `GC content (%)`
|
|
45
|
+
- `No. of contigs`
|
|
46
|
+
- `N50 (bp)`
|
|
47
|
+
- `N90 (bp)`
|
|
48
|
+
- `L50`
|
|
49
|
+
- `L90`
|
|
50
|
+
- `Longest contig (bp)`
|
|
51
|
+
- `Gaps (N per 100 kb)`
|
|
52
|
+
|
|
53
|
+
### GFF-derived fields
|
|
54
|
+
These are added when you provide GFF files or ask GenoChar to create annotation files:
|
|
55
|
+
|
|
56
|
+
- `CDSs`
|
|
57
|
+
- `tRNAs`
|
|
58
|
+
- `rRNAs`
|
|
59
|
+
- `tmRNA`
|
|
60
|
+
- `misc RNA`
|
|
61
|
+
- `Repeat regions`
|
|
62
|
+
- `16S rRNA count`
|
|
63
|
+
- `16S rRNA length (bp)`
|
|
64
|
+
- `16S rRNA contig`
|
|
65
|
+
- `16S rRNA sequence`
|
|
66
|
+
|
|
67
|
+
### CheckM2-derived fields
|
|
68
|
+
These are added when you provide or generate a CheckM2 report:
|
|
69
|
+
|
|
70
|
+
- `Completeness (%)`
|
|
71
|
+
- `Contamination (%)`
|
|
72
|
+
|
|
73
|
+
### User-supplied metadata
|
|
74
|
+
Optional input tables can add:
|
|
75
|
+
|
|
76
|
+
- `Sequencing coverage (×)`
|
|
77
|
+
- `Sequencing platforms`
|
|
78
|
+
- `Assembly method`
|
|
79
|
+
- `Genus`
|
|
80
|
+
- `Species`
|
|
81
|
+
- `Accession`
|
|
82
|
+
- `Repeat regions`
|
|
83
|
+
|
|
84
|
+
## Default output columns
|
|
85
|
+
|
|
86
|
+
The main output table contains:
|
|
87
|
+
|
|
88
|
+
- `Strain`
|
|
89
|
+
- `Strain name`
|
|
90
|
+
- `Genus`
|
|
91
|
+
- `Species`
|
|
92
|
+
- `Accession`
|
|
93
|
+
- `Genome size (bp)`
|
|
94
|
+
- `GC content (%)`
|
|
95
|
+
- `No. of contigs`
|
|
96
|
+
- `N50 (bp)`
|
|
97
|
+
- `N90 (bp)`
|
|
98
|
+
- `L50`
|
|
99
|
+
- `L90`
|
|
100
|
+
- `Longest contig (bp)`
|
|
101
|
+
- `Gaps (N per 100 kb)`
|
|
102
|
+
- `Sequencing coverage (×)`
|
|
103
|
+
- `Sequencing platforms`
|
|
104
|
+
- `Assembly method`
|
|
105
|
+
- `CDSs`
|
|
106
|
+
- `tRNAs`
|
|
107
|
+
- `rRNAs`
|
|
108
|
+
- `tmRNA`
|
|
109
|
+
- `misc RNA`
|
|
110
|
+
- `Repeat regions`
|
|
111
|
+
- `16S rRNA count`
|
|
112
|
+
- `16S rRNA length (bp)`
|
|
113
|
+
- `16S rRNA contig`
|
|
114
|
+
- `16S rRNA sequence`
|
|
115
|
+
- `Completeness (%)`
|
|
116
|
+
- `Contamination (%)`
|
|
117
|
+
|
|
118
|
+
## Installation
|
|
119
|
+
|
|
120
|
+
Clone the repository and install GenoChar into the active Python environment:
|
|
121
|
+
|
|
122
|
+
```bash
|
|
123
|
+
git clone https://github.com/ljunwon1114/GenoChar.git
|
|
124
|
+
cd GenoChar
|
|
125
|
+
pip install -e .
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
If you just want the latest GitHub version without cloning for development, you can also use:
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
pip install git+https://github.com/ljunwon1114/GenoChar.git
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### Core Python requirement
|
|
135
|
+
|
|
136
|
+
GenoChar itself is lightweight. The **core package** now declares:
|
|
137
|
+
|
|
138
|
+
- `requires-python >=3.10`
|
|
139
|
+
|
|
140
|
+
The external tools are the difficult part, so v0.6.3.2 no longer assumes that Prokka and CheckM2 should share the same environment.
|
|
141
|
+
|
|
142
|
+
## One-time managed setup (recommended)
|
|
143
|
+
|
|
144
|
+
Run this once:
|
|
145
|
+
|
|
146
|
+
```bash
|
|
147
|
+
genochar setup
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
This creates managed environments under `~/.genochar`, typically:
|
|
151
|
+
|
|
152
|
+
```text
|
|
153
|
+
~/.genochar/
|
|
154
|
+
config.json
|
|
155
|
+
envs/
|
|
156
|
+
prokka/
|
|
157
|
+
checkm2/
|
|
158
|
+
databases/
|
|
159
|
+
CheckM2_database/
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
After that, normal workflow commands automatically use those managed environments when `--annotate prokka` and/or `--check` are requested.
|
|
163
|
+
|
|
164
|
+
When `--check` is used, GenoChar now passes the resolved input FASTA files directly to `checkm2 predict --input ...`, matching the official CheckM2 interface that accepts either a folder of bins or a list of FASTA files.
|
|
165
|
+
|
|
166
|
+
### Reuse an existing CheckM2 database
|
|
167
|
+
|
|
168
|
+
If you already downloaded the CheckM2 database, point setup at it directly:
|
|
169
|
+
|
|
170
|
+
```bash
|
|
171
|
+
genochar setup --checkm2-db /home/jwlee/databases/CheckM2_database/uniref100.KO.1.dmnd
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
You can also pass a directory that contains the `.dmnd` file.
|
|
175
|
+
|
|
176
|
+
### Optional setup flags
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
genochar setup --skip-prokka
|
|
180
|
+
genochar setup --skip-checkm2
|
|
181
|
+
genochar setup --force
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
- `--skip-prokka`: only prepare CheckM2
|
|
185
|
+
- `--skip-checkm2`: only prepare Prokka
|
|
186
|
+
- `--force`: recreate managed environments even if they already exist
|
|
187
|
+
|
|
188
|
+
## Command overview
|
|
189
|
+
|
|
190
|
+
### A. FASTA only
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
genochar -i "assemblies/*.fasta" -o genome_characterization.tsv
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### B. FASTA + existing GFF + existing CheckM2 report
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
genochar -i "assemblies/*.fasta" --gff "annotations/*.gff*" --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
### C. FASTA + managed CheckM2 first + managed Prokka annotation
|
|
203
|
+
|
|
204
|
+
```bash
|
|
205
|
+
genochar -i "assemblies/*.fasta" --check --annotate prokka -k Archaea -t 8 -w genochar_work -o genome_characterization.tsv
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
### D. Reuse existing GFF files automatically
|
|
209
|
+
|
|
210
|
+
```bash
|
|
211
|
+
genochar -i "assemblies/*.fasta" --annotate existing --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### E. Reuse explicitly supplied GFF files in existing-annotation mode
|
|
215
|
+
|
|
216
|
+
```bash
|
|
217
|
+
genochar -i "assemblies/*.fasta" --annotate existing --gff "annotations/*.gff*" --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## Optional extra outputs
|
|
221
|
+
|
|
222
|
+
### Feature-style table
|
|
223
|
+
|
|
224
|
+
```bash
|
|
225
|
+
genochar -i "assemblies/*.fasta" -f genome_characterization_feature.tsv -o genome_characterization.tsv
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
### Excel workbook
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
genochar -i "assemblies/*.fasta" -x genome_characterization.xlsx -o genome_characterization.tsv
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
## Coverage input
|
|
235
|
+
|
|
236
|
+
Coverage cannot be derived from FASTA alone. If you want to fill `Sequencing coverage (×)`, provide a coverage table.
|
|
237
|
+
|
|
238
|
+
Example:
|
|
239
|
+
|
|
240
|
+
```text
|
|
241
|
+
Strain Coverage
|
|
242
|
+
IOH03 55.7
|
|
243
|
+
IOH05 50.3
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
or
|
|
247
|
+
|
|
248
|
+
```text
|
|
249
|
+
Strain Total bases
|
|
250
|
+
IOH03 110.8 Mbp
|
|
251
|
+
IOH05 107.6 Mbp
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
If `Total bases` is provided, GenoChar computes:
|
|
255
|
+
|
|
256
|
+
```text
|
|
257
|
+
Sequencing coverage (×) = Total bases / Genome size
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
## Metadata input
|
|
261
|
+
|
|
262
|
+
Optional metadata columns include:
|
|
263
|
+
|
|
264
|
+
- `Strain`
|
|
265
|
+
- `Sequencing platforms`
|
|
266
|
+
- `Assembly method`
|
|
267
|
+
- `Genus`
|
|
268
|
+
- `Species`
|
|
269
|
+
- `Accession`
|
|
270
|
+
- `Repeat regions`
|
|
271
|
+
- `Sequencing coverage (×)`
|
|
272
|
+
|
|
273
|
+
Example:
|
|
274
|
+
|
|
275
|
+
```text
|
|
276
|
+
Strain Genus Species Accession Sequencing platforms Assembly method
|
|
277
|
+
IOH03 Thermococcus waiotapuensis GCF_032304395 Illumina iSeq 100 Unicycler (short-read assembly)
|
|
278
|
+
IOH05 Thermococcus sp. GCA_000000000 Illumina iSeq 100 Unicycler (short-read assembly)
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Notes
|
|
282
|
+
|
|
283
|
+
- **GenoChar** is summarize-first by default. If you only pass FASTA, GFF, CheckM2, coverage, and metadata inputs, it behaves like a direct summarization tool.
|
|
284
|
+
- **`genochar setup`** is the recommended way to prepare Prokka and CheckM2 without forcing them into one shared environment.
|
|
285
|
+
- `--annotate prokka` tells GenoChar to create annotation files before building the final table.
|
|
286
|
+
- `--annotate existing` tells GenoChar to reuse nearby GFF files or explicitly supplied `--gff` inputs.
|
|
287
|
+
- `--check` runs CheckM2 internally **before annotation** and automatically integrates the resulting `quality_report.tsv` into the final table.
|
|
288
|
+
- `--check-report` reuses an existing CheckM2 `quality_report.tsv` file.
|
|
289
|
+
- `--check` and `--check-report` are mutually exclusive.
|
|
290
|
+
- `--gff` is intended for existing annotation files and should not be combined with `--annotate prokka`.
|
|
291
|
+
- If more than one 16S rRNA feature is found, GenoChar stores the longest 16S sequence in the main table.
|
|
292
|
+
- Legacy `genochar summarize ...` and `genochar pipeline ...` calls are still accepted in v0.6.3, but the preferred interface is the single-command form shown above.
|
|
293
|
+
|
|
294
|
+
## License
|
|
295
|
+
|
|
296
|
+
MIT
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
GenoChar.egg-info/PKG-INFO
|
|
5
|
+
GenoChar.egg-info/SOURCES.txt
|
|
6
|
+
GenoChar.egg-info/dependency_links.txt
|
|
7
|
+
GenoChar.egg-info/entry_points.txt
|
|
8
|
+
GenoChar.egg-info/requires.txt
|
|
9
|
+
GenoChar.egg-info/top_level.txt
|
|
10
|
+
genochar/__init__.py
|
|
11
|
+
genochar/assembly_stats.py
|
|
12
|
+
genochar/checkm2.py
|
|
13
|
+
genochar/cli.py
|
|
14
|
+
genochar/coverage.py
|
|
15
|
+
genochar/gff_stats.py
|
|
16
|
+
genochar/managed_envs.py
|
|
17
|
+
genochar/metadata.py
|
|
18
|
+
genochar/pipeline.py
|
|
19
|
+
genochar/utils.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
genochar
|
genochar-0.6.3.2/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Jun Won Lee
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,296 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: GenoChar
|
|
3
|
+
Version: 0.6.3.2
|
|
4
|
+
Summary: GenoChar: summarize-first genome characterization workflow with one-time managed setup for Prokka and CheckM2
|
|
5
|
+
Author: Jun Won Lee
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/ljunwon1114/GenoChar
|
|
8
|
+
Project-URL: Issues, https://github.com/ljunwon1114/GenoChar/issues
|
|
9
|
+
Keywords: bioinformatics,genomics,genome-assembly,genome-annotation,microbiology,bacteria,archaea,checkm2,prokka
|
|
10
|
+
Requires-Python: >=3.10
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Requires-Dist: pandas>=2.0
|
|
14
|
+
Requires-Dist: openpyxl>=3.1
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# GenoChar
|
|
18
|
+
|
|
19
|
+
**GenoChar** generates publication-ready genome characterization tables for bacterial and archaeal genomes.
|
|
20
|
+
|
|
21
|
+
Version **0.6.3.2** keeps the **single-command, summarize-first interface**, but adds a practical solution for external-tool dependency conflicts:
|
|
22
|
+
|
|
23
|
+
- **`genochar`** builds the final genome characterization table
|
|
24
|
+
- **`genochar setup`** prepares **managed Prokka and CheckM2 conda environments** under `~/.genochar`
|
|
25
|
+
- later, a normal `genochar ... --check --annotate prokka` run automatically calls those managed environments with `conda run -p ...`
|
|
26
|
+
|
|
27
|
+
That means users do **not** need to manually solve a shared Prokka/CheckM2 environment.
|
|
28
|
+
|
|
29
|
+
The main output is a wide table with one row per strain. Optional outputs include a feature-style table and an Excel workbook.
|
|
30
|
+
|
|
31
|
+
**Naming note:** the project/display name is **GenoChar**, while the Python package name and command-line executable remain lowercase as `genochar`.
|
|
32
|
+
|
|
33
|
+
## What the tool computes
|
|
34
|
+
|
|
35
|
+
### FASTA-derived fields
|
|
36
|
+
These work even if you provide only FASTA files:
|
|
37
|
+
|
|
38
|
+
- `Strain`
|
|
39
|
+
- `Strain name`
|
|
40
|
+
- `Genus`
|
|
41
|
+
- `Species`
|
|
42
|
+
- `Accession`
|
|
43
|
+
- `Genome size (bp)`
|
|
44
|
+
- `GC content (%)`
|
|
45
|
+
- `No. of contigs`
|
|
46
|
+
- `N50 (bp)`
|
|
47
|
+
- `N90 (bp)`
|
|
48
|
+
- `L50`
|
|
49
|
+
- `L90`
|
|
50
|
+
- `Longest contig (bp)`
|
|
51
|
+
- `Gaps (N per 100 kb)`
|
|
52
|
+
|
|
53
|
+
### GFF-derived fields
|
|
54
|
+
These are added when you provide GFF files or ask GenoChar to create annotation files:
|
|
55
|
+
|
|
56
|
+
- `CDSs`
|
|
57
|
+
- `tRNAs`
|
|
58
|
+
- `rRNAs`
|
|
59
|
+
- `tmRNA`
|
|
60
|
+
- `misc RNA`
|
|
61
|
+
- `Repeat regions`
|
|
62
|
+
- `16S rRNA count`
|
|
63
|
+
- `16S rRNA length (bp)`
|
|
64
|
+
- `16S rRNA contig`
|
|
65
|
+
- `16S rRNA sequence`
|
|
66
|
+
|
|
67
|
+
### CheckM2-derived fields
|
|
68
|
+
These are added when you provide or generate a CheckM2 report:
|
|
69
|
+
|
|
70
|
+
- `Completeness (%)`
|
|
71
|
+
- `Contamination (%)`
|
|
72
|
+
|
|
73
|
+
### User-supplied metadata
|
|
74
|
+
Optional input tables can add:
|
|
75
|
+
|
|
76
|
+
- `Sequencing coverage (×)`
|
|
77
|
+
- `Sequencing platforms`
|
|
78
|
+
- `Assembly method`
|
|
79
|
+
- `Genus`
|
|
80
|
+
- `Species`
|
|
81
|
+
- `Accession`
|
|
82
|
+
- `Repeat regions`
|
|
83
|
+
|
|
84
|
+
## Default output columns
|
|
85
|
+
|
|
86
|
+
The main output table contains:
|
|
87
|
+
|
|
88
|
+
- `Strain`
|
|
89
|
+
- `Strain name`
|
|
90
|
+
- `Genus`
|
|
91
|
+
- `Species`
|
|
92
|
+
- `Accession`
|
|
93
|
+
- `Genome size (bp)`
|
|
94
|
+
- `GC content (%)`
|
|
95
|
+
- `No. of contigs`
|
|
96
|
+
- `N50 (bp)`
|
|
97
|
+
- `N90 (bp)`
|
|
98
|
+
- `L50`
|
|
99
|
+
- `L90`
|
|
100
|
+
- `Longest contig (bp)`
|
|
101
|
+
- `Gaps (N per 100 kb)`
|
|
102
|
+
- `Sequencing coverage (×)`
|
|
103
|
+
- `Sequencing platforms`
|
|
104
|
+
- `Assembly method`
|
|
105
|
+
- `CDSs`
|
|
106
|
+
- `tRNAs`
|
|
107
|
+
- `rRNAs`
|
|
108
|
+
- `tmRNA`
|
|
109
|
+
- `misc RNA`
|
|
110
|
+
- `Repeat regions`
|
|
111
|
+
- `16S rRNA count`
|
|
112
|
+
- `16S rRNA length (bp)`
|
|
113
|
+
- `16S rRNA contig`
|
|
114
|
+
- `16S rRNA sequence`
|
|
115
|
+
- `Completeness (%)`
|
|
116
|
+
- `Contamination (%)`
|
|
117
|
+
|
|
118
|
+
## Installation
|
|
119
|
+
|
|
120
|
+
Clone the repository and install GenoChar into the active Python environment:
|
|
121
|
+
|
|
122
|
+
```bash
|
|
123
|
+
git clone https://github.com/ljunwon1114/GenoChar.git
|
|
124
|
+
cd GenoChar
|
|
125
|
+
pip install -e .
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
If you just want the latest GitHub version without cloning for development, you can also use:
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
pip install git+https://github.com/ljunwon1114/GenoChar.git
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### Core Python requirement
|
|
135
|
+
|
|
136
|
+
GenoChar itself is lightweight. The **core package** now declares:
|
|
137
|
+
|
|
138
|
+
- `requires-python >=3.10`
|
|
139
|
+
|
|
140
|
+
The external tools are the difficult part, so v0.6.3.2 no longer assumes that Prokka and CheckM2 should share the same environment.
|
|
141
|
+
|
|
142
|
+
## One-time managed setup (recommended)
|
|
143
|
+
|
|
144
|
+
Run this once:
|
|
145
|
+
|
|
146
|
+
```bash
|
|
147
|
+
genochar setup
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
This creates managed environments under `~/.genochar`, typically:
|
|
151
|
+
|
|
152
|
+
```text
|
|
153
|
+
~/.genochar/
|
|
154
|
+
config.json
|
|
155
|
+
envs/
|
|
156
|
+
prokka/
|
|
157
|
+
checkm2/
|
|
158
|
+
databases/
|
|
159
|
+
CheckM2_database/
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
After that, normal workflow commands automatically use those managed environments when `--annotate prokka` and/or `--check` are requested.
|
|
163
|
+
|
|
164
|
+
When `--check` is used, GenoChar now passes the resolved input FASTA files directly to `checkm2 predict --input ...`, matching the official CheckM2 interface that accepts either a folder of bins or a list of FASTA files.
|
|
165
|
+
|
|
166
|
+
### Reuse an existing CheckM2 database
|
|
167
|
+
|
|
168
|
+
If you already downloaded the CheckM2 database, point setup at it directly:
|
|
169
|
+
|
|
170
|
+
```bash
|
|
171
|
+
genochar setup --checkm2-db /home/jwlee/databases/CheckM2_database/uniref100.KO.1.dmnd
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
You can also pass a directory that contains the `.dmnd` file.
|
|
175
|
+
|
|
176
|
+
### Optional setup flags
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
genochar setup --skip-prokka
|
|
180
|
+
genochar setup --skip-checkm2
|
|
181
|
+
genochar setup --force
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
- `--skip-prokka`: only prepare CheckM2
|
|
185
|
+
- `--skip-checkm2`: only prepare Prokka
|
|
186
|
+
- `--force`: recreate managed environments even if they already exist
|
|
187
|
+
|
|
188
|
+
## Command overview
|
|
189
|
+
|
|
190
|
+
### A. FASTA only
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
genochar -i "assemblies/*.fasta" -o genome_characterization.tsv
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### B. FASTA + existing GFF + existing CheckM2 report
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
genochar -i "assemblies/*.fasta" --gff "annotations/*.gff*" --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
### C. FASTA + managed CheckM2 first + managed Prokka annotation
|
|
203
|
+
|
|
204
|
+
```bash
|
|
205
|
+
genochar -i "assemblies/*.fasta" --check --annotate prokka -k Archaea -t 8 -w genochar_work -o genome_characterization.tsv
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
### D. Reuse existing GFF files automatically
|
|
209
|
+
|
|
210
|
+
```bash
|
|
211
|
+
genochar -i "assemblies/*.fasta" --annotate existing --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### E. Reuse explicitly supplied GFF files in existing-annotation mode
|
|
215
|
+
|
|
216
|
+
```bash
|
|
217
|
+
genochar -i "assemblies/*.fasta" --annotate existing --gff "annotations/*.gff*" --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## Optional extra outputs
|
|
221
|
+
|
|
222
|
+
### Feature-style table
|
|
223
|
+
|
|
224
|
+
```bash
|
|
225
|
+
genochar -i "assemblies/*.fasta" -f genome_characterization_feature.tsv -o genome_characterization.tsv
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
### Excel workbook
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
genochar -i "assemblies/*.fasta" -x genome_characterization.xlsx -o genome_characterization.tsv
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
## Coverage input
|
|
235
|
+
|
|
236
|
+
Coverage cannot be derived from FASTA alone. If you want to fill `Sequencing coverage (×)`, provide a coverage table.
|
|
237
|
+
|
|
238
|
+
Example:
|
|
239
|
+
|
|
240
|
+
```text
|
|
241
|
+
Strain Coverage
|
|
242
|
+
IOH03 55.7
|
|
243
|
+
IOH05 50.3
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
or
|
|
247
|
+
|
|
248
|
+
```text
|
|
249
|
+
Strain Total bases
|
|
250
|
+
IOH03 110.8 Mbp
|
|
251
|
+
IOH05 107.6 Mbp
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
If `Total bases` is provided, GenoChar computes:
|
|
255
|
+
|
|
256
|
+
```text
|
|
257
|
+
Sequencing coverage (×) = Total bases / Genome size
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
## Metadata input
|
|
261
|
+
|
|
262
|
+
Optional metadata columns include:
|
|
263
|
+
|
|
264
|
+
- `Strain`
|
|
265
|
+
- `Sequencing platforms`
|
|
266
|
+
- `Assembly method`
|
|
267
|
+
- `Genus`
|
|
268
|
+
- `Species`
|
|
269
|
+
- `Accession`
|
|
270
|
+
- `Repeat regions`
|
|
271
|
+
- `Sequencing coverage (×)`
|
|
272
|
+
|
|
273
|
+
Example:
|
|
274
|
+
|
|
275
|
+
```text
|
|
276
|
+
Strain Genus Species Accession Sequencing platforms Assembly method
|
|
277
|
+
IOH03 Thermococcus waiotapuensis GCF_032304395 Illumina iSeq 100 Unicycler (short-read assembly)
|
|
278
|
+
IOH05 Thermococcus sp. GCA_000000000 Illumina iSeq 100 Unicycler (short-read assembly)
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Notes
|
|
282
|
+
|
|
283
|
+
- **GenoChar** is summarize-first by default. If you only pass FASTA, GFF, CheckM2, coverage, and metadata inputs, it behaves like a direct summarization tool.
|
|
284
|
+
- **`genochar setup`** is the recommended way to prepare Prokka and CheckM2 without forcing them into one shared environment.
|
|
285
|
+
- `--annotate prokka` tells GenoChar to create annotation files before building the final table.
|
|
286
|
+
- `--annotate existing` tells GenoChar to reuse nearby GFF files or explicitly supplied `--gff` inputs.
|
|
287
|
+
- `--check` runs CheckM2 internally **before annotation** and automatically integrates the resulting `quality_report.tsv` into the final table.
|
|
288
|
+
- `--check-report` reuses an existing CheckM2 `quality_report.tsv` file.
|
|
289
|
+
- `--check` and `--check-report` are mutually exclusive.
|
|
290
|
+
- `--gff` is intended for existing annotation files and should not be combined with `--annotate prokka`.
|
|
291
|
+
- If more than one 16S rRNA feature is found, GenoChar stores the longest 16S sequence in the main table.
|
|
292
|
+
- Legacy `genochar summarize ...` and `genochar pipeline ...` calls are still accepted in v0.6.3, but the preferred interface is the single-command form shown above.
|
|
293
|
+
|
|
294
|
+
## License
|
|
295
|
+
|
|
296
|
+
MIT
|