moducomp 0.7.11__tar.gz → 0.7.13__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {moducomp-0.7.11 → moducomp-0.7.13}/PKG-INFO +86 -14
- {moducomp-0.7.11 → moducomp-0.7.13}/README.md +85 -13
- {moducomp-0.7.11 → moducomp-0.7.13}/moducomp/__init__.py +1 -1
- {moducomp-0.7.11 → moducomp-0.7.13}/moducomp/moducomp.py +993 -10
- {moducomp-0.7.11 → moducomp-0.7.13}/pyproject.toml +1 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/recipe.yaml +4 -2
- {moducomp-0.7.11 → moducomp-0.7.13}/.gitignore +0 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/LICENSE.txt +0 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/moducomp/__main__.py +0 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/moducomp/data/test_genomes/IMG2562617132.faa +0 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/moducomp/data/test_genomes/IMG2568526683.faa +0 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/moducomp/data/test_genomes/IMG2740892217.faa +0 -0
- {moducomp-0.7.11 → moducomp-0.7.13}/pixi.lock +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: moducomp
|
|
3
|
-
Version: 0.7.
|
|
3
|
+
Version: 0.7.13
|
|
4
4
|
Summary: moducomp: metabolic module completeness and complementarity for microbiomes.
|
|
5
5
|
Keywords: bioinformatics,microbiome,metabolic,kegg,genomics
|
|
6
6
|
Author-email: "Juan C. Villada" <jvillada@lbl.gov>
|
|
@@ -38,6 +38,7 @@ Project-URL: Repository, https://github.com/NeLLi-team/moducomp
|
|
|
38
38
|
- Tracks and reports the actual proteins that are responsible for the completion of the module in the combination of N genomes.
|
|
39
39
|
- **Automatic resource monitoring** with timestamped logs tracking CPU usage, memory consumption, and runtime for reproducibility.
|
|
40
40
|
- **Consistent logging to stdout/stderr** with a per-command resource summary emitted at the end of each run.
|
|
41
|
+
- **Built-in validation (`moducomp validate`)** for scientific consistency checks across annotations, KO matrices, KPCT outputs, and complementarity reports.
|
|
41
42
|
|
|
42
43
|
## Installation (Recommended)
|
|
43
44
|
|
|
@@ -59,16 +60,16 @@ pixi global install \
|
|
|
59
60
|
|
|
60
61
|
## Setup data (required)
|
|
61
62
|
|
|
62
|
-
`moducomp` needs the eggNOG-mapper database to run.
|
|
63
|
+
`moducomp` needs the eggNOG-mapper database to run. The primary (recommended) way to download it is using the `download_eggnog_data.py` wrapper, which mirrors the upstream downloader behavior. For upstream details, see the eggNOG-mapper setup guide: [eggNOG-mapper database setup](https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.13#user-content-Setup).
|
|
63
64
|
|
|
64
65
|
```bash
|
|
65
66
|
export EGGNOG_DATA_DIR="/path/to/eggnog-data"
|
|
66
|
-
|
|
67
|
-
#
|
|
68
|
-
#
|
|
67
|
+
download_eggnog_data.py --eggnog-data-dir "$EGGNOG_DATA_DIR"
|
|
68
|
+
# equivalent:
|
|
69
|
+
# moducomp download-eggnog-data --eggnog-data-dir "$EGGNOG_DATA_DIR"
|
|
69
70
|
```
|
|
70
71
|
|
|
71
|
-
If `EGGNOG_DATA_DIR` is not set,
|
|
72
|
+
If `EGGNOG_DATA_DIR` is not set, the downloader defaults to `${XDG_DATA_HOME:-~/.local/share}/moducomp/eggnog`.
|
|
72
73
|
|
|
73
74
|
### Quick test
|
|
74
75
|
|
|
@@ -148,6 +149,9 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
148
149
|
| `--del-tmp/--keep-tmp` | `true` | Delete temporary files after completion. |
|
|
149
150
|
| `--lowmem/--fullmem` (`--low-mem/--full-mem`) | `fullmem` | Run eggNOG-mapper without `--dbmem` to reduce RAM. |
|
|
150
151
|
| `--verbose/--quiet` | `false` | Enable verbose progress output. |
|
|
152
|
+
| `--validate/--no-validate` | `validate` | Run post-run validation checks. |
|
|
153
|
+
| `--validate-report/--no-validate-report` | `validate-report` | Write `validation_report.json` in the output directory. |
|
|
154
|
+
| `--validate-strict/--validate-lenient` | `lenient` | Treat validation warnings as failures when strict. |
|
|
151
155
|
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
152
156
|
| `--eggnog-data-dir` | `EGGNOG_DATA_DIR` | Path to eggNOG-mapper data (sets `EGGNOG_DATA_DIR`). |
|
|
153
157
|
|
|
@@ -162,6 +166,9 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
162
166
|
| `--del-tmp/--keep-tmp` | `true` | Delete temporary files after the test completes. |
|
|
163
167
|
| `--lowmem/--fullmem` (`--low-mem/--full-mem`) | `lowmem` | Low-memory mode is the default for tests. |
|
|
164
168
|
| `--verbose/--quiet` | `verbose` | Verbose output is the default for tests. |
|
|
169
|
+
| `--validate/--no-validate` | `validate` | Run post-run validation checks. |
|
|
170
|
+
| `--validate-report/--no-validate-report` | `validate-report` | Write `validation_report.json` in the output directory. |
|
|
171
|
+
| `--validate-strict/--validate-lenient` | `lenient` | Treat validation warnings as failures when strict. |
|
|
165
172
|
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
166
173
|
| `--eggnog-data-dir` | `EGGNOG_DATA_DIR` | Path to eggNOG-mapper data (sets `EGGNOG_DATA_DIR`). |
|
|
167
174
|
|
|
@@ -174,6 +181,21 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
174
181
|
| `--del-tmp/--keep-tmp` | `true` | Delete temporary files after completion. |
|
|
175
182
|
| `--ncpus`, `-n` | `16` | CPU cores for KPCT parallel processing. |
|
|
176
183
|
| `--verbose/--quiet` | `false` | Enable verbose progress output. |
|
|
184
|
+
| `--validate/--no-validate` | `validate` | Run post-run validation checks. |
|
|
185
|
+
| `--validate-report/--no-validate-report` | `validate-report` | Write `validation_report.json` in the output directory. |
|
|
186
|
+
| `--validate-strict/--validate-lenient` | `lenient` | Treat validation warnings as failures when strict. |
|
|
187
|
+
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
188
|
+
|
|
189
|
+
#### `validate` command (positional args: `savedir`)
|
|
190
|
+
|
|
191
|
+
| Option | Default | Description |
|
|
192
|
+
| --- | --- | --- |
|
|
193
|
+
| `--mode` | `auto` | Validation mode: `auto`, `pipeline`, or `ko-matrix`. |
|
|
194
|
+
| `--calculate-complementarity`, `-c` | `auto-detect` | Expected complementarity size (0 disables). |
|
|
195
|
+
| `--kpct-outprefix` | `output_give_completeness` | KPCT output prefix used during analysis. |
|
|
196
|
+
| `--strict/--lenient` | `lenient` | Treat warnings as failures when strict. |
|
|
197
|
+
| `--report` | _none_ | Write JSON validation report to this path. |
|
|
198
|
+
| `--verbose/--quiet` | `false` | Enable verbose progress output. |
|
|
177
199
|
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
178
200
|
|
|
179
201
|
#### `download-eggnog-data` command
|
|
@@ -198,6 +220,33 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
198
220
|
- For KPCT parallel processing, the system creates the same number of chunks as CPU cores specified
|
|
199
221
|
- Example: `--ncpus 8` will use 8 cores and create 8 chunks for optimal parallel processing
|
|
200
222
|
|
|
223
|
+
### Validation (QC)
|
|
224
|
+
|
|
225
|
+
Use the built-in validator to check scientific consistency across outputs after a run. The validator compares:
|
|
226
|
+
- KO sets and counts between eggNOG-mapper annotations and `kos_matrix.csv`
|
|
227
|
+
- KO sets between `kos_matrix.csv` and `ko_file_for_kpct.txt`
|
|
228
|
+
- KPCT contigs vs pathways outputs
|
|
229
|
+
- Module completeness ranges and combination naming
|
|
230
|
+
- Complementarity reports versus module completeness values
|
|
231
|
+
- Protein provenance fields (pipeline mode) or placeholders (KO-matrix mode)
|
|
232
|
+
|
|
233
|
+
Example:
|
|
234
|
+
|
|
235
|
+
```bash
|
|
236
|
+
# Validation runs by default after pipeline/analyze/test.
|
|
237
|
+
# Use --no-validate to disable or --no-validate-report to skip JSON output.
|
|
238
|
+
# When validation reports errors (or warnings in strict mode), the command exits non-zero.
|
|
239
|
+
|
|
240
|
+
# Validate a pipeline run and write a JSON report
|
|
241
|
+
moducomp validate /path/to/output --mode pipeline --report /path/to/output/validation_report.json
|
|
242
|
+
|
|
243
|
+
# Validate KO-matrix mode outputs (non-default KPCT prefix)
|
|
244
|
+
moducomp validate /path/to/output --mode ko-matrix --kpct-outprefix my_prefix
|
|
245
|
+
|
|
246
|
+
# Treat warnings as failures
|
|
247
|
+
moducomp validate /path/to/output --strict
|
|
248
|
+
```
|
|
249
|
+
|
|
201
250
|
### ⚠️ Important note 1
|
|
202
251
|
|
|
203
252
|
**Prepare FAA files**: Ensure FAA headers are in the form `>genomeName|proteinId`, or use the `--adapt-headers` option to format your headers into `>fileName_prefix|protein_id_counter`.
|
|
@@ -211,7 +260,7 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
211
260
|
You can override the bundled data location with `MODUCOMP_DATA_DIR`.
|
|
212
261
|
When working from source, the bundled test genomes live at `moducomp/data/test_genomes`.
|
|
213
262
|
|
|
214
|
-
`download_eggnog_data.py` is
|
|
263
|
+
`download_eggnog_data.py` is exposed by `moducomp` as a convenience wrapper for the eggnog-mapper downloader and is available in the Pixi environment (including `pixi global` installs).
|
|
215
264
|
|
|
216
265
|
Pixi task (supports passing a custom location):
|
|
217
266
|
|
|
@@ -298,15 +347,38 @@ moducomp analyze-ko-matrix ./ko_matrix.csv ./output_moderate --ncpus 16 --calcul
|
|
|
298
347
|
moducomp pipeline ./genomes ./output_lowmem --ncpus 8 --lowmem --calculate-complementarity 2
|
|
299
348
|
```
|
|
300
349
|
|
|
301
|
-
##
|
|
350
|
+
## Expected outputs
|
|
351
|
+
|
|
352
|
+
The sections below describe the expected output files, naming conventions, and the column-level meaning of each file. These details are the same for `moducomp pipeline` and `moducomp test` (pipeline mode), and the subset noted for `moducomp analyze-ko-matrix` (KO-matrix mode).
|
|
353
|
+
|
|
354
|
+
**Naming conventions**
|
|
355
|
+
|
|
356
|
+
Genome identifiers are stored as `taxon_oid`. In pipeline mode, ModuComp expects protein headers in the format `genome_id|protein_id`. If you set `--adapt-headers`, ModuComp rewrites headers to `>genomeName|protein_N`, where `genomeName` is the FAA filename stem. Combination identifiers use `__` (double underscore), for example `GenomeA__GenomeB`, and `n_members` in `module_completeness.tsv` records the size of each combination.
|
|
357
|
+
|
|
358
|
+
**Pipeline mode outputs (`moducomp pipeline`, `moducomp test`)**
|
|
359
|
+
|
|
360
|
+
- `emapper_out.emapper.annotations`: Full eggNOG-mapper annotations. The `#query` column must match `genome_id|protein_id`. `KEGG_ko` entries are prefixed `ko:KXXXXX` and are converted to `KXXXXX` for downstream matrices.
|
|
361
|
+
- `kos_matrix.csv`: Genome × KO count matrix. Columns: `taxon_oid` followed by KO IDs (e.g., `K00001`). Values are integer protein counts per KO.
|
|
362
|
+
- `ko_file_for_kpct.txt`: KPCT input file. Each line starts with `taxon_oid` followed by the set of KO IDs present in that genome or combination. If `--calculate-complementarity` is `N>=2`, combinations up to `N` are included as `GenomeA__GenomeB`.
|
|
363
|
+
- `output_give_completeness_contigs.with_weights.tsv`: KPCT module results per genome/combination. Columns: `contig` (genome/combination ID), `module_accession`, `completeness` (0–100), `pathway_name`, `pathway_class`, `matching_ko` (KO weights), `missing_ko`.
|
|
364
|
+
- `output_give_completeness_pathways.with_weights.tsv`: Same rows and order as the contigs file, but without the `contig` column. This is provided for compatibility with legacy tools; prefer the contigs file when you need genome-level provenance.
|
|
365
|
+
- `module_completeness.tsv`: Pivoted module completeness matrix. Columns: `n_members`, `taxon_oid`, followed by KEGG module IDs (`M00001`, …). Values are numeric percentages in the range 0–100.
|
|
366
|
+
- `module_completeness_complementarity_Nmember.tsv`: Complementarity report for `N`-member combinations (only when `--calculate-complementarity N` is set). Columns: `taxon_oid_1..N`, `completeness_taxon_oid_1..N`, `module_id`, `module_name`, `pathway_class`, `matching_ko`, `proteins_taxon_oid_1..N`. Protein fields list contributing proteins per KO (from eggNOG-mapper) as `{'KXXXXX': 'genome|protein'}`.
|
|
367
|
+
- `logs/moducomp.log`: Detailed run log with structured progress messages and per-command resource summaries.
|
|
368
|
+
- `logs/resource_usage_YYYYMMDD_HHMMSS.log`: Resource monitoring log capturing wall time, CPU time, CPU utilization, peak RAM, and exit code for each monitored command.
|
|
369
|
+
- `tmp/` (only if `--keep-tmp`): Intermediate files such as `merged_genomes.faa`, `emapper_output/`, and KPCT chunk outputs.
|
|
370
|
+
- `validation_report.json` (default when validation is enabled): JSON report produced by the validator.
|
|
302
371
|
|
|
303
|
-
|
|
372
|
+
**KO-matrix mode outputs (`moducomp analyze-ko-matrix`)**
|
|
304
373
|
|
|
305
|
-
-
|
|
306
|
-
-
|
|
307
|
-
-
|
|
308
|
-
-
|
|
309
|
-
-
|
|
374
|
+
- `kos_matrix.csv`: A copy of the input KO matrix (same format as above).
|
|
375
|
+
- `ko_file_for_kpct.txt`: KPCT input generated from the KO matrix. If `--calculate-complementarity` is set, combination lines are added using `GenomeA__GenomeB` identifiers.
|
|
376
|
+
- `output_give_completeness_contigs.with_weights.tsv`: KPCT module results per genome/combination (same format as pipeline mode).
|
|
377
|
+
- `output_give_completeness_pathways.with_weights.tsv`: Same rows as the contigs file, without the `contig` column.
|
|
378
|
+
- `module_completeness.tsv`: Module completeness matrix (same format as pipeline mode).
|
|
379
|
+
- `module_completeness_complementarity_Nmember.tsv`: Complementarity report. Protein contribution columns are filled with `No protein data available for <genome>` because no eggNOG-mapper annotations are available in KO-matrix mode.
|
|
380
|
+
- `logs/moducomp.log` and `logs/resource_usage_YYYYMMDD_HHMMSS.log`: Standard run logs and resource summaries.
|
|
381
|
+
- `validation_report.json` (default when validation is enabled): JSON report produced by the validator.
|
|
310
382
|
|
|
311
383
|
## Citation
|
|
312
384
|
Villada, JC. & Schulz, F. (2025). Assessment of metabolic module completeness of genomes and metabolic complementarity in microbiomes with `moducomp` . `moducomp` (v0.5.1) Zenodo. https://doi.org/10.5281/zenodo.16116092
|
|
@@ -13,6 +13,7 @@
|
|
|
13
13
|
- Tracks and reports the actual proteins that are responsible for the completion of the module in the combination of N genomes.
|
|
14
14
|
- **Automatic resource monitoring** with timestamped logs tracking CPU usage, memory consumption, and runtime for reproducibility.
|
|
15
15
|
- **Consistent logging to stdout/stderr** with a per-command resource summary emitted at the end of each run.
|
|
16
|
+
- **Built-in validation (`moducomp validate`)** for scientific consistency checks across annotations, KO matrices, KPCT outputs, and complementarity reports.
|
|
16
17
|
|
|
17
18
|
## Installation (Recommended)
|
|
18
19
|
|
|
@@ -34,16 +35,16 @@ pixi global install \
|
|
|
34
35
|
|
|
35
36
|
## Setup data (required)
|
|
36
37
|
|
|
37
|
-
`moducomp` needs the eggNOG-mapper database to run.
|
|
38
|
+
`moducomp` needs the eggNOG-mapper database to run. The primary (recommended) way to download it is using the `download_eggnog_data.py` wrapper, which mirrors the upstream downloader behavior. For upstream details, see the eggNOG-mapper setup guide: [eggNOG-mapper database setup](https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.13#user-content-Setup).
|
|
38
39
|
|
|
39
40
|
```bash
|
|
40
41
|
export EGGNOG_DATA_DIR="/path/to/eggnog-data"
|
|
41
|
-
|
|
42
|
-
#
|
|
43
|
-
#
|
|
42
|
+
download_eggnog_data.py --eggnog-data-dir "$EGGNOG_DATA_DIR"
|
|
43
|
+
# equivalent:
|
|
44
|
+
# moducomp download-eggnog-data --eggnog-data-dir "$EGGNOG_DATA_DIR"
|
|
44
45
|
```
|
|
45
46
|
|
|
46
|
-
If `EGGNOG_DATA_DIR` is not set,
|
|
47
|
+
If `EGGNOG_DATA_DIR` is not set, the downloader defaults to `${XDG_DATA_HOME:-~/.local/share}/moducomp/eggnog`.
|
|
47
48
|
|
|
48
49
|
### Quick test
|
|
49
50
|
|
|
@@ -123,6 +124,9 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
123
124
|
| `--del-tmp/--keep-tmp` | `true` | Delete temporary files after completion. |
|
|
124
125
|
| `--lowmem/--fullmem` (`--low-mem/--full-mem`) | `fullmem` | Run eggNOG-mapper without `--dbmem` to reduce RAM. |
|
|
125
126
|
| `--verbose/--quiet` | `false` | Enable verbose progress output. |
|
|
127
|
+
| `--validate/--no-validate` | `validate` | Run post-run validation checks. |
|
|
128
|
+
| `--validate-report/--no-validate-report` | `validate-report` | Write `validation_report.json` in the output directory. |
|
|
129
|
+
| `--validate-strict/--validate-lenient` | `lenient` | Treat validation warnings as failures when strict. |
|
|
126
130
|
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
127
131
|
| `--eggnog-data-dir` | `EGGNOG_DATA_DIR` | Path to eggNOG-mapper data (sets `EGGNOG_DATA_DIR`). |
|
|
128
132
|
|
|
@@ -137,6 +141,9 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
137
141
|
| `--del-tmp/--keep-tmp` | `true` | Delete temporary files after the test completes. |
|
|
138
142
|
| `--lowmem/--fullmem` (`--low-mem/--full-mem`) | `lowmem` | Low-memory mode is the default for tests. |
|
|
139
143
|
| `--verbose/--quiet` | `verbose` | Verbose output is the default for tests. |
|
|
144
|
+
| `--validate/--no-validate` | `validate` | Run post-run validation checks. |
|
|
145
|
+
| `--validate-report/--no-validate-report` | `validate-report` | Write `validation_report.json` in the output directory. |
|
|
146
|
+
| `--validate-strict/--validate-lenient` | `lenient` | Treat validation warnings as failures when strict. |
|
|
140
147
|
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
141
148
|
| `--eggnog-data-dir` | `EGGNOG_DATA_DIR` | Path to eggNOG-mapper data (sets `EGGNOG_DATA_DIR`). |
|
|
142
149
|
|
|
@@ -149,6 +156,21 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
149
156
|
| `--del-tmp/--keep-tmp` | `true` | Delete temporary files after completion. |
|
|
150
157
|
| `--ncpus`, `-n` | `16` | CPU cores for KPCT parallel processing. |
|
|
151
158
|
| `--verbose/--quiet` | `false` | Enable verbose progress output. |
|
|
159
|
+
| `--validate/--no-validate` | `validate` | Run post-run validation checks. |
|
|
160
|
+
| `--validate-report/--no-validate-report` | `validate-report` | Write `validation_report.json` in the output directory. |
|
|
161
|
+
| `--validate-strict/--validate-lenient` | `lenient` | Treat validation warnings as failures when strict. |
|
|
162
|
+
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
163
|
+
|
|
164
|
+
#### `validate` command (positional args: `savedir`)
|
|
165
|
+
|
|
166
|
+
| Option | Default | Description |
|
|
167
|
+
| --- | --- | --- |
|
|
168
|
+
| `--mode` | `auto` | Validation mode: `auto`, `pipeline`, or `ko-matrix`. |
|
|
169
|
+
| `--calculate-complementarity`, `-c` | `auto-detect` | Expected complementarity size (0 disables). |
|
|
170
|
+
| `--kpct-outprefix` | `output_give_completeness` | KPCT output prefix used during analysis. |
|
|
171
|
+
| `--strict/--lenient` | `lenient` | Treat warnings as failures when strict. |
|
|
172
|
+
| `--report` | _none_ | Write JSON validation report to this path. |
|
|
173
|
+
| `--verbose/--quiet` | `false` | Enable verbose progress output. |
|
|
152
174
|
| `--log-level`, `-l` | `INFO` | Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. |
|
|
153
175
|
|
|
154
176
|
#### `download-eggnog-data` command
|
|
@@ -173,6 +195,33 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
173
195
|
- For KPCT parallel processing, the system creates the same number of chunks as CPU cores specified
|
|
174
196
|
- Example: `--ncpus 8` will use 8 cores and create 8 chunks for optimal parallel processing
|
|
175
197
|
|
|
198
|
+
### Validation (QC)
|
|
199
|
+
|
|
200
|
+
Use the built-in validator to check scientific consistency across outputs after a run. The validator compares:
|
|
201
|
+
- KO sets and counts between eggNOG-mapper annotations and `kos_matrix.csv`
|
|
202
|
+
- KO sets between `kos_matrix.csv` and `ko_file_for_kpct.txt`
|
|
203
|
+
- KPCT contigs vs pathways outputs
|
|
204
|
+
- Module completeness ranges and combination naming
|
|
205
|
+
- Complementarity reports versus module completeness values
|
|
206
|
+
- Protein provenance fields (pipeline mode) or placeholders (KO-matrix mode)
|
|
207
|
+
|
|
208
|
+
Example:
|
|
209
|
+
|
|
210
|
+
```bash
|
|
211
|
+
# Validation runs by default after pipeline/analyze/test.
|
|
212
|
+
# Use --no-validate to disable or --no-validate-report to skip JSON output.
|
|
213
|
+
# When validation reports errors (or warnings in strict mode), the command exits non-zero.
|
|
214
|
+
|
|
215
|
+
# Validate a pipeline run and write a JSON report
|
|
216
|
+
moducomp validate /path/to/output --mode pipeline --report /path/to/output/validation_report.json
|
|
217
|
+
|
|
218
|
+
# Validate KO-matrix mode outputs (non-default KPCT prefix)
|
|
219
|
+
moducomp validate /path/to/output --mode ko-matrix --kpct-outprefix my_prefix
|
|
220
|
+
|
|
221
|
+
# Treat warnings as failures
|
|
222
|
+
moducomp validate /path/to/output --strict
|
|
223
|
+
```
|
|
224
|
+
|
|
176
225
|
### ⚠️ Important note 1
|
|
177
226
|
|
|
178
227
|
**Prepare FAA files**: Ensure FAA headers are in the form `>genomeName|proteinId`, or use the `--adapt-headers` option to format your headers into `>fileName_prefix|protein_id_counter`.
|
|
@@ -186,7 +235,7 @@ This section lists all CLI options implemented today, along with their default v
|
|
|
186
235
|
You can override the bundled data location with `MODUCOMP_DATA_DIR`.
|
|
187
236
|
When working from source, the bundled test genomes live at `moducomp/data/test_genomes`.
|
|
188
237
|
|
|
189
|
-
`download_eggnog_data.py` is
|
|
238
|
+
`download_eggnog_data.py` is exposed by `moducomp` as a convenience wrapper for the eggnog-mapper downloader and is available in the Pixi environment (including `pixi global` installs).
|
|
190
239
|
|
|
191
240
|
Pixi task (supports passing a custom location):
|
|
192
241
|
|
|
@@ -273,15 +322,38 @@ moducomp analyze-ko-matrix ./ko_matrix.csv ./output_moderate --ncpus 16 --calcul
|
|
|
273
322
|
moducomp pipeline ./genomes ./output_lowmem --ncpus 8 --lowmem --calculate-complementarity 2
|
|
274
323
|
```
|
|
275
324
|
|
|
276
|
-
##
|
|
325
|
+
## Expected outputs
|
|
326
|
+
|
|
327
|
+
The sections below describe the expected output files, naming conventions, and the column-level meaning of each file. These details are the same for `moducomp pipeline` and `moducomp test` (pipeline mode), and the subset noted for `moducomp analyze-ko-matrix` (KO-matrix mode).
|
|
328
|
+
|
|
329
|
+
**Naming conventions**
|
|
330
|
+
|
|
331
|
+
Genome identifiers are stored as `taxon_oid`. In pipeline mode, ModuComp expects protein headers in the format `genome_id|protein_id`. If you set `--adapt-headers`, ModuComp rewrites headers to `>genomeName|protein_N`, where `genomeName` is the FAA filename stem. Combination identifiers use `__` (double underscore), for example `GenomeA__GenomeB`, and `n_members` in `module_completeness.tsv` records the size of each combination.
|
|
332
|
+
|
|
333
|
+
**Pipeline mode outputs (`moducomp pipeline`, `moducomp test`)**
|
|
334
|
+
|
|
335
|
+
- `emapper_out.emapper.annotations`: Full eggNOG-mapper annotations. The `#query` column must match `genome_id|protein_id`. `KEGG_ko` entries are prefixed `ko:KXXXXX` and are converted to `KXXXXX` for downstream matrices.
|
|
336
|
+
- `kos_matrix.csv`: Genome × KO count matrix. Columns: `taxon_oid` followed by KO IDs (e.g., `K00001`). Values are integer protein counts per KO.
|
|
337
|
+
- `ko_file_for_kpct.txt`: KPCT input file. Each line starts with `taxon_oid` followed by the set of KO IDs present in that genome or combination. If `--calculate-complementarity` is `N>=2`, combinations up to `N` are included as `GenomeA__GenomeB`.
|
|
338
|
+
- `output_give_completeness_contigs.with_weights.tsv`: KPCT module results per genome/combination. Columns: `contig` (genome/combination ID), `module_accession`, `completeness` (0–100), `pathway_name`, `pathway_class`, `matching_ko` (KO weights), `missing_ko`.
|
|
339
|
+
- `output_give_completeness_pathways.with_weights.tsv`: Same rows and order as the contigs file, but without the `contig` column. This is provided for compatibility with legacy tools; prefer the contigs file when you need genome-level provenance.
|
|
340
|
+
- `module_completeness.tsv`: Pivoted module completeness matrix. Columns: `n_members`, `taxon_oid`, followed by KEGG module IDs (`M00001`, …). Values are numeric percentages in the range 0–100.
|
|
341
|
+
- `module_completeness_complementarity_Nmember.tsv`: Complementarity report for `N`-member combinations (only when `--calculate-complementarity N` is set). Columns: `taxon_oid_1..N`, `completeness_taxon_oid_1..N`, `module_id`, `module_name`, `pathway_class`, `matching_ko`, `proteins_taxon_oid_1..N`. Protein fields list contributing proteins per KO (from eggNOG-mapper) as `{'KXXXXX': 'genome|protein'}`.
|
|
342
|
+
- `logs/moducomp.log`: Detailed run log with structured progress messages and per-command resource summaries.
|
|
343
|
+
- `logs/resource_usage_YYYYMMDD_HHMMSS.log`: Resource monitoring log capturing wall time, CPU time, CPU utilization, peak RAM, and exit code for each monitored command.
|
|
344
|
+
- `tmp/` (only if `--keep-tmp`): Intermediate files such as `merged_genomes.faa`, `emapper_output/`, and KPCT chunk outputs.
|
|
345
|
+
- `validation_report.json` (default when validation is enabled): JSON report produced by the validator.
|
|
277
346
|
|
|
278
|
-
|
|
347
|
+
**KO-matrix mode outputs (`moducomp analyze-ko-matrix`)**
|
|
279
348
|
|
|
280
|
-
-
|
|
281
|
-
-
|
|
282
|
-
-
|
|
283
|
-
-
|
|
284
|
-
-
|
|
349
|
+
- `kos_matrix.csv`: A copy of the input KO matrix (same format as above).
|
|
350
|
+
- `ko_file_for_kpct.txt`: KPCT input generated from the KO matrix. If `--calculate-complementarity` is set, combination lines are added using `GenomeA__GenomeB` identifiers.
|
|
351
|
+
- `output_give_completeness_contigs.with_weights.tsv`: KPCT module results per genome/combination (same format as pipeline mode).
|
|
352
|
+
- `output_give_completeness_pathways.with_weights.tsv`: Same rows as the contigs file, without the `contig` column.
|
|
353
|
+
- `module_completeness.tsv`: Module completeness matrix (same format as pipeline mode).
|
|
354
|
+
- `module_completeness_complementarity_Nmember.tsv`: Complementarity report. Protein contribution columns are filled with `No protein data available for <genome>` because no eggNOG-mapper annotations are available in KO-matrix mode.
|
|
355
|
+
- `logs/moducomp.log` and `logs/resource_usage_YYYYMMDD_HHMMSS.log`: Standard run logs and resource summaries.
|
|
356
|
+
- `validation_report.json` (default when validation is enabled): JSON report produced by the validator.
|
|
285
357
|
|
|
286
358
|
## Citation
|
|
287
359
|
Villada, JC. & Schulz, F. (2025). Assessment of metabolic module completeness of genomes and metabolic complementarity in microbiomes with `moducomp` . `moducomp` (v0.5.1) Zenodo. https://doi.org/10.5281/zenodo.16116092
|