oxymetag 1.0.0__tar.gz → 1.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {oxymetag-1.0.0/oxymetag.egg-info → oxymetag-1.1.1}/PKG-INFO +116 -52
- {oxymetag-1.0.0 → oxymetag-1.1.1}/README.md +115 -51
- {oxymetag-1.0.0 → oxymetag-1.1.1}/environment.yml +1 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/__init__.py +1 -1
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/cli.py +15 -13
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/core.py +114 -15
- oxymetag-1.1.1/oxymetag/data/VTML20.out +33 -0
- oxymetag-1.1.1/oxymetag/data/nucleotide.out +9 -0
- oxymetag-1.1.1/oxymetag/data/oxygen_model.rds +0 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117.dmnd +0 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db +0 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db.dbtype +0 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db.index +23972 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db.lookup +23972 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db.source +1 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db_h +0 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db_h.dbtype +0 -0
- oxymetag-1.1.1/oxymetag/data/oxymetag_pfams_n117_db_h.index +23972 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/scripts/predict_oxygen.R +86 -38
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/utils.py +32 -14
- {oxymetag-1.0.0 → oxymetag-1.1.1/oxymetag.egg-info}/PKG-INFO +116 -52
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag.egg-info/SOURCES.txt +11 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/setup.py +1 -1
- oxymetag-1.0.0/oxymetag/data/oxygen_model.rds +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/LICENSE +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/MANIFEST.in +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/data/.DS_Store +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/data/Oxygen_pfams.csv +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/data/oxymetag_pfams.dmnd +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/data/pfam_headers_table.txt +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag/data/pfam_lengths.tsv +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag.egg-info/dependency_links.txt +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag.egg-info/entry_points.txt +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag.egg-info/requires.txt +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/oxymetag.egg-info/top_level.txt +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/requirements.txt +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/setup.cfg +0 -0
- {oxymetag-1.0.0 → oxymetag-1.1.1}/tests/__init__.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: oxymetag
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.1.1
|
|
4
4
|
Summary: Oxygen metabolism profiling from metagenomic data
|
|
5
5
|
Home-page: https://github.com/cliffbueno/oxymetag
|
|
6
6
|
Author: Clifton P. Bueno de Mesquita
|
|
@@ -25,9 +25,9 @@ Requires-Dist: numpy>=1.20.0
|
|
|
25
25
|
|
|
26
26
|
Oxygen metabolism profiling from metagenomic data using Pfam domains. OxyMetaG predicts the percent relative abundance of aerobic bacteria in metagenomic reads based on the ratio of abundances of a set of 20 Pfams. It is recommended to use a HPC cluster or server rather than laptop to run OxyMetaG due to memory requirements, particularly for the step of extracting bacterial reads. If you already have bacterial reads, the "profile" and "predict" functions will run quickly on a laptop.
|
|
27
27
|
|
|
28
|
-
If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function.
|
|
28
|
+
If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function. For profiling modern metagenomes, use DIAMOND blastx with the default `-m diamond` mode for the profile step. You can also use `-m custom` in the predict step to test different hit cutoffs.
|
|
29
29
|
|
|
30
|
-
If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the
|
|
30
|
+
If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the read mapping approach employed by De Sanctis et al. (2025). For profiling ancient metagenomes, use MMseqs2 with `-m mmseqs2` for the profile step and `-m ancient` for the predict step. The ancient mode uses parameters optimized for ancient DNA along with 97 decoy Pfams to reduce instances of false positives. We are still working on optimizing the methods for ancient DNA, which will be released as v2.0.0.
|
|
31
31
|
|
|
32
32
|
## Installation
|
|
33
33
|
|
|
@@ -46,16 +46,22 @@ conda env create -f environment.yml
|
|
|
46
46
|
conda activate oxymetag
|
|
47
47
|
|
|
48
48
|
# Install OxyMetaG
|
|
49
|
-
pip install
|
|
49
|
+
pip install -e .
|
|
50
|
+
|
|
51
|
+
# Index the MMseqs2 database (one-time setup, ~5-10 minutes)
|
|
52
|
+
mmseqs createindex oxymetag/data/oxymetag_pfams_n117_db tmp
|
|
50
53
|
```
|
|
51
54
|
|
|
55
|
+
**Note:** The MMseqs2 database indexing is optional but highly recommended for faster searches.
|
|
56
|
+
|
|
52
57
|
### Using Pip
|
|
53
58
|
|
|
54
59
|
First install external dependencies:
|
|
55
60
|
- Kraken2
|
|
56
61
|
- DIAMOND
|
|
62
|
+
- MMseqs2
|
|
57
63
|
- KrakenTools
|
|
58
|
-
- R with mgcv and
|
|
64
|
+
- R with mgcv, dplyr, tidyr, and rlang packages
|
|
59
65
|
|
|
60
66
|
Then install OxyMetaG:
|
|
61
67
|
```bash
|
|
@@ -64,30 +70,37 @@ pip install oxymetag
|
|
|
64
70
|
|
|
65
71
|
## Quick Start
|
|
66
72
|
|
|
67
|
-
###
|
|
73
|
+
### Modern DNA workflow
|
|
74
|
+
|
|
68
75
|
```bash
|
|
76
|
+
# 1. Setup Kraken2 database (one-time)
|
|
69
77
|
oxymetag setup
|
|
70
|
-
```
|
|
71
78
|
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
oxymetag extract -i sample1_R1.fastq.gz sample1_R2.fastq.gz -o BactReads -t 48
|
|
75
|
-
```
|
|
79
|
+
# 2. Extract bacterial reads
|
|
80
|
+
oxymetag extract -i sample1_R1.fastq.gz -o BactReads -t 48
|
|
76
81
|
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
82
|
+
# 3. Profile samples with DIAMOND
|
|
83
|
+
oxymetag profile -i BactReads -o diamond_output -m diamond -t 8
|
|
84
|
+
|
|
85
|
+
# 4. Predict aerobe levels
|
|
86
|
+
oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m modern
|
|
80
87
|
```
|
|
81
88
|
|
|
82
|
-
###
|
|
89
|
+
### Ancient DNA workflow
|
|
90
|
+
|
|
83
91
|
```bash
|
|
84
|
-
#
|
|
85
|
-
|
|
92
|
+
# 1. Extract bacterial reads with an ancient DNA-optimized workflow (not currently provided by OxyMetaG)
|
|
93
|
+
|
|
94
|
+
# 2. Profile samples with MMseqs2
|
|
95
|
+
oxymetag profile -i BactReads -o mmseqs_output -m mmseqs2 -t 8
|
|
86
96
|
|
|
87
|
-
#
|
|
88
|
-
oxymetag predict -i
|
|
97
|
+
# 3. Predict aerobe levels with ancient mode
|
|
98
|
+
oxymetag predict -i mmseqs_output -o per_aerobe_predictions.tsv -m ancient
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Custom parameters
|
|
89
102
|
|
|
90
|
-
|
|
103
|
+
```bash
|
|
91
104
|
oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idcut 50 --bitcut 30 --ecut 0.01
|
|
92
105
|
```
|
|
93
106
|
|
|
@@ -98,11 +111,11 @@ oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idc
|
|
|
98
111
|
|
|
99
112
|
**What it does:** Downloads and builds the standard Kraken2 database containing bacterial, archaeal, and viral genomes. This database is used by the `extract` command to identify bacterial sequences from metagenomic samples.
|
|
100
113
|
|
|
101
|
-
**Time:**
|
|
114
|
+
**Time:** Depends on internet speed and system performance, but will likely take several hours (4-20 hours typical).
|
|
102
115
|
|
|
103
|
-
**Output:** Creates a `kraken2_db/` directory with the standard database.
|
|
116
|
+
**Output:** Creates a `kraken2_db/` directory with the standard database (~90 GB).
|
|
104
117
|
|
|
105
|
-
Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large
|
|
118
|
+
Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large, so choose a location with sufficient storage.
|
|
106
119
|
|
|
107
120
|
---
|
|
108
121
|
|
|
@@ -114,7 +127,7 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
114
127
|
2. Uses KrakenTools to extract only the reads classified as bacterial
|
|
115
128
|
3. Outputs cleaned bacterial-only FASTQ files for downstream analysis
|
|
116
129
|
|
|
117
|
-
**Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)
|
|
130
|
+
**Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)
|
|
118
131
|
**Output:** Bacterial-only FASTQ files in `BactReads/` directory
|
|
119
132
|
|
|
120
133
|
**Arguments:**
|
|
@@ -126,22 +139,26 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
126
139
|
---
|
|
127
140
|
|
|
128
141
|
### oxymetag profile
|
|
129
|
-
**Function:** Profiles bacterial reads against oxygen metabolism protein domains.
|
|
142
|
+
**Function:** Profiles bacterial reads against oxygen metabolism protein domains using DIAMOND or MMseqs2.
|
|
130
143
|
|
|
131
144
|
**What it does:**
|
|
132
145
|
1. Takes bacterial-only reads from the `extract` step
|
|
133
|
-
2. Uses DIAMOND blastx to search against a curated database of
|
|
146
|
+
2. Uses DIAMOND blastx (for modern DNA) or MMseqs2 (for ancient DNA) to search against a curated database of Pfam domains related to oxygen metabolism
|
|
147
|
+
- DIAMOND mode: 20 target Pfams
|
|
148
|
+
- MMseqs2 mode: 20 target Pfams + 97 decoy Pfams (117 total) to reduce false positives
|
|
134
149
|
3. Identifies protein-coding sequences and their functional annotations
|
|
135
150
|
4. Creates detailed hit tables for each sample
|
|
136
151
|
|
|
137
|
-
**Input:** Bacterial FASTQ files (uses R1 or merged reads only)
|
|
138
|
-
**Output:**
|
|
152
|
+
**Input:** Bacterial FASTQ files (uses R1 or merged reads only)
|
|
153
|
+
**Output:** Alignment files (TSV format) in `diamond_output/` or `mmseqs_output/` directory
|
|
139
154
|
|
|
140
155
|
**Arguments:**
|
|
141
156
|
- `-i, --input`: Input directory with bacterial reads (default: BactReads)
|
|
142
|
-
- `-o, --output`: Output directory (default: diamond_output)
|
|
157
|
+
- `-o, --output`: Output directory (default: diamond_output or mmseqs_output depending on method)
|
|
143
158
|
- `-t, --threads`: Number of threads (default: 4)
|
|
144
|
-
-
|
|
159
|
+
- `-m, --method`: Profiling method - 'diamond' or 'mmseqs2' (default: diamond)
|
|
160
|
+
- `--diamond-db`: Custom DIAMOND database path (optional, for diamond method)
|
|
161
|
+
- `--mmseqs-db`: Custom MMseqs2 database path (optional, for mmseqs2 method)
|
|
145
162
|
|
|
146
163
|
---
|
|
147
164
|
|
|
@@ -149,17 +166,24 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
149
166
|
**Function:** Predicts aerobe abundance from protein domain profiles using machine learning.
|
|
150
167
|
|
|
151
168
|
**What it does:**
|
|
152
|
-
1. Processes DIAMOND output files with appropriate quality filters
|
|
153
|
-
2.
|
|
154
|
-
3.
|
|
155
|
-
4.
|
|
156
|
-
5.
|
|
157
|
-
|
|
158
|
-
|
|
169
|
+
1. Processes DIAMOND or MMseqs2 output files with appropriate quality filters
|
|
170
|
+
2. Selects the top hit per read based on bitscore
|
|
171
|
+
3. For MMseqs2 (ancient mode): filters out decoy Pfams after selecting top hits
|
|
172
|
+
4. Normalizes protein domain counts by gene length (reads per kilobase)
|
|
173
|
+
5. Calculates aerobic/anaerobic domain ratios for each sample
|
|
174
|
+
6. Applies a trained GAM (Generalized Additive Model) to predict percentage of aerobes
|
|
175
|
+
7. Outputs a table with the sampleID, # Pfams detected, and predicted % aerobic bacteria
|
|
176
|
+
|
|
177
|
+
**Input:** DIAMOND or MMseqs2 output directory from `profile` step
|
|
159
178
|
**Output:** Tab-separated file with aerobe predictions for each sample
|
|
160
179
|
|
|
180
|
+
**Mode determines input type:**
|
|
181
|
+
- `-m modern`: Uses DIAMOND output (default input: diamond_output/)
|
|
182
|
+
- `-m ancient`: Uses MMseqs2 output (default input: mmseqs_output/)
|
|
183
|
+
- `-m custom`: Auto-detects DIAMOND or MMseqs2 files in input directory
|
|
184
|
+
|
|
161
185
|
**Arguments:**
|
|
162
|
-
- `-i, --input`: Directory with
|
|
186
|
+
- `-i, --input`: Directory with profiling output (default: diamond_output for modern, mmseqs_output for ancient)
|
|
163
187
|
- `-o, --output`: Output file (default: per_aerobe_predictions.tsv)
|
|
164
188
|
- `-t, --threads`: Number of threads (default: 4)
|
|
165
189
|
- `-m, --mode`: Filtering mode - 'modern', 'ancient', or 'custom' (default: modern)
|
|
@@ -169,32 +193,56 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
169
193
|
|
|
170
194
|
## Filtering Modes
|
|
171
195
|
|
|
172
|
-
OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA
|
|
196
|
+
OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA. In any case, it is always recommended to try several different parameters (using -m custom) to check how sensitive the results are to the cutoffs.
|
|
173
197
|
|
|
174
198
|
### Modern DNA (default)
|
|
175
|
-
**Best for:** Modern environmental metagenomes
|
|
199
|
+
**Best for:** Modern environmental metagenomes
|
|
200
|
+
**Method:** DIAMOND blastx
|
|
201
|
+
**Filters:**
|
|
176
202
|
- Identity ≥ 60%
|
|
177
203
|
- Bitscore ≥ 50
|
|
178
204
|
- E-value ≤ 0.001
|
|
179
205
|
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
206
|
+
**Usage:**
|
|
207
|
+
```bash
|
|
208
|
+
oxymetag profile -m diamond
|
|
209
|
+
oxymetag predict -m modern
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### Ancient DNA
|
|
213
|
+
**Best for:** Archaeological samples, paleogenomic data, degraded environmental DNA
|
|
214
|
+
**Method:** MMseqs2 with decoy Pfams
|
|
215
|
+
**Filters:**
|
|
216
|
+
- Identity ≥ 86%
|
|
217
|
+
- Bitscore ≥ 50
|
|
218
|
+
- E-value ≤ 0.001
|
|
219
|
+
|
|
220
|
+
**Note:** The ancient mode uses stricter identity cutoffs but employs 97 decoy Pfams to reduce false positives from damaged DNA. Reads matching decoys better than target Pfams are filtered out.
|
|
221
|
+
|
|
222
|
+
**Usage:**
|
|
223
|
+
```bash
|
|
224
|
+
oxymetag profile -m mmseqs2
|
|
225
|
+
oxymetag predict -m ancient
|
|
226
|
+
```
|
|
185
227
|
|
|
186
228
|
### Custom
|
|
187
229
|
**Best for:** Specialized applications or when you want to optimize parameters
|
|
188
230
|
- Specify your own `--idcut`, `--bitcut`, and `--ecut` values
|
|
231
|
+
- Auto-detects whether input is from DIAMOND or MMseqs2
|
|
189
232
|
- Useful for method development or unusual sample types
|
|
190
233
|
|
|
234
|
+
**Usage:**
|
|
235
|
+
```bash
|
|
236
|
+
oxymetag predict -m custom --idcut 50 --bitcut 30 --ecut 0.01
|
|
237
|
+
```
|
|
238
|
+
|
|
191
239
|
## Output
|
|
192
240
|
|
|
193
241
|
The final output (`per_aerobe_predictions.tsv`) contains:
|
|
194
242
|
- `SampleID`: Sample identifier extracted from filenames
|
|
195
243
|
- `ratio`: Aerobic/anaerobic domain ratio
|
|
196
|
-
- `aerobe_pfams`: Number of aerobic Pfam domains detected
|
|
197
|
-
- `anaerobe_pfams`: Number of anaerobic Pfam domains detected
|
|
244
|
+
- `aerobe_pfams`: Number of aerobic Pfam domains detected (from 20 target Pfams)
|
|
245
|
+
- `anaerobe_pfams`: Number of anaerobic Pfam domains detected (from 20 target Pfams)
|
|
198
246
|
- `Per_aerobe`: **Predicted percentage of aerobic bacteria (0-100%)**
|
|
199
247
|
|
|
200
248
|
## Biological Interpretation
|
|
@@ -212,17 +260,33 @@ The `Per_aerobe` value represents the predicted percentage of aerobic bacteria i
|
|
|
212
260
|
If you use OxyMetaG in your research, please cite:
|
|
213
261
|
|
|
214
262
|
```
|
|
215
|
-
Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025).
|
|
263
|
+
Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025).
|
|
264
|
+
Quantifying the oxygen preferences of bacterial communities using a
|
|
265
|
+
metagenome-based approach.
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
### Additional citations
|
|
269
|
+
|
|
270
|
+
If you use the **extract** function, also cite Kraken2 and KrakenTools:
|
|
271
|
+
```
|
|
272
|
+
Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken
|
|
273
|
+
software suite. Nat Protoc 17, 2815–2839 (2022).
|
|
274
|
+
https://doi.org/10.1038/s41596-022-00738-y
|
|
216
275
|
```
|
|
217
|
-
If you use the extract function, also cite Kraken2 and KrakenTools:
|
|
218
276
|
|
|
277
|
+
If you use the **profile** function with DIAMOND (`-m diamond`), also cite:
|
|
219
278
|
```
|
|
220
|
-
|
|
279
|
+
Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using
|
|
280
|
+
DIAMOND. Nat Methods 12, 59–60 (2015).
|
|
281
|
+
https://doi.org/10.1038/nmeth.3176
|
|
221
282
|
```
|
|
222
|
-
If you use the profile function, also cite DIAMOND
|
|
223
283
|
|
|
284
|
+
If you use the **profile** function with MMseqs2 (`-m mmseqs2`), also cite:
|
|
224
285
|
```
|
|
225
|
-
|
|
286
|
+
Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence
|
|
287
|
+
searching for the analysis of massive data sets. Nat Biotechnol 35,
|
|
288
|
+
1026–1028 (2017).
|
|
289
|
+
https://doi.org/10.1038/nbt.3988
|
|
226
290
|
```
|
|
227
291
|
|
|
228
292
|
## License
|
|
@@ -2,9 +2,9 @@
|
|
|
2
2
|
|
|
3
3
|
Oxygen metabolism profiling from metagenomic data using Pfam domains. OxyMetaG predicts the percent relative abundance of aerobic bacteria in metagenomic reads based on the ratio of abundances of a set of 20 Pfams. It is recommended to use a HPC cluster or server rather than laptop to run OxyMetaG due to memory requirements, particularly for the step of extracting bacterial reads. If you already have bacterial reads, the "profile" and "predict" functions will run quickly on a laptop.
|
|
4
4
|
|
|
5
|
-
If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function.
|
|
5
|
+
If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function. For profiling modern metagenomes, use DIAMOND blastx with the default `-m diamond` mode for the profile step. You can also use `-m custom` in the predict step to test different hit cutoffs.
|
|
6
6
|
|
|
7
|
-
If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the
|
|
7
|
+
If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the read mapping approach employed by De Sanctis et al. (2025). For profiling ancient metagenomes, use MMseqs2 with `-m mmseqs2` for the profile step and `-m ancient` for the predict step. The ancient mode uses parameters optimized for ancient DNA along with 97 decoy Pfams to reduce instances of false positives. We are still working on optimizing the methods for ancient DNA, which will be released as v2.0.0.
|
|
8
8
|
|
|
9
9
|
## Installation
|
|
10
10
|
|
|
@@ -23,16 +23,22 @@ conda env create -f environment.yml
|
|
|
23
23
|
conda activate oxymetag
|
|
24
24
|
|
|
25
25
|
# Install OxyMetaG
|
|
26
|
-
pip install
|
|
26
|
+
pip install -e .
|
|
27
|
+
|
|
28
|
+
# Index the MMseqs2 database (one-time setup, ~5-10 minutes)
|
|
29
|
+
mmseqs createindex oxymetag/data/oxymetag_pfams_n117_db tmp
|
|
27
30
|
```
|
|
28
31
|
|
|
32
|
+
**Note:** The MMseqs2 database indexing is optional but highly recommended for faster searches.
|
|
33
|
+
|
|
29
34
|
### Using Pip
|
|
30
35
|
|
|
31
36
|
First install external dependencies:
|
|
32
37
|
- Kraken2
|
|
33
38
|
- DIAMOND
|
|
39
|
+
- MMseqs2
|
|
34
40
|
- KrakenTools
|
|
35
|
-
- R with mgcv and
|
|
41
|
+
- R with mgcv, dplyr, tidyr, and rlang packages
|
|
36
42
|
|
|
37
43
|
Then install OxyMetaG:
|
|
38
44
|
```bash
|
|
@@ -41,30 +47,37 @@ pip install oxymetag
|
|
|
41
47
|
|
|
42
48
|
## Quick Start
|
|
43
49
|
|
|
44
|
-
###
|
|
50
|
+
### Modern DNA workflow
|
|
51
|
+
|
|
45
52
|
```bash
|
|
53
|
+
# 1. Setup Kraken2 database (one-time)
|
|
46
54
|
oxymetag setup
|
|
47
|
-
```
|
|
48
55
|
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
oxymetag extract -i sample1_R1.fastq.gz sample1_R2.fastq.gz -o BactReads -t 48
|
|
52
|
-
```
|
|
56
|
+
# 2. Extract bacterial reads
|
|
57
|
+
oxymetag extract -i sample1_R1.fastq.gz -o BactReads -t 48
|
|
53
58
|
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
59
|
+
# 3. Profile samples with DIAMOND
|
|
60
|
+
oxymetag profile -i BactReads -o diamond_output -m diamond -t 8
|
|
61
|
+
|
|
62
|
+
# 4. Predict aerobe levels
|
|
63
|
+
oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m modern
|
|
57
64
|
```
|
|
58
65
|
|
|
59
|
-
###
|
|
66
|
+
### Ancient DNA workflow
|
|
67
|
+
|
|
60
68
|
```bash
|
|
61
|
-
#
|
|
62
|
-
|
|
69
|
+
# 1. Extract bacterial reads with an ancient DNA-optimized workflow (not currently provided by OxyMetaG)
|
|
70
|
+
|
|
71
|
+
# 2. Profile samples with MMseqs2
|
|
72
|
+
oxymetag profile -i BactReads -o mmseqs_output -m mmseqs2 -t 8
|
|
63
73
|
|
|
64
|
-
#
|
|
65
|
-
oxymetag predict -i
|
|
74
|
+
# 3. Predict aerobe levels with ancient mode
|
|
75
|
+
oxymetag predict -i mmseqs_output -o per_aerobe_predictions.tsv -m ancient
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### Custom parameters
|
|
66
79
|
|
|
67
|
-
|
|
80
|
+
```bash
|
|
68
81
|
oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idcut 50 --bitcut 30 --ecut 0.01
|
|
69
82
|
```
|
|
70
83
|
|
|
@@ -75,11 +88,11 @@ oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idc
|
|
|
75
88
|
|
|
76
89
|
**What it does:** Downloads and builds the standard Kraken2 database containing bacterial, archaeal, and viral genomes. This database is used by the `extract` command to identify bacterial sequences from metagenomic samples.
|
|
77
90
|
|
|
78
|
-
**Time:**
|
|
91
|
+
**Time:** Depends on internet speed and system performance, but will likely take several hours (4-20 hours typical).
|
|
79
92
|
|
|
80
|
-
**Output:** Creates a `kraken2_db/` directory with the standard database.
|
|
93
|
+
**Output:** Creates a `kraken2_db/` directory with the standard database (~90 GB).
|
|
81
94
|
|
|
82
|
-
Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large
|
|
95
|
+
Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large, so choose a location with sufficient storage.
|
|
83
96
|
|
|
84
97
|
---
|
|
85
98
|
|
|
@@ -91,7 +104,7 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
91
104
|
2. Uses KrakenTools to extract only the reads classified as bacterial
|
|
92
105
|
3. Outputs cleaned bacterial-only FASTQ files for downstream analysis
|
|
93
106
|
|
|
94
|
-
**Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)
|
|
107
|
+
**Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)
|
|
95
108
|
**Output:** Bacterial-only FASTQ files in `BactReads/` directory
|
|
96
109
|
|
|
97
110
|
**Arguments:**
|
|
@@ -103,22 +116,26 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
103
116
|
---
|
|
104
117
|
|
|
105
118
|
### oxymetag profile
|
|
106
|
-
**Function:** Profiles bacterial reads against oxygen metabolism protein domains.
|
|
119
|
+
**Function:** Profiles bacterial reads against oxygen metabolism protein domains using DIAMOND or MMseqs2.
|
|
107
120
|
|
|
108
121
|
**What it does:**
|
|
109
122
|
1. Takes bacterial-only reads from the `extract` step
|
|
110
|
-
2. Uses DIAMOND blastx to search against a curated database of
|
|
123
|
+
2. Uses DIAMOND blastx (for modern DNA) or MMseqs2 (for ancient DNA) to search against a curated database of Pfam domains related to oxygen metabolism
|
|
124
|
+
- DIAMOND mode: 20 target Pfams
|
|
125
|
+
- MMseqs2 mode: 20 target Pfams + 97 decoy Pfams (117 total) to reduce false positives
|
|
111
126
|
3. Identifies protein-coding sequences and their functional annotations
|
|
112
127
|
4. Creates detailed hit tables for each sample
|
|
113
128
|
|
|
114
|
-
**Input:** Bacterial FASTQ files (uses R1 or merged reads only)
|
|
115
|
-
**Output:**
|
|
129
|
+
**Input:** Bacterial FASTQ files (uses R1 or merged reads only)
|
|
130
|
+
**Output:** Alignment files (TSV format) in `diamond_output/` or `mmseqs_output/` directory
|
|
116
131
|
|
|
117
132
|
**Arguments:**
|
|
118
133
|
- `-i, --input`: Input directory with bacterial reads (default: BactReads)
|
|
119
|
-
- `-o, --output`: Output directory (default: diamond_output)
|
|
134
|
+
- `-o, --output`: Output directory (default: diamond_output or mmseqs_output depending on method)
|
|
120
135
|
- `-t, --threads`: Number of threads (default: 4)
|
|
121
|
-
-
|
|
136
|
+
- `-m, --method`: Profiling method - 'diamond' or 'mmseqs2' (default: diamond)
|
|
137
|
+
- `--diamond-db`: Custom DIAMOND database path (optional, for diamond method)
|
|
138
|
+
- `--mmseqs-db`: Custom MMseqs2 database path (optional, for mmseqs2 method)
|
|
122
139
|
|
|
123
140
|
---
|
|
124
141
|
|
|
@@ -126,17 +143,24 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
126
143
|
**Function:** Predicts aerobe abundance from protein domain profiles using machine learning.
|
|
127
144
|
|
|
128
145
|
**What it does:**
|
|
129
|
-
1. Processes DIAMOND output files with appropriate quality filters
|
|
130
|
-
2.
|
|
131
|
-
3.
|
|
132
|
-
4.
|
|
133
|
-
5.
|
|
134
|
-
|
|
135
|
-
|
|
146
|
+
1. Processes DIAMOND or MMseqs2 output files with appropriate quality filters
|
|
147
|
+
2. Selects the top hit per read based on bitscore
|
|
148
|
+
3. For MMseqs2 (ancient mode): filters out decoy Pfams after selecting top hits
|
|
149
|
+
4. Normalizes protein domain counts by gene length (reads per kilobase)
|
|
150
|
+
5. Calculates aerobic/anaerobic domain ratios for each sample
|
|
151
|
+
6. Applies a trained GAM (Generalized Additive Model) to predict percentage of aerobes
|
|
152
|
+
7. Outputs a table with the sampleID, # Pfams detected, and predicted % aerobic bacteria
|
|
153
|
+
|
|
154
|
+
**Input:** DIAMOND or MMseqs2 output directory from `profile` step
|
|
136
155
|
**Output:** Tab-separated file with aerobe predictions for each sample
|
|
137
156
|
|
|
157
|
+
**Mode determines input type:**
|
|
158
|
+
- `-m modern`: Uses DIAMOND output (default input: diamond_output/)
|
|
159
|
+
- `-m ancient`: Uses MMseqs2 output (default input: mmseqs_output/)
|
|
160
|
+
- `-m custom`: Auto-detects DIAMOND or MMseqs2 files in input directory
|
|
161
|
+
|
|
138
162
|
**Arguments:**
|
|
139
|
-
- `-i, --input`: Directory with
|
|
163
|
+
- `-i, --input`: Directory with profiling output (default: diamond_output for modern, mmseqs_output for ancient)
|
|
140
164
|
- `-o, --output`: Output file (default: per_aerobe_predictions.tsv)
|
|
141
165
|
- `-t, --threads`: Number of threads (default: 4)
|
|
142
166
|
- `-m, --mode`: Filtering mode - 'modern', 'ancient', or 'custom' (default: modern)
|
|
@@ -146,32 +170,56 @@ Make sure you run oxymetag setup from the directory where you want the database
|
|
|
146
170
|
|
|
147
171
|
## Filtering Modes
|
|
148
172
|
|
|
149
|
-
OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA
|
|
173
|
+
OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA. In any case, it is always recommended to try several different parameters (using -m custom) to check how sensitive the results are to the cutoffs.
|
|
150
174
|
|
|
151
175
|
### Modern DNA (default)
|
|
152
|
-
**Best for:** Modern environmental metagenomes
|
|
176
|
+
**Best for:** Modern environmental metagenomes
|
|
177
|
+
**Method:** DIAMOND blastx
|
|
178
|
+
**Filters:**
|
|
153
179
|
- Identity ≥ 60%
|
|
154
180
|
- Bitscore ≥ 50
|
|
155
181
|
- E-value ≤ 0.001
|
|
156
182
|
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
183
|
+
**Usage:**
|
|
184
|
+
```bash
|
|
185
|
+
oxymetag profile -m diamond
|
|
186
|
+
oxymetag predict -m modern
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
### Ancient DNA
|
|
190
|
+
**Best for:** Archaeological samples, paleogenomic data, degraded environmental DNA
|
|
191
|
+
**Method:** MMseqs2 with decoy Pfams
|
|
192
|
+
**Filters:**
|
|
193
|
+
- Identity ≥ 86%
|
|
194
|
+
- Bitscore ≥ 50
|
|
195
|
+
- E-value ≤ 0.001
|
|
196
|
+
|
|
197
|
+
**Note:** The ancient mode uses stricter identity cutoffs but employs 97 decoy Pfams to reduce false positives from damaged DNA. Reads matching decoys better than target Pfams are filtered out.
|
|
198
|
+
|
|
199
|
+
**Usage:**
|
|
200
|
+
```bash
|
|
201
|
+
oxymetag profile -m mmseqs2
|
|
202
|
+
oxymetag predict -m ancient
|
|
203
|
+
```
|
|
162
204
|
|
|
163
205
|
### Custom
|
|
164
206
|
**Best for:** Specialized applications or when you want to optimize parameters
|
|
165
207
|
- Specify your own `--idcut`, `--bitcut`, and `--ecut` values
|
|
208
|
+
- Auto-detects whether input is from DIAMOND or MMseqs2
|
|
166
209
|
- Useful for method development or unusual sample types
|
|
167
210
|
|
|
211
|
+
**Usage:**
|
|
212
|
+
```bash
|
|
213
|
+
oxymetag predict -m custom --idcut 50 --bitcut 30 --ecut 0.01
|
|
214
|
+
```
|
|
215
|
+
|
|
168
216
|
## Output
|
|
169
217
|
|
|
170
218
|
The final output (`per_aerobe_predictions.tsv`) contains:
|
|
171
219
|
- `SampleID`: Sample identifier extracted from filenames
|
|
172
220
|
- `ratio`: Aerobic/anaerobic domain ratio
|
|
173
|
-
- `aerobe_pfams`: Number of aerobic Pfam domains detected
|
|
174
|
-
- `anaerobe_pfams`: Number of anaerobic Pfam domains detected
|
|
221
|
+
- `aerobe_pfams`: Number of aerobic Pfam domains detected (from 20 target Pfams)
|
|
222
|
+
- `anaerobe_pfams`: Number of anaerobic Pfam domains detected (from 20 target Pfams)
|
|
175
223
|
- `Per_aerobe`: **Predicted percentage of aerobic bacteria (0-100%)**
|
|
176
224
|
|
|
177
225
|
## Biological Interpretation
|
|
@@ -189,17 +237,33 @@ The `Per_aerobe` value represents the predicted percentage of aerobic bacteria i
|
|
|
189
237
|
If you use OxyMetaG in your research, please cite:
|
|
190
238
|
|
|
191
239
|
```
|
|
192
|
-
Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025).
|
|
240
|
+
Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025).
|
|
241
|
+
Quantifying the oxygen preferences of bacterial communities using a
|
|
242
|
+
metagenome-based approach.
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
### Additional citations
|
|
246
|
+
|
|
247
|
+
If you use the **extract** function, also cite Kraken2 and KrakenTools:
|
|
248
|
+
```
|
|
249
|
+
Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken
|
|
250
|
+
software suite. Nat Protoc 17, 2815–2839 (2022).
|
|
251
|
+
https://doi.org/10.1038/s41596-022-00738-y
|
|
193
252
|
```
|
|
194
|
-
If you use the extract function, also cite Kraken2 and KrakenTools:
|
|
195
253
|
|
|
254
|
+
If you use the **profile** function with DIAMOND (`-m diamond`), also cite:
|
|
196
255
|
```
|
|
197
|
-
|
|
256
|
+
Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using
|
|
257
|
+
DIAMOND. Nat Methods 12, 59–60 (2015).
|
|
258
|
+
https://doi.org/10.1038/nmeth.3176
|
|
198
259
|
```
|
|
199
|
-
If you use the profile function, also cite DIAMOND
|
|
200
260
|
|
|
261
|
+
If you use the **profile** function with MMseqs2 (`-m mmseqs2`), also cite:
|
|
201
262
|
```
|
|
202
|
-
|
|
263
|
+
Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence
|
|
264
|
+
searching for the analysis of massive data sets. Nat Biotechnol 35,
|
|
265
|
+
1026–1028 (2017).
|
|
266
|
+
https://doi.org/10.1038/nbt.3988
|
|
203
267
|
```
|
|
204
268
|
|
|
205
269
|
## License
|