oxymetag 1.0.0__tar.gz → 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. {oxymetag-1.0.0/oxymetag.egg-info → oxymetag-1.1.0}/PKG-INFO +117 -52
  2. {oxymetag-1.0.0 → oxymetag-1.1.0}/README.md +116 -51
  3. {oxymetag-1.0.0 → oxymetag-1.1.0}/environment.yml +1 -0
  4. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/__init__.py +1 -1
  5. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/cli.py +15 -13
  6. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/core.py +114 -15
  7. oxymetag-1.1.0/oxymetag/data/VTML20.out +33 -0
  8. oxymetag-1.1.0/oxymetag/data/nucleotide.out +9 -0
  9. oxymetag-1.1.0/oxymetag/data/oxygen_model.rds +0 -0
  10. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117.dmnd +0 -0
  11. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db +0 -0
  12. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db.dbtype +0 -0
  13. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db.index +23972 -0
  14. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db.lookup +23972 -0
  15. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db.source +1 -0
  16. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db_h +0 -0
  17. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db_h.dbtype +0 -0
  18. oxymetag-1.1.0/oxymetag/data/oxymetag_pfams_n117_db_h.index +23972 -0
  19. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/scripts/predict_oxygen.R +86 -38
  20. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/utils.py +32 -14
  21. {oxymetag-1.0.0 → oxymetag-1.1.0/oxymetag.egg-info}/PKG-INFO +117 -52
  22. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag.egg-info/SOURCES.txt +11 -0
  23. {oxymetag-1.0.0 → oxymetag-1.1.0}/setup.py +1 -1
  24. oxymetag-1.0.0/oxymetag/data/oxygen_model.rds +0 -0
  25. {oxymetag-1.0.0 → oxymetag-1.1.0}/LICENSE +0 -0
  26. {oxymetag-1.0.0 → oxymetag-1.1.0}/MANIFEST.in +0 -0
  27. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/data/.DS_Store +0 -0
  28. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/data/Oxygen_pfams.csv +0 -0
  29. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/data/oxymetag_pfams.dmnd +0 -0
  30. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/data/pfam_headers_table.txt +0 -0
  31. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag/data/pfam_lengths.tsv +0 -0
  32. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag.egg-info/dependency_links.txt +0 -0
  33. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag.egg-info/entry_points.txt +0 -0
  34. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag.egg-info/requires.txt +0 -0
  35. {oxymetag-1.0.0 → oxymetag-1.1.0}/oxymetag.egg-info/top_level.txt +0 -0
  36. {oxymetag-1.0.0 → oxymetag-1.1.0}/requirements.txt +0 -0
  37. {oxymetag-1.0.0 → oxymetag-1.1.0}/setup.cfg +0 -0
  38. {oxymetag-1.0.0 → oxymetag-1.1.0}/tests/__init__.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: oxymetag
3
- Version: 1.0.0
3
+ Version: 1.1.0
4
4
  Summary: Oxygen metabolism profiling from metagenomic data
5
5
  Home-page: https://github.com/cliffbueno/oxymetag
6
6
  Author: Clifton P. Bueno de Mesquita
@@ -25,9 +25,9 @@ Requires-Dist: numpy>=1.20.0
25
25
 
26
26
  Oxygen metabolism profiling from metagenomic data using Pfam domains. OxyMetaG predicts the percent relative abundance of aerobic bacteria in metagenomic reads based on the ratio of abundances of a set of 20 Pfams. It is recommended to use a HPC cluster or server rather than laptop to run OxyMetaG due to memory requirements, particularly for the step of extracting bacterial reads. If you already have bacterial reads, the "profile" and "predict" functions will run quickly on a laptop.
27
27
 
28
- If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function.
28
+ If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function. For profiling modern metagenomes, use DIAMOND blastx with the default `-m modern` mode for the predict step.
29
29
 
30
- If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the one employed by De Sanctis et al. (2025).
30
+ If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the read mapping approach employed by De Sanctis et al. (2025). For profiling ancient metagenomes, use MMseqs2 with `-m mmseqs2` for the profile step and `-m ancient` for the predict step. The ancient mode uses parameters optimized for ancient DNA along with 97 decoy Pfams to reduce instances of false positives.
31
31
 
32
32
  ## Installation
33
33
 
@@ -46,16 +46,22 @@ conda env create -f environment.yml
46
46
  conda activate oxymetag
47
47
 
48
48
  # Install OxyMetaG
49
- pip install oxymetag
49
+ pip install -e .
50
+
51
+ # Index the MMseqs2 database (one-time setup, ~5-10 minutes)
52
+ mmseqs createindex oxymetag/data/oxymetag_pfams_n117_db tmp
50
53
  ```
51
54
 
55
+ **Note:** The MMseqs2 database indexing is optional but highly recommended for faster searches.
56
+
52
57
  ### Using Pip
53
58
 
54
59
  First install external dependencies:
55
60
  - Kraken2
56
61
  - DIAMOND
62
+ - MMseqs2
57
63
  - KrakenTools
58
- - R with mgcv and dplyr packages
64
+ - R with mgcv, dplyr, tidyr, and rlang packages
59
65
 
60
66
  Then install OxyMetaG:
61
67
  ```bash
@@ -64,30 +70,38 @@ pip install oxymetag
64
70
 
65
71
  ## Quick Start
66
72
 
67
- ### 1. Setup the standard Kraken2 database
73
+ ### Modern DNA workflow
74
+
68
75
  ```bash
76
+ # 1. Setup Kraken2 database (one-time)
69
77
  oxymetag setup
70
- ```
71
78
 
72
- ### 2. Extract bacterial reads
73
- ```bash
74
- oxymetag extract -i sample1_R1.fastq.gz sample1_R2.fastq.gz -o BactReads -t 48
75
- ```
79
+ # 2. Extract bacterial reads
80
+ oxymetag extract -i sample1_R1.fastq.gz -o BactReads -t 48
76
81
 
77
- ### 3. Profile samples
78
- ```bash
79
- oxymetag profile -i BactReads -o diamond_output -t 8
82
+ # 3. Profile samples with DIAMOND
83
+ oxymetag profile -i BactReads -o diamond_output -m diamond -t 8
84
+
85
+ # 4. Predict aerobe levels
86
+ oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m modern
80
87
  ```
81
88
 
82
- ### 4. Predict aerobe levels
89
+ ### Ancient DNA workflow
90
+
83
91
  ```bash
84
- # For modern DNA
85
- oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m modern
92
+ # 1. Extract bacterial reads (use ancient DNA-optimized workflow if available)
93
+ # If using oxymetag extract, same as modern workflow
94
+
95
+ # 2. Profile samples with MMseqs2
96
+ oxymetag profile -i BactReads -o mmseqs_output -m mmseqs2 -t 8
86
97
 
87
- # For ancient DNA
88
- oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m ancient
98
+ # 3. Predict aerobe levels with ancient mode
99
+ oxymetag predict -i mmseqs_output -o per_aerobe_predictions.tsv -m ancient
100
+ ```
101
+
102
+ ### Custom parameters
89
103
 
90
- # Custom cutoffs
104
+ ```bash
91
105
  oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idcut 50 --bitcut 30 --ecut 0.01
92
106
  ```
93
107
 
@@ -98,11 +112,11 @@ oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idc
98
112
 
99
113
  **What it does:** Downloads and builds the standard Kraken2 database containing bacterial, archaeal, and viral genomes. This database is used by the `extract` command to identify bacterial sequences from metagenomic samples.
100
114
 
101
- **Time:** 2-4 hours depending on internet speed and system performance.
115
+ **Time:** Depends on internet speed and system performance, but will likely take several hours (4-20 hours typical).
102
116
 
103
- **Output:** Creates a `kraken2_db/` directory with the standard database.
117
+ **Output:** Creates a `kraken2_db/` directory with the standard database (~90 GB).
104
118
 
105
- Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large (~50-100 GB), so choose a location with sufficient storage.
119
+ Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large, so choose a location with sufficient storage.
106
120
 
107
121
  ---
108
122
 
@@ -114,7 +128,7 @@ Make sure you run oxymetag setup from the directory where you want the database
114
128
  2. Uses KrakenTools to extract only the reads classified as bacterial
115
129
  3. Outputs cleaned bacterial-only FASTQ files for downstream analysis
116
130
 
117
- **Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)\
131
+ **Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)
118
132
  **Output:** Bacterial-only FASTQ files in `BactReads/` directory
119
133
 
120
134
  **Arguments:**
@@ -126,22 +140,26 @@ Make sure you run oxymetag setup from the directory where you want the database
126
140
  ---
127
141
 
128
142
  ### oxymetag profile
129
- **Function:** Profiles bacterial reads against oxygen metabolism protein domains.
143
+ **Function:** Profiles bacterial reads against oxygen metabolism protein domains using DIAMOND or MMseqs2.
130
144
 
131
145
  **What it does:**
132
146
  1. Takes bacterial-only reads from the `extract` step
133
- 2. Uses DIAMOND blastx to search against a curated database of 20 Pfam domains related to oxygen metabolism
147
+ 2. Uses DIAMOND blastx (for modern DNA) or MMseqs2 (for ancient DNA) to search against a curated database of Pfam domains related to oxygen metabolism
148
+ - DIAMOND mode: 20 target Pfams
149
+ - MMseqs2 mode: 20 target Pfams + 97 decoy Pfams (117 total) to reduce false positives
134
150
  3. Identifies protein-coding sequences and their functional annotations
135
151
  4. Creates detailed hit tables for each sample
136
152
 
137
- **Input:** Bacterial FASTQ files (uses R1 or merged reads only)\
138
- **Output:** DIAMOND alignment files (TSV format) in `diamond_output/` directory
153
+ **Input:** Bacterial FASTQ files (uses R1 or merged reads only)
154
+ **Output:** Alignment files (TSV format) in `diamond_output/` or `mmseqs_output/` directory
139
155
 
140
156
  **Arguments:**
141
157
  - `-i, --input`: Input directory with bacterial reads (default: BactReads)
142
- - `-o, --output`: Output directory (default: diamond_output)
158
+ - `-o, --output`: Output directory (default: diamond_output or mmseqs_output depending on method)
143
159
  - `-t, --threads`: Number of threads (default: 4)
144
- - `--diamond-db`: Custom DIAMOND database path (optional)
160
+ - `-m, --method`: Profiling method - 'diamond' or 'mmseqs2' (default: diamond)
161
+ - `--diamond-db`: Custom DIAMOND database path (optional, for diamond method)
162
+ - `--mmseqs-db`: Custom MMseqs2 database path (optional, for mmseqs2 method)
145
163
 
146
164
  ---
147
165
 
@@ -149,17 +167,24 @@ Make sure you run oxymetag setup from the directory where you want the database
149
167
  **Function:** Predicts aerobe abundance from protein domain profiles using machine learning.
150
168
 
151
169
  **What it does:**
152
- 1. Processes DIAMOND output files with appropriate quality filters
153
- 2. Normalizes protein domain counts by gene length (reads per kilobase)
154
- 3. Calculates aerobic/anaerobic domain ratios for each sample
155
- 4. Applies a trained GAM (Generalized Additive Model) to predict percentage of aerobes
156
- 5. Outputs a table with the sampleID, # Pfams detected, and predicted % aerobic bacteria
157
-
158
- **Input:** DIAMOND output directory from `profile` step\
170
+ 1. Processes DIAMOND or MMseqs2 output files with appropriate quality filters
171
+ 2. Selects the top hit per read based on bitscore
172
+ 3. For MMseqs2 (ancient mode): filters out decoy Pfams after selecting top hits
173
+ 4. Normalizes protein domain counts by gene length (reads per kilobase)
174
+ 5. Calculates aerobic/anaerobic domain ratios for each sample
175
+ 6. Applies a trained GAM (Generalized Additive Model) to predict percentage of aerobes
176
+ 7. Outputs a table with the sampleID, # Pfams detected, and predicted % aerobic bacteria
177
+
178
+ **Input:** DIAMOND or MMseqs2 output directory from `profile` step
159
179
  **Output:** Tab-separated file with aerobe predictions for each sample
160
180
 
181
+ **Mode determines input type:**
182
+ - `-m modern`: Uses DIAMOND output (default input: diamond_output/)
183
+ - `-m ancient`: Uses MMseqs2 output (default input: mmseqs_output/)
184
+ - `-m custom`: Auto-detects DIAMOND or MMseqs2 files in input directory
185
+
161
186
  **Arguments:**
162
- - `-i, --input`: Directory with DIAMOND output (default: diamond_output)
187
+ - `-i, --input`: Directory with profiling output (default: diamond_output for modern, mmseqs_output for ancient)
163
188
  - `-o, --output`: Output file (default: per_aerobe_predictions.tsv)
164
189
  - `-t, --threads`: Number of threads (default: 4)
165
190
  - `-m, --mode`: Filtering mode - 'modern', 'ancient', or 'custom' (default: modern)
@@ -169,32 +194,56 @@ Make sure you run oxymetag setup from the directory where you want the database
169
194
 
170
195
  ## Filtering Modes
171
196
 
172
- OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA:
197
+ OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA. In any case, it is always recommended to try several different parameters (using -m custom) to check how sensitive the results are to the cutoffs.
173
198
 
174
199
  ### Modern DNA (default)
175
- **Best for:** Modern environmental metagenomes
200
+ **Best for:** Modern environmental metagenomes
201
+ **Method:** DIAMOND blastx
202
+ **Filters:**
176
203
  - Identity ≥ 60%
177
204
  - Bitscore ≥ 50
178
205
  - E-value ≤ 0.001
179
206
 
180
- ### Ancient DNA
181
- **Best for:** Archaeological samples, paleogenomic data, degraded environmental DNA
182
- - Identity 45% (accounts for DNA damage)
183
- - Bitscore 25 (accommodates shorter fragments)
184
- - E-value ≤ 0.1 (more permissive for low-quality data)
207
+ **Usage:**
208
+ ```bash
209
+ oxymetag profile -m diamond
210
+ oxymetag predict -m modern
211
+ ```
212
+
213
+ ### Ancient DNA
214
+ **Best for:** Archaeological samples, paleogenomic data, degraded environmental DNA
215
+ **Method:** MMseqs2 with decoy Pfams
216
+ **Filters:**
217
+ - Identity ≥ 86%
218
+ - Bitscore ≥ 50
219
+ - E-value ≤ 0.001
220
+
221
+ **Note:** The ancient mode uses stricter identity cutoffs but employs 97 decoy Pfams to reduce false positives from damaged DNA. Reads matching decoys better than target Pfams are filtered out.
222
+
223
+ **Usage:**
224
+ ```bash
225
+ oxymetag profile -m mmseqs2
226
+ oxymetag predict -m ancient
227
+ ```
185
228
 
186
229
  ### Custom
187
230
  **Best for:** Specialized applications or when you want to optimize parameters
188
231
  - Specify your own `--idcut`, `--bitcut`, and `--ecut` values
232
+ - Auto-detects whether input is from DIAMOND or MMseqs2
189
233
  - Useful for method development or unusual sample types
190
234
 
235
+ **Usage:**
236
+ ```bash
237
+ oxymetag predict -m custom --idcut 50 --bitcut 30 --ecut 0.01
238
+ ```
239
+
191
240
  ## Output
192
241
 
193
242
  The final output (`per_aerobe_predictions.tsv`) contains:
194
243
  - `SampleID`: Sample identifier extracted from filenames
195
244
  - `ratio`: Aerobic/anaerobic domain ratio
196
- - `aerobe_pfams`: Number of aerobic Pfam domains detected
197
- - `anaerobe_pfams`: Number of anaerobic Pfam domains detected
245
+ - `aerobe_pfams`: Number of aerobic Pfam domains detected (from 20 target Pfams)
246
+ - `anaerobe_pfams`: Number of anaerobic Pfam domains detected (from 20 target Pfams)
198
247
  - `Per_aerobe`: **Predicted percentage of aerobic bacteria (0-100%)**
199
248
 
200
249
  ## Biological Interpretation
@@ -212,17 +261,33 @@ The `Per_aerobe` value represents the predicted percentage of aerobic bacteria i
212
261
  If you use OxyMetaG in your research, please cite:
213
262
 
214
263
  ```
215
- Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025). Bueno de Mesquita, C.P. et al. (2025). Predicting the proportion of aerobic and anaerobic bacteria from metagenomic reads with OxyMetaG.
264
+ Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025).
265
+ Quantifying the oxygen preferences of bacterial communities using a
266
+ metagenome-based approach.
267
+ ```
268
+
269
+ ### Additional citations
270
+
271
+ If you use the **extract** function, also cite Kraken2 and KrakenTools:
272
+ ```
273
+ Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken
274
+ software suite. Nat Protoc 17, 2815–2839 (2022).
275
+ https://doi.org/10.1038/s41596-022-00738-y
216
276
  ```
217
- If you use the extract function, also cite Kraken2 and KrakenTools:
218
277
 
278
+ If you use the **profile** function with DIAMOND (`-m diamond`), also cite:
219
279
  ```
220
- Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken software suite. Nat Protoc 17, 2815–2839 (2022). https://doi.org/10.1038/s41596-022-00738-y
280
+ Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using
281
+ DIAMOND. Nat Methods 12, 59–60 (2015).
282
+ https://doi.org/10.1038/nmeth.3176
221
283
  ```
222
- If you use the profile function, also cite DIAMOND
223
284
 
285
+ If you use the **profile** function with MMseqs2 (`-m mmseqs2`), also cite:
224
286
  ```
225
- Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60 (2015). https://doi.org/10.1038/nmeth.3176
287
+ Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence
288
+ searching for the analysis of massive data sets. Nat Biotechnol 35,
289
+ 1026–1028 (2017).
290
+ https://doi.org/10.1038/nbt.3988
226
291
  ```
227
292
 
228
293
  ## License
@@ -2,9 +2,9 @@
2
2
 
3
3
  Oxygen metabolism profiling from metagenomic data using Pfam domains. OxyMetaG predicts the percent relative abundance of aerobic bacteria in metagenomic reads based on the ratio of abundances of a set of 20 Pfams. It is recommended to use a HPC cluster or server rather than laptop to run OxyMetaG due to memory requirements, particularly for the step of extracting bacterial reads. If you already have bacterial reads, the "profile" and "predict" functions will run quickly on a laptop.
4
4
 
5
- If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function.
5
+ If you are working with modern metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with Kraken2 and KrakenTools, which is performed with the OxyMetaG extract function. For profiling modern metagenomes, use DIAMOND blastx with the default `-m modern` mode for the predict step.
6
6
 
7
- If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the one employed by De Sanctis et al. (2025).
7
+ If you are working with ancient metagenomes, we recommend first quality filtering the raw reads with your method of choice and standard practices, and then extracting bacterial reads with a workflow optimized for ancient DNA, such as the read mapping approach employed by De Sanctis et al. (2025). For profiling ancient metagenomes, use MMseqs2 with `-m mmseqs2` for the profile step and `-m ancient` for the predict step. The ancient mode uses parameters optimized for ancient DNA along with 97 decoy Pfams to reduce instances of false positives.
8
8
 
9
9
  ## Installation
10
10
 
@@ -23,16 +23,22 @@ conda env create -f environment.yml
23
23
  conda activate oxymetag
24
24
 
25
25
  # Install OxyMetaG
26
- pip install oxymetag
26
+ pip install -e .
27
+
28
+ # Index the MMseqs2 database (one-time setup, ~5-10 minutes)
29
+ mmseqs createindex oxymetag/data/oxymetag_pfams_n117_db tmp
27
30
  ```
28
31
 
32
+ **Note:** The MMseqs2 database indexing is optional but highly recommended for faster searches.
33
+
29
34
  ### Using Pip
30
35
 
31
36
  First install external dependencies:
32
37
  - Kraken2
33
38
  - DIAMOND
39
+ - MMseqs2
34
40
  - KrakenTools
35
- - R with mgcv and dplyr packages
41
+ - R with mgcv, dplyr, tidyr, and rlang packages
36
42
 
37
43
  Then install OxyMetaG:
38
44
  ```bash
@@ -41,30 +47,38 @@ pip install oxymetag
41
47
 
42
48
  ## Quick Start
43
49
 
44
- ### 1. Setup the standard Kraken2 database
50
+ ### Modern DNA workflow
51
+
45
52
  ```bash
53
+ # 1. Setup Kraken2 database (one-time)
46
54
  oxymetag setup
47
- ```
48
55
 
49
- ### 2. Extract bacterial reads
50
- ```bash
51
- oxymetag extract -i sample1_R1.fastq.gz sample1_R2.fastq.gz -o BactReads -t 48
52
- ```
56
+ # 2. Extract bacterial reads
57
+ oxymetag extract -i sample1_R1.fastq.gz -o BactReads -t 48
53
58
 
54
- ### 3. Profile samples
55
- ```bash
56
- oxymetag profile -i BactReads -o diamond_output -t 8
59
+ # 3. Profile samples with DIAMOND
60
+ oxymetag profile -i BactReads -o diamond_output -m diamond -t 8
61
+
62
+ # 4. Predict aerobe levels
63
+ oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m modern
57
64
  ```
58
65
 
59
- ### 4. Predict aerobe levels
66
+ ### Ancient DNA workflow
67
+
60
68
  ```bash
61
- # For modern DNA
62
- oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m modern
69
+ # 1. Extract bacterial reads (use ancient DNA-optimized workflow if available)
70
+ # If using oxymetag extract, same as modern workflow
71
+
72
+ # 2. Profile samples with MMseqs2
73
+ oxymetag profile -i BactReads -o mmseqs_output -m mmseqs2 -t 8
63
74
 
64
- # For ancient DNA
65
- oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m ancient
75
+ # 3. Predict aerobe levels with ancient mode
76
+ oxymetag predict -i mmseqs_output -o per_aerobe_predictions.tsv -m ancient
77
+ ```
78
+
79
+ ### Custom parameters
66
80
 
67
- # Custom cutoffs
81
+ ```bash
68
82
  oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idcut 50 --bitcut 30 --ecut 0.01
69
83
  ```
70
84
 
@@ -75,11 +89,11 @@ oxymetag predict -i diamond_output -o per_aerobe_predictions.tsv -m custom --idc
75
89
 
76
90
  **What it does:** Downloads and builds the standard Kraken2 database containing bacterial, archaeal, and viral genomes. This database is used by the `extract` command to identify bacterial sequences from metagenomic samples.
77
91
 
78
- **Time:** 2-4 hours depending on internet speed and system performance.
92
+ **Time:** Depends on internet speed and system performance, but will likely take several hours (4-20 hours typical).
79
93
 
80
- **Output:** Creates a `kraken2_db/` directory with the standard database.
94
+ **Output:** Creates a `kraken2_db/` directory with the standard database (~90 GB).
81
95
 
82
- Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large (~50-100 GB), so choose a location with sufficient storage.
96
+ Make sure you run oxymetag setup from the directory where you want the database to live, or plan to always specify the --kraken-db path when running extract. The database is quite large, so choose a location with sufficient storage.
83
97
 
84
98
  ---
85
99
 
@@ -91,7 +105,7 @@ Make sure you run oxymetag setup from the directory where you want the database
91
105
  2. Uses KrakenTools to extract only the reads classified as bacterial
92
106
  3. Outputs cleaned bacterial-only FASTQ files for downstream analysis
93
107
 
94
- **Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)\
108
+ **Input:** Quality filtered metagenomic read FASTQ files (paired-end or merged)
95
109
  **Output:** Bacterial-only FASTQ files in `BactReads/` directory
96
110
 
97
111
  **Arguments:**
@@ -103,22 +117,26 @@ Make sure you run oxymetag setup from the directory where you want the database
103
117
  ---
104
118
 
105
119
  ### oxymetag profile
106
- **Function:** Profiles bacterial reads against oxygen metabolism protein domains.
120
+ **Function:** Profiles bacterial reads against oxygen metabolism protein domains using DIAMOND or MMseqs2.
107
121
 
108
122
  **What it does:**
109
123
  1. Takes bacterial-only reads from the `extract` step
110
- 2. Uses DIAMOND blastx to search against a curated database of 20 Pfam domains related to oxygen metabolism
124
+ 2. Uses DIAMOND blastx (for modern DNA) or MMseqs2 (for ancient DNA) to search against a curated database of Pfam domains related to oxygen metabolism
125
+ - DIAMOND mode: 20 target Pfams
126
+ - MMseqs2 mode: 20 target Pfams + 97 decoy Pfams (117 total) to reduce false positives
111
127
  3. Identifies protein-coding sequences and their functional annotations
112
128
  4. Creates detailed hit tables for each sample
113
129
 
114
- **Input:** Bacterial FASTQ files (uses R1 or merged reads only)\
115
- **Output:** DIAMOND alignment files (TSV format) in `diamond_output/` directory
130
+ **Input:** Bacterial FASTQ files (uses R1 or merged reads only)
131
+ **Output:** Alignment files (TSV format) in `diamond_output/` or `mmseqs_output/` directory
116
132
 
117
133
  **Arguments:**
118
134
  - `-i, --input`: Input directory with bacterial reads (default: BactReads)
119
- - `-o, --output`: Output directory (default: diamond_output)
135
+ - `-o, --output`: Output directory (default: diamond_output or mmseqs_output depending on method)
120
136
  - `-t, --threads`: Number of threads (default: 4)
121
- - `--diamond-db`: Custom DIAMOND database path (optional)
137
+ - `-m, --method`: Profiling method - 'diamond' or 'mmseqs2' (default: diamond)
138
+ - `--diamond-db`: Custom DIAMOND database path (optional, for diamond method)
139
+ - `--mmseqs-db`: Custom MMseqs2 database path (optional, for mmseqs2 method)
122
140
 
123
141
  ---
124
142
 
@@ -126,17 +144,24 @@ Make sure you run oxymetag setup from the directory where you want the database
126
144
  **Function:** Predicts aerobe abundance from protein domain profiles using machine learning.
127
145
 
128
146
  **What it does:**
129
- 1. Processes DIAMOND output files with appropriate quality filters
130
- 2. Normalizes protein domain counts by gene length (reads per kilobase)
131
- 3. Calculates aerobic/anaerobic domain ratios for each sample
132
- 4. Applies a trained GAM (Generalized Additive Model) to predict percentage of aerobes
133
- 5. Outputs a table with the sampleID, # Pfams detected, and predicted % aerobic bacteria
134
-
135
- **Input:** DIAMOND output directory from `profile` step\
147
+ 1. Processes DIAMOND or MMseqs2 output files with appropriate quality filters
148
+ 2. Selects the top hit per read based on bitscore
149
+ 3. For MMseqs2 (ancient mode): filters out decoy Pfams after selecting top hits
150
+ 4. Normalizes protein domain counts by gene length (reads per kilobase)
151
+ 5. Calculates aerobic/anaerobic domain ratios for each sample
152
+ 6. Applies a trained GAM (Generalized Additive Model) to predict percentage of aerobes
153
+ 7. Outputs a table with the sampleID, # Pfams detected, and predicted % aerobic bacteria
154
+
155
+ **Input:** DIAMOND or MMseqs2 output directory from `profile` step
136
156
  **Output:** Tab-separated file with aerobe predictions for each sample
137
157
 
158
+ **Mode determines input type:**
159
+ - `-m modern`: Uses DIAMOND output (default input: diamond_output/)
160
+ - `-m ancient`: Uses MMseqs2 output (default input: mmseqs_output/)
161
+ - `-m custom`: Auto-detects DIAMOND or MMseqs2 files in input directory
162
+
138
163
  **Arguments:**
139
- - `-i, --input`: Directory with DIAMOND output (default: diamond_output)
164
+ - `-i, --input`: Directory with profiling output (default: diamond_output for modern, mmseqs_output for ancient)
140
165
  - `-o, --output`: Output file (default: per_aerobe_predictions.tsv)
141
166
  - `-t, --threads`: Number of threads (default: 4)
142
167
  - `-m, --mode`: Filtering mode - 'modern', 'ancient', or 'custom' (default: modern)
@@ -146,32 +171,56 @@ Make sure you run oxymetag setup from the directory where you want the database
146
171
 
147
172
  ## Filtering Modes
148
173
 
149
- OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA:
174
+ OxyMetaG includes three pre-configured filtering modes optimized for different types of DNA. In any case, it is always recommended to try several different parameters (using -m custom) to check how sensitive the results are to the cutoffs.
150
175
 
151
176
  ### Modern DNA (default)
152
- **Best for:** Modern environmental metagenomes
177
+ **Best for:** Modern environmental metagenomes
178
+ **Method:** DIAMOND blastx
179
+ **Filters:**
153
180
  - Identity ≥ 60%
154
181
  - Bitscore ≥ 50
155
182
  - E-value ≤ 0.001
156
183
 
157
- ### Ancient DNA
158
- **Best for:** Archaeological samples, paleogenomic data, degraded environmental DNA
159
- - Identity 45% (accounts for DNA damage)
160
- - Bitscore 25 (accommodates shorter fragments)
161
- - E-value ≤ 0.1 (more permissive for low-quality data)
184
+ **Usage:**
185
+ ```bash
186
+ oxymetag profile -m diamond
187
+ oxymetag predict -m modern
188
+ ```
189
+
190
+ ### Ancient DNA
191
+ **Best for:** Archaeological samples, paleogenomic data, degraded environmental DNA
192
+ **Method:** MMseqs2 with decoy Pfams
193
+ **Filters:**
194
+ - Identity ≥ 86%
195
+ - Bitscore ≥ 50
196
+ - E-value ≤ 0.001
197
+
198
+ **Note:** The ancient mode uses stricter identity cutoffs but employs 97 decoy Pfams to reduce false positives from damaged DNA. Reads matching decoys better than target Pfams are filtered out.
199
+
200
+ **Usage:**
201
+ ```bash
202
+ oxymetag profile -m mmseqs2
203
+ oxymetag predict -m ancient
204
+ ```
162
205
 
163
206
  ### Custom
164
207
  **Best for:** Specialized applications or when you want to optimize parameters
165
208
  - Specify your own `--idcut`, `--bitcut`, and `--ecut` values
209
+ - Auto-detects whether input is from DIAMOND or MMseqs2
166
210
  - Useful for method development or unusual sample types
167
211
 
212
+ **Usage:**
213
+ ```bash
214
+ oxymetag predict -m custom --idcut 50 --bitcut 30 --ecut 0.01
215
+ ```
216
+
168
217
  ## Output
169
218
 
170
219
  The final output (`per_aerobe_predictions.tsv`) contains:
171
220
  - `SampleID`: Sample identifier extracted from filenames
172
221
  - `ratio`: Aerobic/anaerobic domain ratio
173
- - `aerobe_pfams`: Number of aerobic Pfam domains detected
174
- - `anaerobe_pfams`: Number of anaerobic Pfam domains detected
222
+ - `aerobe_pfams`: Number of aerobic Pfam domains detected (from 20 target Pfams)
223
+ - `anaerobe_pfams`: Number of anaerobic Pfam domains detected (from 20 target Pfams)
175
224
  - `Per_aerobe`: **Predicted percentage of aerobic bacteria (0-100%)**
176
225
 
177
226
  ## Biological Interpretation
@@ -189,17 +238,33 @@ The `Per_aerobe` value represents the predicted percentage of aerobic bacteria i
189
238
  If you use OxyMetaG in your research, please cite:
190
239
 
191
240
  ```
192
- Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025). Bueno de Mesquita, C.P. et al. (2025). Predicting the proportion of aerobic and anaerobic bacteria from metagenomic reads with OxyMetaG.
241
+ Bueno de Mesquita, C.P., Stallard-Olivera, E., Fierer, N. (2025).
242
+ Quantifying the oxygen preferences of bacterial communities using a
243
+ metagenome-based approach.
244
+ ```
245
+
246
+ ### Additional citations
247
+
248
+ If you use the **extract** function, also cite Kraken2 and KrakenTools:
249
+ ```
250
+ Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken
251
+ software suite. Nat Protoc 17, 2815–2839 (2022).
252
+ https://doi.org/10.1038/s41596-022-00738-y
193
253
  ```
194
- If you use the extract function, also cite Kraken2 and KrakenTools:
195
254
 
255
+ If you use the **profile** function with DIAMOND (`-m diamond`), also cite:
196
256
  ```
197
- Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken software suite. Nat Protoc 17, 2815–2839 (2022). https://doi.org/10.1038/s41596-022-00738-y
257
+ Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using
258
+ DIAMOND. Nat Methods 12, 59–60 (2015).
259
+ https://doi.org/10.1038/nmeth.3176
198
260
  ```
199
- If you use the profile function, also cite DIAMOND
200
261
 
262
+ If you use the **profile** function with MMseqs2 (`-m mmseqs2`), also cite:
201
263
  ```
202
- Buchfink, B., Xie, C. & Huson, D. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60 (2015). https://doi.org/10.1038/nmeth.3176
264
+ Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence
265
+ searching for the analysis of massive data sets. Nat Biotechnol 35,
266
+ 1026–1028 (2017).
267
+ https://doi.org/10.1038/nbt.3988
203
268
  ```
204
269
 
205
270
  ## License
@@ -8,6 +8,7 @@ dependencies:
8
8
  - numpy>=1.20.0
9
9
  - kraken2
10
10
  - diamond
11
+ - mmseqs2
11
12
  - krakentools
12
13
  - r-base>=4.0
13
14
  - r-mgcv
@@ -2,7 +2,7 @@
2
2
  OxyMetaG: Oxygen metabolism profiling from metagenomic data
3
3
  """
4
4
 
5
- __version__ = "1.0.0"
5
+ __version__ = "1.1.0"
6
6
  __author__ = "Clifton P. Bueno de Mesquita"
7
7
  __email__ = "cliff.buenodemesquita@colorado.edu"
8
8