fetchm2 0.1.0__tar.gz → 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {fetchm2-0.1.0 → fetchm2-0.1.2}/MANIFEST.in +2 -0
- fetchm2-0.1.2/PKG-INFO +555 -0
- fetchm2-0.1.2/README.md +519 -0
- fetchm2-0.1.2/docs/METADATA_ANALYSIS.md +94 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/docs/STANDARDIZATION.md +14 -1
- fetchm2-0.1.2/docs/VALIDATION_REPORT.md +236 -0
- fetchm2-0.1.2/environment.yml +18 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/pyproject.toml +1 -1
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/__init__.py +1 -2
- fetchm2-0.1.2/src/fetchm2/analysis.py +366 -0
- fetchm2-0.1.2/src/fetchm2/audit.py +419 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/cli.py +93 -3
- fetchm2-0.1.2/src/fetchm2/data/collection_date_reviewed_rules.csv +1 -0
- fetchm2-0.1.2/src/fetchm2/metadata.py +476 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/sequence.py +10 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/standardization.py +27 -1
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/utils.py +6 -4
- fetchm2-0.1.2/src/fetchm2.egg-info/PKG-INFO +555 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/SOURCES.txt +5 -0
- fetchm2-0.1.2/test.tsv +201 -0
- fetchm2-0.1.2/tests/test_cli.py +207 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/tests/test_standardization.py +14 -0
- fetchm2-0.1.0/PKG-INFO +0 -208
- fetchm2-0.1.0/README.md +0 -172
- fetchm2-0.1.0/docs/VALIDATION_REPORT.md +0 -99
- fetchm2-0.1.0/src/fetchm2/audit.py +0 -126
- fetchm2-0.1.0/src/fetchm2/metadata.py +0 -244
- fetchm2-0.1.0/src/fetchm2.egg-info/PKG-INFO +0 -208
- fetchm2-0.1.0/tests/test_cli.py +0 -70
- {fetchm2-0.1.0 → fetchm2-0.1.2}/LICENSE +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/docs/RELEASE_CHECKLIST.md +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/docs/SEQUENCE_DOWNLOAD.md +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/examples/offline_metadata.tsv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/examples/test_ncbi_dataset.tsv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/setup.cfg +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/__init__.py +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/approved_broad_categories.csv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/controlled_categories.csv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/country_mapping.json +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/geography_reviewed_rules.csv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/host_negative_rules.csv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/host_synonyms.csv +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/dependency_links.txt +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/entry_points.txt +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/requires.txt +0 -0
- {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/top_level.txt +0 -0
fetchm2-0.1.2/PKG-INFO
ADDED
|
@@ -0,0 +1,555 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: fetchm2
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
|
|
5
|
+
Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/Tasnimul-Arabi-Anik/FetchM2
|
|
8
|
+
Project-URL: Repository, https://github.com/Tasnimul-Arabi-Anik/FetchM2
|
|
9
|
+
Project-URL: Issues, https://github.com/Tasnimul-Arabi-Anik/FetchM2/issues
|
|
10
|
+
Keywords: NCBI,BioSample,metadata,genomics,standardization,sequence-download
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Environment :: Console
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
20
|
+
Requires-Python: >=3.10
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Requires-Dist: pandas>=2.0
|
|
24
|
+
Requires-Dist: requests>=2.31
|
|
25
|
+
Requires-Dist: tqdm>=4.66
|
|
26
|
+
Requires-Dist: matplotlib>=3.7
|
|
27
|
+
Requires-Dist: seaborn>=0.13
|
|
28
|
+
Requires-Dist: plotly>=5.20
|
|
29
|
+
Requires-Dist: kaleido<1.0.0,>=0.2.1
|
|
30
|
+
Requires-Dist: xmltodict>=0.13
|
|
31
|
+
Provides-Extra: dev
|
|
32
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
33
|
+
Requires-Dist: build>=1.2; extra == "dev"
|
|
34
|
+
Requires-Dist: twine>=5.0; extra == "dev"
|
|
35
|
+
Dynamic: license-file
|
|
36
|
+
|
|
37
|
+
# FetchM2
|
|
38
|
+
|
|
39
|
+
FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.
|
|
40
|
+
|
|
41
|
+
FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.
|
|
42
|
+
|
|
43
|
+
Recommended one-command workflow:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Key Features
|
|
50
|
+
|
|
51
|
+
- Standalone command-line tool installable with `pip` or a conda environment.
|
|
52
|
+
- Reads NCBI Genome Datasets TSV/CSV exports.
|
|
53
|
+
- Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
|
|
54
|
+
- Supports offline analysis when metadata columns are already present.
|
|
55
|
+
- Applies packaged deterministic standardization rules.
|
|
56
|
+
- Writes clean CSV and TSV metadata outputs.
|
|
57
|
+
- Generates metadata analysis tables and figures automatically.
|
|
58
|
+
- Produces audit summaries and review queues.
|
|
59
|
+
- Downloads genome FASTA files from NCBI.
|
|
60
|
+
- Supports flexible sequence-download filtering by standardized metadata.
|
|
61
|
+
- Includes `test.tsv`, matching the public FetchM-style test dataset layout.
|
|
62
|
+
- Includes `examples/offline_metadata.tsv` for fast local smoke testing.
|
|
63
|
+
|
|
64
|
+
## Installation
|
|
65
|
+
|
|
66
|
+
### Option 1: pip
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
python -m venv fetchm2-env
|
|
70
|
+
source fetchm2-env/bin/activate
|
|
71
|
+
pip install fetchm2
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Verify:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
fetchm2 --version
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### Option 2: conda / mamba environment
|
|
81
|
+
|
|
82
|
+
Clone the repository and create the environment:
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
|
|
86
|
+
cd FetchM2
|
|
87
|
+
mamba env create -f environment.yml
|
|
88
|
+
conda activate fetchm2
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
If you use `conda` instead of `mamba`:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
conda env create -f environment.yml
|
|
95
|
+
conda activate fetchm2
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
The conda environment includes `taxonkit`, which can improve host lineage enrichment for less common TaxIDs. FetchM2 still works without `taxonkit`; common host lineages are bundled.
|
|
99
|
+
|
|
100
|
+
### Option 3: developer install
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
|
|
104
|
+
cd FetchM2
|
|
105
|
+
python -m pip install -e ".[dev]"
|
|
106
|
+
pytest
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## Quick Start
|
|
110
|
+
|
|
111
|
+
Run the bundled standalone smoke test:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Run the FetchM-style test dataset:
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
fetchm2 metadata --input test.tsv --outdir test_out --offline
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
`test.tsv` contains assembly-level NCBI dataset columns and BioSample accessions. In offline mode, FetchM2 analyzes assembly statistics and any metadata already present in the table. To populate host, source, sample, environment, and geography from NCBI BioSample records, run without `--offline`.
|
|
124
|
+
|
|
125
|
+
Run metadata retrieval with BioSample enrichment:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
fetchm2 metadata --input test.tsv --outdir test_out_live --workers 3 --sleep 0.4
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Use an NCBI API key for larger jobs:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
135
|
+
export NCBI_EMAIL=you@example.com
|
|
136
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Run metadata standardization and sequence download in one command:
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Typical Species/Genus Workflow
|
|
146
|
+
|
|
147
|
+
1. Download an NCBI Genome Datasets TSV or CSV for your target species or genus.
|
|
148
|
+
2. Run FetchM2:
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
3. Review the main outputs:
|
|
155
|
+
|
|
156
|
+
- `results/metadata_output/fetchm2_clean.csv`
|
|
157
|
+
- `results/metadata_analysis/metadata_analysis_report.md`
|
|
158
|
+
- `results/audit/standardization_audit.md`
|
|
159
|
+
- `results/audit/production_readiness_gate.md`
|
|
160
|
+
- `results/sequence/`
|
|
161
|
+
|
|
162
|
+
For large NCBI retrieval jobs without an API key, use a conservative request delay:
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download --workers 3 --sleep 0.4
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
## Metadata Retrieval Workflow
|
|
169
|
+
|
|
170
|
+
FetchM2 can work in two modes.
|
|
171
|
+
|
|
172
|
+
Offline mode:
|
|
173
|
+
|
|
174
|
+
- Uses metadata columns already present in the input table.
|
|
175
|
+
- Applies standardization rules.
|
|
176
|
+
- Generates audit and metadata analysis outputs.
|
|
177
|
+
- Does not contact NCBI.
|
|
178
|
+
|
|
179
|
+
Live BioSample mode:
|
|
180
|
+
|
|
181
|
+
- Reads BioSample accessions from NCBI dataset exports.
|
|
182
|
+
- Retrieves BioSample records through NCBI E-utilities.
|
|
183
|
+
- Uses direct BioSample XML first, then an `esummary` fallback when the direct record lacks usable attributes.
|
|
184
|
+
- Tracks raw BioSample attribute names and matched standardized attribute names.
|
|
185
|
+
- Uses a local SQLite cache so repeated runs do not refetch the same BioSample records.
|
|
186
|
+
- Uses request throttling, retry, and backoff behavior for temporary NCBI rate-limit or server errors.
|
|
187
|
+
|
|
188
|
+
Important output columns from retrieval include:
|
|
189
|
+
|
|
190
|
+
- `BioSample`
|
|
191
|
+
- `BioSample Taxonomy Name`
|
|
192
|
+
- `Metadata Fetch Status`
|
|
193
|
+
- `Metadata Fetch Reason`
|
|
194
|
+
- `Metadata Fetch Error`
|
|
195
|
+
- `Metadata Raw Attribute Names`
|
|
196
|
+
- `Metadata Matched Attribute Names`
|
|
197
|
+
|
|
198
|
+
FetchM2 currently recognizes common BioSample attribute aliases for host, source, sample type, isolation site, collection date, geography, environmental medium/broad/local scale, host disease, and host health state.
|
|
199
|
+
|
|
200
|
+
## Main Commands
|
|
201
|
+
|
|
202
|
+
```bash
|
|
203
|
+
fetchm2 metadata --help
|
|
204
|
+
fetchm2 run --help
|
|
205
|
+
fetchm2 seq --help
|
|
206
|
+
fetchm2 audit --help
|
|
207
|
+
fetchm2 validate --help
|
|
208
|
+
fetchm2 analyze --help
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### `fetchm2 metadata`
|
|
212
|
+
|
|
213
|
+
Reads an NCBI dataset TSV/CSV, optionally fetches BioSample metadata, standardizes fields, and writes clean outputs.
|
|
214
|
+
|
|
215
|
+
Example:
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
fetchm2 metadata \
|
|
219
|
+
--input ncbi_dataset.tsv \
|
|
220
|
+
--outdir results \
|
|
221
|
+
--ani OK \
|
|
222
|
+
--checkm 95 \
|
|
223
|
+
--workers 6
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Common options:
|
|
227
|
+
|
|
228
|
+
- `--input`: NCBI dataset TSV/CSV.
|
|
229
|
+
- `--outdir`: output directory.
|
|
230
|
+
- `--ani`: filter by ANI Check status, for example `OK`.
|
|
231
|
+
- `--checkm`: minimum CheckM completeness.
|
|
232
|
+
- `--api-key`: NCBI API key; can also use `NCBI_API_KEY`.
|
|
233
|
+
- `--email`: NCBI email; can also use `NCBI_EMAIL`.
|
|
234
|
+
- `--workers`: BioSample fetch worker count.
|
|
235
|
+
- `--sleep`: shared request delay between NCBI calls. Use a slower value such as `0.4` to `0.5` for unauthenticated larger jobs.
|
|
236
|
+
- `--offline`: skip NCBI fetching and standardize existing columns only.
|
|
237
|
+
- `--no-analysis`: skip automatic `metadata_analysis/` table and figure generation.
|
|
238
|
+
|
|
239
|
+
### `fetchm2 run`
|
|
240
|
+
|
|
241
|
+
Runs metadata analysis and, if requested, sequence download.
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
fetchm2 run \
|
|
245
|
+
--input ncbi_dataset.tsv \
|
|
246
|
+
--outdir results \
|
|
247
|
+
--ani OK \
|
|
248
|
+
--checkm 95 \
|
|
249
|
+
--download \
|
|
250
|
+
--country Bangladesh \
|
|
251
|
+
--host "Homo sapiens" \
|
|
252
|
+
--year-from 2018 \
|
|
253
|
+
--year-to 2024
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
### `fetchm2 seq`
|
|
257
|
+
|
|
258
|
+
Downloads genome FASTA files using the standardized clean metadata table.
|
|
259
|
+
|
|
260
|
+
```bash
|
|
261
|
+
fetchm2 seq \
|
|
262
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
263
|
+
--outdir results/sequence \
|
|
264
|
+
--host "Homo sapiens" \
|
|
265
|
+
--country Bangladesh \
|
|
266
|
+
--year-from 2018 \
|
|
267
|
+
--year-to 2024
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
Check expected sequences without downloading:
|
|
271
|
+
|
|
272
|
+
```bash
|
|
273
|
+
fetchm2 seq \
|
|
274
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
275
|
+
--outdir results/sequence \
|
|
276
|
+
--country Bangladesh \
|
|
277
|
+
--check-only
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
### `fetchm2 audit`
|
|
281
|
+
|
|
282
|
+
Audits an existing standardized output:
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
fetchm2 audit \
|
|
286
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
287
|
+
--outdir results/audit_rerun
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
### `fetchm2 validate`
|
|
291
|
+
|
|
292
|
+
Runs the same production-readiness checks as `audit`, but names the workflow explicitly for CLI validation:
|
|
293
|
+
|
|
294
|
+
```bash
|
|
295
|
+
fetchm2 validate \
|
|
296
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
297
|
+
--outdir results/validation
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
### `fetchm2 analyze`
|
|
301
|
+
|
|
302
|
+
Generates metadata analysis outputs from any existing clean metadata CSV.
|
|
303
|
+
|
|
304
|
+
```bash
|
|
305
|
+
fetchm2 analyze \
|
|
306
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
307
|
+
--outdir results/metadata_analysis_rerun \
|
|
308
|
+
--top-n 30
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
## Metadata Outputs
|
|
312
|
+
|
|
313
|
+
FetchM2 writes:
|
|
314
|
+
|
|
315
|
+
- `metadata_output/fetchm2_clean.csv`
|
|
316
|
+
- `metadata_output/fetchm2_clean.tsv`
|
|
317
|
+
- `metadata_output/fetchm2_report.md`
|
|
318
|
+
- `audit/standardization_summary.csv`
|
|
319
|
+
- `audit/top_host_review_needed.csv`
|
|
320
|
+
- `audit/standardization_audit.md`
|
|
321
|
+
- `metadata_analysis/metadata_analysis_report.md`
|
|
322
|
+
- `metadata_analysis/tables/field_coverage_summary.csv`
|
|
323
|
+
- `metadata_analysis/tables/top_values_by_field.csv`
|
|
324
|
+
- `metadata_analysis/tables/numeric_summary.csv`
|
|
325
|
+
- `metadata_analysis/figures/*.png`
|
|
326
|
+
|
|
327
|
+
Typical output structure:
|
|
328
|
+
|
|
329
|
+
```text
|
|
330
|
+
results/
|
|
331
|
+
├── metadata_output/
|
|
332
|
+
│ ├── fetchm2_clean.csv
|
|
333
|
+
│ ├── fetchm2_clean.tsv
|
|
334
|
+
│ └── fetchm2_report.md
|
|
335
|
+
├── metadata_analysis/
|
|
336
|
+
│ ├── metadata_analysis_report.md
|
|
337
|
+
│ ├── tables/
|
|
338
|
+
│ └── figures/
|
|
339
|
+
├── audit/
|
|
340
|
+
│ ├── standardization_summary.csv
|
|
341
|
+
│ ├── standardization_audit.md
|
|
342
|
+
│ ├── production_readiness_gate.md
|
|
343
|
+
│ ├── production_readiness_gate.json
|
|
344
|
+
│ ├── top_host_review_needed.csv
|
|
345
|
+
│ ├── non_country_values_in_country.csv
|
|
346
|
+
│ ├── country_continent_mismatch.csv
|
|
347
|
+
│ ├── country_subcontinent_mismatch.csv
|
|
348
|
+
│ ├── invalid_collection_years.csv
|
|
349
|
+
│ ├── invalid_host_like_sample_type.csv
|
|
350
|
+
│ ├── source_like_mapped_hosts.csv
|
|
351
|
+
│ ├── source_like_unmapped_hosts_for_review.csv
|
|
352
|
+
│ ├── broad_vocabulary_leakage.csv
|
|
353
|
+
│ ├── sequence_readiness.csv
|
|
354
|
+
│ └── rule_count_summary.csv
|
|
355
|
+
└── sequence/
|
|
356
|
+
├── *.fna
|
|
357
|
+
├── failed_accessions.txt
|
|
358
|
+
├── sequence_download_summary.csv
|
|
359
|
+
└── fetchm2_sequence_cache.sqlite3
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
## Standardized Metadata Fields
|
|
363
|
+
|
|
364
|
+
FetchM2 keeps the original input columns and adds standardized fields.
|
|
365
|
+
|
|
366
|
+
### Host Standardization
|
|
367
|
+
|
|
368
|
+
Original FetchM had host-oriented metadata summaries. FetchM2 expands this into detailed host standardization:
|
|
369
|
+
|
|
370
|
+
- `Host_Original`
|
|
371
|
+
- `Host_Cleaned`
|
|
372
|
+
- `Host_SD`
|
|
373
|
+
- `Host_TaxID`
|
|
374
|
+
- `Host_Rank`
|
|
375
|
+
- `Host_Superkingdom`
|
|
376
|
+
- `Host_Phylum`
|
|
377
|
+
- `Host_Class`
|
|
378
|
+
- `Host_Order`
|
|
379
|
+
- `Host_Family`
|
|
380
|
+
- `Host_Genus`
|
|
381
|
+
- `Host_Species`
|
|
382
|
+
- `Host_Common_Name`
|
|
383
|
+
- `Host_Context_SD`
|
|
384
|
+
- `Host_Match_Method`
|
|
385
|
+
- `Host_Confidence`
|
|
386
|
+
- `Host_Review_Status`
|
|
387
|
+
|
|
388
|
+
Examples:
|
|
389
|
+
|
|
390
|
+
- `human`, `human blood`, `Homosapines` variants can map to `Homo sapiens`, TaxID `9606`.
|
|
391
|
+
- `cattle feces` can map to `Bos taurus`, TaxID `9913`, while also preserving feces/stool as sample metadata.
|
|
392
|
+
- `bacteria culture`, `DH5a`, lab strain terms, missing values, and source/material terms are blocked from becoming host values.
|
|
393
|
+
|
|
394
|
+
### Source, Sample, and Environment
|
|
395
|
+
|
|
396
|
+
FetchM2 standardizes source/sample/environment fields into:
|
|
397
|
+
|
|
398
|
+
- `Sample_Type_SD`
|
|
399
|
+
- `Sample_Type_SD_Broad`
|
|
400
|
+
- `Isolation_Source_SD`
|
|
401
|
+
- `Isolation_Source_SD_Broad`
|
|
402
|
+
- `Isolation_Site_SD`
|
|
403
|
+
- `Environment_Medium_SD`
|
|
404
|
+
- `Environment_Medium_SD_Broad`
|
|
405
|
+
- `Environment_Broad_Scale_SD`
|
|
406
|
+
- `Environment_Local_Scale_SD`
|
|
407
|
+
|
|
408
|
+
Examples:
|
|
409
|
+
|
|
410
|
+
- `blood` -> `Sample_Type_SD=blood`
|
|
411
|
+
- `urine` -> `Sample_Type_SD=urine`
|
|
412
|
+
- `feces`, `faeces`, `stool` -> `Sample_Type_SD=feces/stool`
|
|
413
|
+
- `soil` -> `Environment_Medium_SD=soil`
|
|
414
|
+
- `sediment` -> `Environment_Medium_SD=sediment`
|
|
415
|
+
- `wastewater`, `sewage` -> `Environment_Medium_SD=wastewater/sewage`
|
|
416
|
+
- `hospital surface` -> healthcare/source context
|
|
417
|
+
- `rectal swab` -> sample type plus anatomical site when available
|
|
418
|
+
|
|
419
|
+
### Geography and Date
|
|
420
|
+
|
|
421
|
+
FetchM2 standardizes:
|
|
422
|
+
|
|
423
|
+
- `Country`
|
|
424
|
+
- `Continent`
|
|
425
|
+
- `Subcontinent`
|
|
426
|
+
- `Collection_Year`
|
|
427
|
+
|
|
428
|
+
It also blocks common false positives such as:
|
|
429
|
+
|
|
430
|
+
- `Hospital` as country
|
|
431
|
+
- `ground turkey` as Turkey
|
|
432
|
+
- `Guinea pig` as Guinea
|
|
433
|
+
- `Norway rat` as Norway
|
|
434
|
+
- `Aspergillus niger` as Niger
|
|
435
|
+
|
|
436
|
+
### Disease and Health State
|
|
437
|
+
|
|
438
|
+
FetchM2 includes:
|
|
439
|
+
|
|
440
|
+
- `Host_Disease_SD`
|
|
441
|
+
- `Host_Health_State_SD`
|
|
442
|
+
|
|
443
|
+
These are conservative deterministic fields. Disease words are not treated as sample material unless an actual specimen is present.
|
|
444
|
+
|
|
445
|
+
## Sequence Download Features
|
|
446
|
+
|
|
447
|
+
FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
|
|
448
|
+
|
|
449
|
+
Filtering options:
|
|
450
|
+
|
|
451
|
+
- `--host`
|
|
452
|
+
- `--host-rank`
|
|
453
|
+
- `--country`
|
|
454
|
+
- `--continent`
|
|
455
|
+
- `--subcontinent`
|
|
456
|
+
- `--sample-type`
|
|
457
|
+
- `--isolation-source`
|
|
458
|
+
- `--environment-medium`
|
|
459
|
+
- `--year-from`
|
|
460
|
+
- `--year-to`
|
|
461
|
+
- `--max-genomes`
|
|
462
|
+
|
|
463
|
+
Download control:
|
|
464
|
+
|
|
465
|
+
- `--download-workers`
|
|
466
|
+
- `--retries`
|
|
467
|
+
- `--retry-delay`
|
|
468
|
+
- `--keep-gz`
|
|
469
|
+
- `--check-only`
|
|
470
|
+
|
|
471
|
+
Outputs:
|
|
472
|
+
|
|
473
|
+
- genome FASTA files
|
|
474
|
+
- `failed_accessions.txt`
|
|
475
|
+
- `sequence_download_summary.csv`
|
|
476
|
+
- `fetchm2_sequence_cache.sqlite3`
|
|
477
|
+
|
|
478
|
+
## Test Dataset
|
|
479
|
+
|
|
480
|
+
FetchM2 includes:
|
|
481
|
+
|
|
482
|
+
- `test.tsv`: FetchM-style NCBI dataset example copied from the public FetchM test dataset.
|
|
483
|
+
- `examples/test_ncbi_dataset.tsv`: same dataset stored under examples.
|
|
484
|
+
- `examples/offline_metadata.tsv`: small annotated metadata table for fast offline testing.
|
|
485
|
+
|
|
486
|
+
Run:
|
|
487
|
+
|
|
488
|
+
```bash
|
|
489
|
+
fetchm2 metadata --input test.tsv --outdir test_run --offline
|
|
490
|
+
fetchm2 audit --input test_run/metadata_output/fetchm2_clean.csv --outdir test_run/audit_check
|
|
491
|
+
```
|
|
492
|
+
|
|
493
|
+
For BioSample metadata retrieval:
|
|
494
|
+
|
|
495
|
+
```bash
|
|
496
|
+
fetchm2 metadata --input test.tsv --outdir test_run_live --workers 3 --sleep 0.34
|
|
497
|
+
```
|
|
498
|
+
|
|
499
|
+
## Rule Files Packaged With FetchM2
|
|
500
|
+
|
|
501
|
+
FetchM2 ships deterministic rules in `src/fetchm2/data/`:
|
|
502
|
+
|
|
503
|
+
- `host_synonyms.csv`
|
|
504
|
+
- `host_negative_rules.csv`
|
|
505
|
+
- `controlled_categories.csv`
|
|
506
|
+
- `approved_broad_categories.csv`
|
|
507
|
+
- `geography_reviewed_rules.csv`
|
|
508
|
+
- `collection_date_reviewed_rules.csv`
|
|
509
|
+
- `country_mapping.json`
|
|
510
|
+
|
|
511
|
+
These rules let the standalone tool produce richer standardized fields without needing a web database.
|
|
512
|
+
|
|
513
|
+
## API Keys
|
|
514
|
+
|
|
515
|
+
For NCBI, use environment variables:
|
|
516
|
+
|
|
517
|
+
```bash
|
|
518
|
+
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
519
|
+
export NCBI_EMAIL=you@example.com
|
|
520
|
+
```
|
|
521
|
+
|
|
522
|
+
Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
|
|
523
|
+
|
|
524
|
+
## Validation
|
|
525
|
+
|
|
526
|
+
Run local validation:
|
|
527
|
+
|
|
528
|
+
```bash
|
|
529
|
+
pytest
|
|
530
|
+
python -m build
|
|
531
|
+
python -m twine check dist/*
|
|
532
|
+
python -m pip install dist/fetchm2-*.whl
|
|
533
|
+
fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
|
|
534
|
+
fetchm2 validate --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_out/validation
|
|
535
|
+
fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only
|
|
536
|
+
```
|
|
537
|
+
|
|
538
|
+
The validation report is in:
|
|
539
|
+
|
|
540
|
+
```text
|
|
541
|
+
docs/VALIDATION_REPORT.md
|
|
542
|
+
```
|
|
543
|
+
|
|
544
|
+
More analysis details:
|
|
545
|
+
|
|
546
|
+
```text
|
|
547
|
+
docs/METADATA_ANALYSIS.md
|
|
548
|
+
docs/STANDARDIZATION.md
|
|
549
|
+
docs/SEQUENCE_DOWNLOAD.md
|
|
550
|
+
docs/RELEASE_CHECKLIST.md
|
|
551
|
+
```
|
|
552
|
+
|
|
553
|
+
## License
|
|
554
|
+
|
|
555
|
+
MIT License.
|