fetchm2 0.1.4__tar.gz → 0.1.7__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {fetchm2-0.1.4/src/fetchm2.egg-info → fetchm2-0.1.7}/PKG-INFO +158 -20
- {fetchm2-0.1.4 → fetchm2-0.1.7}/README.md +157 -19
- fetchm2-0.1.7/docs/FETCHM_COMPATIBILITY.md +80 -0
- fetchm2-0.1.7/docs/SEQUENCE_DOWNLOAD.md +67 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/STANDARDIZATION.md +4 -2
- {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/VALIDATION_REPORT.md +158 -4
- {fetchm2-0.1.4 → fetchm2-0.1.7}/pyproject.toml +1 -1
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/__init__.py +1 -1
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/audit.py +54 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/cli.py +28 -1
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/country_mapping.json +147 -3
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/metadata.py +349 -6
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/sequence.py +105 -20
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/standardization.py +298 -32
- {fetchm2-0.1.4 → fetchm2-0.1.7/src/fetchm2.egg-info}/PKG-INFO +158 -20
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/SOURCES.txt +1 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/tests/test_cli.py +109 -5
- fetchm2-0.1.7/tests/test_standardization.py +154 -0
- fetchm2-0.1.4/docs/SEQUENCE_DOWNLOAD.md +0 -45
- fetchm2-0.1.4/tests/test_standardization.py +0 -69
- {fetchm2-0.1.4 → fetchm2-0.1.7}/LICENSE +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/MANIFEST.in +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/METADATA_ANALYSIS.md +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/RELEASE_CHECKLIST.md +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/environment.yml +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/examples/offline_metadata.tsv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/examples/test_ncbi_dataset.tsv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/setup.cfg +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/analysis.py +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/__init__.py +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/approved_broad_categories.csv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/collection_date_reviewed_rules.csv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/controlled_categories.csv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/geography_reviewed_rules.csv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/host_negative_rules.csv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/host_synonyms.csv +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/utils.py +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/dependency_links.txt +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/entry_points.txt +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/requires.txt +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/top_level.txt +0 -0
- {fetchm2-0.1.4 → fetchm2-0.1.7}/test.tsv +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: fetchm2
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.7
|
|
4
4
|
Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
|
|
5
5
|
Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -36,9 +36,11 @@ Dynamic: license-file
|
|
|
36
36
|
|
|
37
37
|
# FetchM2
|
|
38
38
|
|
|
39
|
-
|
|
39
|
+
## Overview
|
|
40
40
|
|
|
41
|
-
FetchM2 is
|
|
41
|
+
FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata retrieval, deterministic metadata standardization, metadata analysis, audit/validation reporting, and optional genome sequence download from NCBI Genome Datasets exports.
|
|
42
|
+
|
|
43
|
+
FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the practical FetchM command-line workflow, but adds expanded host taxonomy fields, source/sample/environment standardization, geography and collection-year recovery, production-readiness audits, richer sequence-download filters, and reproducible test data.
|
|
42
44
|
|
|
43
45
|
Recommended one-command workflow:
|
|
44
46
|
|
|
@@ -46,16 +48,42 @@ Recommended one-command workflow:
|
|
|
46
48
|
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
47
49
|
```
|
|
48
50
|
|
|
49
|
-
|
|
51
|
+
The tool is intended primarily for bacterial genome datasets. It can process other NCBI Genome Datasets TSV/CSV exports, but metadata conventions outside bacterial datasets may be less consistent.
|
|
52
|
+
|
|
53
|
+
## Workflow
|
|
54
|
+
|
|
55
|
+
FetchM2 starts from an NCBI Genome Datasets TSV/CSV, retrieves linked BioSample metadata when requested, standardizes metadata fields with packaged deterministic rules, generates analysis/audit outputs, and optionally downloads FASTA genome sequences.
|
|
56
|
+
|
|
57
|
+
Typical flow:
|
|
58
|
+
|
|
59
|
+
```text
|
|
60
|
+
NCBI ncbi_dataset.tsv/csv
|
|
61
|
+
|
|
|
62
|
+
v
|
|
63
|
+
BioSample metadata retrieval or offline metadata parsing
|
|
64
|
+
|
|
|
65
|
+
v
|
|
66
|
+
Deterministic standardization
|
|
67
|
+
|
|
|
68
|
+
v
|
|
69
|
+
Clean metadata + analysis tables/figures + audit reports
|
|
70
|
+
|
|
|
71
|
+
v
|
|
72
|
+
Optional filtered sequence download
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Features
|
|
50
76
|
|
|
51
77
|
- Standalone command-line tool installable with `pip` or a conda environment.
|
|
52
78
|
- Reads NCBI Genome Datasets TSV/CSV exports.
|
|
53
79
|
- Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
|
|
54
80
|
- Supports offline analysis when metadata columns are already present.
|
|
55
|
-
- Applies packaged deterministic standardization rules.
|
|
56
|
-
-
|
|
81
|
+
- Applies packaged deterministic standardization rules for host, source, sample, environment, geography, collection year, disease, and health state.
|
|
82
|
+
- Adds `Host_SD`, `Host_TaxID`, host lineage/rank fields, `Host_Context_SD`, standardized sample/source/environment fields, `Country`, `Continent`, `Subcontinent`, and geography traceability fields.
|
|
83
|
+
- Labels 238 country/territory/marine-region entries, including common territories and ocean/sea regions.
|
|
84
|
+
- Writes representative clean CSV/TSV outputs plus full all-assembly outputs.
|
|
57
85
|
- Generates metadata analysis tables and figures automatically.
|
|
58
|
-
- Produces audit summaries and review queues.
|
|
86
|
+
- Produces audit summaries, production-readiness gates, leakage checks, and review queues.
|
|
59
87
|
- Downloads genome FASTA files from NCBI.
|
|
60
88
|
- Supports flexible sequence-download filtering by standardized metadata.
|
|
61
89
|
- Includes `test.tsv`, matching the public FetchM-style test dataset layout.
|
|
@@ -68,7 +96,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
|
68
96
|
```bash
|
|
69
97
|
python -m venv fetchm2-env
|
|
70
98
|
source fetchm2-env/bin/activate
|
|
71
|
-
pip install fetchm2
|
|
99
|
+
pip install fetchm2
|
|
72
100
|
```
|
|
73
101
|
|
|
74
102
|
Verify:
|
|
@@ -77,6 +105,12 @@ Verify:
|
|
|
77
105
|
fetchm2 --version
|
|
78
106
|
```
|
|
79
107
|
|
|
108
|
+
To install the validated `0.1.7` GitHub release tag before the PyPI package is updated:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
pip install "git+https://github.com/Tasnimul-Arabi-Anik/FetchM2.git@v0.1.7"
|
|
112
|
+
```
|
|
113
|
+
|
|
80
114
|
### Option 2: conda / mamba environment
|
|
81
115
|
|
|
82
116
|
Clone the repository and create the environment:
|
|
@@ -106,7 +140,51 @@ python -m pip install -e ".[dev]"
|
|
|
106
140
|
pytest
|
|
107
141
|
```
|
|
108
142
|
|
|
109
|
-
##
|
|
143
|
+
## NCBI API Key
|
|
144
|
+
|
|
145
|
+
FetchM2 can run without an NCBI API key, but larger BioSample retrieval jobs are more reliable with one.
|
|
146
|
+
|
|
147
|
+
Create an NCBI API key from your My NCBI account, then either pass it directly:
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --api-key YOUR_NCBI_API_KEY
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
Or use environment variables:
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
157
|
+
export NCBI_EMAIL=you@example.com
|
|
158
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Recommended request pacing:
|
|
162
|
+
|
|
163
|
+
- without an API key: use `--workers 3 --sleep 0.4` for larger jobs
|
|
164
|
+
- with an API key: `--workers 6 --sleep 0.15` is usually reasonable
|
|
165
|
+
|
|
166
|
+
FetchM2 keeps a persistent SQLite BioSample cache in `metadata_output/fetchm2_biosample_cache.sqlite3`, so repeated runs do not refetch BioSamples that were already resolved.
|
|
167
|
+
|
|
168
|
+
Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
|
|
169
|
+
|
|
170
|
+
## Usage
|
|
171
|
+
|
|
172
|
+
### Recommended All-In-One Workflow
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
This command:
|
|
179
|
+
|
|
180
|
+
1. reads the NCBI genome export
|
|
181
|
+
2. filters rows if `--ani` and/or `--checkm` are provided
|
|
182
|
+
3. retrieves linked BioSample metadata unless `--offline` is used
|
|
183
|
+
4. standardizes metadata fields
|
|
184
|
+
5. writes clean tables, analysis outputs, and audit reports
|
|
185
|
+
6. downloads FASTA files when `--download` is provided
|
|
186
|
+
|
|
187
|
+
### Quick Start
|
|
110
188
|
|
|
111
189
|
Run the bundled standalone smoke test:
|
|
112
190
|
|
|
@@ -154,6 +232,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
|
154
232
|
3. Review the main outputs:
|
|
155
233
|
|
|
156
234
|
- `results/metadata_output/fetchm2_clean.csv`
|
|
235
|
+
- `results/metadata_output/fetchm2_all_assemblies.csv`
|
|
157
236
|
- `results/metadata_analysis/metadata_analysis_report.md`
|
|
158
237
|
- `results/audit/standardization_audit.md`
|
|
159
238
|
- `results/audit/production_readiness_gate.md`
|
|
@@ -314,6 +393,14 @@ FetchM2 writes:
|
|
|
314
393
|
|
|
315
394
|
- `metadata_output/fetchm2_clean.csv`
|
|
316
395
|
- `metadata_output/fetchm2_clean.tsv`
|
|
396
|
+
- `metadata_output/fetchm2_clean_compat.csv`
|
|
397
|
+
- `metadata_output/ncbi_clean.csv`
|
|
398
|
+
- `metadata_output/fetchm2_all_assemblies.csv`
|
|
399
|
+
- `metadata_output/fetchm2_all_assemblies.tsv`
|
|
400
|
+
- `metadata_output/sample_map.csv`
|
|
401
|
+
- `metadata_output/metadata_completeness.csv`
|
|
402
|
+
- `metadata_output/metadata_bias_warning.txt`
|
|
403
|
+
- `metadata_output/fetchm2_manifest.json`
|
|
317
404
|
- `metadata_output/fetchm2_report.md`
|
|
318
405
|
- `audit/standardization_summary.csv`
|
|
319
406
|
- `audit/top_host_review_needed.csv`
|
|
@@ -331,6 +418,14 @@ results/
|
|
|
331
418
|
├── metadata_output/
|
|
332
419
|
│ ├── fetchm2_clean.csv
|
|
333
420
|
│ ├── fetchm2_clean.tsv
|
|
421
|
+
│ ├── fetchm2_clean_compat.csv
|
|
422
|
+
│ ├── ncbi_clean.csv
|
|
423
|
+
│ ├── fetchm2_all_assemblies.csv
|
|
424
|
+
│ ├── fetchm2_all_assemblies.tsv
|
|
425
|
+
│ ├── sample_map.csv
|
|
426
|
+
│ ├── metadata_completeness.csv
|
|
427
|
+
│ ├── metadata_bias_warning.txt
|
|
428
|
+
│ ├── fetchm2_manifest.json
|
|
334
429
|
│ └── fetchm2_report.md
|
|
335
430
|
├── metadata_analysis/
|
|
336
431
|
│ ├── metadata_analysis_report.md
|
|
@@ -359,6 +454,41 @@ results/
|
|
|
359
454
|
└── fetchm2_sequence_cache.sqlite3
|
|
360
455
|
```
|
|
361
456
|
|
|
457
|
+
By default, `fetchm2_clean.csv` follows original FetchM behavior: it selects one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when both are present. This prevents paired GCA/GCF assemblies sharing the same BioSample from being double-counted in downstream prevalence analyses. The full row-preserving output is still saved as `fetchm2_all_assemblies.csv`.
|
|
458
|
+
|
|
459
|
+
If you intentionally want paired GCA/GCF rows retained in `fetchm2_clean.csv`, use:
|
|
460
|
+
|
|
461
|
+
```bash
|
|
462
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --keep-assembly-duplicates
|
|
463
|
+
```
|
|
464
|
+
|
|
465
|
+
For PanR2/PanResistome-style downstream pipelines, FetchM2 always includes these compatibility columns in `fetchm2_clean.csv`, even when values are blank:
|
|
466
|
+
|
|
467
|
+
- `Assembly Accession`
|
|
468
|
+
- `Assembly Name`
|
|
469
|
+
- `Assembly BioSample Accession`
|
|
470
|
+
- `Organism Name`
|
|
471
|
+
- `Geographic Location`
|
|
472
|
+
- `Continent`
|
|
473
|
+
- `Subcontinent`
|
|
474
|
+
- `Collection Date`
|
|
475
|
+
- `Collection_Year`
|
|
476
|
+
- `Host`
|
|
477
|
+
- `Host_SD`
|
|
478
|
+
- `Isolation_Source`
|
|
479
|
+
- `Isolation_Source_SD`
|
|
480
|
+
- `Sample_Type_SD`
|
|
481
|
+
- `Environment_Medium_SD`
|
|
482
|
+
|
|
483
|
+
`sample_map.csv` provides stable sequence-analysis matching columns:
|
|
484
|
+
|
|
485
|
+
- `sample_id`
|
|
486
|
+
- `Assembly Accession`
|
|
487
|
+
- `Assembly Name`
|
|
488
|
+
- `sequence_file`
|
|
489
|
+
|
|
490
|
+
Assembly accession versions such as `GCF_000123456.1` are preserved.
|
|
491
|
+
|
|
362
492
|
## Standardized Metadata Fields
|
|
363
493
|
|
|
364
494
|
FetchM2 keeps the original input columns and adds standardized fields.
|
|
@@ -423,8 +553,14 @@ FetchM2 standardizes:
|
|
|
423
553
|
- `Country`
|
|
424
554
|
- `Continent`
|
|
425
555
|
- `Subcontinent`
|
|
556
|
+
- `Country_Source`
|
|
557
|
+
- `Country_Confidence`
|
|
558
|
+
- `Country_Evidence`
|
|
559
|
+
- `Geo_Recovery_Status`
|
|
426
560
|
- `Collection_Year`
|
|
427
561
|
|
|
562
|
+
The packaged region mapping covers countries, selected territories, historical labels, and marine regions such as `Arctic Ocean`, `Pacific Ocean`, `Mediterranean Sea`, and `North Sea`.
|
|
563
|
+
|
|
428
564
|
It also blocks common false positives such as:
|
|
429
565
|
|
|
430
566
|
- `Hospital` as country
|
|
@@ -446,6 +582,8 @@ These are conservative deterministic fields. Disease words are not treated as sa
|
|
|
446
582
|
|
|
447
583
|
FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
|
|
448
584
|
|
|
585
|
+
When using the default `fetchm2_clean.csv`, sequence download operates on representative assemblies only, matching original FetchM behavior. Use `fetchm2_all_assemblies.csv` or `--keep-assembly-duplicates` only when you deliberately want both paired `GCA_*` and `GCF_*` accessions.
|
|
586
|
+
|
|
449
587
|
Filtering options:
|
|
450
588
|
|
|
451
589
|
- `--host`
|
|
@@ -475,6 +613,17 @@ Outputs:
|
|
|
475
613
|
- `sequence_download_summary.csv`
|
|
476
614
|
- `fetchm2_sequence_cache.sqlite3`
|
|
477
615
|
|
|
616
|
+
`sequence_download_summary.csv` includes stable downstream matching columns:
|
|
617
|
+
|
|
618
|
+
- `Assembly Accession`
|
|
619
|
+
- `Assembly Name`
|
|
620
|
+
- `BioSample`
|
|
621
|
+
- `selected_for_download`
|
|
622
|
+
- `download_status`
|
|
623
|
+
- `sequence_file`
|
|
624
|
+
- `failure_reason`
|
|
625
|
+
- `ftp_path`
|
|
626
|
+
|
|
478
627
|
## Test Dataset
|
|
479
628
|
|
|
480
629
|
FetchM2 includes:
|
|
@@ -510,17 +659,6 @@ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
|
|
|
510
659
|
|
|
511
660
|
These rules let the standalone tool produce richer standardized fields without needing a web database.
|
|
512
661
|
|
|
513
|
-
## API Keys
|
|
514
|
-
|
|
515
|
-
For NCBI, use environment variables:
|
|
516
|
-
|
|
517
|
-
```bash
|
|
518
|
-
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
519
|
-
export NCBI_EMAIL=you@example.com
|
|
520
|
-
```
|
|
521
|
-
|
|
522
|
-
Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
|
|
523
|
-
|
|
524
662
|
## Validation
|
|
525
663
|
|
|
526
664
|
Run local validation:
|
|
@@ -1,8 +1,10 @@
|
|
|
1
1
|
# FetchM2
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
## Overview
|
|
4
4
|
|
|
5
|
-
FetchM2 is
|
|
5
|
+
FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata retrieval, deterministic metadata standardization, metadata analysis, audit/validation reporting, and optional genome sequence download from NCBI Genome Datasets exports.
|
|
6
|
+
|
|
7
|
+
FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the practical FetchM command-line workflow, but adds expanded host taxonomy fields, source/sample/environment standardization, geography and collection-year recovery, production-readiness audits, richer sequence-download filters, and reproducible test data.
|
|
6
8
|
|
|
7
9
|
Recommended one-command workflow:
|
|
8
10
|
|
|
@@ -10,16 +12,42 @@ Recommended one-command workflow:
|
|
|
10
12
|
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
11
13
|
```
|
|
12
14
|
|
|
13
|
-
|
|
15
|
+
The tool is intended primarily for bacterial genome datasets. It can process other NCBI Genome Datasets TSV/CSV exports, but metadata conventions outside bacterial datasets may be less consistent.
|
|
16
|
+
|
|
17
|
+
## Workflow
|
|
18
|
+
|
|
19
|
+
FetchM2 starts from an NCBI Genome Datasets TSV/CSV, retrieves linked BioSample metadata when requested, standardizes metadata fields with packaged deterministic rules, generates analysis/audit outputs, and optionally downloads FASTA genome sequences.
|
|
20
|
+
|
|
21
|
+
Typical flow:
|
|
22
|
+
|
|
23
|
+
```text
|
|
24
|
+
NCBI ncbi_dataset.tsv/csv
|
|
25
|
+
|
|
|
26
|
+
v
|
|
27
|
+
BioSample metadata retrieval or offline metadata parsing
|
|
28
|
+
|
|
|
29
|
+
v
|
|
30
|
+
Deterministic standardization
|
|
31
|
+
|
|
|
32
|
+
v
|
|
33
|
+
Clean metadata + analysis tables/figures + audit reports
|
|
34
|
+
|
|
|
35
|
+
v
|
|
36
|
+
Optional filtered sequence download
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Features
|
|
14
40
|
|
|
15
41
|
- Standalone command-line tool installable with `pip` or a conda environment.
|
|
16
42
|
- Reads NCBI Genome Datasets TSV/CSV exports.
|
|
17
43
|
- Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
|
|
18
44
|
- Supports offline analysis when metadata columns are already present.
|
|
19
|
-
- Applies packaged deterministic standardization rules.
|
|
20
|
-
-
|
|
45
|
+
- Applies packaged deterministic standardization rules for host, source, sample, environment, geography, collection year, disease, and health state.
|
|
46
|
+
- Adds `Host_SD`, `Host_TaxID`, host lineage/rank fields, `Host_Context_SD`, standardized sample/source/environment fields, `Country`, `Continent`, `Subcontinent`, and geography traceability fields.
|
|
47
|
+
- Labels 238 country/territory/marine-region entries, including common territories and ocean/sea regions.
|
|
48
|
+
- Writes representative clean CSV/TSV outputs plus full all-assembly outputs.
|
|
21
49
|
- Generates metadata analysis tables and figures automatically.
|
|
22
|
-
- Produces audit summaries and review queues.
|
|
50
|
+
- Produces audit summaries, production-readiness gates, leakage checks, and review queues.
|
|
23
51
|
- Downloads genome FASTA files from NCBI.
|
|
24
52
|
- Supports flexible sequence-download filtering by standardized metadata.
|
|
25
53
|
- Includes `test.tsv`, matching the public FetchM-style test dataset layout.
|
|
@@ -32,7 +60,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
|
32
60
|
```bash
|
|
33
61
|
python -m venv fetchm2-env
|
|
34
62
|
source fetchm2-env/bin/activate
|
|
35
|
-
pip install fetchm2
|
|
63
|
+
pip install fetchm2
|
|
36
64
|
```
|
|
37
65
|
|
|
38
66
|
Verify:
|
|
@@ -41,6 +69,12 @@ Verify:
|
|
|
41
69
|
fetchm2 --version
|
|
42
70
|
```
|
|
43
71
|
|
|
72
|
+
To install the validated `0.1.7` GitHub release tag before the PyPI package is updated:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
pip install "git+https://github.com/Tasnimul-Arabi-Anik/FetchM2.git@v0.1.7"
|
|
76
|
+
```
|
|
77
|
+
|
|
44
78
|
### Option 2: conda / mamba environment
|
|
45
79
|
|
|
46
80
|
Clone the repository and create the environment:
|
|
@@ -70,7 +104,51 @@ python -m pip install -e ".[dev]"
|
|
|
70
104
|
pytest
|
|
71
105
|
```
|
|
72
106
|
|
|
73
|
-
##
|
|
107
|
+
## NCBI API Key
|
|
108
|
+
|
|
109
|
+
FetchM2 can run without an NCBI API key, but larger BioSample retrieval jobs are more reliable with one.
|
|
110
|
+
|
|
111
|
+
Create an NCBI API key from your My NCBI account, then either pass it directly:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --api-key YOUR_NCBI_API_KEY
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Or use environment variables:
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
121
|
+
export NCBI_EMAIL=you@example.com
|
|
122
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Recommended request pacing:
|
|
126
|
+
|
|
127
|
+
- without an API key: use `--workers 3 --sleep 0.4` for larger jobs
|
|
128
|
+
- with an API key: `--workers 6 --sleep 0.15` is usually reasonable
|
|
129
|
+
|
|
130
|
+
FetchM2 keeps a persistent SQLite BioSample cache in `metadata_output/fetchm2_biosample_cache.sqlite3`, so repeated runs do not refetch BioSamples that were already resolved.
|
|
131
|
+
|
|
132
|
+
Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
|
|
133
|
+
|
|
134
|
+
## Usage
|
|
135
|
+
|
|
136
|
+
### Recommended All-In-One Workflow
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
This command:
|
|
143
|
+
|
|
144
|
+
1. reads the NCBI genome export
|
|
145
|
+
2. filters rows if `--ani` and/or `--checkm` are provided
|
|
146
|
+
3. retrieves linked BioSample metadata unless `--offline` is used
|
|
147
|
+
4. standardizes metadata fields
|
|
148
|
+
5. writes clean tables, analysis outputs, and audit reports
|
|
149
|
+
6. downloads FASTA files when `--download` is provided
|
|
150
|
+
|
|
151
|
+
### Quick Start
|
|
74
152
|
|
|
75
153
|
Run the bundled standalone smoke test:
|
|
76
154
|
|
|
@@ -118,6 +196,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
|
118
196
|
3. Review the main outputs:
|
|
119
197
|
|
|
120
198
|
- `results/metadata_output/fetchm2_clean.csv`
|
|
199
|
+
- `results/metadata_output/fetchm2_all_assemblies.csv`
|
|
121
200
|
- `results/metadata_analysis/metadata_analysis_report.md`
|
|
122
201
|
- `results/audit/standardization_audit.md`
|
|
123
202
|
- `results/audit/production_readiness_gate.md`
|
|
@@ -278,6 +357,14 @@ FetchM2 writes:
|
|
|
278
357
|
|
|
279
358
|
- `metadata_output/fetchm2_clean.csv`
|
|
280
359
|
- `metadata_output/fetchm2_clean.tsv`
|
|
360
|
+
- `metadata_output/fetchm2_clean_compat.csv`
|
|
361
|
+
- `metadata_output/ncbi_clean.csv`
|
|
362
|
+
- `metadata_output/fetchm2_all_assemblies.csv`
|
|
363
|
+
- `metadata_output/fetchm2_all_assemblies.tsv`
|
|
364
|
+
- `metadata_output/sample_map.csv`
|
|
365
|
+
- `metadata_output/metadata_completeness.csv`
|
|
366
|
+
- `metadata_output/metadata_bias_warning.txt`
|
|
367
|
+
- `metadata_output/fetchm2_manifest.json`
|
|
281
368
|
- `metadata_output/fetchm2_report.md`
|
|
282
369
|
- `audit/standardization_summary.csv`
|
|
283
370
|
- `audit/top_host_review_needed.csv`
|
|
@@ -295,6 +382,14 @@ results/
|
|
|
295
382
|
├── metadata_output/
|
|
296
383
|
│ ├── fetchm2_clean.csv
|
|
297
384
|
│ ├── fetchm2_clean.tsv
|
|
385
|
+
│ ├── fetchm2_clean_compat.csv
|
|
386
|
+
│ ├── ncbi_clean.csv
|
|
387
|
+
│ ├── fetchm2_all_assemblies.csv
|
|
388
|
+
│ ├── fetchm2_all_assemblies.tsv
|
|
389
|
+
│ ├── sample_map.csv
|
|
390
|
+
│ ├── metadata_completeness.csv
|
|
391
|
+
│ ├── metadata_bias_warning.txt
|
|
392
|
+
│ ├── fetchm2_manifest.json
|
|
298
393
|
│ └── fetchm2_report.md
|
|
299
394
|
├── metadata_analysis/
|
|
300
395
|
│ ├── metadata_analysis_report.md
|
|
@@ -323,6 +418,41 @@ results/
|
|
|
323
418
|
└── fetchm2_sequence_cache.sqlite3
|
|
324
419
|
```
|
|
325
420
|
|
|
421
|
+
By default, `fetchm2_clean.csv` follows original FetchM behavior: it selects one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when both are present. This prevents paired GCA/GCF assemblies sharing the same BioSample from being double-counted in downstream prevalence analyses. The full row-preserving output is still saved as `fetchm2_all_assemblies.csv`.
|
|
422
|
+
|
|
423
|
+
If you intentionally want paired GCA/GCF rows retained in `fetchm2_clean.csv`, use:
|
|
424
|
+
|
|
425
|
+
```bash
|
|
426
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --keep-assembly-duplicates
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
For PanR2/PanResistome-style downstream pipelines, FetchM2 always includes these compatibility columns in `fetchm2_clean.csv`, even when values are blank:
|
|
430
|
+
|
|
431
|
+
- `Assembly Accession`
|
|
432
|
+
- `Assembly Name`
|
|
433
|
+
- `Assembly BioSample Accession`
|
|
434
|
+
- `Organism Name`
|
|
435
|
+
- `Geographic Location`
|
|
436
|
+
- `Continent`
|
|
437
|
+
- `Subcontinent`
|
|
438
|
+
- `Collection Date`
|
|
439
|
+
- `Collection_Year`
|
|
440
|
+
- `Host`
|
|
441
|
+
- `Host_SD`
|
|
442
|
+
- `Isolation_Source`
|
|
443
|
+
- `Isolation_Source_SD`
|
|
444
|
+
- `Sample_Type_SD`
|
|
445
|
+
- `Environment_Medium_SD`
|
|
446
|
+
|
|
447
|
+
`sample_map.csv` provides stable sequence-analysis matching columns:
|
|
448
|
+
|
|
449
|
+
- `sample_id`
|
|
450
|
+
- `Assembly Accession`
|
|
451
|
+
- `Assembly Name`
|
|
452
|
+
- `sequence_file`
|
|
453
|
+
|
|
454
|
+
Assembly accession versions such as `GCF_000123456.1` are preserved.
|
|
455
|
+
|
|
326
456
|
## Standardized Metadata Fields
|
|
327
457
|
|
|
328
458
|
FetchM2 keeps the original input columns and adds standardized fields.
|
|
@@ -387,8 +517,14 @@ FetchM2 standardizes:
|
|
|
387
517
|
- `Country`
|
|
388
518
|
- `Continent`
|
|
389
519
|
- `Subcontinent`
|
|
520
|
+
- `Country_Source`
|
|
521
|
+
- `Country_Confidence`
|
|
522
|
+
- `Country_Evidence`
|
|
523
|
+
- `Geo_Recovery_Status`
|
|
390
524
|
- `Collection_Year`
|
|
391
525
|
|
|
526
|
+
The packaged region mapping covers countries, selected territories, historical labels, and marine regions such as `Arctic Ocean`, `Pacific Ocean`, `Mediterranean Sea`, and `North Sea`.
|
|
527
|
+
|
|
392
528
|
It also blocks common false positives such as:
|
|
393
529
|
|
|
394
530
|
- `Hospital` as country
|
|
@@ -410,6 +546,8 @@ These are conservative deterministic fields. Disease words are not treated as sa
|
|
|
410
546
|
|
|
411
547
|
FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
|
|
412
548
|
|
|
549
|
+
When using the default `fetchm2_clean.csv`, sequence download operates on representative assemblies only, matching original FetchM behavior. Use `fetchm2_all_assemblies.csv` or `--keep-assembly-duplicates` only when you deliberately want both paired `GCA_*` and `GCF_*` accessions.
|
|
550
|
+
|
|
413
551
|
Filtering options:
|
|
414
552
|
|
|
415
553
|
- `--host`
|
|
@@ -439,6 +577,17 @@ Outputs:
|
|
|
439
577
|
- `sequence_download_summary.csv`
|
|
440
578
|
- `fetchm2_sequence_cache.sqlite3`
|
|
441
579
|
|
|
580
|
+
`sequence_download_summary.csv` includes stable downstream matching columns:
|
|
581
|
+
|
|
582
|
+
- `Assembly Accession`
|
|
583
|
+
- `Assembly Name`
|
|
584
|
+
- `BioSample`
|
|
585
|
+
- `selected_for_download`
|
|
586
|
+
- `download_status`
|
|
587
|
+
- `sequence_file`
|
|
588
|
+
- `failure_reason`
|
|
589
|
+
- `ftp_path`
|
|
590
|
+
|
|
442
591
|
## Test Dataset
|
|
443
592
|
|
|
444
593
|
FetchM2 includes:
|
|
@@ -474,17 +623,6 @@ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
|
|
|
474
623
|
|
|
475
624
|
These rules let the standalone tool produce richer standardized fields without needing a web database.
|
|
476
625
|
|
|
477
|
-
## API Keys
|
|
478
|
-
|
|
479
|
-
For NCBI, use environment variables:
|
|
480
|
-
|
|
481
|
-
```bash
|
|
482
|
-
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
483
|
-
export NCBI_EMAIL=you@example.com
|
|
484
|
-
```
|
|
485
|
-
|
|
486
|
-
Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
|
|
487
|
-
|
|
488
626
|
## Validation
|
|
489
627
|
|
|
490
628
|
Run local validation:
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# FetchM Compatibility Notes
|
|
2
|
+
|
|
3
|
+
FetchM2 is designed as a standalone CLI successor to FetchM. The points below document compatibility-sensitive behavior that affects downstream analysis.
|
|
4
|
+
|
|
5
|
+
## BioSample Fetching
|
|
6
|
+
|
|
7
|
+
FetchM2 follows FetchM by fetching metadata once per unique BioSample accession, not once per assembly row. If paired `GCA_*` and `GCF_*` rows share the same BioSample, FetchM2 retrieves that BioSample metadata once and applies it back to both rows before representative selection.
|
|
8
|
+
|
|
9
|
+
Audit outputs report both units:
|
|
10
|
+
|
|
11
|
+
- `rows`: rows in `fetchm2_clean.csv`
|
|
12
|
+
- `unique_assembly_accessions`: unique assemblies represented in the clean output
|
|
13
|
+
- `biosample_linked_rows`: clean rows with a BioSample accession
|
|
14
|
+
- `unique_biosample_accessions`: unique BioSample accessions represented
|
|
15
|
+
- `biosample_reused_extra_rows`: extra rows caused by BioSample reuse across assemblies
|
|
16
|
+
|
|
17
|
+
## Representative Clean Output
|
|
18
|
+
|
|
19
|
+
Original FetchM writes its main clean output after deduplicating by `Assembly Name`, prioritizing `GCF_*` rows when paired RefSeq/GenBank assemblies exist.
|
|
20
|
+
|
|
21
|
+
FetchM2 follows that behavior by default:
|
|
22
|
+
|
|
23
|
+
- `fetchm2_clean.csv`: one representative row per `Assembly Name`, preferring `GCF_*`.
|
|
24
|
+
- `fetchm2_all_assemblies.csv`: all standardized assembly rows before representative selection.
|
|
25
|
+
- `ncbi_clean.csv`: FetchM-compatible alias of `fetchm2_clean.csv`.
|
|
26
|
+
- `fetchm2_clean_compat.csv`: explicit compatibility alias of `fetchm2_clean.csv`.
|
|
27
|
+
|
|
28
|
+
Use `--keep-assembly-duplicates` if you intentionally want paired `GCA_*` and `GCF_*` rows retained in `fetchm2_clean.csv`.
|
|
29
|
+
|
|
30
|
+
## Stable Pipeline Contract
|
|
31
|
+
|
|
32
|
+
FetchM2 always writes the following downstream-facing columns in `fetchm2_clean.csv`, even when values are blank:
|
|
33
|
+
|
|
34
|
+
- `Assembly Accession`
|
|
35
|
+
- `Assembly Name`
|
|
36
|
+
- `Assembly BioSample Accession`
|
|
37
|
+
- `Organism Name`
|
|
38
|
+
- `Geographic Location`
|
|
39
|
+
- `Continent`
|
|
40
|
+
- `Subcontinent`
|
|
41
|
+
- `Collection Date`
|
|
42
|
+
- `Collection_Year`
|
|
43
|
+
- `Host`
|
|
44
|
+
- `Host_SD`
|
|
45
|
+
- `Isolation_Source`
|
|
46
|
+
- `Isolation_Source_SD`
|
|
47
|
+
- `Sample_Type_SD`
|
|
48
|
+
- `Environment_Medium_SD`
|
|
49
|
+
|
|
50
|
+
Assembly accession versions are preserved. For example, `GCF_000123456.1` remains `GCF_000123456.1`.
|
|
51
|
+
|
|
52
|
+
FetchM2 also writes:
|
|
53
|
+
|
|
54
|
+
- `sample_map.csv` for stable sequence-analysis sample IDs.
|
|
55
|
+
- `metadata_completeness.csv` for field-level coverage checks.
|
|
56
|
+
- `metadata_bias_warning.txt` for low-coverage and representative-selection warnings.
|
|
57
|
+
- `fetchm2_manifest.json` for machine-readable run metadata.
|
|
58
|
+
|
|
59
|
+
## Sequence Download Unit
|
|
60
|
+
|
|
61
|
+
Sequence download from the default `fetchm2_clean.csv` operates on representative assemblies, matching FetchM and avoiding duplicate GCA/GCF downloads for the same assembly name.
|
|
62
|
+
|
|
63
|
+
If downstream prevalence or resistome tools operate at assembly/genome level, use `Assembly Accession` as the denominator. If they operate at BioSample level, deduplicate intentionally and document the representative rule.
|
|
64
|
+
|
|
65
|
+
## Cache and Fallback Behavior
|
|
66
|
+
|
|
67
|
+
FetchM2 aligns with FetchM's BioSample fallback behavior:
|
|
68
|
+
|
|
69
|
+
- direct BioSample XML is tried first
|
|
70
|
+
- esummary fallback is used when direct XML lacks usable attributes
|
|
71
|
+
- recovered fallback XML is cached, not the incomplete direct XML
|
|
72
|
+
- stale incomplete cache entries are ignored and retried
|
|
73
|
+
|
|
74
|
+
This avoids rerun regressions where a recovered BioSample would later appear missing because incomplete direct XML had been cached.
|
|
75
|
+
|
|
76
|
+
## Known Intentional Differences
|
|
77
|
+
|
|
78
|
+
FetchM2 adds standardized fields, richer audits, production gate outputs, and additional sequence filters. These are additions, not replacements for the core FetchM workflow.
|
|
79
|
+
|
|
80
|
+
FetchM2 does not generate FetchM's DOCX report; it focuses on CSV/TSV, Markdown, JSON, and figure outputs that are easier to use in automated CLI workflows.
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
# FetchM2 Sequence Download
|
|
2
|
+
|
|
3
|
+
FetchM2 downloads genome FASTA files from the public NCBI genomes FTP layout using the assembly accession and assembly name in `fetchm2_clean.csv`.
|
|
4
|
+
|
|
5
|
+
By default, `fetchm2_clean.csv` contains one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when paired rows share the same assembly name. This follows original FetchM behavior and prevents duplicate downloads/counting for paired GCA/GCF assemblies. If you need all input assembly rows, use `fetchm2_all_assemblies.csv` or rerun metadata with `--keep-assembly-duplicates`.
|
|
6
|
+
|
|
7
|
+
## Typical Workflow
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results
|
|
11
|
+
fetchm2 seq --input results/metadata_output/fetchm2_clean.csv --outdir results/sequence
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Filtered Download
|
|
15
|
+
|
|
16
|
+
You can filter using standardized metadata:
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
fetchm2 seq \
|
|
20
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
21
|
+
--outdir results/sequence_human_bd \
|
|
22
|
+
--host "Homo sapiens" \
|
|
23
|
+
--country Bangladesh \
|
|
24
|
+
--year-from 2018 \
|
|
25
|
+
--year-to 2024
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Supported filters include:
|
|
29
|
+
|
|
30
|
+
- `--host`
|
|
31
|
+
- `--host-rank`
|
|
32
|
+
- `--country`
|
|
33
|
+
- `--continent`
|
|
34
|
+
- `--subcontinent`
|
|
35
|
+
- `--sample-type`
|
|
36
|
+
- `--isolation-source`
|
|
37
|
+
- `--environment-medium`
|
|
38
|
+
- `--year-from` and `--year-to`
|
|
39
|
+
- `--max-genomes`
|
|
40
|
+
|
|
41
|
+
## Check Only
|
|
42
|
+
|
|
43
|
+
Use `--check-only` to compare expected accessions against an output directory without downloading:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
fetchm2 seq --input fetchm2_clean.csv --outdir sequence --check-only
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Stable Summary Output
|
|
50
|
+
|
|
51
|
+
FetchM2 always writes:
|
|
52
|
+
|
|
53
|
+
- `sequence_download_summary.csv`
|
|
54
|
+
- `failed_accessions.txt`
|
|
55
|
+
|
|
56
|
+
`sequence_download_summary.csv` includes these stable downstream matching columns:
|
|
57
|
+
|
|
58
|
+
- `Assembly Accession`
|
|
59
|
+
- `Assembly Name`
|
|
60
|
+
- `BioSample`
|
|
61
|
+
- `selected_for_download`
|
|
62
|
+
- `download_status`
|
|
63
|
+
- `sequence_file`
|
|
64
|
+
- `failure_reason`
|
|
65
|
+
- `ftp_path`
|
|
66
|
+
|
|
67
|
+
The `sequence_file` value matches the expected FASTA basename used by FetchM2. Assembly accession versions are preserved so downstream tools can match ABRicate, MLST, MobileElementFinder, IntegronFinder, DefenseFinder, PanR2, and PanResistome outputs by `Assembly Accession` or by `sample_id` from `metadata_output/sample_map.csv`.
|