fetchm2 0.1.4__tar.gz → 0.1.7__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. {fetchm2-0.1.4/src/fetchm2.egg-info → fetchm2-0.1.7}/PKG-INFO +158 -20
  2. {fetchm2-0.1.4 → fetchm2-0.1.7}/README.md +157 -19
  3. fetchm2-0.1.7/docs/FETCHM_COMPATIBILITY.md +80 -0
  4. fetchm2-0.1.7/docs/SEQUENCE_DOWNLOAD.md +67 -0
  5. {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/STANDARDIZATION.md +4 -2
  6. {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/VALIDATION_REPORT.md +158 -4
  7. {fetchm2-0.1.4 → fetchm2-0.1.7}/pyproject.toml +1 -1
  8. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/__init__.py +1 -1
  9. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/audit.py +54 -0
  10. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/cli.py +28 -1
  11. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/country_mapping.json +147 -3
  12. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/metadata.py +349 -6
  13. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/sequence.py +105 -20
  14. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/standardization.py +298 -32
  15. {fetchm2-0.1.4 → fetchm2-0.1.7/src/fetchm2.egg-info}/PKG-INFO +158 -20
  16. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/SOURCES.txt +1 -0
  17. {fetchm2-0.1.4 → fetchm2-0.1.7}/tests/test_cli.py +109 -5
  18. fetchm2-0.1.7/tests/test_standardization.py +154 -0
  19. fetchm2-0.1.4/docs/SEQUENCE_DOWNLOAD.md +0 -45
  20. fetchm2-0.1.4/tests/test_standardization.py +0 -69
  21. {fetchm2-0.1.4 → fetchm2-0.1.7}/LICENSE +0 -0
  22. {fetchm2-0.1.4 → fetchm2-0.1.7}/MANIFEST.in +0 -0
  23. {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/METADATA_ANALYSIS.md +0 -0
  24. {fetchm2-0.1.4 → fetchm2-0.1.7}/docs/RELEASE_CHECKLIST.md +0 -0
  25. {fetchm2-0.1.4 → fetchm2-0.1.7}/environment.yml +0 -0
  26. {fetchm2-0.1.4 → fetchm2-0.1.7}/examples/offline_metadata.tsv +0 -0
  27. {fetchm2-0.1.4 → fetchm2-0.1.7}/examples/test_ncbi_dataset.tsv +0 -0
  28. {fetchm2-0.1.4 → fetchm2-0.1.7}/setup.cfg +0 -0
  29. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/analysis.py +0 -0
  30. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/__init__.py +0 -0
  31. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/approved_broad_categories.csv +0 -0
  32. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/collection_date_reviewed_rules.csv +0 -0
  33. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/controlled_categories.csv +0 -0
  34. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/geography_reviewed_rules.csv +0 -0
  35. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/host_negative_rules.csv +0 -0
  36. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/data/host_synonyms.csv +0 -0
  37. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2/utils.py +0 -0
  38. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/dependency_links.txt +0 -0
  39. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/entry_points.txt +0 -0
  40. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/requires.txt +0 -0
  41. {fetchm2-0.1.4 → fetchm2-0.1.7}/src/fetchm2.egg-info/top_level.txt +0 -0
  42. {fetchm2-0.1.4 → fetchm2-0.1.7}/test.tsv +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fetchm2
3
- Version: 0.1.4
3
+ Version: 0.1.7
4
4
  Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
5
5
  Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
6
6
  License-Expression: MIT
@@ -36,9 +36,11 @@ Dynamic: license-file
36
36
 
37
37
  # FetchM2
38
38
 
39
- FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.
39
+ ## Overview
40
40
 
41
- FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.
41
+ FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata retrieval, deterministic metadata standardization, metadata analysis, audit/validation reporting, and optional genome sequence download from NCBI Genome Datasets exports.
42
+
43
+ FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the practical FetchM command-line workflow, but adds expanded host taxonomy fields, source/sample/environment standardization, geography and collection-year recovery, production-readiness audits, richer sequence-download filters, and reproducible test data.
42
44
 
43
45
  Recommended one-command workflow:
44
46
 
@@ -46,16 +48,42 @@ Recommended one-command workflow:
46
48
  fetchm2 run --input ncbi_dataset.tsv --outdir results --download
47
49
  ```
48
50
 
49
- ## Key Features
51
+ The tool is intended primarily for bacterial genome datasets. It can process other NCBI Genome Datasets TSV/CSV exports, but metadata conventions outside bacterial datasets may be less consistent.
52
+
53
+ ## Workflow
54
+
55
+ FetchM2 starts from an NCBI Genome Datasets TSV/CSV, retrieves linked BioSample metadata when requested, standardizes metadata fields with packaged deterministic rules, generates analysis/audit outputs, and optionally downloads FASTA genome sequences.
56
+
57
+ Typical flow:
58
+
59
+ ```text
60
+ NCBI ncbi_dataset.tsv/csv
61
+ |
62
+ v
63
+ BioSample metadata retrieval or offline metadata parsing
64
+ |
65
+ v
66
+ Deterministic standardization
67
+ |
68
+ v
69
+ Clean metadata + analysis tables/figures + audit reports
70
+ |
71
+ v
72
+ Optional filtered sequence download
73
+ ```
74
+
75
+ ## Features
50
76
 
51
77
  - Standalone command-line tool installable with `pip` or a conda environment.
52
78
  - Reads NCBI Genome Datasets TSV/CSV exports.
53
79
  - Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
54
80
  - Supports offline analysis when metadata columns are already present.
55
- - Applies packaged deterministic standardization rules.
56
- - Writes clean CSV and TSV metadata outputs.
81
+ - Applies packaged deterministic standardization rules for host, source, sample, environment, geography, collection year, disease, and health state.
82
+ - Adds `Host_SD`, `Host_TaxID`, host lineage/rank fields, `Host_Context_SD`, standardized sample/source/environment fields, `Country`, `Continent`, `Subcontinent`, and geography traceability fields.
83
+ - Labels 238 country/territory/marine-region entries, including common territories and ocean/sea regions.
84
+ - Writes representative clean CSV/TSV outputs plus full all-assembly outputs.
57
85
  - Generates metadata analysis tables and figures automatically.
58
- - Produces audit summaries and review queues.
86
+ - Produces audit summaries, production-readiness gates, leakage checks, and review queues.
59
87
  - Downloads genome FASTA files from NCBI.
60
88
  - Supports flexible sequence-download filtering by standardized metadata.
61
89
  - Includes `test.tsv`, matching the public FetchM-style test dataset layout.
@@ -68,7 +96,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
68
96
  ```bash
69
97
  python -m venv fetchm2-env
70
98
  source fetchm2-env/bin/activate
71
- pip install fetchm2==0.1.4
99
+ pip install fetchm2
72
100
  ```
73
101
 
74
102
  Verify:
@@ -77,6 +105,12 @@ Verify:
77
105
  fetchm2 --version
78
106
  ```
79
107
 
108
+ To install the validated `0.1.7` GitHub release tag before the PyPI package is updated:
109
+
110
+ ```bash
111
+ pip install "git+https://github.com/Tasnimul-Arabi-Anik/FetchM2.git@v0.1.7"
112
+ ```
113
+
80
114
  ### Option 2: conda / mamba environment
81
115
 
82
116
  Clone the repository and create the environment:
@@ -106,7 +140,51 @@ python -m pip install -e ".[dev]"
106
140
  pytest
107
141
  ```
108
142
 
109
- ## Quick Start
143
+ ## NCBI API Key
144
+
145
+ FetchM2 can run without an NCBI API key, but larger BioSample retrieval jobs are more reliable with one.
146
+
147
+ Create an NCBI API key from your My NCBI account, then either pass it directly:
148
+
149
+ ```bash
150
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results --api-key YOUR_NCBI_API_KEY
151
+ ```
152
+
153
+ Or use environment variables:
154
+
155
+ ```bash
156
+ export NCBI_API_KEY=YOUR_NCBI_API_KEY
157
+ export NCBI_EMAIL=you@example.com
158
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results
159
+ ```
160
+
161
+ Recommended request pacing:
162
+
163
+ - without an API key: use `--workers 3 --sleep 0.4` for larger jobs
164
+ - with an API key: `--workers 6 --sleep 0.15` is usually reasonable
165
+
166
+ FetchM2 keeps a persistent SQLite BioSample cache in `metadata_output/fetchm2_biosample_cache.sqlite3`, so repeated runs do not refetch BioSamples that were already resolved.
167
+
168
+ Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
169
+
170
+ ## Usage
171
+
172
+ ### Recommended All-In-One Workflow
173
+
174
+ ```bash
175
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
176
+ ```
177
+
178
+ This command:
179
+
180
+ 1. reads the NCBI genome export
181
+ 2. filters rows if `--ani` and/or `--checkm` are provided
182
+ 3. retrieves linked BioSample metadata unless `--offline` is used
183
+ 4. standardizes metadata fields
184
+ 5. writes clean tables, analysis outputs, and audit reports
185
+ 6. downloads FASTA files when `--download` is provided
186
+
187
+ ### Quick Start
110
188
 
111
189
  Run the bundled standalone smoke test:
112
190
 
@@ -154,6 +232,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
154
232
  3. Review the main outputs:
155
233
 
156
234
  - `results/metadata_output/fetchm2_clean.csv`
235
+ - `results/metadata_output/fetchm2_all_assemblies.csv`
157
236
  - `results/metadata_analysis/metadata_analysis_report.md`
158
237
  - `results/audit/standardization_audit.md`
159
238
  - `results/audit/production_readiness_gate.md`
@@ -314,6 +393,14 @@ FetchM2 writes:
314
393
 
315
394
  - `metadata_output/fetchm2_clean.csv`
316
395
  - `metadata_output/fetchm2_clean.tsv`
396
+ - `metadata_output/fetchm2_clean_compat.csv`
397
+ - `metadata_output/ncbi_clean.csv`
398
+ - `metadata_output/fetchm2_all_assemblies.csv`
399
+ - `metadata_output/fetchm2_all_assemblies.tsv`
400
+ - `metadata_output/sample_map.csv`
401
+ - `metadata_output/metadata_completeness.csv`
402
+ - `metadata_output/metadata_bias_warning.txt`
403
+ - `metadata_output/fetchm2_manifest.json`
317
404
  - `metadata_output/fetchm2_report.md`
318
405
  - `audit/standardization_summary.csv`
319
406
  - `audit/top_host_review_needed.csv`
@@ -331,6 +418,14 @@ results/
331
418
  ├── metadata_output/
332
419
  │ ├── fetchm2_clean.csv
333
420
  │ ├── fetchm2_clean.tsv
421
+ │ ├── fetchm2_clean_compat.csv
422
+ │ ├── ncbi_clean.csv
423
+ │ ├── fetchm2_all_assemblies.csv
424
+ │ ├── fetchm2_all_assemblies.tsv
425
+ │ ├── sample_map.csv
426
+ │ ├── metadata_completeness.csv
427
+ │ ├── metadata_bias_warning.txt
428
+ │ ├── fetchm2_manifest.json
334
429
  │ └── fetchm2_report.md
335
430
  ├── metadata_analysis/
336
431
  │ ├── metadata_analysis_report.md
@@ -359,6 +454,41 @@ results/
359
454
  └── fetchm2_sequence_cache.sqlite3
360
455
  ```
361
456
 
457
+ By default, `fetchm2_clean.csv` follows original FetchM behavior: it selects one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when both are present. This prevents paired GCA/GCF assemblies sharing the same BioSample from being double-counted in downstream prevalence analyses. The full row-preserving output is still saved as `fetchm2_all_assemblies.csv`.
458
+
459
+ If you intentionally want paired GCA/GCF rows retained in `fetchm2_clean.csv`, use:
460
+
461
+ ```bash
462
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results --keep-assembly-duplicates
463
+ ```
464
+
465
+ For PanR2/PanResistome-style downstream pipelines, FetchM2 always includes these compatibility columns in `fetchm2_clean.csv`, even when values are blank:
466
+
467
+ - `Assembly Accession`
468
+ - `Assembly Name`
469
+ - `Assembly BioSample Accession`
470
+ - `Organism Name`
471
+ - `Geographic Location`
472
+ - `Continent`
473
+ - `Subcontinent`
474
+ - `Collection Date`
475
+ - `Collection_Year`
476
+ - `Host`
477
+ - `Host_SD`
478
+ - `Isolation_Source`
479
+ - `Isolation_Source_SD`
480
+ - `Sample_Type_SD`
481
+ - `Environment_Medium_SD`
482
+
483
+ `sample_map.csv` provides stable sequence-analysis matching columns:
484
+
485
+ - `sample_id`
486
+ - `Assembly Accession`
487
+ - `Assembly Name`
488
+ - `sequence_file`
489
+
490
+ Assembly accession versions such as `GCF_000123456.1` are preserved.
491
+
362
492
  ## Standardized Metadata Fields
363
493
 
364
494
  FetchM2 keeps the original input columns and adds standardized fields.
@@ -423,8 +553,14 @@ FetchM2 standardizes:
423
553
  - `Country`
424
554
  - `Continent`
425
555
  - `Subcontinent`
556
+ - `Country_Source`
557
+ - `Country_Confidence`
558
+ - `Country_Evidence`
559
+ - `Geo_Recovery_Status`
426
560
  - `Collection_Year`
427
561
 
562
+ The packaged region mapping covers countries, selected territories, historical labels, and marine regions such as `Arctic Ocean`, `Pacific Ocean`, `Mediterranean Sea`, and `North Sea`.
563
+
428
564
  It also blocks common false positives such as:
429
565
 
430
566
  - `Hospital` as country
@@ -446,6 +582,8 @@ These are conservative deterministic fields. Disease words are not treated as sa
446
582
 
447
583
  FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
448
584
 
585
+ When using the default `fetchm2_clean.csv`, sequence download operates on representative assemblies only, matching original FetchM behavior. Use `fetchm2_all_assemblies.csv` or `--keep-assembly-duplicates` only when you deliberately want both paired `GCA_*` and `GCF_*` accessions.
586
+
449
587
  Filtering options:
450
588
 
451
589
  - `--host`
@@ -475,6 +613,17 @@ Outputs:
475
613
  - `sequence_download_summary.csv`
476
614
  - `fetchm2_sequence_cache.sqlite3`
477
615
 
616
+ `sequence_download_summary.csv` includes stable downstream matching columns:
617
+
618
+ - `Assembly Accession`
619
+ - `Assembly Name`
620
+ - `BioSample`
621
+ - `selected_for_download`
622
+ - `download_status`
623
+ - `sequence_file`
624
+ - `failure_reason`
625
+ - `ftp_path`
626
+
478
627
  ## Test Dataset
479
628
 
480
629
  FetchM2 includes:
@@ -510,17 +659,6 @@ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
510
659
 
511
660
  These rules let the standalone tool produce richer standardized fields without needing a web database.
512
661
 
513
- ## API Keys
514
-
515
- For NCBI, use environment variables:
516
-
517
- ```bash
518
- export NCBI_API_KEY=YOUR_NCBI_API_KEY
519
- export NCBI_EMAIL=you@example.com
520
- ```
521
-
522
- Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
523
-
524
662
  ## Validation
525
663
 
526
664
  Run local validation:
@@ -1,8 +1,10 @@
1
1
  # FetchM2
2
2
 
3
- FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.
3
+ ## Overview
4
4
 
5
- FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.
5
+ FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata retrieval, deterministic metadata standardization, metadata analysis, audit/validation reporting, and optional genome sequence download from NCBI Genome Datasets exports.
6
+
7
+ FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the practical FetchM command-line workflow, but adds expanded host taxonomy fields, source/sample/environment standardization, geography and collection-year recovery, production-readiness audits, richer sequence-download filters, and reproducible test data.
6
8
 
7
9
  Recommended one-command workflow:
8
10
 
@@ -10,16 +12,42 @@ Recommended one-command workflow:
10
12
  fetchm2 run --input ncbi_dataset.tsv --outdir results --download
11
13
  ```
12
14
 
13
- ## Key Features
15
+ The tool is intended primarily for bacterial genome datasets. It can process other NCBI Genome Datasets TSV/CSV exports, but metadata conventions outside bacterial datasets may be less consistent.
16
+
17
+ ## Workflow
18
+
19
+ FetchM2 starts from an NCBI Genome Datasets TSV/CSV, retrieves linked BioSample metadata when requested, standardizes metadata fields with packaged deterministic rules, generates analysis/audit outputs, and optionally downloads FASTA genome sequences.
20
+
21
+ Typical flow:
22
+
23
+ ```text
24
+ NCBI ncbi_dataset.tsv/csv
25
+ |
26
+ v
27
+ BioSample metadata retrieval or offline metadata parsing
28
+ |
29
+ v
30
+ Deterministic standardization
31
+ |
32
+ v
33
+ Clean metadata + analysis tables/figures + audit reports
34
+ |
35
+ v
36
+ Optional filtered sequence download
37
+ ```
38
+
39
+ ## Features
14
40
 
15
41
  - Standalone command-line tool installable with `pip` or a conda environment.
16
42
  - Reads NCBI Genome Datasets TSV/CSV exports.
17
43
  - Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
18
44
  - Supports offline analysis when metadata columns are already present.
19
- - Applies packaged deterministic standardization rules.
20
- - Writes clean CSV and TSV metadata outputs.
45
+ - Applies packaged deterministic standardization rules for host, source, sample, environment, geography, collection year, disease, and health state.
46
+ - Adds `Host_SD`, `Host_TaxID`, host lineage/rank fields, `Host_Context_SD`, standardized sample/source/environment fields, `Country`, `Continent`, `Subcontinent`, and geography traceability fields.
47
+ - Labels 238 country/territory/marine-region entries, including common territories and ocean/sea regions.
48
+ - Writes representative clean CSV/TSV outputs plus full all-assembly outputs.
21
49
  - Generates metadata analysis tables and figures automatically.
22
- - Produces audit summaries and review queues.
50
+ - Produces audit summaries, production-readiness gates, leakage checks, and review queues.
23
51
  - Downloads genome FASTA files from NCBI.
24
52
  - Supports flexible sequence-download filtering by standardized metadata.
25
53
  - Includes `test.tsv`, matching the public FetchM-style test dataset layout.
@@ -32,7 +60,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
32
60
  ```bash
33
61
  python -m venv fetchm2-env
34
62
  source fetchm2-env/bin/activate
35
- pip install fetchm2==0.1.4
63
+ pip install fetchm2
36
64
  ```
37
65
 
38
66
  Verify:
@@ -41,6 +69,12 @@ Verify:
41
69
  fetchm2 --version
42
70
  ```
43
71
 
72
+ To install the validated `0.1.7` GitHub release tag before the PyPI package is updated:
73
+
74
+ ```bash
75
+ pip install "git+https://github.com/Tasnimul-Arabi-Anik/FetchM2.git@v0.1.7"
76
+ ```
77
+
44
78
  ### Option 2: conda / mamba environment
45
79
 
46
80
  Clone the repository and create the environment:
@@ -70,7 +104,51 @@ python -m pip install -e ".[dev]"
70
104
  pytest
71
105
  ```
72
106
 
73
- ## Quick Start
107
+ ## NCBI API Key
108
+
109
+ FetchM2 can run without an NCBI API key, but larger BioSample retrieval jobs are more reliable with one.
110
+
111
+ Create an NCBI API key from your My NCBI account, then either pass it directly:
112
+
113
+ ```bash
114
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results --api-key YOUR_NCBI_API_KEY
115
+ ```
116
+
117
+ Or use environment variables:
118
+
119
+ ```bash
120
+ export NCBI_API_KEY=YOUR_NCBI_API_KEY
121
+ export NCBI_EMAIL=you@example.com
122
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results
123
+ ```
124
+
125
+ Recommended request pacing:
126
+
127
+ - without an API key: use `--workers 3 --sleep 0.4` for larger jobs
128
+ - with an API key: `--workers 6 --sleep 0.15` is usually reasonable
129
+
130
+ FetchM2 keeps a persistent SQLite BioSample cache in `metadata_output/fetchm2_biosample_cache.sqlite3`, so repeated runs do not refetch BioSamples that were already resolved.
131
+
132
+ Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
133
+
134
+ ## Usage
135
+
136
+ ### Recommended All-In-One Workflow
137
+
138
+ ```bash
139
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
140
+ ```
141
+
142
+ This command:
143
+
144
+ 1. reads the NCBI genome export
145
+ 2. filters rows if `--ani` and/or `--checkm` are provided
146
+ 3. retrieves linked BioSample metadata unless `--offline` is used
147
+ 4. standardizes metadata fields
148
+ 5. writes clean tables, analysis outputs, and audit reports
149
+ 6. downloads FASTA files when `--download` is provided
150
+
151
+ ### Quick Start
74
152
 
75
153
  Run the bundled standalone smoke test:
76
154
 
@@ -118,6 +196,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
118
196
  3. Review the main outputs:
119
197
 
120
198
  - `results/metadata_output/fetchm2_clean.csv`
199
+ - `results/metadata_output/fetchm2_all_assemblies.csv`
121
200
  - `results/metadata_analysis/metadata_analysis_report.md`
122
201
  - `results/audit/standardization_audit.md`
123
202
  - `results/audit/production_readiness_gate.md`
@@ -278,6 +357,14 @@ FetchM2 writes:
278
357
 
279
358
  - `metadata_output/fetchm2_clean.csv`
280
359
  - `metadata_output/fetchm2_clean.tsv`
360
+ - `metadata_output/fetchm2_clean_compat.csv`
361
+ - `metadata_output/ncbi_clean.csv`
362
+ - `metadata_output/fetchm2_all_assemblies.csv`
363
+ - `metadata_output/fetchm2_all_assemblies.tsv`
364
+ - `metadata_output/sample_map.csv`
365
+ - `metadata_output/metadata_completeness.csv`
366
+ - `metadata_output/metadata_bias_warning.txt`
367
+ - `metadata_output/fetchm2_manifest.json`
281
368
  - `metadata_output/fetchm2_report.md`
282
369
  - `audit/standardization_summary.csv`
283
370
  - `audit/top_host_review_needed.csv`
@@ -295,6 +382,14 @@ results/
295
382
  ├── metadata_output/
296
383
  │ ├── fetchm2_clean.csv
297
384
  │ ├── fetchm2_clean.tsv
385
+ │ ├── fetchm2_clean_compat.csv
386
+ │ ├── ncbi_clean.csv
387
+ │ ├── fetchm2_all_assemblies.csv
388
+ │ ├── fetchm2_all_assemblies.tsv
389
+ │ ├── sample_map.csv
390
+ │ ├── metadata_completeness.csv
391
+ │ ├── metadata_bias_warning.txt
392
+ │ ├── fetchm2_manifest.json
298
393
  │ └── fetchm2_report.md
299
394
  ├── metadata_analysis/
300
395
  │ ├── metadata_analysis_report.md
@@ -323,6 +418,41 @@ results/
323
418
  └── fetchm2_sequence_cache.sqlite3
324
419
  ```
325
420
 
421
+ By default, `fetchm2_clean.csv` follows original FetchM behavior: it selects one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when both are present. This prevents paired GCA/GCF assemblies sharing the same BioSample from being double-counted in downstream prevalence analyses. The full row-preserving output is still saved as `fetchm2_all_assemblies.csv`.
422
+
423
+ If you intentionally want paired GCA/GCF rows retained in `fetchm2_clean.csv`, use:
424
+
425
+ ```bash
426
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results --keep-assembly-duplicates
427
+ ```
428
+
429
+ For PanR2/PanResistome-style downstream pipelines, FetchM2 always includes these compatibility columns in `fetchm2_clean.csv`, even when values are blank:
430
+
431
+ - `Assembly Accession`
432
+ - `Assembly Name`
433
+ - `Assembly BioSample Accession`
434
+ - `Organism Name`
435
+ - `Geographic Location`
436
+ - `Continent`
437
+ - `Subcontinent`
438
+ - `Collection Date`
439
+ - `Collection_Year`
440
+ - `Host`
441
+ - `Host_SD`
442
+ - `Isolation_Source`
443
+ - `Isolation_Source_SD`
444
+ - `Sample_Type_SD`
445
+ - `Environment_Medium_SD`
446
+
447
+ `sample_map.csv` provides stable sequence-analysis matching columns:
448
+
449
+ - `sample_id`
450
+ - `Assembly Accession`
451
+ - `Assembly Name`
452
+ - `sequence_file`
453
+
454
+ Assembly accession versions such as `GCF_000123456.1` are preserved.
455
+
326
456
  ## Standardized Metadata Fields
327
457
 
328
458
  FetchM2 keeps the original input columns and adds standardized fields.
@@ -387,8 +517,14 @@ FetchM2 standardizes:
387
517
  - `Country`
388
518
  - `Continent`
389
519
  - `Subcontinent`
520
+ - `Country_Source`
521
+ - `Country_Confidence`
522
+ - `Country_Evidence`
523
+ - `Geo_Recovery_Status`
390
524
  - `Collection_Year`
391
525
 
526
+ The packaged region mapping covers countries, selected territories, historical labels, and marine regions such as `Arctic Ocean`, `Pacific Ocean`, `Mediterranean Sea`, and `North Sea`.
527
+
392
528
  It also blocks common false positives such as:
393
529
 
394
530
  - `Hospital` as country
@@ -410,6 +546,8 @@ These are conservative deterministic fields. Disease words are not treated as sa
410
546
 
411
547
  FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
412
548
 
549
+ When using the default `fetchm2_clean.csv`, sequence download operates on representative assemblies only, matching original FetchM behavior. Use `fetchm2_all_assemblies.csv` or `--keep-assembly-duplicates` only when you deliberately want both paired `GCA_*` and `GCF_*` accessions.
550
+
413
551
  Filtering options:
414
552
 
415
553
  - `--host`
@@ -439,6 +577,17 @@ Outputs:
439
577
  - `sequence_download_summary.csv`
440
578
  - `fetchm2_sequence_cache.sqlite3`
441
579
 
580
+ `sequence_download_summary.csv` includes stable downstream matching columns:
581
+
582
+ - `Assembly Accession`
583
+ - `Assembly Name`
584
+ - `BioSample`
585
+ - `selected_for_download`
586
+ - `download_status`
587
+ - `sequence_file`
588
+ - `failure_reason`
589
+ - `ftp_path`
590
+
442
591
  ## Test Dataset
443
592
 
444
593
  FetchM2 includes:
@@ -474,17 +623,6 @@ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
474
623
 
475
624
  These rules let the standalone tool produce richer standardized fields without needing a web database.
476
625
 
477
- ## API Keys
478
-
479
- For NCBI, use environment variables:
480
-
481
- ```bash
482
- export NCBI_API_KEY=YOUR_NCBI_API_KEY
483
- export NCBI_EMAIL=you@example.com
484
- ```
485
-
486
- Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
487
-
488
626
  ## Validation
489
627
 
490
628
  Run local validation:
@@ -0,0 +1,80 @@
1
+ # FetchM Compatibility Notes
2
+
3
+ FetchM2 is designed as a standalone CLI successor to FetchM. The points below document compatibility-sensitive behavior that affects downstream analysis.
4
+
5
+ ## BioSample Fetching
6
+
7
+ FetchM2 follows FetchM by fetching metadata once per unique BioSample accession, not once per assembly row. If paired `GCA_*` and `GCF_*` rows share the same BioSample, FetchM2 retrieves that BioSample metadata once and applies it back to both rows before representative selection.
8
+
9
+ Audit outputs report both units:
10
+
11
+ - `rows`: rows in `fetchm2_clean.csv`
12
+ - `unique_assembly_accessions`: unique assemblies represented in the clean output
13
+ - `biosample_linked_rows`: clean rows with a BioSample accession
14
+ - `unique_biosample_accessions`: unique BioSample accessions represented
15
+ - `biosample_reused_extra_rows`: extra rows caused by BioSample reuse across assemblies
16
+
17
+ ## Representative Clean Output
18
+
19
+ Original FetchM writes its main clean output after deduplicating by `Assembly Name`, prioritizing `GCF_*` rows when paired RefSeq/GenBank assemblies exist.
20
+
21
+ FetchM2 follows that behavior by default:
22
+
23
+ - `fetchm2_clean.csv`: one representative row per `Assembly Name`, preferring `GCF_*`.
24
+ - `fetchm2_all_assemblies.csv`: all standardized assembly rows before representative selection.
25
+ - `ncbi_clean.csv`: FetchM-compatible alias of `fetchm2_clean.csv`.
26
+ - `fetchm2_clean_compat.csv`: explicit compatibility alias of `fetchm2_clean.csv`.
27
+
28
+ Use `--keep-assembly-duplicates` if you intentionally want paired `GCA_*` and `GCF_*` rows retained in `fetchm2_clean.csv`.
29
+
30
+ ## Stable Pipeline Contract
31
+
32
+ FetchM2 always writes the following downstream-facing columns in `fetchm2_clean.csv`, even when values are blank:
33
+
34
+ - `Assembly Accession`
35
+ - `Assembly Name`
36
+ - `Assembly BioSample Accession`
37
+ - `Organism Name`
38
+ - `Geographic Location`
39
+ - `Continent`
40
+ - `Subcontinent`
41
+ - `Collection Date`
42
+ - `Collection_Year`
43
+ - `Host`
44
+ - `Host_SD`
45
+ - `Isolation_Source`
46
+ - `Isolation_Source_SD`
47
+ - `Sample_Type_SD`
48
+ - `Environment_Medium_SD`
49
+
50
+ Assembly accession versions are preserved. For example, `GCF_000123456.1` remains `GCF_000123456.1`.
51
+
52
+ FetchM2 also writes:
53
+
54
+ - `sample_map.csv` for stable sequence-analysis sample IDs.
55
+ - `metadata_completeness.csv` for field-level coverage checks.
56
+ - `metadata_bias_warning.txt` for low-coverage and representative-selection warnings.
57
+ - `fetchm2_manifest.json` for machine-readable run metadata.
58
+
59
+ ## Sequence Download Unit
60
+
61
+ Sequence download from the default `fetchm2_clean.csv` operates on representative assemblies, matching FetchM and avoiding duplicate GCA/GCF downloads for the same assembly name.
62
+
63
+ If downstream prevalence or resistome tools operate at assembly/genome level, use `Assembly Accession` as the denominator. If they operate at BioSample level, deduplicate intentionally and document the representative rule.
64
+
65
+ ## Cache and Fallback Behavior
66
+
67
+ FetchM2 aligns with FetchM's BioSample fallback behavior:
68
+
69
+ - direct BioSample XML is tried first
70
+ - esummary fallback is used when direct XML lacks usable attributes
71
+ - recovered fallback XML is cached, not the incomplete direct XML
72
+ - stale incomplete cache entries are ignored and retried
73
+
74
+ This avoids rerun regressions where a recovered BioSample would later appear missing because incomplete direct XML had been cached.
75
+
76
+ ## Known Intentional Differences
77
+
78
+ FetchM2 adds standardized fields, richer audits, production gate outputs, and additional sequence filters. These are additions, not replacements for the core FetchM workflow.
79
+
80
+ FetchM2 does not generate FetchM's DOCX report; it focuses on CSV/TSV, Markdown, JSON, and figure outputs that are easier to use in automated CLI workflows.
@@ -0,0 +1,67 @@
1
+ # FetchM2 Sequence Download
2
+
3
+ FetchM2 downloads genome FASTA files from the public NCBI genomes FTP layout using the assembly accession and assembly name in `fetchm2_clean.csv`.
4
+
5
+ By default, `fetchm2_clean.csv` contains one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when paired rows share the same assembly name. This follows original FetchM behavior and prevents duplicate downloads/counting for paired GCA/GCF assemblies. If you need all input assembly rows, use `fetchm2_all_assemblies.csv` or rerun metadata with `--keep-assembly-duplicates`.
6
+
7
+ ## Typical Workflow
8
+
9
+ ```bash
10
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results
11
+ fetchm2 seq --input results/metadata_output/fetchm2_clean.csv --outdir results/sequence
12
+ ```
13
+
14
+ ## Filtered Download
15
+
16
+ You can filter using standardized metadata:
17
+
18
+ ```bash
19
+ fetchm2 seq \
20
+ --input results/metadata_output/fetchm2_clean.csv \
21
+ --outdir results/sequence_human_bd \
22
+ --host "Homo sapiens" \
23
+ --country Bangladesh \
24
+ --year-from 2018 \
25
+ --year-to 2024
26
+ ```
27
+
28
+ Supported filters include:
29
+
30
+ - `--host`
31
+ - `--host-rank`
32
+ - `--country`
33
+ - `--continent`
34
+ - `--subcontinent`
35
+ - `--sample-type`
36
+ - `--isolation-source`
37
+ - `--environment-medium`
38
+ - `--year-from` and `--year-to`
39
+ - `--max-genomes`
40
+
41
+ ## Check Only
42
+
43
+ Use `--check-only` to compare expected accessions against an output directory without downloading:
44
+
45
+ ```bash
46
+ fetchm2 seq --input fetchm2_clean.csv --outdir sequence --check-only
47
+ ```
48
+
49
+ ## Stable Summary Output
50
+
51
+ FetchM2 always writes:
52
+
53
+ - `sequence_download_summary.csv`
54
+ - `failed_accessions.txt`
55
+
56
+ `sequence_download_summary.csv` includes these stable downstream matching columns:
57
+
58
+ - `Assembly Accession`
59
+ - `Assembly Name`
60
+ - `BioSample`
61
+ - `selected_for_download`
62
+ - `download_status`
63
+ - `sequence_file`
64
+ - `failure_reason`
65
+ - `ftp_path`
66
+
67
+ The `sequence_file` value matches the expected FASTA basename used by FetchM2. Assembly accession versions are preserved so downstream tools can match ABRicate, MLST, MobileElementFinder, IntegronFinder, DefenseFinder, PanR2, and PanResistome outputs by `Assembly Accession` or by `sample_id` from `metadata_output/sample_map.csv`.