fetchm2 0.1.0__tar.gz → 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. {fetchm2-0.1.0 → fetchm2-0.1.2}/MANIFEST.in +2 -0
  2. fetchm2-0.1.2/PKG-INFO +555 -0
  3. fetchm2-0.1.2/README.md +519 -0
  4. fetchm2-0.1.2/docs/METADATA_ANALYSIS.md +94 -0
  5. {fetchm2-0.1.0 → fetchm2-0.1.2}/docs/STANDARDIZATION.md +14 -1
  6. fetchm2-0.1.2/docs/VALIDATION_REPORT.md +236 -0
  7. fetchm2-0.1.2/environment.yml +18 -0
  8. {fetchm2-0.1.0 → fetchm2-0.1.2}/pyproject.toml +1 -1
  9. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/__init__.py +1 -2
  10. fetchm2-0.1.2/src/fetchm2/analysis.py +366 -0
  11. fetchm2-0.1.2/src/fetchm2/audit.py +419 -0
  12. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/cli.py +93 -3
  13. fetchm2-0.1.2/src/fetchm2/data/collection_date_reviewed_rules.csv +1 -0
  14. fetchm2-0.1.2/src/fetchm2/metadata.py +476 -0
  15. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/sequence.py +10 -0
  16. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/standardization.py +27 -1
  17. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/utils.py +6 -4
  18. fetchm2-0.1.2/src/fetchm2.egg-info/PKG-INFO +555 -0
  19. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/SOURCES.txt +5 -0
  20. fetchm2-0.1.2/test.tsv +201 -0
  21. fetchm2-0.1.2/tests/test_cli.py +207 -0
  22. {fetchm2-0.1.0 → fetchm2-0.1.2}/tests/test_standardization.py +14 -0
  23. fetchm2-0.1.0/PKG-INFO +0 -208
  24. fetchm2-0.1.0/README.md +0 -172
  25. fetchm2-0.1.0/docs/VALIDATION_REPORT.md +0 -99
  26. fetchm2-0.1.0/src/fetchm2/audit.py +0 -126
  27. fetchm2-0.1.0/src/fetchm2/metadata.py +0 -244
  28. fetchm2-0.1.0/src/fetchm2.egg-info/PKG-INFO +0 -208
  29. fetchm2-0.1.0/tests/test_cli.py +0 -70
  30. {fetchm2-0.1.0 → fetchm2-0.1.2}/LICENSE +0 -0
  31. {fetchm2-0.1.0 → fetchm2-0.1.2}/docs/RELEASE_CHECKLIST.md +0 -0
  32. {fetchm2-0.1.0 → fetchm2-0.1.2}/docs/SEQUENCE_DOWNLOAD.md +0 -0
  33. {fetchm2-0.1.0 → fetchm2-0.1.2}/examples/offline_metadata.tsv +0 -0
  34. {fetchm2-0.1.0 → fetchm2-0.1.2}/examples/test_ncbi_dataset.tsv +0 -0
  35. {fetchm2-0.1.0 → fetchm2-0.1.2}/setup.cfg +0 -0
  36. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/__init__.py +0 -0
  37. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/approved_broad_categories.csv +0 -0
  38. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/controlled_categories.csv +0 -0
  39. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/country_mapping.json +0 -0
  40. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/geography_reviewed_rules.csv +0 -0
  41. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/host_negative_rules.csv +0 -0
  42. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2/data/host_synonyms.csv +0 -0
  43. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/dependency_links.txt +0 -0
  44. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/entry_points.txt +0 -0
  45. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/requires.txt +0 -0
  46. {fetchm2-0.1.0 → fetchm2-0.1.2}/src/fetchm2.egg-info/top_level.txt +0 -0
@@ -1,5 +1,7 @@
1
1
  include README.md
2
2
  include LICENSE
3
+ include test.tsv
4
+ include environment.yml
3
5
  recursive-include src/fetchm2/data *.csv *.json
4
6
  recursive-include examples *.tsv
5
7
  recursive-include docs *.md
fetchm2-0.1.2/PKG-INFO ADDED
@@ -0,0 +1,555 @@
1
+ Metadata-Version: 2.4
2
+ Name: fetchm2
3
+ Version: 0.1.2
4
+ Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
5
+ Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/Tasnimul-Arabi-Anik/FetchM2
8
+ Project-URL: Repository, https://github.com/Tasnimul-Arabi-Anik/FetchM2
9
+ Project-URL: Issues, https://github.com/Tasnimul-Arabi-Anik/FetchM2/issues
10
+ Keywords: NCBI,BioSample,metadata,genomics,standardization,sequence-download
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
20
+ Requires-Python: >=3.10
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: pandas>=2.0
24
+ Requires-Dist: requests>=2.31
25
+ Requires-Dist: tqdm>=4.66
26
+ Requires-Dist: matplotlib>=3.7
27
+ Requires-Dist: seaborn>=0.13
28
+ Requires-Dist: plotly>=5.20
29
+ Requires-Dist: kaleido<1.0.0,>=0.2.1
30
+ Requires-Dist: xmltodict>=0.13
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=8.0; extra == "dev"
33
+ Requires-Dist: build>=1.2; extra == "dev"
34
+ Requires-Dist: twine>=5.0; extra == "dev"
35
+ Dynamic: license-file
36
+
37
+ # FetchM2
38
+
39
+ FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.
40
+
41
+ FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.
42
+
43
+ Recommended one-command workflow:
44
+
45
+ ```bash
46
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
47
+ ```
48
+
49
+ ## Key Features
50
+
51
+ - Standalone command-line tool installable with `pip` or a conda environment.
52
+ - Reads NCBI Genome Datasets TSV/CSV exports.
53
+ - Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
54
+ - Supports offline analysis when metadata columns are already present.
55
+ - Applies packaged deterministic standardization rules.
56
+ - Writes clean CSV and TSV metadata outputs.
57
+ - Generates metadata analysis tables and figures automatically.
58
+ - Produces audit summaries and review queues.
59
+ - Downloads genome FASTA files from NCBI.
60
+ - Supports flexible sequence-download filtering by standardized metadata.
61
+ - Includes `test.tsv`, matching the public FetchM-style test dataset layout.
62
+ - Includes `examples/offline_metadata.tsv` for fast local smoke testing.
63
+
64
+ ## Installation
65
+
66
+ ### Option 1: pip
67
+
68
+ ```bash
69
+ python -m venv fetchm2-env
70
+ source fetchm2-env/bin/activate
71
+ pip install fetchm2
72
+ ```
73
+
74
+ Verify:
75
+
76
+ ```bash
77
+ fetchm2 --version
78
+ ```
79
+
80
+ ### Option 2: conda / mamba environment
81
+
82
+ Clone the repository and create the environment:
83
+
84
+ ```bash
85
+ git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
86
+ cd FetchM2
87
+ mamba env create -f environment.yml
88
+ conda activate fetchm2
89
+ ```
90
+
91
+ If you use `conda` instead of `mamba`:
92
+
93
+ ```bash
94
+ conda env create -f environment.yml
95
+ conda activate fetchm2
96
+ ```
97
+
98
+ The conda environment includes `taxonkit`, which can improve host lineage enrichment for less common TaxIDs. FetchM2 still works without `taxonkit`; common host lineages are bundled.
99
+
100
+ ### Option 3: developer install
101
+
102
+ ```bash
103
+ git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
104
+ cd FetchM2
105
+ python -m pip install -e ".[dev]"
106
+ pytest
107
+ ```
108
+
109
+ ## Quick Start
110
+
111
+ Run the bundled standalone smoke test:
112
+
113
+ ```bash
114
+ fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline
115
+ ```
116
+
117
+ Run the FetchM-style test dataset:
118
+
119
+ ```bash
120
+ fetchm2 metadata --input test.tsv --outdir test_out --offline
121
+ ```
122
+
123
+ `test.tsv` contains assembly-level NCBI dataset columns and BioSample accessions. In offline mode, FetchM2 analyzes assembly statistics and any metadata already present in the table. To populate host, source, sample, environment, and geography from NCBI BioSample records, run without `--offline`.
124
+
125
+ Run metadata retrieval with BioSample enrichment:
126
+
127
+ ```bash
128
+ fetchm2 metadata --input test.tsv --outdir test_out_live --workers 3 --sleep 0.4
129
+ ```
130
+
131
+ Use an NCBI API key for larger jobs:
132
+
133
+ ```bash
134
+ export NCBI_API_KEY=YOUR_NCBI_API_KEY
135
+ export NCBI_EMAIL=you@example.com
136
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15
137
+ ```
138
+
139
+ Run metadata standardization and sequence download in one command:
140
+
141
+ ```bash
142
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
143
+ ```
144
+
145
+ ## Typical Species/Genus Workflow
146
+
147
+ 1. Download an NCBI Genome Datasets TSV or CSV for your target species or genus.
148
+ 2. Run FetchM2:
149
+
150
+ ```bash
151
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
152
+ ```
153
+
154
+ 3. Review the main outputs:
155
+
156
+ - `results/metadata_output/fetchm2_clean.csv`
157
+ - `results/metadata_analysis/metadata_analysis_report.md`
158
+ - `results/audit/standardization_audit.md`
159
+ - `results/audit/production_readiness_gate.md`
160
+ - `results/sequence/`
161
+
162
+ For large NCBI retrieval jobs without an API key, use a conservative request delay:
163
+
164
+ ```bash
165
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download --workers 3 --sleep 0.4
166
+ ```
167
+
168
+ ## Metadata Retrieval Workflow
169
+
170
+ FetchM2 can work in two modes.
171
+
172
+ Offline mode:
173
+
174
+ - Uses metadata columns already present in the input table.
175
+ - Applies standardization rules.
176
+ - Generates audit and metadata analysis outputs.
177
+ - Does not contact NCBI.
178
+
179
+ Live BioSample mode:
180
+
181
+ - Reads BioSample accessions from NCBI dataset exports.
182
+ - Retrieves BioSample records through NCBI E-utilities.
183
+ - Uses direct BioSample XML first, then an `esummary` fallback when the direct record lacks usable attributes.
184
+ - Tracks raw BioSample attribute names and matched standardized attribute names.
185
+ - Uses a local SQLite cache so repeated runs do not refetch the same BioSample records.
186
+ - Uses request throttling, retry, and backoff behavior for temporary NCBI rate-limit or server errors.
187
+
188
+ Important output columns from retrieval include:
189
+
190
+ - `BioSample`
191
+ - `BioSample Taxonomy Name`
192
+ - `Metadata Fetch Status`
193
+ - `Metadata Fetch Reason`
194
+ - `Metadata Fetch Error`
195
+ - `Metadata Raw Attribute Names`
196
+ - `Metadata Matched Attribute Names`
197
+
198
+ FetchM2 currently recognizes common BioSample attribute aliases for host, source, sample type, isolation site, collection date, geography, environmental medium/broad/local scale, host disease, and host health state.
199
+
200
+ ## Main Commands
201
+
202
+ ```bash
203
+ fetchm2 metadata --help
204
+ fetchm2 run --help
205
+ fetchm2 seq --help
206
+ fetchm2 audit --help
207
+ fetchm2 validate --help
208
+ fetchm2 analyze --help
209
+ ```
210
+
211
+ ### `fetchm2 metadata`
212
+
213
+ Reads an NCBI dataset TSV/CSV, optionally fetches BioSample metadata, standardizes fields, and writes clean outputs.
214
+
215
+ Example:
216
+
217
+ ```bash
218
+ fetchm2 metadata \
219
+ --input ncbi_dataset.tsv \
220
+ --outdir results \
221
+ --ani OK \
222
+ --checkm 95 \
223
+ --workers 6
224
+ ```
225
+
226
+ Common options:
227
+
228
+ - `--input`: NCBI dataset TSV/CSV.
229
+ - `--outdir`: output directory.
230
+ - `--ani`: filter by ANI Check status, for example `OK`.
231
+ - `--checkm`: minimum CheckM completeness.
232
+ - `--api-key`: NCBI API key; can also use `NCBI_API_KEY`.
233
+ - `--email`: NCBI email; can also use `NCBI_EMAIL`.
234
+ - `--workers`: BioSample fetch worker count.
235
+ - `--sleep`: shared request delay between NCBI calls. Use a slower value such as `0.4` to `0.5` for unauthenticated larger jobs.
236
+ - `--offline`: skip NCBI fetching and standardize existing columns only.
237
+ - `--no-analysis`: skip automatic `metadata_analysis/` table and figure generation.
238
+
239
+ ### `fetchm2 run`
240
+
241
+ Runs metadata analysis and, if requested, sequence download.
242
+
243
+ ```bash
244
+ fetchm2 run \
245
+ --input ncbi_dataset.tsv \
246
+ --outdir results \
247
+ --ani OK \
248
+ --checkm 95 \
249
+ --download \
250
+ --country Bangladesh \
251
+ --host "Homo sapiens" \
252
+ --year-from 2018 \
253
+ --year-to 2024
254
+ ```
255
+
256
+ ### `fetchm2 seq`
257
+
258
+ Downloads genome FASTA files using the standardized clean metadata table.
259
+
260
+ ```bash
261
+ fetchm2 seq \
262
+ --input results/metadata_output/fetchm2_clean.csv \
263
+ --outdir results/sequence \
264
+ --host "Homo sapiens" \
265
+ --country Bangladesh \
266
+ --year-from 2018 \
267
+ --year-to 2024
268
+ ```
269
+
270
+ Check expected sequences without downloading:
271
+
272
+ ```bash
273
+ fetchm2 seq \
274
+ --input results/metadata_output/fetchm2_clean.csv \
275
+ --outdir results/sequence \
276
+ --country Bangladesh \
277
+ --check-only
278
+ ```
279
+
280
+ ### `fetchm2 audit`
281
+
282
+ Audits an existing standardized output:
283
+
284
+ ```bash
285
+ fetchm2 audit \
286
+ --input results/metadata_output/fetchm2_clean.csv \
287
+ --outdir results/audit_rerun
288
+ ```
289
+
290
+ ### `fetchm2 validate`
291
+
292
+ Runs the same production-readiness checks as `audit`, but names the workflow explicitly for CLI validation:
293
+
294
+ ```bash
295
+ fetchm2 validate \
296
+ --input results/metadata_output/fetchm2_clean.csv \
297
+ --outdir results/validation
298
+ ```
299
+
300
+ ### `fetchm2 analyze`
301
+
302
+ Generates metadata analysis outputs from any existing clean metadata CSV.
303
+
304
+ ```bash
305
+ fetchm2 analyze \
306
+ --input results/metadata_output/fetchm2_clean.csv \
307
+ --outdir results/metadata_analysis_rerun \
308
+ --top-n 30
309
+ ```
310
+
311
+ ## Metadata Outputs
312
+
313
+ FetchM2 writes:
314
+
315
+ - `metadata_output/fetchm2_clean.csv`
316
+ - `metadata_output/fetchm2_clean.tsv`
317
+ - `metadata_output/fetchm2_report.md`
318
+ - `audit/standardization_summary.csv`
319
+ - `audit/top_host_review_needed.csv`
320
+ - `audit/standardization_audit.md`
321
+ - `metadata_analysis/metadata_analysis_report.md`
322
+ - `metadata_analysis/tables/field_coverage_summary.csv`
323
+ - `metadata_analysis/tables/top_values_by_field.csv`
324
+ - `metadata_analysis/tables/numeric_summary.csv`
325
+ - `metadata_analysis/figures/*.png`
326
+
327
+ Typical output structure:
328
+
329
+ ```text
330
+ results/
331
+ ├── metadata_output/
332
+ │ ├── fetchm2_clean.csv
333
+ │ ├── fetchm2_clean.tsv
334
+ │ └── fetchm2_report.md
335
+ ├── metadata_analysis/
336
+ │ ├── metadata_analysis_report.md
337
+ │ ├── tables/
338
+ │ └── figures/
339
+ ├── audit/
340
+ │ ├── standardization_summary.csv
341
+ │ ├── standardization_audit.md
342
+ │ ├── production_readiness_gate.md
343
+ │ ├── production_readiness_gate.json
344
+ │ ├── top_host_review_needed.csv
345
+ │ ├── non_country_values_in_country.csv
346
+ │ ├── country_continent_mismatch.csv
347
+ │ ├── country_subcontinent_mismatch.csv
348
+ │ ├── invalid_collection_years.csv
349
+ │ ├── invalid_host_like_sample_type.csv
350
+ │ ├── source_like_mapped_hosts.csv
351
+ │ ├── source_like_unmapped_hosts_for_review.csv
352
+ │ ├── broad_vocabulary_leakage.csv
353
+ │ ├── sequence_readiness.csv
354
+ │ └── rule_count_summary.csv
355
+ └── sequence/
356
+ ├── *.fna
357
+ ├── failed_accessions.txt
358
+ ├── sequence_download_summary.csv
359
+ └── fetchm2_sequence_cache.sqlite3
360
+ ```
361
+
362
+ ## Standardized Metadata Fields
363
+
364
+ FetchM2 keeps the original input columns and adds standardized fields.
365
+
366
+ ### Host Standardization
367
+
368
+ Original FetchM had host-oriented metadata summaries. FetchM2 expands this into detailed host standardization:
369
+
370
+ - `Host_Original`
371
+ - `Host_Cleaned`
372
+ - `Host_SD`
373
+ - `Host_TaxID`
374
+ - `Host_Rank`
375
+ - `Host_Superkingdom`
376
+ - `Host_Phylum`
377
+ - `Host_Class`
378
+ - `Host_Order`
379
+ - `Host_Family`
380
+ - `Host_Genus`
381
+ - `Host_Species`
382
+ - `Host_Common_Name`
383
+ - `Host_Context_SD`
384
+ - `Host_Match_Method`
385
+ - `Host_Confidence`
386
+ - `Host_Review_Status`
387
+
388
+ Examples:
389
+
390
+ - `human`, `human blood`, `Homosapines` variants can map to `Homo sapiens`, TaxID `9606`.
391
+ - `cattle feces` can map to `Bos taurus`, TaxID `9913`, while also preserving feces/stool as sample metadata.
392
+ - `bacteria culture`, `DH5a`, lab strain terms, missing values, and source/material terms are blocked from becoming host values.
393
+
394
+ ### Source, Sample, and Environment
395
+
396
+ FetchM2 standardizes source/sample/environment fields into:
397
+
398
+ - `Sample_Type_SD`
399
+ - `Sample_Type_SD_Broad`
400
+ - `Isolation_Source_SD`
401
+ - `Isolation_Source_SD_Broad`
402
+ - `Isolation_Site_SD`
403
+ - `Environment_Medium_SD`
404
+ - `Environment_Medium_SD_Broad`
405
+ - `Environment_Broad_Scale_SD`
406
+ - `Environment_Local_Scale_SD`
407
+
408
+ Examples:
409
+
410
+ - `blood` -> `Sample_Type_SD=blood`
411
+ - `urine` -> `Sample_Type_SD=urine`
412
+ - `feces`, `faeces`, `stool` -> `Sample_Type_SD=feces/stool`
413
+ - `soil` -> `Environment_Medium_SD=soil`
414
+ - `sediment` -> `Environment_Medium_SD=sediment`
415
+ - `wastewater`, `sewage` -> `Environment_Medium_SD=wastewater/sewage`
416
+ - `hospital surface` -> healthcare/source context
417
+ - `rectal swab` -> sample type plus anatomical site when available
418
+
419
+ ### Geography and Date
420
+
421
+ FetchM2 standardizes:
422
+
423
+ - `Country`
424
+ - `Continent`
425
+ - `Subcontinent`
426
+ - `Collection_Year`
427
+
428
+ It also blocks common false positives such as:
429
+
430
+ - `Hospital` as country
431
+ - `ground turkey` as Turkey
432
+ - `Guinea pig` as Guinea
433
+ - `Norway rat` as Norway
434
+ - `Aspergillus niger` as Niger
435
+
436
+ ### Disease and Health State
437
+
438
+ FetchM2 includes:
439
+
440
+ - `Host_Disease_SD`
441
+ - `Host_Health_State_SD`
442
+
443
+ These are conservative deterministic fields. Disease words are not treated as sample material unless an actual specimen is present.
444
+
445
+ ## Sequence Download Features
446
+
447
+ FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
448
+
449
+ Filtering options:
450
+
451
+ - `--host`
452
+ - `--host-rank`
453
+ - `--country`
454
+ - `--continent`
455
+ - `--subcontinent`
456
+ - `--sample-type`
457
+ - `--isolation-source`
458
+ - `--environment-medium`
459
+ - `--year-from`
460
+ - `--year-to`
461
+ - `--max-genomes`
462
+
463
+ Download control:
464
+
465
+ - `--download-workers`
466
+ - `--retries`
467
+ - `--retry-delay`
468
+ - `--keep-gz`
469
+ - `--check-only`
470
+
471
+ Outputs:
472
+
473
+ - genome FASTA files
474
+ - `failed_accessions.txt`
475
+ - `sequence_download_summary.csv`
476
+ - `fetchm2_sequence_cache.sqlite3`
477
+
478
+ ## Test Dataset
479
+
480
+ FetchM2 includes:
481
+
482
+ - `test.tsv`: FetchM-style NCBI dataset example copied from the public FetchM test dataset.
483
+ - `examples/test_ncbi_dataset.tsv`: same dataset stored under examples.
484
+ - `examples/offline_metadata.tsv`: small annotated metadata table for fast offline testing.
485
+
486
+ Run:
487
+
488
+ ```bash
489
+ fetchm2 metadata --input test.tsv --outdir test_run --offline
490
+ fetchm2 audit --input test_run/metadata_output/fetchm2_clean.csv --outdir test_run/audit_check
491
+ ```
492
+
493
+ For BioSample metadata retrieval:
494
+
495
+ ```bash
496
+ fetchm2 metadata --input test.tsv --outdir test_run_live --workers 3 --sleep 0.34
497
+ ```
498
+
499
+ ## Rule Files Packaged With FetchM2
500
+
501
+ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
502
+
503
+ - `host_synonyms.csv`
504
+ - `host_negative_rules.csv`
505
+ - `controlled_categories.csv`
506
+ - `approved_broad_categories.csv`
507
+ - `geography_reviewed_rules.csv`
508
+ - `collection_date_reviewed_rules.csv`
509
+ - `country_mapping.json`
510
+
511
+ These rules let the standalone tool produce richer standardized fields without needing a web database.
512
+
513
+ ## API Keys
514
+
515
+ For NCBI, use environment variables:
516
+
517
+ ```bash
518
+ export NCBI_API_KEY=YOUR_NCBI_API_KEY
519
+ export NCBI_EMAIL=you@example.com
520
+ ```
521
+
522
+ Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
523
+
524
+ ## Validation
525
+
526
+ Run local validation:
527
+
528
+ ```bash
529
+ pytest
530
+ python -m build
531
+ python -m twine check dist/*
532
+ python -m pip install dist/fetchm2-*.whl
533
+ fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
534
+ fetchm2 validate --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_out/validation
535
+ fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only
536
+ ```
537
+
538
+ The validation report is in:
539
+
540
+ ```text
541
+ docs/VALIDATION_REPORT.md
542
+ ```
543
+
544
+ More analysis details:
545
+
546
+ ```text
547
+ docs/METADATA_ANALYSIS.md
548
+ docs/STANDARDIZATION.md
549
+ docs/SEQUENCE_DOWNLOAD.md
550
+ docs/RELEASE_CHECKLIST.md
551
+ ```
552
+
553
+ ## License
554
+
555
+ MIT License.