PyPI - fetchm2 - Versions diffs - 0.1.4__tar.gz → 0.1.7__tar.gz - Mend

fetchm2 0.1.4tar.gz → 0.1.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

{fetchm2-0.1.4/src/fetchm2.egg-info → fetchm2-0.1.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: fetchm2
-Version: 0.1.4
+Version: 0.1.7
 Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
 Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
 License-Expression: MIT
@@ -36,9 +36,11 @@ Dynamic: license-file
 # FetchM2
-FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.
+## Overview
-FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.
+FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata retrieval, deterministic metadata standardization, metadata analysis, audit/validation reporting, and optional genome sequence download from NCBI Genome Datasets exports.
+FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the practical FetchM command-line workflow, but adds expanded host taxonomy fields, source/sample/environment standardization, geography and collection-year recovery, production-readiness audits, richer sequence-download filters, and reproducible test data.
 Recommended one-command workflow:
@@ -46,16 +48,42 @@ Recommended one-command workflow:
 fetchm2 run --input ncbi_dataset.tsv --outdir results --download
 ```
-## Key Features
+The tool is intended primarily for bacterial genome datasets. It can process other NCBI Genome Datasets TSV/CSV exports, but metadata conventions outside bacterial datasets may be less consistent.
+## Workflow
+FetchM2 starts from an NCBI Genome Datasets TSV/CSV, retrieves linked BioSample metadata when requested, standardizes metadata fields with packaged deterministic rules, generates analysis/audit outputs, and optionally downloads FASTA genome sequences.
+Typical flow:
+```text
+NCBI ncbi_dataset.tsv/csv
+        |
+        v
+BioSample metadata retrieval or offline metadata parsing
+        |
+        v
+Deterministic standardization
+        |
+        v
+Clean metadata + analysis tables/figures + audit reports
+        |
+        v
+Optional filtered sequence download
+```
+## Features
 - Standalone command-line tool installable with `pip` or a conda environment.
 - Reads NCBI Genome Datasets TSV/CSV exports.
 - Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
 - Supports offline analysis when metadata columns are already present.
-- Applies packaged deterministic standardization rules.
-- Writes clean CSV and TSV metadata outputs.
+- Applies packaged deterministic standardization rules for host, source, sample, environment, geography, collection year, disease, and health state.
+- Adds `Host_SD`, `Host_TaxID`, host lineage/rank fields, `Host_Context_SD`, standardized sample/source/environment fields, `Country`, `Continent`, `Subcontinent`, and geography traceability fields.
+- Labels 238 country/territory/marine-region entries, including common territories and ocean/sea regions.
+- Writes representative clean CSV/TSV outputs plus full all-assembly outputs.
 - Generates metadata analysis tables and figures automatically.
-- Produces audit summaries and review queues.
+- Produces audit summaries, production-readiness gates, leakage checks, and review queues.
 - Downloads genome FASTA files from NCBI.
 - Supports flexible sequence-download filtering by standardized metadata.
 - Includes `test.tsv`, matching the public FetchM-style test dataset layout.
@@ -68,7 +96,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
 ```bash
 python -m venv fetchm2-env
 source fetchm2-env/bin/activate
-pip install fetchm2==0.1.4
+pip install fetchm2
 ```
 Verify:
@@ -77,6 +105,12 @@ Verify:
 fetchm2 --version
 ```
+To install the validated `0.1.7` GitHub release tag before the PyPI package is updated:
+```bash
+pip install "git+https://github.com/Tasnimul-Arabi-Anik/FetchM2.git@v0.1.7"
+```
 ### Option 2: conda / mamba environment
 Clone the repository and create the environment:
@@ -106,7 +140,51 @@ python -m pip install -e ".[dev]"
 pytest
 ```
-## Quick Start
+## NCBI API Key
+FetchM2 can run without an NCBI API key, but larger BioSample retrieval jobs are more reliable with one.
+Create an NCBI API key from your My NCBI account, then either pass it directly:
+```bash
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results --api-key YOUR_NCBI_API_KEY
+```
+Or use environment variables:
+```bash
+export NCBI_API_KEY=YOUR_NCBI_API_KEY
+export NCBI_EMAIL=you@example.com
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results
+```
+Recommended request pacing:
+- without an API key: use `--workers 3 --sleep 0.4` for larger jobs
+- with an API key: `--workers 6 --sleep 0.15` is usually reasonable
+FetchM2 keeps a persistent SQLite BioSample cache in `metadata_output/fetchm2_biosample_cache.sqlite3`, so repeated runs do not refetch BioSamples that were already resolved.
+Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
+## Usage
+### Recommended All-In-One Workflow
+```bash
+fetchm2 run --input ncbi_dataset.tsv --outdir results --download
+```
+This command:
+1. reads the NCBI genome export
+2. filters rows if `--ani` and/or `--checkm` are provided
+3. retrieves linked BioSample metadata unless `--offline` is used
+4. standardizes metadata fields
+5. writes clean tables, analysis outputs, and audit reports
+6. downloads FASTA files when `--download` is provided
+### Quick Start
 Run the bundled standalone smoke test:
@@ -154,6 +232,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
 3. Review the main outputs:
 - `results/metadata_output/fetchm2_clean.csv`
+- `results/metadata_output/fetchm2_all_assemblies.csv`
 - `results/metadata_analysis/metadata_analysis_report.md`
 - `results/audit/standardization_audit.md`
 - `results/audit/production_readiness_gate.md`
@@ -314,6 +393,14 @@ FetchM2 writes:
 - `metadata_output/fetchm2_clean.csv`
 - `metadata_output/fetchm2_clean.tsv`
+- `metadata_output/fetchm2_clean_compat.csv`
+- `metadata_output/ncbi_clean.csv`
+- `metadata_output/fetchm2_all_assemblies.csv`
+- `metadata_output/fetchm2_all_assemblies.tsv`
+- `metadata_output/sample_map.csv`
+- `metadata_output/metadata_completeness.csv`
+- `metadata_output/metadata_bias_warning.txt`
+- `metadata_output/fetchm2_manifest.json`
 - `metadata_output/fetchm2_report.md`
 - `audit/standardization_summary.csv`
 - `audit/top_host_review_needed.csv`
@@ -331,6 +418,14 @@ results/
 ├── metadata_output/
 │   ├── fetchm2_clean.csv
 │   ├── fetchm2_clean.tsv
+│   ├── fetchm2_clean_compat.csv
+│   ├── ncbi_clean.csv
+│   ├── fetchm2_all_assemblies.csv
+│   ├── fetchm2_all_assemblies.tsv
+│   ├── sample_map.csv
+│   ├── metadata_completeness.csv
+│   ├── metadata_bias_warning.txt
+│   ├── fetchm2_manifest.json
 │   └── fetchm2_report.md
 ├── metadata_analysis/
 │   ├── metadata_analysis_report.md
@@ -359,6 +454,41 @@ results/
     └── fetchm2_sequence_cache.sqlite3
 ```
+By default, `fetchm2_clean.csv` follows original FetchM behavior: it selects one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when both are present. This prevents paired GCA/GCF assemblies sharing the same BioSample from being double-counted in downstream prevalence analyses. The full row-preserving output is still saved as `fetchm2_all_assemblies.csv`.
+If you intentionally want paired GCA/GCF rows retained in `fetchm2_clean.csv`, use:
+```bash
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results --keep-assembly-duplicates
+```
+For PanR2/PanResistome-style downstream pipelines, FetchM2 always includes these compatibility columns in `fetchm2_clean.csv`, even when values are blank:
+- `Assembly Accession`
+- `Assembly Name`
+- `Assembly BioSample Accession`
+- `Organism Name`
+- `Geographic Location`
+- `Continent`
+- `Subcontinent`
+- `Collection Date`
+- `Collection_Year`
+- `Host`
+- `Host_SD`
+- `Isolation_Source`
+- `Isolation_Source_SD`
+- `Sample_Type_SD`
+- `Environment_Medium_SD`
+`sample_map.csv` provides stable sequence-analysis matching columns:
+- `sample_id`
+- `Assembly Accession`
+- `Assembly Name`
+- `sequence_file`
+Assembly accession versions such as `GCF_000123456.1` are preserved.
 ## Standardized Metadata Fields
 FetchM2 keeps the original input columns and adds standardized fields.
@@ -423,8 +553,14 @@ FetchM2 standardizes:
 - `Country`
 - `Continent`
 - `Subcontinent`
+- `Country_Source`
+- `Country_Confidence`
+- `Country_Evidence`
+- `Geo_Recovery_Status`
 - `Collection_Year`
+The packaged region mapping covers countries, selected territories, historical labels, and marine regions such as `Arctic Ocean`, `Pacific Ocean`, `Mediterranean Sea`, and `North Sea`.
 It also blocks common false positives such as:
 - `Hospital` as country
@@ -446,6 +582,8 @@ These are conservative deterministic fields. Disease words are not treated as sa
 FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
+When using the default `fetchm2_clean.csv`, sequence download operates on representative assemblies only, matching original FetchM behavior. Use `fetchm2_all_assemblies.csv` or `--keep-assembly-duplicates` only when you deliberately want both paired `GCA_*` and `GCF_*` accessions.
 Filtering options:
 - `--host`
@@ -475,6 +613,17 @@ Outputs:
 - `sequence_download_summary.csv`
 - `fetchm2_sequence_cache.sqlite3`
+`sequence_download_summary.csv` includes stable downstream matching columns:
+- `Assembly Accession`
+- `Assembly Name`
+- `BioSample`
+- `selected_for_download`
+- `download_status`
+- `sequence_file`
+- `failure_reason`
+- `ftp_path`
 ## Test Dataset
 FetchM2 includes:
@@ -510,17 +659,6 @@ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
 These rules let the standalone tool produce richer standardized fields without needing a web database.
-## API Keys
-For NCBI, use environment variables:
-```bash
-export NCBI_API_KEY=YOUR_NCBI_API_KEY
-export NCBI_EMAIL=you@example.com
-```
-Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
 ## Validation
 Run local validation:

{fetchm2-0.1.4 → fetchm2-0.1.7}/README.md RENAMED Viewed

@@ -1,8 +1,10 @@
 # FetchM2
-FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.
+## Overview
-FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.
+FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata retrieval, deterministic metadata standardization, metadata analysis, audit/validation reporting, and optional genome sequence download from NCBI Genome Datasets exports.
+FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the practical FetchM command-line workflow, but adds expanded host taxonomy fields, source/sample/environment standardization, geography and collection-year recovery, production-readiness audits, richer sequence-download filters, and reproducible test data.
 Recommended one-command workflow:
@@ -10,16 +12,42 @@ Recommended one-command workflow:
 fetchm2 run --input ncbi_dataset.tsv --outdir results --download
 ```
-## Key Features
+The tool is intended primarily for bacterial genome datasets. It can process other NCBI Genome Datasets TSV/CSV exports, but metadata conventions outside bacterial datasets may be less consistent.
+## Workflow
+FetchM2 starts from an NCBI Genome Datasets TSV/CSV, retrieves linked BioSample metadata when requested, standardizes metadata fields with packaged deterministic rules, generates analysis/audit outputs, and optionally downloads FASTA genome sequences.
+Typical flow:
+```text
+NCBI ncbi_dataset.tsv/csv
+        |
+        v
+BioSample metadata retrieval or offline metadata parsing
+        |
+        v
+Deterministic standardization
+        |
+        v
+Clean metadata + analysis tables/figures + audit reports
+        |
+        v
+Optional filtered sequence download
+```
+## Features
 - Standalone command-line tool installable with `pip` or a conda environment.
 - Reads NCBI Genome Datasets TSV/CSV exports.
 - Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
 - Supports offline analysis when metadata columns are already present.
-- Applies packaged deterministic standardization rules.
-- Writes clean CSV and TSV metadata outputs.
+- Applies packaged deterministic standardization rules for host, source, sample, environment, geography, collection year, disease, and health state.
+- Adds `Host_SD`, `Host_TaxID`, host lineage/rank fields, `Host_Context_SD`, standardized sample/source/environment fields, `Country`, `Continent`, `Subcontinent`, and geography traceability fields.
+- Labels 238 country/territory/marine-region entries, including common territories and ocean/sea regions.
+- Writes representative clean CSV/TSV outputs plus full all-assembly outputs.
 - Generates metadata analysis tables and figures automatically.
-- Produces audit summaries and review queues.
+- Produces audit summaries, production-readiness gates, leakage checks, and review queues.
 - Downloads genome FASTA files from NCBI.
 - Supports flexible sequence-download filtering by standardized metadata.
 - Includes `test.tsv`, matching the public FetchM-style test dataset layout.
@@ -32,7 +60,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
 ```bash
 python -m venv fetchm2-env
 source fetchm2-env/bin/activate
-pip install fetchm2==0.1.4
+pip install fetchm2
 ```
 Verify:
@@ -41,6 +69,12 @@ Verify:
 fetchm2 --version
 ```
+To install the validated `0.1.7` GitHub release tag before the PyPI package is updated:
+```bash
+pip install "git+https://github.com/Tasnimul-Arabi-Anik/FetchM2.git@v0.1.7"
+```
 ### Option 2: conda / mamba environment
 Clone the repository and create the environment:
@@ -70,7 +104,51 @@ python -m pip install -e ".[dev]"
 pytest
 ```
-## Quick Start
+## NCBI API Key
+FetchM2 can run without an NCBI API key, but larger BioSample retrieval jobs are more reliable with one.
+Create an NCBI API key from your My NCBI account, then either pass it directly:
+```bash
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results --api-key YOUR_NCBI_API_KEY
+```
+Or use environment variables:
+```bash
+export NCBI_API_KEY=YOUR_NCBI_API_KEY
+export NCBI_EMAIL=you@example.com
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results
+```
+Recommended request pacing:
+- without an API key: use `--workers 3 --sleep 0.4` for larger jobs
+- with an API key: `--workers 6 --sleep 0.15` is usually reasonable
+FetchM2 keeps a persistent SQLite BioSample cache in `metadata_output/fetchm2_biosample_cache.sqlite3`, so repeated runs do not refetch BioSamples that were already resolved.
+Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
+## Usage
+### Recommended All-In-One Workflow
+```bash
+fetchm2 run --input ncbi_dataset.tsv --outdir results --download
+```
+This command:
+1. reads the NCBI genome export
+2. filters rows if `--ani` and/or `--checkm` are provided
+3. retrieves linked BioSample metadata unless `--offline` is used
+4. standardizes metadata fields
+5. writes clean tables, analysis outputs, and audit reports
+6. downloads FASTA files when `--download` is provided
+### Quick Start
 Run the bundled standalone smoke test:
@@ -118,6 +196,7 @@ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
 3. Review the main outputs:
 - `results/metadata_output/fetchm2_clean.csv`
+- `results/metadata_output/fetchm2_all_assemblies.csv`
 - `results/metadata_analysis/metadata_analysis_report.md`
 - `results/audit/standardization_audit.md`
 - `results/audit/production_readiness_gate.md`
@@ -278,6 +357,14 @@ FetchM2 writes:
 - `metadata_output/fetchm2_clean.csv`
 - `metadata_output/fetchm2_clean.tsv`
+- `metadata_output/fetchm2_clean_compat.csv`
+- `metadata_output/ncbi_clean.csv`
+- `metadata_output/fetchm2_all_assemblies.csv`
+- `metadata_output/fetchm2_all_assemblies.tsv`
+- `metadata_output/sample_map.csv`
+- `metadata_output/metadata_completeness.csv`
+- `metadata_output/metadata_bias_warning.txt`
+- `metadata_output/fetchm2_manifest.json`
 - `metadata_output/fetchm2_report.md`
 - `audit/standardization_summary.csv`
 - `audit/top_host_review_needed.csv`
@@ -295,6 +382,14 @@ results/
 ├── metadata_output/
 │   ├── fetchm2_clean.csv
 │   ├── fetchm2_clean.tsv
+│   ├── fetchm2_clean_compat.csv
+│   ├── ncbi_clean.csv
+│   ├── fetchm2_all_assemblies.csv
+│   ├── fetchm2_all_assemblies.tsv
+│   ├── sample_map.csv
+│   ├── metadata_completeness.csv
+│   ├── metadata_bias_warning.txt
+│   ├── fetchm2_manifest.json
 │   └── fetchm2_report.md
 ├── metadata_analysis/
 │   ├── metadata_analysis_report.md
@@ -323,6 +418,41 @@ results/
     └── fetchm2_sequence_cache.sqlite3
 ```
+By default, `fetchm2_clean.csv` follows original FetchM behavior: it selects one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when both are present. This prevents paired GCA/GCF assemblies sharing the same BioSample from being double-counted in downstream prevalence analyses. The full row-preserving output is still saved as `fetchm2_all_assemblies.csv`.
+If you intentionally want paired GCA/GCF rows retained in `fetchm2_clean.csv`, use:
+```bash
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results --keep-assembly-duplicates
+```
+For PanR2/PanResistome-style downstream pipelines, FetchM2 always includes these compatibility columns in `fetchm2_clean.csv`, even when values are blank:
+- `Assembly Accession`
+- `Assembly Name`
+- `Assembly BioSample Accession`
+- `Organism Name`
+- `Geographic Location`
+- `Continent`
+- `Subcontinent`
+- `Collection Date`
+- `Collection_Year`
+- `Host`
+- `Host_SD`
+- `Isolation_Source`
+- `Isolation_Source_SD`
+- `Sample_Type_SD`
+- `Environment_Medium_SD`
+`sample_map.csv` provides stable sequence-analysis matching columns:
+- `sample_id`
+- `Assembly Accession`
+- `Assembly Name`
+- `sequence_file`
+Assembly accession versions such as `GCF_000123456.1` are preserved.
 ## Standardized Metadata Fields
 FetchM2 keeps the original input columns and adds standardized fields.
@@ -387,8 +517,14 @@ FetchM2 standardizes:
 - `Country`
 - `Continent`
 - `Subcontinent`
+- `Country_Source`
+- `Country_Confidence`
+- `Country_Evidence`
+- `Geo_Recovery_Status`
 - `Collection_Year`
+The packaged region mapping covers countries, selected territories, historical labels, and marine regions such as `Arctic Ocean`, `Pacific Ocean`, `Mediterranean Sea`, and `North Sea`.
 It also blocks common false positives such as:
 - `Hospital` as country
@@ -410,6 +546,8 @@ These are conservative deterministic fields. Disease words are not treated as sa
 FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using `Assembly Accession` and `Assembly Name`.
+When using the default `fetchm2_clean.csv`, sequence download operates on representative assemblies only, matching original FetchM behavior. Use `fetchm2_all_assemblies.csv` or `--keep-assembly-duplicates` only when you deliberately want both paired `GCA_*` and `GCF_*` accessions.
 Filtering options:
 - `--host`
@@ -439,6 +577,17 @@ Outputs:
 - `sequence_download_summary.csv`
 - `fetchm2_sequence_cache.sqlite3`
+`sequence_download_summary.csv` includes stable downstream matching columns:
+- `Assembly Accession`
+- `Assembly Name`
+- `BioSample`
+- `selected_for_download`
+- `download_status`
+- `sequence_file`
+- `failure_reason`
+- `ftp_path`
 ## Test Dataset
 FetchM2 includes:
@@ -474,17 +623,6 @@ FetchM2 ships deterministic rules in `src/fetchm2/data/`:
 These rules let the standalone tool produce richer standardized fields without needing a web database.
-## API Keys
-For NCBI, use environment variables:
-```bash
-export NCBI_API_KEY=YOUR_NCBI_API_KEY
-export NCBI_EMAIL=you@example.com
-```
-Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.
 ## Validation
 Run local validation:

fetchm2-0.1.7/docs/FETCHM_COMPATIBILITY.md ADDED Viewed

@@ -0,0 +1,80 @@
+# FetchM Compatibility Notes
+FetchM2 is designed as a standalone CLI successor to FetchM. The points below document compatibility-sensitive behavior that affects downstream analysis.
+## BioSample Fetching
+FetchM2 follows FetchM by fetching metadata once per unique BioSample accession, not once per assembly row. If paired `GCA_*` and `GCF_*` rows share the same BioSample, FetchM2 retrieves that BioSample metadata once and applies it back to both rows before representative selection.
+Audit outputs report both units:
+- `rows`: rows in `fetchm2_clean.csv`
+- `unique_assembly_accessions`: unique assemblies represented in the clean output
+- `biosample_linked_rows`: clean rows with a BioSample accession
+- `unique_biosample_accessions`: unique BioSample accessions represented
+- `biosample_reused_extra_rows`: extra rows caused by BioSample reuse across assemblies
+## Representative Clean Output
+Original FetchM writes its main clean output after deduplicating by `Assembly Name`, prioritizing `GCF_*` rows when paired RefSeq/GenBank assemblies exist.
+FetchM2 follows that behavior by default:
+- `fetchm2_clean.csv`: one representative row per `Assembly Name`, preferring `GCF_*`.
+- `fetchm2_all_assemblies.csv`: all standardized assembly rows before representative selection.
+- `ncbi_clean.csv`: FetchM-compatible alias of `fetchm2_clean.csv`.
+- `fetchm2_clean_compat.csv`: explicit compatibility alias of `fetchm2_clean.csv`.
+Use `--keep-assembly-duplicates` if you intentionally want paired `GCA_*` and `GCF_*` rows retained in `fetchm2_clean.csv`.
+## Stable Pipeline Contract
+FetchM2 always writes the following downstream-facing columns in `fetchm2_clean.csv`, even when values are blank:
+- `Assembly Accession`
+- `Assembly Name`
+- `Assembly BioSample Accession`
+- `Organism Name`
+- `Geographic Location`
+- `Continent`
+- `Subcontinent`
+- `Collection Date`
+- `Collection_Year`
+- `Host`
+- `Host_SD`
+- `Isolation_Source`
+- `Isolation_Source_SD`
+- `Sample_Type_SD`
+- `Environment_Medium_SD`
+Assembly accession versions are preserved. For example, `GCF_000123456.1` remains `GCF_000123456.1`.
+FetchM2 also writes:
+- `sample_map.csv` for stable sequence-analysis sample IDs.
+- `metadata_completeness.csv` for field-level coverage checks.
+- `metadata_bias_warning.txt` for low-coverage and representative-selection warnings.
+- `fetchm2_manifest.json` for machine-readable run metadata.
+## Sequence Download Unit
+Sequence download from the default `fetchm2_clean.csv` operates on representative assemblies, matching FetchM and avoiding duplicate GCA/GCF downloads for the same assembly name.
+If downstream prevalence or resistome tools operate at assembly/genome level, use `Assembly Accession` as the denominator. If they operate at BioSample level, deduplicate intentionally and document the representative rule.
+## Cache and Fallback Behavior
+FetchM2 aligns with FetchM's BioSample fallback behavior:
+- direct BioSample XML is tried first
+- esummary fallback is used when direct XML lacks usable attributes
+- recovered fallback XML is cached, not the incomplete direct XML
+- stale incomplete cache entries are ignored and retried
+This avoids rerun regressions where a recovered BioSample would later appear missing because incomplete direct XML had been cached.
+## Known Intentional Differences
+FetchM2 adds standardized fields, richer audits, production gate outputs, and additional sequence filters. These are additions, not replacements for the core FetchM workflow.
+FetchM2 does not generate FetchM's DOCX report; it focuses on CSV/TSV, Markdown, JSON, and figure outputs that are easier to use in automated CLI workflows.

fetchm2-0.1.7/docs/SEQUENCE_DOWNLOAD.md ADDED Viewed

@@ -0,0 +1,67 @@
+# FetchM2 Sequence Download
+FetchM2 downloads genome FASTA files from the public NCBI genomes FTP layout using the assembly accession and assembly name in `fetchm2_clean.csv`.
+By default, `fetchm2_clean.csv` contains one representative row per `Assembly Name`, preferring RefSeq `GCF_*` over GenBank `GCA_*` when paired rows share the same assembly name. This follows original FetchM behavior and prevents duplicate downloads/counting for paired GCA/GCF assemblies. If you need all input assembly rows, use `fetchm2_all_assemblies.csv` or rerun metadata with `--keep-assembly-duplicates`.
+## Typical Workflow
+```bash
+fetchm2 metadata --input ncbi_dataset.tsv --outdir results
+fetchm2 seq --input results/metadata_output/fetchm2_clean.csv --outdir results/sequence
+```
+## Filtered Download
+You can filter using standardized metadata:
+```bash
+fetchm2 seq \
+  --input results/metadata_output/fetchm2_clean.csv \
+  --outdir results/sequence_human_bd \
+  --host "Homo sapiens" \
+  --country Bangladesh \
+  --year-from 2018 \
+  --year-to 2024
+```
+Supported filters include:
+- `--host`
+- `--host-rank`
+- `--country`
+- `--continent`
+- `--subcontinent`
+- `--sample-type`
+- `--isolation-source`
+- `--environment-medium`
+- `--year-from` and `--year-to`
+- `--max-genomes`
+## Check Only
+Use `--check-only` to compare expected accessions against an output directory without downloading:
+```bash
+fetchm2 seq --input fetchm2_clean.csv --outdir sequence --check-only
+```
+## Stable Summary Output
+FetchM2 always writes:
+- `sequence_download_summary.csv`
+- `failed_accessions.txt`
+`sequence_download_summary.csv` includes these stable downstream matching columns:
+- `Assembly Accession`
+- `Assembly Name`
+- `BioSample`
+- `selected_for_download`
+- `download_status`
+- `sequence_file`
+- `failure_reason`
+- `ftp_path`
+The `sequence_file` value matches the expected FASTA basename used by FetchM2. Assembly accession versions are preserved so downstream tools can match ABRicate, MLST, MobileElementFinder, IntegronFinder, DefenseFinder, PanR2, and PanResistome outputs by `Assembly Accession` or by `sample_id` from `metadata_output/sample_map.csv`.

fetchm2 0.1.4__tar.gz → 0.1.7__tar.gz

fetchm2 0.1.4tar.gz → 0.1.7tar.gz