fetchm2 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,208 @@
1
+ Metadata-Version: 2.4
2
+ Name: fetchm2
3
+ Version: 0.1.0
4
+ Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
5
+ Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/Tasnimul-Arabi-Anik/FetchM2
8
+ Project-URL: Repository, https://github.com/Tasnimul-Arabi-Anik/FetchM2
9
+ Project-URL: Issues, https://github.com/Tasnimul-Arabi-Anik/FetchM2/issues
10
+ Keywords: NCBI,BioSample,metadata,genomics,standardization,sequence-download
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
20
+ Requires-Python: >=3.10
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: pandas>=2.0
24
+ Requires-Dist: requests>=2.31
25
+ Requires-Dist: tqdm>=4.66
26
+ Requires-Dist: matplotlib>=3.7
27
+ Requires-Dist: seaborn>=0.13
28
+ Requires-Dist: plotly>=5.20
29
+ Requires-Dist: kaleido<1.0.0,>=0.2.1
30
+ Requires-Dist: xmltodict>=0.13
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=8.0; extra == "dev"
33
+ Requires-Dist: build>=1.2; extra == "dev"
34
+ Requires-Dist: twine>=5.0; extra == "dev"
35
+ Dynamic: license-file
36
+
37
+ # FetchM2
38
+
39
+ FetchM2 is a standalone command-line toolkit for genome metadata retrieval, comprehensive metadata standardization, audit reporting, and optional sequence download.
40
+
41
+ It keeps the simple standalone installation model of the original public [`FetchM`](https://github.com/Tasnimul-Arabi-Anik/FetchM), while packaging deterministic rule files and QA concepts developed in FetchM Web.
42
+
43
+ ## What FetchM2 Does
44
+
45
+ - Reads NCBI Genome Datasets TSV/CSV exports.
46
+ - Optionally fetches linked BioSample metadata from NCBI.
47
+ - Standardizes host, country/geography, collection year, sample type, isolation source, isolation site, environment medium, host disease, and host health state.
48
+ - Adds host TaxID, rank, lineage fields, match method, confidence, and review status.
49
+ - Writes clean metadata tables and audit reports.
50
+ - Downloads genome FASTA files from NCBI with flexible filters.
51
+ - Runs offline on already annotated tables for reproducible tests and local standardization.
52
+
53
+ ## Installation
54
+
55
+ Recommended clean environment:
56
+
57
+ ```bash
58
+ python -m venv fetchm2-env
59
+ source fetchm2-env/bin/activate
60
+ pip install fetchm2
61
+ ```
62
+
63
+ For development from source:
64
+
65
+ ```bash
66
+ git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
67
+ cd FetchM2
68
+ python -m pip install -e ".[dev]"
69
+ pytest
70
+ ```
71
+
72
+ FetchM2 uses Python dependencies only. `taxonkit` is optional. If available, FetchM2 can use it to enrich less common host TaxIDs with lineage fields; common host lineages are bundled.
73
+
74
+ ## Quick Start
75
+
76
+ Offline smoke test using the bundled example:
77
+
78
+ ```bash
79
+ fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline
80
+ ```
81
+
82
+ Full BioSample metadata retrieval:
83
+
84
+ ```bash
85
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results
86
+ ```
87
+
88
+ With NCBI API key:
89
+
90
+ ```bash
91
+ export NCBI_API_KEY=YOUR_NCBI_API_KEY
92
+ fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15
93
+ ```
94
+
95
+ All-in-one metadata plus sequence download:
96
+
97
+ ```bash
98
+ fetchm2 run --input ncbi_dataset.tsv --outdir results --download
99
+ ```
100
+
101
+ Filtered sequence download from a clean table:
102
+
103
+ ```bash
104
+ fetchm2 seq \
105
+ --input results/metadata_output/fetchm2_clean.csv \
106
+ --outdir results/sequence \
107
+ --host "Homo sapiens" \
108
+ --country Bangladesh \
109
+ --year-from 2018 \
110
+ --year-to 2024
111
+ ```
112
+
113
+ ## Main Commands
114
+
115
+ ```bash
116
+ fetchm2 metadata --help
117
+ fetchm2 run --help
118
+ fetchm2 seq --help
119
+ fetchm2 audit --help
120
+ ```
121
+
122
+ ## Metadata Outputs
123
+
124
+ FetchM2 writes:
125
+
126
+ - `metadata_output/fetchm2_clean.csv`
127
+ - `metadata_output/fetchm2_clean.tsv`
128
+ - `metadata_output/fetchm2_report.md`
129
+ - `audit/standardization_summary.csv`
130
+ - `audit/top_host_review_needed.csv`
131
+ - `audit/standardization_audit.md`
132
+
133
+ Important standardized fields include:
134
+
135
+ - `Host_SD`, `Host_TaxID`, `Host_Rank`, `Host_Superkingdom`, `Host_Phylum`, `Host_Class`, `Host_Order`, `Host_Family`, `Host_Genus`, `Host_Species`
136
+ - `Host_Common_Name`, `Host_Match_Method`, `Host_Confidence`, `Host_Review_Status`
137
+ - `Sample_Type_SD`, `Sample_Type_SD_Broad`
138
+ - `Isolation_Source_SD`, `Isolation_Source_SD_Broad`
139
+ - `Isolation_Site_SD`
140
+ - `Environment_Medium_SD`, `Environment_Medium_SD_Broad`
141
+ - `Environment_Broad_Scale_SD`, `Environment_Local_Scale_SD`
142
+ - `Host_Disease_SD`, `Host_Health_State_SD`
143
+ - `Country`, `Continent`, `Subcontinent`, `Collection_Year`
144
+
145
+ ## Sequence Download Options
146
+
147
+ FetchM2 supports filtering by:
148
+
149
+ - host
150
+ - host rank
151
+ - country
152
+ - continent
153
+ - subcontinent
154
+ - sample type
155
+ - isolation source
156
+ - environment medium
157
+ - collection year range
158
+ - maximum genomes
159
+
160
+ Use `--check-only` to audit a sequence output directory without downloading.
161
+
162
+ ## API Keys
163
+
164
+ For NCBI, prefer environment variables:
165
+
166
+ ```bash
167
+ export NCBI_API_KEY=YOUR_NCBI_API_KEY
168
+ export NCBI_EMAIL=you@example.com
169
+ ```
170
+
171
+ Do not place API keys in scripts, notebooks, README files, or Git commits.
172
+
173
+ ## Design Compared With FetchM and FetchM Web
174
+
175
+ FetchM2 uses the original FetchM standalone flow as the command-line baseline:
176
+
177
+ - metadata
178
+ - run
179
+ - seq
180
+ - SQLite cache
181
+ - NCBI BioSample fetch
182
+ - sequence download from NCBI FTP
183
+
184
+ FetchM2 adds FetchM Web-style standardized metadata fields and deterministic rule files:
185
+
186
+ - host synonyms and negative host rules
187
+ - controlled source/sample/environment categories
188
+ - approved broad vocabulary
189
+ - production-style audit gate
190
+ - richer sequence filtering on standardized fields
191
+
192
+ FetchM2 intentionally does not use embeddings or AI for production mappings. Embeddings can be used later as a review assistant, but final production rules should remain deterministic and auditable.
193
+
194
+ ## Testing
195
+
196
+ Run:
197
+
198
+ ```bash
199
+ pytest
200
+ python -m build
201
+ python -m pip install dist/fetchm2-*.whl
202
+ fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
203
+ fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only
204
+ ```
205
+
206
+ ## License
207
+
208
+ MIT License.
@@ -0,0 +1,20 @@
1
+ fetchm2/__init__.py,sha256=CLanHxpL8WLckFTU_jlA8aCDMOaZfM-9GQ8XO5syJ3U,94
2
+ fetchm2/audit.py,sha256=Sn_MK05O7snpJ2FOUMEJZdKvbs4HqN4U-rdrMH8ua-k,5760
3
+ fetchm2/cli.py,sha256=roRAzM1eU-hecHqq3BJlzmKuzLMT3t6nGPxEWtJhAtg,8464
4
+ fetchm2/metadata.py,sha256=0IDzJwoOJMlMH50PSTe9wSzhfJdz6NqGKSZvb90AqRo,8201
5
+ fetchm2/sequence.py,sha256=7t7C0PZv_aeEA4E0I3SHLxS8Yjz1t1ob4mKc9E8od1E,7956
6
+ fetchm2/standardization.py,sha256=_AciT1VkYWtkWdAu6RpWhpkre6OMxEExHKMSNPmjMlc,20618
7
+ fetchm2/utils.py,sha256=jmLo2Zbcfj6ikKjNdEg2XrU-wrUp8GVyRkgcDxzDrAU,1652
8
+ fetchm2/data/__init__.py,sha256=8NlOfHm8CpgA9TnBDw6UC7JfZQTzKDHnAnhjTbrk68w,46
9
+ fetchm2/data/approved_broad_categories.csv,sha256=LD7B22NmcjaFLQopC5_Ax-7xLjTbwsXqIKEDX7niKvg,4872
10
+ fetchm2/data/controlled_categories.csv,sha256=xLa8_r1vklLg789wYCEzBt7T4IU6TmuUCE89gS_MZds,2395053
11
+ fetchm2/data/country_mapping.json,sha256=PYk_PGYM1F_uHxpbUXylZIWoPu4NVjKKqvol8t971xQ,17344
12
+ fetchm2/data/geography_reviewed_rules.csv,sha256=1zTDMq078ltYs7IVabn8QGmn5Vf8eyxCBNRMYOWUR2o,1560
13
+ fetchm2/data/host_negative_rules.csv,sha256=YodAzIvSpEX2rSSe1JTxoZtzdPbDRA8Ld7Hu0GxFEmY,45300
14
+ fetchm2/data/host_synonyms.csv,sha256=Byy5igpM2acOB_d-awLOvxkvQqsnJDAlVPx5By-QVJQ,716973
15
+ fetchm2-0.1.0.dist-info/licenses/LICENSE,sha256=8QW78fG4kk5GSOCzLE_D9SrD2eTlANgRr8vc4baQ-MI,1076
16
+ fetchm2-0.1.0.dist-info/METADATA,sha256=IZd-qZ2psh95S3EaSvk7HuDkJEanbr60ke6I_H-vWe0,6240
17
+ fetchm2-0.1.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
18
+ fetchm2-0.1.0.dist-info/entry_points.txt,sha256=vvc7-U-NmlkAe3LaZArXHEQcp8eilgQKAS-T1FyInt4,72
19
+ fetchm2-0.1.0.dist-info/top_level.txt,sha256=pIvb4NVf1LumJONqwoqxTxErRiLimYFPnaPtpiSuM9I,8
20
+ fetchm2-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ fetchM2 = fetchm2.cli:main
3
+ fetchm2 = fetchm2.cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Tasnimul Arabi Anik
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ fetchm2