fetchm2 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- fetchm2/__init__.py +6 -0
- fetchm2/audit.py +126 -0
- fetchm2/cli.py +175 -0
- fetchm2/data/__init__.py +2 -0
- fetchm2/data/approved_broad_categories.csv +51 -0
- fetchm2/data/controlled_categories.csv +7506 -0
- fetchm2/data/country_mapping.json +810 -0
- fetchm2/data/geography_reviewed_rules.csv +17 -0
- fetchm2/data/host_negative_rules.csv +409 -0
- fetchm2/data/host_synonyms.csv +7114 -0
- fetchm2/metadata.py +244 -0
- fetchm2/sequence.py +194 -0
- fetchm2/standardization.py +586 -0
- fetchm2/utils.py +54 -0
- fetchm2-0.1.0.dist-info/METADATA +208 -0
- fetchm2-0.1.0.dist-info/RECORD +20 -0
- fetchm2-0.1.0.dist-info/WHEEL +5 -0
- fetchm2-0.1.0.dist-info/entry_points.txt +3 -0
- fetchm2-0.1.0.dist-info/licenses/LICENSE +21 -0
- fetchm2-0.1.0.dist-info/top_level.txt +1 -0
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: fetchm2
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Standalone comprehensive genome metadata standardization and sequence download toolkit.
|
|
5
|
+
Author-email: Tasnimul Arabi Anik <arabianik987@gmail.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/Tasnimul-Arabi-Anik/FetchM2
|
|
8
|
+
Project-URL: Repository, https://github.com/Tasnimul-Arabi-Anik/FetchM2
|
|
9
|
+
Project-URL: Issues, https://github.com/Tasnimul-Arabi-Anik/FetchM2/issues
|
|
10
|
+
Keywords: NCBI,BioSample,metadata,genomics,standardization,sequence-download
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Environment :: Console
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
20
|
+
Requires-Python: >=3.10
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Requires-Dist: pandas>=2.0
|
|
24
|
+
Requires-Dist: requests>=2.31
|
|
25
|
+
Requires-Dist: tqdm>=4.66
|
|
26
|
+
Requires-Dist: matplotlib>=3.7
|
|
27
|
+
Requires-Dist: seaborn>=0.13
|
|
28
|
+
Requires-Dist: plotly>=5.20
|
|
29
|
+
Requires-Dist: kaleido<1.0.0,>=0.2.1
|
|
30
|
+
Requires-Dist: xmltodict>=0.13
|
|
31
|
+
Provides-Extra: dev
|
|
32
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
33
|
+
Requires-Dist: build>=1.2; extra == "dev"
|
|
34
|
+
Requires-Dist: twine>=5.0; extra == "dev"
|
|
35
|
+
Dynamic: license-file
|
|
36
|
+
|
|
37
|
+
# FetchM2
|
|
38
|
+
|
|
39
|
+
FetchM2 is a standalone command-line toolkit for genome metadata retrieval, comprehensive metadata standardization, audit reporting, and optional sequence download.
|
|
40
|
+
|
|
41
|
+
It keeps the simple standalone installation model of the original public [`FetchM`](https://github.com/Tasnimul-Arabi-Anik/FetchM), while packaging deterministic rule files and QA concepts developed in FetchM Web.
|
|
42
|
+
|
|
43
|
+
## What FetchM2 Does
|
|
44
|
+
|
|
45
|
+
- Reads NCBI Genome Datasets TSV/CSV exports.
|
|
46
|
+
- Optionally fetches linked BioSample metadata from NCBI.
|
|
47
|
+
- Standardizes host, country/geography, collection year, sample type, isolation source, isolation site, environment medium, host disease, and host health state.
|
|
48
|
+
- Adds host TaxID, rank, lineage fields, match method, confidence, and review status.
|
|
49
|
+
- Writes clean metadata tables and audit reports.
|
|
50
|
+
- Downloads genome FASTA files from NCBI with flexible filters.
|
|
51
|
+
- Runs offline on already annotated tables for reproducible tests and local standardization.
|
|
52
|
+
|
|
53
|
+
## Installation
|
|
54
|
+
|
|
55
|
+
Recommended clean environment:
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
python -m venv fetchm2-env
|
|
59
|
+
source fetchm2-env/bin/activate
|
|
60
|
+
pip install fetchm2
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
For development from source:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
|
|
67
|
+
cd FetchM2
|
|
68
|
+
python -m pip install -e ".[dev]"
|
|
69
|
+
pytest
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
FetchM2 uses Python dependencies only. `taxonkit` is optional. If available, FetchM2 can use it to enrich less common host TaxIDs with lineage fields; common host lineages are bundled.
|
|
73
|
+
|
|
74
|
+
## Quick Start
|
|
75
|
+
|
|
76
|
+
Offline smoke test using the bundled example:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Full BioSample metadata retrieval:
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
With NCBI API key:
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
92
|
+
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
All-in-one metadata plus sequence download:
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
Filtered sequence download from a clean table:
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
fetchm2 seq \
|
|
105
|
+
--input results/metadata_output/fetchm2_clean.csv \
|
|
106
|
+
--outdir results/sequence \
|
|
107
|
+
--host "Homo sapiens" \
|
|
108
|
+
--country Bangladesh \
|
|
109
|
+
--year-from 2018 \
|
|
110
|
+
--year-to 2024
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
## Main Commands
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
fetchm2 metadata --help
|
|
117
|
+
fetchm2 run --help
|
|
118
|
+
fetchm2 seq --help
|
|
119
|
+
fetchm2 audit --help
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
## Metadata Outputs
|
|
123
|
+
|
|
124
|
+
FetchM2 writes:
|
|
125
|
+
|
|
126
|
+
- `metadata_output/fetchm2_clean.csv`
|
|
127
|
+
- `metadata_output/fetchm2_clean.tsv`
|
|
128
|
+
- `metadata_output/fetchm2_report.md`
|
|
129
|
+
- `audit/standardization_summary.csv`
|
|
130
|
+
- `audit/top_host_review_needed.csv`
|
|
131
|
+
- `audit/standardization_audit.md`
|
|
132
|
+
|
|
133
|
+
Important standardized fields include:
|
|
134
|
+
|
|
135
|
+
- `Host_SD`, `Host_TaxID`, `Host_Rank`, `Host_Superkingdom`, `Host_Phylum`, `Host_Class`, `Host_Order`, `Host_Family`, `Host_Genus`, `Host_Species`
|
|
136
|
+
- `Host_Common_Name`, `Host_Match_Method`, `Host_Confidence`, `Host_Review_Status`
|
|
137
|
+
- `Sample_Type_SD`, `Sample_Type_SD_Broad`
|
|
138
|
+
- `Isolation_Source_SD`, `Isolation_Source_SD_Broad`
|
|
139
|
+
- `Isolation_Site_SD`
|
|
140
|
+
- `Environment_Medium_SD`, `Environment_Medium_SD_Broad`
|
|
141
|
+
- `Environment_Broad_Scale_SD`, `Environment_Local_Scale_SD`
|
|
142
|
+
- `Host_Disease_SD`, `Host_Health_State_SD`
|
|
143
|
+
- `Country`, `Continent`, `Subcontinent`, `Collection_Year`
|
|
144
|
+
|
|
145
|
+
## Sequence Download Options
|
|
146
|
+
|
|
147
|
+
FetchM2 supports filtering by:
|
|
148
|
+
|
|
149
|
+
- host
|
|
150
|
+
- host rank
|
|
151
|
+
- country
|
|
152
|
+
- continent
|
|
153
|
+
- subcontinent
|
|
154
|
+
- sample type
|
|
155
|
+
- isolation source
|
|
156
|
+
- environment medium
|
|
157
|
+
- collection year range
|
|
158
|
+
- maximum genomes
|
|
159
|
+
|
|
160
|
+
Use `--check-only` to audit a sequence output directory without downloading.
|
|
161
|
+
|
|
162
|
+
## API Keys
|
|
163
|
+
|
|
164
|
+
For NCBI, prefer environment variables:
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
export NCBI_API_KEY=YOUR_NCBI_API_KEY
|
|
168
|
+
export NCBI_EMAIL=you@example.com
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
Do not place API keys in scripts, notebooks, README files, or Git commits.
|
|
172
|
+
|
|
173
|
+
## Design Compared With FetchM and FetchM Web
|
|
174
|
+
|
|
175
|
+
FetchM2 uses the original FetchM standalone flow as the command-line baseline:
|
|
176
|
+
|
|
177
|
+
- metadata
|
|
178
|
+
- run
|
|
179
|
+
- seq
|
|
180
|
+
- SQLite cache
|
|
181
|
+
- NCBI BioSample fetch
|
|
182
|
+
- sequence download from NCBI FTP
|
|
183
|
+
|
|
184
|
+
FetchM2 adds FetchM Web-style standardized metadata fields and deterministic rule files:
|
|
185
|
+
|
|
186
|
+
- host synonyms and negative host rules
|
|
187
|
+
- controlled source/sample/environment categories
|
|
188
|
+
- approved broad vocabulary
|
|
189
|
+
- production-style audit gate
|
|
190
|
+
- richer sequence filtering on standardized fields
|
|
191
|
+
|
|
192
|
+
FetchM2 intentionally does not use embeddings or AI for production mappings. Embeddings can be used later as a review assistant, but final production rules should remain deterministic and auditable.
|
|
193
|
+
|
|
194
|
+
## Testing
|
|
195
|
+
|
|
196
|
+
Run:
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
pytest
|
|
200
|
+
python -m build
|
|
201
|
+
python -m pip install dist/fetchm2-*.whl
|
|
202
|
+
fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
|
|
203
|
+
fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
## License
|
|
207
|
+
|
|
208
|
+
MIT License.
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
fetchm2/__init__.py,sha256=CLanHxpL8WLckFTU_jlA8aCDMOaZfM-9GQ8XO5syJ3U,94
|
|
2
|
+
fetchm2/audit.py,sha256=Sn_MK05O7snpJ2FOUMEJZdKvbs4HqN4U-rdrMH8ua-k,5760
|
|
3
|
+
fetchm2/cli.py,sha256=roRAzM1eU-hecHqq3BJlzmKuzLMT3t6nGPxEWtJhAtg,8464
|
|
4
|
+
fetchm2/metadata.py,sha256=0IDzJwoOJMlMH50PSTe9wSzhfJdz6NqGKSZvb90AqRo,8201
|
|
5
|
+
fetchm2/sequence.py,sha256=7t7C0PZv_aeEA4E0I3SHLxS8Yjz1t1ob4mKc9E8od1E,7956
|
|
6
|
+
fetchm2/standardization.py,sha256=_AciT1VkYWtkWdAu6RpWhpkre6OMxEExHKMSNPmjMlc,20618
|
|
7
|
+
fetchm2/utils.py,sha256=jmLo2Zbcfj6ikKjNdEg2XrU-wrUp8GVyRkgcDxzDrAU,1652
|
|
8
|
+
fetchm2/data/__init__.py,sha256=8NlOfHm8CpgA9TnBDw6UC7JfZQTzKDHnAnhjTbrk68w,46
|
|
9
|
+
fetchm2/data/approved_broad_categories.csv,sha256=LD7B22NmcjaFLQopC5_Ax-7xLjTbwsXqIKEDX7niKvg,4872
|
|
10
|
+
fetchm2/data/controlled_categories.csv,sha256=xLa8_r1vklLg789wYCEzBt7T4IU6TmuUCE89gS_MZds,2395053
|
|
11
|
+
fetchm2/data/country_mapping.json,sha256=PYk_PGYM1F_uHxpbUXylZIWoPu4NVjKKqvol8t971xQ,17344
|
|
12
|
+
fetchm2/data/geography_reviewed_rules.csv,sha256=1zTDMq078ltYs7IVabn8QGmn5Vf8eyxCBNRMYOWUR2o,1560
|
|
13
|
+
fetchm2/data/host_negative_rules.csv,sha256=YodAzIvSpEX2rSSe1JTxoZtzdPbDRA8Ld7Hu0GxFEmY,45300
|
|
14
|
+
fetchm2/data/host_synonyms.csv,sha256=Byy5igpM2acOB_d-awLOvxkvQqsnJDAlVPx5By-QVJQ,716973
|
|
15
|
+
fetchm2-0.1.0.dist-info/licenses/LICENSE,sha256=8QW78fG4kk5GSOCzLE_D9SrD2eTlANgRr8vc4baQ-MI,1076
|
|
16
|
+
fetchm2-0.1.0.dist-info/METADATA,sha256=IZd-qZ2psh95S3EaSvk7HuDkJEanbr60ke6I_H-vWe0,6240
|
|
17
|
+
fetchm2-0.1.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
|
|
18
|
+
fetchm2-0.1.0.dist-info/entry_points.txt,sha256=vvc7-U-NmlkAe3LaZArXHEQcp8eilgQKAS-T1FyInt4,72
|
|
19
|
+
fetchm2-0.1.0.dist-info/top_level.txt,sha256=pIvb4NVf1LumJONqwoqxTxErRiLimYFPnaPtpiSuM9I,8
|
|
20
|
+
fetchm2-0.1.0.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Tasnimul Arabi Anik
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
fetchm2
|