RefgenDetector 3.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,3 @@
1
+ RefgenDetector is released under GNU General Public License v3.0.
2
+
3
+ It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies 2019-2021 and 2022-2023).
@@ -0,0 +1,300 @@
1
+ Metadata-Version: 2.4
2
+ Name: RefgenDetector
3
+ Version: 3.0.0
4
+ Summary: RefgenDetector
5
+ Author: Mireia Marin i Ginestar
6
+ Author-email: <mireia.marin@crg.eu>
7
+ Keywords: python
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Operating System :: Unix
10
+ Description-Content-Type: text/markdown
11
+ License-File: LICENSE
12
+ Requires-Dist: argparse
13
+ Requires-Dist: pysam
14
+ Requires-Dist: psutil
15
+ Requires-Dist: rich
16
+ Requires-Dist: pandas
17
+ Requires-Dist: dnspython
18
+ Requires-Dist: msgpack
19
+ Dynamic: author
20
+ Dynamic: author-email
21
+ Dynamic: classifier
22
+ Dynamic: description
23
+ Dynamic: description-content-type
24
+ Dynamic: keywords
25
+ Dynamic: license-file
26
+ Dynamic: requires-dist
27
+ Dynamic: summary
28
+
29
+ # EGA - RefgenDetector
30
+
31
+ RefgenDetector is a bioinformatics tool that **infers the reference genome assembly** used to create aligment files (BAM/CRAM/header) and VCFs.
32
+
33
+ ## Aligment Files
34
+
35
+ It identifies major genome releases and derived assemblies across humans and multiple other species by analyzing contig names and lengths **from the header**. Benchmarking against 94 synthetic datasets achieved a 100% accuracy rate, while large-scale testing on 918,404 real-world files demonstrated 97.13% correctness, failing only when files’ headers are incomplete.
36
+
37
+ ### Description
38
+
39
+ RefgenDetector is able to infer the following reference genomes:
40
+
41
+ **Primates**
42
+
43
+ 👤 Homo sapiens
44
+
45
+ - hg16
46
+ - hg17
47
+ - hg18
48
+ - GRCh37
49
+ - GRCh38
50
+ - T2T
51
+
52
+ 🐒 Pan troglodytes
53
+
54
+ - pantro3_0
55
+ - Pan_troglodytes-2.1
56
+
57
+ 🐵 Macaca mulatta
58
+
59
+ - Mmul10
60
+ - rheMac8
61
+ - rheMac3
62
+
63
+ **Rodents**
64
+
65
+ 🐭 Mus musculus
66
+
67
+ - mm7
68
+ - mm8
69
+ - mm9
70
+ - mm10
71
+ - mm39
72
+
73
+ 🐀 Rattus norvegicus
74
+
75
+ - mRatBN7_2
76
+ - Rnor_6_0
77
+
78
+ **Other Mammals**
79
+
80
+ 🐷 Sus scrofa
81
+
82
+ - Sscrofa10_2
83
+ - Sscrofa11_1
84
+
85
+ **Vertebrates (Non-Mammalian)**
86
+
87
+ 🐟 Danio Rerio
88
+
89
+ - danRer10
90
+ - danRer11
91
+
92
+ **Invertebrates**
93
+
94
+ 🪰 Drosophila Melanogaster
95
+
96
+ - dm5
97
+ - dm6
98
+
99
+ 🐛 Caenorhabditis elegans
100
+
101
+ - WBcel215
102
+ - WBcel235
103
+
104
+ **Microorganisms & Plants**
105
+
106
+ 🧫 Escherichia coli
107
+
108
+ - ASM886v2
109
+ - ASM584v2
110
+
111
+ 🌱 Arabidopsis thaliana
112
+
113
+ - TAIR
114
+
115
+ 🍺 Saccharomyces cerevisiae
116
+
117
+ - R64
118
+
119
+ ## `ref_manager.py` - Customize the assemblies database.
120
+
121
+ `ref_manager.py` provides command-line management of reference genomes used by RefgenDetector. It allows users to add custom assemblies from FASTA index (`.fai`) files, list all available references, and remove previously added custom entries without modifying the source code.
122
+
123
+ ### Usage
124
+
125
+ ```bash
126
+ python ref_manager.py <command> [options]
127
+ ```
128
+
129
+ ### Commands
130
+
131
+ #### Add a reference
132
+
133
+ ```bash
134
+ python ref_manager.py add <genome.fai> <reference_name> <species>
135
+ ```
136
+
137
+ Registers a new reference from a valid `.fai` file. If the contig structure matches an existing reference, the entry is not added.
138
+
139
+ #### List references
140
+
141
+ ```bash
142
+ python ref_manager.py list
143
+ ```
144
+
145
+ Displays all available references, including both built-in and user-defined assemblies.
146
+
147
+ #### Remove a reference
148
+
149
+ ```bash
150
+ python ref_manager.py remove <reference_name>
151
+ ```
152
+
153
+ Removes a custom reference from the local database. Built-in references cannot be removed.
154
+
155
+ ### Notes
156
+
157
+ - Custom references are stored separately from the default reference database.
158
+ - Input files must be valid FASTA index files generated with `samtools faidx`.
159
+ - Duplicate assemblies are detected based on exact contig composition.
160
+
161
+ ## Variant Calling Files (VCFs)
162
+
163
+ From VCF files only 4 human assemblies can be inferred:
164
+
165
+ - Hg18
166
+ - GRCh37
167
+ - GRCh38
168
+ - T2T
169
+
170
+ Two different sources of information are used to infer the reference genome from variant calling files
171
+
172
+ * **Header**
173
+
174
+ In the VCF specification it is recommended, but **not mandatory** that the VCF header includes tags describing the reference and contigs backing the data contained in the file. When present, the tool will analyze this information and output the reference genome version based on the contig lengths, following the same logic of the aligment files inference.
175
+
176
+ * **Variants**
177
+
178
+ To infer the reference genome from a VCF the tool will read the VCF file in chunks of 100.000 variants, avoiding to load the complete file in memory. The `POS` and `REF` columns will be extracted and compared to the msgpack files.
179
+
180
+ The msgpack files were created comparing the nucleotides in each position for hg18, GRCh37, GRCh38 and T2T. Each file contains a list of the positions where each reference had a different nucleotide (distinguishing positions).
181
+
182
+ By getting the number of matches between these distinguishing positions and the `REF` present in the VCF we infer the reference genome version used to call the variants.
183
+
184
+ ## Requirements
185
+
186
+ - Python 3.10.6
187
+
188
+ Depending on how you want to install the package:
189
+
190
+ - pip
191
+ - Docker
192
+
193
+ Download the `msgpack` files for the inference with VCFs:
194
+
195
+ 1. [Download the msgpack reference](https://crgcnag-my.sharepoint.com/:u:/g/personal/mimarin_crg_es/IQDa5CICZDAoRZmbfhBG3ZPEAWdVnNqvefFJB_r5Hc8aM70?e=kID7zn)
196
+
197
+ 2. Move the `msgpack` to the correct path:
198
+
199
+ ```
200
+ mv msgpack.zip /refgenDetector/src/refgenDetector/
201
+ unzip /refgenDetector/src/refgenDetector/msgpack.zip
202
+ ```
203
+
204
+ ## Installation
205
+
206
+ ### Cloning this repository
207
+
208
+ 1. Clone this repository
209
+
210
+ 2. ``` $ cd PATH_WHERE_YOU_CLONED_THE_REPOSITORY/src/refgenDetector ```
211
+
212
+ 3. ``$ python3 refgenDetector_main.py -h ``
213
+
214
+ ### From pypi
215
+
216
+ ``$ pip install refgenDetector``
217
+
218
+ ### From Docker
219
+ ``
220
+
221
+ ## Usage
222
+
223
+ You can get the help menu by running:
224
+
225
+ ```
226
+ $ refgenDetector -h
227
+ ```
228
+
229
+ ```
230
+ usage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE [-h] -f FILE -t {BAM/CRAM,Header,VCF,BIM} [--md5] [-a] [-v MAX_N_VAR] [-m MATCHES] [-r]
231
+
232
+ optional arguments:
233
+ -h, --help show this help message and exit
234
+ -f FILE, --file FILE Input file path
235
+ -t {BAM/CRAM,Header,VCF,BIM}, --type {BAM/CRAM,Header,VCF,BIM}
236
+ Type of files to analyze.
237
+ --md5 Print md5 values if present in header.
238
+ -a, --assembly Print assembly if present in header.
239
+ -v MAX_N_VAR, --max_n_var MAX_N_VAR
240
+ Maximum number of variants to read before stopping inference. The file is processed in chunks of 100,000 variants, so this value must be a multiple of 100,000 (e.g. 100000,
241
+ 200000, 300000, ...).
242
+ -m MATCHES, --matches MATCHES
243
+ Number of matches required before stopping. [DEFAULT:5000]
244
+ -r, --resources When set, print execution time, CPU, memory, and disk I/O usage
245
+ ```
246
+
247
+ ## Test RefgenDetector
248
+
249
+ In the folder **examples** you can find headers, BAM and CRAMs to test the working of RefgenDetector.
250
+
251
+ *All this files belong to the [synthetics data cohort](https://ega-archive.org/synthetic-data) from the European
252
+ Genome-Phenome Archive ([EGA](https://ega-archive.org/)).*
253
+
254
+ ### Test with headers in a TXT
255
+
256
+ In the folder TEST_HEADERS there are four headers obtained from synthetic BAM an CRAMs stored in the EGA. Each one of
257
+ them belongs to a different synthetic study:
258
+
259
+ - Test Study for EGA using data from 1000 Genomes Project - Phase
260
+ 3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
261
+ - Synthetic data - Genome in a Bottle - [EGAS00001005591](https://ega-archive.org/studies/EGAS00001005591).
262
+ - Human genomic and phenotypic synthetic data for the study of rare
263
+ diseases - [EGAS00001005702](https://ega-archive.org/studies/EGAS00001005702).
264
+ - CINECA synthetic data.Please note: This study contains synthetic data (with cohort “participants” / ”subjects” marked
265
+ with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or
266
+ results - [EGAS00001002472](https://ega-archive.org/studies/EGAS00001002472).
267
+
268
+ Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
269
+
270
+ To run RefgenDetector with the files:
271
+
272
+ 1. Modify the txt *path_to_headers* so the paths match those in your computer.
273
+ 2. Run:
274
+
275
+ ``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector/examples/path_to_headers -t Headers```
276
+
277
+ ### Test with BAM and CRAMs
278
+
279
+ In the folder TEST_BAM_CRAM there are a BAM and a CRAM obtained from synthetic BAM an CRAMs stored in the EGA. They
280
+ belong to the synthetic study - Test Study for EGA using data from 1000 Genomes Project - Phase
281
+ 3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
282
+
283
+ Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
284
+
285
+ To run RefgenDetector with the files:
286
+
287
+ 1. Modify the txt *path_to_bam_cram* so the paths match those in your computer.
288
+
289
+ 2. Run:
290
+
291
+ ``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector_pip-master/examples/path_to_bam_cram -t BAM/CRAM```
292
+
293
+
294
+
295
+ ## Licence and funding
296
+
297
+ RefgenDetector is released under GNU General Public License v3.0.
298
+
299
+ It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies
300
+ 2019-2021 and 2022-2023).
@@ -0,0 +1,272 @@
1
+ # EGA - RefgenDetector
2
+
3
+ RefgenDetector is a bioinformatics tool that **infers the reference genome assembly** used to create aligment files (BAM/CRAM/header) and VCFs.
4
+
5
+ ## Aligment Files
6
+
7
+ It identifies major genome releases and derived assemblies across humans and multiple other species by analyzing contig names and lengths **from the header**. Benchmarking against 94 synthetic datasets achieved a 100% accuracy rate, while large-scale testing on 918,404 real-world files demonstrated 97.13% correctness, failing only when files’ headers are incomplete.
8
+
9
+ ### Description
10
+
11
+ RefgenDetector is able to infer the following reference genomes:
12
+
13
+ **Primates**
14
+
15
+ 👤 Homo sapiens
16
+
17
+ - hg16
18
+ - hg17
19
+ - hg18
20
+ - GRCh37
21
+ - GRCh38
22
+ - T2T
23
+
24
+ 🐒 Pan troglodytes
25
+
26
+ - pantro3_0
27
+ - Pan_troglodytes-2.1
28
+
29
+ 🐵 Macaca mulatta
30
+
31
+ - Mmul10
32
+ - rheMac8
33
+ - rheMac3
34
+
35
+ **Rodents**
36
+
37
+ 🐭 Mus musculus
38
+
39
+ - mm7
40
+ - mm8
41
+ - mm9
42
+ - mm10
43
+ - mm39
44
+
45
+ 🐀 Rattus norvegicus
46
+
47
+ - mRatBN7_2
48
+ - Rnor_6_0
49
+
50
+ **Other Mammals**
51
+
52
+ 🐷 Sus scrofa
53
+
54
+ - Sscrofa10_2
55
+ - Sscrofa11_1
56
+
57
+ **Vertebrates (Non-Mammalian)**
58
+
59
+ 🐟 Danio Rerio
60
+
61
+ - danRer10
62
+ - danRer11
63
+
64
+ **Invertebrates**
65
+
66
+ 🪰 Drosophila Melanogaster
67
+
68
+ - dm5
69
+ - dm6
70
+
71
+ 🐛 Caenorhabditis elegans
72
+
73
+ - WBcel215
74
+ - WBcel235
75
+
76
+ **Microorganisms & Plants**
77
+
78
+ 🧫 Escherichia coli
79
+
80
+ - ASM886v2
81
+ - ASM584v2
82
+
83
+ 🌱 Arabidopsis thaliana
84
+
85
+ - TAIR
86
+
87
+ 🍺 Saccharomyces cerevisiae
88
+
89
+ - R64
90
+
91
+ ## `ref_manager.py` - Customize the assemblies database.
92
+
93
+ `ref_manager.py` provides command-line management of reference genomes used by RefgenDetector. It allows users to add custom assemblies from FASTA index (`.fai`) files, list all available references, and remove previously added custom entries without modifying the source code.
94
+
95
+ ### Usage
96
+
97
+ ```bash
98
+ python ref_manager.py <command> [options]
99
+ ```
100
+
101
+ ### Commands
102
+
103
+ #### Add a reference
104
+
105
+ ```bash
106
+ python ref_manager.py add <genome.fai> <reference_name> <species>
107
+ ```
108
+
109
+ Registers a new reference from a valid `.fai` file. If the contig structure matches an existing reference, the entry is not added.
110
+
111
+ #### List references
112
+
113
+ ```bash
114
+ python ref_manager.py list
115
+ ```
116
+
117
+ Displays all available references, including both built-in and user-defined assemblies.
118
+
119
+ #### Remove a reference
120
+
121
+ ```bash
122
+ python ref_manager.py remove <reference_name>
123
+ ```
124
+
125
+ Removes a custom reference from the local database. Built-in references cannot be removed.
126
+
127
+ ### Notes
128
+
129
+ - Custom references are stored separately from the default reference database.
130
+ - Input files must be valid FASTA index files generated with `samtools faidx`.
131
+ - Duplicate assemblies are detected based on exact contig composition.
132
+
133
+ ## Variant Calling Files (VCFs)
134
+
135
+ From VCF files only 4 human assemblies can be inferred:
136
+
137
+ - Hg18
138
+ - GRCh37
139
+ - GRCh38
140
+ - T2T
141
+
142
+ Two different sources of information are used to infer the reference genome from variant calling files
143
+
144
+ * **Header**
145
+
146
+ In the VCF specification it is recommended, but **not mandatory** that the VCF header includes tags describing the reference and contigs backing the data contained in the file. When present, the tool will analyze this information and output the reference genome version based on the contig lengths, following the same logic of the aligment files inference.
147
+
148
+ * **Variants**
149
+
150
+ To infer the reference genome from a VCF the tool will read the VCF file in chunks of 100.000 variants, avoiding to load the complete file in memory. The `POS` and `REF` columns will be extracted and compared to the msgpack files.
151
+
152
+ The msgpack files were created comparing the nucleotides in each position for hg18, GRCh37, GRCh38 and T2T. Each file contains a list of the positions where each reference had a different nucleotide (distinguishing positions).
153
+
154
+ By getting the number of matches between these distinguishing positions and the `REF` present in the VCF we infer the reference genome version used to call the variants.
155
+
156
+ ## Requirements
157
+
158
+ - Python 3.10.6
159
+
160
+ Depending on how you want to install the package:
161
+
162
+ - pip
163
+ - Docker
164
+
165
+ Download the `msgpack` files for the inference with VCFs:
166
+
167
+ 1. [Download the msgpack reference](https://crgcnag-my.sharepoint.com/:u:/g/personal/mimarin_crg_es/IQDa5CICZDAoRZmbfhBG3ZPEAWdVnNqvefFJB_r5Hc8aM70?e=kID7zn)
168
+
169
+ 2. Move the `msgpack` to the correct path:
170
+
171
+ ```
172
+ mv msgpack.zip /refgenDetector/src/refgenDetector/
173
+ unzip /refgenDetector/src/refgenDetector/msgpack.zip
174
+ ```
175
+
176
+ ## Installation
177
+
178
+ ### Cloning this repository
179
+
180
+ 1. Clone this repository
181
+
182
+ 2. ``` $ cd PATH_WHERE_YOU_CLONED_THE_REPOSITORY/src/refgenDetector ```
183
+
184
+ 3. ``$ python3 refgenDetector_main.py -h ``
185
+
186
+ ### From pypi
187
+
188
+ ``$ pip install refgenDetector``
189
+
190
+ ### From Docker
191
+ ``
192
+
193
+ ## Usage
194
+
195
+ You can get the help menu by running:
196
+
197
+ ```
198
+ $ refgenDetector -h
199
+ ```
200
+
201
+ ```
202
+ usage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE [-h] -f FILE -t {BAM/CRAM,Header,VCF,BIM} [--md5] [-a] [-v MAX_N_VAR] [-m MATCHES] [-r]
203
+
204
+ optional arguments:
205
+ -h, --help show this help message and exit
206
+ -f FILE, --file FILE Input file path
207
+ -t {BAM/CRAM,Header,VCF,BIM}, --type {BAM/CRAM,Header,VCF,BIM}
208
+ Type of files to analyze.
209
+ --md5 Print md5 values if present in header.
210
+ -a, --assembly Print assembly if present in header.
211
+ -v MAX_N_VAR, --max_n_var MAX_N_VAR
212
+ Maximum number of variants to read before stopping inference. The file is processed in chunks of 100,000 variants, so this value must be a multiple of 100,000 (e.g. 100000,
213
+ 200000, 300000, ...).
214
+ -m MATCHES, --matches MATCHES
215
+ Number of matches required before stopping. [DEFAULT:5000]
216
+ -r, --resources When set, print execution time, CPU, memory, and disk I/O usage
217
+ ```
218
+
219
+ ## Test RefgenDetector
220
+
221
+ In the folder **examples** you can find headers, BAM and CRAMs to test the working of RefgenDetector.
222
+
223
+ *All this files belong to the [synthetics data cohort](https://ega-archive.org/synthetic-data) from the European
224
+ Genome-Phenome Archive ([EGA](https://ega-archive.org/)).*
225
+
226
+ ### Test with headers in a TXT
227
+
228
+ In the folder TEST_HEADERS there are four headers obtained from synthetic BAM an CRAMs stored in the EGA. Each one of
229
+ them belongs to a different synthetic study:
230
+
231
+ - Test Study for EGA using data from 1000 Genomes Project - Phase
232
+ 3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
233
+ - Synthetic data - Genome in a Bottle - [EGAS00001005591](https://ega-archive.org/studies/EGAS00001005591).
234
+ - Human genomic and phenotypic synthetic data for the study of rare
235
+ diseases - [EGAS00001005702](https://ega-archive.org/studies/EGAS00001005702).
236
+ - CINECA synthetic data.Please note: This study contains synthetic data (with cohort “participants” / ”subjects” marked
237
+ with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or
238
+ results - [EGAS00001002472](https://ega-archive.org/studies/EGAS00001002472).
239
+
240
+ Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
241
+
242
+ To run RefgenDetector with the files:
243
+
244
+ 1. Modify the txt *path_to_headers* so the paths match those in your computer.
245
+ 2. Run:
246
+
247
+ ``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector/examples/path_to_headers -t Headers```
248
+
249
+ ### Test with BAM and CRAMs
250
+
251
+ In the folder TEST_BAM_CRAM there are a BAM and a CRAM obtained from synthetic BAM an CRAMs stored in the EGA. They
252
+ belong to the synthetic study - Test Study for EGA using data from 1000 Genomes Project - Phase
253
+ 3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
254
+
255
+ Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
256
+
257
+ To run RefgenDetector with the files:
258
+
259
+ 1. Modify the txt *path_to_bam_cram* so the paths match those in your computer.
260
+
261
+ 2. Run:
262
+
263
+ ``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector_pip-master/examples/path_to_bam_cram -t BAM/CRAM```
264
+
265
+
266
+
267
+ ## Licence and funding
268
+
269
+ RefgenDetector is released under GNU General Public License v3.0.
270
+
271
+ It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies
272
+ 2019-2021 and 2022-2023).
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,32 @@
1
+ from setuptools import setup, find_packages
2
+ # read the contents of your README file
3
+ from pathlib import Path
4
+ this_directory = Path(__file__).parent
5
+ long_description = (this_directory / "README.md").read_text()
6
+
7
+ VERSION = '3.0.0'
8
+ DESCRIPTION = 'RefgenDetector'
9
+
10
+ # Setting up
11
+ setup(
12
+ name="RefgenDetector",
13
+ version=VERSION,
14
+ author="Mireia Marin i Ginestar",
15
+ author_email="<mireia.marin@crg.eu>",
16
+ description=DESCRIPTION,
17
+ long_description=long_description,
18
+ long_description_content_type='text/markdown',
19
+ install_requires=['argparse', 'pysam', 'psutil', 'rich', 'pandas', 'dnspython', 'msgpack'],
20
+ keywords=['python'],
21
+ classifiers=[
22
+ "Programming Language :: Python :: 3",
23
+ "Operating System :: Unix"],
24
+ entry_points={
25
+ 'console_scripts': [
26
+ 'refgenDetector=refgenDetector.refgenDetector_main:main',
27
+ ],
28
+ },
29
+ packages=find_packages(where='src'),
30
+ package_dir={'': 'src'}
31
+
32
+ )