RefgenDetector 3.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- refgendetector-3.0.0/LICENSE +3 -0
- refgendetector-3.0.0/PKG-INFO +300 -0
- refgendetector-3.0.0/README.md +272 -0
- refgendetector-3.0.0/setup.cfg +4 -0
- refgendetector-3.0.0/setup.py +32 -0
- refgendetector-3.0.0/src/RefgenDetector.egg-info/PKG-INFO +300 -0
- refgendetector-3.0.0/src/RefgenDetector.egg-info/SOURCES.txt +16 -0
- refgendetector-3.0.0/src/RefgenDetector.egg-info/dependency_links.txt +1 -0
- refgendetector-3.0.0/src/RefgenDetector.egg-info/entry_points.txt +2 -0
- refgendetector-3.0.0/src/RefgenDetector.egg-info/requires.txt +7 -0
- refgendetector-3.0.0/src/RefgenDetector.egg-info/top_level.txt +1 -0
- refgendetector-3.0.0/src/refgenDetector/__init__.py +0 -0
- refgendetector-3.0.0/src/refgenDetector/aligment_files.py +272 -0
- refgendetector-3.0.0/src/refgenDetector/chromosomes_dict.py +34 -0
- refgendetector-3.0.0/src/refgenDetector/ref_manager.py +240 -0
- refgendetector-3.0.0/src/refgenDetector/reference_genome_dictionaries.py +820 -0
- refgendetector-3.0.0/src/refgenDetector/refgenDetector_main.py +114 -0
- refgendetector-3.0.0/src/refgenDetector/variant_files.py +363 -0
|
@@ -0,0 +1,300 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: RefgenDetector
|
|
3
|
+
Version: 3.0.0
|
|
4
|
+
Summary: RefgenDetector
|
|
5
|
+
Author: Mireia Marin i Ginestar
|
|
6
|
+
Author-email: <mireia.marin@crg.eu>
|
|
7
|
+
Keywords: python
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: Operating System :: Unix
|
|
10
|
+
Description-Content-Type: text/markdown
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Requires-Dist: argparse
|
|
13
|
+
Requires-Dist: pysam
|
|
14
|
+
Requires-Dist: psutil
|
|
15
|
+
Requires-Dist: rich
|
|
16
|
+
Requires-Dist: pandas
|
|
17
|
+
Requires-Dist: dnspython
|
|
18
|
+
Requires-Dist: msgpack
|
|
19
|
+
Dynamic: author
|
|
20
|
+
Dynamic: author-email
|
|
21
|
+
Dynamic: classifier
|
|
22
|
+
Dynamic: description
|
|
23
|
+
Dynamic: description-content-type
|
|
24
|
+
Dynamic: keywords
|
|
25
|
+
Dynamic: license-file
|
|
26
|
+
Dynamic: requires-dist
|
|
27
|
+
Dynamic: summary
|
|
28
|
+
|
|
29
|
+
# EGA - RefgenDetector
|
|
30
|
+
|
|
31
|
+
RefgenDetector is a bioinformatics tool that **infers the reference genome assembly** used to create aligment files (BAM/CRAM/header) and VCFs.
|
|
32
|
+
|
|
33
|
+
## Aligment Files
|
|
34
|
+
|
|
35
|
+
It identifies major genome releases and derived assemblies across humans and multiple other species by analyzing contig names and lengths **from the header**. Benchmarking against 94 synthetic datasets achieved a 100% accuracy rate, while large-scale testing on 918,404 real-world files demonstrated 97.13% correctness, failing only when files’ headers are incomplete.
|
|
36
|
+
|
|
37
|
+
### Description
|
|
38
|
+
|
|
39
|
+
RefgenDetector is able to infer the following reference genomes:
|
|
40
|
+
|
|
41
|
+
**Primates**
|
|
42
|
+
|
|
43
|
+
👤 Homo sapiens
|
|
44
|
+
|
|
45
|
+
- hg16
|
|
46
|
+
- hg17
|
|
47
|
+
- hg18
|
|
48
|
+
- GRCh37
|
|
49
|
+
- GRCh38
|
|
50
|
+
- T2T
|
|
51
|
+
|
|
52
|
+
🐒 Pan troglodytes
|
|
53
|
+
|
|
54
|
+
- pantro3_0
|
|
55
|
+
- Pan_troglodytes-2.1
|
|
56
|
+
|
|
57
|
+
🐵 Macaca mulatta
|
|
58
|
+
|
|
59
|
+
- Mmul10
|
|
60
|
+
- rheMac8
|
|
61
|
+
- rheMac3
|
|
62
|
+
|
|
63
|
+
**Rodents**
|
|
64
|
+
|
|
65
|
+
🐭 Mus musculus
|
|
66
|
+
|
|
67
|
+
- mm7
|
|
68
|
+
- mm8
|
|
69
|
+
- mm9
|
|
70
|
+
- mm10
|
|
71
|
+
- mm39
|
|
72
|
+
|
|
73
|
+
🐀 Rattus norvegicus
|
|
74
|
+
|
|
75
|
+
- mRatBN7_2
|
|
76
|
+
- Rnor_6_0
|
|
77
|
+
|
|
78
|
+
**Other Mammals**
|
|
79
|
+
|
|
80
|
+
🐷 Sus scrofa
|
|
81
|
+
|
|
82
|
+
- Sscrofa10_2
|
|
83
|
+
- Sscrofa11_1
|
|
84
|
+
|
|
85
|
+
**Vertebrates (Non-Mammalian)**
|
|
86
|
+
|
|
87
|
+
🐟 Danio Rerio
|
|
88
|
+
|
|
89
|
+
- danRer10
|
|
90
|
+
- danRer11
|
|
91
|
+
|
|
92
|
+
**Invertebrates**
|
|
93
|
+
|
|
94
|
+
🪰 Drosophila Melanogaster
|
|
95
|
+
|
|
96
|
+
- dm5
|
|
97
|
+
- dm6
|
|
98
|
+
|
|
99
|
+
🐛 Caenorhabditis elegans
|
|
100
|
+
|
|
101
|
+
- WBcel215
|
|
102
|
+
- WBcel235
|
|
103
|
+
|
|
104
|
+
**Microorganisms & Plants**
|
|
105
|
+
|
|
106
|
+
🧫 Escherichia coli
|
|
107
|
+
|
|
108
|
+
- ASM886v2
|
|
109
|
+
- ASM584v2
|
|
110
|
+
|
|
111
|
+
🌱 Arabidopsis thaliana
|
|
112
|
+
|
|
113
|
+
- TAIR
|
|
114
|
+
|
|
115
|
+
🍺 Saccharomyces cerevisiae
|
|
116
|
+
|
|
117
|
+
- R64
|
|
118
|
+
|
|
119
|
+
## `ref_manager.py` - Customize the assemblies database.
|
|
120
|
+
|
|
121
|
+
`ref_manager.py` provides command-line management of reference genomes used by RefgenDetector. It allows users to add custom assemblies from FASTA index (`.fai`) files, list all available references, and remove previously added custom entries without modifying the source code.
|
|
122
|
+
|
|
123
|
+
### Usage
|
|
124
|
+
|
|
125
|
+
```bash
|
|
126
|
+
python ref_manager.py <command> [options]
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Commands
|
|
130
|
+
|
|
131
|
+
#### Add a reference
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
python ref_manager.py add <genome.fai> <reference_name> <species>
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Registers a new reference from a valid `.fai` file. If the contig structure matches an existing reference, the entry is not added.
|
|
138
|
+
|
|
139
|
+
#### List references
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
python ref_manager.py list
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Displays all available references, including both built-in and user-defined assemblies.
|
|
146
|
+
|
|
147
|
+
#### Remove a reference
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
python ref_manager.py remove <reference_name>
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
Removes a custom reference from the local database. Built-in references cannot be removed.
|
|
154
|
+
|
|
155
|
+
### Notes
|
|
156
|
+
|
|
157
|
+
- Custom references are stored separately from the default reference database.
|
|
158
|
+
- Input files must be valid FASTA index files generated with `samtools faidx`.
|
|
159
|
+
- Duplicate assemblies are detected based on exact contig composition.
|
|
160
|
+
|
|
161
|
+
## Variant Calling Files (VCFs)
|
|
162
|
+
|
|
163
|
+
From VCF files only 4 human assemblies can be inferred:
|
|
164
|
+
|
|
165
|
+
- Hg18
|
|
166
|
+
- GRCh37
|
|
167
|
+
- GRCh38
|
|
168
|
+
- T2T
|
|
169
|
+
|
|
170
|
+
Two different sources of information are used to infer the reference genome from variant calling files
|
|
171
|
+
|
|
172
|
+
* **Header**
|
|
173
|
+
|
|
174
|
+
In the VCF specification it is recommended, but **not mandatory** that the VCF header includes tags describing the reference and contigs backing the data contained in the file. When present, the tool will analyze this information and output the reference genome version based on the contig lengths, following the same logic of the aligment files inference.
|
|
175
|
+
|
|
176
|
+
* **Variants**
|
|
177
|
+
|
|
178
|
+
To infer the reference genome from a VCF the tool will read the VCF file in chunks of 100.000 variants, avoiding to load the complete file in memory. The `POS` and `REF` columns will be extracted and compared to the msgpack files.
|
|
179
|
+
|
|
180
|
+
The msgpack files were created comparing the nucleotides in each position for hg18, GRCh37, GRCh38 and T2T. Each file contains a list of the positions where each reference had a different nucleotide (distinguishing positions).
|
|
181
|
+
|
|
182
|
+
By getting the number of matches between these distinguishing positions and the `REF` present in the VCF we infer the reference genome version used to call the variants.
|
|
183
|
+
|
|
184
|
+
## Requirements
|
|
185
|
+
|
|
186
|
+
- Python 3.10.6
|
|
187
|
+
|
|
188
|
+
Depending on how you want to install the package:
|
|
189
|
+
|
|
190
|
+
- pip
|
|
191
|
+
- Docker
|
|
192
|
+
|
|
193
|
+
Download the `msgpack` files for the inference with VCFs:
|
|
194
|
+
|
|
195
|
+
1. [Download the msgpack reference](https://crgcnag-my.sharepoint.com/:u:/g/personal/mimarin_crg_es/IQDa5CICZDAoRZmbfhBG3ZPEAWdVnNqvefFJB_r5Hc8aM70?e=kID7zn)
|
|
196
|
+
|
|
197
|
+
2. Move the `msgpack` to the correct path:
|
|
198
|
+
|
|
199
|
+
```
|
|
200
|
+
mv msgpack.zip /refgenDetector/src/refgenDetector/
|
|
201
|
+
unzip /refgenDetector/src/refgenDetector/msgpack.zip
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
## Installation
|
|
205
|
+
|
|
206
|
+
### Cloning this repository
|
|
207
|
+
|
|
208
|
+
1. Clone this repository
|
|
209
|
+
|
|
210
|
+
2. ``` $ cd PATH_WHERE_YOU_CLONED_THE_REPOSITORY/src/refgenDetector ```
|
|
211
|
+
|
|
212
|
+
3. ``$ python3 refgenDetector_main.py -h ``
|
|
213
|
+
|
|
214
|
+
### From pypi
|
|
215
|
+
|
|
216
|
+
``$ pip install refgenDetector``
|
|
217
|
+
|
|
218
|
+
### From Docker
|
|
219
|
+
``
|
|
220
|
+
|
|
221
|
+
## Usage
|
|
222
|
+
|
|
223
|
+
You can get the help menu by running:
|
|
224
|
+
|
|
225
|
+
```
|
|
226
|
+
$ refgenDetector -h
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
usage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE [-h] -f FILE -t {BAM/CRAM,Header,VCF,BIM} [--md5] [-a] [-v MAX_N_VAR] [-m MATCHES] [-r]
|
|
231
|
+
|
|
232
|
+
optional arguments:
|
|
233
|
+
-h, --help show this help message and exit
|
|
234
|
+
-f FILE, --file FILE Input file path
|
|
235
|
+
-t {BAM/CRAM,Header,VCF,BIM}, --type {BAM/CRAM,Header,VCF,BIM}
|
|
236
|
+
Type of files to analyze.
|
|
237
|
+
--md5 Print md5 values if present in header.
|
|
238
|
+
-a, --assembly Print assembly if present in header.
|
|
239
|
+
-v MAX_N_VAR, --max_n_var MAX_N_VAR
|
|
240
|
+
Maximum number of variants to read before stopping inference. The file is processed in chunks of 100,000 variants, so this value must be a multiple of 100,000 (e.g. 100000,
|
|
241
|
+
200000, 300000, ...).
|
|
242
|
+
-m MATCHES, --matches MATCHES
|
|
243
|
+
Number of matches required before stopping. [DEFAULT:5000]
|
|
244
|
+
-r, --resources When set, print execution time, CPU, memory, and disk I/O usage
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
## Test RefgenDetector
|
|
248
|
+
|
|
249
|
+
In the folder **examples** you can find headers, BAM and CRAMs to test the working of RefgenDetector.
|
|
250
|
+
|
|
251
|
+
*All this files belong to the [synthetics data cohort](https://ega-archive.org/synthetic-data) from the European
|
|
252
|
+
Genome-Phenome Archive ([EGA](https://ega-archive.org/)).*
|
|
253
|
+
|
|
254
|
+
### Test with headers in a TXT
|
|
255
|
+
|
|
256
|
+
In the folder TEST_HEADERS there are four headers obtained from synthetic BAM an CRAMs stored in the EGA. Each one of
|
|
257
|
+
them belongs to a different synthetic study:
|
|
258
|
+
|
|
259
|
+
- Test Study for EGA using data from 1000 Genomes Project - Phase
|
|
260
|
+
3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
|
|
261
|
+
- Synthetic data - Genome in a Bottle - [EGAS00001005591](https://ega-archive.org/studies/EGAS00001005591).
|
|
262
|
+
- Human genomic and phenotypic synthetic data for the study of rare
|
|
263
|
+
diseases - [EGAS00001005702](https://ega-archive.org/studies/EGAS00001005702).
|
|
264
|
+
- CINECA synthetic data.Please note: This study contains synthetic data (with cohort “participants” / ”subjects” marked
|
|
265
|
+
with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or
|
|
266
|
+
results - [EGAS00001002472](https://ega-archive.org/studies/EGAS00001002472).
|
|
267
|
+
|
|
268
|
+
Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
|
|
269
|
+
|
|
270
|
+
To run RefgenDetector with the files:
|
|
271
|
+
|
|
272
|
+
1. Modify the txt *path_to_headers* so the paths match those in your computer.
|
|
273
|
+
2. Run:
|
|
274
|
+
|
|
275
|
+
``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector/examples/path_to_headers -t Headers```
|
|
276
|
+
|
|
277
|
+
### Test with BAM and CRAMs
|
|
278
|
+
|
|
279
|
+
In the folder TEST_BAM_CRAM there are a BAM and a CRAM obtained from synthetic BAM an CRAMs stored in the EGA. They
|
|
280
|
+
belong to the synthetic study - Test Study for EGA using data from 1000 Genomes Project - Phase
|
|
281
|
+
3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
|
|
282
|
+
|
|
283
|
+
Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
|
|
284
|
+
|
|
285
|
+
To run RefgenDetector with the files:
|
|
286
|
+
|
|
287
|
+
1. Modify the txt *path_to_bam_cram* so the paths match those in your computer.
|
|
288
|
+
|
|
289
|
+
2. Run:
|
|
290
|
+
|
|
291
|
+
``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector_pip-master/examples/path_to_bam_cram -t BAM/CRAM```
|
|
292
|
+
|
|
293
|
+
|
|
294
|
+
|
|
295
|
+
## Licence and funding
|
|
296
|
+
|
|
297
|
+
RefgenDetector is released under GNU General Public License v3.0.
|
|
298
|
+
|
|
299
|
+
It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies
|
|
300
|
+
2019-2021 and 2022-2023).
|
|
@@ -0,0 +1,272 @@
|
|
|
1
|
+
# EGA - RefgenDetector
|
|
2
|
+
|
|
3
|
+
RefgenDetector is a bioinformatics tool that **infers the reference genome assembly** used to create aligment files (BAM/CRAM/header) and VCFs.
|
|
4
|
+
|
|
5
|
+
## Aligment Files
|
|
6
|
+
|
|
7
|
+
It identifies major genome releases and derived assemblies across humans and multiple other species by analyzing contig names and lengths **from the header**. Benchmarking against 94 synthetic datasets achieved a 100% accuracy rate, while large-scale testing on 918,404 real-world files demonstrated 97.13% correctness, failing only when files’ headers are incomplete.
|
|
8
|
+
|
|
9
|
+
### Description
|
|
10
|
+
|
|
11
|
+
RefgenDetector is able to infer the following reference genomes:
|
|
12
|
+
|
|
13
|
+
**Primates**
|
|
14
|
+
|
|
15
|
+
👤 Homo sapiens
|
|
16
|
+
|
|
17
|
+
- hg16
|
|
18
|
+
- hg17
|
|
19
|
+
- hg18
|
|
20
|
+
- GRCh37
|
|
21
|
+
- GRCh38
|
|
22
|
+
- T2T
|
|
23
|
+
|
|
24
|
+
🐒 Pan troglodytes
|
|
25
|
+
|
|
26
|
+
- pantro3_0
|
|
27
|
+
- Pan_troglodytes-2.1
|
|
28
|
+
|
|
29
|
+
🐵 Macaca mulatta
|
|
30
|
+
|
|
31
|
+
- Mmul10
|
|
32
|
+
- rheMac8
|
|
33
|
+
- rheMac3
|
|
34
|
+
|
|
35
|
+
**Rodents**
|
|
36
|
+
|
|
37
|
+
🐭 Mus musculus
|
|
38
|
+
|
|
39
|
+
- mm7
|
|
40
|
+
- mm8
|
|
41
|
+
- mm9
|
|
42
|
+
- mm10
|
|
43
|
+
- mm39
|
|
44
|
+
|
|
45
|
+
🐀 Rattus norvegicus
|
|
46
|
+
|
|
47
|
+
- mRatBN7_2
|
|
48
|
+
- Rnor_6_0
|
|
49
|
+
|
|
50
|
+
**Other Mammals**
|
|
51
|
+
|
|
52
|
+
🐷 Sus scrofa
|
|
53
|
+
|
|
54
|
+
- Sscrofa10_2
|
|
55
|
+
- Sscrofa11_1
|
|
56
|
+
|
|
57
|
+
**Vertebrates (Non-Mammalian)**
|
|
58
|
+
|
|
59
|
+
🐟 Danio Rerio
|
|
60
|
+
|
|
61
|
+
- danRer10
|
|
62
|
+
- danRer11
|
|
63
|
+
|
|
64
|
+
**Invertebrates**
|
|
65
|
+
|
|
66
|
+
🪰 Drosophila Melanogaster
|
|
67
|
+
|
|
68
|
+
- dm5
|
|
69
|
+
- dm6
|
|
70
|
+
|
|
71
|
+
🐛 Caenorhabditis elegans
|
|
72
|
+
|
|
73
|
+
- WBcel215
|
|
74
|
+
- WBcel235
|
|
75
|
+
|
|
76
|
+
**Microorganisms & Plants**
|
|
77
|
+
|
|
78
|
+
🧫 Escherichia coli
|
|
79
|
+
|
|
80
|
+
- ASM886v2
|
|
81
|
+
- ASM584v2
|
|
82
|
+
|
|
83
|
+
🌱 Arabidopsis thaliana
|
|
84
|
+
|
|
85
|
+
- TAIR
|
|
86
|
+
|
|
87
|
+
🍺 Saccharomyces cerevisiae
|
|
88
|
+
|
|
89
|
+
- R64
|
|
90
|
+
|
|
91
|
+
## `ref_manager.py` - Customize the assemblies database.
|
|
92
|
+
|
|
93
|
+
`ref_manager.py` provides command-line management of reference genomes used by RefgenDetector. It allows users to add custom assemblies from FASTA index (`.fai`) files, list all available references, and remove previously added custom entries without modifying the source code.
|
|
94
|
+
|
|
95
|
+
### Usage
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
python ref_manager.py <command> [options]
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Commands
|
|
102
|
+
|
|
103
|
+
#### Add a reference
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
python ref_manager.py add <genome.fai> <reference_name> <species>
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Registers a new reference from a valid `.fai` file. If the contig structure matches an existing reference, the entry is not added.
|
|
110
|
+
|
|
111
|
+
#### List references
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
python ref_manager.py list
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Displays all available references, including both built-in and user-defined assemblies.
|
|
118
|
+
|
|
119
|
+
#### Remove a reference
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
python ref_manager.py remove <reference_name>
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Removes a custom reference from the local database. Built-in references cannot be removed.
|
|
126
|
+
|
|
127
|
+
### Notes
|
|
128
|
+
|
|
129
|
+
- Custom references are stored separately from the default reference database.
|
|
130
|
+
- Input files must be valid FASTA index files generated with `samtools faidx`.
|
|
131
|
+
- Duplicate assemblies are detected based on exact contig composition.
|
|
132
|
+
|
|
133
|
+
## Variant Calling Files (VCFs)
|
|
134
|
+
|
|
135
|
+
From VCF files only 4 human assemblies can be inferred:
|
|
136
|
+
|
|
137
|
+
- Hg18
|
|
138
|
+
- GRCh37
|
|
139
|
+
- GRCh38
|
|
140
|
+
- T2T
|
|
141
|
+
|
|
142
|
+
Two different sources of information are used to infer the reference genome from variant calling files
|
|
143
|
+
|
|
144
|
+
* **Header**
|
|
145
|
+
|
|
146
|
+
In the VCF specification it is recommended, but **not mandatory** that the VCF header includes tags describing the reference and contigs backing the data contained in the file. When present, the tool will analyze this information and output the reference genome version based on the contig lengths, following the same logic of the aligment files inference.
|
|
147
|
+
|
|
148
|
+
* **Variants**
|
|
149
|
+
|
|
150
|
+
To infer the reference genome from a VCF the tool will read the VCF file in chunks of 100.000 variants, avoiding to load the complete file in memory. The `POS` and `REF` columns will be extracted and compared to the msgpack files.
|
|
151
|
+
|
|
152
|
+
The msgpack files were created comparing the nucleotides in each position for hg18, GRCh37, GRCh38 and T2T. Each file contains a list of the positions where each reference had a different nucleotide (distinguishing positions).
|
|
153
|
+
|
|
154
|
+
By getting the number of matches between these distinguishing positions and the `REF` present in the VCF we infer the reference genome version used to call the variants.
|
|
155
|
+
|
|
156
|
+
## Requirements
|
|
157
|
+
|
|
158
|
+
- Python 3.10.6
|
|
159
|
+
|
|
160
|
+
Depending on how you want to install the package:
|
|
161
|
+
|
|
162
|
+
- pip
|
|
163
|
+
- Docker
|
|
164
|
+
|
|
165
|
+
Download the `msgpack` files for the inference with VCFs:
|
|
166
|
+
|
|
167
|
+
1. [Download the msgpack reference](https://crgcnag-my.sharepoint.com/:u:/g/personal/mimarin_crg_es/IQDa5CICZDAoRZmbfhBG3ZPEAWdVnNqvefFJB_r5Hc8aM70?e=kID7zn)
|
|
168
|
+
|
|
169
|
+
2. Move the `msgpack` to the correct path:
|
|
170
|
+
|
|
171
|
+
```
|
|
172
|
+
mv msgpack.zip /refgenDetector/src/refgenDetector/
|
|
173
|
+
unzip /refgenDetector/src/refgenDetector/msgpack.zip
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
## Installation
|
|
177
|
+
|
|
178
|
+
### Cloning this repository
|
|
179
|
+
|
|
180
|
+
1. Clone this repository
|
|
181
|
+
|
|
182
|
+
2. ``` $ cd PATH_WHERE_YOU_CLONED_THE_REPOSITORY/src/refgenDetector ```
|
|
183
|
+
|
|
184
|
+
3. ``$ python3 refgenDetector_main.py -h ``
|
|
185
|
+
|
|
186
|
+
### From pypi
|
|
187
|
+
|
|
188
|
+
``$ pip install refgenDetector``
|
|
189
|
+
|
|
190
|
+
### From Docker
|
|
191
|
+
``
|
|
192
|
+
|
|
193
|
+
## Usage
|
|
194
|
+
|
|
195
|
+
You can get the help menu by running:
|
|
196
|
+
|
|
197
|
+
```
|
|
198
|
+
$ refgenDetector -h
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
```
|
|
202
|
+
usage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE [-h] -f FILE -t {BAM/CRAM,Header,VCF,BIM} [--md5] [-a] [-v MAX_N_VAR] [-m MATCHES] [-r]
|
|
203
|
+
|
|
204
|
+
optional arguments:
|
|
205
|
+
-h, --help show this help message and exit
|
|
206
|
+
-f FILE, --file FILE Input file path
|
|
207
|
+
-t {BAM/CRAM,Header,VCF,BIM}, --type {BAM/CRAM,Header,VCF,BIM}
|
|
208
|
+
Type of files to analyze.
|
|
209
|
+
--md5 Print md5 values if present in header.
|
|
210
|
+
-a, --assembly Print assembly if present in header.
|
|
211
|
+
-v MAX_N_VAR, --max_n_var MAX_N_VAR
|
|
212
|
+
Maximum number of variants to read before stopping inference. The file is processed in chunks of 100,000 variants, so this value must be a multiple of 100,000 (e.g. 100000,
|
|
213
|
+
200000, 300000, ...).
|
|
214
|
+
-m MATCHES, --matches MATCHES
|
|
215
|
+
Number of matches required before stopping. [DEFAULT:5000]
|
|
216
|
+
-r, --resources When set, print execution time, CPU, memory, and disk I/O usage
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
## Test RefgenDetector
|
|
220
|
+
|
|
221
|
+
In the folder **examples** you can find headers, BAM and CRAMs to test the working of RefgenDetector.
|
|
222
|
+
|
|
223
|
+
*All this files belong to the [synthetics data cohort](https://ega-archive.org/synthetic-data) from the European
|
|
224
|
+
Genome-Phenome Archive ([EGA](https://ega-archive.org/)).*
|
|
225
|
+
|
|
226
|
+
### Test with headers in a TXT
|
|
227
|
+
|
|
228
|
+
In the folder TEST_HEADERS there are four headers obtained from synthetic BAM an CRAMs stored in the EGA. Each one of
|
|
229
|
+
them belongs to a different synthetic study:
|
|
230
|
+
|
|
231
|
+
- Test Study for EGA using data from 1000 Genomes Project - Phase
|
|
232
|
+
3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
|
|
233
|
+
- Synthetic data - Genome in a Bottle - [EGAS00001005591](https://ega-archive.org/studies/EGAS00001005591).
|
|
234
|
+
- Human genomic and phenotypic synthetic data for the study of rare
|
|
235
|
+
diseases - [EGAS00001005702](https://ega-archive.org/studies/EGAS00001005702).
|
|
236
|
+
- CINECA synthetic data.Please note: This study contains synthetic data (with cohort “participants” / ”subjects” marked
|
|
237
|
+
with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or
|
|
238
|
+
results - [EGAS00001002472](https://ega-archive.org/studies/EGAS00001002472).
|
|
239
|
+
|
|
240
|
+
Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
|
|
241
|
+
|
|
242
|
+
To run RefgenDetector with the files:
|
|
243
|
+
|
|
244
|
+
1. Modify the txt *path_to_headers* so the paths match those in your computer.
|
|
245
|
+
2. Run:
|
|
246
|
+
|
|
247
|
+
``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector/examples/path_to_headers -t Headers```
|
|
248
|
+
|
|
249
|
+
### Test with BAM and CRAMs
|
|
250
|
+
|
|
251
|
+
In the folder TEST_BAM_CRAM there are a BAM and a CRAM obtained from synthetic BAM an CRAMs stored in the EGA. They
|
|
252
|
+
belong to the synthetic study - Test Study for EGA using data from 1000 Genomes Project - Phase
|
|
253
|
+
3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
|
|
254
|
+
|
|
255
|
+
Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.
|
|
256
|
+
|
|
257
|
+
To run RefgenDetector with the files:
|
|
258
|
+
|
|
259
|
+
1. Modify the txt *path_to_bam_cram* so the paths match those in your computer.
|
|
260
|
+
|
|
261
|
+
2. Run:
|
|
262
|
+
|
|
263
|
+
``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector_pip-master/examples/path_to_bam_cram -t BAM/CRAM```
|
|
264
|
+
|
|
265
|
+
|
|
266
|
+
|
|
267
|
+
## Licence and funding
|
|
268
|
+
|
|
269
|
+
RefgenDetector is released under GNU General Public License v3.0.
|
|
270
|
+
|
|
271
|
+
It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies
|
|
272
|
+
2019-2021 and 2022-2023).
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
# read the contents of your README file
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
this_directory = Path(__file__).parent
|
|
5
|
+
long_description = (this_directory / "README.md").read_text()
|
|
6
|
+
|
|
7
|
+
VERSION = '3.0.0'
|
|
8
|
+
DESCRIPTION = 'RefgenDetector'
|
|
9
|
+
|
|
10
|
+
# Setting up
|
|
11
|
+
setup(
|
|
12
|
+
name="RefgenDetector",
|
|
13
|
+
version=VERSION,
|
|
14
|
+
author="Mireia Marin i Ginestar",
|
|
15
|
+
author_email="<mireia.marin@crg.eu>",
|
|
16
|
+
description=DESCRIPTION,
|
|
17
|
+
long_description=long_description,
|
|
18
|
+
long_description_content_type='text/markdown',
|
|
19
|
+
install_requires=['argparse', 'pysam', 'psutil', 'rich', 'pandas', 'dnspython', 'msgpack'],
|
|
20
|
+
keywords=['python'],
|
|
21
|
+
classifiers=[
|
|
22
|
+
"Programming Language :: Python :: 3",
|
|
23
|
+
"Operating System :: Unix"],
|
|
24
|
+
entry_points={
|
|
25
|
+
'console_scripts': [
|
|
26
|
+
'refgenDetector=refgenDetector.refgenDetector_main:main',
|
|
27
|
+
],
|
|
28
|
+
},
|
|
29
|
+
packages=find_packages(where='src'),
|
|
30
|
+
package_dir={'': 'src'}
|
|
31
|
+
|
|
32
|
+
)
|