pyvariantdb 1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pyvariantdb-1.1/MANIFEST.in +1 -0
- pyvariantdb-1.1/PKG-INFO +93 -0
- pyvariantdb-1.1/README.md +76 -0
- pyvariantdb-1.1/pyproject.toml +54 -0
- pyvariantdb-1.1/pyvariantdb/__init__.py +7 -0
- pyvariantdb-1.1/pyvariantdb/assets/Snakefile +129 -0
- pyvariantdb-1.1/pyvariantdb/assets/config.yaml +20 -0
- pyvariantdb-1.1/pyvariantdb/assets/contig_map.tsv +24 -0
- pyvariantdb-1.1/pyvariantdb/const.py +14 -0
- pyvariantdb-1.1/pyvariantdb/download.py +96 -0
- pyvariantdb-1.1/pyvariantdb/lookup.py +110 -0
- pyvariantdb-1.1/pyvariantdb.egg-info/PKG-INFO +93 -0
- pyvariantdb-1.1/pyvariantdb.egg-info/SOURCES.txt +17 -0
- pyvariantdb-1.1/pyvariantdb.egg-info/dependency_links.txt +1 -0
- pyvariantdb-1.1/pyvariantdb.egg-info/entry_points.txt +4 -0
- pyvariantdb-1.1/pyvariantdb.egg-info/requires.txt +13 -0
- pyvariantdb-1.1/pyvariantdb.egg-info/top_level.txt +1 -0
- pyvariantdb-1.1/setup.cfg +4 -0
- pyvariantdb-1.1/tests/test_lookup.py +0 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
recursive-include pyvariantdb/assets *
|
pyvariantdb-1.1/PKG-INFO
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pyvariantdb
|
|
3
|
+
Version: 1.1
|
|
4
|
+
Summary: A python package to download, process and access dbSNP data.
|
|
5
|
+
Requires-Python: >=3.11
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Provides-Extra: test
|
|
8
|
+
Requires-Dist: pytest; extra == "test"
|
|
9
|
+
Requires-Dist: twine; extra == "test"
|
|
10
|
+
Requires-Dist: pre-commit; extra == "test"
|
|
11
|
+
Requires-Dist: pytest-ruff; extra == "test"
|
|
12
|
+
Requires-Dist: ruff; extra == "test"
|
|
13
|
+
Provides-Extra: prod
|
|
14
|
+
Requires-Dist: pyvariantdb; extra == "prod"
|
|
15
|
+
Provides-Extra: all
|
|
16
|
+
Requires-Dist: pyvariantdb[test]; extra == "all"
|
|
17
|
+
|
|
18
|
+
# dbSNP to Parquet Converter
|
|
19
|
+
|
|
20
|
+
A python-package focused rewrite of [Weinstock's dbSNP](https://github.com/weinstockj/dbSNP_to_parquet) to parquet
|
|
21
|
+
repository.
|
|
22
|
+
This package allows convenient download, processing and access to data from dbSNP. A simple python interface
|
|
23
|
+
can be used to work with the data once it is processed.
|
|
24
|
+
|
|
25
|
+
## Installation
|
|
26
|
+
|
|
27
|
+
`pyvariantdb` is available on PyPI and can be installed from there with the package management tool of choice. For
|
|
28
|
+
development we use pixi:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
# install pixi
|
|
32
|
+
curl -fsSL https://pixi.sh/install.sh | sh
|
|
33
|
+
pixi update && pixi install
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Usage
|
|
37
|
+
|
|
38
|
+
We recommend to prepare the data from the command line before using the package since download and processing takes
|
|
39
|
+
quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of
|
|
40
|
+
environment variables:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Execution of the pipeline can be done with:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# activate pixi
|
|
50
|
+
pixi shell -e default
|
|
51
|
+
# download dbsnp
|
|
52
|
+
pyvariantdb-download
|
|
53
|
+
# transform the database to parquets
|
|
54
|
+
pyvariantdb-make-dbsnp
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality
|
|
58
|
+
is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective
|
|
59
|
+
coordindates. However, to achieve this we need to run the pipeline mentioned above.
|
|
60
|
+
|
|
61
|
+
Extracting coordinates is simple:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from pyvariantdb.lookup import SNPLookup
|
|
65
|
+
|
|
66
|
+
lookup = SNPLookup()
|
|
67
|
+
# P53 mutations, chromosome 17
|
|
68
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
69
|
+
df_all = lookup.query_all(rsids)
|
|
70
|
+
df_chr = lookup.query_chromosome(rsids, "17")
|
|
71
|
+
print(df_all)
|
|
72
|
+
print(df_chr)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Processing Pipeline
|
|
76
|
+
|
|
77
|
+
`pyvariantdb` offers some quality of life improvements for working with dbSNP and the original repository.
|
|
78
|
+
The original pipeline remains the same:
|
|
79
|
+
|
|
80
|
+
1. Downloads dbSNP data (GRCh38 build 156)
|
|
81
|
+
2. Filters for SNVs only
|
|
82
|
+
3. Converts chromosome contigs to standard naming
|
|
83
|
+
4. Splits data by chromosome
|
|
84
|
+
5. Creates Parquet lookup tables with RSID mappings
|
|
85
|
+
|
|
86
|
+
## Output
|
|
87
|
+
|
|
88
|
+
The script entrypoint generates the following files on-disk to the defined cache dir.
|
|
89
|
+
|
|
90
|
+
- `dbSNP_156.bcf` - Full filtered BCF file
|
|
91
|
+
- `dbSNP_156.chr*.bcf` - Per-chromosome BCF files
|
|
92
|
+
- `dbSNP_156.chr*.lookup.parquet` - Per-chromosome RSID lookup tables
|
|
93
|
+
- `dbSNP_156.lookup.parquet` - Combined RSID lookup table
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# dbSNP to Parquet Converter
|
|
2
|
+
|
|
3
|
+
A python-package focused rewrite of [Weinstock's dbSNP](https://github.com/weinstockj/dbSNP_to_parquet) to parquet
|
|
4
|
+
repository.
|
|
5
|
+
This package allows convenient download, processing and access to data from dbSNP. A simple python interface
|
|
6
|
+
can be used to work with the data once it is processed.
|
|
7
|
+
|
|
8
|
+
## Installation
|
|
9
|
+
|
|
10
|
+
`pyvariantdb` is available on PyPI and can be installed from there with the package management tool of choice. For
|
|
11
|
+
development we use pixi:
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
# install pixi
|
|
15
|
+
curl -fsSL https://pixi.sh/install.sh | sh
|
|
16
|
+
pixi update && pixi install
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Usage
|
|
20
|
+
|
|
21
|
+
We recommend to prepare the data from the command line before using the package since download and processing takes
|
|
22
|
+
quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of
|
|
23
|
+
environment variables:
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Execution of the pipeline can be done with:
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# activate pixi
|
|
33
|
+
pixi shell -e default
|
|
34
|
+
# download dbsnp
|
|
35
|
+
pyvariantdb-download
|
|
36
|
+
# transform the database to parquets
|
|
37
|
+
pyvariantdb-make-dbsnp
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality
|
|
41
|
+
is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective
|
|
42
|
+
coordindates. However, to achieve this we need to run the pipeline mentioned above.
|
|
43
|
+
|
|
44
|
+
Extracting coordinates is simple:
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
from pyvariantdb.lookup import SNPLookup
|
|
48
|
+
|
|
49
|
+
lookup = SNPLookup()
|
|
50
|
+
# P53 mutations, chromosome 17
|
|
51
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
52
|
+
df_all = lookup.query_all(rsids)
|
|
53
|
+
df_chr = lookup.query_chromosome(rsids, "17")
|
|
54
|
+
print(df_all)
|
|
55
|
+
print(df_chr)
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## Processing Pipeline
|
|
59
|
+
|
|
60
|
+
`pyvariantdb` offers some quality of life improvements for working with dbSNP and the original repository.
|
|
61
|
+
The original pipeline remains the same:
|
|
62
|
+
|
|
63
|
+
1. Downloads dbSNP data (GRCh38 build 156)
|
|
64
|
+
2. Filters for SNVs only
|
|
65
|
+
3. Converts chromosome contigs to standard naming
|
|
66
|
+
4. Splits data by chromosome
|
|
67
|
+
5. Creates Parquet lookup tables with RSID mappings
|
|
68
|
+
|
|
69
|
+
## Output
|
|
70
|
+
|
|
71
|
+
The script entrypoint generates the following files on-disk to the defined cache dir.
|
|
72
|
+
|
|
73
|
+
- `dbSNP_156.bcf` - Full filtered BCF file
|
|
74
|
+
- `dbSNP_156.chr*.bcf` - Per-chromosome BCF files
|
|
75
|
+
- `dbSNP_156.chr*.lookup.parquet` - Per-chromosome RSID lookup tables
|
|
76
|
+
- `dbSNP_156.lookup.parquet` - Combined RSID lookup table
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "pyvariantdb"
|
|
3
|
+
version = "1.1"
|
|
4
|
+
description = "A python package to download, process and access dbSNP data."
|
|
5
|
+
readme = "README.md"
|
|
6
|
+
requires-python = ">=3.11"
|
|
7
|
+
|
|
8
|
+
[project.scripts]
|
|
9
|
+
pyvariantdb-download = "pyvariantdb.download:download_dbsnp"
|
|
10
|
+
pyvariantdb-snp2parquet = "pyvariantdb.scripts.convert:main"
|
|
11
|
+
pyvariantdb-make-dbsnp = "pyvariantdb.scripts.dbsnp2parquet:main"
|
|
12
|
+
|
|
13
|
+
# ---- build
|
|
14
|
+
[build-system]
|
|
15
|
+
requires = ["setuptools"]
|
|
16
|
+
build-backend = "setuptools.build_meta"
|
|
17
|
+
|
|
18
|
+
[tool.setuptools.packages.find]
|
|
19
|
+
include = ["pyvariantdb"]
|
|
20
|
+
|
|
21
|
+
# -- pixi settings
|
|
22
|
+
[tool.pixi.workspace]
|
|
23
|
+
channels = ["conda-forge", "bioconda"]
|
|
24
|
+
platforms = ["linux-64", "osx-arm64"]
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
[tool.pixi.dependencies]
|
|
28
|
+
python = "*"
|
|
29
|
+
bcftools = "*"
|
|
30
|
+
pandas = "*"
|
|
31
|
+
polars = "*"
|
|
32
|
+
cyvcf2 = "*"
|
|
33
|
+
snakemake-executor-plugin-cluster-generic = "*"
|
|
34
|
+
snakemake = "*"
|
|
35
|
+
pyarrow = "*"
|
|
36
|
+
ipython = "*"
|
|
37
|
+
loguru = "*"
|
|
38
|
+
aria2 = "*"
|
|
39
|
+
|
|
40
|
+
[tool.pixi.pypi-dependencies]
|
|
41
|
+
pyvariantdb = { path = ".", editable = true}
|
|
42
|
+
|
|
43
|
+
[project.optional-dependencies]
|
|
44
|
+
test = ["pytest", "twine", "pre-commit", "pytest-ruff", "ruff"]
|
|
45
|
+
prod = ["pyvariantdb"]
|
|
46
|
+
all = ["pyvariantdb[test]"]
|
|
47
|
+
|
|
48
|
+
[tool.pixi.environments]
|
|
49
|
+
default = { features = ["all"], solve-group = "default" }
|
|
50
|
+
prod = { features = ["prod"], solve-group = "default" }
|
|
51
|
+
test = { features = ["test"], solve-group = "default" }
|
|
52
|
+
|
|
53
|
+
[tool.pixi.feature.test.tasks]
|
|
54
|
+
download = "pytest"
|
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
"""Snakemake workflow for processing of dbSNP.
|
|
2
|
+
|
|
3
|
+
This workflow downloads, processes, and converts dbSNP data (GRCh38 build 156) to Parquet format.
|
|
4
|
+
|
|
5
|
+
Pipeline Steps:
|
|
6
|
+
1. convert_contig: Downloads and filters dbSNP data for SNVs only
|
|
7
|
+
2. subset_BCF: Splits data by chromosome
|
|
8
|
+
3. convert_parquet: Converts each chromosome to Parquet lookup tables
|
|
9
|
+
4. combine_all_parquets: Combines all chromosome-specific parquet files into one
|
|
10
|
+
|
|
11
|
+
Usage Examples:
|
|
12
|
+
# Run all rules (default)
|
|
13
|
+
>>> snakemake --cores all
|
|
14
|
+
|
|
15
|
+
# Run on a cluster with SLURM
|
|
16
|
+
>>> snakemake --cluster "sbatch -p {resources.partition} --mem={resources.mem} -t {resources.time} -c {threads}" -j 23
|
|
17
|
+
|
|
18
|
+
# Run only the combine_all_parquets rule
|
|
19
|
+
>>> snakemake --cores 1 output/dbSNP_156.combined.lookup.parquet
|
|
20
|
+
|
|
21
|
+
# Use a custom config file
|
|
22
|
+
>>> snakemake --configfile my_config.yaml --cores all
|
|
23
|
+
|
|
24
|
+
# Dry run to see what will be executed
|
|
25
|
+
>>> snakemake -n
|
|
26
|
+
|
|
27
|
+
"""
|
|
28
|
+
import yaml
|
|
29
|
+
from importlib.resources import files
|
|
30
|
+
from loguru import logger
|
|
31
|
+
|
|
32
|
+
from pyvariantdb.const import get_cache_dir
|
|
33
|
+
|
|
34
|
+
pyvariant_assets_dir = files("pyvariantdb") / "assets"
|
|
35
|
+
pyvariant_contigs = pyvariant_assets_dir / "contig_map.tsv"
|
|
36
|
+
pyvariant_config = pyvariant_assets_dir / "config.yaml"
|
|
37
|
+
|
|
38
|
+
# Use packaged config only when no config was provided on the command line.
|
|
39
|
+
if not config:
|
|
40
|
+
with open(pyvariant_config,"r") as f:
|
|
41
|
+
config = yaml.safe_load(f)
|
|
42
|
+
|
|
43
|
+
CHROMS = [f"chr{x}" for x in range(1,23)] + ['chrX']
|
|
44
|
+
|
|
45
|
+
config["output_dir"] = get_cache_dir()
|
|
46
|
+
|
|
47
|
+
logger.info("Starting snakemake workflow from pyvariantdb to download variants.")
|
|
48
|
+
for conf_key, conf_value in config.items():
|
|
49
|
+
logger.info(f"{conf_key}: {conf_value}")
|
|
50
|
+
|
|
51
|
+
rule all:
|
|
52
|
+
input:
|
|
53
|
+
f"{config['output_dir']}/dbSNP_156.bcf",
|
|
54
|
+
expand(f"{config['output_dir']}/dbSNP_156.{{CHROM}}.bcf",CHROM=CHROMS),
|
|
55
|
+
expand(f"{config['output_dir']}/dbSNP_156.{{CHROM}}.lookup.parquet",CHROM=CHROMS),
|
|
56
|
+
f"{config['output_dir']}/dbSNP_156.combined.lookup.parquet"
|
|
57
|
+
|
|
58
|
+
rule convert_contig:
|
|
59
|
+
input:
|
|
60
|
+
f"{config['output_dir']}/GCF_000001405.40.gz"
|
|
61
|
+
output:
|
|
62
|
+
f"{config['output_dir']}/dbSNP_156.bcf"
|
|
63
|
+
threads: config['resources']['convert_contig']['threads']
|
|
64
|
+
resources:
|
|
65
|
+
mem=config['resources']['convert_contig']['mem'],
|
|
66
|
+
partition=config['cluster']['partition'],
|
|
67
|
+
time=config['resources']['convert_contig']['time']
|
|
68
|
+
shell:
|
|
69
|
+
"""
|
|
70
|
+
bcftools view -i "INFO/VC='SNV'" {input} |
|
|
71
|
+
bcftools annotate --rename-chrs {pyvariant_contigs} -Ob --threads {threads} > {output}
|
|
72
|
+
|
|
73
|
+
bcftools index {output}
|
|
74
|
+
"""
|
|
75
|
+
|
|
76
|
+
rule subset_BCF:
|
|
77
|
+
input:
|
|
78
|
+
f"{config['output_dir']}/dbSNP_156.bcf"
|
|
79
|
+
output:
|
|
80
|
+
f"{config['output_dir']}/dbSNP_156.{{CHROM}}.bcf"
|
|
81
|
+
threads: config['resources']['subset_BCF']['threads']
|
|
82
|
+
resources:
|
|
83
|
+
mem=config['resources']['subset_BCF']['mem'],
|
|
84
|
+
partition=config['cluster']['partition'],
|
|
85
|
+
time=config['resources']['subset_BCF']['time']
|
|
86
|
+
shell:
|
|
87
|
+
"""
|
|
88
|
+
bcftools view -Ob --threads {threads} {input} {wildcards.CHROM} > {output}
|
|
89
|
+
bcftools index {output}
|
|
90
|
+
"""
|
|
91
|
+
|
|
92
|
+
rule convert_parquet:
|
|
93
|
+
input:
|
|
94
|
+
f"{config['output_dir']}/dbSNP_156.{{CHROM}}.bcf"
|
|
95
|
+
output:
|
|
96
|
+
f"{config['output_dir']}/dbSNP_156.{{CHROM}}.lookup.parquet"
|
|
97
|
+
resources:
|
|
98
|
+
mem=config['resources']['convert_parquet']['mem'],
|
|
99
|
+
partition=config['cluster']['partition'],
|
|
100
|
+
time=config['resources']['convert_parquet']['time']
|
|
101
|
+
shell:
|
|
102
|
+
"""
|
|
103
|
+
pyvariantdb-snp2parquet {input} {output}
|
|
104
|
+
"""
|
|
105
|
+
|
|
106
|
+
rule combine_all_parquets:
|
|
107
|
+
input:
|
|
108
|
+
expand(f"{config['output_dir']}/dbSNP_156.{{CHROM}}.lookup.parquet",CHROM=CHROMS)
|
|
109
|
+
output:
|
|
110
|
+
f"{config['output_dir']}/dbSNP_156.combined.lookup.parquet"
|
|
111
|
+
resources:
|
|
112
|
+
mem=config['resources'].get('combine_all_parquets',{}).get('mem','16G'),
|
|
113
|
+
partition=config['cluster']['partition'],
|
|
114
|
+
time=config['resources'].get('combine_all_parquets',{}).get('time','02:00:00')
|
|
115
|
+
run:
|
|
116
|
+
import pyarrow.parquet as pq
|
|
117
|
+
import pyarrow as pa
|
|
118
|
+
|
|
119
|
+
# Read all parquet files
|
|
120
|
+
tables = []
|
|
121
|
+
for input_file in input:
|
|
122
|
+
table = pq.read_table(input_file)
|
|
123
|
+
tables.append(table)
|
|
124
|
+
|
|
125
|
+
# Concatenate all tables
|
|
126
|
+
combined_table = pa.concat_tables(tables)
|
|
127
|
+
|
|
128
|
+
# Write the combined table
|
|
129
|
+
pq.write_table(combined_table,output[0])
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
output_dir: "output"
|
|
2
|
+
|
|
3
|
+
cluster:
|
|
4
|
+
partition: "nodes*"
|
|
5
|
+
|
|
6
|
+
resources:
|
|
7
|
+
convert_contig:
|
|
8
|
+
mem: "6G"
|
|
9
|
+
time: "07:30:00"
|
|
10
|
+
threads: 10
|
|
11
|
+
subset_BCF:
|
|
12
|
+
mem: "3G"
|
|
13
|
+
time: "03:30:00"
|
|
14
|
+
threads: 10
|
|
15
|
+
convert_parquet:
|
|
16
|
+
mem: "8G"
|
|
17
|
+
time: "05:30:00"
|
|
18
|
+
combine_all_parquets:
|
|
19
|
+
mem: "16G"
|
|
20
|
+
time: "02:00:00"
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
NC_000001.11 chr1
|
|
2
|
+
NC_000002.12 chr2
|
|
3
|
+
NC_000003.12 chr3
|
|
4
|
+
NC_000004.12 chr4
|
|
5
|
+
NC_000005.10 chr5
|
|
6
|
+
NC_000006.12 chr6
|
|
7
|
+
NC_000007.14 chr7
|
|
8
|
+
NC_000008.11 chr8
|
|
9
|
+
NC_000009.12 chr9
|
|
10
|
+
NC_000010.11 chr10
|
|
11
|
+
NC_000011.10 chr11
|
|
12
|
+
NC_000012.12 chr12
|
|
13
|
+
NC_000013.11 chr13
|
|
14
|
+
NC_000014.9 chr14
|
|
15
|
+
NC_000015.10 chr15
|
|
16
|
+
NC_000016.10 chr16
|
|
17
|
+
NC_000017.11 chr17
|
|
18
|
+
NC_000018.10 chr18
|
|
19
|
+
NC_000019.10 chr19
|
|
20
|
+
NC_000020.11 chr20
|
|
21
|
+
NC_000021.9 chr21
|
|
22
|
+
NC_000022.11 chr22
|
|
23
|
+
NC_000023.11 chrX
|
|
24
|
+
NC_000024.10 chrY
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
"""Define constants used in pyvariantdb."""
|
|
2
|
+
|
|
3
|
+
import os
|
|
4
|
+
from loguru import logger
|
|
5
|
+
from pathlib import Path
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
def get_cache_dir() -> None:
|
|
9
|
+
"""Return the root directory of the project."""
|
|
10
|
+
# check if env var PYVARIANTDB_HOME exists, otherwise ~/.cache/pyvariantdb
|
|
11
|
+
cache_dir = os.getenv("PYVARIANTDB_HOME", Path().home() / ".cache" / "pyvariantdb")
|
|
12
|
+
cache_dir.mkdir(parents=True, exist_ok=True)
|
|
13
|
+
logger.info(f"Cache directory: {cache_dir}")
|
|
14
|
+
return cache_dir
|
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Download utilities for dbSNP to Parquet conversion."""
|
|
3
|
+
|
|
4
|
+
import subprocess
|
|
5
|
+
from pathlib import Path
|
|
6
|
+
from loguru import logger
|
|
7
|
+
from pyvariantdb.const import get_cache_dir
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
def download_with_aria2(url: str, output_dir: Path, filename: str = None):
|
|
11
|
+
"""
|
|
12
|
+
Download a file using aria2c.
|
|
13
|
+
|
|
14
|
+
Args:
|
|
15
|
+
url: The URL to download from
|
|
16
|
+
output_dir: Directory where the file should be saved
|
|
17
|
+
filename: Optional custom filename (defaults to name from URL)
|
|
18
|
+
|
|
19
|
+
Returns:
|
|
20
|
+
Path: Path to the downloaded file
|
|
21
|
+
|
|
22
|
+
Raises:
|
|
23
|
+
RuntimeError: If aria2c is not installed or download fails
|
|
24
|
+
"""
|
|
25
|
+
# Ensure output directory exists
|
|
26
|
+
output_dir.mkdir(parents=True, exist_ok=True)
|
|
27
|
+
|
|
28
|
+
# Build aria2c command
|
|
29
|
+
cmd = [
|
|
30
|
+
"aria2c",
|
|
31
|
+
"--dir",
|
|
32
|
+
str(output_dir),
|
|
33
|
+
"--max-connection-per-server=16", # Use up to 16 connections per server
|
|
34
|
+
"--split=16", # Split file into 16 pieces for parallel download
|
|
35
|
+
"--min-split-size=1M", # Minimum size for splitting
|
|
36
|
+
"--continue=true", # Continue partial downloads
|
|
37
|
+
"--allow-overwrite=true", # Allow overwriting existing files
|
|
38
|
+
"--auto-file-renaming=false", # Don't auto-rename files
|
|
39
|
+
]
|
|
40
|
+
|
|
41
|
+
if filename:
|
|
42
|
+
cmd.extend(["--out", filename])
|
|
43
|
+
|
|
44
|
+
cmd.append(url)
|
|
45
|
+
|
|
46
|
+
# Run aria2c
|
|
47
|
+
logger.info(f"Running: {' '.join(cmd)}")
|
|
48
|
+
result = subprocess.run(cmd, capture_output=True, text=True)
|
|
49
|
+
|
|
50
|
+
if result.returncode != 0:
|
|
51
|
+
logger.error(f"aria2c stderr: {result.stderr}")
|
|
52
|
+
raise RuntimeError(f"Failed to download {url}: {result.stderr}")
|
|
53
|
+
|
|
54
|
+
# Determine output path
|
|
55
|
+
if filename:
|
|
56
|
+
output_path = output_dir / filename
|
|
57
|
+
else:
|
|
58
|
+
output_path = output_dir / url.split("/")[-1]
|
|
59
|
+
|
|
60
|
+
return output_path
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
def download_dbsnp():
|
|
64
|
+
"""
|
|
65
|
+
Download dbSNP VCF file and its index from NCBI's FTP server using aria2.
|
|
66
|
+
|
|
67
|
+
Downloads:
|
|
68
|
+
- GCF_000001405.40.gz (main VCF file)
|
|
69
|
+
- GCF_000001405.40.gz.tbi (index file)
|
|
70
|
+
|
|
71
|
+
Uses aria2c for fast multi-connection downloads.
|
|
72
|
+
"""
|
|
73
|
+
logger.info("Starting dbSNP download with aria2...")
|
|
74
|
+
|
|
75
|
+
urls = [
|
|
76
|
+
"https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz",
|
|
77
|
+
"https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz.tbi",
|
|
78
|
+
]
|
|
79
|
+
destination_dir = get_cache_dir()
|
|
80
|
+
logger.info(f"Destination directory: {destination_dir}")
|
|
81
|
+
|
|
82
|
+
for url in urls:
|
|
83
|
+
try:
|
|
84
|
+
filename = url.split("/")[-1]
|
|
85
|
+
logger.info(f"Downloading {url}...")
|
|
86
|
+
output_path = download_with_aria2(url, destination_dir, filename)
|
|
87
|
+
logger.info(f"✓ Successfully downloaded to {output_path}")
|
|
88
|
+
except Exception as e:
|
|
89
|
+
logger.error(f"Failed to download {url}: {e}")
|
|
90
|
+
raise
|
|
91
|
+
|
|
92
|
+
logger.info("Download completed successfully!")
|
|
93
|
+
|
|
94
|
+
|
|
95
|
+
if __name__ == "__main__":
|
|
96
|
+
download_dbsnp()
|
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
"""Lookup utilities for dbSNP Parquet outputs.
|
|
2
|
+
|
|
3
|
+
This module provides lightweight query helpers over per-chromosome dbSNP
|
|
4
|
+
lookup Parquet files generated by the pipeline.
|
|
5
|
+
|
|
6
|
+
In ideal cases, you already know which chromsome holds a specific rsID. In this case,
|
|
7
|
+
its most efficient to just query the respective chromosome file.
|
|
8
|
+
|
|
9
|
+
Example:
|
|
10
|
+
from pyvariantdb.lookup import SNPLookup
|
|
11
|
+
|
|
12
|
+
lookup = SNPLookup()
|
|
13
|
+
# P53 mutations, chromosome 17
|
|
14
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
15
|
+
df_all = lookup.query_all(rsids)
|
|
16
|
+
df_chr = lookup.query_chromosome(rsids, "17")
|
|
17
|
+
print(df_all)
|
|
18
|
+
print(df_chr)
|
|
19
|
+
"""
|
|
20
|
+
|
|
21
|
+
from __future__ import annotations
|
|
22
|
+
|
|
23
|
+
import polars as pl
|
|
24
|
+
from loguru import logger
|
|
25
|
+
from pathlib import Path
|
|
26
|
+
from typing import Optional
|
|
27
|
+
|
|
28
|
+
from pyvariantdb.const import get_cache_dir
|
|
29
|
+
|
|
30
|
+
DEFAULT_VERSION = "156"
|
|
31
|
+
DEFAULT_OUTPUT_DIR = "output"
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
class SNPLookup:
|
|
35
|
+
"""Query dbSNP lookup Parquet files by rsID.
|
|
36
|
+
|
|
37
|
+
Example:
|
|
38
|
+
lookup = SNPLookup(version="156")
|
|
39
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
40
|
+
df = lookup.query_all(rsids)
|
|
41
|
+
print(df)
|
|
42
|
+
"""
|
|
43
|
+
|
|
44
|
+
def __init__(self, version: str = DEFAULT_VERSION, cache_dir: Optional[str] = None):
|
|
45
|
+
self.version = version
|
|
46
|
+
self.cache_dir = (
|
|
47
|
+
Path(cache_dir) if cache_dir else get_cache_dir() / DEFAULT_OUTPUT_DIR
|
|
48
|
+
)
|
|
49
|
+
|
|
50
|
+
def query_all(self, rsids: list[str]) -> pl.DataFrame:
|
|
51
|
+
"""Query all lookup files for a list of rsIDs.
|
|
52
|
+
|
|
53
|
+
Args:
|
|
54
|
+
rsids: List of rsIDs to search for.
|
|
55
|
+
|
|
56
|
+
Returns:
|
|
57
|
+
A Polars DataFrame with columns "RSID" and "ID".
|
|
58
|
+
|
|
59
|
+
Example:
|
|
60
|
+
lookup = SNPLookup()
|
|
61
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
62
|
+
df = lookup.query_all(rsids)
|
|
63
|
+
print(df.head(3))
|
|
64
|
+
"""
|
|
65
|
+
logger.info(f"Querying dbSNP Parquet files for {len(rsids)} rsIDs...")
|
|
66
|
+
if not rsids:
|
|
67
|
+
return pl.DataFrame(schema={"RSID": pl.String, "ID": pl.String})
|
|
68
|
+
|
|
69
|
+
pattern = f"dbSNP_{self.version}.chr*.lookup.parquet"
|
|
70
|
+
lookup_files = sorted(self.cache_dir.glob(pattern))
|
|
71
|
+
if not lookup_files:
|
|
72
|
+
raise FileNotFoundError(f"No dbSNP Parquet files found in {self.cache_dir}")
|
|
73
|
+
|
|
74
|
+
lazy_frames = []
|
|
75
|
+
for in_file in lookup_files:
|
|
76
|
+
lazy_frames.append(
|
|
77
|
+
pl.scan_parquet(in_file).filter(pl.col("RSID").is_in(rsids))
|
|
78
|
+
)
|
|
79
|
+
|
|
80
|
+
return pl.concat(lazy_frames).collect()
|
|
81
|
+
|
|
82
|
+
def query_chromosome(self, rsids: list[str], chrom: str) -> pl.DataFrame:
|
|
83
|
+
"""Query a specific chromosome lookup file for a list of rsIDs.
|
|
84
|
+
|
|
85
|
+
Args:
|
|
86
|
+
rsids: List of rsIDs to search for.
|
|
87
|
+
chrom: Chromosome label, e.g. "chr17" or "17".
|
|
88
|
+
|
|
89
|
+
Returns:
|
|
90
|
+
A Polars DataFrame with columns "RSID" and "ID".
|
|
91
|
+
|
|
92
|
+
Example:
|
|
93
|
+
lookup = SNPLookup()
|
|
94
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
95
|
+
df = lookup.query_chromosome(rsids, "17")
|
|
96
|
+
print(df)
|
|
97
|
+
"""
|
|
98
|
+
logger.info(f"Querying dbSNP Parquet files for {len(rsids)} rsIDs...")
|
|
99
|
+
if not rsids:
|
|
100
|
+
return pl.DataFrame(schema={"RSID": pl.String, "ID": pl.String})
|
|
101
|
+
|
|
102
|
+
chrom_label = chrom if chrom.startswith("chr") else f"chr{chrom}"
|
|
103
|
+
in_file = self.cache_dir / f"dbSNP_{self.version}.{chrom_label}.lookup.parquet"
|
|
104
|
+
if not in_file.exists():
|
|
105
|
+
raise FileNotFoundError(f"dbSNP Parquet file not found: {in_file}")
|
|
106
|
+
|
|
107
|
+
subset_df = (
|
|
108
|
+
pl.scan_parquet(in_file).filter(pl.col("RSID").is_in(rsids)).collect()
|
|
109
|
+
)
|
|
110
|
+
return subset_df
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pyvariantdb
|
|
3
|
+
Version: 1.1
|
|
4
|
+
Summary: A python package to download, process and access dbSNP data.
|
|
5
|
+
Requires-Python: >=3.11
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Provides-Extra: test
|
|
8
|
+
Requires-Dist: pytest; extra == "test"
|
|
9
|
+
Requires-Dist: twine; extra == "test"
|
|
10
|
+
Requires-Dist: pre-commit; extra == "test"
|
|
11
|
+
Requires-Dist: pytest-ruff; extra == "test"
|
|
12
|
+
Requires-Dist: ruff; extra == "test"
|
|
13
|
+
Provides-Extra: prod
|
|
14
|
+
Requires-Dist: pyvariantdb; extra == "prod"
|
|
15
|
+
Provides-Extra: all
|
|
16
|
+
Requires-Dist: pyvariantdb[test]; extra == "all"
|
|
17
|
+
|
|
18
|
+
# dbSNP to Parquet Converter
|
|
19
|
+
|
|
20
|
+
A python-package focused rewrite of [Weinstock's dbSNP](https://github.com/weinstockj/dbSNP_to_parquet) to parquet
|
|
21
|
+
repository.
|
|
22
|
+
This package allows convenient download, processing and access to data from dbSNP. A simple python interface
|
|
23
|
+
can be used to work with the data once it is processed.
|
|
24
|
+
|
|
25
|
+
## Installation
|
|
26
|
+
|
|
27
|
+
`pyvariantdb` is available on PyPI and can be installed from there with the package management tool of choice. For
|
|
28
|
+
development we use pixi:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
# install pixi
|
|
32
|
+
curl -fsSL https://pixi.sh/install.sh | sh
|
|
33
|
+
pixi update && pixi install
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Usage
|
|
37
|
+
|
|
38
|
+
We recommend to prepare the data from the command line before using the package since download and processing takes
|
|
39
|
+
quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of
|
|
40
|
+
environment variables:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Execution of the pipeline can be done with:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# activate pixi
|
|
50
|
+
pixi shell -e default
|
|
51
|
+
# download dbsnp
|
|
52
|
+
pyvariantdb-download
|
|
53
|
+
# transform the database to parquets
|
|
54
|
+
pyvariantdb-make-dbsnp
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality
|
|
58
|
+
is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective
|
|
59
|
+
coordindates. However, to achieve this we need to run the pipeline mentioned above.
|
|
60
|
+
|
|
61
|
+
Extracting coordinates is simple:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from pyvariantdb.lookup import SNPLookup
|
|
65
|
+
|
|
66
|
+
lookup = SNPLookup()
|
|
67
|
+
# P53 mutations, chromosome 17
|
|
68
|
+
rsids = ["rs1042522", "rs17878362", "rs1800372"]
|
|
69
|
+
df_all = lookup.query_all(rsids)
|
|
70
|
+
df_chr = lookup.query_chromosome(rsids, "17")
|
|
71
|
+
print(df_all)
|
|
72
|
+
print(df_chr)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Processing Pipeline
|
|
76
|
+
|
|
77
|
+
`pyvariantdb` offers some quality of life improvements for working with dbSNP and the original repository.
|
|
78
|
+
The original pipeline remains the same:
|
|
79
|
+
|
|
80
|
+
1. Downloads dbSNP data (GRCh38 build 156)
|
|
81
|
+
2. Filters for SNVs only
|
|
82
|
+
3. Converts chromosome contigs to standard naming
|
|
83
|
+
4. Splits data by chromosome
|
|
84
|
+
5. Creates Parquet lookup tables with RSID mappings
|
|
85
|
+
|
|
86
|
+
## Output
|
|
87
|
+
|
|
88
|
+
The script entrypoint generates the following files on-disk to the defined cache dir.
|
|
89
|
+
|
|
90
|
+
- `dbSNP_156.bcf` - Full filtered BCF file
|
|
91
|
+
- `dbSNP_156.chr*.bcf` - Per-chromosome BCF files
|
|
92
|
+
- `dbSNP_156.chr*.lookup.parquet` - Per-chromosome RSID lookup tables
|
|
93
|
+
- `dbSNP_156.lookup.parquet` - Combined RSID lookup table
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
MANIFEST.in
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
pyvariantdb/__init__.py
|
|
5
|
+
pyvariantdb/const.py
|
|
6
|
+
pyvariantdb/download.py
|
|
7
|
+
pyvariantdb/lookup.py
|
|
8
|
+
pyvariantdb.egg-info/PKG-INFO
|
|
9
|
+
pyvariantdb.egg-info/SOURCES.txt
|
|
10
|
+
pyvariantdb.egg-info/dependency_links.txt
|
|
11
|
+
pyvariantdb.egg-info/entry_points.txt
|
|
12
|
+
pyvariantdb.egg-info/requires.txt
|
|
13
|
+
pyvariantdb.egg-info/top_level.txt
|
|
14
|
+
pyvariantdb/assets/Snakefile
|
|
15
|
+
pyvariantdb/assets/config.yaml
|
|
16
|
+
pyvariantdb/assets/contig_map.tsv
|
|
17
|
+
tests/test_lookup.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
pyvariantdb
|
|
File without changes
|