pyvariantdb 1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1 @@
1
+ recursive-include pyvariantdb/assets *
@@ -0,0 +1,93 @@
1
+ Metadata-Version: 2.4
2
+ Name: pyvariantdb
3
+ Version: 1.1
4
+ Summary: A python package to download, process and access dbSNP data.
5
+ Requires-Python: >=3.11
6
+ Description-Content-Type: text/markdown
7
+ Provides-Extra: test
8
+ Requires-Dist: pytest; extra == "test"
9
+ Requires-Dist: twine; extra == "test"
10
+ Requires-Dist: pre-commit; extra == "test"
11
+ Requires-Dist: pytest-ruff; extra == "test"
12
+ Requires-Dist: ruff; extra == "test"
13
+ Provides-Extra: prod
14
+ Requires-Dist: pyvariantdb; extra == "prod"
15
+ Provides-Extra: all
16
+ Requires-Dist: pyvariantdb[test]; extra == "all"
17
+
18
+ # dbSNP to Parquet Converter
19
+
20
+ A python-package focused rewrite of [Weinstock's dbSNP](https://github.com/weinstockj/dbSNP_to_parquet) to parquet
21
+ repository.
22
+ This package allows convenient download, processing and access to data from dbSNP. A simple python interface
23
+ can be used to work with the data once it is processed.
24
+
25
+ ## Installation
26
+
27
+ `pyvariantdb` is available on PyPI and can be installed from there with the package management tool of choice. For
28
+ development we use pixi:
29
+
30
+ ```bash
31
+ # install pixi
32
+ curl -fsSL https://pixi.sh/install.sh | sh
33
+ pixi update && pixi install
34
+ ```
35
+
36
+ ## Usage
37
+
38
+ We recommend to prepare the data from the command line before using the package since download and processing takes
39
+ quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of
40
+ environment variables:
41
+
42
+ ```bash
43
+ export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
44
+ ```
45
+
46
+ Execution of the pipeline can be done with:
47
+
48
+ ```bash
49
+ # activate pixi
50
+ pixi shell -e default
51
+ # download dbsnp
52
+ pyvariantdb-download
53
+ # transform the database to parquets
54
+ pyvariantdb-make-dbsnp
55
+ ```
56
+
57
+ We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality
58
+ is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective
59
+ coordindates. However, to achieve this we need to run the pipeline mentioned above.
60
+
61
+ Extracting coordinates is simple:
62
+
63
+ ```python
64
+ from pyvariantdb.lookup import SNPLookup
65
+
66
+ lookup = SNPLookup()
67
+ # P53 mutations, chromosome 17
68
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
69
+ df_all = lookup.query_all(rsids)
70
+ df_chr = lookup.query_chromosome(rsids, "17")
71
+ print(df_all)
72
+ print(df_chr)
73
+ ```
74
+
75
+ ## Processing Pipeline
76
+
77
+ `pyvariantdb` offers some quality of life improvements for working with dbSNP and the original repository.
78
+ The original pipeline remains the same:
79
+
80
+ 1. Downloads dbSNP data (GRCh38 build 156)
81
+ 2. Filters for SNVs only
82
+ 3. Converts chromosome contigs to standard naming
83
+ 4. Splits data by chromosome
84
+ 5. Creates Parquet lookup tables with RSID mappings
85
+
86
+ ## Output
87
+
88
+ The script entrypoint generates the following files on-disk to the defined cache dir.
89
+
90
+ - `dbSNP_156.bcf` - Full filtered BCF file
91
+ - `dbSNP_156.chr*.bcf` - Per-chromosome BCF files
92
+ - `dbSNP_156.chr*.lookup.parquet` - Per-chromosome RSID lookup tables
93
+ - `dbSNP_156.lookup.parquet` - Combined RSID lookup table
@@ -0,0 +1,76 @@
1
+ # dbSNP to Parquet Converter
2
+
3
+ A python-package focused rewrite of [Weinstock's dbSNP](https://github.com/weinstockj/dbSNP_to_parquet) to parquet
4
+ repository.
5
+ This package allows convenient download, processing and access to data from dbSNP. A simple python interface
6
+ can be used to work with the data once it is processed.
7
+
8
+ ## Installation
9
+
10
+ `pyvariantdb` is available on PyPI and can be installed from there with the package management tool of choice. For
11
+ development we use pixi:
12
+
13
+ ```bash
14
+ # install pixi
15
+ curl -fsSL https://pixi.sh/install.sh | sh
16
+ pixi update && pixi install
17
+ ```
18
+
19
+ ## Usage
20
+
21
+ We recommend to prepare the data from the command line before using the package since download and processing takes
22
+ quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of
23
+ environment variables:
24
+
25
+ ```bash
26
+ export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
27
+ ```
28
+
29
+ Execution of the pipeline can be done with:
30
+
31
+ ```bash
32
+ # activate pixi
33
+ pixi shell -e default
34
+ # download dbsnp
35
+ pyvariantdb-download
36
+ # transform the database to parquets
37
+ pyvariantdb-make-dbsnp
38
+ ```
39
+
40
+ We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality
41
+ is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective
42
+ coordindates. However, to achieve this we need to run the pipeline mentioned above.
43
+
44
+ Extracting coordinates is simple:
45
+
46
+ ```python
47
+ from pyvariantdb.lookup import SNPLookup
48
+
49
+ lookup = SNPLookup()
50
+ # P53 mutations, chromosome 17
51
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
52
+ df_all = lookup.query_all(rsids)
53
+ df_chr = lookup.query_chromosome(rsids, "17")
54
+ print(df_all)
55
+ print(df_chr)
56
+ ```
57
+
58
+ ## Processing Pipeline
59
+
60
+ `pyvariantdb` offers some quality of life improvements for working with dbSNP and the original repository.
61
+ The original pipeline remains the same:
62
+
63
+ 1. Downloads dbSNP data (GRCh38 build 156)
64
+ 2. Filters for SNVs only
65
+ 3. Converts chromosome contigs to standard naming
66
+ 4. Splits data by chromosome
67
+ 5. Creates Parquet lookup tables with RSID mappings
68
+
69
+ ## Output
70
+
71
+ The script entrypoint generates the following files on-disk to the defined cache dir.
72
+
73
+ - `dbSNP_156.bcf` - Full filtered BCF file
74
+ - `dbSNP_156.chr*.bcf` - Per-chromosome BCF files
75
+ - `dbSNP_156.chr*.lookup.parquet` - Per-chromosome RSID lookup tables
76
+ - `dbSNP_156.lookup.parquet` - Combined RSID lookup table
@@ -0,0 +1,54 @@
1
+ [project]
2
+ name = "pyvariantdb"
3
+ version = "1.1"
4
+ description = "A python package to download, process and access dbSNP data."
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+
8
+ [project.scripts]
9
+ pyvariantdb-download = "pyvariantdb.download:download_dbsnp"
10
+ pyvariantdb-snp2parquet = "pyvariantdb.scripts.convert:main"
11
+ pyvariantdb-make-dbsnp = "pyvariantdb.scripts.dbsnp2parquet:main"
12
+
13
+ # ---- build
14
+ [build-system]
15
+ requires = ["setuptools"]
16
+ build-backend = "setuptools.build_meta"
17
+
18
+ [tool.setuptools.packages.find]
19
+ include = ["pyvariantdb"]
20
+
21
+ # -- pixi settings
22
+ [tool.pixi.workspace]
23
+ channels = ["conda-forge", "bioconda"]
24
+ platforms = ["linux-64", "osx-arm64"]
25
+
26
+
27
+ [tool.pixi.dependencies]
28
+ python = "*"
29
+ bcftools = "*"
30
+ pandas = "*"
31
+ polars = "*"
32
+ cyvcf2 = "*"
33
+ snakemake-executor-plugin-cluster-generic = "*"
34
+ snakemake = "*"
35
+ pyarrow = "*"
36
+ ipython = "*"
37
+ loguru = "*"
38
+ aria2 = "*"
39
+
40
+ [tool.pixi.pypi-dependencies]
41
+ pyvariantdb = { path = ".", editable = true}
42
+
43
+ [project.optional-dependencies]
44
+ test = ["pytest", "twine", "pre-commit", "pytest-ruff", "ruff"]
45
+ prod = ["pyvariantdb"]
46
+ all = ["pyvariantdb[test]"]
47
+
48
+ [tool.pixi.environments]
49
+ default = { features = ["all"], solve-group = "default" }
50
+ prod = { features = ["prod"], solve-group = "default" }
51
+ test = { features = ["test"], solve-group = "default" }
52
+
53
+ [tool.pixi.feature.test.tasks]
54
+ download = "pytest"
@@ -0,0 +1,7 @@
1
+ """pyvariantdb - A python package to download, process and access dbSNP data."""
2
+
3
+ import importlib
4
+
5
+ __all__ = ["__version__"]
6
+
7
+ __version__ = importlib.metadata.version(__name__)
@@ -0,0 +1,129 @@
1
+ """Snakemake workflow for processing of dbSNP.
2
+
3
+ This workflow downloads, processes, and converts dbSNP data (GRCh38 build 156) to Parquet format.
4
+
5
+ Pipeline Steps:
6
+ 1. convert_contig: Downloads and filters dbSNP data for SNVs only
7
+ 2. subset_BCF: Splits data by chromosome
8
+ 3. convert_parquet: Converts each chromosome to Parquet lookup tables
9
+ 4. combine_all_parquets: Combines all chromosome-specific parquet files into one
10
+
11
+ Usage Examples:
12
+ # Run all rules (default)
13
+ >>> snakemake --cores all
14
+
15
+ # Run on a cluster with SLURM
16
+ >>> snakemake --cluster "sbatch -p {resources.partition} --mem={resources.mem} -t {resources.time} -c {threads}" -j 23
17
+
18
+ # Run only the combine_all_parquets rule
19
+ >>> snakemake --cores 1 output/dbSNP_156.combined.lookup.parquet
20
+
21
+ # Use a custom config file
22
+ >>> snakemake --configfile my_config.yaml --cores all
23
+
24
+ # Dry run to see what will be executed
25
+ >>> snakemake -n
26
+
27
+ """
28
+ import yaml
29
+ from importlib.resources import files
30
+ from loguru import logger
31
+
32
+ from pyvariantdb.const import get_cache_dir
33
+
34
+ pyvariant_assets_dir = files("pyvariantdb") / "assets"
35
+ pyvariant_contigs = pyvariant_assets_dir / "contig_map.tsv"
36
+ pyvariant_config = pyvariant_assets_dir / "config.yaml"
37
+
38
+ # Use packaged config only when no config was provided on the command line.
39
+ if not config:
40
+ with open(pyvariant_config,"r") as f:
41
+ config = yaml.safe_load(f)
42
+
43
+ CHROMS = [f"chr{x}" for x in range(1,23)] + ['chrX']
44
+
45
+ config["output_dir"] = get_cache_dir()
46
+
47
+ logger.info("Starting snakemake workflow from pyvariantdb to download variants.")
48
+ for conf_key, conf_value in config.items():
49
+ logger.info(f"{conf_key}: {conf_value}")
50
+
51
+ rule all:
52
+ input:
53
+ f"{config['output_dir']}/dbSNP_156.bcf",
54
+ expand(f"{config['output_dir']}/dbSNP_156.{{CHROM}}.bcf",CHROM=CHROMS),
55
+ expand(f"{config['output_dir']}/dbSNP_156.{{CHROM}}.lookup.parquet",CHROM=CHROMS),
56
+ f"{config['output_dir']}/dbSNP_156.combined.lookup.parquet"
57
+
58
+ rule convert_contig:
59
+ input:
60
+ f"{config['output_dir']}/GCF_000001405.40.gz"
61
+ output:
62
+ f"{config['output_dir']}/dbSNP_156.bcf"
63
+ threads: config['resources']['convert_contig']['threads']
64
+ resources:
65
+ mem=config['resources']['convert_contig']['mem'],
66
+ partition=config['cluster']['partition'],
67
+ time=config['resources']['convert_contig']['time']
68
+ shell:
69
+ """
70
+ bcftools view -i "INFO/VC='SNV'" {input} |
71
+ bcftools annotate --rename-chrs {pyvariant_contigs} -Ob --threads {threads} > {output}
72
+
73
+ bcftools index {output}
74
+ """
75
+
76
+ rule subset_BCF:
77
+ input:
78
+ f"{config['output_dir']}/dbSNP_156.bcf"
79
+ output:
80
+ f"{config['output_dir']}/dbSNP_156.{{CHROM}}.bcf"
81
+ threads: config['resources']['subset_BCF']['threads']
82
+ resources:
83
+ mem=config['resources']['subset_BCF']['mem'],
84
+ partition=config['cluster']['partition'],
85
+ time=config['resources']['subset_BCF']['time']
86
+ shell:
87
+ """
88
+ bcftools view -Ob --threads {threads} {input} {wildcards.CHROM} > {output}
89
+ bcftools index {output}
90
+ """
91
+
92
+ rule convert_parquet:
93
+ input:
94
+ f"{config['output_dir']}/dbSNP_156.{{CHROM}}.bcf"
95
+ output:
96
+ f"{config['output_dir']}/dbSNP_156.{{CHROM}}.lookup.parquet"
97
+ resources:
98
+ mem=config['resources']['convert_parquet']['mem'],
99
+ partition=config['cluster']['partition'],
100
+ time=config['resources']['convert_parquet']['time']
101
+ shell:
102
+ """
103
+ pyvariantdb-snp2parquet {input} {output}
104
+ """
105
+
106
+ rule combine_all_parquets:
107
+ input:
108
+ expand(f"{config['output_dir']}/dbSNP_156.{{CHROM}}.lookup.parquet",CHROM=CHROMS)
109
+ output:
110
+ f"{config['output_dir']}/dbSNP_156.combined.lookup.parquet"
111
+ resources:
112
+ mem=config['resources'].get('combine_all_parquets',{}).get('mem','16G'),
113
+ partition=config['cluster']['partition'],
114
+ time=config['resources'].get('combine_all_parquets',{}).get('time','02:00:00')
115
+ run:
116
+ import pyarrow.parquet as pq
117
+ import pyarrow as pa
118
+
119
+ # Read all parquet files
120
+ tables = []
121
+ for input_file in input:
122
+ table = pq.read_table(input_file)
123
+ tables.append(table)
124
+
125
+ # Concatenate all tables
126
+ combined_table = pa.concat_tables(tables)
127
+
128
+ # Write the combined table
129
+ pq.write_table(combined_table,output[0])
@@ -0,0 +1,20 @@
1
+ output_dir: "output"
2
+
3
+ cluster:
4
+ partition: "nodes*"
5
+
6
+ resources:
7
+ convert_contig:
8
+ mem: "6G"
9
+ time: "07:30:00"
10
+ threads: 10
11
+ subset_BCF:
12
+ mem: "3G"
13
+ time: "03:30:00"
14
+ threads: 10
15
+ convert_parquet:
16
+ mem: "8G"
17
+ time: "05:30:00"
18
+ combine_all_parquets:
19
+ mem: "16G"
20
+ time: "02:00:00"
@@ -0,0 +1,24 @@
1
+ NC_000001.11 chr1
2
+ NC_000002.12 chr2
3
+ NC_000003.12 chr3
4
+ NC_000004.12 chr4
5
+ NC_000005.10 chr5
6
+ NC_000006.12 chr6
7
+ NC_000007.14 chr7
8
+ NC_000008.11 chr8
9
+ NC_000009.12 chr9
10
+ NC_000010.11 chr10
11
+ NC_000011.10 chr11
12
+ NC_000012.12 chr12
13
+ NC_000013.11 chr13
14
+ NC_000014.9 chr14
15
+ NC_000015.10 chr15
16
+ NC_000016.10 chr16
17
+ NC_000017.11 chr17
18
+ NC_000018.10 chr18
19
+ NC_000019.10 chr19
20
+ NC_000020.11 chr20
21
+ NC_000021.9 chr21
22
+ NC_000022.11 chr22
23
+ NC_000023.11 chrX
24
+ NC_000024.10 chrY
@@ -0,0 +1,14 @@
1
+ """Define constants used in pyvariantdb."""
2
+
3
+ import os
4
+ from loguru import logger
5
+ from pathlib import Path
6
+
7
+
8
+ def get_cache_dir() -> None:
9
+ """Return the root directory of the project."""
10
+ # check if env var PYVARIANTDB_HOME exists, otherwise ~/.cache/pyvariantdb
11
+ cache_dir = os.getenv("PYVARIANTDB_HOME", Path().home() / ".cache" / "pyvariantdb")
12
+ cache_dir.mkdir(parents=True, exist_ok=True)
13
+ logger.info(f"Cache directory: {cache_dir}")
14
+ return cache_dir
@@ -0,0 +1,96 @@
1
+ #!/usr/bin/env python3
2
+ """Download utilities for dbSNP to Parquet conversion."""
3
+
4
+ import subprocess
5
+ from pathlib import Path
6
+ from loguru import logger
7
+ from pyvariantdb.const import get_cache_dir
8
+
9
+
10
+ def download_with_aria2(url: str, output_dir: Path, filename: str = None):
11
+ """
12
+ Download a file using aria2c.
13
+
14
+ Args:
15
+ url: The URL to download from
16
+ output_dir: Directory where the file should be saved
17
+ filename: Optional custom filename (defaults to name from URL)
18
+
19
+ Returns:
20
+ Path: Path to the downloaded file
21
+
22
+ Raises:
23
+ RuntimeError: If aria2c is not installed or download fails
24
+ """
25
+ # Ensure output directory exists
26
+ output_dir.mkdir(parents=True, exist_ok=True)
27
+
28
+ # Build aria2c command
29
+ cmd = [
30
+ "aria2c",
31
+ "--dir",
32
+ str(output_dir),
33
+ "--max-connection-per-server=16", # Use up to 16 connections per server
34
+ "--split=16", # Split file into 16 pieces for parallel download
35
+ "--min-split-size=1M", # Minimum size for splitting
36
+ "--continue=true", # Continue partial downloads
37
+ "--allow-overwrite=true", # Allow overwriting existing files
38
+ "--auto-file-renaming=false", # Don't auto-rename files
39
+ ]
40
+
41
+ if filename:
42
+ cmd.extend(["--out", filename])
43
+
44
+ cmd.append(url)
45
+
46
+ # Run aria2c
47
+ logger.info(f"Running: {' '.join(cmd)}")
48
+ result = subprocess.run(cmd, capture_output=True, text=True)
49
+
50
+ if result.returncode != 0:
51
+ logger.error(f"aria2c stderr: {result.stderr}")
52
+ raise RuntimeError(f"Failed to download {url}: {result.stderr}")
53
+
54
+ # Determine output path
55
+ if filename:
56
+ output_path = output_dir / filename
57
+ else:
58
+ output_path = output_dir / url.split("/")[-1]
59
+
60
+ return output_path
61
+
62
+
63
+ def download_dbsnp():
64
+ """
65
+ Download dbSNP VCF file and its index from NCBI's FTP server using aria2.
66
+
67
+ Downloads:
68
+ - GCF_000001405.40.gz (main VCF file)
69
+ - GCF_000001405.40.gz.tbi (index file)
70
+
71
+ Uses aria2c for fast multi-connection downloads.
72
+ """
73
+ logger.info("Starting dbSNP download with aria2...")
74
+
75
+ urls = [
76
+ "https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz",
77
+ "https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz.tbi",
78
+ ]
79
+ destination_dir = get_cache_dir()
80
+ logger.info(f"Destination directory: {destination_dir}")
81
+
82
+ for url in urls:
83
+ try:
84
+ filename = url.split("/")[-1]
85
+ logger.info(f"Downloading {url}...")
86
+ output_path = download_with_aria2(url, destination_dir, filename)
87
+ logger.info(f"✓ Successfully downloaded to {output_path}")
88
+ except Exception as e:
89
+ logger.error(f"Failed to download {url}: {e}")
90
+ raise
91
+
92
+ logger.info("Download completed successfully!")
93
+
94
+
95
+ if __name__ == "__main__":
96
+ download_dbsnp()
@@ -0,0 +1,110 @@
1
+ """Lookup utilities for dbSNP Parquet outputs.
2
+
3
+ This module provides lightweight query helpers over per-chromosome dbSNP
4
+ lookup Parquet files generated by the pipeline.
5
+
6
+ In ideal cases, you already know which chromsome holds a specific rsID. In this case,
7
+ its most efficient to just query the respective chromosome file.
8
+
9
+ Example:
10
+ from pyvariantdb.lookup import SNPLookup
11
+
12
+ lookup = SNPLookup()
13
+ # P53 mutations, chromosome 17
14
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
15
+ df_all = lookup.query_all(rsids)
16
+ df_chr = lookup.query_chromosome(rsids, "17")
17
+ print(df_all)
18
+ print(df_chr)
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import polars as pl
24
+ from loguru import logger
25
+ from pathlib import Path
26
+ from typing import Optional
27
+
28
+ from pyvariantdb.const import get_cache_dir
29
+
30
+ DEFAULT_VERSION = "156"
31
+ DEFAULT_OUTPUT_DIR = "output"
32
+
33
+
34
+ class SNPLookup:
35
+ """Query dbSNP lookup Parquet files by rsID.
36
+
37
+ Example:
38
+ lookup = SNPLookup(version="156")
39
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
40
+ df = lookup.query_all(rsids)
41
+ print(df)
42
+ """
43
+
44
+ def __init__(self, version: str = DEFAULT_VERSION, cache_dir: Optional[str] = None):
45
+ self.version = version
46
+ self.cache_dir = (
47
+ Path(cache_dir) if cache_dir else get_cache_dir() / DEFAULT_OUTPUT_DIR
48
+ )
49
+
50
+ def query_all(self, rsids: list[str]) -> pl.DataFrame:
51
+ """Query all lookup files for a list of rsIDs.
52
+
53
+ Args:
54
+ rsids: List of rsIDs to search for.
55
+
56
+ Returns:
57
+ A Polars DataFrame with columns "RSID" and "ID".
58
+
59
+ Example:
60
+ lookup = SNPLookup()
61
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
62
+ df = lookup.query_all(rsids)
63
+ print(df.head(3))
64
+ """
65
+ logger.info(f"Querying dbSNP Parquet files for {len(rsids)} rsIDs...")
66
+ if not rsids:
67
+ return pl.DataFrame(schema={"RSID": pl.String, "ID": pl.String})
68
+
69
+ pattern = f"dbSNP_{self.version}.chr*.lookup.parquet"
70
+ lookup_files = sorted(self.cache_dir.glob(pattern))
71
+ if not lookup_files:
72
+ raise FileNotFoundError(f"No dbSNP Parquet files found in {self.cache_dir}")
73
+
74
+ lazy_frames = []
75
+ for in_file in lookup_files:
76
+ lazy_frames.append(
77
+ pl.scan_parquet(in_file).filter(pl.col("RSID").is_in(rsids))
78
+ )
79
+
80
+ return pl.concat(lazy_frames).collect()
81
+
82
+ def query_chromosome(self, rsids: list[str], chrom: str) -> pl.DataFrame:
83
+ """Query a specific chromosome lookup file for a list of rsIDs.
84
+
85
+ Args:
86
+ rsids: List of rsIDs to search for.
87
+ chrom: Chromosome label, e.g. "chr17" or "17".
88
+
89
+ Returns:
90
+ A Polars DataFrame with columns "RSID" and "ID".
91
+
92
+ Example:
93
+ lookup = SNPLookup()
94
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
95
+ df = lookup.query_chromosome(rsids, "17")
96
+ print(df)
97
+ """
98
+ logger.info(f"Querying dbSNP Parquet files for {len(rsids)} rsIDs...")
99
+ if not rsids:
100
+ return pl.DataFrame(schema={"RSID": pl.String, "ID": pl.String})
101
+
102
+ chrom_label = chrom if chrom.startswith("chr") else f"chr{chrom}"
103
+ in_file = self.cache_dir / f"dbSNP_{self.version}.{chrom_label}.lookup.parquet"
104
+ if not in_file.exists():
105
+ raise FileNotFoundError(f"dbSNP Parquet file not found: {in_file}")
106
+
107
+ subset_df = (
108
+ pl.scan_parquet(in_file).filter(pl.col("RSID").is_in(rsids)).collect()
109
+ )
110
+ return subset_df
@@ -0,0 +1,93 @@
1
+ Metadata-Version: 2.4
2
+ Name: pyvariantdb
3
+ Version: 1.1
4
+ Summary: A python package to download, process and access dbSNP data.
5
+ Requires-Python: >=3.11
6
+ Description-Content-Type: text/markdown
7
+ Provides-Extra: test
8
+ Requires-Dist: pytest; extra == "test"
9
+ Requires-Dist: twine; extra == "test"
10
+ Requires-Dist: pre-commit; extra == "test"
11
+ Requires-Dist: pytest-ruff; extra == "test"
12
+ Requires-Dist: ruff; extra == "test"
13
+ Provides-Extra: prod
14
+ Requires-Dist: pyvariantdb; extra == "prod"
15
+ Provides-Extra: all
16
+ Requires-Dist: pyvariantdb[test]; extra == "all"
17
+
18
+ # dbSNP to Parquet Converter
19
+
20
+ A python-package focused rewrite of [Weinstock's dbSNP](https://github.com/weinstockj/dbSNP_to_parquet) to parquet
21
+ repository.
22
+ This package allows convenient download, processing and access to data from dbSNP. A simple python interface
23
+ can be used to work with the data once it is processed.
24
+
25
+ ## Installation
26
+
27
+ `pyvariantdb` is available on PyPI and can be installed from there with the package management tool of choice. For
28
+ development we use pixi:
29
+
30
+ ```bash
31
+ # install pixi
32
+ curl -fsSL https://pixi.sh/install.sh | sh
33
+ pixi update && pixi install
34
+ ```
35
+
36
+ ## Usage
37
+
38
+ We recommend to prepare the data from the command line before using the package since download and processing takes
39
+ quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of
40
+ environment variables:
41
+
42
+ ```bash
43
+ export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
44
+ ```
45
+
46
+ Execution of the pipeline can be done with:
47
+
48
+ ```bash
49
+ # activate pixi
50
+ pixi shell -e default
51
+ # download dbsnp
52
+ pyvariantdb-download
53
+ # transform the database to parquets
54
+ pyvariantdb-make-dbsnp
55
+ ```
56
+
57
+ We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality
58
+ is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective
59
+ coordindates. However, to achieve this we need to run the pipeline mentioned above.
60
+
61
+ Extracting coordinates is simple:
62
+
63
+ ```python
64
+ from pyvariantdb.lookup import SNPLookup
65
+
66
+ lookup = SNPLookup()
67
+ # P53 mutations, chromosome 17
68
+ rsids = ["rs1042522", "rs17878362", "rs1800372"]
69
+ df_all = lookup.query_all(rsids)
70
+ df_chr = lookup.query_chromosome(rsids, "17")
71
+ print(df_all)
72
+ print(df_chr)
73
+ ```
74
+
75
+ ## Processing Pipeline
76
+
77
+ `pyvariantdb` offers some quality of life improvements for working with dbSNP and the original repository.
78
+ The original pipeline remains the same:
79
+
80
+ 1. Downloads dbSNP data (GRCh38 build 156)
81
+ 2. Filters for SNVs only
82
+ 3. Converts chromosome contigs to standard naming
83
+ 4. Splits data by chromosome
84
+ 5. Creates Parquet lookup tables with RSID mappings
85
+
86
+ ## Output
87
+
88
+ The script entrypoint generates the following files on-disk to the defined cache dir.
89
+
90
+ - `dbSNP_156.bcf` - Full filtered BCF file
91
+ - `dbSNP_156.chr*.bcf` - Per-chromosome BCF files
92
+ - `dbSNP_156.chr*.lookup.parquet` - Per-chromosome RSID lookup tables
93
+ - `dbSNP_156.lookup.parquet` - Combined RSID lookup table
@@ -0,0 +1,17 @@
1
+ MANIFEST.in
2
+ README.md
3
+ pyproject.toml
4
+ pyvariantdb/__init__.py
5
+ pyvariantdb/const.py
6
+ pyvariantdb/download.py
7
+ pyvariantdb/lookup.py
8
+ pyvariantdb.egg-info/PKG-INFO
9
+ pyvariantdb.egg-info/SOURCES.txt
10
+ pyvariantdb.egg-info/dependency_links.txt
11
+ pyvariantdb.egg-info/entry_points.txt
12
+ pyvariantdb.egg-info/requires.txt
13
+ pyvariantdb.egg-info/top_level.txt
14
+ pyvariantdb/assets/Snakefile
15
+ pyvariantdb/assets/config.yaml
16
+ pyvariantdb/assets/contig_map.tsv
17
+ tests/test_lookup.py
@@ -0,0 +1,4 @@
1
+ [console_scripts]
2
+ pyvariantdb-download = pyvariantdb.download:download_dbsnp
3
+ pyvariantdb-make-dbsnp = pyvariantdb.scripts.dbsnp2parquet:main
4
+ pyvariantdb-snp2parquet = pyvariantdb.scripts.convert:main
@@ -0,0 +1,13 @@
1
+
2
+ [all]
3
+ pyvariantdb[test]
4
+
5
+ [prod]
6
+ pyvariantdb
7
+
8
+ [test]
9
+ pytest
10
+ twine
11
+ pre-commit
12
+ pytest-ruff
13
+ ruff
@@ -0,0 +1 @@
1
+ pyvariantdb
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
File without changes