genal-python 1.2.8__tar.gz → 1.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {genal_python-1.2.8 → genal_python-1.3.0}/PKG-INFO +49 -70
- {genal_python-1.2.8 → genal_python-1.3.0}/README.md +47 -69
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/Geno.py +164 -55
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/MR_tools.py +5 -1
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/__init__.py +1 -1
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/association.py +2 -8
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/clump.py +12 -22
- genal_python-1.3.0/genal/colocalization.py +159 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/constants.py +5 -2
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/geno_tools.py +64 -26
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/lift.py +4 -4
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/proxy.py +13 -18
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/tools.py +184 -90
- {genal_python-1.2.8 → genal_python-1.3.0}/pyproject.toml +3 -2
- {genal_python-1.2.8 → genal_python-1.3.0}/.DS_Store +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/.gitignore +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/.readthedocs.yaml +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/Genal_flowchart.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/LICENSE +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/.DS_Store +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/Makefile +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.DS_Store +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.buildinfo +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/api.doctree +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/environment.pickle +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/genal.doctree +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/index.doctree +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/introduction.doctree +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/modules.doctree +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_images/MR_plot_SBP_AS.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/Geno.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/MR.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/MR_tools.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/MRpresso.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/association.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/clump.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/extract_prs.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/geno_tools.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/lift.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/proxy.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/snp_query.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/tools.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/index.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/api.rst.txt +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/genal.rst.txt +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/index.rst.txt +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/introduction.rst.txt +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/modules.rst.txt +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/basic.css +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/badge_only.css +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Bold.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Bold.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Regular.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Regular.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.eot +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.svg +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.ttf +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold-italic.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold-italic.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal-italic.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal-italic.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal.woff +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal.woff2 +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/theme.css +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/doctools.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/documentation_options.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/file.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/badge_only.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/html5shiv-printshiv.min.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/html5shiv.min.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/theme.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/language_data.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/minus.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/plus.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/pygments.css +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/searchtools.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/sphinx_highlight.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/api.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/genal.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/genindex.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/index.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/introduction.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/modules.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/objects.inv +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/py-modindex.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/search.html +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/searchindex.js +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/make.bat +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/requirements.txt +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/.DS_Store +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/Images/Genal_flowchart.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/Images/MR_plot_SBP_AS.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/Images/genal_logo.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/api.rst +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/conf.py +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/index.rst +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/introduction.rst +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/modules.rst +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/MR.py +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/MRpresso.py +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/extract_prs.py +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/genal/snp_query.py +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/genal_logo.png +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/gitignore +0 -0
- {genal_python-1.2.8 → genal_python-1.3.0}/readthedocs.yaml +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: genal-python
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.3.0
|
|
4
4
|
Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
|
|
5
5
|
Author-email: Cyprien Rivier <riviercyprien@gmail.com>
|
|
6
6
|
Requires-Python: >=3.8
|
|
@@ -20,6 +20,7 @@ Requires-Dist: scipy>=1.10.1, <1.11
|
|
|
20
20
|
Requires-Dist: statsmodels==0.14.0
|
|
21
21
|
Requires-Dist: tqdm==4.66.1
|
|
22
22
|
Requires-Dist: wget==3.2
|
|
23
|
+
Requires-Dist: fastparquet>=2024.2.0
|
|
23
24
|
Project-URL: Home, https://github.com/CypRiv/genal
|
|
24
25
|
|
|
25
26
|
[](https://www.python.org/downloads/release/python-3100/)
|
|
@@ -48,11 +49,34 @@ Project-URL: Home, https://github.com/CypRiv/genal
|
|
|
48
49
|
|
|
49
50
|
|
|
50
51
|
## Introduction <a name="introduction"></a>
|
|
51
|
-
Genal is a python module designed to make it easy to run genetic risk scores and
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
52
|
+
Genal is a python module designed to make it easy and intuitive to run genetic risk scores and Mendelian Randomization analyses. The functionalities provided by genal include:
|
|
53
|
+
|
|
54
|
+
- Data preprocessing and cleaning of variant data (usually from GWAS summary statistics)
|
|
55
|
+
- Selection of independent genetic instruments through clumping
|
|
56
|
+
- Polygenic risk score calculations (currently C+T only, soon LDPRED2 and PRScs)
|
|
57
|
+
- More than 10 Mendelian Randomization methods, including heterogeneity and pleiotropy tests, with parallel processing support:
|
|
58
|
+
- Inverse Variance Weighted (IVW) methods
|
|
59
|
+
- MR-Egger methods
|
|
60
|
+
- Weighted Median methods
|
|
61
|
+
- Mode methods
|
|
62
|
+
- MR-PRESSO in parallel
|
|
63
|
+
- More to come...
|
|
64
|
+
- SNP-trait association testing
|
|
65
|
+
- Lifting of genetic data to another genomic build
|
|
66
|
+
- Variant-phenotype querying with the GWAS Catalog
|
|
67
|
+
|
|
68
|
+
### Key Features
|
|
69
|
+
|
|
70
|
+
- **Efficient Parallel Processing**: Parallel computation for bootstrapping-based MR methods and MR-PRESSO significantly reduces computation time compared to the original R packages (up to 85% faster for MR-PRESSO)
|
|
71
|
+
- **Flexible Data Handling**: Automatic formatting of variant data and summary statistics
|
|
72
|
+
- **Comprehensive MR Pipeline**: From data preprocessing to sensitivity analyses and plotting in a single package
|
|
73
|
+
- **Reference Panel Support**: Automatically download and use the latest 1000 Genomes reference panels in builds 37 and 38 with the option to use custom reference panels
|
|
74
|
+
- **Customizable**: Ability to choose all the parameters, but defaults are set to the most common values
|
|
75
|
+
- **Proxy SNP Support**: Includes functionality for finding and using proxy SNPs when instruments are missing (for polygenic risk scores, Mendelian Randomization, and association testing)
|
|
76
|
+
|
|
77
|
+
The objective of genal is to bring the functionalities of well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, in a more user-friendly Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
|
|
78
|
+
|
|
79
|
+
This package is still under development, feel free to report any issues or suggest improvements!
|
|
56
80
|
|
|
57
81
|
<img src="/Genal_flowchart.png" data-canonical-src="/Genal_flowchart.png" style="max-width:100%;" />
|
|
58
82
|
|
|
@@ -65,6 +89,8 @@ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez,
|
|
|
65
89
|
Bioinformatics Advances 2024.
|
|
66
90
|
doi: https://doi.org/10.1093/bioadv/vbae207
|
|
67
91
|
|
|
92
|
+
If you're using methods derived from R packages, such as MR-PRESSO, please also cite the original papers.
|
|
93
|
+
|
|
68
94
|
## Requirements for the genal module <a name="paragraph1"></a>
|
|
69
95
|
***Python 3.8 or later***. https://www.python.org/ <br>
|
|
70
96
|
|
|
@@ -145,16 +171,7 @@ import pandas as pd
|
|
|
145
171
|
sbp_gwas = pd.read_csv("Evangelou_30224653_SBP.txt", sep=" ")
|
|
146
172
|
sbp_gwas.head(5)
|
|
147
173
|
```
|
|
148
|
-
|
|
149
|
-
| MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective |
|
|
150
|
-
|-------------------|---------|---------|-------|--------|--------|---------|-----------------|-------------|
|
|
151
|
-
| 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940| 738170 | 736847 |
|
|
152
|
-
| 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100| 738168 | 735018 |
|
|
153
|
-
| 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526| 738168 | 733070 |
|
|
154
|
-
| 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800| 737054 | 663809 |
|
|
155
|
-
| 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870| 738169 | 735681 |
|
|
156
|
-
|
|
157
|
-
We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean.
|
|
174
|
+
We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean it.
|
|
158
175
|
|
|
159
176
|
The `genal.Geno` takes as input a pandas dataframe where each row corresponds to a SNP, with columns describing the position and possibly the effect of the SNP for the given trait (SBP in our case). To indicate the names of the columns, the following arguments can be passed:
|
|
160
177
|
- **CHR**: Column name for chromosome. Defaults to `'CHR'`.
|
|
@@ -175,15 +192,12 @@ After inspecting the dataframe, we first need to extract the chromosome and posi
|
|
|
175
192
|
|
|
176
193
|
```python
|
|
177
194
|
sbp_gwas[["CHR", "POS", "Filler"]] = sbp_gwas["MarkerName"].str.split(":", expand=True)
|
|
178
|
-
sbp_gwas.head(
|
|
195
|
+
sbp_gwas.head(2)
|
|
179
196
|
```
|
|
180
197
|
| MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective | CHR | POS | Filler |
|
|
181
198
|
|-------------------|---------|---------|-------|--------|--------|----------|-----------------|-------------|-----|-----------|--------|
|
|
182
199
|
| 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940 | 738170 | 736847 | 10 | 100000625 | SNP |
|
|
183
200
|
| 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100 | 738168 | 735018 | 10 | 100000645 | SNP |
|
|
184
|
-
| 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526 | 738168 | 733070 | 10 | 100003242 | SNP |
|
|
185
|
-
| 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800 | 737054 | 663809 | 10 | 100003304 | SNP |
|
|
186
|
-
| 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870 | 738169 | 735681 | 10 | 100003785 | SNP |
|
|
187
201
|
|
|
188
202
|
And it can now be loaded into a `genal.Geno` instance:
|
|
189
203
|
|
|
@@ -220,7 +234,7 @@ By default, and depending on the global preprocessing level (`'None'`, `'Fill'`,
|
|
|
220
234
|
- Validate the `P` (p-value) column for proper values.
|
|
221
235
|
- Check for no duplicated SNPs based on rsid.
|
|
222
236
|
- Determine if the `BETA` (effect) column contains beta estimates or odds ratios, and log-transform odds ratios if necessary.
|
|
223
|
-
- Create `SNP` column using a reference panel if CHR and POS columns are present.
|
|
237
|
+
- Create `SNP` column (containing rsids) using a reference panel if CHR and POS columns are present.
|
|
224
238
|
- Create `CHR` and/or `POS` column using a reference panel if `SNP` column is present.
|
|
225
239
|
- Create `NEA` (non-effect allele) column using a reference panel if `EA` (effect allele) column is present.
|
|
226
240
|
- Create the `SE` (standard-error) column if the `BETA` and `P` (p-value) columns are present.
|
|
@@ -243,17 +257,17 @@ SBP_Geno.data
|
|
|
243
257
|
|---------|----|-----|-------|--------|--------|----------|-----|---------|-----------|
|
|
244
258
|
| 0 | A | G | 0.5660| 0.0523| 0.0303 | 0.083940 | 10 | 100000625 | rs7899632 |
|
|
245
259
|
| 1 | A | C | 0.7936| 0.0200| 0.0372 | 0.591100 | 10 | 100000645 | rs61875309 |
|
|
246
|
-
| 2 | T | G | 0.8831| 0.1417| 0.0469 | 0.002526 | 10 | 100003242 | rs12258651 |
|
|
247
|
-
| 3 | A | G | 0.9609| 0.0245| 0.0838 | 0.769800 | 10 | 100003304 | rs72828461 |
|
|
248
|
-
| 4 | T | C | 0.6406| -0.0680| 0.0313 | 0.029870 | 10 | 100003785 | rs1359508 |
|
|
249
|
-
| ... | .. | .. | ... | ... | ... | ... | ... | ... | ... |
|
|
250
|
-
| 7088120 | A | G | 0.9028| -0.0184| 0.0517 | 0.722300 | 9 | 99999468 | rs10981301 |
|
|
251
260
|
|
|
252
|
-
|
|
253
|
-
|
|
261
|
+
|
|
262
|
+
The `SNP` column with the rsids has been added based on the reference data.
|
|
263
|
+
You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it.
|
|
264
|
+
|
|
265
|
+
> **Note:**
|
|
266
|
+
>
|
|
267
|
+
> By default, the reference panel used is the european (eur) one in build 37. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument. For instance "AFR_38" for the african reference panel in build 38:
|
|
254
268
|
|
|
255
269
|
```python
|
|
256
|
-
SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "
|
|
270
|
+
SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "AFR_37")
|
|
257
271
|
```
|
|
258
272
|
|
|
259
273
|
You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
|
|
@@ -264,22 +278,15 @@ Clumping, or C+T: Clumping + Thresholding, is the step at which we select the SN
|
|
|
264
278
|
|
|
265
279
|
The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
|
|
266
280
|
|
|
267
|
-
|
|
268
281
|
```python
|
|
269
|
-
SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.
|
|
282
|
+
SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.01, kb = 10000, reference_panel = "EUR_37")
|
|
270
283
|
```
|
|
271
284
|
|
|
272
|
-
It will output the number of instruments obtained::
|
|
273
|
-
|
|
274
|
-
Using the EUR reference panel.
|
|
275
|
-
Warning: 760 top variant IDs missing
|
|
276
|
-
1545 clumps formed from 73594 top variants.
|
|
277
|
-
|
|
278
285
|
You can specify the thresholds you want to use for the clumping with the following arguments:
|
|
279
286
|
- `p1`: P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Defaults to `5e-8`.
|
|
280
|
-
- `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.
|
|
281
|
-
- `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `
|
|
282
|
-
- `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `
|
|
287
|
+
- `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.01`.
|
|
288
|
+
- `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `10000`.
|
|
289
|
+
- `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `EUR_37`.
|
|
283
290
|
|
|
284
291
|
### Polygenic Risk Scoring <a name="paragraph3.4"></a>
|
|
285
292
|
|
|
@@ -302,25 +309,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
|
|
|
302
309
|
Extracting SNPs for each chromosome...
|
|
303
310
|
SNPs extracted for chr1.
|
|
304
311
|
SNPs extracted for chr2.
|
|
305
|
-
|
|
306
|
-
SNPs extracted for chr4.
|
|
307
|
-
SNPs extracted for chr5.
|
|
308
|
-
SNPs extracted for chr6.
|
|
309
|
-
SNPs extracted for chr7.
|
|
310
|
-
SNPs extracted for chr8.
|
|
311
|
-
SNPs extracted for chr9.
|
|
312
|
-
SNPs extracted for chr10.
|
|
313
|
-
SNPs extracted for chr11.
|
|
314
|
-
SNPs extracted for chr12.
|
|
315
|
-
SNPs extracted for chr13.
|
|
316
|
-
SNPs extracted for chr14.
|
|
317
|
-
SNPs extracted for chr15.
|
|
318
|
-
SNPs extracted for chr16.
|
|
319
|
-
SNPs extracted for chr17.
|
|
320
|
-
SNPs extracted for chr18.
|
|
321
|
-
SNPs extracted for chr19.
|
|
322
|
-
SNPs extracted for chr20.
|
|
323
|
-
SNPs extracted for chr21.
|
|
312
|
+
...
|
|
324
313
|
SNPs extracted for chr22.
|
|
325
314
|
Merging SNPs extracted from each chromosome...
|
|
326
315
|
Created bed/bim/fam fileset with extracted SNPs: tmp_GENAL/4f4ce6a7_allchr
|
|
@@ -379,16 +368,6 @@ To get their association with the outcome trait (instrument-stroke estimates), w
|
|
|
379
368
|
stroke_gwas = pd.read_csv("GCST90104539_buildGRCh37.tsv",sep="\t")
|
|
380
369
|
```
|
|
381
370
|
|
|
382
|
-
We inspect it to determine the column names:
|
|
383
|
-
|
|
384
|
-
| chromosome | base_pair_location | effect_allele_frequency | beta | standard_error | p_value | odds_ratio | ci_lower | ci_upper | effect_allele | other_allele |
|
|
385
|
-
|------------|--------------------|-------------------------|--------|----------------|---------|------------|----------|----------|---------------|--------------|
|
|
386
|
-
| 5 | 29439275 | 0.3569 | 0.0030 | 0.0070 | 0.6658 | 1.003005 | 0.989337 | 1.016861 | T | C |
|
|
387
|
-
| 5 | 85928892 | 0.0639 | -0.0152| 0.0137 | 0.2686 | 0.984915 | 0.958820 | 1.011720 | T | C |
|
|
388
|
-
| 10 | 128341232 | 0.4613 | 0.0025 | 0.0065 | 0.6998 | 1.002503 | 0.989812 | 1.015357 | T | C |
|
|
389
|
-
| 3 | 62707519 | 0.0536 | 0.0152 | 0.0152 | 0.3177 | 1.015316 | 0.985514 | 1.046019 | T | C |
|
|
390
|
-
| 2 | 80464120 | 0.9789 | 0.0057 | 0.0254 | 0.8223 | 1.005716 | 0.956874 | 1.057052 | T | G |
|
|
391
|
-
|
|
392
371
|
We load it in a `genal.Geno` instance:
|
|
393
372
|
|
|
394
373
|
```python
|
|
@@ -420,7 +399,7 @@ Genal will print how many SNPs were successfully found and extracted from the ou
|
|
|
420
399
|
> Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
|
|
421
400
|
>
|
|
422
401
|
> ```python
|
|
423
|
-
> SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "
|
|
402
|
+
> SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "EUR_37", kb = 5000, r2 = 0.8, window_snps = 5000)
|
|
424
403
|
> ```
|
|
425
404
|
>
|
|
426
405
|
> And genal will print the number of missing instruments that have been proxied:
|
|
@@ -24,11 +24,34 @@
|
|
|
24
24
|
|
|
25
25
|
|
|
26
26
|
## Introduction <a name="introduction"></a>
|
|
27
|
-
Genal is a python module designed to make it easy to run genetic risk scores and
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
27
|
+
Genal is a python module designed to make it easy and intuitive to run genetic risk scores and Mendelian Randomization analyses. The functionalities provided by genal include:
|
|
28
|
+
|
|
29
|
+
- Data preprocessing and cleaning of variant data (usually from GWAS summary statistics)
|
|
30
|
+
- Selection of independent genetic instruments through clumping
|
|
31
|
+
- Polygenic risk score calculations (currently C+T only, soon LDPRED2 and PRScs)
|
|
32
|
+
- More than 10 Mendelian Randomization methods, including heterogeneity and pleiotropy tests, with parallel processing support:
|
|
33
|
+
- Inverse Variance Weighted (IVW) methods
|
|
34
|
+
- MR-Egger methods
|
|
35
|
+
- Weighted Median methods
|
|
36
|
+
- Mode methods
|
|
37
|
+
- MR-PRESSO in parallel
|
|
38
|
+
- More to come...
|
|
39
|
+
- SNP-trait association testing
|
|
40
|
+
- Lifting of genetic data to another genomic build
|
|
41
|
+
- Variant-phenotype querying with the GWAS Catalog
|
|
42
|
+
|
|
43
|
+
### Key Features
|
|
44
|
+
|
|
45
|
+
- **Efficient Parallel Processing**: Parallel computation for bootstrapping-based MR methods and MR-PRESSO significantly reduces computation time compared to the original R packages (up to 85% faster for MR-PRESSO)
|
|
46
|
+
- **Flexible Data Handling**: Automatic formatting of variant data and summary statistics
|
|
47
|
+
- **Comprehensive MR Pipeline**: From data preprocessing to sensitivity analyses and plotting in a single package
|
|
48
|
+
- **Reference Panel Support**: Automatically download and use the latest 1000 Genomes reference panels in builds 37 and 38 with the option to use custom reference panels
|
|
49
|
+
- **Customizable**: Ability to choose all the parameters, but defaults are set to the most common values
|
|
50
|
+
- **Proxy SNP Support**: Includes functionality for finding and using proxy SNPs when instruments are missing (for polygenic risk scores, Mendelian Randomization, and association testing)
|
|
51
|
+
|
|
52
|
+
The objective of genal is to bring the functionalities of well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, in a more user-friendly Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
|
|
53
|
+
|
|
54
|
+
This package is still under development, feel free to report any issues or suggest improvements!
|
|
32
55
|
|
|
33
56
|
<img src="/Genal_flowchart.png" data-canonical-src="/Genal_flowchart.png" style="max-width:100%;" />
|
|
34
57
|
|
|
@@ -41,6 +64,8 @@ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez,
|
|
|
41
64
|
Bioinformatics Advances 2024.
|
|
42
65
|
doi: https://doi.org/10.1093/bioadv/vbae207
|
|
43
66
|
|
|
67
|
+
If you're using methods derived from R packages, such as MR-PRESSO, please also cite the original papers.
|
|
68
|
+
|
|
44
69
|
## Requirements for the genal module <a name="paragraph1"></a>
|
|
45
70
|
***Python 3.8 or later***. https://www.python.org/ <br>
|
|
46
71
|
|
|
@@ -121,16 +146,7 @@ import pandas as pd
|
|
|
121
146
|
sbp_gwas = pd.read_csv("Evangelou_30224653_SBP.txt", sep=" ")
|
|
122
147
|
sbp_gwas.head(5)
|
|
123
148
|
```
|
|
124
|
-
|
|
125
|
-
| MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective |
|
|
126
|
-
|-------------------|---------|---------|-------|--------|--------|---------|-----------------|-------------|
|
|
127
|
-
| 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940| 738170 | 736847 |
|
|
128
|
-
| 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100| 738168 | 735018 |
|
|
129
|
-
| 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526| 738168 | 733070 |
|
|
130
|
-
| 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800| 737054 | 663809 |
|
|
131
|
-
| 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870| 738169 | 735681 |
|
|
132
|
-
|
|
133
|
-
We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean.
|
|
149
|
+
We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean it.
|
|
134
150
|
|
|
135
151
|
The `genal.Geno` takes as input a pandas dataframe where each row corresponds to a SNP, with columns describing the position and possibly the effect of the SNP for the given trait (SBP in our case). To indicate the names of the columns, the following arguments can be passed:
|
|
136
152
|
- **CHR**: Column name for chromosome. Defaults to `'CHR'`.
|
|
@@ -151,15 +167,12 @@ After inspecting the dataframe, we first need to extract the chromosome and posi
|
|
|
151
167
|
|
|
152
168
|
```python
|
|
153
169
|
sbp_gwas[["CHR", "POS", "Filler"]] = sbp_gwas["MarkerName"].str.split(":", expand=True)
|
|
154
|
-
sbp_gwas.head(
|
|
170
|
+
sbp_gwas.head(2)
|
|
155
171
|
```
|
|
156
172
|
| MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective | CHR | POS | Filler |
|
|
157
173
|
|-------------------|---------|---------|-------|--------|--------|----------|-----------------|-------------|-----|-----------|--------|
|
|
158
174
|
| 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940 | 738170 | 736847 | 10 | 100000625 | SNP |
|
|
159
175
|
| 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100 | 738168 | 735018 | 10 | 100000645 | SNP |
|
|
160
|
-
| 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526 | 738168 | 733070 | 10 | 100003242 | SNP |
|
|
161
|
-
| 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800 | 737054 | 663809 | 10 | 100003304 | SNP |
|
|
162
|
-
| 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870 | 738169 | 735681 | 10 | 100003785 | SNP |
|
|
163
176
|
|
|
164
177
|
And it can now be loaded into a `genal.Geno` instance:
|
|
165
178
|
|
|
@@ -196,7 +209,7 @@ By default, and depending on the global preprocessing level (`'None'`, `'Fill'`,
|
|
|
196
209
|
- Validate the `P` (p-value) column for proper values.
|
|
197
210
|
- Check for no duplicated SNPs based on rsid.
|
|
198
211
|
- Determine if the `BETA` (effect) column contains beta estimates or odds ratios, and log-transform odds ratios if necessary.
|
|
199
|
-
- Create `SNP` column using a reference panel if CHR and POS columns are present.
|
|
212
|
+
- Create `SNP` column (containing rsids) using a reference panel if CHR and POS columns are present.
|
|
200
213
|
- Create `CHR` and/or `POS` column using a reference panel if `SNP` column is present.
|
|
201
214
|
- Create `NEA` (non-effect allele) column using a reference panel if `EA` (effect allele) column is present.
|
|
202
215
|
- Create the `SE` (standard-error) column if the `BETA` and `P` (p-value) columns are present.
|
|
@@ -219,17 +232,17 @@ SBP_Geno.data
|
|
|
219
232
|
|---------|----|-----|-------|--------|--------|----------|-----|---------|-----------|
|
|
220
233
|
| 0 | A | G | 0.5660| 0.0523| 0.0303 | 0.083940 | 10 | 100000625 | rs7899632 |
|
|
221
234
|
| 1 | A | C | 0.7936| 0.0200| 0.0372 | 0.591100 | 10 | 100000645 | rs61875309 |
|
|
222
|
-
| 2 | T | G | 0.8831| 0.1417| 0.0469 | 0.002526 | 10 | 100003242 | rs12258651 |
|
|
223
|
-
| 3 | A | G | 0.9609| 0.0245| 0.0838 | 0.769800 | 10 | 100003304 | rs72828461 |
|
|
224
|
-
| 4 | T | C | 0.6406| -0.0680| 0.0313 | 0.029870 | 10 | 100003785 | rs1359508 |
|
|
225
|
-
| ... | .. | .. | ... | ... | ... | ... | ... | ... | ... |
|
|
226
|
-
| 7088120 | A | G | 0.9028| -0.0184| 0.0517 | 0.722300 | 9 | 99999468 | rs10981301 |
|
|
227
235
|
|
|
228
|
-
|
|
229
|
-
|
|
236
|
+
|
|
237
|
+
The `SNP` column with the rsids has been added based on the reference data.
|
|
238
|
+
You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it.
|
|
239
|
+
|
|
240
|
+
> **Note:**
|
|
241
|
+
>
|
|
242
|
+
> By default, the reference panel used is the european (eur) one in build 37. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument. For instance "AFR_38" for the african reference panel in build 38:
|
|
230
243
|
|
|
231
244
|
```python
|
|
232
|
-
SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "
|
|
245
|
+
SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "AFR_37")
|
|
233
246
|
```
|
|
234
247
|
|
|
235
248
|
You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
|
|
@@ -240,22 +253,15 @@ Clumping, or C+T: Clumping + Thresholding, is the step at which we select the SN
|
|
|
240
253
|
|
|
241
254
|
The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
|
|
242
255
|
|
|
243
|
-
|
|
244
256
|
```python
|
|
245
|
-
SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.
|
|
257
|
+
SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.01, kb = 10000, reference_panel = "EUR_37")
|
|
246
258
|
```
|
|
247
259
|
|
|
248
|
-
It will output the number of instruments obtained::
|
|
249
|
-
|
|
250
|
-
Using the EUR reference panel.
|
|
251
|
-
Warning: 760 top variant IDs missing
|
|
252
|
-
1545 clumps formed from 73594 top variants.
|
|
253
|
-
|
|
254
260
|
You can specify the thresholds you want to use for the clumping with the following arguments:
|
|
255
261
|
- `p1`: P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Defaults to `5e-8`.
|
|
256
|
-
- `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.
|
|
257
|
-
- `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `
|
|
258
|
-
- `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `
|
|
262
|
+
- `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.01`.
|
|
263
|
+
- `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `10000`.
|
|
264
|
+
- `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `EUR_37`.
|
|
259
265
|
|
|
260
266
|
### Polygenic Risk Scoring <a name="paragraph3.4"></a>
|
|
261
267
|
|
|
@@ -278,25 +284,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
|
|
|
278
284
|
Extracting SNPs for each chromosome...
|
|
279
285
|
SNPs extracted for chr1.
|
|
280
286
|
SNPs extracted for chr2.
|
|
281
|
-
|
|
282
|
-
SNPs extracted for chr4.
|
|
283
|
-
SNPs extracted for chr5.
|
|
284
|
-
SNPs extracted for chr6.
|
|
285
|
-
SNPs extracted for chr7.
|
|
286
|
-
SNPs extracted for chr8.
|
|
287
|
-
SNPs extracted for chr9.
|
|
288
|
-
SNPs extracted for chr10.
|
|
289
|
-
SNPs extracted for chr11.
|
|
290
|
-
SNPs extracted for chr12.
|
|
291
|
-
SNPs extracted for chr13.
|
|
292
|
-
SNPs extracted for chr14.
|
|
293
|
-
SNPs extracted for chr15.
|
|
294
|
-
SNPs extracted for chr16.
|
|
295
|
-
SNPs extracted for chr17.
|
|
296
|
-
SNPs extracted for chr18.
|
|
297
|
-
SNPs extracted for chr19.
|
|
298
|
-
SNPs extracted for chr20.
|
|
299
|
-
SNPs extracted for chr21.
|
|
287
|
+
...
|
|
300
288
|
SNPs extracted for chr22.
|
|
301
289
|
Merging SNPs extracted from each chromosome...
|
|
302
290
|
Created bed/bim/fam fileset with extracted SNPs: tmp_GENAL/4f4ce6a7_allchr
|
|
@@ -355,16 +343,6 @@ To get their association with the outcome trait (instrument-stroke estimates), w
|
|
|
355
343
|
stroke_gwas = pd.read_csv("GCST90104539_buildGRCh37.tsv",sep="\t")
|
|
356
344
|
```
|
|
357
345
|
|
|
358
|
-
We inspect it to determine the column names:
|
|
359
|
-
|
|
360
|
-
| chromosome | base_pair_location | effect_allele_frequency | beta | standard_error | p_value | odds_ratio | ci_lower | ci_upper | effect_allele | other_allele |
|
|
361
|
-
|------------|--------------------|-------------------------|--------|----------------|---------|------------|----------|----------|---------------|--------------|
|
|
362
|
-
| 5 | 29439275 | 0.3569 | 0.0030 | 0.0070 | 0.6658 | 1.003005 | 0.989337 | 1.016861 | T | C |
|
|
363
|
-
| 5 | 85928892 | 0.0639 | -0.0152| 0.0137 | 0.2686 | 0.984915 | 0.958820 | 1.011720 | T | C |
|
|
364
|
-
| 10 | 128341232 | 0.4613 | 0.0025 | 0.0065 | 0.6998 | 1.002503 | 0.989812 | 1.015357 | T | C |
|
|
365
|
-
| 3 | 62707519 | 0.0536 | 0.0152 | 0.0152 | 0.3177 | 1.015316 | 0.985514 | 1.046019 | T | C |
|
|
366
|
-
| 2 | 80464120 | 0.9789 | 0.0057 | 0.0254 | 0.8223 | 1.005716 | 0.956874 | 1.057052 | T | G |
|
|
367
|
-
|
|
368
346
|
We load it in a `genal.Geno` instance:
|
|
369
347
|
|
|
370
348
|
```python
|
|
@@ -396,7 +374,7 @@ Genal will print how many SNPs were successfully found and extracted from the ou
|
|
|
396
374
|
> Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
|
|
397
375
|
>
|
|
398
376
|
> ```python
|
|
399
|
-
> SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "
|
|
377
|
+
> SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "EUR_37", kb = 5000, r2 = 0.8, window_snps = 5000)
|
|
400
378
|
> ```
|
|
401
379
|
>
|
|
402
380
|
> And genal will print the number of missing instruments that have been proxied:
|