genal-python 0.3__tar.gz → 0.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {genal_python-0.3 → genal_python-0.5}/PKG-INFO +111 -96
- {genal_python-0.3 → genal_python-0.5}/README.md +110 -95
- {genal_python-0.3 → genal_python-0.5}/genal/Geno.py +114 -14
- {genal_python-0.3 → genal_python-0.5}/genal/MR.py +221 -57
- {genal_python-0.3 → genal_python-0.5}/genal/MR_tools.py +42 -62
- {genal_python-0.3 → genal_python-0.5}/genal/MRpresso.py +2 -1
- {genal_python-0.3 → genal_python-0.5}/genal/__init__.py +2 -2
- {genal_python-0.3 → genal_python-0.5}/genal/constants.py +7 -5
- {genal_python-0.3 → genal_python-0.5}/genal/tools.py +2 -2
- {genal_python-0.3 → genal_python-0.5}/pyproject.toml +1 -1
- {genal_python-0.3 → genal_python-0.5}/requirements.txt +5 -5
- {genal_python-0.3 → genal_python-0.5}/.gitignore +0 -0
- {genal_python-0.3 → genal_python-0.5}/LICENSE +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/Images/MR_plot_SBP_AS.png +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/Makefile +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/doctrees/environment.pickle +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/doctrees/index.doctree +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/doctrees/source/genal.doctree +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/doctrees/source/modules.doctree +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/.buildinfo +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_sources/index.rst.txt +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_sources/source/genal.rst.txt +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_sources/source/modules.rst.txt +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/_sphinx_javascript_frameworks_compat.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/basic.css +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/badge_only.css +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/Roboto-Slab-Bold.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/Roboto-Slab-Bold.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/Roboto-Slab-Regular.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/Roboto-Slab-Regular.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/fontawesome-webfont.eot +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/fontawesome-webfont.svg +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/fontawesome-webfont.ttf +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/fontawesome-webfont.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/fontawesome-webfont.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-bold-italic.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-bold-italic.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-bold.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-bold.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-normal-italic.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-normal-italic.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-normal.woff +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/fonts/lato-normal.woff2 +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/css/theme.css +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/doctools.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/documentation_options.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/file.png +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/jquery.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/js/badge_only.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/js/html5shiv-printshiv.min.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/js/html5shiv.min.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/js/theme.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/language_data.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/minus.png +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/plus.png +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/pygments.css +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/searchtools.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/_static/sphinx_highlight.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/genindex.html +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/index.html +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/objects.inv +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/py-modindex.html +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/search.html +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/searchindex.js +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/source/genal.html +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/_build/html/source/modules.html +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/make.bat +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/requirements.txt +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/source/api.rst +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/source/conf.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/source/genal.rst +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/source/index.rst +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/source/introduction.rst +0 -0
- {genal_python-0.3 → genal_python-0.5}/docs/source/modules.rst +0 -0
- {genal_python-0.3 → genal_python-0.5}/genal/association.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/genal/clump.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/genal/extract_prs.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/genal/geno_tools.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/genal/lift.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/genal/proxy.py +0 -0
- {genal_python-0.3 → genal_python-0.5}/gitignore +0 -0
- {genal_python-0.3 → genal_python-0.5}/readthedocs.yaml +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: genal-python
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.5
|
|
4
4
|
Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
|
|
5
5
|
Author-email: Cyprien Rivier <riviercyprien@gmail.com>
|
|
6
6
|
Requires-Python: >=3.7
|
|
@@ -28,8 +28,8 @@ Project-URL: Home, https://github.com/CypRiv/genal
|
|
|
28
28
|
|
|
29
29
|
# Table of contents
|
|
30
30
|
1. [Introduction](#introduction)
|
|
31
|
-
2. [Requirements for the
|
|
32
|
-
3. [Installation and how to use
|
|
31
|
+
2. [Requirements for the genal module](#paragraph1)
|
|
32
|
+
3. [Installation and how to use genal](#paragraph2)
|
|
33
33
|
1. [Installation](#paragraph2.1)
|
|
34
34
|
4. [Tutorial and presentation of the main tools](#paragraph3)
|
|
35
35
|
1. [Data loading](#paragraph3.1)
|
|
@@ -50,11 +50,11 @@ The module prioritizes user-friendliness and intuitive operation, aiming to redu
|
|
|
50
50
|
|
|
51
51
|
Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
|
|
52
52
|
|
|
53
|
-
## Requirements for the
|
|
53
|
+
## Requirements for the genal module <a name="paragraph1"></a>
|
|
54
54
|
***Python 3.9 or later***. https://www.python.org/ <br>
|
|
55
55
|
|
|
56
56
|
|
|
57
|
-
## Installation and How to use the
|
|
57
|
+
## Installation and How to use the genal module <a name="paragraph2"></a>
|
|
58
58
|
|
|
59
59
|
### Installation <a name="paragraph2.1"></a>
|
|
60
60
|
|
|
@@ -75,7 +75,7 @@ genal.set_plink(path="/path/to/plink/executable/file")
|
|
|
75
75
|
```
|
|
76
76
|
|
|
77
77
|
## Tutorial <a name="paragraph3"></a>
|
|
78
|
-
For this tutorial, we will
|
|
78
|
+
For this tutorial, we will obtain genetic instruments for systolic blood pressure (SBP), compute a Polygenic Risk Score (PRS), and run a Mendelian Randomization analysis to investigate the genetically-determined effect of SBP on the risk of stroke. We will utilize summary statistics from Genome-Wide Association Studies (GWAS) and individual-level data from the UK Biobank. The steps include:
|
|
79
79
|
|
|
80
80
|
- Data loading
|
|
81
81
|
- Data preprocessing
|
|
@@ -84,7 +84,7 @@ For this tutorial, we will build a Polygenic Risk Score (PRS) for systolic blood
|
|
|
84
84
|
- Build a genomic risk score for SBP in a test population
|
|
85
85
|
- Include risk score calculations with proxies
|
|
86
86
|
- Perform Mendelian Randomization
|
|
87
|
-
- Analyze SBP as an exposure and acute
|
|
87
|
+
- Analyze SBP as an exposure and acute stroke as an outcome
|
|
88
88
|
- Plot the results
|
|
89
89
|
- Conduct sensitivity analyses using the weighted median, MR-Egger, and MR-PRESSO methods
|
|
90
90
|
- Calibrate SNP-trait weights with individual-level genetic data
|
|
@@ -113,20 +113,20 @@ sbp_gwas.head(5)
|
|
|
113
113
|
| 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800| 737054 | 663809 |
|
|
114
114
|
| 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870| 738169 | 735681 |
|
|
115
115
|
|
|
116
|
-
We can now load this data into a `genal.Geno`
|
|
116
|
+
We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean.
|
|
117
117
|
|
|
118
118
|
The `genal.Geno` takes as input a pandas dataframe where each row corresponds to a SNP, with columns describing the position and possibly the effect of the SNP for the given trait (SBP in our case). To indicate the names of the columns, the following arguments can be passed:
|
|
119
|
-
- **CHR**: Column name for chromosome. Defaults to
|
|
120
|
-
- **POS**: Column name for genomic position. Defaults to
|
|
121
|
-
- **SNP**: Column name for SNP identifier (rsid). Defaults to
|
|
122
|
-
- **EA**: Column name for effect allele. Defaults to
|
|
123
|
-
- **NEA**: Column name for non-effect allele. Defaults to
|
|
124
|
-
- **BETA**: Column name for effect estimate. Defaults to
|
|
125
|
-
- **SE**: Column name for effect standard error. Defaults to
|
|
126
|
-
- **P**: Column name for effect p-value. Defaults to
|
|
127
|
-
- **EAF**: Column name for effect allele frequency. Defaults to
|
|
128
|
-
|
|
129
|
-
After inspecting the dataframe, we first need to extract the chromosome and position information from the
|
|
119
|
+
- **CHR**: Column name for chromosome. Defaults to `'CHR'`.
|
|
120
|
+
- **POS**: Column name for genomic position. Defaults to `'POS'`.
|
|
121
|
+
- **SNP**: Column name for SNP identifier (rsid). Defaults to `'SNP'`.
|
|
122
|
+
- **EA**: Column name for effect allele. Defaults to `'EA'`.
|
|
123
|
+
- **NEA**: Column name for non-effect allele. Defaults to `'NEA'`.
|
|
124
|
+
- **BETA**: Column name for effect estimate. Defaults to `'BETA'`.
|
|
125
|
+
- **SE**: Column name for effect standard error. Defaults to `'SE'`.
|
|
126
|
+
- **P**: Column name for effect p-value. Defaults to `'P'`.
|
|
127
|
+
- **EAF**: Column name for effect allele frequency. Defaults to `'EAF'`.
|
|
128
|
+
|
|
129
|
+
After inspecting the dataframe, we first need to extract the chromosome and position information from the `MarkerName` column into two new columns `CHR` and `POS`:
|
|
130
130
|
|
|
131
131
|
```python
|
|
132
132
|
sbp_gwas[["CHR", "POS", "Filler"]] = sbp_gwas["MarkerName"].str.split(":", expand=True)
|
|
@@ -140,14 +140,14 @@ sbp_gwas.head(5)
|
|
|
140
140
|
| 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800 | 737054 | 663809 | 10 | 100003304 | SNP |
|
|
141
141
|
| 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870 | 738169 | 735681 | 10 | 100003785 | SNP |
|
|
142
142
|
|
|
143
|
-
And it can now be loaded into a `genal.Geno`
|
|
143
|
+
And it can now be loaded into a `genal.Geno` instance:
|
|
144
144
|
|
|
145
145
|
```python
|
|
146
146
|
import genal
|
|
147
147
|
SBP_Geno = genal.Geno(sbp_gwas, CHR="CHR", POS="POS", EA="Allele1", NEA="Allele2", BETA="Effect", SE="StdErr", P="P", EAF="Freq1", keep_columns=False)
|
|
148
148
|
```
|
|
149
149
|
|
|
150
|
-
The last argument (`keep_columns = False`) indicates that we do not wish to keep the other (non-main) columns in the dataframe.
|
|
150
|
+
The last argument (`keep_columns = False`) indicates that we do not wish to keep the other (non-main) columns in the dataframe. Defaults to `True`.
|
|
151
151
|
|
|
152
152
|
> **Note:**
|
|
153
153
|
>
|
|
@@ -155,7 +155,7 @@ The last argument (`keep_columns = False`) indicates that we do not wish to keep
|
|
|
155
155
|
|
|
156
156
|
### Data preprocessing <a name="paragraph3.2"></a>
|
|
157
157
|
|
|
158
|
-
Now that we have loaded the data into a `genal.Geno`
|
|
158
|
+
Now that we have loaded the data into a `genal.Geno` instance, we can begin cleaning and formatting it. Methods such as Polygenic Risk Scoring or Mendelian Randomization require the SNP data to be in a specific format. Also, raw summary statistics can sometimes contain missing or invalid values that need to be handled. Additionally, some columns may be missing from the data (such as the SNP rsid column, or the non-effect allele column) and these columns can be created based on existing ones and a reference panel.
|
|
159
159
|
|
|
160
160
|
Genal can run all the basic cleaning and preprocessing steps in one command:
|
|
161
161
|
|
|
@@ -167,28 +167,29 @@ The `preprocessing` argument specifies the global level of preprocessing applied
|
|
|
167
167
|
- `preprocessing = 'None'`: The data won't be modified.
|
|
168
168
|
- `preprocessing = 'Fill'`: Missing columns will be added based on reference data and invalid values set to NaN, but no rows will be deleted.
|
|
169
169
|
- `preprocessing = 'Fill_delete'`: Missing columns will be added, and all rows containing missing, duplicated, or invalid values will be deleted. This option is recommended before running genetic methods.
|
|
170
|
+
Defaults to `'Fill'`.
|
|
170
171
|
|
|
171
|
-
By default, and depending on the global preprocessing level ('None'
|
|
172
|
-
- Ensure the CHR (chromosome) and POS (genomic position) columns are integers.
|
|
173
|
-
- Ensure the EA (effect allele) and NEA (non-effect allele) columns are uppercase characters containing A, T, C, G letters. Multiallelic values are set to NaN.
|
|
174
|
-
- Validate the P (p-value) column for proper values.
|
|
172
|
+
By default, and depending on the global preprocessing level (`'None'`, `'Fill'`, `'Fill_delete'`) chosen, the `preprocess_data` method of `genal.Geno` will run the following checks:
|
|
173
|
+
- Ensure the `CHR` (chromosome) and `POS` (genomic position) columns are integers.
|
|
174
|
+
- Ensure the `EA` (effect allele) and `NEA` (non-effect allele) columns are uppercase characters containing A, T, C, G letters. Multiallelic values are set to NaN.
|
|
175
|
+
- Validate the `P` (p-value) column for proper values.
|
|
175
176
|
- Check for no duplicated SNPs based on rsid.
|
|
176
|
-
- Determine if the BETA (effect) column contains beta estimates or odds ratios, and log-transform odds ratios if necessary.
|
|
177
|
-
- Create SNP column using a reference panel if CHR and POS columns are present.
|
|
178
|
-
- Create CHR and/or POS column using a reference panel if SNP column is present.
|
|
179
|
-
- Create NEA (non-effect allele) column using a reference panel if EA (effect allele) column is present.
|
|
180
|
-
- Create the SE (standard-error) column if the BETA and P (p-value) columns are present.
|
|
181
|
-
- Create the P column if the BETA and SE columns are present.
|
|
177
|
+
- Determine if the `BETA` (effect) column contains beta estimates or odds ratios, and log-transform odds ratios if necessary.
|
|
178
|
+
- Create `SNP` column using a reference panel if CHR and POS columns are present.
|
|
179
|
+
- Create `CHR` and/or `POS` column using a reference panel if `SNP` column is present.
|
|
180
|
+
- Create `NEA` (non-effect allele) column using a reference panel if `EA` (effect allele) column is present.
|
|
181
|
+
- Create the `SE` (standard-error) column if the `BETA` and `P` (p-value) columns are present.
|
|
182
|
+
- Create the `P` column if the `BETA` and `SE` columns are present.
|
|
182
183
|
|
|
183
184
|
If you do not wish to run certain steps, or wish to run only certain steps, you can use additional arguments. For more information, please refer to the `genal.Geno.preprocess_data` method in the API documentation.
|
|
184
185
|
|
|
185
|
-
In our case, the SNP column (for SNP identifier - rsid) was missing from our dataframe and has been added based on a 1000 genome reference panel:
|
|
186
|
+
In our case, the `SNP` column (for SNP identifier - rsid) was missing from our dataframe and has been added based on a 1000 genome reference panel:
|
|
186
187
|
|
|
187
188
|
Using the EUR reference panel.
|
|
188
189
|
The SNP column (rsID) has been created. 197511(2.787%) SNPs were not found in the reference data and their ID set to CHR:POS:EA.
|
|
189
190
|
The BETA column looks like Beta estimates. Use effect_column='OR' if it is a column of Odds Ratios.
|
|
190
191
|
|
|
191
|
-
You can always check the data of a `genal.Geno`
|
|
192
|
+
You can always check the data of a `genal.Geno` instance by accessing the 'data' attribute:
|
|
192
193
|
|
|
193
194
|
```python
|
|
194
195
|
SBP_Geno.data
|
|
@@ -203,8 +204,8 @@ SBP_Geno.data
|
|
|
203
204
|
| ... | .. | .. | ... | ... | ... | ... | ... | ... | ... |
|
|
204
205
|
| 7088120 | A | G | 0.9028| -0.0184| 0.0517 | 0.722300 | 9 | 99999468 | rs10981301 |
|
|
205
206
|
|
|
206
|
-
And we see that the SNP column with the rsids has been added based on the reference data.
|
|
207
|
-
You do not need to obtain the 1000 genome reference panel yourself,
|
|
207
|
+
And we see that the `SNP` column with the rsids has been added based on the reference data.
|
|
208
|
+
You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it. By default, the reference panel used is the european (eur) one. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument:
|
|
208
209
|
|
|
209
210
|
```python
|
|
210
211
|
SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
|
|
@@ -214,9 +215,9 @@ You can also use a custom reference panel by specifying to the reference_panel a
|
|
|
214
215
|
|
|
215
216
|
### Clumping <a name="paragraph3.3"></a>
|
|
216
217
|
|
|
217
|
-
Clumping is the step at which we select the SNPs that will be used as our genetic instruments in future Polygenic Risk Scores and Mendelian Randomization analyses. The process involves identifying the SNPs that are strongly associated with our trait of interest (systolic blood pressure in this tutorial) and are independent from each other. This second step ensures that selected SNPs are not highly correlated, (i.e., they are not in
|
|
218
|
+
Clumping is the step at which we select the SNPs that will be used as our genetic instruments in future Polygenic Risk Scores and Mendelian Randomization analyses. The process involves identifying the SNPs that are strongly associated with our trait of interest (systolic blood pressure in this tutorial) and are independent from each other. This second step ensures that selected SNPs are not highly correlated, (i.e., they are not in high linkage disequilibrium). For this step, we again need to use a reference panel.
|
|
218
219
|
|
|
219
|
-
The SNP-data loaded in a `genal.Geno`
|
|
220
|
+
The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
|
|
220
221
|
|
|
221
222
|
|
|
222
223
|
```python
|
|
@@ -230,10 +231,10 @@ It will output the number of instruments obtained::
|
|
|
230
231
|
1545 clumps formed from 73594 top variants.
|
|
231
232
|
|
|
232
233
|
You can specify the thresholds you want to use for the clumping with the following arguments:
|
|
233
|
-
-
|
|
234
|
-
-
|
|
235
|
-
-
|
|
236
|
-
-
|
|
234
|
+
- `p1`: P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Defaults to `5e-8`.
|
|
235
|
+
- `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.1`.
|
|
236
|
+
- `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `250`.
|
|
237
|
+
- `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `eur`.
|
|
237
238
|
|
|
238
239
|
### Polygenic Risk Scoring <a name="paragraph3.4"></a>
|
|
239
240
|
|
|
@@ -243,7 +244,7 @@ Computing a Polygenic Risk Score (PRS) can be done in one line with the `genal.G
|
|
|
243
244
|
SBP_clumped.prs(name = "SBP_prs", path = "path/to/genetic/files")
|
|
244
245
|
```
|
|
245
246
|
|
|
246
|
-
The genetic files of the target population can be either one triple of bed/bim/fam files
|
|
247
|
+
The genetic files of the target population can be either contained in one triple of bed/bim/fam files with information for all SNPs, or divided by chromosome (one bed/bim/fam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by `$` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named `Pop_chr1.bed`, `Pop_chr1.bim`, `Pop_chr1.fam`, `Pop_chr2.bed`, ..., you can use:
|
|
247
248
|
|
|
248
249
|
```python
|
|
249
250
|
SBP_clumped.prs(name = "SBP_prs", path = "Pop_chr$")
|
|
@@ -283,7 +284,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
|
|
|
283
284
|
The PRS computation was successful and used 759/1545 (49.126%) SNPs.
|
|
284
285
|
PRS data saved to SBP_prs.csv
|
|
285
286
|
|
|
286
|
-
Here, we see that about half of the SNPs were not extracted from the data. In such cases, we may want to try and salvage some of these SNPs by looking for proxies (SNPs in high linkage disequilibrium, i.e. highly correlated SNPs). This can be done by specifying the
|
|
287
|
+
Here, we see that about half of the SNPs were not extracted from the data. In such cases, we may want to try and salvage some of these SNPs by looking for proxies (SNPs in high linkage disequilibrium, i.e. highly correlated SNPs). This can be done by specifying the `proxy = True`. argument:
|
|
287
288
|
|
|
288
289
|
```python
|
|
289
290
|
SBP_clumped.prs(name = "SBP_prs" ,path = "Pop_chr$", proxy = True, reference_panel = "eur", r2=0.8, kb=5000, window_snps=5000)
|
|
@@ -332,19 +333,19 @@ and the output is:
|
|
|
332
333
|
In our case, we have been able to find proxies for 571 of the 786 SNPs that were missing in the population genetic data (7 potential proxies have been removed because they were identical to SNPs already present in our data).
|
|
333
334
|
|
|
334
335
|
You can customize how the proxies are chosen with the following arguments:
|
|
335
|
-
-
|
|
336
|
-
-
|
|
337
|
-
-
|
|
338
|
-
-
|
|
336
|
+
- `reference_panel`: The reference population used to derive linkage disequilibrium values and find proxies. Defaults to `eur`.
|
|
337
|
+
- `kb`: Width of the genomic window to look for proxies (in thousands of base-pair positions). Defaults to `5000`.
|
|
338
|
+
- `r2`: Minimum linkage disequilibrium value with the original SNP for a proxy to be included. Defaults to `0.8`.
|
|
339
|
+
- `window_snps`: Width of the window to look for proxies (in number of SNPs). Defaults to `5000`.
|
|
339
340
|
|
|
340
341
|
> **Note:**
|
|
341
342
|
>
|
|
342
|
-
> You can call the `genal.Geno.prs` method on any `Geno`
|
|
343
|
+
> You can call the `genal.Geno.prs` method on any `Geno` instance (containing at least the EA, BETA, and either SNP or CHR/POS columns). The data does not need to be clumped, and there is no limit to the number of instruments used to compute the scores.
|
|
343
344
|
|
|
344
345
|
|
|
345
346
|
### Mendelian Randomization <a name="paragraph3.5"></a>
|
|
346
347
|
|
|
347
|
-
To run MR, we need to load both our exposure and outcome SNP-level data in `genal.Geno`
|
|
348
|
+
To run MR, we need to load both our exposure and outcome SNP-level data in `genal.Geno` instances. In our case, the genetic instruments of the MR are the SNPs associated with blood pressure at genome-wide significant levels resulting from the clumping of the blood pressure GWAS. They are stored in our `SBP_clumped` `genal.Geno` instance which also include their association with the exposure trait (instrument-SBP estimates in the `BETA` column).
|
|
348
349
|
|
|
349
350
|
To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium ([https://www.nature.com/articles/s41586-022-05165-3](https://www.nature.com/articles/s41586-022-05165-3)):
|
|
350
351
|
|
|
@@ -362,7 +363,7 @@ We inspect it to determine the column names:
|
|
|
362
363
|
| 3 | 62707519 | 0.0536 | 0.0152 | 0.0152 | 0.3177 | 1.015316 | 0.985514 | 1.046019 | T | C |
|
|
363
364
|
| 2 | 80464120 | 0.9789 | 0.0057 | 0.0254 | 0.8223 | 1.005716 | 0.956874 | 1.057052 | T | G |
|
|
364
365
|
|
|
365
|
-
We load it in a `genal.Geno`
|
|
366
|
+
We load it in a `genal.Geno` instance:
|
|
366
367
|
|
|
367
368
|
```python
|
|
368
369
|
Stroke_Geno = genal.Geno(stroke_gwas, CHR = "chromosome", POS = "base_pair_location", EA = "effect_allele", NEA = "other_allele", BETA = "beta", SE = "standard_error", P = "p_value", EAF = "effect_allele_frequency", keep_columns = False)
|
|
@@ -374,7 +375,7 @@ We preprocess it as well to put it in the correct format and make sure there is
|
|
|
374
375
|
Stroke_Geno.preprocess_data(preprocessing = 'Fill_delete')
|
|
375
376
|
```
|
|
376
377
|
|
|
377
|
-
Now, we need to extract our instruments (SNPs of the SBP_clumped data) from the outcome data to obtain their association with the outcome trait (stroke). It can be done by calling the `genal.Geno.query_outcome` method:
|
|
378
|
+
Now, we need to extract our instruments (SNPs of the `SBP_clumped` data) from the outcome data to obtain their association with the outcome trait (stroke). It can be done by calling the `genal.Geno.query_outcome` method:
|
|
378
379
|
|
|
379
380
|
```python
|
|
380
381
|
SBP_clumped.query_outcome(Stroke_Geno, proxy = False)
|
|
@@ -382,7 +383,7 @@ SBP_clumped.query_outcome(Stroke_Geno, proxy = False)
|
|
|
382
383
|
|
|
383
384
|
Genal will print how many SNPs were successfully found and extracted from the outcome data:
|
|
384
385
|
|
|
385
|
-
Outcome data successfully loaded from 'b352e412' geno
|
|
386
|
+
Outcome data successfully loaded from 'b352e412' geno instance.
|
|
386
387
|
Identifying the exposure SNPs present in the outcome data...
|
|
387
388
|
1541 SNPs out of 1545 are present in the outcome data.
|
|
388
389
|
(Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
|
|
@@ -393,9 +394,9 @@ Here as well you have the option to use proxies for the instruments that are not
|
|
|
393
394
|
SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
|
|
394
395
|
```
|
|
395
396
|
|
|
396
|
-
And
|
|
397
|
+
And genal will print the number of missing instruments which have been proxied:
|
|
397
398
|
|
|
398
|
-
Outcome data successfully loaded from 'b352e412' geno
|
|
399
|
+
Outcome data successfully loaded from 'b352e412' geno instance.
|
|
399
400
|
Identifying the exposure SNPs present in the outcome data...
|
|
400
401
|
1541 SNPs out of 1545 are present in the outcome data.
|
|
401
402
|
Searching proxies for 4 SNPs...
|
|
@@ -403,65 +404,78 @@ And Genal will print the number of missing instruments which have been proxied:
|
|
|
403
404
|
Found proxies for 4 SNPs.
|
|
404
405
|
(Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
|
|
405
406
|
|
|
406
|
-
After extracting the instruments from the outcome data, the SBP_clumped `genal.Geno`
|
|
407
|
+
After extracting the instruments from the outcome data, the `SBP_clumped` `genal.Geno` instance contains an `MR_data` attribute containing the instruments-exposure and instruments-outcome associations necessary to run MR. Running MR is now as simple as calling the `genal.Geno.MR` method of the SBP_clumped `genal.Geno` instance:
|
|
407
408
|
|
|
408
409
|
```python
|
|
409
|
-
SBP_clumped.MR(action =
|
|
410
|
+
SBP_clumped.MR(action = 2, exposure_name = "SBP", outcome_name = "Stroke_eur")
|
|
410
411
|
```
|
|
411
412
|
|
|
412
|
-
The `genal.Geno.MR` method
|
|
413
|
-
|
|
414
|
-
|
|
415
|
-
|
|
416
|
-
|
|
417
|
-
|
|
|
418
|
-
|
|
419
|
-
| SBP | Stroke_eur |
|
|
420
|
-
| SBP | Stroke_eur |
|
|
421
|
-
| SBP | Stroke_eur |
|
|
422
|
-
| SBP | Stroke_eur |
|
|
423
|
-
| SBP | Stroke_eur | Egger
|
|
424
|
-
| SBP | Stroke_eur |
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
|
|
429
|
-
-
|
|
430
|
-
-
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
|
|
435
|
-
|
|
413
|
+
The `genal.Geno.MR` method prints a message specifying which SNPs have been excluded from the analysis (it depends on the action argument, as we will see):
|
|
414
|
+
|
|
415
|
+
Action = 2: 42 SNPs excluded for being palindromic with intermediate allele frequencies: rs11817866, rs3802517, rs2788293, rs2274224, rs7310615, rs7953257, rs2024385, rs61912333, rs11632436, rs1012089, rs3851018, rs9899540, rs4617956, rs773432, rs11585169, rs7796, rs2487904, rs12321, rs73029563, rs4673238, rs3845811, rs2160236, rs10165271, rs9848170, rs2724535, rs6842486, rs4834792, rs990619, rs155364, rs480882, rs6875372, rs258951, rs1870735, rs1800795, rs12700814, rs1821002, rs3021500, rs28601761, rs7463212, rs907183, rs534523, rs520015
|
|
416
|
+
|
|
417
|
+
It returns a dataframe containing the results for different MR methods:
|
|
418
|
+
| exposure | outcome | method | nSNP | b | se | pval |
|
|
419
|
+
|----------|------------|-------------------------------------------|------|----------|----------|---------------|
|
|
420
|
+
| SBP | Stroke_eur | Inverse-Variance Weighted | 1499 | 0.023049 | 0.001061 | 1.382645e-104 |
|
|
421
|
+
| SBP | Stroke_eur | Inverse Variance Weighted (Fixed Effects) | 1499 | 0.023049 | 0.000754 | 4.390655e-205 |
|
|
422
|
+
| SBP | Stroke_eur | Weighted Median | 1499 | 0.022365 | 0.001337 | 8.863203e-63 |
|
|
423
|
+
| SBP | Stroke_eur | Simple mode | 1499 | 0.027125 | 0.007698 | 4.382993e-04 |
|
|
424
|
+
| SBP | Stroke_eur | MR Egger | 1499 | 0.027543 | 0.002849 | 1.723156e-21 |
|
|
425
|
+
| SBP | Stroke_eur | Egger Intercept | 1499 | -0.001381| 0.000813 | 8.935529e-02 |
|
|
426
|
+
|
|
427
|
+
You can specify several arguments. We refer to the API for a full list, but the most important one is the `action` argument. It determines how palindromic SNPs are treated during the exposure-outcome harmonization step. Palindromic SNPs are SNPs where the nucleotide change reads the same forward and backward on complementary strands of DNA (for instance `EA = 'A'` and `NEA = 'T'`).
|
|
428
|
+
|
|
429
|
+
- `action = 1`: Palindromic SNPs are not treated (assumes all alleles are on the forward strand)
|
|
430
|
+
- `action = 2`: Uses effect allele frequencies to attempt to flip them (conservative, default)
|
|
431
|
+
- `action = 3`: Removes all palindromic SNPs (very conservative)
|
|
432
|
+
|
|
433
|
+
If you choose the option 2 or 3 (recommended), genal will print the list of palindromic SNPs that have been removed from the analysis.
|
|
434
|
+
|
|
435
|
+
By default, only some MR methods (inverse-variance weighted, weighted median, Simple mode, MR-Egger) are going to be run. But if you wish to run a different set of MR methods, you can pass a list of strings to the `methods` argument. The possible strings are:
|
|
436
|
+
- `IVW` for the classical Inverse-Variance Weighted method with random effects
|
|
437
|
+
- `IVW-RE` for the Inverse Variance Weighted method with Random Effects where the standard error is not corrected for under dispersion
|
|
438
|
+
- `IVW-FE` for the Inverse Variance Weighted with fixed effects
|
|
439
|
+
- `UWR` for the Unweighted Regression method
|
|
440
|
+
- `WM` for the Weighted Median method
|
|
441
|
+
- `WM-pen` for the penalised Weighted Median method
|
|
442
|
+
- `Simple-median` for the Simple Median method
|
|
443
|
+
- `Sign` for the Sign concordance test
|
|
444
|
+
- `Egger` for MR-Egger and the MR-Egger intercept
|
|
445
|
+
- `Egger-boot` for the bootstrapped version of MR-Egger and its intercept
|
|
446
|
+
- `Simple-mode` for the Simple mode method
|
|
447
|
+
- `Weighted-mode` for the Weighted mode method
|
|
448
|
+
- `all` to run all the above methods
|
|
436
449
|
|
|
437
450
|
For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API.
|
|
438
451
|
|
|
439
|
-
If you want to visualize the obtained MR results, you can use the `genal.Geno.MR_plot` method that will plot each SNP in an effect_on_exposure x effect_on_outcome plane as well as lines corresponding to different MR methods:
|
|
452
|
+
If you want to visualize the obtained MR results, you can use the `genal.Geno.MR_plot` method that will plot each SNP in an `effect_on_exposure x effect_on_outcome` plane as well as lines corresponding to different MR methods:
|
|
440
453
|
|
|
441
454
|
```python
|
|
442
455
|
SBP_clumped.MR_plot(filename="MR_plot_SBP_AS")
|
|
443
456
|
```
|
|
444
457
|
|
|
445
458
|

|
|
446
|
-
You can select which MR methods you wish to plot with the
|
|
459
|
+
You can select which MR methods you wish to plot with the `methods` argument. Note that for an MR method to be plotted, they must be included in the latest `genal.Geno.MR` call of this `genal.Geno` instance.
|
|
447
460
|
|
|
448
461
|
If you wish to include the heterogeneity values (Cochran's Q) in the results, you can use the heterogeneity argument in the `genal.Geno.MR` call. Here, the heterogeneity for the inverse-variance weighted method:
|
|
449
462
|
|
|
450
463
|
```python
|
|
451
|
-
SBP_clumped.MR(action =
|
|
464
|
+
SBP_clumped.MR(action = 2, methods = ["Egger","IVW"], exposure_name = "SBP", outcome_name = "Stroke_eur", heterogeneity = True)
|
|
452
465
|
```
|
|
453
466
|
|
|
454
467
|
And that will give:
|
|
455
|
-
| exposure | outcome | method
|
|
456
|
-
|
|
457
|
-
| SBP | Stroke_eur |
|
|
458
|
-
|
|
468
|
+
| exposure | outcome | method | nSNP | b | se | pval | Q | Q_df | Q_pval |
|
|
469
|
+
|----------|------------|--------------------------|------|-----------|-----------|---------------|-------------|------|--------------|
|
|
470
|
+
| SBP | Stroke_eur | MR Egger | 1499 | 0.027543 | 0.002849 | 1.723156e-21 | 2959.965136 | 1497 | 1.253763e-98 |
|
|
471
|
+
| SBP | Stroke_eur | Egger Intercept | 1499 | -0.001381 | 0.000813 | 8.935529e-02 | 2959.965136 | 1497 | 1.253763e-98 |
|
|
472
|
+
| SBP | Stroke_eur | Inverse-Variance Weighted| 1499 | 0.023049 | 0.001061 | 1.382645e-104 | 2965.678836 | 1498 | 4.280737e-99 |
|
|
459
473
|
|
|
460
|
-
As expected, many MR methods indicate that SBP is strongly associated with stroke, but there
|
|
474
|
+
As expected, many MR methods indicate that SBP is strongly associated with stroke, but there could be concerns for horizontal pleiotropy (instruments influencing the outcome through a different pathway than the one used as exposure) given the almost significant MR-Egger intercept p-value.
|
|
461
475
|
To investigate horizontal pleiotropy in more details, a very useful method is Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO). MR-PRESSO is a method designed to detect and correct for horizontal pleiotropy. It will identify which instruments are likely to be pleiotropic on their effect on the outcome, and it will rerun an inverse-variance weighted MR after excluding them. It can be run using the `genal.Geno.MRpresso` method:
|
|
462
476
|
|
|
463
477
|
```python
|
|
464
|
-
SBP_clumped.MRpresso(action =
|
|
478
|
+
SBP_clumped.MRpresso(action = 2, n_iterations = 30000)
|
|
465
479
|
```
|
|
466
480
|
|
|
467
481
|
As with the `genal.Geno.MR` method, the `action` argument determines how the pleiotropic SNPs will be treated. The output is a list containing:
|
|
@@ -488,19 +502,19 @@ df_pheno = pd.read_csv("path/to/trait/data")
|
|
|
488
502
|
>
|
|
489
503
|
> One important detail is to make sure that the individual IDs are identical between the phenotypic data and the genetic data for the target population.
|
|
490
504
|
|
|
491
|
-
Then, it is advised to make a copy of the `genal.Geno`
|
|
505
|
+
Then, it is advised to make a copy of the `genal.Geno` instance containing our instruments as we are going to update their coefficients and to avoid any confusion:
|
|
492
506
|
|
|
493
507
|
```python
|
|
494
508
|
SBP_adjusted = SBP_clumped.copy()
|
|
495
509
|
```
|
|
496
510
|
|
|
497
|
-
We can then call the `genal.Geno.set_phenotype` method, specifying which column contains our trait of interest (for the association testing) and which column contains the IDs:
|
|
511
|
+
We can then call the `genal.Geno.set_phenotype` method, specifying which column contains our trait of interest (for the association testing) and which column contains the individual IDs:
|
|
498
512
|
|
|
499
513
|
```python
|
|
500
514
|
SBP_adjusted.set_phenotype(df_pheno, PHENO = "htn", IID = "IID")
|
|
501
515
|
```
|
|
502
516
|
|
|
503
|
-
At this point,
|
|
517
|
+
At this point, genal will identify if the phenotype is binary or quantitative in order to choose the appropriate regression model. If the phenotype is binary, it will assume that the most frequent value is coding for control (and the other value for case), this can be changed with `alternate_control = True`:
|
|
504
518
|
|
|
505
519
|
Detected a binary phenotype in the 'PHENO' column. Specify 'PHENO_type="quant"' if this is incorrect.
|
|
506
520
|
Identified 0 as the control code in 'PHENO'. Set 'alternate_control=True' to inverse this interpretation.
|
|
@@ -544,18 +558,19 @@ Genal will print information regarding the number of individuals used in the tes
|
|
|
544
558
|
Running single-SNP logistic regression tests on tmp_GENAL/e415aab3_allchr data with adjustment for: ['age'].
|
|
545
559
|
The BETA, SE, P columns of the .data attribute have been updated.
|
|
546
560
|
|
|
547
|
-
The
|
|
561
|
+
The `BETA`, `SE`, and `P` columns of the `SBP_adjusted.data` attribute have been updated with the results of the association tests.
|
|
548
562
|
|
|
549
563
|
### Lifting <a name="paragraph3.7"></a>
|
|
550
564
|
|
|
551
565
|
It is sometimes necessary to lift the SNP data to a different build. For instance, if the genetic data of our target population is in build 38 (hg38), but the GWAS summary statistics are in build 37 (hg19).
|
|
552
|
-
This can easily be done in
|
|
566
|
+
This can easily be done in genal using the `genal.Geno.lift` method:
|
|
553
567
|
|
|
554
568
|
```python
|
|
555
569
|
SBP_clumped.lift(start = "hg19", end = "hg38", replace = False)
|
|
556
570
|
```
|
|
557
571
|
|
|
558
|
-
This outputs a table with the lifted SBP instruments (stored in the `SBP_clumped`
|
|
572
|
+
This outputs a table with the lifted SBP instruments (stored in the `SBP_clumped` instance) from build 37 (hg19) to build 38 (hg38). We specified `replace = False` to not modify the `SBP_clumped.data` attribute, but we may want to modify it (before running a PRS in a population stored in build 38 for instance).
|
|
573
|
+
Genal will download the appropriate chain files required for the lift, and it will be done in python by default. However, if you plan to lift large datasets of SNPs (the whole summary statistics for instance), it may be useful to install the LiftOver executable that will run faster than the python version. It can be downloaded here: [https://genome-store.ucsc.edu/](https://genome-store.ucsc.edu/) You will need to create an account, scroll down to "LiftOver program", add it to your cart, and declare that you are a non-profit user.
|
|
559
574
|
|
|
560
575
|
You can specify the path of the LiftOver executable to the `liftover_path` argument:
|
|
561
576
|
|