genal-python 0.9__tar.gz → 1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (110) hide show
  1. {genal_python-0.9 → genal_python-1.1}/.DS_Store +0 -0
  2. genal_python-1.1/Genal_flowchart.png +0 -0
  3. {genal_python-0.9 → genal_python-1.1}/PKG-INFO +39 -16
  4. {genal_python-0.9 → genal_python-1.1}/README.md +35 -12
  5. {genal_python-0.9 → genal_python-1.1}/docs/.DS_Store +0 -0
  6. {genal_python-0.9 → genal_python-1.1/docs}/requirements.txt +4 -3
  7. genal_python-1.1/docs/source/Images/Genal_flowchart.png +0 -0
  8. genal_python-1.1/docs/source/Images/genal_logo.png +0 -0
  9. {genal_python-0.9 → genal_python-1.1}/docs/source/conf.py +1 -1
  10. {genal_python-0.9 → genal_python-1.1}/docs/source/index.rst +8 -4
  11. {genal_python-0.9 → genal_python-1.1}/docs/source/introduction.rst +25 -12
  12. {genal_python-0.9 → genal_python-1.1}/genal/Geno.py +175 -105
  13. {genal_python-0.9 → genal_python-1.1}/genal/MR.py +1 -7
  14. {genal_python-0.9 → genal_python-1.1}/genal/MR_tools.py +24 -25
  15. {genal_python-0.9 → genal_python-1.1}/genal/MRpresso.py +12 -16
  16. genal_python-1.1/genal/__init__.py +19 -0
  17. {genal_python-0.9 → genal_python-1.1}/genal/association.py +173 -115
  18. {genal_python-0.9 → genal_python-1.1}/genal/clump.py +33 -18
  19. {genal_python-0.9 → genal_python-1.1}/genal/constants.py +3 -0
  20. {genal_python-0.9 → genal_python-1.1}/genal/extract_prs.py +115 -107
  21. {genal_python-0.9 → genal_python-1.1}/genal/geno_tools.py +15 -9
  22. {genal_python-0.9 → genal_python-1.1}/genal/proxy.py +120 -54
  23. {genal_python-0.9 → genal_python-1.1}/genal/snp_query.py +49 -25
  24. genal_python-1.1/genal/tools.py +499 -0
  25. genal_python-1.1/genal_logo.png +0 -0
  26. {genal_python-0.9 → genal_python-1.1}/pyproject.toml +3 -3
  27. genal_python-0.9/genal/__init__.py +0 -20
  28. genal_python-0.9/genal/tools.py +0 -300
  29. {genal_python-0.9 → genal_python-1.1}/.gitignore +0 -0
  30. {genal_python-0.9 → genal_python-1.1}/.readthedocs.yaml +0 -0
  31. {genal_python-0.9 → genal_python-1.1}/LICENSE +0 -0
  32. {genal_python-0.9 → genal_python-1.1}/docs/Makefile +0 -0
  33. {genal_python-0.9 → genal_python-1.1}/docs/build/.DS_Store +0 -0
  34. {genal_python-0.9 → genal_python-1.1}/docs/build/.buildinfo +0 -0
  35. {genal_python-0.9 → genal_python-1.1}/docs/build/.doctrees/api.doctree +0 -0
  36. {genal_python-0.9 → genal_python-1.1}/docs/build/.doctrees/environment.pickle +0 -0
  37. {genal_python-0.9 → genal_python-1.1}/docs/build/.doctrees/genal.doctree +0 -0
  38. {genal_python-0.9 → genal_python-1.1}/docs/build/.doctrees/index.doctree +0 -0
  39. {genal_python-0.9 → genal_python-1.1}/docs/build/.doctrees/introduction.doctree +0 -0
  40. {genal_python-0.9 → genal_python-1.1}/docs/build/.doctrees/modules.doctree +0 -0
  41. {genal_python-0.9 → genal_python-1.1}/docs/build/_images/MR_plot_SBP_AS.png +0 -0
  42. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/Geno.html +0 -0
  43. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/MR.html +0 -0
  44. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/MR_tools.html +0 -0
  45. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/MRpresso.html +0 -0
  46. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/association.html +0 -0
  47. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/clump.html +0 -0
  48. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/extract_prs.html +0 -0
  49. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/geno_tools.html +0 -0
  50. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/lift.html +0 -0
  51. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/proxy.html +0 -0
  52. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/snp_query.html +0 -0
  53. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/genal/tools.html +0 -0
  54. {genal_python-0.9 → genal_python-1.1}/docs/build/_modules/index.html +0 -0
  55. {genal_python-0.9 → genal_python-1.1}/docs/build/_sources/api.rst.txt +0 -0
  56. {genal_python-0.9 → genal_python-1.1}/docs/build/_sources/genal.rst.txt +0 -0
  57. {genal_python-0.9 → genal_python-1.1}/docs/build/_sources/index.rst.txt +0 -0
  58. {genal_python-0.9 → genal_python-1.1}/docs/build/_sources/introduction.rst.txt +0 -0
  59. {genal_python-0.9 → genal_python-1.1}/docs/build/_sources/modules.rst.txt +0 -0
  60. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/basic.css +0 -0
  61. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/badge_only.css +0 -0
  62. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/Roboto-Slab-Bold.woff +0 -0
  63. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/Roboto-Slab-Bold.woff2 +0 -0
  64. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/Roboto-Slab-Regular.woff +0 -0
  65. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/Roboto-Slab-Regular.woff2 +0 -0
  66. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/fontawesome-webfont.eot +0 -0
  67. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/fontawesome-webfont.svg +0 -0
  68. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/fontawesome-webfont.ttf +0 -0
  69. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/fontawesome-webfont.woff +0 -0
  70. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/fontawesome-webfont.woff2 +0 -0
  71. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-bold-italic.woff +0 -0
  72. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-bold-italic.woff2 +0 -0
  73. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-bold.woff +0 -0
  74. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-bold.woff2 +0 -0
  75. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-normal-italic.woff +0 -0
  76. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-normal-italic.woff2 +0 -0
  77. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-normal.woff +0 -0
  78. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/fonts/lato-normal.woff2 +0 -0
  79. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/css/theme.css +0 -0
  80. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/doctools.js +0 -0
  81. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/documentation_options.js +0 -0
  82. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/file.png +0 -0
  83. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/js/badge_only.js +0 -0
  84. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/js/html5shiv-printshiv.min.js +0 -0
  85. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/js/html5shiv.min.js +0 -0
  86. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/js/theme.js +0 -0
  87. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/language_data.js +0 -0
  88. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/minus.png +0 -0
  89. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/plus.png +0 -0
  90. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/pygments.css +0 -0
  91. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/searchtools.js +0 -0
  92. {genal_python-0.9 → genal_python-1.1}/docs/build/_static/sphinx_highlight.js +0 -0
  93. {genal_python-0.9 → genal_python-1.1}/docs/build/api.html +0 -0
  94. {genal_python-0.9 → genal_python-1.1}/docs/build/genal.html +0 -0
  95. {genal_python-0.9 → genal_python-1.1}/docs/build/genindex.html +0 -0
  96. {genal_python-0.9 → genal_python-1.1}/docs/build/index.html +0 -0
  97. {genal_python-0.9 → genal_python-1.1}/docs/build/introduction.html +0 -0
  98. {genal_python-0.9 → genal_python-1.1}/docs/build/modules.html +0 -0
  99. {genal_python-0.9 → genal_python-1.1}/docs/build/objects.inv +0 -0
  100. {genal_python-0.9 → genal_python-1.1}/docs/build/py-modindex.html +0 -0
  101. {genal_python-0.9 → genal_python-1.1}/docs/build/search.html +0 -0
  102. {genal_python-0.9 → genal_python-1.1}/docs/build/searchindex.js +0 -0
  103. {genal_python-0.9 → genal_python-1.1}/docs/make.bat +0 -0
  104. {genal_python-0.9 → genal_python-1.1}/docs/source/.DS_Store +0 -0
  105. {genal_python-0.9 → genal_python-1.1}/docs/source/Images/MR_plot_SBP_AS.png +0 -0
  106. {genal_python-0.9 → genal_python-1.1}/docs/source/api.rst +0 -0
  107. {genal_python-0.9 → genal_python-1.1}/docs/source/modules.rst +0 -0
  108. {genal_python-0.9 → genal_python-1.1}/genal/lift.py +0 -0
  109. {genal_python-0.9 → genal_python-1.1}/gitignore +0 -0
  110. {genal_python-0.9 → genal_python-1.1}/readthedocs.yaml +0 -0
Binary file
@@ -1,9 +1,9 @@
1
- Metadata-Version: 2.1
1
+ Metadata-Version: 2.3
2
2
  Name: genal-python
3
- Version: 0.9
3
+ Version: 1.1
4
4
  Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
5
5
  Author-email: Cyprien Rivier <riviercyprien@gmail.com>
6
- Requires-Python: >=3.7
6
+ Requires-Python: >=3.8
7
7
  Description-Content-Type: text/markdown
8
8
  Classifier: Programming Language :: Python :: 3
9
9
  Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
@@ -16,12 +16,16 @@ Requires-Dist: plotnine==0.12.3
16
16
  Requires-Dist: psutil==5.9.1
17
17
  Requires-Dist: pyliftover==0.4
18
18
  Requires-Dist: scikit_learn>=1.3.0
19
- Requires-Dist: scipy>=1.11.4
19
+ Requires-Dist: scipy>=1.10.1, <1.11
20
20
  Requires-Dist: statsmodels==0.14.0
21
21
  Requires-Dist: tqdm==4.66.1
22
22
  Requires-Dist: wget==3.2
23
23
  Project-URL: Home, https://github.com/CypRiv/genal
24
24
 
25
+ [![Python 3.8](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org/downloads/release/python-3100/)
26
+
27
+ <img src="/genal_logo.png" data-canonical-src="/genal_logo.png" height="80" />
28
+
25
29
  <center><h1> genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization </h1></center>
26
30
 
27
31
 
@@ -54,12 +58,18 @@ The module prioritizes user-friendliness and intuitive operation, aiming to redu
54
58
 
55
59
  Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
56
60
 
61
+ <img src="/Genal_flowchart.png" data-canonical-src="/Genal_flowchart.png" style="max-width:100%;" />
62
+
63
+ Genal flowchart. Created in https://www.BioRender.com
57
64
  ## Citation <a name="citation"></a>
58
- If you're using genal, please cite the following paper:
59
- **Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization.** Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta. medRxiv 2024.05.23.24307776; doi: https://doi.org/10.1101/2024.05.23.24307776
65
+ If you're using genal, please cite the following paper:
66
+ **Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization.**
67
+ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta.
68
+ Bioinformatics Advances 2024.
69
+ doi: https://doi.org/10.1093/bioadv/vbae207
60
70
 
61
71
  ## Requirements for the genal module <a name="paragraph1"></a>
62
- ***Python 3.11 or later***. https://www.python.org/ <br>
72
+ ***Python 3.8 or later***. https://www.python.org/ <br>
63
73
 
64
74
 
65
75
  ## Installation and How to use the genal module <a name="paragraph2"></a>
@@ -70,7 +80,7 @@ If you're using genal, please cite the following paper:
70
80
  >
71
81
  > **Optional**: It is recommended to create a new environment to avoid dependencies conflicts. Here, we create a new conda environment called 'genal_env'.
72
82
  > ```
73
- > conda create --name genal_env python=3.11
83
+ > conda create --name genal_env python=3.8
74
84
  > conda activate genal_env
75
85
  > ```
76
86
 
@@ -84,12 +94,19 @@ And import it in a python environment with:
84
94
  import genal
85
95
  ```
86
96
 
87
- The main genal functionalities require a working installation of PLINK v1.9 that can be downloaded here: https://www.cog-genomics.org/plink/
88
- Once downloaded, the path to the plink executable can be set with:
97
+ The main genal functionalities require a working installation of PLINK v2.0.
98
+ If you have already installed plink v2.0, you can set the path to its executable with:
89
99
 
90
100
  ```
91
101
  genal.set_plink(path="/path/to/plink/executable/file")
92
102
  ```
103
+
104
+ If plink is not installed, genal can install the correct version for your system with the following line:
105
+
106
+ ```
107
+ genal.install_plink()
108
+ ```
109
+
93
110
  ### Documentation <a name="paragraph2.2"></a>
94
111
 
95
112
  For detailed information on how to use the functionalities of Genal, please refer to the documentation: https://genal.rtfd.io
@@ -124,7 +141,7 @@ For this tutorial, we will obtain genetic instruments for systolic blood pressur
124
141
 
125
142
  ### Data loading <a name="paragraph3.1"></a>
126
143
 
127
- We start this tutorial with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
144
+ We start this tutorial with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). [Download link](http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST006001-GCST007000/GCST006624/Evangelou_30224653_SBP.txt.gz). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
128
145
 
129
146
  ```python
130
147
  import pandas as pd
@@ -242,7 +259,7 @@ You do not need to obtain the 1000 genome reference panel yourself, genal will d
242
259
  SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
243
260
  ```
244
261
 
245
- You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam files (without the extension).
262
+ You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
246
263
 
247
264
  ### Clumping <a name="paragraph3.3"></a>
248
265
 
@@ -275,7 +292,7 @@ Computing a Polygenic Risk Score (PRS) can be done in one line with the `genal.G
275
292
  SBP_clumped.prs(name = "SBP_prs", path = "path/to/genetic/files")
276
293
  ```
277
294
 
278
- The genetic files of the target population can be either contained in one triple of bed/bim/fam files with information for all SNPs, or divided by chromosome (one bed/bim/fam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by `$` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named `Pop_chr1.bed`, `Pop_chr1.bim`, `Pop_chr1.fam`, `Pop_chr2.bed`, ..., you can use:
295
+ The genetic files of the target population can be either contained in one triple of bed/bim/fam or pgen/pvar/psam files with information for all SNPs, or divided by chromosome (one bed/bim/fam or pgen/pvar/psam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by `$` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named `Pop_chr1.bed`, `Pop_chr1.bim`, `Pop_chr1.fam`, `Pop_chr2.bed`, ..., you can use:
279
296
 
280
297
  ```python
281
298
  SBP_clumped.prs(name = "SBP_prs", path = "Pop_chr$")
@@ -378,7 +395,7 @@ You can customize how the proxies are chosen with the following arguments:
378
395
 
379
396
  To run MR, we need to load both our exposure and outcome SNP-level data in `genal.Geno` instances. In our case, the genetic instruments of the MR are the SNPs associated with blood pressure at genome-wide significant levels resulting from the clumping of the blood pressure GWAS. They are stored in our `SBP_clumped` `genal.Geno` instance which also include their association with the exposure trait (instrument-SBP estimates in the `BETA` column).
380
397
 
381
- To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium ([https://www.nature.com/articles/s41586-022-05165-3](https://www.nature.com/articles/s41586-022-05165-3)):
398
+ To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium: [Link to study](https://www.nature.com/articles/s41586-022-05165-3). [Link to download](http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90104001-GCST90105000/GCST90104539/GCST90104539_buildGRCh37.tsv.gz):
382
399
 
383
400
  ```python
384
401
  stroke_gwas = pd.read_csv("GCST90104539_buildGRCh37.tsv",sep="\t")
@@ -506,6 +523,12 @@ And that will give:
506
523
  | SBP | Stroke_eur | Egger Intercept | 1499 | -0.001381 | 0.000813 | 8.935529e-02 | 2959.965136 | 1497 | 1.253763e-98 |
507
524
  | SBP | Stroke_eur | Inverse-Variance Weighted| 1499 | 0.023049 | 0.001061 | 1.382645e-104 | 2965.678836 | 1498 | 4.280737e-99 |
508
525
 
526
+ If you wish to display the coefficients as odds ratios with confidence intervals for a binary outcome trait, you can use the `odds = True` argument:
527
+
528
+ ```python
529
+ SBP_clumped.MR(action = 2, methods = ["Egger","IVW"], exposure_name = "SBP", outcome_name = "Stroke_eur", heterogeneity = True, odds = True)
530
+ ```
531
+
509
532
  As expected, many MR methods indicate that SBP is strongly associated with stroke, but there could be concerns for horizontal pleiotropy (instruments influencing the outcome through a different pathway than the one used as exposure) given the almost significant MR-Egger intercept p-value.
510
533
  To investigate horizontal pleiotropy in more details, a very useful method is Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO). MR-PRESSO is a method designed to detect and correct for horizontal pleiotropy. It will identify which instruments are likely to be pleiotropic on their effect on the outcome, and it will rerun an inverse-variance weighted MR after excluding them. It can be run using the `genal.Geno.MRpresso` method:
511
534
 
@@ -535,7 +558,7 @@ df_pheno = pd.read_csv("path/to/trait/data")
535
558
 
536
559
  > **Note:**
537
560
  >
538
- > One important point is to make sure that the IDs of the participants are identical in the phenotypic data and in the genetic data.
561
+ > One important point is to make sure that both the Family IDs (FID) and Individual IDs (IID) of the participants are identical in the phenotypic data and in the genetic data.
539
562
 
540
563
  Then, it is advised to make a copy of the `genal.Geno` instance containing our instruments as we are going to update their coefficients and to avoid any confusion:
541
564
 
@@ -546,7 +569,7 @@ SBP_adjusted = SBP_clumped.copy()
546
569
  We can then call the `genal.Geno.set_phenotype` method, specifying which column contains our trait of interest (for the association testing) and which column contains the individual IDs:
547
570
 
548
571
  ```python
549
- SBP_adjusted.set_phenotype(df_pheno, PHENO = "htn", IID = "IID")
572
+ SBP_adjusted.set_phenotype(df_pheno, PHENO = "htn", IID = "IID", FID = "FID")
550
573
  ```
551
574
 
552
575
  At this point, genal will identify if the phenotype is binary or quantitative in order to choose the appropriate regression model. If the phenotype is binary, it will assume that the most frequent value is coding for control (and the other value for case), this can be changed with `alternate_control = True`:
@@ -1,3 +1,7 @@
1
+ [![Python 3.8](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org/downloads/release/python-3100/)
2
+
3
+ <img src="/genal_logo.png" data-canonical-src="/genal_logo.png" height="80" />
4
+
1
5
  <center><h1> genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization </h1></center>
2
6
 
3
7
 
@@ -30,12 +34,18 @@ The module prioritizes user-friendliness and intuitive operation, aiming to redu
30
34
 
31
35
  Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
32
36
 
37
+ <img src="/Genal_flowchart.png" data-canonical-src="/Genal_flowchart.png" style="max-width:100%;" />
38
+
39
+ Genal flowchart. Created in https://www.BioRender.com
33
40
  ## Citation <a name="citation"></a>
34
- If you're using genal, please cite the following paper:
35
- **Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization.** Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta. medRxiv 2024.05.23.24307776; doi: https://doi.org/10.1101/2024.05.23.24307776
41
+ If you're using genal, please cite the following paper:
42
+ **Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization.**
43
+ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta.
44
+ Bioinformatics Advances 2024.
45
+ doi: https://doi.org/10.1093/bioadv/vbae207
36
46
 
37
47
  ## Requirements for the genal module <a name="paragraph1"></a>
38
- ***Python 3.11 or later***. https://www.python.org/ <br>
48
+ ***Python 3.8 or later***. https://www.python.org/ <br>
39
49
 
40
50
 
41
51
  ## Installation and How to use the genal module <a name="paragraph2"></a>
@@ -46,7 +56,7 @@ If you're using genal, please cite the following paper:
46
56
  >
47
57
  > **Optional**: It is recommended to create a new environment to avoid dependencies conflicts. Here, we create a new conda environment called 'genal_env'.
48
58
  > ```
49
- > conda create --name genal_env python=3.11
59
+ > conda create --name genal_env python=3.8
50
60
  > conda activate genal_env
51
61
  > ```
52
62
 
@@ -60,12 +70,19 @@ And import it in a python environment with:
60
70
  import genal
61
71
  ```
62
72
 
63
- The main genal functionalities require a working installation of PLINK v1.9 that can be downloaded here: https://www.cog-genomics.org/plink/
64
- Once downloaded, the path to the plink executable can be set with:
73
+ The main genal functionalities require a working installation of PLINK v2.0.
74
+ If you have already installed plink v2.0, you can set the path to its executable with:
65
75
 
66
76
  ```
67
77
  genal.set_plink(path="/path/to/plink/executable/file")
68
78
  ```
79
+
80
+ If plink is not installed, genal can install the correct version for your system with the following line:
81
+
82
+ ```
83
+ genal.install_plink()
84
+ ```
85
+
69
86
  ### Documentation <a name="paragraph2.2"></a>
70
87
 
71
88
  For detailed information on how to use the functionalities of Genal, please refer to the documentation: https://genal.rtfd.io
@@ -100,7 +117,7 @@ For this tutorial, we will obtain genetic instruments for systolic blood pressur
100
117
 
101
118
  ### Data loading <a name="paragraph3.1"></a>
102
119
 
103
- We start this tutorial with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
120
+ We start this tutorial with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). [Download link](http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST006001-GCST007000/GCST006624/Evangelou_30224653_SBP.txt.gz). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
104
121
 
105
122
  ```python
106
123
  import pandas as pd
@@ -218,7 +235,7 @@ You do not need to obtain the 1000 genome reference panel yourself, genal will d
218
235
  SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
219
236
  ```
220
237
 
221
- You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam files (without the extension).
238
+ You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
222
239
 
223
240
  ### Clumping <a name="paragraph3.3"></a>
224
241
 
@@ -251,7 +268,7 @@ Computing a Polygenic Risk Score (PRS) can be done in one line with the `genal.G
251
268
  SBP_clumped.prs(name = "SBP_prs", path = "path/to/genetic/files")
252
269
  ```
253
270
 
254
- The genetic files of the target population can be either contained in one triple of bed/bim/fam files with information for all SNPs, or divided by chromosome (one bed/bim/fam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by `$` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named `Pop_chr1.bed`, `Pop_chr1.bim`, `Pop_chr1.fam`, `Pop_chr2.bed`, ..., you can use:
271
+ The genetic files of the target population can be either contained in one triple of bed/bim/fam or pgen/pvar/psam files with information for all SNPs, or divided by chromosome (one bed/bim/fam or pgen/pvar/psam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by `$` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named `Pop_chr1.bed`, `Pop_chr1.bim`, `Pop_chr1.fam`, `Pop_chr2.bed`, ..., you can use:
255
272
 
256
273
  ```python
257
274
  SBP_clumped.prs(name = "SBP_prs", path = "Pop_chr$")
@@ -354,7 +371,7 @@ You can customize how the proxies are chosen with the following arguments:
354
371
 
355
372
  To run MR, we need to load both our exposure and outcome SNP-level data in `genal.Geno` instances. In our case, the genetic instruments of the MR are the SNPs associated with blood pressure at genome-wide significant levels resulting from the clumping of the blood pressure GWAS. They are stored in our `SBP_clumped` `genal.Geno` instance which also include their association with the exposure trait (instrument-SBP estimates in the `BETA` column).
356
373
 
357
- To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium ([https://www.nature.com/articles/s41586-022-05165-3](https://www.nature.com/articles/s41586-022-05165-3)):
374
+ To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium: [Link to study](https://www.nature.com/articles/s41586-022-05165-3). [Link to download](http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90104001-GCST90105000/GCST90104539/GCST90104539_buildGRCh37.tsv.gz):
358
375
 
359
376
  ```python
360
377
  stroke_gwas = pd.read_csv("GCST90104539_buildGRCh37.tsv",sep="\t")
@@ -482,6 +499,12 @@ And that will give:
482
499
  | SBP | Stroke_eur | Egger Intercept | 1499 | -0.001381 | 0.000813 | 8.935529e-02 | 2959.965136 | 1497 | 1.253763e-98 |
483
500
  | SBP | Stroke_eur | Inverse-Variance Weighted| 1499 | 0.023049 | 0.001061 | 1.382645e-104 | 2965.678836 | 1498 | 4.280737e-99 |
484
501
 
502
+ If you wish to display the coefficients as odds ratios with confidence intervals for a binary outcome trait, you can use the `odds = True` argument:
503
+
504
+ ```python
505
+ SBP_clumped.MR(action = 2, methods = ["Egger","IVW"], exposure_name = "SBP", outcome_name = "Stroke_eur", heterogeneity = True, odds = True)
506
+ ```
507
+
485
508
  As expected, many MR methods indicate that SBP is strongly associated with stroke, but there could be concerns for horizontal pleiotropy (instruments influencing the outcome through a different pathway than the one used as exposure) given the almost significant MR-Egger intercept p-value.
486
509
  To investigate horizontal pleiotropy in more details, a very useful method is Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO). MR-PRESSO is a method designed to detect and correct for horizontal pleiotropy. It will identify which instruments are likely to be pleiotropic on their effect on the outcome, and it will rerun an inverse-variance weighted MR after excluding them. It can be run using the `genal.Geno.MRpresso` method:
487
510
 
@@ -511,7 +534,7 @@ df_pheno = pd.read_csv("path/to/trait/data")
511
534
 
512
535
  > **Note:**
513
536
  >
514
- > One important point is to make sure that the IDs of the participants are identical in the phenotypic data and in the genetic data.
537
+ > One important point is to make sure that both the Family IDs (FID) and Individual IDs (IID) of the participants are identical in the phenotypic data and in the genetic data.
515
538
 
516
539
  Then, it is advised to make a copy of the `genal.Geno` instance containing our instruments as we are going to update their coefficients and to avoid any confusion:
517
540
 
@@ -522,7 +545,7 @@ SBP_adjusted = SBP_clumped.copy()
522
545
  We can then call the `genal.Geno.set_phenotype` method, specifying which column contains our trait of interest (for the association testing) and which column contains the individual IDs:
523
546
 
524
547
  ```python
525
- SBP_adjusted.set_phenotype(df_pheno, PHENO = "htn", IID = "IID")
548
+ SBP_adjusted.set_phenotype(df_pheno, PHENO = "htn", IID = "IID", FID = "FID")
526
549
  ```
527
550
 
528
551
  At this point, genal will identify if the phenotype is binary or quantitative in order to choose the appropriate regression model. If the phenotype is binary, it will assume that the most frequent value is coding for control (and the other value for case), this can be changed with `alternate_control = True`:
@@ -1,13 +1,14 @@
1
+ sphinx
2
+ sphinx_rtd_theme
1
3
  aiohttp==3.9.5
2
4
  nest_asyncio==1.5.5
3
- numpy>=1.24.4, <2.0
5
+ numpy>=1.24.4,<2.0
4
6
  pandas>=2.0.3
5
7
  plotnine==0.12.3
6
8
  psutil==5.9.1
7
9
  pyliftover==0.4
8
10
  scikit_learn>=1.3.0
9
11
  scipy>=1.11.4
10
- sphinx_rtd_theme==1.3.0
11
12
  statsmodels==0.14.0
12
13
  tqdm==4.66.1
13
- wget==3.2
14
+ wget==3.2
@@ -13,7 +13,7 @@ sys.path.insert(0, os.path.abspath('../../'))
13
13
  project = 'genal'
14
14
  copyright = '2023, Cyprien A. Rivier'
15
15
  author = 'Cyprien A. Rivier'
16
- release = 'v0.9'
16
+ release = 'v1.1'
17
17
 
18
18
 
19
19
  # -- General configuration ---------------------------------------------------
@@ -3,12 +3,16 @@
3
3
  You can adapt this file completely to your liking, but it should at least
4
4
  contain the root `toctree` directive.
5
5
 
6
+ .. image:: Images/genal_logo.png
7
+ :alt: genal_logo
8
+ :width: 400px
9
+
6
10
  genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization
7
11
  ============================================================================
8
12
 
9
- :Author: Cyprien Rivier
13
+ :Author: Cyprien A. Rivier
10
14
  :Date: |today|
11
- :Version: "0.8"
15
+ :Version: "1.0"
12
16
 
13
17
  Genal is a python module designed to make it easy to run genetic risk scores and mendelian randomization analyses. It integrates a collection of tools that facilitate the cleaning of single nucleotide polymorphism data (usually derived from Genome-Wide Association Studies) and enable the execution of key clinical population genetic workflows. The functionalities provided by genal include clumping, lifting, association testing, polygenic risk scoring, and Mendelian randomization analyses, all within a single Python module.
14
18
 
@@ -46,8 +50,8 @@ Citation
46
50
  If you use genal in your work, please cite the following paper:
47
51
 
48
52
  .. [Rivier.2024] *Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization*
49
- Cyprien Rivier, Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta.
50
- medRxiv. 2024 May `10.1101/2024.05.23.24307776 <https://doi.org/10.1101/2024.05.23.24307776>`_.
53
+ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta.
54
+ Bioinformatics Advances. 2024 December; `10.1093/bioadv/vbae207 <https://doi.org/10.1093/bioadv/vbae207>`_.
51
55
 
52
56
  References
53
57
  ----------
@@ -7,10 +7,10 @@ Installation
7
7
 
8
8
  .. code-block:: bash
9
9
 
10
- conda create --name genal_env python=3.11
10
+ conda create --name genal_env python=3.8
11
11
  conda activate genal_env
12
12
 
13
- The genal package requires Python 3.11. Download and install it with pip:
13
+ The genal package requires Python 3.8 or later. Download and install it with pip:
14
14
 
15
15
  .. code-block:: bash
16
16
 
@@ -22,17 +22,26 @@ And import it in a python environment with:
22
22
 
23
23
  import genal
24
24
 
25
- The main genal functionalities require a working installation of PLINK v1.9 (and not 2.0 as certain functionalities have not been updated yet) that can be downloaded here: https://www.cog-genomics.org/plink/
26
- Once downloaded, the path to the plink 1.9 executable should be set with:
25
+ The main genal functionalities require a working installation of PLINK v2.0.
26
+ If you have already installed plink v2.0, you can set the path to its executable with:
27
27
 
28
28
  .. code-block:: python
29
29
 
30
30
  genal.set_plink(path="/path/to/plink/executable/file")
31
31
 
32
+ If plink is not installed, genal can install the correct version for your system with the :meth:`~genal.tools.install_plink` function:
33
+
34
+ .. code-block:: python
35
+
36
+ genal.install_plink()
37
+
32
38
  ========
33
39
  Tutorial
34
40
  ========
35
41
 
42
+ .. image:: Images/Genal_flowchart.png
43
+ :alt: Genal_flowchart
44
+
36
45
  For the purpose of this tutorial, we are going to build a PRS of systolic blood pressure (SBP) and investigate the genetically-determined effect of SBP on the risk of stroke. We will use both summary statistics from Genome-Wide Association Studies (GWAS) and individual-level data from the UK Biobank as our test population. We are going to go through the following steps:
37
46
 
38
47
  Table of contents
@@ -51,7 +60,7 @@ h. `GWAS Catalog`_
51
60
  Data loading
52
61
  ============
53
62
 
54
- We start this tutorial with publicly available summary statistics data from a large GWAS of systolic blood pressure (https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load the data into a pandas dataframe:
63
+ We start this tutorial with publicly available summary statistics data from a large GWAS of systolic blood pressure `Link to study <https://www.nature.com/articles/s41588-018-0205-x>`_. `Download link <http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST006001-GCST007000/GCST006624/Evangelou_30224653_SBP.txt.gz>`_. After downloading and unzipping the summary statistics, we load the data into a pandas dataframe:
55
64
 
56
65
  .. code-block:: python
57
66
 
@@ -179,8 +188,7 @@ By default, the reference panel used is the European (EUR) one. You can specify
179
188
 
180
189
  SBP_Geno.preprocess_data(preprocessing='Fill_delete', reference_panel="afr")
181
190
 
182
- You can also use a custom reference panel by specifying the path to bed/bim/fam files (without the extension) in the ``reference_panel`` argument.
183
-
191
+ You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
184
192
 
185
193
  Clumping
186
194
  --------
@@ -215,7 +223,7 @@ Computing a Polygenic Risk Score (PRS) can be done in one line with the :meth:`~
215
223
 
216
224
  SBP_clumped.prs(name="SBP_prs", path="path/to/genetic/files")
217
225
 
218
- The genetic files of the target population can be either contained in one triple of bed/bim/fam files with information for all SNPs, or divided by chromosome (one bed/bim/fam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by ``$`` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named ``Pop_chr1.bed``, ``Pop_chr1.bim``, ``Pop_chr1.fam``, ``Pop_chr2.bed``, ..., you can use:
226
+ The genetic files of the target population can be either contained in one triple of bed/bim/fam or pgen/pvar/psam files with information for all SNPs, or divided by chromosome (one bed/bim/fam or pgen/pvar/psam triple for chr 1, another for chr 2, etc...). In the latter case, provide the path by replacing the chromosome number by `$` and genal will extract the necessary SNPs from each chromosome and merge them before running the PRS. For instance, if the genetic files are named `Pop_chr1.bed`, `Pop_chr1.bim`, `Pop_chr1.fam`, `Pop_chr2.bed`, ..., you can use:
219
227
 
220
228
  .. code-block:: python
221
229
 
@@ -318,7 +326,8 @@ Mendelian Randomization
318
326
 
319
327
  To run MR, we need to load both our exposure and outcome SNP-level data in :class:`~genal.Geno` instances. In our case, the genetic instruments of the MR are the SNPs associated with blood pressure at genome-wide significant levels resulting from the clumping of the blood pressure GWAS. They are stored in our ``SBP_clumped`` :class:`~genal.Geno` instance which also include their association with the exposure trait (instrument-SBP estimates in the ``BETA`` column).
320
328
 
321
- To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium (`Nature article <https://www.nature.com/articles/s41586-022-05165-3>`_):
329
+ To get their association with the outcome trait (instrument-stroke estimates), we are going to use SNP-level data from a large GWAS of stroke performed by the GIGASTROKE consortium:
330
+ `Link to study <https://www.nature.com/articles/s41586-022-05165-3>`_. `Download link <http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90104001-GCST90105000/GCST90104539/GCST90104539_buildGRCh37.tsv.gz>`_.
322
331
 
323
332
  .. code-block:: python
324
333
 
@@ -461,8 +470,12 @@ And that will give:
461
470
  1 SBP Stroke_eur Egger Intercept 1499 -0.001381 0.000813 8.935529e-02 2959.965136 1497 1.253763e-98
462
471
  2 SBP Stroke_eur Inverse-Variance Weighted 1499 0.023049 0.001061 1.382645e-104 2965.678836 1498 4.280737e-99
463
472
 
473
+ If you wish to display the coefficients as odds ratios with confidence intervals for a binary outcome trait, you can use the `odds = True` argument:
474
+
475
+ .. code-block:: python
476
+
477
+ SBP_clumped.MR(action=2, methods=["Egger","IVW"], exposure_name="SBP", outcome_name="Stroke_eur", heterogeneity=True, odds=True)
464
478
 
465
-
466
479
  As expected, many MR methods indicate that SBP is strongly associated with stroke, but there could be concerns for horizontal pleiotropy (instruments influencing the outcome through a different pathway than the one used as exposure) given the almost significant MR-Egger intercept p-value.
467
480
 
468
481
  To investigate horizontal pleiotropy in more detail, a very useful method is Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO).
@@ -499,7 +512,7 @@ Let's start by loading phenotypic data:
499
512
  df_pheno = pd.read_csv("path/to/trait/data")
500
513
 
501
514
  .. note::
502
- One important point is to make sure that the IDs of the participants are identical in the phenotypic data and in the genetic data.
515
+ One important point is to make sure that both the Family IDs (FID) and Individual IDs (IID) of the participants are identical in the phenotypic data and in the genetic data.
503
516
 
504
517
  Then, it is advised to make a copy of the :class:`~genal.Geno` instance containing our instruments as we are going to update their coefficients and to avoid any confusion:
505
518
 
@@ -511,7 +524,7 @@ We can then call the :meth:`~genal.Geno.set_phenotype` method, specifying which
511
524
 
512
525
  .. code-block:: python
513
526
 
514
- SBP_adjusted.set_phenotype(df_pheno, PHENO="htn", IID="IID")
527
+ SBP_adjusted.set_phenotype(df_pheno, PHENO="htn", IID="IID", FID="FID")
515
528
 
516
529
  At this point, genal will identify if the phenotype is binary or quantitative in order to choose the appropriate regression model. If the phenotype is binary, it will assume that the most frequent value is coding for control (and the other value for case), this can be changed with ``alternate_control=True``::
517
530