genal-python 1.2.8__tar.gz → 1.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (109) hide show
  1. {genal_python-1.2.8 → genal_python-1.3.0}/PKG-INFO +49 -70
  2. {genal_python-1.2.8 → genal_python-1.3.0}/README.md +47 -69
  3. {genal_python-1.2.8 → genal_python-1.3.0}/genal/Geno.py +164 -55
  4. {genal_python-1.2.8 → genal_python-1.3.0}/genal/MR_tools.py +5 -1
  5. {genal_python-1.2.8 → genal_python-1.3.0}/genal/__init__.py +1 -1
  6. {genal_python-1.2.8 → genal_python-1.3.0}/genal/association.py +2 -8
  7. {genal_python-1.2.8 → genal_python-1.3.0}/genal/clump.py +12 -22
  8. genal_python-1.3.0/genal/colocalization.py +159 -0
  9. {genal_python-1.2.8 → genal_python-1.3.0}/genal/constants.py +5 -2
  10. {genal_python-1.2.8 → genal_python-1.3.0}/genal/geno_tools.py +64 -26
  11. {genal_python-1.2.8 → genal_python-1.3.0}/genal/lift.py +4 -4
  12. {genal_python-1.2.8 → genal_python-1.3.0}/genal/proxy.py +13 -18
  13. {genal_python-1.2.8 → genal_python-1.3.0}/genal/tools.py +184 -90
  14. {genal_python-1.2.8 → genal_python-1.3.0}/pyproject.toml +3 -2
  15. {genal_python-1.2.8 → genal_python-1.3.0}/.DS_Store +0 -0
  16. {genal_python-1.2.8 → genal_python-1.3.0}/.gitignore +0 -0
  17. {genal_python-1.2.8 → genal_python-1.3.0}/.readthedocs.yaml +0 -0
  18. {genal_python-1.2.8 → genal_python-1.3.0}/Genal_flowchart.png +0 -0
  19. {genal_python-1.2.8 → genal_python-1.3.0}/LICENSE +0 -0
  20. {genal_python-1.2.8 → genal_python-1.3.0}/docs/.DS_Store +0 -0
  21. {genal_python-1.2.8 → genal_python-1.3.0}/docs/Makefile +0 -0
  22. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.DS_Store +0 -0
  23. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.buildinfo +0 -0
  24. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/api.doctree +0 -0
  25. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/environment.pickle +0 -0
  26. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/genal.doctree +0 -0
  27. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/index.doctree +0 -0
  28. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/introduction.doctree +0 -0
  29. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/.doctrees/modules.doctree +0 -0
  30. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_images/MR_plot_SBP_AS.png +0 -0
  31. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/Geno.html +0 -0
  32. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/MR.html +0 -0
  33. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/MR_tools.html +0 -0
  34. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/MRpresso.html +0 -0
  35. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/association.html +0 -0
  36. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/clump.html +0 -0
  37. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/extract_prs.html +0 -0
  38. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/geno_tools.html +0 -0
  39. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/lift.html +0 -0
  40. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/proxy.html +0 -0
  41. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/snp_query.html +0 -0
  42. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/genal/tools.html +0 -0
  43. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_modules/index.html +0 -0
  44. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/api.rst.txt +0 -0
  45. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/genal.rst.txt +0 -0
  46. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/index.rst.txt +0 -0
  47. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/introduction.rst.txt +0 -0
  48. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_sources/modules.rst.txt +0 -0
  49. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/basic.css +0 -0
  50. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/badge_only.css +0 -0
  51. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Bold.woff +0 -0
  52. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Bold.woff2 +0 -0
  53. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Regular.woff +0 -0
  54. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/Roboto-Slab-Regular.woff2 +0 -0
  55. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.eot +0 -0
  56. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.svg +0 -0
  57. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.ttf +0 -0
  58. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.woff +0 -0
  59. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/fontawesome-webfont.woff2 +0 -0
  60. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold-italic.woff +0 -0
  61. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold-italic.woff2 +0 -0
  62. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold.woff +0 -0
  63. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-bold.woff2 +0 -0
  64. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal-italic.woff +0 -0
  65. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal-italic.woff2 +0 -0
  66. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal.woff +0 -0
  67. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/fonts/lato-normal.woff2 +0 -0
  68. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/css/theme.css +0 -0
  69. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/doctools.js +0 -0
  70. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/documentation_options.js +0 -0
  71. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/file.png +0 -0
  72. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/badge_only.js +0 -0
  73. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/html5shiv-printshiv.min.js +0 -0
  74. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/html5shiv.min.js +0 -0
  75. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/js/theme.js +0 -0
  76. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/language_data.js +0 -0
  77. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/minus.png +0 -0
  78. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/plus.png +0 -0
  79. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/pygments.css +0 -0
  80. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/searchtools.js +0 -0
  81. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/_static/sphinx_highlight.js +0 -0
  82. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/api.html +0 -0
  83. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/genal.html +0 -0
  84. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/genindex.html +0 -0
  85. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/index.html +0 -0
  86. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/introduction.html +0 -0
  87. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/modules.html +0 -0
  88. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/objects.inv +0 -0
  89. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/py-modindex.html +0 -0
  90. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/search.html +0 -0
  91. {genal_python-1.2.8 → genal_python-1.3.0}/docs/build/searchindex.js +0 -0
  92. {genal_python-1.2.8 → genal_python-1.3.0}/docs/make.bat +0 -0
  93. {genal_python-1.2.8 → genal_python-1.3.0}/docs/requirements.txt +0 -0
  94. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/.DS_Store +0 -0
  95. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/Images/Genal_flowchart.png +0 -0
  96. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/Images/MR_plot_SBP_AS.png +0 -0
  97. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/Images/genal_logo.png +0 -0
  98. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/api.rst +0 -0
  99. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/conf.py +0 -0
  100. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/index.rst +0 -0
  101. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/introduction.rst +0 -0
  102. {genal_python-1.2.8 → genal_python-1.3.0}/docs/source/modules.rst +0 -0
  103. {genal_python-1.2.8 → genal_python-1.3.0}/genal/MR.py +0 -0
  104. {genal_python-1.2.8 → genal_python-1.3.0}/genal/MRpresso.py +0 -0
  105. {genal_python-1.2.8 → genal_python-1.3.0}/genal/extract_prs.py +0 -0
  106. {genal_python-1.2.8 → genal_python-1.3.0}/genal/snp_query.py +0 -0
  107. {genal_python-1.2.8 → genal_python-1.3.0}/genal_logo.png +0 -0
  108. {genal_python-1.2.8 → genal_python-1.3.0}/gitignore +0 -0
  109. {genal_python-1.2.8 → genal_python-1.3.0}/readthedocs.yaml +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.3
2
2
  Name: genal-python
3
- Version: 1.2.8
3
+ Version: 1.3.0
4
4
  Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
5
5
  Author-email: Cyprien Rivier <riviercyprien@gmail.com>
6
6
  Requires-Python: >=3.8
@@ -20,6 +20,7 @@ Requires-Dist: scipy>=1.10.1, <1.11
20
20
  Requires-Dist: statsmodels==0.14.0
21
21
  Requires-Dist: tqdm==4.66.1
22
22
  Requires-Dist: wget==3.2
23
+ Requires-Dist: fastparquet>=2024.2.0
23
24
  Project-URL: Home, https://github.com/CypRiv/genal
24
25
 
25
26
  [![Python 3.8](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org/downloads/release/python-3100/)
@@ -48,11 +49,34 @@ Project-URL: Home, https://github.com/CypRiv/genal
48
49
 
49
50
 
50
51
  ## Introduction <a name="introduction"></a>
51
- Genal is a python module designed to make it easy to run genetic risk scores and mendelian randomization analyses. It integrates a collection of tools that facilitate the cleaning of single nucleotide polymorphism data (usually derived from Genome-Wide Association Studies) and enable the execution of clinical population genetic workflows. The functionalities provided by genal include clumping, lifting, association testing, polygenic risk scoring, and Mendelian randomization analyses, all within a single Python module.
52
-
53
- The module prioritizes user-friendliness and intuitive operation, aiming to reduce the complexity of data analysis for researchers. Despite its focus on simplicity, Genal does not sacrifice the depth of customization or the precision of analysis. Researchers can expect to maintain analytical rigour while benefiting from the streamlined experience.
54
-
55
- Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
52
+ Genal is a python module designed to make it easy and intuitive to run genetic risk scores and Mendelian Randomization analyses. The functionalities provided by genal include:
53
+
54
+ - Data preprocessing and cleaning of variant data (usually from GWAS summary statistics)
55
+ - Selection of independent genetic instruments through clumping
56
+ - Polygenic risk score calculations (currently C+T only, soon LDPRED2 and PRScs)
57
+ - More than 10 Mendelian Randomization methods, including heterogeneity and pleiotropy tests, with parallel processing support:
58
+ - Inverse Variance Weighted (IVW) methods
59
+ - MR-Egger methods
60
+ - Weighted Median methods
61
+ - Mode methods
62
+ - MR-PRESSO in parallel
63
+ - More to come...
64
+ - SNP-trait association testing
65
+ - Lifting of genetic data to another genomic build
66
+ - Variant-phenotype querying with the GWAS Catalog
67
+
68
+ ### Key Features
69
+
70
+ - **Efficient Parallel Processing**: Parallel computation for bootstrapping-based MR methods and MR-PRESSO significantly reduces computation time compared to the original R packages (up to 85% faster for MR-PRESSO)
71
+ - **Flexible Data Handling**: Automatic formatting of variant data and summary statistics
72
+ - **Comprehensive MR Pipeline**: From data preprocessing to sensitivity analyses and plotting in a single package
73
+ - **Reference Panel Support**: Automatically download and use the latest 1000 Genomes reference panels in builds 37 and 38 with the option to use custom reference panels
74
+ - **Customizable**: Ability to choose all the parameters, but defaults are set to the most common values
75
+ - **Proxy SNP Support**: Includes functionality for finding and using proxy SNPs when instruments are missing (for polygenic risk scores, Mendelian Randomization, and association testing)
76
+
77
+ The objective of genal is to bring the functionalities of well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, in a more user-friendly Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
78
+
79
+ This package is still under development, feel free to report any issues or suggest improvements!
56
80
 
57
81
  <img src="/Genal_flowchart.png" data-canonical-src="/Genal_flowchart.png" style="max-width:100%;" />
58
82
 
@@ -65,6 +89,8 @@ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez,
65
89
  Bioinformatics Advances 2024.
66
90
  doi: https://doi.org/10.1093/bioadv/vbae207
67
91
 
92
+ If you're using methods derived from R packages, such as MR-PRESSO, please also cite the original papers.
93
+
68
94
  ## Requirements for the genal module <a name="paragraph1"></a>
69
95
  ***Python 3.8 or later***. https://www.python.org/ <br>
70
96
 
@@ -145,16 +171,7 @@ import pandas as pd
145
171
  sbp_gwas = pd.read_csv("Evangelou_30224653_SBP.txt", sep=" ")
146
172
  sbp_gwas.head(5)
147
173
  ```
148
-
149
- | MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective |
150
- |-------------------|---------|---------|-------|--------|--------|---------|-----------------|-------------|
151
- | 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940| 738170 | 736847 |
152
- | 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100| 738168 | 735018 |
153
- | 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526| 738168 | 733070 |
154
- | 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800| 737054 | 663809 |
155
- | 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870| 738169 | 735681 |
156
-
157
- We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean.
174
+ We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean it.
158
175
 
159
176
  The `genal.Geno` takes as input a pandas dataframe where each row corresponds to a SNP, with columns describing the position and possibly the effect of the SNP for the given trait (SBP in our case). To indicate the names of the columns, the following arguments can be passed:
160
177
  - **CHR**: Column name for chromosome. Defaults to `'CHR'`.
@@ -175,15 +192,12 @@ After inspecting the dataframe, we first need to extract the chromosome and posi
175
192
 
176
193
  ```python
177
194
  sbp_gwas[["CHR", "POS", "Filler"]] = sbp_gwas["MarkerName"].str.split(":", expand=True)
178
- sbp_gwas.head(5)
195
+ sbp_gwas.head(2)
179
196
  ```
180
197
  | MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective | CHR | POS | Filler |
181
198
  |-------------------|---------|---------|-------|--------|--------|----------|-----------------|-------------|-----|-----------|--------|
182
199
  | 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940 | 738170 | 736847 | 10 | 100000625 | SNP |
183
200
  | 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100 | 738168 | 735018 | 10 | 100000645 | SNP |
184
- | 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526 | 738168 | 733070 | 10 | 100003242 | SNP |
185
- | 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800 | 737054 | 663809 | 10 | 100003304 | SNP |
186
- | 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870 | 738169 | 735681 | 10 | 100003785 | SNP |
187
201
 
188
202
  And it can now be loaded into a `genal.Geno` instance:
189
203
 
@@ -220,7 +234,7 @@ By default, and depending on the global preprocessing level (`'None'`, `'Fill'`,
220
234
  - Validate the `P` (p-value) column for proper values.
221
235
  - Check for no duplicated SNPs based on rsid.
222
236
  - Determine if the `BETA` (effect) column contains beta estimates or odds ratios, and log-transform odds ratios if necessary.
223
- - Create `SNP` column using a reference panel if CHR and POS columns are present.
237
+ - Create `SNP` column (containing rsids) using a reference panel if CHR and POS columns are present.
224
238
  - Create `CHR` and/or `POS` column using a reference panel if `SNP` column is present.
225
239
  - Create `NEA` (non-effect allele) column using a reference panel if `EA` (effect allele) column is present.
226
240
  - Create the `SE` (standard-error) column if the `BETA` and `P` (p-value) columns are present.
@@ -243,17 +257,17 @@ SBP_Geno.data
243
257
  |---------|----|-----|-------|--------|--------|----------|-----|---------|-----------|
244
258
  | 0 | A | G | 0.5660| 0.0523| 0.0303 | 0.083940 | 10 | 100000625 | rs7899632 |
245
259
  | 1 | A | C | 0.7936| 0.0200| 0.0372 | 0.591100 | 10 | 100000645 | rs61875309 |
246
- | 2 | T | G | 0.8831| 0.1417| 0.0469 | 0.002526 | 10 | 100003242 | rs12258651 |
247
- | 3 | A | G | 0.9609| 0.0245| 0.0838 | 0.769800 | 10 | 100003304 | rs72828461 |
248
- | 4 | T | C | 0.6406| -0.0680| 0.0313 | 0.029870 | 10 | 100003785 | rs1359508 |
249
- | ... | .. | .. | ... | ... | ... | ... | ... | ... | ... |
250
- | 7088120 | A | G | 0.9028| -0.0184| 0.0517 | 0.722300 | 9 | 99999468 | rs10981301 |
251
260
 
252
- And we see that the `SNP` column with the rsids has been added based on the reference data.
253
- You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it. By default, the reference panel used is the european (eur) one. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument:
261
+
262
+ The `SNP` column with the rsids has been added based on the reference data.
263
+ You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it.
264
+
265
+ > **Note:**
266
+ >
267
+ > By default, the reference panel used is the european (eur) one in build 37. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument. For instance "AFR_38" for the african reference panel in build 38:
254
268
 
255
269
  ```python
256
- SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
270
+ SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "AFR_37")
257
271
  ```
258
272
 
259
273
  You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
@@ -264,22 +278,15 @@ Clumping, or C+T: Clumping + Thresholding, is the step at which we select the SN
264
278
 
265
279
  The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
266
280
 
267
-
268
281
  ```python
269
- SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.1, kb = 250, reference_panel = "eur")
282
+ SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.01, kb = 10000, reference_panel = "EUR_37")
270
283
  ```
271
284
 
272
- It will output the number of instruments obtained::
273
-
274
- Using the EUR reference panel.
275
- Warning: 760 top variant IDs missing
276
- 1545 clumps formed from 73594 top variants.
277
-
278
285
  You can specify the thresholds you want to use for the clumping with the following arguments:
279
286
  - `p1`: P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Defaults to `5e-8`.
280
- - `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.1`.
281
- - `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `250`.
282
- - `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `eur`.
287
+ - `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.01`.
288
+ - `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `10000`.
289
+ - `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `EUR_37`.
283
290
 
284
291
  ### Polygenic Risk Scoring <a name="paragraph3.4"></a>
285
292
 
@@ -302,25 +309,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
302
309
  Extracting SNPs for each chromosome...
303
310
  SNPs extracted for chr1.
304
311
  SNPs extracted for chr2.
305
- SNPs extracted for chr3.
306
- SNPs extracted for chr4.
307
- SNPs extracted for chr5.
308
- SNPs extracted for chr6.
309
- SNPs extracted for chr7.
310
- SNPs extracted for chr8.
311
- SNPs extracted for chr9.
312
- SNPs extracted for chr10.
313
- SNPs extracted for chr11.
314
- SNPs extracted for chr12.
315
- SNPs extracted for chr13.
316
- SNPs extracted for chr14.
317
- SNPs extracted for chr15.
318
- SNPs extracted for chr16.
319
- SNPs extracted for chr17.
320
- SNPs extracted for chr18.
321
- SNPs extracted for chr19.
322
- SNPs extracted for chr20.
323
- SNPs extracted for chr21.
312
+ ...
324
313
  SNPs extracted for chr22.
325
314
  Merging SNPs extracted from each chromosome...
326
315
  Created bed/bim/fam fileset with extracted SNPs: tmp_GENAL/4f4ce6a7_allchr
@@ -379,16 +368,6 @@ To get their association with the outcome trait (instrument-stroke estimates), w
379
368
  stroke_gwas = pd.read_csv("GCST90104539_buildGRCh37.tsv",sep="\t")
380
369
  ```
381
370
 
382
- We inspect it to determine the column names:
383
-
384
- | chromosome | base_pair_location | effect_allele_frequency | beta | standard_error | p_value | odds_ratio | ci_lower | ci_upper | effect_allele | other_allele |
385
- |------------|--------------------|-------------------------|--------|----------------|---------|------------|----------|----------|---------------|--------------|
386
- | 5 | 29439275 | 0.3569 | 0.0030 | 0.0070 | 0.6658 | 1.003005 | 0.989337 | 1.016861 | T | C |
387
- | 5 | 85928892 | 0.0639 | -0.0152| 0.0137 | 0.2686 | 0.984915 | 0.958820 | 1.011720 | T | C |
388
- | 10 | 128341232 | 0.4613 | 0.0025 | 0.0065 | 0.6998 | 1.002503 | 0.989812 | 1.015357 | T | C |
389
- | 3 | 62707519 | 0.0536 | 0.0152 | 0.0152 | 0.3177 | 1.015316 | 0.985514 | 1.046019 | T | C |
390
- | 2 | 80464120 | 0.9789 | 0.0057 | 0.0254 | 0.8223 | 1.005716 | 0.956874 | 1.057052 | T | G |
391
-
392
371
  We load it in a `genal.Geno` instance:
393
372
 
394
373
  ```python
@@ -420,7 +399,7 @@ Genal will print how many SNPs were successfully found and extracted from the ou
420
399
  > Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
421
400
  >
422
401
  > ```python
423
- > SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
402
+ > SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "EUR_37", kb = 5000, r2 = 0.8, window_snps = 5000)
424
403
  > ```
425
404
  >
426
405
  > And genal will print the number of missing instruments that have been proxied:
@@ -24,11 +24,34 @@
24
24
 
25
25
 
26
26
  ## Introduction <a name="introduction"></a>
27
- Genal is a python module designed to make it easy to run genetic risk scores and mendelian randomization analyses. It integrates a collection of tools that facilitate the cleaning of single nucleotide polymorphism data (usually derived from Genome-Wide Association Studies) and enable the execution of clinical population genetic workflows. The functionalities provided by genal include clumping, lifting, association testing, polygenic risk scoring, and Mendelian randomization analyses, all within a single Python module.
28
-
29
- The module prioritizes user-friendliness and intuitive operation, aiming to reduce the complexity of data analysis for researchers. Despite its focus on simplicity, Genal does not sacrifice the depth of customization or the precision of analysis. Researchers can expect to maintain analytical rigour while benefiting from the streamlined experience.
30
-
31
- Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
27
+ Genal is a python module designed to make it easy and intuitive to run genetic risk scores and Mendelian Randomization analyses. The functionalities provided by genal include:
28
+
29
+ - Data preprocessing and cleaning of variant data (usually from GWAS summary statistics)
30
+ - Selection of independent genetic instruments through clumping
31
+ - Polygenic risk score calculations (currently C+T only, soon LDPRED2 and PRScs)
32
+ - More than 10 Mendelian Randomization methods, including heterogeneity and pleiotropy tests, with parallel processing support:
33
+ - Inverse Variance Weighted (IVW) methods
34
+ - MR-Egger methods
35
+ - Weighted Median methods
36
+ - Mode methods
37
+ - MR-PRESSO in parallel
38
+ - More to come...
39
+ - SNP-trait association testing
40
+ - Lifting of genetic data to another genomic build
41
+ - Variant-phenotype querying with the GWAS Catalog
42
+
43
+ ### Key Features
44
+
45
+ - **Efficient Parallel Processing**: Parallel computation for bootstrapping-based MR methods and MR-PRESSO significantly reduces computation time compared to the original R packages (up to 85% faster for MR-PRESSO)
46
+ - **Flexible Data Handling**: Automatic formatting of variant data and summary statistics
47
+ - **Comprehensive MR Pipeline**: From data preprocessing to sensitivity analyses and plotting in a single package
48
+ - **Reference Panel Support**: Automatically download and use the latest 1000 Genomes reference panels in builds 37 and 38 with the option to use custom reference panels
49
+ - **Customizable**: Ability to choose all the parameters, but defaults are set to the most common values
50
+ - **Proxy SNP Support**: Includes functionality for finding and using proxy SNPs when instruments are missing (for polygenic risk scores, Mendelian Randomization, and association testing)
51
+
52
+ The objective of genal is to bring the functionalities of well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, in a more user-friendly Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
53
+
54
+ This package is still under development, feel free to report any issues or suggest improvements!
32
55
 
33
56
  <img src="/Genal_flowchart.png" data-canonical-src="/Genal_flowchart.png" style="max-width:100%;" />
34
57
 
@@ -41,6 +64,8 @@ Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez,
41
64
  Bioinformatics Advances 2024.
42
65
  doi: https://doi.org/10.1093/bioadv/vbae207
43
66
 
67
+ If you're using methods derived from R packages, such as MR-PRESSO, please also cite the original papers.
68
+
44
69
  ## Requirements for the genal module <a name="paragraph1"></a>
45
70
  ***Python 3.8 or later***. https://www.python.org/ <br>
46
71
 
@@ -121,16 +146,7 @@ import pandas as pd
121
146
  sbp_gwas = pd.read_csv("Evangelou_30224653_SBP.txt", sep=" ")
122
147
  sbp_gwas.head(5)
123
148
  ```
124
-
125
- | MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective |
126
- |-------------------|---------|---------|-------|--------|--------|---------|-----------------|-------------|
127
- | 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940| 738170 | 736847 |
128
- | 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100| 738168 | 735018 |
129
- | 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526| 738168 | 733070 |
130
- | 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800| 737054 | 663809 |
131
- | 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870| 738169 | 735681 |
132
-
133
- We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean.
149
+ We can now load this data into a `genal.Geno` instance. The `genal.Geno` class is the central piece of the package. It is designed to store Single Nucleotide Polymorphisms (SNP) data and make it easy to preprocess and clean it.
134
150
 
135
151
  The `genal.Geno` takes as input a pandas dataframe where each row corresponds to a SNP, with columns describing the position and possibly the effect of the SNP for the given trait (SBP in our case). To indicate the names of the columns, the following arguments can be passed:
136
152
  - **CHR**: Column name for chromosome. Defaults to `'CHR'`.
@@ -151,15 +167,12 @@ After inspecting the dataframe, we first need to extract the chromosome and posi
151
167
 
152
168
  ```python
153
169
  sbp_gwas[["CHR", "POS", "Filler"]] = sbp_gwas["MarkerName"].str.split(":", expand=True)
154
- sbp_gwas.head(5)
170
+ sbp_gwas.head(2)
155
171
  ```
156
172
  | MarkerName | Allele1 | Allele2 | Freq1 | Effect | StdErr | P | TotalSampleSize | N_effective | CHR | POS | Filler |
157
173
  |-------------------|---------|---------|-------|--------|--------|----------|-----------------|-------------|-----|-----------|--------|
158
174
  | 10:100000625:SNP | a | g | 0.5660| 0.0523 | 0.0303 | 0.083940 | 738170 | 736847 | 10 | 100000625 | SNP |
159
175
  | 10:100000645:SNP | a | c | 0.7936| 0.0200 | 0.0372 | 0.591100 | 738168 | 735018 | 10 | 100000645 | SNP |
160
- | 10:100003242:SNP | t | g | 0.8831| 0.1417 | 0.0469 | 0.002526 | 738168 | 733070 | 10 | 100003242 | SNP |
161
- | 10:100003304:SNP | a | g | 0.9609| 0.0245 | 0.0838 | 0.769800 | 737054 | 663809 | 10 | 100003304 | SNP |
162
- | 10:100003785:SNP | t | c | 0.6406| -0.0680| 0.0313 | 0.029870 | 738169 | 735681 | 10 | 100003785 | SNP |
163
176
 
164
177
  And it can now be loaded into a `genal.Geno` instance:
165
178
 
@@ -196,7 +209,7 @@ By default, and depending on the global preprocessing level (`'None'`, `'Fill'`,
196
209
  - Validate the `P` (p-value) column for proper values.
197
210
  - Check for no duplicated SNPs based on rsid.
198
211
  - Determine if the `BETA` (effect) column contains beta estimates or odds ratios, and log-transform odds ratios if necessary.
199
- - Create `SNP` column using a reference panel if CHR and POS columns are present.
212
+ - Create `SNP` column (containing rsids) using a reference panel if CHR and POS columns are present.
200
213
  - Create `CHR` and/or `POS` column using a reference panel if `SNP` column is present.
201
214
  - Create `NEA` (non-effect allele) column using a reference panel if `EA` (effect allele) column is present.
202
215
  - Create the `SE` (standard-error) column if the `BETA` and `P` (p-value) columns are present.
@@ -219,17 +232,17 @@ SBP_Geno.data
219
232
  |---------|----|-----|-------|--------|--------|----------|-----|---------|-----------|
220
233
  | 0 | A | G | 0.5660| 0.0523| 0.0303 | 0.083940 | 10 | 100000625 | rs7899632 |
221
234
  | 1 | A | C | 0.7936| 0.0200| 0.0372 | 0.591100 | 10 | 100000645 | rs61875309 |
222
- | 2 | T | G | 0.8831| 0.1417| 0.0469 | 0.002526 | 10 | 100003242 | rs12258651 |
223
- | 3 | A | G | 0.9609| 0.0245| 0.0838 | 0.769800 | 10 | 100003304 | rs72828461 |
224
- | 4 | T | C | 0.6406| -0.0680| 0.0313 | 0.029870 | 10 | 100003785 | rs1359508 |
225
- | ... | .. | .. | ... | ... | ... | ... | ... | ... | ... |
226
- | 7088120 | A | G | 0.9028| -0.0184| 0.0517 | 0.722300 | 9 | 99999468 | rs10981301 |
227
235
 
228
- And we see that the `SNP` column with the rsids has been added based on the reference data.
229
- You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it. By default, the reference panel used is the european (eur) one. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument:
236
+
237
+ The `SNP` column with the rsids has been added based on the reference data.
238
+ You do not need to obtain the 1000 genome reference panel yourself, genal will download it the first time you use it.
239
+
240
+ > **Note:**
241
+ >
242
+ > By default, the reference panel used is the european (eur) one in build 37. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument. For instance "AFR_38" for the african reference panel in build 38:
230
243
 
231
244
  ```python
232
- SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
245
+ SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "AFR_37")
233
246
  ```
234
247
 
235
248
  You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam (plink v1.9 format) or pgen/pvar/psam files (plink v2.0 format), without the extension.
@@ -240,22 +253,15 @@ Clumping, or C+T: Clumping + Thresholding, is the step at which we select the SN
240
253
 
241
254
  The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
242
255
 
243
-
244
256
  ```python
245
- SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.1, kb = 250, reference_panel = "eur")
257
+ SBP_clumped = SBP_Geno.clump(p1 = 5e-8, r2 = 0.01, kb = 10000, reference_panel = "EUR_37")
246
258
  ```
247
259
 
248
- It will output the number of instruments obtained::
249
-
250
- Using the EUR reference panel.
251
- Warning: 760 top variant IDs missing
252
- 1545 clumps formed from 73594 top variants.
253
-
254
260
  You can specify the thresholds you want to use for the clumping with the following arguments:
255
261
  - `p1`: P-value threshold during clumping. SNPs with a P-value higher than this value are excluded. Defaults to `5e-8`.
256
- - `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.1`.
257
- - `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `250`.
258
- - `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `eur`.
262
+ - `r2`: Linkage disequilibrium threshold for the independence check. Takes values between 0 and 1. Defaults to `0.01`.
263
+ - `kb`: Genomic window used for the independence check (the unit is thousands of base-pair positions). Defaults to `10000`.
264
+ - `reference_panel`: The reference population used to derive linkage disequilibrium values and select independent SNPs. Defaults to `EUR_37`.
259
265
 
260
266
  ### Polygenic Risk Scoring <a name="paragraph3.4"></a>
261
267
 
@@ -278,25 +284,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
278
284
  Extracting SNPs for each chromosome...
279
285
  SNPs extracted for chr1.
280
286
  SNPs extracted for chr2.
281
- SNPs extracted for chr3.
282
- SNPs extracted for chr4.
283
- SNPs extracted for chr5.
284
- SNPs extracted for chr6.
285
- SNPs extracted for chr7.
286
- SNPs extracted for chr8.
287
- SNPs extracted for chr9.
288
- SNPs extracted for chr10.
289
- SNPs extracted for chr11.
290
- SNPs extracted for chr12.
291
- SNPs extracted for chr13.
292
- SNPs extracted for chr14.
293
- SNPs extracted for chr15.
294
- SNPs extracted for chr16.
295
- SNPs extracted for chr17.
296
- SNPs extracted for chr18.
297
- SNPs extracted for chr19.
298
- SNPs extracted for chr20.
299
- SNPs extracted for chr21.
287
+ ...
300
288
  SNPs extracted for chr22.
301
289
  Merging SNPs extracted from each chromosome...
302
290
  Created bed/bim/fam fileset with extracted SNPs: tmp_GENAL/4f4ce6a7_allchr
@@ -355,16 +343,6 @@ To get their association with the outcome trait (instrument-stroke estimates), w
355
343
  stroke_gwas = pd.read_csv("GCST90104539_buildGRCh37.tsv",sep="\t")
356
344
  ```
357
345
 
358
- We inspect it to determine the column names:
359
-
360
- | chromosome | base_pair_location | effect_allele_frequency | beta | standard_error | p_value | odds_ratio | ci_lower | ci_upper | effect_allele | other_allele |
361
- |------------|--------------------|-------------------------|--------|----------------|---------|------------|----------|----------|---------------|--------------|
362
- | 5 | 29439275 | 0.3569 | 0.0030 | 0.0070 | 0.6658 | 1.003005 | 0.989337 | 1.016861 | T | C |
363
- | 5 | 85928892 | 0.0639 | -0.0152| 0.0137 | 0.2686 | 0.984915 | 0.958820 | 1.011720 | T | C |
364
- | 10 | 128341232 | 0.4613 | 0.0025 | 0.0065 | 0.6998 | 1.002503 | 0.989812 | 1.015357 | T | C |
365
- | 3 | 62707519 | 0.0536 | 0.0152 | 0.0152 | 0.3177 | 1.015316 | 0.985514 | 1.046019 | T | C |
366
- | 2 | 80464120 | 0.9789 | 0.0057 | 0.0254 | 0.8223 | 1.005716 | 0.956874 | 1.057052 | T | G |
367
-
368
346
  We load it in a `genal.Geno` instance:
369
347
 
370
348
  ```python
@@ -396,7 +374,7 @@ Genal will print how many SNPs were successfully found and extracted from the ou
396
374
  > Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
397
375
  >
398
376
  > ```python
399
- > SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
377
+ > SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "EUR_37", kb = 5000, r2 = 0.8, window_snps = 5000)
400
378
  > ```
401
379
  >
402
380
  > And genal will print the number of missing instruments that have been proxied: