PyPI - genal-python - Versions diffs - 0.7__tar.gz → 0.9__tar.gz - Mend

genal-python 0.7tar.gz → 0.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (128) hide show

genal_python-0.9/.DS_Store ADDED Viewed

Binary file

{genal_python-0.7 → genal_python-0.9}/.gitignore RENAMED Viewed

@@ -3,4 +3,4 @@ dist/
 .ipynb_checkpoints/
 ipynb_checkpoints/
 genal/.ipynb_checkpoints/
-docs/
+test_data/

genal_python-0.9/.readthedocs.yaml ADDED Viewed

@@ -0,0 +1,22 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+# Required
+version: 2
+# Set the version of Python and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.11"
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+  configuration: docs/source/conf.py
+# We recommend specifying your dependencies to enable reproducible builds:
+# https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+  install:
+  - requirements: docs/requirements.txt

{genal_python-0.7 → genal_python-0.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: genal-python
-Version: 0.7
+Version: 0.9
 Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
 Author-email: Cyprien Rivier <riviercyprien@gmail.com>
 Requires-Python: >=3.7
@@ -17,7 +17,6 @@ Requires-Dist: psutil==5.9.1
 Requires-Dist: pyliftover==0.4
 Requires-Dist: scikit_learn>=1.3.0
 Requires-Dist: scipy>=1.11.4
-Requires-Dist: sphinx_rtd_theme==1.3.0
 Requires-Dist: statsmodels==0.14.0
 Requires-Dist: tqdm==4.66.1
 Requires-Dist: wget==3.2
@@ -32,10 +31,11 @@ Project-URL: Home, https://github.com/CypRiv/genal
 # Table of contents
 1. [Introduction](#introduction)
-2. [Citation] (#citation)
+2. [Citation](#citation)
 3. [Requirements for the genal module](#paragraph1)
 4. [Installation and how to use genal](#paragraph2)
     1. [Installation](#paragraph2.1)
+    2. [Documentation](#paragraph2.2)
 5. [Tutorial and presentation of the main tools](#paragraph3)
     1. [Data loading](#paragraph3.1)
     2. [Data preprocessing](#paragraph3.2)
@@ -44,6 +44,7 @@ Project-URL: Home, https://github.com/CypRiv/genal
     5. [Mendelian Randomization](#paragraph3.5)
     6. [SNP-association testing](#paragraph3.6)
     7. [Lifting](#paragraph3.7)
+    8. [GWAS Catalog](#paragraph3.8)
 ## Introduction <a name="introduction"></a>
@@ -58,18 +59,27 @@ If you're using genal, please cite the following paper:
 **Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization.** Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta. medRxiv 2024.05.23.24307776; doi: https://doi.org/10.1101/2024.05.23.24307776
 ## Requirements for the genal module <a name="paragraph1"></a>
-***Python 3.9 or later***. https://www.python.org/ <br>
+***Python 3.11 or later***. https://www.python.org/ <br>
 ## Installation and How to use the genal module <a name="paragraph2"></a>
 ### Installation <a name="paragraph2.1"></a>
+> **Note:**
+>
+> **Optional**: It is recommended to create a new environment to avoid dependencies conflicts. Here, we create a new conda environment called 'genal_env'.
+> ```
+> conda create --name genal_env python=3.11
+> conda activate genal_env
+> ```
 Download and install the package with pip:
 ```
 pip install genal-python
 ```
-And it can be imported in a python environment with:
+And import it in a python environment with:
 ```python
 import genal
 ```
@@ -80,6 +90,16 @@ Once downloaded, the path to the plink executable can be set with:
 ```
 genal.set_plink(path="/path/to/plink/executable/file")
 ```
+### Documentation <a name="paragraph2.2"></a>
+For detailed information on how to use the functionalities of Genal, please refer to the documentation: https://genal.rtfd.io
+The documentation covers:
+- Installation
+- This tutorial
+- The list of the main functions with complete description of their arguments
+- An exhaustive API reference
 ## Tutorial <a name="paragraph3"></a>
 For this tutorial, we will obtain genetic instruments for systolic blood pressure (SBP), compute a Polygenic Risk Score (PRS), and run a Mendelian Randomization analysis to investigate the genetically-determined effect of SBP on the risk of stroke. We will utilize summary statistics from Genome-Wide Association Studies (GWAS) and individual-level data from the UK Biobank. The steps include:
@@ -100,11 +120,11 @@ For this tutorial, we will obtain genetic instruments for systolic blood pressur
   - Data lifting to another genomic build
     - In pure Python
     - Using LiftOver
-  - Phenoscanner (to be added)
+  - Querying the GWAS Catalog
 ### Data loading <a name="paragraph3.1"></a>
-We begin with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
+We start this tutorial with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
 ```python
 import pandas as pd
@@ -133,6 +153,10 @@ The `genal.Geno` takes as input a pandas dataframe where each row corresponds to
 - **P**: Column name for effect p-value. Defaults to `'P'`.
 - **EAF**: Column name for effect allele frequency. Defaults to `'EAF'`.
+> **Note:**
+>
+> You do not need all columns to move forward, as not all columns are required by every function. Additionally, some columns can be imputed as we will see in the next paragraph.
 After inspecting the dataframe, we first need to extract the chromosome and position information from the `MarkerName` column into two new columns `CHR` and `POS`:
 ```python
@@ -158,7 +182,7 @@ The last argument (`keep_columns = False`) indicates that we do not wish to keep
 > **Note:**
 >
-> Make sure to read the readme file usually provided with the summary statistics to identify the correct columns. It is particularly important to correctly identify the allele that represents the effect allele. Also, you do not need all columns to move forward, as some can be inputted as we will see next.
+> Make sure to read the readme file usually provided with the summary statistics to identify the correct columns. It is particularly important to correctly identify the allele that represents the effect allele.
 ### Data preprocessing <a name="paragraph3.2"></a>
@@ -222,7 +246,7 @@ You can also use a custom reference panel by specifying to the reference_panel a
 ### Clumping <a name="paragraph3.3"></a>
-Clumping is the step at which we select the SNPs that will be used as our genetic instruments in future Polygenic Risk Scores and Mendelian Randomization analyses. The process involves identifying the SNPs that are strongly associated with our trait of interest (systolic blood pressure in this tutorial) and are independent from each other. This second step ensures that selected SNPs are not highly correlated, (i.e., they are not in high linkage disequilibrium). For this step, we again need to use a reference panel.
+Clumping, or C+T: Clumping + Thresholding, is the step at which we select the SNPs that will be used as our genetic instruments in future Polygenic Risk Scores and Mendelian Randomization analyses. The process involves identifying the SNPs that are strongly associated with our trait of interest (systolic blood pressure in this tutorial) and are independent from each other. This second step ensures that selected SNPs are not highly correlated, (i.e., they are not in high linkage disequilibrium). For this step, we again need to use a reference panel.
 The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
@@ -294,7 +318,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
 Here, we see that about half of the SNPs were not extracted from the data. In such cases, we may want to try and salvage some of these SNPs by looking for proxies (SNPs in high linkage disequilibrium, i.e. highly correlated SNPs). This can be done by specifying the `proxy = True`. argument:
 ```python
-SBP_clumped.prs(name = "SBP_prs" ,path = "Pop_chr$", proxy = True, reference_panel = "eur", r2=0.8, kb=5000, window_snps=5000)
+SBP_clumped.prs(name = "SBP_prs_proxy" ,path = "Pop_chr$", proxy = True, reference_panel = "eur", r2=0.8, kb=5000, window_snps=5000)
 ```
 and the output is:
@@ -337,7 +361,7 @@ and the output is:
     The PRS computation was successful and used 1330/1538 (86.476%) SNPs.
     PRS data saved to SBP_prs.csv
-In our case, we have been able to find proxies for 571 of the 786 SNPs that were missing in the population genetic data (7 potential proxies have been removed because they were identical to SNPs already present in our data).
+In our case, we have been able to find proxies for 578 of the 786 SNPs that were missing in the population genetic data (7 potential proxies have been removed because they were identical to SNPs already present in our data).
 You can customize how the proxies are chosen with the following arguments:
 - `reference_panel`: The reference population used to derive linkage disequilibrium values and find proxies. Defaults to `eur`.
@@ -347,7 +371,7 @@ You can customize how the proxies are chosen with the following arguments:
 > **Note:**
 >
-> You can call the `genal.Geno.prs` method on any `Geno` instance (containing at least the EA, BETA, and either SNP or CHR/POS columns). The data does not need to be clumped, and there is no limit to the number of instruments used to compute the scores.
+> You can call the `genal.Geno.prs` method on any `Geno` instance (containing at least the EA, BETA, and either SNP or CHR/POS columns). The data does not need to be clumped, and there is no limit to the number of SNPs used to compute the scores.
 ### Mendelian Randomization <a name="paragraph3.5"></a>
@@ -395,21 +419,25 @@ Genal will print how many SNPs were successfully found and extracted from the ou
     1541 SNPs out of 1545 are present in the outcome data.
     (Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
-Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
-```python
-SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
-```
-And genal will print the number of missing instruments which have been proxied:
-    Outcome data successfully loaded from 'b352e412' geno instance.
-    Identifying the exposure SNPs present in the outcome data...
-    1541 SNPs out of 1545 are present in the outcome data.
-    Searching proxies for 4 SNPs...
-    Using the EUR reference panel.
-    Found proxies for 4 SNPs.
-    (Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
+> **Note:**
+>Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
+>
+> Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
+>
+> ```python
+> SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
+> ```
+>
+> And genal will print the number of missing instruments that have been proxied:
+>
+>     Outcome data successfully loaded from 'b352e412' geno instance.
+>     Identifying the exposure SNPs present in the outcome data...
+>     1541 SNPs out of 1545 are present in the outcome data.
+>     Searching proxies for 4 SNPs...
+>     Using the EUR reference panel.
+>     Found proxies for 4 SNPs.
+>     (Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
 After extracting the instruments from the outcome data, the `SBP_clumped` `genal.Geno` instance contains an `MR_data` attribute containing the instruments-exposure and instruments-outcome associations necessary to run MR. Running MR is now as simple as calling the `genal.Geno.MR` method of the SBP_clumped `genal.Geno` instance:
@@ -454,7 +482,7 @@ By default, only some MR methods (inverse-variance weighted, weighted median, Si
 - `Weighted-mode` for the Weighted mode method
 - `all` to run all the above methods
-For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API.
+For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API: [https://genal.readthedocs.io/en/latest/modules.html#id4](MR method).
 If you want to visualize the obtained MR results, you can use the `genal.Geno.MR_plot` method that will plot each SNP in an `effect_on_exposure x effect_on_outcome` plane as well as lines corresponding to different MR methods:
@@ -462,7 +490,7 @@ If you want to visualize the obtained MR results, you can use the `genal.Geno.MR
 SBP_clumped.MR_plot(filename="MR_plot_SBP_AS")
 ```
-![MR plot](docs/Images/MR_plot_SBP_AS.png)
+![MR plot](docs/build/_images/MR_plot_SBP_AS.png)
 You can select which MR methods you wish to plot with the `methods` argument. Note that for an MR method to be plotted, they must be included in the latest `genal.Geno.MR` call of this `genal.Geno` instance.
 If you wish to include the heterogeneity values (Cochran's Q) in the results, you can use the heterogeneity argument in the `genal.Geno.MR` call. Here, the heterogeneity for the inverse-variance weighted method:
@@ -507,7 +535,7 @@ df_pheno = pd.read_csv("path/to/trait/data")
 > **Note:**
 >
-> One important detail is to make sure that the individual IDs are identical between the phenotypic data and the genetic data for the target population.
+>    One important point is to make sure that the IDs of the participants are identical in the phenotypic data and in the genetic data.
 Then, it is advised to make a copy of the `genal.Geno` instance containing our instruments as we are going to update their coefficients and to avoid any confusion:
@@ -585,6 +613,54 @@ You can specify the path of the LiftOver executable to the `liftover_path` argum
 SBP_Geno.lift(start = "hg19", end = "hg38", replace = False, liftover_path = "path/to/liftover/exec")
 ```
+### GWAS Catalog <a name="paragraph3.8"></a>
+It is sometimes interesting to determine the traits associated with our SNPs. In Mendelian Randomization, for instance, we may want to exclude instruments that are associated with traits likely causing horizontal pleiotropy. For this purpose, we can use the `genal.Geno.query_gwas_catalog` method. This method will query the GWAS Catalog API to determine the list of traits associated with each of our SNPs and store the results in a list in the `ASSOC` column of the `.data` attribute:
+```python
+SBP_clumped.query_gwas_catalog(p_threshold=5e-8)
+```
+Which will output:
+    Querying the GWAS Catalog and creating the ASSOC column.
+    Only associations with a p-value <= 5e-08 are reported. Use the p_threshold argument to change the threshold.
+    To report the p-value for each association, use return_p=True.
+    To report the study ID for each association, use return_study=True.
+    The .data attribute will be modified. Use replace=False to leave it as is.
+    100%|██████████| 1545/1545 [00:34<00:00, 44.86it/s]
+    The ASSOC column has been successfully created.
+    701 (45.37%) SNPs failed to query (not found in GWAS Catalog) and 7 (0.5%) SNPs timed out after 34.33 seconds. You can increase the timeout value with the timeout argument.
+| EA  | NEA | EAF   | BETA   | SE     | CHR | POS        | SNP        | ASSOC                                                                 |
+|-----|-----|-------|--------|--------|-----|------------|------------|------------------------------------------------------------------------|
+| A   | G   | 0.1784| 0.2330 | 0.0402 | 10  | 102075479  | rs603424   | [eicosanoids measurement, decadienedioic acid (...]                     |
+| A   | G   | 0.0706| -0.3873| 0.0626 | 10  | 102403682  | rs2996303  | FAILED_QUERY                                                           |
+| T   | G   | 0.8872| 0.6846 | 0.0480 | 10  | 102553647  | rs1006545  | [diastolic blood pressure, systolic blood pressure...]                  |
+| T   | G   | 0.6652| -0.2098| 0.0340 | 10  | 102558506  | rs12570050 | FAILED_QUERY                                                           |
+| T   | C   | 0.3057| -0.2448| 0.0334 | 10  | 102603924  | rs4919502  | FAILED_QUERY                                                           |
+| ... | ... | ...   | ...    | ...    | ... | ...        | ...        | ...                                                                    |                                          |
+| T   | C   | 0.3514| 0.2203 | 0.0314 | 9   | 9350706    | rs1332813  | [diastolic blood pressure, systolic blood pressure...]                  |
+| T   | C   | 0.6880| -0.1897| 0.0332 | 9   | 94201341   | rs10820855 | FAILED_QUERY                                                           |
+| A   | T   | 0.3669| -0.1862| 0.0313 | 9   | 95201540   | rs7045409  | [protein measurement, pulse pressure measurement...]                   |
+If you are also interested in the p-values of each SNP-trait association, or the ID of the study from which the association was reported, you can use the `return_p = True` and `return_study = True` arguments. Then, the `ASSOC` column will contain a list of tuples, where each tuple contains the trait name, the p-value, and the study ID:
+```python
+SBP_clumped.query_gwas_catalog(p_threshold=5e-8, return_p=True, return_study=True)
+```
+| EA  | NEA | EAF   | BETA   | SE     | CHR | POS        | SNP        | ASSOC                                                                 |
+|-----|-----|-------|--------|--------|-----|------------|------------|------------------------------------------------------------------------|
+| A   | G   | 0.1784| 0.2330 | 0.0402 | 10  | 102075479  | rs603424   | TIMEOUT                                                                |
+| A   | G   | 0.0706| -0.3873| 0.0626 | 10  | 102403682  | rs2996303  | FAILED_QUERY                                                           |
+| T   | G   | 0.8872| 0.6846 | 0.0480 | 10  | 102553647  | rs1006545  | [(heart rate response to exercise, 6e-12, GCST...                      |
+| T   | G   | 0.6652| -0.2098| 0.0340 | 10  | 102558506  | rs12570050 | FAILED_QUERY                                                           |
+| T   | C   | 0.3057| -0.2448| 0.0334 | 10  | 102603924  | rs4919502  | FAILED_QUERY                                                           |
+| ... | ... | ...   | ...    | ...    | ... | ...        | ...        | ...                                                                    |                                                         |
+| T   | C   | 0.3514| 0.2203 | 0.0314 | 9   | 9350706    | rs1332813  | [(diastolic blood pressure, 1e-12, GCST9031029...                      |
+| T   | C   | 0.6880| -0.1897| 0.0332 | 9   | 94201341   | rs10820855 | FAILED_QUERY                                                           |
+| A   | T   | 0.3669| -0.1862| 0.0313 | 9   | 95201540   | rs7045409  | [(systolic blood pressure, 9e-13, GCST006624),...                      |
+> **Note:**
+>
+> As you can see, many SNPs failed to be queried. This is normal as the GWAS Catalog is not exhaustive.

{genal_python-0.7 → genal_python-0.9}/README.md RENAMED Viewed

@@ -7,10 +7,11 @@
 # Table of contents
 1. [Introduction](#introduction)
-2. [Citation] (#citation)
+2. [Citation](#citation)
 3. [Requirements for the genal module](#paragraph1)
 4. [Installation and how to use genal](#paragraph2)
     1. [Installation](#paragraph2.1)
+    2. [Documentation](#paragraph2.2)
 5. [Tutorial and presentation of the main tools](#paragraph3)
     1. [Data loading](#paragraph3.1)
     2. [Data preprocessing](#paragraph3.2)
@@ -19,6 +20,7 @@
     5. [Mendelian Randomization](#paragraph3.5)
     6. [SNP-association testing](#paragraph3.6)
     7. [Lifting](#paragraph3.7)
+    8. [GWAS Catalog](#paragraph3.8)
 ## Introduction <a name="introduction"></a>
@@ -33,18 +35,27 @@ If you're using genal, please cite the following paper:
 **Genal: A Python Toolkit for Genetic Risk Scoring and Mendelian Randomization.** Cyprien A. Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N. Sheth, Guido J. Falcone, Julian N. Acosta. medRxiv 2024.05.23.24307776; doi: https://doi.org/10.1101/2024.05.23.24307776
 ## Requirements for the genal module <a name="paragraph1"></a>
-***Python 3.9 or later***. https://www.python.org/ <br>
+***Python 3.11 or later***. https://www.python.org/ <br>
 ## Installation and How to use the genal module <a name="paragraph2"></a>
 ### Installation <a name="paragraph2.1"></a>
+> **Note:**
+>
+> **Optional**: It is recommended to create a new environment to avoid dependencies conflicts. Here, we create a new conda environment called 'genal_env'.
+> ```
+> conda create --name genal_env python=3.11
+> conda activate genal_env
+> ```
 Download and install the package with pip:
 ```
 pip install genal-python
 ```
-And it can be imported in a python environment with:
+And import it in a python environment with:
 ```python
 import genal
 ```
@@ -55,6 +66,16 @@ Once downloaded, the path to the plink executable can be set with:
 ```
 genal.set_plink(path="/path/to/plink/executable/file")
 ```
+### Documentation <a name="paragraph2.2"></a>
+For detailed information on how to use the functionalities of Genal, please refer to the documentation: https://genal.rtfd.io
+The documentation covers:
+- Installation
+- This tutorial
+- The list of the main functions with complete description of their arguments
+- An exhaustive API reference
 ## Tutorial <a name="paragraph3"></a>
 For this tutorial, we will obtain genetic instruments for systolic blood pressure (SBP), compute a Polygenic Risk Score (PRS), and run a Mendelian Randomization analysis to investigate the genetically-determined effect of SBP on the risk of stroke. We will utilize summary statistics from Genome-Wide Association Studies (GWAS) and individual-level data from the UK Biobank. The steps include:
@@ -75,11 +96,11 @@ For this tutorial, we will obtain genetic instruments for systolic blood pressur
   - Data lifting to another genomic build
     - In pure Python
     - Using LiftOver
-  - Phenoscanner (to be added)
+  - Querying the GWAS Catalog
 ### Data loading <a name="paragraph3.1"></a>
-We begin with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
+We start this tutorial with publicly available summary statistics from a large GWAS study of systolic blood pressure. [Link to study](https://www.nature.com/articles/s41588-018-0205-x). After downloading and unzipping the summary statistics, we load them into a pandas DataFrame:
 ```python
 import pandas as pd
@@ -108,6 +129,10 @@ The `genal.Geno` takes as input a pandas dataframe where each row corresponds to
 - **P**: Column name for effect p-value. Defaults to `'P'`.
 - **EAF**: Column name for effect allele frequency. Defaults to `'EAF'`.
+> **Note:**
+>
+> You do not need all columns to move forward, as not all columns are required by every function. Additionally, some columns can be imputed as we will see in the next paragraph.
 After inspecting the dataframe, we first need to extract the chromosome and position information from the `MarkerName` column into two new columns `CHR` and `POS`:
 ```python
@@ -133,7 +158,7 @@ The last argument (`keep_columns = False`) indicates that we do not wish to keep
 > **Note:**
 >
-> Make sure to read the readme file usually provided with the summary statistics to identify the correct columns. It is particularly important to correctly identify the allele that represents the effect allele. Also, you do not need all columns to move forward, as some can be inputted as we will see next.
+> Make sure to read the readme file usually provided with the summary statistics to identify the correct columns. It is particularly important to correctly identify the allele that represents the effect allele.
 ### Data preprocessing <a name="paragraph3.2"></a>
@@ -197,7 +222,7 @@ You can also use a custom reference panel by specifying to the reference_panel a
 ### Clumping <a name="paragraph3.3"></a>
-Clumping is the step at which we select the SNPs that will be used as our genetic instruments in future Polygenic Risk Scores and Mendelian Randomization analyses. The process involves identifying the SNPs that are strongly associated with our trait of interest (systolic blood pressure in this tutorial) and are independent from each other. This second step ensures that selected SNPs are not highly correlated, (i.e., they are not in high linkage disequilibrium). For this step, we again need to use a reference panel.
+Clumping, or C+T: Clumping + Thresholding, is the step at which we select the SNPs that will be used as our genetic instruments in future Polygenic Risk Scores and Mendelian Randomization analyses. The process involves identifying the SNPs that are strongly associated with our trait of interest (systolic blood pressure in this tutorial) and are independent from each other. This second step ensures that selected SNPs are not highly correlated, (i.e., they are not in high linkage disequilibrium). For this step, we again need to use a reference panel.
 The SNP-data loaded in a `genal.Geno` instance can be clumped using the `genal.Geno.clump` method. It will return another `genal.Geno` instance containing only the clumped data:
@@ -269,7 +294,7 @@ The output of the `genal.Geno.prs` method will include how many SNPs were used t
 Here, we see that about half of the SNPs were not extracted from the data. In such cases, we may want to try and salvage some of these SNPs by looking for proxies (SNPs in high linkage disequilibrium, i.e. highly correlated SNPs). This can be done by specifying the `proxy = True`. argument:
 ```python
-SBP_clumped.prs(name = "SBP_prs" ,path = "Pop_chr$", proxy = True, reference_panel = "eur", r2=0.8, kb=5000, window_snps=5000)
+SBP_clumped.prs(name = "SBP_prs_proxy" ,path = "Pop_chr$", proxy = True, reference_panel = "eur", r2=0.8, kb=5000, window_snps=5000)
 ```
 and the output is:
@@ -312,7 +337,7 @@ and the output is:
     The PRS computation was successful and used 1330/1538 (86.476%) SNPs.
     PRS data saved to SBP_prs.csv
-In our case, we have been able to find proxies for 571 of the 786 SNPs that were missing in the population genetic data (7 potential proxies have been removed because they were identical to SNPs already present in our data).
+In our case, we have been able to find proxies for 578 of the 786 SNPs that were missing in the population genetic data (7 potential proxies have been removed because they were identical to SNPs already present in our data).
 You can customize how the proxies are chosen with the following arguments:
 - `reference_panel`: The reference population used to derive linkage disequilibrium values and find proxies. Defaults to `eur`.
@@ -322,7 +347,7 @@ You can customize how the proxies are chosen with the following arguments:
 > **Note:**
 >
-> You can call the `genal.Geno.prs` method on any `Geno` instance (containing at least the EA, BETA, and either SNP or CHR/POS columns). The data does not need to be clumped, and there is no limit to the number of instruments used to compute the scores.
+> You can call the `genal.Geno.prs` method on any `Geno` instance (containing at least the EA, BETA, and either SNP or CHR/POS columns). The data does not need to be clumped, and there is no limit to the number of SNPs used to compute the scores.
 ### Mendelian Randomization <a name="paragraph3.5"></a>
@@ -370,21 +395,25 @@ Genal will print how many SNPs were successfully found and extracted from the ou
     1541 SNPs out of 1545 are present in the outcome data.
     (Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
-Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
-```python
-SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
-```
-And genal will print the number of missing instruments which have been proxied:
-    Outcome data successfully loaded from 'b352e412' geno instance.
-    Identifying the exposure SNPs present in the outcome data...
-    1541 SNPs out of 1545 are present in the outcome data.
-    Searching proxies for 4 SNPs...
-    Using the EUR reference panel.
-    Found proxies for 4 SNPs.
-    (Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
+> **Note:**
+>Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
+>
+> Here as well you have the option to use proxies for the instruments that are not present in the outcome data:
+>
+> ```python
+> SBP_clumped.query_outcome(Stroke_geno, proxy = True, reference_panel = "eur", kb = 5000, r2 = 0.6, window_snps = 5000)
+> ```
+>
+> And genal will print the number of missing instruments that have been proxied:
+>
+>     Outcome data successfully loaded from 'b352e412' geno instance.
+>     Identifying the exposure SNPs present in the outcome data...
+>     1541 SNPs out of 1545 are present in the outcome data.
+>     Searching proxies for 4 SNPs...
+>     Using the EUR reference panel.
+>     Found proxies for 4 SNPs.
+>     (Exposure data, Outcome data, Outcome name) stored in the .MR_data attribute.
 After extracting the instruments from the outcome data, the `SBP_clumped` `genal.Geno` instance contains an `MR_data` attribute containing the instruments-exposure and instruments-outcome associations necessary to run MR. Running MR is now as simple as calling the `genal.Geno.MR` method of the SBP_clumped `genal.Geno` instance:
@@ -429,7 +458,7 @@ By default, only some MR methods (inverse-variance weighted, weighted median, Si
 - `Weighted-mode` for the Weighted mode method
 - `all` to run all the above methods
-For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API.
+For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API: [https://genal.readthedocs.io/en/latest/modules.html#id4](MR method).
 If you want to visualize the obtained MR results, you can use the `genal.Geno.MR_plot` method that will plot each SNP in an `effect_on_exposure x effect_on_outcome` plane as well as lines corresponding to different MR methods:
@@ -437,7 +466,7 @@ If you want to visualize the obtained MR results, you can use the `genal.Geno.MR
 SBP_clumped.MR_plot(filename="MR_plot_SBP_AS")
 ```
-![MR plot](docs/Images/MR_plot_SBP_AS.png)
+![MR plot](docs/build/_images/MR_plot_SBP_AS.png)
 You can select which MR methods you wish to plot with the `methods` argument. Note that for an MR method to be plotted, they must be included in the latest `genal.Geno.MR` call of this `genal.Geno` instance.
 If you wish to include the heterogeneity values (Cochran's Q) in the results, you can use the heterogeneity argument in the `genal.Geno.MR` call. Here, the heterogeneity for the inverse-variance weighted method:
@@ -482,7 +511,7 @@ df_pheno = pd.read_csv("path/to/trait/data")
 > **Note:**
 >
-> One important detail is to make sure that the individual IDs are identical between the phenotypic data and the genetic data for the target population.
+>    One important point is to make sure that the IDs of the participants are identical in the phenotypic data and in the genetic data.
 Then, it is advised to make a copy of the `genal.Geno` instance containing our instruments as we are going to update their coefficients and to avoid any confusion:
@@ -560,5 +589,54 @@ You can specify the path of the LiftOver executable to the `liftover_path` argum
 SBP_Geno.lift(start = "hg19", end = "hg38", replace = False, liftover_path = "path/to/liftover/exec")
 ```
+### GWAS Catalog <a name="paragraph3.8"></a>
+It is sometimes interesting to determine the traits associated with our SNPs. In Mendelian Randomization, for instance, we may want to exclude instruments that are associated with traits likely causing horizontal pleiotropy. For this purpose, we can use the `genal.Geno.query_gwas_catalog` method. This method will query the GWAS Catalog API to determine the list of traits associated with each of our SNPs and store the results in a list in the `ASSOC` column of the `.data` attribute:
+```python
+SBP_clumped.query_gwas_catalog(p_threshold=5e-8)
+```
+Which will output:
+    Querying the GWAS Catalog and creating the ASSOC column.
+    Only associations with a p-value <= 5e-08 are reported. Use the p_threshold argument to change the threshold.
+    To report the p-value for each association, use return_p=True.
+    To report the study ID for each association, use return_study=True.
+    The .data attribute will be modified. Use replace=False to leave it as is.
+    100%|██████████| 1545/1545 [00:34<00:00, 44.86it/s]
+    The ASSOC column has been successfully created.
+    701 (45.37%) SNPs failed to query (not found in GWAS Catalog) and 7 (0.5%) SNPs timed out after 34.33 seconds. You can increase the timeout value with the timeout argument.
+| EA  | NEA | EAF   | BETA   | SE     | CHR | POS        | SNP        | ASSOC                                                                 |
+|-----|-----|-------|--------|--------|-----|------------|------------|------------------------------------------------------------------------|
+| A   | G   | 0.1784| 0.2330 | 0.0402 | 10  | 102075479  | rs603424   | [eicosanoids measurement, decadienedioic acid (...]                     |
+| A   | G   | 0.0706| -0.3873| 0.0626 | 10  | 102403682  | rs2996303  | FAILED_QUERY                                                           |
+| T   | G   | 0.8872| 0.6846 | 0.0480 | 10  | 102553647  | rs1006545  | [diastolic blood pressure, systolic blood pressure...]                  |
+| T   | G   | 0.6652| -0.2098| 0.0340 | 10  | 102558506  | rs12570050 | FAILED_QUERY                                                           |
+| T   | C   | 0.3057| -0.2448| 0.0334 | 10  | 102603924  | rs4919502  | FAILED_QUERY                                                           |
+| ... | ... | ...   | ...    | ...    | ... | ...        | ...        | ...                                                                    |                                          |
+| T   | C   | 0.3514| 0.2203 | 0.0314 | 9   | 9350706    | rs1332813  | [diastolic blood pressure, systolic blood pressure...]                  |
+| T   | C   | 0.6880| -0.1897| 0.0332 | 9   | 94201341   | rs10820855 | FAILED_QUERY                                                           |
+| A   | T   | 0.3669| -0.1862| 0.0313 | 9   | 95201540   | rs7045409  | [protein measurement, pulse pressure measurement...]                   |
+If you are also interested in the p-values of each SNP-trait association, or the ID of the study from which the association was reported, you can use the `return_p = True` and `return_study = True` arguments. Then, the `ASSOC` column will contain a list of tuples, where each tuple contains the trait name, the p-value, and the study ID:
+```python
+SBP_clumped.query_gwas_catalog(p_threshold=5e-8, return_p=True, return_study=True)
+```
+| EA  | NEA | EAF   | BETA   | SE     | CHR | POS        | SNP        | ASSOC                                                                 |
+|-----|-----|-------|--------|--------|-----|------------|------------|------------------------------------------------------------------------|
+| A   | G   | 0.1784| 0.2330 | 0.0402 | 10  | 102075479  | rs603424   | TIMEOUT                                                                |
+| A   | G   | 0.0706| -0.3873| 0.0626 | 10  | 102403682  | rs2996303  | FAILED_QUERY                                                           |
+| T   | G   | 0.8872| 0.6846 | 0.0480 | 10  | 102553647  | rs1006545  | [(heart rate response to exercise, 6e-12, GCST...                      |
+| T   | G   | 0.6652| -0.2098| 0.0340 | 10  | 102558506  | rs12570050 | FAILED_QUERY                                                           |
+| T   | C   | 0.3057| -0.2448| 0.0334 | 10  | 102603924  | rs4919502  | FAILED_QUERY                                                           |
+| ... | ... | ...   | ...    | ...    | ... | ...        | ...        | ...                                                                    |                                                         |
+| T   | C   | 0.3514| 0.2203 | 0.0314 | 9   | 9350706    | rs1332813  | [(diastolic blood pressure, 1e-12, GCST9031029...                      |
+| T   | C   | 0.6880| -0.1897| 0.0332 | 9   | 94201341   | rs10820855 | FAILED_QUERY                                                           |
+| A   | T   | 0.3669| -0.1862| 0.0313 | 9   | 95201540   | rs7045409  | [(systolic blood pressure, 9e-13, GCST006624),...                      |
+> **Note:**
+>
+> As you can see, many SNPs failed to be queried. This is normal as the GWAS Catalog is not exhaustive.

genal_python-0.9/docs/.DS_Store ADDED Viewed

Binary file

genal_python-0.9/docs/build/.DS_Store ADDED Viewed

Binary file

{genal_python-0.7/docs/_build/html → genal_python-0.9/docs/build}/.buildinfo RENAMED Viewed

@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: f327128ef63678afda2847b68c505d0e
+config: 1a3c03fa317dbf0f46b6f7567774d6c5
 tags: 645f666f9bcd5a90fca523b33c5a78b7