PyPI - genal-python - Versions diffs - 0.0.dev0__tar.gz → 0.1__tar.gz - Mend

genal-python 0.0.dev0tar.gz → 0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (82) hide show

{genal_python-0.0.dev0 → genal_python-0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: genal-python
-Version: 0.0.dev0
+Version: 0.1
 Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
 Author-email: Cyprien Rivier <riviercyprien@gmail.com>
 Requires-Python: >=3.7
@@ -51,7 +51,7 @@ The module prioritizes user-friendliness and intuitive operation, aiming to redu
 Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
 ## Requirements for the GENAL module <a name="paragraph1"></a>
-***Python 3.7 or later***. https://www.python.org/ or  https://www.python.org/downloads/release/python-379/ <br>
+***Python 3.9 or later***. https://www.python.org/ <br>
 ## Installation and How to use the GENAL module <a name="paragraph2"></a>
@@ -60,7 +60,11 @@ Genal draws on concepts from well-established R packages such as TwoSampleMR, MR
 Download and install the package with pip:
 ```
-pip install genal
+pip install genal-python
+```
+And it can be imported in a python environment with:
+```python
+import genal
 ```
 The main genal functionalities require a working installation of PLINK v1.9 that can be downloaded here: https://www.cog-genomics.org/plink/
@@ -81,6 +85,7 @@ For this tutorial, we will build a Polygenic Risk Score (PRS) for systolic blood
   - Include risk score calculations with proxies
 - Perform Mendelian Randomization
   - Analyze SBP as an exposure and acute ischemic stroke as an outcome
+  - Plot the results
   - Conduct sensitivity analyses using the weighted median, MR-Egger, and MR-PRESSO methods
 - Calibrate SNP-trait weights with individual-level genetic data
   - Execute single-SNP association tests for calibrating SBP genetic instruments
@@ -155,15 +160,15 @@ Now that we have loaded the data into a `genal.Geno` object, we can begin cleani
 Genal can run all the basic cleaning and preprocessing steps in one command:
 ```python
-SBP_Geno.preprocess_data(preprocessing = 2)
+SBP_Geno.preprocess_data(preprocessing = 'Fill_delete')
 ```
 The `preprocessing` argument specifies the global level of preprocessing applied to the data:
-- `preprocessing = 0`: The data won't be modified.
-- `preprocessing = 1`: Missing columns will be added based on reference data and invalid values set to NaN, but no rows will be deleted.
-- `preprocessing = 2`: Missing columns will be added, and all rows containing missing, duplicated, or invalid values will be deleted. This option is recommended before running genetic methods.
+- `preprocessing = 'None'`: The data won't be modified.
+- `preprocessing = 'Fill'`: Missing columns will be added based on reference data and invalid values set to NaN, but no rows will be deleted.
+- `preprocessing = 'Fill_delete'`: Missing columns will be added, and all rows containing missing, duplicated, or invalid values will be deleted. This option is recommended before running genetic methods.
-By default, and depending on the global preprocessing level (0, 1, 2) chosen, the `preprocess_data` method of `genal.Geno` will run the following checks:
+By default, and depending on the global preprocessing level ('None', 'Fill', 'Fill_delete') chosen, the `preprocess_data` method of `genal.Geno` will run the following checks:
 - Ensure the CHR (chromosome) and POS (genomic position) columns are integers.
 - Ensure the EA (effect allele) and NEA (non-effect allele) columns are uppercase characters containing A, T, C, G letters. Multiallelic values are set to NaN.
 - Validate the P (p-value) column for proper values.
@@ -202,7 +207,7 @@ And we see that the SNP column with the rsids has been added based on the refere
 You do not need to obtain the 1000 genome reference panel yourself, Genal will download it the first time you use it. By default, the reference panel used is the european (eur) one. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument:
 ```python
-SBP_Geno.preprocess_data(preprocessing = 2, reference_panel = "afr")
+SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
 ```
 You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam files (without the extension).
@@ -366,7 +371,7 @@ Stroke_Geno = genal.Geno(stroke_gwas, CHR = "chromosome", POS = "base_pair_locat
 We preprocess it as well to put it in the correct format and make sure there is no invalid values:
 ```python
-Stroke_Geno.preprocess_data(preprocessing = 2)
+Stroke_Geno.preprocess_data(preprocessing = 'Fill_delete')
 ```
 Now, we need to extract our instruments (SNPs of the SBP_clumped data) from the outcome data to obtain their association with the outcome trait (stroke). It can be done by calling the `genal.Geno.query_outcome` method:
@@ -431,8 +436,16 @@ By default, all MR methods (inverse-variance weighted, weighted median, MR-Egger
 For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API.
-If you wish to include the heterogeneity values (Cochran's Q), you can use the heterogeneity argument. Here, the heterogeneity for the inverse-variance weighted method:
+If you want to visualize the obtained MR results, you can use the `genal.Geno.MR_plot` method that will plot each SNP in an effect_on_exposure x effect_on_outcome plane as well as lines corresponding to different MR methods:
+```python
+SBP_clumped.MR_plot(filename="MR_plot_SBP_AS")
+```
+![MR plot](docs/Images/MR_plot_SBP_AS.png)
+You can select which MR methods you wish to plot with the 'methods' argument. Note that for an MR method to be plotted, they must be included in the latest `genal.Geno.MR` call of this `genal.Geno` object.
+If you wish to include the heterogeneity values (Cochran's Q) in the results, you can use the heterogeneity argument in the `genal.Geno.MR` call. Here, the heterogeneity for the inverse-variance weighted method:
 ```python
 SBP_clumped.MR(action = 3, methods = ["IVW"], exposure_name = "SBP", outcome_name = "Stroke_eur", heterogeneity = True)

{genal_python-0.0.dev0 → genal_python-0.1}/README.md RENAMED Viewed

@@ -29,7 +29,7 @@ The module prioritizes user-friendliness and intuitive operation, aiming to redu
 Genal draws on concepts from well-established R packages such as TwoSampleMR, MR-Presso, MendelianRandomization, and gwasvcf, adapting their proven methodologies to the Python environment. This approach ensures that users have access to tried and tested techniques with the versatility of Python's data science tools.
 ## Requirements for the GENAL module <a name="paragraph1"></a>
-***Python 3.7 or later***. https://www.python.org/ or  https://www.python.org/downloads/release/python-379/ <br>
+***Python 3.9 or later***. https://www.python.org/ <br>
 ## Installation and How to use the GENAL module <a name="paragraph2"></a>
@@ -38,7 +38,11 @@ Genal draws on concepts from well-established R packages such as TwoSampleMR, MR
 Download and install the package with pip:
 ```
-pip install genal
+pip install genal-python
+```
+And it can be imported in a python environment with:
+```python
+import genal
 ```
 The main genal functionalities require a working installation of PLINK v1.9 that can be downloaded here: https://www.cog-genomics.org/plink/
@@ -59,6 +63,7 @@ For this tutorial, we will build a Polygenic Risk Score (PRS) for systolic blood
   - Include risk score calculations with proxies
 - Perform Mendelian Randomization
   - Analyze SBP as an exposure and acute ischemic stroke as an outcome
+  - Plot the results
   - Conduct sensitivity analyses using the weighted median, MR-Egger, and MR-PRESSO methods
 - Calibrate SNP-trait weights with individual-level genetic data
   - Execute single-SNP association tests for calibrating SBP genetic instruments
@@ -133,15 +138,15 @@ Now that we have loaded the data into a `genal.Geno` object, we can begin cleani
 Genal can run all the basic cleaning and preprocessing steps in one command:
 ```python
-SBP_Geno.preprocess_data(preprocessing = 2)
+SBP_Geno.preprocess_data(preprocessing = 'Fill_delete')
 ```
 The `preprocessing` argument specifies the global level of preprocessing applied to the data:
-- `preprocessing = 0`: The data won't be modified.
-- `preprocessing = 1`: Missing columns will be added based on reference data and invalid values set to NaN, but no rows will be deleted.
-- `preprocessing = 2`: Missing columns will be added, and all rows containing missing, duplicated, or invalid values will be deleted. This option is recommended before running genetic methods.
+- `preprocessing = 'None'`: The data won't be modified.
+- `preprocessing = 'Fill'`: Missing columns will be added based on reference data and invalid values set to NaN, but no rows will be deleted.
+- `preprocessing = 'Fill_delete'`: Missing columns will be added, and all rows containing missing, duplicated, or invalid values will be deleted. This option is recommended before running genetic methods.
-By default, and depending on the global preprocessing level (0, 1, 2) chosen, the `preprocess_data` method of `genal.Geno` will run the following checks:
+By default, and depending on the global preprocessing level ('None', 'Fill', 'Fill_delete') chosen, the `preprocess_data` method of `genal.Geno` will run the following checks:
 - Ensure the CHR (chromosome) and POS (genomic position) columns are integers.
 - Ensure the EA (effect allele) and NEA (non-effect allele) columns are uppercase characters containing A, T, C, G letters. Multiallelic values are set to NaN.
 - Validate the P (p-value) column for proper values.
@@ -180,7 +185,7 @@ And we see that the SNP column with the rsids has been added based on the refere
 You do not need to obtain the 1000 genome reference panel yourself, Genal will download it the first time you use it. By default, the reference panel used is the european (eur) one. You can specify another valid reference panel (afr, eas, sas, amr) with the reference_panel argument:
 ```python
-SBP_Geno.preprocess_data(preprocessing = 2, reference_panel = "afr")
+SBP_Geno.preprocess_data(preprocessing = 'Fill_delete', reference_panel = "afr")
 ```
 You can also use a custom reference panel by specifying to the reference_panel argument a path to bed/bim/fam files (without the extension).
@@ -344,7 +349,7 @@ Stroke_Geno = genal.Geno(stroke_gwas, CHR = "chromosome", POS = "base_pair_locat
 We preprocess it as well to put it in the correct format and make sure there is no invalid values:
 ```python
-Stroke_Geno.preprocess_data(preprocessing = 2)
+Stroke_Geno.preprocess_data(preprocessing = 'Fill_delete')
 ```
 Now, we need to extract our instruments (SNPs of the SBP_clumped data) from the outcome data to obtain their association with the outcome trait (stroke). It can be done by calling the `genal.Geno.query_outcome` method:
@@ -409,8 +414,16 @@ By default, all MR methods (inverse-variance weighted, weighted median, MR-Egger
 For more fine-tuning, such as settings for the number of boostrapping iterations, please refer to the API.
-If you wish to include the heterogeneity values (Cochran's Q), you can use the heterogeneity argument. Here, the heterogeneity for the inverse-variance weighted method:
+If you want to visualize the obtained MR results, you can use the `genal.Geno.MR_plot` method that will plot each SNP in an effect_on_exposure x effect_on_outcome plane as well as lines corresponding to different MR methods:
+```python
+SBP_clumped.MR_plot(filename="MR_plot_SBP_AS")
+```
+![MR plot](docs/Images/MR_plot_SBP_AS.png)
+You can select which MR methods you wish to plot with the 'methods' argument. Note that for an MR method to be plotted, they must be included in the latest `genal.Geno.MR` call of this `genal.Geno` object.
+If you wish to include the heterogeneity values (Cochran's Q) in the results, you can use the heterogeneity argument in the `genal.Geno.MR` call. Here, the heterogeneity for the inverse-variance weighted method:
 ```python
 SBP_clumped.MR(action = 3, methods = ["IVW"], exposure_name = "SBP", outcome_name = "Stroke_eur", heterogeneity = True)

genal_python-0.1/docs/Images/MR_plot_SBP_AS.png ADDED Viewed

Binary file

{genal_python-0.0.dev0 → genal_python-0.1}/genal/Geno.py RENAMED Viewed

@@ -167,7 +167,7 @@ class Geno:
     def preprocess_data(
         self,
-        preprocessing=1,
+        preprocessing='Fill',
         reference_panel="eur",
         effect_column=None,
         keep_multi=None,
@@ -179,11 +179,11 @@ class Geno:
         Clean and preprocess the main dataframe of Single Nucleotide Polymorphisms (SNP) data.
         Args:
-            preprocessing (int, optional): Level of preprocessing to apply. Options include:
-                - 0: The dataframe is not modified.
-                - 1: Missing columns are added based on reference data and invalid values set to NaN, but no rows are deleted.
-                - 2: Missing columns are added, and rows with missing, duplicated, or invalid values are deleted.
-                Defaults to 1.
+            preprocessing (str, optional): Level of preprocessing to apply. Options include:
+                - "None": The dataframe is not modified.
+                - "Fill": Missing columns are added based on reference data and invalid values set to NaN, but no rows are deleted.
+                - "Fill_delete": Missing columns are added, and rows with missing, duplicated, or invalid values are deleted.
+                Defaults to 'Fill'.
             reference_panel (str or pd.DataFrame, optional): Reference panel for SNP adjustments. Can be a string representing ancestry classification ("eur", "afr", "eas", "sas", "amr") or a DataFrame with ["CHR","SNP","POS","A1","A2"] columns or a path to a .bim file. Defaults to "eur".
             effect_column (str, optional): Specifies the type of effect column ("BETA" or "OR"). If None, the method tries to determine it. Odds Ratios will be log-transformed and the standard error adjusted. Defaults to None.
             keep_multi (bool, optional): Determines if multiallelic SNPs should be kept. If None, defers to preprocessing value. Defaults to None.
@@ -207,7 +207,7 @@ class Geno:
         # Ensure CHR and POS columns are integers if preprocessing is enabled
         for int_col in ["CHR", "POS"]:
-            if int_col in data.columns and preprocessing > 0:
+            if int_col in data.columns and preprocessing in ['Fill', 'Fill_delete']:
                 check_int_column(data, int_col)
                 self.checks[int_col] = True
@@ -237,7 +237,7 @@ class Geno:
             and "NEA" not in data.columns
             and "EA" in data.columns
         )
-        if missing_nea_condition and preprocessing > 0:
+        if missing_nea_condition and preprocessing in ['Fill', 'Fill_delete']:
             data = fill_nea(data, self.get_reference_panel(reference_panel))
         # Fill missing EA and NEA columns from reference data if necessary and preprocessing is enabled
@@ -247,7 +247,7 @@ class Geno:
             and "NEA" not in data.columns
             and "EA" not in data.columns
         )
-        if missing_ea_nea_condition and preprocessing > 0:
+        if missing_ea_nea_condition and preprocessing in ['Fill', 'Fill_delete']:
             data = fill_ea_nea(data, self.get_reference_panel(reference_panel))
         # Convert effect column to Beta estimates if present
@@ -256,18 +256,18 @@ class Geno:
             self.checks["BETA"] = True
         # Ensure P column contains valid values
-        if "P" in data.columns and preprocessing > 0:
+        if "P" in data.columns and preprocessing in ['Fill', 'Fill_delete']:
             check_p_column(data)
             self.checks["P"] = True
         # Fill missing SE or P columns if necessary
-        if preprocessing > 0:
+        if preprocessing in ['Fill', 'Fill_delete']:
             fill_se_p(data)
         # Process allele columns
         for allele_col in ["EA", "NEA"]:
             check_allele_condition = (allele_col in data.columns) and (
-                (preprocessing > 0) or (not keep_multi)
+                (preprocessing in ['Fill', 'Fill_delete']) or (not keep_multi)
             )
             if check_allele_condition:
                 check_allele_column(data, allele_col, keep_multi)
@@ -285,8 +285,8 @@ class Geno:
                     f"Warning: the data doesn't include a {column} column. This may become an issue later on."
                 )
-        # Remove missing values if preprocessing level is set to 2
-        if preprocessing == 2:
+        # Remove missing values if preprocessing level is set to 'Fill_delete'
+        if preprocessing == 'Fill_delete':
             remove_na(data)
             self.checks["NA_removal"] = True
@@ -547,7 +547,7 @@ class Geno:
         if "EA" not in self.checks:
             check_allele_column(data_prs, "EA", keep_multi=False)
         if "BETA" not in self.checks:
-            check_beta_column(data_prs, effect_column=None, preprocessing=2)
+            check_beta_column(data_prs, effect_column=None, preprocessing='Fill_delete')
         initial_rows = data_prs.shape[0]
         data_prs.dropna(subset=["SNP", "P", "BETA"], inplace=True)

{genal_python-0.0.dev0 → genal_python-0.1}/genal/__init__.py RENAMED Viewed

@@ -3,6 +3,8 @@ import json
 from .tools import default_config, write_config, set_plink
 from .geno_tools import delete_tmp
+__version__ = "0.1"
 config_dir = os.path.expanduser(
     "~/.genal/"
 )  # Don't forget to change the config_path dans tools.py

{genal_python-0.0.dev0 → genal_python-0.1}/genal/geno_tools.py RENAMED Viewed

@@ -21,7 +21,7 @@ def remove_na(data):
     n_del = nrows - data.shape[0]
     if n_del > 0:
         print(
-            f"Deleted {n_del}({n_del/nrows*100:.3f}%) rows containing NA values in columns {columns_na}. Use preprocessing = 1 to keep the rows containing NA values."
+            f"Deleted {n_del}({n_del/nrows*100:.3f}%) rows containing NA values in columns {columns_na}. Use preprocessing = 'Fill' to keep the rows containing NA values."
         )
     return
@@ -100,7 +100,7 @@ def check_beta_column(data, effect_column, preprocessing):
     If no effect_column argument is specified, determine if the BETA column are beta estimates or odds ratios.
     """
     if effect_column is None:
-        if preprocessing == 0:
+        if preprocessing == 'None':
             return data
         median = np.median(data.BETA)
         if 0.5 < median < 1.5:
@@ -303,9 +303,9 @@ def check_arguments(
     """
     # Validate preprocessing value
-    if preprocessing not in [0, 1, 2]:
+    if preprocessing not in ['None', 'Fill', 'Fill_delete']:
         raise ValueError(
-            "preprocessing must be one of [0, 1, 2]. Refer to the Geno class docstring for details."
+            "preprocessing must be one of ['None', 'Fill', 'Fill_delete']. Refer to the Geno class docstring for details."
         )
     # Validate effect_column value
@@ -326,11 +326,11 @@ def check_arguments(
     # Helper functions for preprocessing logic
     def keeptype_column(arg):
         """Helper function to decide whether to keep multi-values/duplicates."""
-        return True if arg is None and preprocessing < 2 else arg
+        return True if arg is None and preprocessing in ['None', 'Fill'] else arg
     def filltype_column(arg):
         """Helper function to decide whether to fill snpids/coordinates."""
-        return False if arg is None and preprocessing == 0 else arg
+        return False if arg is None and preprocessing == 'None' else arg
     # Apply preprocessing logic
     keep_multi = keeptype_column(keep_multi)
@@ -385,7 +385,7 @@ def save_data(data, name, path="", fmt="h5", sep="\t", header=True):
     print(f"Data saved to {path_name}")
-def Combine_Geno(Gs, name="noname", clumped=False, preprocessing=0):
+def Combine_Geno(Gs, name="noname", clumped=False, preprocessing='None'):
     """
     Combine a list of GWAS objects into one.
@@ -393,7 +393,7 @@ def Combine_Geno(Gs, name="noname", clumped=False, preprocessing=0):
     - Gs (list): List of GWAS objects.
     - name (str, optional): Name for the combined object. Default is "noname".
     - clumped (bool, optional): If True, uses the clumped data of each object. Default is False.
-    - preprocessing (int, optional): Level of preprocessing to apply. Default is 0.
+    - preprocessing (int, optional): Level of preprocessing to apply. Default is 'None'.
     Returns:
     Geno object: Combined Geno object.

{genal_python-0.0.dev0 → genal_python-0.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "flit_core.buildapi"
 [project]
 name = "genal-python"  # Updated name for PyPI
-version = "0.0.dev0"
+version = "0.1"
 authors = [{name = "Cyprien Rivier", email = "riviercyprien@gmail.com"}]
 description = "A python toolkit for polygenic risk scoring and mendelian randomization."
 readme = "README.md"