PyPI - SIMPApy - Versions diffs - 0.1.4__tar.gz → 0.3.2__tar.gz - Mend

SIMPApy 0.1.4tar.gz → 0.3.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

{simpapy-0.1.4 → simpapy-0.3.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: SIMPApy
-Version: 0.1.4
+Version: 0.3.2
 Summary: Normalized Single Sample Integrated Multiomics Pathway Analysis
 Author-email: Hasan Alsharoh <hasanalsharoh@gmail.com>
 License: Apache License (2.0)
@@ -35,12 +35,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
 ## Description
-`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
+`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
 ## Installation
-To install SIMPApy, create a new virtual environment (preferably also installing Jupyter Notebooks through anaconda). Afterwards, use:
+To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
 ```bash
 pip install SIMPApy
 ```
@@ -60,7 +60,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
     1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
     2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
     3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
-- The dataframes must contain sample divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
+- The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
 - It is possible to use more than 2 groups. For example:
         Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
@@ -99,7 +99,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
 ### CNV data
 Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
 ```python
-cnvs = sp.calculate_ranking(cnv_data, omic = 'cnv'):
+cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
 # cnv_data: Pandas DataFrame with gene-level copy numbers.
 # Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
 ```
@@ -211,8 +211,13 @@ Figure settings allow the following:
     - Reset View (button): Restores the view to unfiltered.
     - Export (300DPI): This options exports the current view of the 3D box.
-## Requirements
+## Downstream analysis
+This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
+Please consult the module's documentation for further instructions.
+# Requirements
 - Python ≥ 3.8
 - gseapy==1.1.3

{simpapy-0.1.4 → simpapy-0.3.2}/README.md RENAMED Viewed

@@ -4,12 +4,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
 ## Description
-`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
+`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
 ## Installation
-To install SIMPApy, create a new virtual environment (preferably also installing Jupyter Notebooks through anaconda). Afterwards, use:
+To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
 ```bash
 pip install SIMPApy
 ```
@@ -29,7 +29,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
     1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
     2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
     3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
-- The dataframes must contain sample divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
+- The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
 - It is possible to use more than 2 groups. For example:
         Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
@@ -68,7 +68,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
 ### CNV data
 Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
 ```python
-cnvs = sp.calculate_ranking(cnv_data, omic = 'cnv'):
+cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
 # cnv_data: Pandas DataFrame with gene-level copy numbers.
 # Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
 ```
@@ -180,8 +180,13 @@ Figure settings allow the following:
     - Reset View (button): Restores the view to unfiltered.
     - Export (300DPI): This options exports the current view of the 3D box.
-## Requirements
+## Downstream analysis
+This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
+Please consult the module's documentation for further instructions.
+# Requirements
 - Python ≥ 3.8
 - gseapy==1.1.3

{simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@ from .preprocess import _extract_tag_genes, _create_aggregated_dataframes, proce
 from .visualize import _create_traces, create_interactive_plot
 from .analyze import group_diffs, plot_volcano, calculate_correlation, plot_correlation_scatterplot
-__version__ = "0.1.4"
+__version__ = "0.3.2"
 __all__ = [
     "calculate_ranking",
     "sopa",

{simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/analyze.py RENAMED Viewed

@@ -12,9 +12,9 @@ from statsmodels.stats.multitest import multipletests
 def group_diffs(
     data: pd.DataFrame,
-    pathway_col: str,
-    value_col: str,
-    group_col: str,
+    pathway_col: str = 'Term',
+    value_col: str = 'nes',
+    group_col: str = 'sample_name',
     group1_prefix: str = 'tm',
     group2_prefix: str = 'tw',
     adj_method: str = 'fdr_bh',
@@ -84,43 +84,68 @@ def group_diffs(
     return results_df.sort_values('p_value').reset_index(drop=True)
-def plot_volcano(
-    data: pd.DataFrame,
-    x_col: str = 'mean_diff',
-    y_col: str = 'neg_log10_p_adj',
-    p_thresh: float = 0.05,
-    title: str = "Group Differences Volcano Plot",
-    xlabel: str = "Mean Difference",
-    ylabel: str = "-log10(Adjusted P-value)",
-    save_path: str = None
-) -> None:
+def plot_volcano(df,
+                 pvalthresh=0.05,
+                 mean_thresh=0.5,
+                 figsize=(6, 4),
+                 fontsize=18,
+                 labelsize=20,
+                 out='volcano.png'):
     """
-    Generates and optionally saves a volcano plot from group difference results.
-    Args:
-        data (pd.DataFrame): DataFrame from group_diffs function.
-        x_col (str): Column for the x-axis (mean difference).
-        y_col (str): Column for the y-axis (-log10 adjusted p-value).
-        p_thresh (float): Significance threshold for p-value.
-        title (str): The title of the plot.
-        xlabel (str): The label for the x-axis.
-        ylabel (str): The label for the y-axis.
-        save_path (str): If provided, the plot is saved to this path.
+    Volcano plot that always displays all points and highlights:
+      - both significant by p_adj and mean_diff (both)
+      - only p_adj significant (p_only)
+      - only mean_diff large (mean_only)
+      - neither (none)
+    This prevents points "disappearing" when mean_thresh is set high.
     """
-    plt.figure(figsize=(10, 8))
-    sns.scatterplot(data=data, x=x_col, y=y_col, color='blue')
-    neg_log10_p_threshold = -np.log10(p_thresh)
-    plt.axhline(y=neg_log10_p_threshold, color='gray', linestyle='--')
-    plt.xlabel(xlabel)
-    plt.ylabel(ylabel)
-    plt.title(title)
-    plt.grid(True)
-    if save_path:
-        plt.savefig(save_path, dpi=300)
+    # thresholds
+    neg_log10_p_thresh = -np.log10(pvalthresh)
+    sig_pval = df['p_adj'] < pvalthresh
+    sig_mean = df['mean_diff'].abs() > mean_thresh
+    both = sig_pval & sig_mean
+    p_only = sig_pval & ~sig_mean
+    mean_only = ~sig_pval & sig_mean
+    none = ~sig_pval & ~sig_mean
+    plt.figure(figsize=figsize)
+    # plot all categories in order (background -> foreground)
+    plt.scatter(df.loc[none, 'mean_diff'], df.loc[none, 'neg_log10_p_adj'],
+                color='lightgrey', alpha=0.6, label='Not significant', s=40, zorder=1)
+    if mean_only.any():
+        plt.scatter(df.loc[mean_only, 'mean_diff'], df.loc[mean_only, 'neg_log10_p_adj'],
+                    color='grey', alpha=0.9, label=f'|mean| > {mean_thresh}', s=50, zorder=2)
+    if p_only.any():
+        plt.scatter(df.loc[p_only, 'mean_diff'], df.loc[p_only, 'neg_log10_p_adj'],
+                    color='grey', alpha=0.9, label=f'p_adj < {pvalthresh}', s=50, zorder=3)
+    if both.any():
+        plt.scatter(df.loc[both, 'mean_diff'], df.loc[both, 'neg_log10_p_adj'],
+                    color='blue', alpha=0.9, label='Both criteria', s=60, zorder=4)
+    # threshold lines
+    plt.axhline(y=neg_log10_p_thresh, color='gray', linestyle='--', lw=1)
+    plt.axvline(x=mean_thresh, color='gray', linestyle='--', lw=1)
+    plt.axvline(x=-mean_thresh, color='gray', linestyle='--', lw=1)
+    # axis labels and styling
+    plt.xlabel("Mean NES Difference (TMA - TWA)", fontsize=fontsize)
+    plt.ylabel("-log10(adjusted p-value)", fontsize=fontsize)
+    plt.tick_params(axis='both', labelsize=labelsize)
+    plt.grid(True, alpha=0.3)
+    # expand limits a bit so lines/points are not at the edge
+    xpad = (df['mean_diff'].max() - df['mean_diff'].min()) * 0.05
+    ypad = (df['neg_log10_p_adj'].max() - df['neg_log10_p_adj'].min()) * 0.05
+    if np.isfinite(xpad):
+        plt.xlim(df['mean_diff'].min() - max(0.1, xpad), df['mean_diff'].max() + max(0.1, xpad))
+    if np.isfinite(ypad):
+        plt.ylim(max(0, df['neg_log10_p_adj'].min() - max(0.1, ypad)), df['neg_log10_p_adj'].max() + max(0.1, ypad))
+    plt.tight_layout()
+    plt.savefig(out, dpi=300)
     plt.show()
 def calculate_correlation(data: pd.DataFrame, x_col: str, y_col: str, group_col: str = None) -> pd.DataFrame:

{simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/ranking.py RENAMED Viewed

@@ -101,39 +101,71 @@ def calculate_ranking(
             # Store the DataFrame in the dictionary
             ranked_dfs[sample] = sample_df
-            # Delete the DataFrame to free up memory (optional)
             del sample_df
         return ranked_dfs
     elif omic.upper() == "CNV":
-        # Calculate baseline importance (Bg) for each gene
-        control_data = df.filter(regex='^tw')  # Control data
-        control_std = control_data.std(axis=1)
-        control_max = control_data.max(axis=1)
-        Bg = control_std / control_max
-        # Create a dictionary to store results
-        adjusted_weights_dict = {}
+        control_data = df.filter(regex='^tw')
+        N = len(control_data.columns)
+        epsilon = 0.01 # Small constant to prevent division by zero
-        # Loop through all samples (cases and controls)
-        for sample_name in df.columns:
-            x_s = df[sample_name]  # Copy numbers for the current sample
-            # Calculate non-linear weight (w(x_s,g))
-            w_x_s_g = 2 ** np.abs(x_s - 2) * np.sign(x_s - 2)
-            # Calculate adjusted weight (w_adjusted(x_s,g))
-            adjusted_weight = w_x_s_g.where(w_x_s_g != 0, Bg)
+        # Pre-compute all necessary stats for the control group
+        control_counts_df = control_data.apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
+        mu_controls = control_data.mean(axis=1)
+        sigma_controls = control_data.std(axis=1)
-            # Create a DataFrame for the current sample
-            df_sample = pd.DataFrame({
-                'adjusted_weight': adjusted_weight
-            })
-            adjusted_weights_dict[sample_name] = df_sample
+        ranked_dfs = {}
-        return adjusted_weights_dict
+        # 2. Loop through all samples
+        for sample_name in df.columns:
+            sample_series = df[sample_name]
+            scores = []
+            # Loop through each gene in the current sample
+            for gene, cn_value in sample_series.items():
+                if cn_value != 2:
+                    # Look up k: number of controls with the same CN value
+                    k = control_counts_df.loc[gene, cn_value] if cn_value in control_counts_df.columns else 0
+                    # Construct 2x2 table cells for a stable Odds Ratio calculation
+                    a, b = 1.5, 0.5
+                    c, d = k + 0.5, (N - k) + 0.5
+                    # Calculate the corrected odds ratio
+                    or_corrected = (a * d) / (b * c)
+                    # Handle edge case for log transform if OR is somehow non-positive
+                    if or_corrected <= 0:
+                        or_corrected = epsilon
+                    # Look up the pre-computed standard deviation for the gene
+                    sigma_for_gene = sigma_controls.loc[gene]
+                    # Calculate the final score using the enhanced formula
+                    score = (np.sign(cn_value - 2) * np.log10(or_corrected)) / (sigma_for_gene + epsilon)
+                else: # cn_value == 2
+                    # Look up pre-computed mean and std dev for the gene
+                    mu_for_gene = mu_controls.loc[gene]
+                    sigma_for_gene = sigma_controls.loc[gene]
+                    # Calculate the Z-score relative to the control mean to capture nuance
+                    score = (2 - mu_for_gene) / (sigma_for_gene + epsilon)
+                scores.append(score)
+            df_sample = pd.DataFrame(
+                {'adjusted_weight': scores},
+                index=df.index
+            )
+            ranked_dfs[sample_name] = df_sample
+        return ranked_dfs
     else:
         raise ValueError("Omic type must be 'RNA', 'DNAm', or 'CNV'")

{simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/visualize.py RENAMED Viewed

@@ -362,19 +362,26 @@ def create_interactive_plot(data, title_suffix=""):
         title=f"{title_suffix}",
         title_x=0.5,
         title_y=0.95,
+        title_font=dict(size=18),  # Optional: also increase title font
         scene=dict(
             xaxis=dict(
                 title='DNAm',
+                titlefont=dict(size=18),  # Axis title font size
+                tickfont=dict(size=18),   # Tick label font size
                 backgroundcolor="white",
                 gridcolor="grey"
             ),
             yaxis=dict(
                 title='TPM',
+                titlefont=dict(size=18),  # Axis title font size
+                tickfont=dict(size=18),   # Tick label font size
                 backgroundcolor="white",
                 gridcolor="grey"
             ),
             zaxis=dict(
                 title='CNV',
+                titlefont=dict(size=18),  # Axis title font size
+                tickfont=dict(size=18),   # Tick label font size
                 backgroundcolor="white",
                 gridcolor="grey"
             ),

{simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: SIMPApy
-Version: 0.1.4
+Version: 0.3.2
 Summary: Normalized Single Sample Integrated Multiomics Pathway Analysis
 Author-email: Hasan Alsharoh <hasanalsharoh@gmail.com>
 License: Apache License (2.0)
@@ -35,12 +35,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
 ## Description
-`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
+`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
 ## Installation
-To install SIMPApy, create a new virtual environment (preferably also installing Jupyter Notebooks through anaconda). Afterwards, use:
+To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
 ```bash
 pip install SIMPApy
 ```
@@ -60,7 +60,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
     1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
     2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
     3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
-- The dataframes must contain sample divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
+- The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
 - It is possible to use more than 2 groups. For example:
         Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
@@ -99,7 +99,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
 ### CNV data
 Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
 ```python
-cnvs = sp.calculate_ranking(cnv_data, omic = 'cnv'):
+cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
 # cnv_data: Pandas DataFrame with gene-level copy numbers.
 # Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
 ```
@@ -211,8 +211,13 @@ Figure settings allow the following:
     - Reset View (button): Restores the view to unfiltered.
     - Export (300DPI): This options exports the current view of the 3D box.
-## Requirements
+## Downstream analysis
+This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
+Please consult the module's documentation for further instructions.
+# Requirements
 - Python ≥ 3.8
 - gseapy==1.1.3