SIMPApy 0.1.4__tar.gz → 0.3.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: SIMPApy
3
- Version: 0.1.4
3
+ Version: 0.3.2
4
4
  Summary: Normalized Single Sample Integrated Multiomics Pathway Analysis
5
5
  Author-email: Hasan Alsharoh <hasanalsharoh@gmail.com>
6
6
  License: Apache License (2.0)
@@ -35,12 +35,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
35
35
 
36
36
  ## Description
37
37
 
38
- `SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
38
+ `SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
39
39
 
40
40
  ## Installation
41
41
 
42
42
 
43
- To install SIMPApy, create a new virtual environment (preferably also installing Jupyter Notebooks through anaconda). Afterwards, use:
43
+ To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
44
44
  ```bash
45
45
  pip install SIMPApy
46
46
  ```
@@ -60,7 +60,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
60
60
  1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
61
61
  2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
62
62
  3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
63
- - The dataframes must contain sample divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
63
+ - The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
64
64
  - It is possible to use more than 2 groups. For example:
65
65
 
66
66
  Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
@@ -99,7 +99,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
99
99
  ### CNV data
100
100
  Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
101
101
  ```python
102
- cnvs = sp.calculate_ranking(cnv_data, omic = 'cnv'):
102
+ cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
103
103
  # cnv_data: Pandas DataFrame with gene-level copy numbers.
104
104
  # Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
105
105
  ```
@@ -211,8 +211,13 @@ Figure settings allow the following:
211
211
  - Reset View (button): Restores the view to unfiltered.
212
212
 
213
213
  - Export (300DPI): This options exports the current view of the 3D box.
214
-
215
- ## Requirements
214
+
215
+ ## Downstream analysis
216
+ This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
217
+ Please consult the module's documentation for further instructions.
218
+
219
+
220
+ # Requirements
216
221
 
217
222
  - Python ≥ 3.8
218
223
  - gseapy==1.1.3
@@ -4,12 +4,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
4
4
 
5
5
  ## Description
6
6
 
7
- `SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
7
+ `SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
8
8
 
9
9
  ## Installation
10
10
 
11
11
 
12
- To install SIMPApy, create a new virtual environment (preferably also installing Jupyter Notebooks through anaconda). Afterwards, use:
12
+ To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
13
13
  ```bash
14
14
  pip install SIMPApy
15
15
  ```
@@ -29,7 +29,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
29
29
  1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
30
30
  2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
31
31
  3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
32
- - The dataframes must contain sample divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
32
+ - The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
33
33
  - It is possible to use more than 2 groups. For example:
34
34
 
35
35
  Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
@@ -68,7 +68,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
68
68
  ### CNV data
69
69
  Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
70
70
  ```python
71
- cnvs = sp.calculate_ranking(cnv_data, omic = 'cnv'):
71
+ cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
72
72
  # cnv_data: Pandas DataFrame with gene-level copy numbers.
73
73
  # Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
74
74
  ```
@@ -180,8 +180,13 @@ Figure settings allow the following:
180
180
  - Reset View (button): Restores the view to unfiltered.
181
181
 
182
182
  - Export (300DPI): This options exports the current view of the 3D box.
183
-
184
- ## Requirements
183
+
184
+ ## Downstream analysis
185
+ This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
186
+ Please consult the module's documentation for further instructions.
187
+
188
+
189
+ # Requirements
185
190
 
186
191
  - Python ≥ 3.8
187
192
  - gseapy==1.1.3
@@ -18,7 +18,7 @@ from .preprocess import _extract_tag_genes, _create_aggregated_dataframes, proce
18
18
  from .visualize import _create_traces, create_interactive_plot
19
19
  from .analyze import group_diffs, plot_volcano, calculate_correlation, plot_correlation_scatterplot
20
20
 
21
- __version__ = "0.1.4"
21
+ __version__ = "0.3.2"
22
22
  __all__ = [
23
23
  "calculate_ranking",
24
24
  "sopa",
@@ -12,9 +12,9 @@ from statsmodels.stats.multitest import multipletests
12
12
 
13
13
  def group_diffs(
14
14
  data: pd.DataFrame,
15
- pathway_col: str,
16
- value_col: str,
17
- group_col: str,
15
+ pathway_col: str = 'Term',
16
+ value_col: str = 'nes',
17
+ group_col: str = 'sample_name',
18
18
  group1_prefix: str = 'tm',
19
19
  group2_prefix: str = 'tw',
20
20
  adj_method: str = 'fdr_bh',
@@ -84,43 +84,68 @@ def group_diffs(
84
84
  return results_df.sort_values('p_value').reset_index(drop=True)
85
85
 
86
86
 
87
- def plot_volcano(
88
- data: pd.DataFrame,
89
- x_col: str = 'mean_diff',
90
- y_col: str = 'neg_log10_p_adj',
91
- p_thresh: float = 0.05,
92
- title: str = "Group Differences Volcano Plot",
93
- xlabel: str = "Mean Difference",
94
- ylabel: str = "-log10(Adjusted P-value)",
95
- save_path: str = None
96
- ) -> None:
87
+ def plot_volcano(df,
88
+ pvalthresh=0.05,
89
+ mean_thresh=0.5,
90
+ figsize=(6, 4),
91
+ fontsize=18,
92
+ labelsize=20,
93
+ out='volcano.png'):
97
94
  """
98
- Generates and optionally saves a volcano plot from group difference results.
99
-
100
- Args:
101
- data (pd.DataFrame): DataFrame from group_diffs function.
102
- x_col (str): Column for the x-axis (mean difference).
103
- y_col (str): Column for the y-axis (-log10 adjusted p-value).
104
- p_thresh (float): Significance threshold for p-value.
105
- title (str): The title of the plot.
106
- xlabel (str): The label for the x-axis.
107
- ylabel (str): The label for the y-axis.
108
- save_path (str): If provided, the plot is saved to this path.
95
+ Volcano plot that always displays all points and highlights:
96
+ - both significant by p_adj and mean_diff (both)
97
+ - only p_adj significant (p_only)
98
+ - only mean_diff large (mean_only)
99
+ - neither (none)
100
+ This prevents points "disappearing" when mean_thresh is set high.
109
101
  """
110
- plt.figure(figsize=(10, 8))
111
- sns.scatterplot(data=data, x=x_col, y=y_col, color='blue')
112
-
113
- neg_log10_p_threshold = -np.log10(p_thresh)
114
- plt.axhline(y=neg_log10_p_threshold, color='gray', linestyle='--')
115
-
116
- plt.xlabel(xlabel)
117
- plt.ylabel(ylabel)
118
- plt.title(title)
119
- plt.grid(True)
120
-
121
- if save_path:
122
- plt.savefig(save_path, dpi=300)
123
-
102
+ # thresholds
103
+ neg_log10_p_thresh = -np.log10(pvalthresh)
104
+
105
+ sig_pval = df['p_adj'] < pvalthresh
106
+ sig_mean = df['mean_diff'].abs() > mean_thresh
107
+
108
+ both = sig_pval & sig_mean
109
+ p_only = sig_pval & ~sig_mean
110
+ mean_only = ~sig_pval & sig_mean
111
+ none = ~sig_pval & ~sig_mean
112
+
113
+ plt.figure(figsize=figsize)
114
+
115
+ # plot all categories in order (background -> foreground)
116
+ plt.scatter(df.loc[none, 'mean_diff'], df.loc[none, 'neg_log10_p_adj'],
117
+ color='lightgrey', alpha=0.6, label='Not significant', s=40, zorder=1)
118
+ if mean_only.any():
119
+ plt.scatter(df.loc[mean_only, 'mean_diff'], df.loc[mean_only, 'neg_log10_p_adj'],
120
+ color='grey', alpha=0.9, label=f'|mean| > {mean_thresh}', s=50, zorder=2)
121
+ if p_only.any():
122
+ plt.scatter(df.loc[p_only, 'mean_diff'], df.loc[p_only, 'neg_log10_p_adj'],
123
+ color='grey', alpha=0.9, label=f'p_adj < {pvalthresh}', s=50, zorder=3)
124
+ if both.any():
125
+ plt.scatter(df.loc[both, 'mean_diff'], df.loc[both, 'neg_log10_p_adj'],
126
+ color='blue', alpha=0.9, label='Both criteria', s=60, zorder=4)
127
+
128
+ # threshold lines
129
+ plt.axhline(y=neg_log10_p_thresh, color='gray', linestyle='--', lw=1)
130
+ plt.axvline(x=mean_thresh, color='gray', linestyle='--', lw=1)
131
+ plt.axvline(x=-mean_thresh, color='gray', linestyle='--', lw=1)
132
+
133
+ # axis labels and styling
134
+ plt.xlabel("Mean NES Difference (TMA - TWA)", fontsize=fontsize)
135
+ plt.ylabel("-log10(adjusted p-value)", fontsize=fontsize)
136
+ plt.tick_params(axis='both', labelsize=labelsize)
137
+ plt.grid(True, alpha=0.3)
138
+
139
+ # expand limits a bit so lines/points are not at the edge
140
+ xpad = (df['mean_diff'].max() - df['mean_diff'].min()) * 0.05
141
+ ypad = (df['neg_log10_p_adj'].max() - df['neg_log10_p_adj'].min()) * 0.05
142
+ if np.isfinite(xpad):
143
+ plt.xlim(df['mean_diff'].min() - max(0.1, xpad), df['mean_diff'].max() + max(0.1, xpad))
144
+ if np.isfinite(ypad):
145
+ plt.ylim(max(0, df['neg_log10_p_adj'].min() - max(0.1, ypad)), df['neg_log10_p_adj'].max() + max(0.1, ypad))
146
+
147
+ plt.tight_layout()
148
+ plt.savefig(out, dpi=300)
124
149
  plt.show()
125
150
 
126
151
  def calculate_correlation(data: pd.DataFrame, x_col: str, y_col: str, group_col: str = None) -> pd.DataFrame:
@@ -101,39 +101,71 @@ def calculate_ranking(
101
101
  # Store the DataFrame in the dictionary
102
102
  ranked_dfs[sample] = sample_df
103
103
 
104
- # Delete the DataFrame to free up memory (optional)
105
104
  del sample_df
106
105
 
107
106
  return ranked_dfs
108
107
 
109
108
  elif omic.upper() == "CNV":
110
- # Calculate baseline importance (Bg) for each gene
111
- control_data = df.filter(regex='^tw') # Control data
112
- control_std = control_data.std(axis=1)
113
- control_max = control_data.max(axis=1)
114
- Bg = control_std / control_max
115
109
 
116
- # Create a dictionary to store results
117
- adjusted_weights_dict = {}
110
+ control_data = df.filter(regex='^tw')
111
+ N = len(control_data.columns)
112
+ epsilon = 0.01 # Small constant to prevent division by zero
118
113
 
119
- # Loop through all samples (cases and controls)
120
- for sample_name in df.columns:
121
- x_s = df[sample_name] # Copy numbers for the current sample
122
-
123
- # Calculate non-linear weight (w(x_s,g))
124
- w_x_s_g = 2 ** np.abs(x_s - 2) * np.sign(x_s - 2)
125
-
126
- # Calculate adjusted weight (w_adjusted(x_s,g))
127
- adjusted_weight = w_x_s_g.where(w_x_s_g != 0, Bg)
114
+ # Pre-compute all necessary stats for the control group
115
+ control_counts_df = control_data.apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
116
+ mu_controls = control_data.mean(axis=1)
117
+ sigma_controls = control_data.std(axis=1)
128
118
 
129
- # Create a DataFrame for the current sample
130
- df_sample = pd.DataFrame({
131
- 'adjusted_weight': adjusted_weight
132
- })
133
-
134
- adjusted_weights_dict[sample_name] = df_sample
119
+ ranked_dfs = {}
135
120
 
136
- return adjusted_weights_dict
121
+ # 2. Loop through all samples
122
+ for sample_name in df.columns:
123
+ sample_series = df[sample_name]
124
+ scores = []
125
+
126
+ # Loop through each gene in the current sample
127
+ for gene, cn_value in sample_series.items():
128
+
129
+ if cn_value != 2:
130
+
131
+ # Look up k: number of controls with the same CN value
132
+ k = control_counts_df.loc[gene, cn_value] if cn_value in control_counts_df.columns else 0
133
+
134
+ # Construct 2x2 table cells for a stable Odds Ratio calculation
135
+ a, b = 1.5, 0.5
136
+ c, d = k + 0.5, (N - k) + 0.5
137
+
138
+ # Calculate the corrected odds ratio
139
+ or_corrected = (a * d) / (b * c)
140
+
141
+ # Handle edge case for log transform if OR is somehow non-positive
142
+ if or_corrected <= 0:
143
+ or_corrected = epsilon
144
+
145
+ # Look up the pre-computed standard deviation for the gene
146
+ sigma_for_gene = sigma_controls.loc[gene]
147
+
148
+ # Calculate the final score using the enhanced formula
149
+ score = (np.sign(cn_value - 2) * np.log10(or_corrected)) / (sigma_for_gene + epsilon)
150
+
151
+ else: # cn_value == 2
152
+
153
+ # Look up pre-computed mean and std dev for the gene
154
+ mu_for_gene = mu_controls.loc[gene]
155
+ sigma_for_gene = sigma_controls.loc[gene]
156
+
157
+ # Calculate the Z-score relative to the control mean to capture nuance
158
+ score = (2 - mu_for_gene) / (sigma_for_gene + epsilon)
159
+
160
+ scores.append(score)
161
+
162
+ df_sample = pd.DataFrame(
163
+ {'adjusted_weight': scores},
164
+ index=df.index
165
+ )
166
+ ranked_dfs[sample_name] = df_sample
167
+
168
+ return ranked_dfs
137
169
 
138
170
  else:
139
171
  raise ValueError("Omic type must be 'RNA', 'DNAm', or 'CNV'")
@@ -362,19 +362,26 @@ def create_interactive_plot(data, title_suffix=""):
362
362
  title=f"{title_suffix}",
363
363
  title_x=0.5,
364
364
  title_y=0.95,
365
+ title_font=dict(size=18), # Optional: also increase title font
365
366
  scene=dict(
366
367
  xaxis=dict(
367
368
  title='DNAm',
369
+ titlefont=dict(size=18), # Axis title font size
370
+ tickfont=dict(size=18), # Tick label font size
368
371
  backgroundcolor="white",
369
372
  gridcolor="grey"
370
373
  ),
371
374
  yaxis=dict(
372
375
  title='TPM',
376
+ titlefont=dict(size=18), # Axis title font size
377
+ tickfont=dict(size=18), # Tick label font size
373
378
  backgroundcolor="white",
374
379
  gridcolor="grey"
375
380
  ),
376
381
  zaxis=dict(
377
382
  title='CNV',
383
+ titlefont=dict(size=18), # Axis title font size
384
+ tickfont=dict(size=18), # Tick label font size
378
385
  backgroundcolor="white",
379
386
  gridcolor="grey"
380
387
  ),
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: SIMPApy
3
- Version: 0.1.4
3
+ Version: 0.3.2
4
4
  Summary: Normalized Single Sample Integrated Multiomics Pathway Analysis
5
5
  Author-email: Hasan Alsharoh <hasanalsharoh@gmail.com>
6
6
  License: Apache License (2.0)
@@ -35,12 +35,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
35
35
 
36
36
  ## Description
37
37
 
38
- `SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
38
+ `SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
39
39
 
40
40
  ## Installation
41
41
 
42
42
 
43
- To install SIMPApy, create a new virtual environment (preferably also installing Jupyter Notebooks through anaconda). Afterwards, use:
43
+ To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
44
44
  ```bash
45
45
  pip install SIMPApy
46
46
  ```
@@ -60,7 +60,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
60
60
  1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
61
61
  2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
62
62
  3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
63
- - The dataframes must contain sample divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
63
+ - The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
64
64
  - It is possible to use more than 2 groups. For example:
65
65
 
66
66
  Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
@@ -99,7 +99,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
99
99
  ### CNV data
100
100
  Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
101
101
  ```python
102
- cnvs = sp.calculate_ranking(cnv_data, omic = 'cnv'):
102
+ cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
103
103
  # cnv_data: Pandas DataFrame with gene-level copy numbers.
104
104
  # Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
105
105
  ```
@@ -211,8 +211,13 @@ Figure settings allow the following:
211
211
  - Reset View (button): Restores the view to unfiltered.
212
212
 
213
213
  - Export (300DPI): This options exports the current view of the 3D box.
214
-
215
- ## Requirements
214
+
215
+ ## Downstream analysis
216
+ This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
217
+ Please consult the module's documentation for further instructions.
218
+
219
+
220
+ # Requirements
216
221
 
217
222
  - Python ≥ 3.8
218
223
  - gseapy==1.1.3
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes