SIMPApy 0.1.4__tar.gz → 0.3.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {simpapy-0.1.4 → simpapy-0.3.2}/PKG-INFO +12 -7
- {simpapy-0.1.4 → simpapy-0.3.2}/README.md +11 -6
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/__init__.py +1 -1
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/analyze.py +63 -38
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/ranking.py +56 -24
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/visualize.py +7 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy.egg-info/PKG-INFO +12 -7
- {simpapy-0.1.4 → simpapy-0.3.2}/LICENSE.txt +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/SIMPA.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/core.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy/preprocess.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy.egg-info/SOURCES.txt +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy.egg-info/dependency_links.txt +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy.egg-info/requires.txt +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/SIMPApy.egg-info/top_level.txt +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/pyproject.toml +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/setup.cfg +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/tests/test_SIMPA.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/tests/test_analyze.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/tests/test_core.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/tests/test_preprocess.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/tests/test_ranking.py +0 -0
- {simpapy-0.1.4 → simpapy-0.3.2}/tests/test_visualize.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: SIMPApy
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.2
|
|
4
4
|
Summary: Normalized Single Sample Integrated Multiomics Pathway Analysis
|
|
5
5
|
Author-email: Hasan Alsharoh <hasanalsharoh@gmail.com>
|
|
6
6
|
License: Apache License (2.0)
|
|
@@ -35,12 +35,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
|
|
|
35
35
|
|
|
36
36
|
## Description
|
|
37
37
|
|
|
38
|
-
`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
|
|
38
|
+
`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
|
|
39
39
|
|
|
40
40
|
## Installation
|
|
41
41
|
|
|
42
42
|
|
|
43
|
-
To install SIMPApy, create a new virtual environment (
|
|
43
|
+
To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
|
|
44
44
|
```bash
|
|
45
45
|
pip install SIMPApy
|
|
46
46
|
```
|
|
@@ -60,7 +60,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
|
|
|
60
60
|
1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
|
|
61
61
|
2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
|
|
62
62
|
3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
|
|
63
|
-
- The dataframes must contain
|
|
63
|
+
- The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
|
|
64
64
|
- It is possible to use more than 2 groups. For example:
|
|
65
65
|
|
|
66
66
|
Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
|
|
@@ -99,7 +99,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
|
|
|
99
99
|
### CNV data
|
|
100
100
|
Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
|
|
101
101
|
```python
|
|
102
|
-
|
|
102
|
+
cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
|
|
103
103
|
# cnv_data: Pandas DataFrame with gene-level copy numbers.
|
|
104
104
|
# Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
|
|
105
105
|
```
|
|
@@ -211,8 +211,13 @@ Figure settings allow the following:
|
|
|
211
211
|
- Reset View (button): Restores the view to unfiltered.
|
|
212
212
|
|
|
213
213
|
- Export (300DPI): This options exports the current view of the 3D box.
|
|
214
|
-
|
|
215
|
-
##
|
|
214
|
+
|
|
215
|
+
## Downstream analysis
|
|
216
|
+
This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
|
|
217
|
+
Please consult the module's documentation for further instructions.
|
|
218
|
+
|
|
219
|
+
|
|
220
|
+
# Requirements
|
|
216
221
|
|
|
217
222
|
- Python ≥ 3.8
|
|
218
223
|
- gseapy==1.1.3
|
|
@@ -4,12 +4,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
|
|
|
4
4
|
|
|
5
5
|
## Description
|
|
6
6
|
|
|
7
|
-
`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
|
|
7
|
+
`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
|
|
8
8
|
|
|
9
9
|
## Installation
|
|
10
10
|
|
|
11
11
|
|
|
12
|
-
To install SIMPApy, create a new virtual environment (
|
|
12
|
+
To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
|
|
13
13
|
```bash
|
|
14
14
|
pip install SIMPApy
|
|
15
15
|
```
|
|
@@ -29,7 +29,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
|
|
|
29
29
|
1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
|
|
30
30
|
2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
|
|
31
31
|
3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
|
|
32
|
-
- The dataframes must contain
|
|
32
|
+
- The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
|
|
33
33
|
- It is possible to use more than 2 groups. For example:
|
|
34
34
|
|
|
35
35
|
Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
|
|
@@ -68,7 +68,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
|
|
|
68
68
|
### CNV data
|
|
69
69
|
Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
|
|
70
70
|
```python
|
|
71
|
-
|
|
71
|
+
cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
|
|
72
72
|
# cnv_data: Pandas DataFrame with gene-level copy numbers.
|
|
73
73
|
# Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
|
|
74
74
|
```
|
|
@@ -180,8 +180,13 @@ Figure settings allow the following:
|
|
|
180
180
|
- Reset View (button): Restores the view to unfiltered.
|
|
181
181
|
|
|
182
182
|
- Export (300DPI): This options exports the current view of the 3D box.
|
|
183
|
-
|
|
184
|
-
##
|
|
183
|
+
|
|
184
|
+
## Downstream analysis
|
|
185
|
+
This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
|
|
186
|
+
Please consult the module's documentation for further instructions.
|
|
187
|
+
|
|
188
|
+
|
|
189
|
+
# Requirements
|
|
185
190
|
|
|
186
191
|
- Python ≥ 3.8
|
|
187
192
|
- gseapy==1.1.3
|
|
@@ -18,7 +18,7 @@ from .preprocess import _extract_tag_genes, _create_aggregated_dataframes, proce
|
|
|
18
18
|
from .visualize import _create_traces, create_interactive_plot
|
|
19
19
|
from .analyze import group_diffs, plot_volcano, calculate_correlation, plot_correlation_scatterplot
|
|
20
20
|
|
|
21
|
-
__version__ = "0.
|
|
21
|
+
__version__ = "0.3.2"
|
|
22
22
|
__all__ = [
|
|
23
23
|
"calculate_ranking",
|
|
24
24
|
"sopa",
|
|
@@ -12,9 +12,9 @@ from statsmodels.stats.multitest import multipletests
|
|
|
12
12
|
|
|
13
13
|
def group_diffs(
|
|
14
14
|
data: pd.DataFrame,
|
|
15
|
-
pathway_col: str,
|
|
16
|
-
value_col: str,
|
|
17
|
-
group_col: str,
|
|
15
|
+
pathway_col: str = 'Term',
|
|
16
|
+
value_col: str = 'nes',
|
|
17
|
+
group_col: str = 'sample_name',
|
|
18
18
|
group1_prefix: str = 'tm',
|
|
19
19
|
group2_prefix: str = 'tw',
|
|
20
20
|
adj_method: str = 'fdr_bh',
|
|
@@ -84,43 +84,68 @@ def group_diffs(
|
|
|
84
84
|
return results_df.sort_values('p_value').reset_index(drop=True)
|
|
85
85
|
|
|
86
86
|
|
|
87
|
-
def plot_volcano(
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
ylabel: str = "-log10(Adjusted P-value)",
|
|
95
|
-
save_path: str = None
|
|
96
|
-
) -> None:
|
|
87
|
+
def plot_volcano(df,
|
|
88
|
+
pvalthresh=0.05,
|
|
89
|
+
mean_thresh=0.5,
|
|
90
|
+
figsize=(6, 4),
|
|
91
|
+
fontsize=18,
|
|
92
|
+
labelsize=20,
|
|
93
|
+
out='volcano.png'):
|
|
97
94
|
"""
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
p_thresh (float): Significance threshold for p-value.
|
|
105
|
-
title (str): The title of the plot.
|
|
106
|
-
xlabel (str): The label for the x-axis.
|
|
107
|
-
ylabel (str): The label for the y-axis.
|
|
108
|
-
save_path (str): If provided, the plot is saved to this path.
|
|
95
|
+
Volcano plot that always displays all points and highlights:
|
|
96
|
+
- both significant by p_adj and mean_diff (both)
|
|
97
|
+
- only p_adj significant (p_only)
|
|
98
|
+
- only mean_diff large (mean_only)
|
|
99
|
+
- neither (none)
|
|
100
|
+
This prevents points "disappearing" when mean_thresh is set high.
|
|
109
101
|
"""
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
102
|
+
# thresholds
|
|
103
|
+
neg_log10_p_thresh = -np.log10(pvalthresh)
|
|
104
|
+
|
|
105
|
+
sig_pval = df['p_adj'] < pvalthresh
|
|
106
|
+
sig_mean = df['mean_diff'].abs() > mean_thresh
|
|
107
|
+
|
|
108
|
+
both = sig_pval & sig_mean
|
|
109
|
+
p_only = sig_pval & ~sig_mean
|
|
110
|
+
mean_only = ~sig_pval & sig_mean
|
|
111
|
+
none = ~sig_pval & ~sig_mean
|
|
112
|
+
|
|
113
|
+
plt.figure(figsize=figsize)
|
|
114
|
+
|
|
115
|
+
# plot all categories in order (background -> foreground)
|
|
116
|
+
plt.scatter(df.loc[none, 'mean_diff'], df.loc[none, 'neg_log10_p_adj'],
|
|
117
|
+
color='lightgrey', alpha=0.6, label='Not significant', s=40, zorder=1)
|
|
118
|
+
if mean_only.any():
|
|
119
|
+
plt.scatter(df.loc[mean_only, 'mean_diff'], df.loc[mean_only, 'neg_log10_p_adj'],
|
|
120
|
+
color='grey', alpha=0.9, label=f'|mean| > {mean_thresh}', s=50, zorder=2)
|
|
121
|
+
if p_only.any():
|
|
122
|
+
plt.scatter(df.loc[p_only, 'mean_diff'], df.loc[p_only, 'neg_log10_p_adj'],
|
|
123
|
+
color='grey', alpha=0.9, label=f'p_adj < {pvalthresh}', s=50, zorder=3)
|
|
124
|
+
if both.any():
|
|
125
|
+
plt.scatter(df.loc[both, 'mean_diff'], df.loc[both, 'neg_log10_p_adj'],
|
|
126
|
+
color='blue', alpha=0.9, label='Both criteria', s=60, zorder=4)
|
|
127
|
+
|
|
128
|
+
# threshold lines
|
|
129
|
+
plt.axhline(y=neg_log10_p_thresh, color='gray', linestyle='--', lw=1)
|
|
130
|
+
plt.axvline(x=mean_thresh, color='gray', linestyle='--', lw=1)
|
|
131
|
+
plt.axvline(x=-mean_thresh, color='gray', linestyle='--', lw=1)
|
|
132
|
+
|
|
133
|
+
# axis labels and styling
|
|
134
|
+
plt.xlabel("Mean NES Difference (TMA - TWA)", fontsize=fontsize)
|
|
135
|
+
plt.ylabel("-log10(adjusted p-value)", fontsize=fontsize)
|
|
136
|
+
plt.tick_params(axis='both', labelsize=labelsize)
|
|
137
|
+
plt.grid(True, alpha=0.3)
|
|
138
|
+
|
|
139
|
+
# expand limits a bit so lines/points are not at the edge
|
|
140
|
+
xpad = (df['mean_diff'].max() - df['mean_diff'].min()) * 0.05
|
|
141
|
+
ypad = (df['neg_log10_p_adj'].max() - df['neg_log10_p_adj'].min()) * 0.05
|
|
142
|
+
if np.isfinite(xpad):
|
|
143
|
+
plt.xlim(df['mean_diff'].min() - max(0.1, xpad), df['mean_diff'].max() + max(0.1, xpad))
|
|
144
|
+
if np.isfinite(ypad):
|
|
145
|
+
plt.ylim(max(0, df['neg_log10_p_adj'].min() - max(0.1, ypad)), df['neg_log10_p_adj'].max() + max(0.1, ypad))
|
|
146
|
+
|
|
147
|
+
plt.tight_layout()
|
|
148
|
+
plt.savefig(out, dpi=300)
|
|
124
149
|
plt.show()
|
|
125
150
|
|
|
126
151
|
def calculate_correlation(data: pd.DataFrame, x_col: str, y_col: str, group_col: str = None) -> pd.DataFrame:
|
|
@@ -101,39 +101,71 @@ def calculate_ranking(
|
|
|
101
101
|
# Store the DataFrame in the dictionary
|
|
102
102
|
ranked_dfs[sample] = sample_df
|
|
103
103
|
|
|
104
|
-
# Delete the DataFrame to free up memory (optional)
|
|
105
104
|
del sample_df
|
|
106
105
|
|
|
107
106
|
return ranked_dfs
|
|
108
107
|
|
|
109
108
|
elif omic.upper() == "CNV":
|
|
110
|
-
# Calculate baseline importance (Bg) for each gene
|
|
111
|
-
control_data = df.filter(regex='^tw') # Control data
|
|
112
|
-
control_std = control_data.std(axis=1)
|
|
113
|
-
control_max = control_data.max(axis=1)
|
|
114
|
-
Bg = control_std / control_max
|
|
115
109
|
|
|
116
|
-
|
|
117
|
-
|
|
110
|
+
control_data = df.filter(regex='^tw')
|
|
111
|
+
N = len(control_data.columns)
|
|
112
|
+
epsilon = 0.01 # Small constant to prevent division by zero
|
|
118
113
|
|
|
119
|
-
#
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
# Calculate non-linear weight (w(x_s,g))
|
|
124
|
-
w_x_s_g = 2 ** np.abs(x_s - 2) * np.sign(x_s - 2)
|
|
125
|
-
|
|
126
|
-
# Calculate adjusted weight (w_adjusted(x_s,g))
|
|
127
|
-
adjusted_weight = w_x_s_g.where(w_x_s_g != 0, Bg)
|
|
114
|
+
# Pre-compute all necessary stats for the control group
|
|
115
|
+
control_counts_df = control_data.apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
|
|
116
|
+
mu_controls = control_data.mean(axis=1)
|
|
117
|
+
sigma_controls = control_data.std(axis=1)
|
|
128
118
|
|
|
129
|
-
|
|
130
|
-
df_sample = pd.DataFrame({
|
|
131
|
-
'adjusted_weight': adjusted_weight
|
|
132
|
-
})
|
|
133
|
-
|
|
134
|
-
adjusted_weights_dict[sample_name] = df_sample
|
|
119
|
+
ranked_dfs = {}
|
|
135
120
|
|
|
136
|
-
|
|
121
|
+
# 2. Loop through all samples
|
|
122
|
+
for sample_name in df.columns:
|
|
123
|
+
sample_series = df[sample_name]
|
|
124
|
+
scores = []
|
|
125
|
+
|
|
126
|
+
# Loop through each gene in the current sample
|
|
127
|
+
for gene, cn_value in sample_series.items():
|
|
128
|
+
|
|
129
|
+
if cn_value != 2:
|
|
130
|
+
|
|
131
|
+
# Look up k: number of controls with the same CN value
|
|
132
|
+
k = control_counts_df.loc[gene, cn_value] if cn_value in control_counts_df.columns else 0
|
|
133
|
+
|
|
134
|
+
# Construct 2x2 table cells for a stable Odds Ratio calculation
|
|
135
|
+
a, b = 1.5, 0.5
|
|
136
|
+
c, d = k + 0.5, (N - k) + 0.5
|
|
137
|
+
|
|
138
|
+
# Calculate the corrected odds ratio
|
|
139
|
+
or_corrected = (a * d) / (b * c)
|
|
140
|
+
|
|
141
|
+
# Handle edge case for log transform if OR is somehow non-positive
|
|
142
|
+
if or_corrected <= 0:
|
|
143
|
+
or_corrected = epsilon
|
|
144
|
+
|
|
145
|
+
# Look up the pre-computed standard deviation for the gene
|
|
146
|
+
sigma_for_gene = sigma_controls.loc[gene]
|
|
147
|
+
|
|
148
|
+
# Calculate the final score using the enhanced formula
|
|
149
|
+
score = (np.sign(cn_value - 2) * np.log10(or_corrected)) / (sigma_for_gene + epsilon)
|
|
150
|
+
|
|
151
|
+
else: # cn_value == 2
|
|
152
|
+
|
|
153
|
+
# Look up pre-computed mean and std dev for the gene
|
|
154
|
+
mu_for_gene = mu_controls.loc[gene]
|
|
155
|
+
sigma_for_gene = sigma_controls.loc[gene]
|
|
156
|
+
|
|
157
|
+
# Calculate the Z-score relative to the control mean to capture nuance
|
|
158
|
+
score = (2 - mu_for_gene) / (sigma_for_gene + epsilon)
|
|
159
|
+
|
|
160
|
+
scores.append(score)
|
|
161
|
+
|
|
162
|
+
df_sample = pd.DataFrame(
|
|
163
|
+
{'adjusted_weight': scores},
|
|
164
|
+
index=df.index
|
|
165
|
+
)
|
|
166
|
+
ranked_dfs[sample_name] = df_sample
|
|
167
|
+
|
|
168
|
+
return ranked_dfs
|
|
137
169
|
|
|
138
170
|
else:
|
|
139
171
|
raise ValueError("Omic type must be 'RNA', 'DNAm', or 'CNV'")
|
|
@@ -362,19 +362,26 @@ def create_interactive_plot(data, title_suffix=""):
|
|
|
362
362
|
title=f"{title_suffix}",
|
|
363
363
|
title_x=0.5,
|
|
364
364
|
title_y=0.95,
|
|
365
|
+
title_font=dict(size=18), # Optional: also increase title font
|
|
365
366
|
scene=dict(
|
|
366
367
|
xaxis=dict(
|
|
367
368
|
title='DNAm',
|
|
369
|
+
titlefont=dict(size=18), # Axis title font size
|
|
370
|
+
tickfont=dict(size=18), # Tick label font size
|
|
368
371
|
backgroundcolor="white",
|
|
369
372
|
gridcolor="grey"
|
|
370
373
|
),
|
|
371
374
|
yaxis=dict(
|
|
372
375
|
title='TPM',
|
|
376
|
+
titlefont=dict(size=18), # Axis title font size
|
|
377
|
+
tickfont=dict(size=18), # Tick label font size
|
|
373
378
|
backgroundcolor="white",
|
|
374
379
|
gridcolor="grey"
|
|
375
380
|
),
|
|
376
381
|
zaxis=dict(
|
|
377
382
|
title='CNV',
|
|
383
|
+
titlefont=dict(size=18), # Axis title font size
|
|
384
|
+
tickfont=dict(size=18), # Tick label font size
|
|
378
385
|
backgroundcolor="white",
|
|
379
386
|
gridcolor="grey"
|
|
380
387
|
),
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: SIMPApy
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.2
|
|
4
4
|
Summary: Normalized Single Sample Integrated Multiomics Pathway Analysis
|
|
5
5
|
Author-email: Hasan Alsharoh <hasanalsharoh@gmail.com>
|
|
6
6
|
License: Apache License (2.0)
|
|
@@ -35,12 +35,12 @@ Normalized Single Sample Integrated Multi-Omics Pathway Analysis for Python
|
|
|
35
35
|
|
|
36
36
|
## Description
|
|
37
37
|
|
|
38
|
-
`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types.
|
|
38
|
+
`SIMPApy` is a Python package for performing Gene Set Enrichment Analysis (GSEA) on multiomics data in single samples and integrating the results. It supports RNA sequencing, DNA methylation, and copy number variation data types. This package uses the normalized single sample single --omic pathway analysis (SOPA) framework and integrates the results through normalized single sample integrated multiomics pathway analysis (SIMPA) extension.
|
|
39
39
|
|
|
40
40
|
## Installation
|
|
41
41
|
|
|
42
42
|
|
|
43
|
-
To install SIMPApy, create a new virtual environment (
|
|
43
|
+
To install SIMPApy, create a new virtual environment (also installing Jupyter Notebooks through anaconda). Afterwards, use:
|
|
44
44
|
```bash
|
|
45
45
|
pip install SIMPApy
|
|
46
46
|
```
|
|
@@ -60,7 +60,7 @@ To use SOPA, we need raw data for a single -omic, and as follows:
|
|
|
60
60
|
1. For RNAseq data: TPM, or normalized counts. FPKM, RPKM may work, but the tool was not validated on them.
|
|
61
61
|
2. For CNV data: Copy numbers. Each gene must have its individual copy numbers.
|
|
62
62
|
3. For DNAm data: Beta values. The values must be mapped to gene names rather than CpG sites.
|
|
63
|
-
- The dataframes must contain
|
|
63
|
+
- The dataframes must contain samples divided into 2 groups, cases and controls. For the case group, each sample must start with the name 'tm', while for the control group, each sample must start with 'tw'.
|
|
64
64
|
- It is possible to use more than 2 groups. For example:
|
|
65
65
|
|
|
66
66
|
Assume 3 groups (group1, group2, group3). Analyses could be done between group1 and group2, and then group3 and group2, or group3 and group1, according to preference.
|
|
@@ -99,7 +99,7 @@ dnaranks_df = pd.concat({k: v['weighted'] for k, v in dnaranks.items()}, axis=1)
|
|
|
99
99
|
### CNV data
|
|
100
100
|
Here, make sure to have copy number data. If GISTIC2 data is present (often through log(Copy_Number +1)), then conversion is usually done through 2**(GISTIC2_value -1)
|
|
101
101
|
```python
|
|
102
|
-
|
|
102
|
+
cnranks = sp.calculate_ranking(cnv_data, omic = 'cnv'):
|
|
103
103
|
# cnv_data: Pandas DataFrame with gene-level copy numbers.
|
|
104
104
|
# Rows are genes and columns are samples ('tm' for cases, 'tw' for controls).
|
|
105
105
|
```
|
|
@@ -211,8 +211,13 @@ Figure settings allow the following:
|
|
|
211
211
|
- Reset View (button): Restores the view to unfiltered.
|
|
212
212
|
|
|
213
213
|
- Export (300DPI): This options exports the current view of the 3D box.
|
|
214
|
-
|
|
215
|
-
##
|
|
214
|
+
|
|
215
|
+
## Downstream analysis
|
|
216
|
+
This package also offers direct analysis of results obtained with both SOPA and SIMPA through the .analyze module.
|
|
217
|
+
Please consult the module's documentation for further instructions.
|
|
218
|
+
|
|
219
|
+
|
|
220
|
+
# Requirements
|
|
216
221
|
|
|
217
222
|
- Python ≥ 3.8
|
|
218
223
|
- gseapy==1.1.3
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|