python-katlas 0.0.9__tar.gz → 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {python-katlas-0.0.9/python_katlas.egg-info → python-katlas-0.1.1}/PKG-INFO +13 -10
- {python-katlas-0.0.9 → python-katlas-0.1.1}/README.md +12 -9
- python-katlas-0.1.1/katlas/__init__.py +1 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/_modidx.py +1 -2
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/core.py +61 -60
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/dl.py +14 -14
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/feature.py +9 -9
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/plot.py +20 -20
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/train.py +7 -7
- {python-katlas-0.0.9 → python-katlas-0.1.1/python_katlas.egg-info}/PKG-INFO +13 -10
- {python-katlas-0.0.9 → python-katlas-0.1.1}/settings.ini +1 -1
- python-katlas-0.0.9/katlas/__init__.py +0 -1
- {python-katlas-0.0.9 → python-katlas-0.1.1}/LICENSE +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/MANIFEST.in +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/katlas/imports.py +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/python_katlas.egg-info/SOURCES.txt +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/python_katlas.egg-info/dependency_links.txt +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/python_katlas.egg-info/entry_points.txt +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/python_katlas.egg-info/not-zip-safe +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/python_katlas.egg-info/requires.txt +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/python_katlas.egg-info/top_level.txt +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/setup.cfg +0 -0
- {python-katlas-0.0.9 → python-katlas-0.1.1}/setup.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: python-katlas
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.1.1
|
|
4
4
|
Summary: tools for predicting kinome specificities
|
|
5
5
|
Home-page: https://github.com/sky1ove/python-katlas
|
|
6
6
|
Author: lily
|
|
@@ -60,6 +60,13 @@ helpful to your research.
|
|
|
60
60
|
phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3),
|
|
61
61
|
and [CPTAC](https://pdc.cancer.gov/pdc/cptac-pancancer) /
|
|
62
62
|
[LinkedOmics](https://academic.oup.com/nar/article/46/D1/D956/4607804)
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
## Web applications
|
|
66
|
+
|
|
67
|
+
Users can now run the analysis directly on the web without needing to code.
|
|
68
|
+
|
|
69
|
+
Check out our latest web: [kinase-atlas.com](https://kinase-atlas.com/)
|
|
63
70
|
|
|
64
71
|
## Tutorials on Colab
|
|
65
72
|
|
|
@@ -67,20 +74,16 @@ helpful to your research.
|
|
|
67
74
|
sequence](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_01_sinlge_input.ipynb)
|
|
68
75
|
- 2. [High throughput substrate scoring on phosphoproteomics
|
|
69
76
|
dataset](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_02_high_throughput.ipynb)
|
|
70
|
-
- 3. [
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
- 4. [Kinase enrichment analysis for AKT
|
|
74
|
-
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04a_enrichment_AKTi.ipynb)
|
|
75
|
-
/ [Kinase enrichment analysis for EGFR
|
|
76
|
-
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04b_enrichment_EGFRi.ipynb)
|
|
77
|
+
- 3. [Kinase enrichment analysis for AKT
|
|
78
|
+
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_03a_enrichment_AKTi.ipynb)
|
|
79
|
+
|
|
77
80
|
|
|
78
81
|
## Install
|
|
79
82
|
|
|
80
|
-
Install the latest version through
|
|
83
|
+
Install the latest version through pip
|
|
81
84
|
|
|
82
85
|
``` python
|
|
83
|
-
|
|
86
|
+
pip install python-katlas -Uq
|
|
84
87
|
```
|
|
85
88
|
|
|
86
89
|
## Import
|
|
@@ -38,6 +38,13 @@ helpful to your research.
|
|
|
38
38
|
phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3),
|
|
39
39
|
and [CPTAC](https://pdc.cancer.gov/pdc/cptac-pancancer) /
|
|
40
40
|
[LinkedOmics](https://academic.oup.com/nar/article/46/D1/D956/4607804)
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
## Web applications
|
|
44
|
+
|
|
45
|
+
Users can now run the analysis directly on the web without needing to code.
|
|
46
|
+
|
|
47
|
+
Check out our latest web: [kinase-atlas.com](https://kinase-atlas.com/)
|
|
41
48
|
|
|
42
49
|
## Tutorials on Colab
|
|
43
50
|
|
|
@@ -45,20 +52,16 @@ helpful to your research.
|
|
|
45
52
|
sequence](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_01_sinlge_input.ipynb)
|
|
46
53
|
- 2. [High throughput substrate scoring on phosphoproteomics
|
|
47
54
|
dataset](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_02_high_throughput.ipynb)
|
|
48
|
-
- 3. [
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
- 4. [Kinase enrichment analysis for AKT
|
|
52
|
-
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04a_enrichment_AKTi.ipynb)
|
|
53
|
-
/ [Kinase enrichment analysis for EGFR
|
|
54
|
-
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04b_enrichment_EGFRi.ipynb)
|
|
55
|
+
- 3. [Kinase enrichment analysis for AKT
|
|
56
|
+
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_03a_enrichment_AKTi.ipynb)
|
|
57
|
+
|
|
55
58
|
|
|
56
59
|
## Install
|
|
57
60
|
|
|
58
|
-
Install the latest version through
|
|
61
|
+
Install the latest version through pip
|
|
59
62
|
|
|
60
63
|
``` python
|
|
61
|
-
|
|
64
|
+
pip install python-katlas -Uq
|
|
62
65
|
```
|
|
63
66
|
|
|
64
67
|
## Import
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
__version__ = "0.1.0"
|
|
@@ -46,13 +46,12 @@ d = { 'settings': { 'branch': 'main',
|
|
|
46
46
|
'katlas.core.get_one_kinase': ('core.html#get_one_kinase', 'katlas/core.py'),
|
|
47
47
|
'katlas.core.get_pct': ('core.html#get_pct', 'katlas/core.py'),
|
|
48
48
|
'katlas.core.get_pct_df': ('core.html#get_pct_df', 'katlas/core.py'),
|
|
49
|
-
'katlas.core.
|
|
49
|
+
'katlas.core.get_pvalue': ('core.html#get_pvalue', 'katlas/core.py'),
|
|
50
50
|
'katlas.core.get_unique_site': ('core.html#get_unique_site', 'katlas/core.py'),
|
|
51
51
|
'katlas.core.multiply': ('core.html#multiply', 'katlas/core.py'),
|
|
52
52
|
'katlas.core.multiply_func': ('core.html#multiply_func', 'katlas/core.py'),
|
|
53
53
|
'katlas.core.predict_kinase': ('core.html#predict_kinase', 'katlas/core.py'),
|
|
54
54
|
'katlas.core.predict_kinase_df': ('core.html#predict_kinase_df', 'katlas/core.py'),
|
|
55
|
-
'katlas.core.query_gene': ('core.html#query_gene', 'katlas/core.py'),
|
|
56
55
|
'katlas.core.raw2norm': ('core.html#raw2norm', 'katlas/core.py'),
|
|
57
56
|
'katlas.core.sumup': ('core.html#sumup', 'katlas/core.py')},
|
|
58
57
|
'katlas.dl': { 'katlas.dl.CNN1D_1': ('dl.html#cnn1d_1', 'katlas/dl.py'),
|
|
@@ -6,7 +6,7 @@
|
|
|
6
6
|
__all__ = ['param_PSPA_st', 'param_PSPA_y', 'param_PSPA', 'param_CDDM', 'param_CDDM_upper', 'Data', 'CPTAC', 'convert_string',
|
|
7
7
|
'checker', 'STY2sty', 'cut_seq', 'get_dict', 'multiply_func', 'multiply', 'sumup', 'predict_kinase',
|
|
8
8
|
'predict_kinase_df', 'get_pct', 'get_pct_df', 'get_unique_site', 'extract_site_seq', 'get_freq',
|
|
9
|
-
'
|
|
9
|
+
'get_pvalue', 'get_metaP', 'raw2norm', 'get_one_kinase']
|
|
10
10
|
|
|
11
11
|
# %% ../nbs/00_core.ipynb 4
|
|
12
12
|
import math, pandas as pd, numpy as np
|
|
@@ -14,7 +14,7 @@ from tqdm import tqdm
|
|
|
14
14
|
from scipy.stats import chi2
|
|
15
15
|
from typing import Callable
|
|
16
16
|
from functools import partial
|
|
17
|
-
from scipy.stats import ttest_ind
|
|
17
|
+
from scipy.stats import ttest_ind, mannwhitneyu, wilcoxon
|
|
18
18
|
from statsmodels.stats.multitest import multipletests
|
|
19
19
|
|
|
20
20
|
# %% ../nbs/00_core.ipynb 7
|
|
@@ -451,17 +451,18 @@ def predict_kinase(input_string: str, # site sequence
|
|
|
451
451
|
|
|
452
452
|
# %% ../nbs/00_core.ipynb 41
|
|
453
453
|
# PSPA
|
|
454
|
-
param_PSPA_st = {'ref':Data.get_pspa_st_norm(), 'func':multiply} # Johnson et al. Nature official
|
|
455
|
-
param_PSPA_y = {'ref':Data.get_pspa_tyr_norm(), 'func':multiply}
|
|
456
|
-
param_PSPA = {'ref':Data.get_pspa_all_norm(), 'func':multiply}
|
|
454
|
+
param_PSPA_st = {'ref':Data.get_pspa_st_norm().astype('float32'), 'func':multiply} # Johnson et al. Nature official
|
|
455
|
+
param_PSPA_y = {'ref':Data.get_pspa_tyr_norm().astype('float32'), 'func':multiply}
|
|
456
|
+
param_PSPA = {'ref':Data.get_pspa_all_norm().astype('float32'), 'func':multiply}
|
|
457
457
|
|
|
458
458
|
|
|
459
459
|
# Kinase-substrate dataset, CDDM
|
|
460
|
-
param_CDDM = {'ref':Data.get_cddm(), 'func':sumup}
|
|
461
|
-
param_CDDM_upper = {'ref':Data.get_cddm_upper(), 'func':sumup, 'to_upper':True} # specific for all uppercase
|
|
460
|
+
param_CDDM = {'ref':Data.get_cddm().astype('float32'), 'func':sumup}
|
|
461
|
+
param_CDDM_upper = {'ref':Data.get_cddm_upper().astype('float32'), 'func':sumup, 'to_upper':True} # specific for all uppercase
|
|
462
462
|
|
|
463
|
-
# %% ../nbs/00_core.ipynb
|
|
463
|
+
# %% ../nbs/00_core.ipynb 46
|
|
464
464
|
def predict_kinase_df(df, seq_col, ref, func, to_lower=False, to_upper=False):
|
|
465
|
+
|
|
465
466
|
print('input dataframe has a length', df.shape[0])
|
|
466
467
|
print('Preprocessing')
|
|
467
468
|
|
|
@@ -493,12 +494,20 @@ def predict_kinase_df(df, seq_col, ref, func, to_lower=False, to_upper=False):
|
|
|
493
494
|
df['keys'] = df['site_seq'].apply(get_dict)
|
|
494
495
|
input_keys_df = df[['keys']].explode('keys').reset_index()
|
|
495
496
|
input_keys_df.columns = ['input_index', 'key']
|
|
497
|
+
|
|
498
|
+
|
|
496
499
|
ref_T = ref.T
|
|
497
500
|
|
|
498
|
-
|
|
501
|
+
input_keys_df = input_keys_df.set_index('key')
|
|
502
|
+
|
|
503
|
+
|
|
504
|
+
print('Merging reference')
|
|
505
|
+
merged_df = input_keys_df.merge(ref_T, left_index=True, right_index=True, how='inner')
|
|
506
|
+
|
|
507
|
+
print('Finish merging')
|
|
499
508
|
|
|
500
509
|
if func == sumup:
|
|
501
|
-
grouped_df = merged_df.
|
|
510
|
+
grouped_df = merged_df.groupby('input_index').sum()
|
|
502
511
|
out = grouped_df.reindex(df.index)
|
|
503
512
|
|
|
504
513
|
elif func==multiply:
|
|
@@ -514,7 +523,7 @@ def predict_kinase_df(df, seq_col, ref, func, to_lower=False, to_upper=False):
|
|
|
514
523
|
kinase_df = kinase_df.rename(columns={kinase: 'value'})
|
|
515
524
|
|
|
516
525
|
# Compute log_value
|
|
517
|
-
kinase_df['log_value'] = np.log2(kinase_df['value']
|
|
526
|
+
kinase_df['log_value'] = np.log2(kinase_df['value'].where(kinase_df['value'] > 0))
|
|
518
527
|
|
|
519
528
|
# Group by 'input_index' and compute sum and count
|
|
520
529
|
grouped = kinase_df.dropna().groupby('input_index')
|
|
@@ -541,7 +550,7 @@ def predict_kinase_df(df, seq_col, ref, func, to_lower=False, to_upper=False):
|
|
|
541
550
|
# Return results as a DataFrame
|
|
542
551
|
return out
|
|
543
552
|
|
|
544
|
-
# %% ../nbs/00_core.ipynb
|
|
553
|
+
# %% ../nbs/00_core.ipynb 53
|
|
545
554
|
def get_pct(site,ref,func,pct_ref):
|
|
546
555
|
|
|
547
556
|
"Replicate the precentile results from The Kinase Library."
|
|
@@ -566,7 +575,7 @@ def get_pct(site,ref,func,pct_ref):
|
|
|
566
575
|
final.columns=['log2(score)','percentile']
|
|
567
576
|
return final
|
|
568
577
|
|
|
569
|
-
# %% ../nbs/00_core.ipynb
|
|
578
|
+
# %% ../nbs/00_core.ipynb 59
|
|
570
579
|
def get_pct_df(score_df, # output from predict_kinase_df
|
|
571
580
|
pct_ref, # a reference df for percentile calculation
|
|
572
581
|
):
|
|
@@ -591,7 +600,7 @@ def get_pct_df(score_df, # output from predict_kinase_df
|
|
|
591
600
|
|
|
592
601
|
return percentiles_df
|
|
593
602
|
|
|
594
|
-
# %% ../nbs/00_core.ipynb
|
|
603
|
+
# %% ../nbs/00_core.ipynb 64
|
|
595
604
|
def get_unique_site(df:pd.DataFrame = None,# dataframe that contains phosphorylation sites
|
|
596
605
|
seq_col: str='site_seq', # column name of site sequence
|
|
597
606
|
id_col: str='gene_site' # column name of site id
|
|
@@ -607,7 +616,7 @@ def get_unique_site(df:pd.DataFrame = None,# dataframe that contains phosphoryla
|
|
|
607
616
|
|
|
608
617
|
return unique
|
|
609
618
|
|
|
610
|
-
# %% ../nbs/00_core.ipynb
|
|
619
|
+
# %% ../nbs/00_core.ipynb 67
|
|
611
620
|
def extract_site_seq(df: pd.DataFrame, # dataframe that contains protein sequence
|
|
612
621
|
seq_col: str, # column name of protein sequence
|
|
613
622
|
position_col: str # column name of position 0
|
|
@@ -633,7 +642,7 @@ def extract_site_seq(df: pd.DataFrame, # dataframe that contains protein sequenc
|
|
|
633
642
|
|
|
634
643
|
return np.array(data)
|
|
635
644
|
|
|
636
|
-
# %% ../nbs/00_core.ipynb
|
|
645
|
+
# %% ../nbs/00_core.ipynb 72
|
|
637
646
|
def get_freq(df_k: pd.DataFrame, # a dataframe for a single kinase that contains phosphorylation sequence splitted by their position
|
|
638
647
|
aa_order = [i for i in 'PGACSTVILMFYWHKRQNDEsty'], # amino acid to include in the full matrix
|
|
639
648
|
aa_order_paper = [i for i in 'PGACSTVILMFYWHKRQNDEsty'], # amino acid to include in the partial matrix
|
|
@@ -674,35 +683,16 @@ def get_freq(df_k: pd.DataFrame, # a dataframe for a single kinase that contains
|
|
|
674
683
|
|
|
675
684
|
return paper,full
|
|
676
685
|
|
|
677
|
-
# %% ../nbs/00_core.ipynb
|
|
678
|
-
def
|
|
679
|
-
|
|
680
|
-
"Query gene in the phosphoproteomics dataset"
|
|
681
|
-
|
|
682
|
-
# query gene in the dataframe
|
|
683
|
-
df_gene = df[df.gene_site.str.contains(f'{gene}_')]
|
|
684
|
-
|
|
685
|
-
# sort dataframe based on position
|
|
686
|
-
sort_position = df_gene.gene_site.str.split('_').str[-1].str[1:].astype(int).sort_values().index
|
|
687
|
-
df_gene = df_gene.loc[sort_position]
|
|
688
|
-
|
|
689
|
-
return df_gene
|
|
690
|
-
|
|
691
|
-
# %% ../nbs/00_core.ipynb 81
|
|
692
|
-
def get_ttest(df,
|
|
686
|
+
# %% ../nbs/00_core.ipynb 76
|
|
687
|
+
def get_pvalue(df,
|
|
693
688
|
columns1, # list of column names for group1
|
|
694
689
|
columns2, # list of column names for group2
|
|
690
|
+
test_method = 'mann_whitney', # 'student_t', 'mann_whitney', 'wilcoxon'
|
|
695
691
|
FC_method = 'median', # or mean
|
|
696
|
-
alpha=0.05, # significance level in multipletests for p_adj
|
|
697
|
-
correction_method='fdr_bh', # method in multipletests for p_adj
|
|
698
692
|
):
|
|
699
|
-
"""
|
|
700
|
-
Performs t-tests and calculates log2 fold change between two groups of columns in a DataFrame.
|
|
701
|
-
NaN p-values are excluded from the multiple testing correction.
|
|
702
693
|
|
|
703
|
-
|
|
704
|
-
|
|
705
|
-
"""
|
|
694
|
+
"Performs statistical tests and calculates difference between the median or mean of two groups of columns."
|
|
695
|
+
|
|
706
696
|
group1 = df[columns1]
|
|
707
697
|
group2 = df[columns2]
|
|
708
698
|
|
|
@@ -717,24 +707,36 @@ def get_ttest(df,
|
|
|
717
707
|
# As phosphoproteomics data has already been log transformed, we can directly use subtraction
|
|
718
708
|
FCs = m2 - m1
|
|
719
709
|
|
|
720
|
-
# Perform
|
|
721
|
-
|
|
710
|
+
# Perform the chosen test and handle NaN p-values
|
|
711
|
+
if test_method == 'student_t': # data is normally distributed, non-paired
|
|
712
|
+
test_func = ttest_ind
|
|
713
|
+
elif test_method == 'mann_whitney': # not normally distributed, non-paired, mann_whitney considers the rank, ignore the differences
|
|
714
|
+
test_func = mannwhitneyu
|
|
715
|
+
elif test_method == 'wilcoxon': # not normally distributed, paired
|
|
716
|
+
test_func = wilcoxon
|
|
717
|
+
|
|
718
|
+
t_results = []
|
|
719
|
+
for idx in tqdm(df.index, desc=f"Computing {test_method} tests"):
|
|
720
|
+
try:
|
|
721
|
+
if test_method == 'wilcoxon': # as wilcoxon is paired, if lack a paired sample, just give nan, as default nanpolicy is propagate (gives nan if nan in input)
|
|
722
|
+
stat, pvalue = test_func(group1.loc[idx], group2.loc[idx])
|
|
723
|
+
else:
|
|
724
|
+
stat, pvalue = test_func(group1.loc[idx], group2.loc[idx], nan_policy='omit')
|
|
725
|
+
except ValueError: # Handle cases with insufficient data
|
|
726
|
+
pvalue = np.nan
|
|
727
|
+
t_results.append(pvalue)
|
|
722
728
|
|
|
723
729
|
# Exclude NaN p-values before multiple testing correction
|
|
724
|
-
p_values =
|
|
725
|
-
valid_p_values = np.
|
|
730
|
+
p_values = np.array(t_results, dtype=float) # Ensure the correct data type
|
|
731
|
+
valid_p_values = p_values[~np.isnan(p_values)]
|
|
726
732
|
|
|
727
|
-
# valid_p_values = np.array(p_values)
|
|
728
|
-
valid_p_values = valid_p_values[~np.isnan(valid_p_values)]
|
|
729
|
-
|
|
730
733
|
# Adjust for multiple testing on valid p-values only
|
|
731
|
-
reject, pvals_corrected, _, _ = multipletests(valid_p_values, alpha=
|
|
732
|
-
|
|
734
|
+
reject, pvals_corrected, _, _ = multipletests(valid_p_values, alpha=0.05, method='fdr_bh')
|
|
735
|
+
|
|
733
736
|
# Create a full list of corrected p-values including NaNs
|
|
734
|
-
full_pvals_corrected = np.
|
|
735
|
-
full_pvals_corrected[:] = np.nan
|
|
737
|
+
full_pvals_corrected = np.full_like(p_values, np.nan)
|
|
736
738
|
np.place(full_pvals_corrected, ~np.isnan(p_values), pvals_corrected)
|
|
737
|
-
|
|
739
|
+
|
|
738
740
|
# Adjust the significance accordingly
|
|
739
741
|
full_reject = np.zeros_like(p_values, dtype=bool)
|
|
740
742
|
np.place(full_reject, ~np.isnan(p_values), reject)
|
|
@@ -743,22 +745,21 @@ def get_ttest(df,
|
|
|
743
745
|
results = pd.DataFrame({
|
|
744
746
|
'log2FC': FCs,
|
|
745
747
|
'p_value': p_values,
|
|
746
|
-
'p_adj': full_pvals_corrected
|
|
747
|
-
'significant': full_reject
|
|
748
|
+
'p_adj': full_pvals_corrected
|
|
748
749
|
})
|
|
749
750
|
|
|
750
751
|
results['p_value'] = results['p_value'].astype(float)
|
|
751
|
-
|
|
752
|
+
|
|
752
753
|
def get_signed_logP(r,p_col):
|
|
753
754
|
log10 = -np.log10(r[p_col])
|
|
754
755
|
return -log10 if r['log2FC']<0 else log10
|
|
755
|
-
|
|
756
|
+
|
|
756
757
|
results['signed_logP'] = results.apply(partial(get_signed_logP,p_col='p_value'),axis=1)
|
|
757
758
|
results['signed_logPadj'] = results.apply(partial(get_signed_logP,p_col='p_adj'),axis=1)
|
|
758
|
-
|
|
759
|
+
|
|
759
760
|
return results
|
|
760
761
|
|
|
761
|
-
# %% ../nbs/00_core.ipynb
|
|
762
|
+
# %% ../nbs/00_core.ipynb 77
|
|
762
763
|
def get_metaP(p_values):
|
|
763
764
|
|
|
764
765
|
"Use Fisher's method to calculate a combined p value given a list of p values; this function also allows negative p values (negative correlation)"
|
|
@@ -770,7 +771,7 @@ def get_metaP(p_values):
|
|
|
770
771
|
|
|
771
772
|
return score
|
|
772
773
|
|
|
773
|
-
# %% ../nbs/00_core.ipynb
|
|
774
|
+
# %% ../nbs/00_core.ipynb 80
|
|
774
775
|
def raw2norm(df: pd.DataFrame, # single kinase's df has position as index, and single amino acid as columns
|
|
775
776
|
PDHK: bool=False, # whether this kinase belongs to PDHK family
|
|
776
777
|
):
|
|
@@ -793,7 +794,7 @@ def raw2norm(df: pd.DataFrame, # single kinase's df has position as index, and s
|
|
|
793
794
|
|
|
794
795
|
return df2
|
|
795
796
|
|
|
796
|
-
# %% ../nbs/00_core.ipynb
|
|
797
|
+
# %% ../nbs/00_core.ipynb 82
|
|
797
798
|
def get_one_kinase(df: pd.DataFrame, #stacked dataframe (paper's raw data)
|
|
798
799
|
kinase:str, # a specific kinase
|
|
799
800
|
normalize: bool=False, # normalize according to the paper; special for PDHK1/4
|
|
@@ -6,7 +6,7 @@
|
|
|
6
6
|
__all__ = ['def_device', 'seed_everything', 'GeneralDataset', 'get_sampler', 'MLP_1', 'CNN1D_1', 'init_weights', 'lin_wn',
|
|
7
7
|
'conv_wn', 'CNN1D_2', 'train_dl', 'train_dl_cv', 'predict_dl']
|
|
8
8
|
|
|
9
|
-
# %% ../nbs/04_DL.ipynb
|
|
9
|
+
# %% ../nbs/04_DL.ipynb 5
|
|
10
10
|
from fastbook import *
|
|
11
11
|
import fastcore.all as fc,torch.nn.init as init
|
|
12
12
|
from fastai.callback.training import GradientClip
|
|
@@ -22,7 +22,7 @@ from sklearn.model_selection import *
|
|
|
22
22
|
from sklearn.metrics import mean_squared_error
|
|
23
23
|
from scipy.stats import spearmanr,pearsonr
|
|
24
24
|
|
|
25
|
-
# %% ../nbs/04_DL.ipynb
|
|
25
|
+
# %% ../nbs/04_DL.ipynb 7
|
|
26
26
|
def seed_everything(seed=123):
|
|
27
27
|
random.seed(seed)
|
|
28
28
|
os.environ['PYTHONHASHSEED'] = str(seed)
|
|
@@ -32,10 +32,10 @@ def seed_everything(seed=123):
|
|
|
32
32
|
torch.backends.cudnn.deterministic = True
|
|
33
33
|
torch.backends.cudnn.benchmark = False
|
|
34
34
|
|
|
35
|
-
# %% ../nbs/04_DL.ipynb
|
|
35
|
+
# %% ../nbs/04_DL.ipynb 9
|
|
36
36
|
def_device = 'mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu'
|
|
37
37
|
|
|
38
|
-
# %% ../nbs/04_DL.ipynb
|
|
38
|
+
# %% ../nbs/04_DL.ipynb 14
|
|
39
39
|
class GeneralDataset:
|
|
40
40
|
def __init__(self,
|
|
41
41
|
df, # a dataframe of values
|
|
@@ -62,7 +62,7 @@ class GeneralDataset:
|
|
|
62
62
|
y = torch.Tensor(self.y[index])
|
|
63
63
|
return X, y
|
|
64
64
|
|
|
65
|
-
# %% ../nbs/04_DL.ipynb
|
|
65
|
+
# %% ../nbs/04_DL.ipynb 18
|
|
66
66
|
def get_sampler(info,col):
|
|
67
67
|
|
|
68
68
|
"For imbalanced data, get higher weights for less-represented samples"
|
|
@@ -82,7 +82,7 @@ def get_sampler(info,col):
|
|
|
82
82
|
|
|
83
83
|
return sampler
|
|
84
84
|
|
|
85
|
-
# %% ../nbs/04_DL.ipynb
|
|
85
|
+
# %% ../nbs/04_DL.ipynb 24
|
|
86
86
|
def MLP_1(num_features,
|
|
87
87
|
num_targets,
|
|
88
88
|
hidden_units = [512, 218],
|
|
@@ -112,7 +112,7 @@ def MLP_1(num_features,
|
|
|
112
112
|
|
|
113
113
|
return model
|
|
114
114
|
|
|
115
|
-
# %% ../nbs/04_DL.ipynb
|
|
115
|
+
# %% ../nbs/04_DL.ipynb 30
|
|
116
116
|
class CNN1D_1(Module):
|
|
117
117
|
|
|
118
118
|
def __init__(self,
|
|
@@ -137,12 +137,12 @@ class CNN1D_1(Module):
|
|
|
137
137
|
x = self.fc2(x)
|
|
138
138
|
return x
|
|
139
139
|
|
|
140
|
-
# %% ../nbs/04_DL.ipynb
|
|
140
|
+
# %% ../nbs/04_DL.ipynb 34
|
|
141
141
|
def init_weights(m, leaky=0.):
|
|
142
142
|
"Initiate any Conv layer with Kaiming norm."
|
|
143
143
|
if isinstance(m, (nn.Conv1d,nn.Conv2d,nn.Conv3d)): init.kaiming_normal_(m.weight, a=leaky)
|
|
144
144
|
|
|
145
|
-
# %% ../nbs/04_DL.ipynb
|
|
145
|
+
# %% ../nbs/04_DL.ipynb 35
|
|
146
146
|
def lin_wn(ni,nf,dp=0.1,act=nn.SiLU):
|
|
147
147
|
"Weight norm of linear."
|
|
148
148
|
layers = nn.Sequential(
|
|
@@ -152,7 +152,7 @@ def lin_wn(ni,nf,dp=0.1,act=nn.SiLU):
|
|
|
152
152
|
if act: layers.append(act())
|
|
153
153
|
return layers
|
|
154
154
|
|
|
155
|
-
# %% ../nbs/04_DL.ipynb
|
|
155
|
+
# %% ../nbs/04_DL.ipynb 36
|
|
156
156
|
def conv_wn(ni, nf, ks=3, stride=1, padding=1, dp=0.1,act=nn.ReLU):
|
|
157
157
|
"Weight norm of conv."
|
|
158
158
|
layers = nn.Sequential(
|
|
@@ -162,7 +162,7 @@ def conv_wn(ni, nf, ks=3, stride=1, padding=1, dp=0.1,act=nn.ReLU):
|
|
|
162
162
|
if act: layers.append(act())
|
|
163
163
|
return layers
|
|
164
164
|
|
|
165
|
-
# %% ../nbs/04_DL.ipynb
|
|
165
|
+
# %% ../nbs/04_DL.ipynb 37
|
|
166
166
|
class CNN1D_2(nn.Module):
|
|
167
167
|
|
|
168
168
|
def __init__(self, ni, nf, amp_scale = 16):
|
|
@@ -212,7 +212,7 @@ class CNN1D_2(nn.Module):
|
|
|
212
212
|
|
|
213
213
|
return x
|
|
214
214
|
|
|
215
|
-
# %% ../nbs/04_DL.ipynb
|
|
215
|
+
# %% ../nbs/04_DL.ipynb 41
|
|
216
216
|
def train_dl(df,
|
|
217
217
|
feat_col,
|
|
218
218
|
target_col,
|
|
@@ -275,7 +275,7 @@ def train_dl(df,
|
|
|
275
275
|
|
|
276
276
|
return target, pred
|
|
277
277
|
|
|
278
|
-
# %% ../nbs/04_DL.ipynb
|
|
278
|
+
# %% ../nbs/04_DL.ipynb 46
|
|
279
279
|
@fc.delegates(train_dl)
|
|
280
280
|
def train_dl_cv(df,
|
|
281
281
|
feat_col,
|
|
@@ -325,7 +325,7 @@ def train_dl_cv(df,
|
|
|
325
325
|
|
|
326
326
|
return oof, metrics
|
|
327
327
|
|
|
328
|
-
# %% ../nbs/04_DL.ipynb
|
|
328
|
+
# %% ../nbs/04_DL.ipynb 54
|
|
329
329
|
def predict_dl(df,
|
|
330
330
|
feat_col,
|
|
331
331
|
target_col,
|
|
@@ -5,7 +5,7 @@
|
|
|
5
5
|
# %% auto 0
|
|
6
6
|
__all__ = ['get_rdkit', 'get_morgan', 'get_esm', 'get_t5', 'get_t5_bfd', 'reduce_feature', 'remove_hi_corr', 'preprocess']
|
|
7
7
|
|
|
8
|
-
# %% ../nbs/01_feature.ipynb
|
|
8
|
+
# %% ../nbs/01_feature.ipynb 5
|
|
9
9
|
from fastbook import *
|
|
10
10
|
import torch,re,joblib,gc,esm
|
|
11
11
|
from tqdm.notebook import tqdm; tqdm.pandas()
|
|
@@ -30,7 +30,7 @@ from umap.umap_ import UMAP
|
|
|
30
30
|
|
|
31
31
|
set_config(transform_output="pandas")
|
|
32
32
|
|
|
33
|
-
# %% ../nbs/01_feature.ipynb
|
|
33
|
+
# %% ../nbs/01_feature.ipynb 8
|
|
34
34
|
def get_rdkit(df: pd.DataFrame, # a dataframe that contains smiles
|
|
35
35
|
col:str = "SMILES", # colname of smile
|
|
36
36
|
normalize: bool = True, # normalize features using StandardScaler()
|
|
@@ -49,7 +49,7 @@ def get_rdkit(df: pd.DataFrame, # a dataframe that contains smiles
|
|
|
49
49
|
# feature_df = feature_df.reset_index()
|
|
50
50
|
return feature_df
|
|
51
51
|
|
|
52
|
-
# %% ../nbs/01_feature.ipynb
|
|
52
|
+
# %% ../nbs/01_feature.ipynb 12
|
|
53
53
|
def get_morgan(df: pd.DataFrame, # a dataframe that contains smiles
|
|
54
54
|
col: str = "SMILES", # colname of smile
|
|
55
55
|
radius=3
|
|
@@ -61,7 +61,7 @@ def get_morgan(df: pd.DataFrame, # a dataframe that contains smiles
|
|
|
61
61
|
fp_df.columns = "morgan_" + fp_df.columns.astype(str)
|
|
62
62
|
return fp_df
|
|
63
63
|
|
|
64
|
-
# %% ../nbs/01_feature.ipynb
|
|
64
|
+
# %% ../nbs/01_feature.ipynb 16
|
|
65
65
|
def get_esm(df:pd.DataFrame, # a dataframe that contains amino acid sequence
|
|
66
66
|
col: str = 'sequence', # colname of amino acid sequence
|
|
67
67
|
model_name: str = "esm2_t33_650M_UR50D", # Name of the ESM model to use for the embeddings.
|
|
@@ -128,7 +128,7 @@ def get_esm(df:pd.DataFrame, # a dataframe that contains amino acid sequence
|
|
|
128
128
|
|
|
129
129
|
return df_feature
|
|
130
130
|
|
|
131
|
-
# %% ../nbs/01_feature.ipynb
|
|
131
|
+
# %% ../nbs/01_feature.ipynb 20
|
|
132
132
|
def get_t5(df: pd.DataFrame,
|
|
133
133
|
col: str = 'sequence'
|
|
134
134
|
):
|
|
@@ -170,7 +170,7 @@ def get_t5(df: pd.DataFrame,
|
|
|
170
170
|
|
|
171
171
|
return T5_feature
|
|
172
172
|
|
|
173
|
-
# %% ../nbs/01_feature.ipynb
|
|
173
|
+
# %% ../nbs/01_feature.ipynb 23
|
|
174
174
|
def get_t5_bfd(df:pd.DataFrame,
|
|
175
175
|
col: str = 'sequence'
|
|
176
176
|
):
|
|
@@ -212,7 +212,7 @@ def get_t5_bfd(df:pd.DataFrame,
|
|
|
212
212
|
|
|
213
213
|
return T5_feature
|
|
214
214
|
|
|
215
|
-
# %% ../nbs/01_feature.ipynb
|
|
215
|
+
# %% ../nbs/01_feature.ipynb 27
|
|
216
216
|
def reduce_feature(df: pd.DataFrame,
|
|
217
217
|
method: str='pca', # dimensionality reduction method, accept both capital and lower case
|
|
218
218
|
complexity: int=20, # None for PCA; perfplexity for TSNE, recommend: 30; n_neigbors for UMAP, recommend: 15
|
|
@@ -258,7 +258,7 @@ def reduce_feature(df: pd.DataFrame,
|
|
|
258
258
|
|
|
259
259
|
return embedding_df
|
|
260
260
|
|
|
261
|
-
# %% ../nbs/01_feature.ipynb
|
|
261
|
+
# %% ../nbs/01_feature.ipynb 30
|
|
262
262
|
def remove_hi_corr(df: pd.DataFrame,
|
|
263
263
|
thr: float=0.98 # threshold
|
|
264
264
|
):
|
|
@@ -278,7 +278,7 @@ def remove_hi_corr(df: pd.DataFrame,
|
|
|
278
278
|
|
|
279
279
|
return df
|
|
280
280
|
|
|
281
|
-
# %% ../nbs/01_feature.ipynb
|
|
281
|
+
# %% ../nbs/01_feature.ipynb 34
|
|
282
282
|
def preprocess(df: pd.DataFrame,
|
|
283
283
|
thr: float=0.98):
|
|
284
284
|
|
|
@@ -7,7 +7,7 @@ __all__ = ['set_sns', 'get_color_dict', 'logo_func', 'get_logo', 'get_logo2', 'p
|
|
|
7
7
|
'plot_cluster', 'plot_bokeh', 'plot_count', 'plot_bar', 'plot_group_bar', 'plot_box', 'plot_corr',
|
|
8
8
|
'draw_corr', 'get_AUCDF', 'plot_confusion_matrix']
|
|
9
9
|
|
|
10
|
-
# %% ../nbs/02_plot.ipynb
|
|
10
|
+
# %% ../nbs/02_plot.ipynb 5
|
|
11
11
|
import joblib,logomaker
|
|
12
12
|
import fastcore.all as fc, pandas as pd, numpy as np, seaborn as sns
|
|
13
13
|
from adjustText import adjust_text
|
|
@@ -32,14 +32,14 @@ from bokeh.layouts import column
|
|
|
32
32
|
from bokeh.palettes import Category20_20
|
|
33
33
|
from itertools import cycle
|
|
34
34
|
|
|
35
|
-
# %% ../nbs/02_plot.ipynb
|
|
35
|
+
# %% ../nbs/02_plot.ipynb 7
|
|
36
36
|
def set_sns():
|
|
37
37
|
"Set seaborn resolution for notebook display"
|
|
38
38
|
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})
|
|
39
39
|
sns.set_context('notebook')
|
|
40
40
|
sns.set_style("ticks")
|
|
41
41
|
|
|
42
|
-
# %% ../nbs/02_plot.ipynb
|
|
42
|
+
# %% ../nbs/02_plot.ipynb 8
|
|
43
43
|
def get_color_dict(categories, # list of names to assign color
|
|
44
44
|
palette: str='tab20', # choose from sns.color_palette
|
|
45
45
|
):
|
|
@@ -49,7 +49,7 @@ def get_color_dict(categories, # list of names to assign color
|
|
|
49
49
|
color_map = {category: next(color_cycle) for category in categories}
|
|
50
50
|
return color_map
|
|
51
51
|
|
|
52
|
-
# %% ../nbs/02_plot.ipynb
|
|
52
|
+
# %% ../nbs/02_plot.ipynb 12
|
|
53
53
|
def logo_func(df:pd.DataFrame, # a dataframe that contains ratios for each amino acid at each position
|
|
54
54
|
title: str='logo', # title of the motif logo
|
|
55
55
|
):
|
|
@@ -81,7 +81,7 @@ def logo_func(df:pd.DataFrame, # a dataframe that contains ratios for each amino
|
|
|
81
81
|
logo.ax.set_yticks([])
|
|
82
82
|
logo.ax.set_title(title)
|
|
83
83
|
|
|
84
|
-
# %% ../nbs/02_plot.ipynb
|
|
84
|
+
# %% ../nbs/02_plot.ipynb 13
|
|
85
85
|
def get_logo(df: pd.DataFrame, # stacked Dataframe with kinase as index, substrates as columns
|
|
86
86
|
kinase: str, # a specific kinase name in index
|
|
87
87
|
):
|
|
@@ -120,7 +120,7 @@ def get_logo(df: pd.DataFrame, # stacked Dataframe with kinase as index, substra
|
|
|
120
120
|
# plot logo
|
|
121
121
|
logo_func(ratio2, kinase)
|
|
122
122
|
|
|
123
|
-
# %% ../nbs/02_plot.ipynb
|
|
123
|
+
# %% ../nbs/02_plot.ipynb 17
|
|
124
124
|
def get_logo2(full: pd.DataFrame, # a dataframe that contains the full matrix of a kinase, with index as amino acid, and columns as positions
|
|
125
125
|
title: str = 'logo', # title of the graph
|
|
126
126
|
):
|
|
@@ -159,7 +159,7 @@ def get_logo2(full: pd.DataFrame, # a dataframe that contains the full matrix of
|
|
|
159
159
|
|
|
160
160
|
logo_func(ratio2,title)
|
|
161
161
|
|
|
162
|
-
# %% ../nbs/02_plot.ipynb
|
|
162
|
+
# %% ../nbs/02_plot.ipynb 20
|
|
163
163
|
@fc.delegates(sns.scatterplot)
|
|
164
164
|
def plot_rank(sorted_df: pd.DataFrame, # a sorted dataframe
|
|
165
165
|
x: str, # column name for x axis
|
|
@@ -203,7 +203,7 @@ def plot_rank(sorted_df: pd.DataFrame, # a sorted dataframe
|
|
|
203
203
|
|
|
204
204
|
plt.tight_layout()
|
|
205
205
|
|
|
206
|
-
# %% ../nbs/02_plot.ipynb
|
|
206
|
+
# %% ../nbs/02_plot.ipynb 24
|
|
207
207
|
@fc.delegates(sns.histplot)
|
|
208
208
|
def plot_hist(df: pd.DataFrame, # a dataframe that contain values for plot
|
|
209
209
|
x: str, # column name of values
|
|
@@ -220,7 +220,7 @@ def plot_hist(df: pd.DataFrame, # a dataframe that contain values for plot
|
|
|
220
220
|
plt.figure(figsize=figsize)
|
|
221
221
|
sns.histplot(data=df,x=x,**hist_params,**kwargs)
|
|
222
222
|
|
|
223
|
-
# %% ../nbs/02_plot.ipynb
|
|
223
|
+
# %% ../nbs/02_plot.ipynb 28
|
|
224
224
|
@fc.delegates(sns.heatmap)
|
|
225
225
|
def plot_heatmap(matrix, # a matrix of values
|
|
226
226
|
title: str='heatmap', # title of the heatmap
|
|
@@ -235,7 +235,7 @@ def plot_heatmap(matrix, # a matrix of values
|
|
|
235
235
|
sns.heatmap(matrix, cmap=cmap, annot=False,**kwargs)
|
|
236
236
|
plt.title(title)
|
|
237
237
|
|
|
238
|
-
# %% ../nbs/02_plot.ipynb
|
|
238
|
+
# %% ../nbs/02_plot.ipynb 32
|
|
239
239
|
@fc.delegates(sns.scatterplot)
|
|
240
240
|
def plot_2d(X: pd.DataFrame, # a dataframe that has first column to be x, and second column to be y
|
|
241
241
|
**kwargs, # arguments for sns.scatterplot
|
|
@@ -244,7 +244,7 @@ def plot_2d(X: pd.DataFrame, # a dataframe that has first column to be x, and se
|
|
|
244
244
|
plt.figure(figsize=(7,7))
|
|
245
245
|
sns.scatterplot(data = X,x=X.columns[0],y=X.columns[1],alpha=0.7,**kwargs)
|
|
246
246
|
|
|
247
|
-
# %% ../nbs/02_plot.ipynb
|
|
247
|
+
# %% ../nbs/02_plot.ipynb 34
|
|
248
248
|
def plot_cluster(df: pd.DataFrame, # a dataframe of values that is waited for dimensionality reduction
|
|
249
249
|
method: str='pca', # dimensionality reduction method, choose from pca, umap, and tsne
|
|
250
250
|
hue: str=None, # colname of color
|
|
@@ -269,7 +269,7 @@ def plot_cluster(df: pd.DataFrame, # a dataframe of values that is waited for di
|
|
|
269
269
|
texts = [plt.text(embedding_df[x_col][i], embedding_df[y_col][i], name_list[i],fontsize=8) for i in range(len(embedding_df))]
|
|
270
270
|
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='black'))
|
|
271
271
|
|
|
272
|
-
# %% ../nbs/02_plot.ipynb
|
|
272
|
+
# %% ../nbs/02_plot.ipynb 38
|
|
273
273
|
def plot_bokeh(X:pd.DataFrame, # a dataframe of two columns from dimensionality reduction
|
|
274
274
|
idx, # pd.Series or list that indicates identities for searching box
|
|
275
275
|
hue:None, # pd.Series or list that indicates category for each sample
|
|
@@ -367,7 +367,7 @@ def plot_bokeh(X:pd.DataFrame, # a dataframe of two columns from dimensionality
|
|
|
367
367
|
layout = column(autocomplete, p)
|
|
368
368
|
show(layout)
|
|
369
369
|
|
|
370
|
-
# %% ../nbs/02_plot.ipynb
|
|
370
|
+
# %% ../nbs/02_plot.ipynb 41
|
|
371
371
|
def plot_count(cnt, # from df['x'].value_counts()
|
|
372
372
|
tick_spacing: float= None, # tick spacing for x axis
|
|
373
373
|
palette: str='tab20'):
|
|
@@ -383,7 +383,7 @@ def plot_count(cnt, # from df['x'].value_counts()
|
|
|
383
383
|
if tick_spacing is not None:
|
|
384
384
|
ax.xaxis.set_major_locator(MultipleLocator(tick_spacing))
|
|
385
385
|
|
|
386
|
-
# %% ../nbs/02_plot.ipynb
|
|
386
|
+
# %% ../nbs/02_plot.ipynb 43
|
|
387
387
|
@fc.delegates(sns.barplot)
|
|
388
388
|
def plot_bar(df,
|
|
389
389
|
value, # colname of value
|
|
@@ -438,7 +438,7 @@ def plot_bar(df,
|
|
|
438
438
|
|
|
439
439
|
plt.gca().spines[['right', 'top']].set_visible(False)
|
|
440
440
|
|
|
441
|
-
# %% ../nbs/02_plot.ipynb
|
|
441
|
+
# %% ../nbs/02_plot.ipynb 46
|
|
442
442
|
@fc.delegates(sns.barplot)
|
|
443
443
|
def plot_group_bar(df,
|
|
444
444
|
value_cols, # list of column names for values, the order depends on the first item
|
|
@@ -481,7 +481,7 @@ def plot_group_bar(df,
|
|
|
481
481
|
plt.gca().spines[['right', 'top']].set_visible(False)
|
|
482
482
|
plt.legend(fontsize=fontsize) # if change legend location, use loc='upper right'
|
|
483
483
|
|
|
484
|
-
# %% ../nbs/02_plot.ipynb
|
|
484
|
+
# %% ../nbs/02_plot.ipynb 49
|
|
485
485
|
@fc.delegates(sns.boxplot)
|
|
486
486
|
def plot_box(df,
|
|
487
487
|
value, # colname of value
|
|
@@ -523,7 +523,7 @@ def plot_box(df,
|
|
|
523
523
|
# plt.gca().spines[['right', 'top']].set_visible(False)
|
|
524
524
|
|
|
525
525
|
|
|
526
|
-
# %% ../nbs/02_plot.ipynb
|
|
526
|
+
# %% ../nbs/02_plot.ipynb 52
|
|
527
527
|
@fc.delegates(sns.regplot)
|
|
528
528
|
def plot_corr(x, # x axis values, or colname of x axis
|
|
529
529
|
y, # y axis values, or colname of y axis
|
|
@@ -560,7 +560,7 @@ def plot_corr(x, # x axis values, or colname of x axis
|
|
|
560
560
|
transform=plt.gca().transAxes,
|
|
561
561
|
ha='center', va='center')
|
|
562
562
|
|
|
563
|
-
# %% ../nbs/02_plot.ipynb
|
|
563
|
+
# %% ../nbs/02_plot.ipynb 56
|
|
564
564
|
def draw_corr(corr):
|
|
565
565
|
|
|
566
566
|
"plot heatmap from df.corr()"
|
|
@@ -572,7 +572,7 @@ def draw_corr(corr):
|
|
|
572
572
|
plt.figure(figsize=(20, 16)) # Set the figure size
|
|
573
573
|
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, mask=mask, fmt='.2f')
|
|
574
574
|
|
|
575
|
-
# %% ../nbs/02_plot.ipynb
|
|
575
|
+
# %% ../nbs/02_plot.ipynb 60
|
|
576
576
|
def get_AUCDF(df,col, reverse=False,plot=True,xlabel='Rank of reported kinase'):
|
|
577
577
|
|
|
578
578
|
"Plot CDF curve and get relative area under the curve"
|
|
@@ -637,7 +637,7 @@ def get_AUCDF(df,col, reverse=False,plot=True,xlabel='Rank of reported kinase'):
|
|
|
637
637
|
|
|
638
638
|
return AUCDF
|
|
639
639
|
|
|
640
|
-
# %% ../nbs/02_plot.ipynb
|
|
640
|
+
# %% ../nbs/02_plot.ipynb 63
|
|
641
641
|
def plot_confusion_matrix(target, # pd.Series
|
|
642
642
|
pred, # pd.Series
|
|
643
643
|
class_names:list=['0','1'],
|
|
@@ -5,7 +5,7 @@
|
|
|
5
5
|
# %% auto 0
|
|
6
6
|
__all__ = ['get_splits', 'split_data', 'score_each', 'train_ml', 'train_ml_cv', 'predict_ml']
|
|
7
7
|
|
|
8
|
-
# %% ../nbs/03_ML.ipynb
|
|
8
|
+
# %% ../nbs/03_ML.ipynb 5
|
|
9
9
|
# katlas
|
|
10
10
|
from .core import Data
|
|
11
11
|
from .feature import *
|
|
@@ -29,7 +29,7 @@ from sklearn.ensemble import *
|
|
|
29
29
|
from sklearn import set_config
|
|
30
30
|
set_config(transform_output="pandas")
|
|
31
31
|
|
|
32
|
-
# %% ../nbs/03_ML.ipynb
|
|
32
|
+
# %% ../nbs/03_ML.ipynb 8
|
|
33
33
|
def get_splits(df: pd.DataFrame, # df contains info for split
|
|
34
34
|
stratified: str=None, # colname to make stratified kfold; sampling from different groups
|
|
35
35
|
group: str=None, # colname to make group kfold; test and train are from different groups
|
|
@@ -79,7 +79,7 @@ def get_splits(df: pd.DataFrame, # df contains info for split
|
|
|
79
79
|
|
|
80
80
|
return splits
|
|
81
81
|
|
|
82
|
-
# %% ../nbs/03_ML.ipynb
|
|
82
|
+
# %% ../nbs/03_ML.ipynb 13
|
|
83
83
|
def split_data(df: pd.DataFrame, # dataframe of values
|
|
84
84
|
feat_col: list, # feature columns
|
|
85
85
|
target_col: list, # target columns
|
|
@@ -95,7 +95,7 @@ def split_data(df: pd.DataFrame, # dataframe of values
|
|
|
95
95
|
|
|
96
96
|
return X_train, y_train, X_test, y_test
|
|
97
97
|
|
|
98
|
-
# %% ../nbs/03_ML.ipynb
|
|
98
|
+
# %% ../nbs/03_ML.ipynb 17
|
|
99
99
|
def score_each(target: pd.DataFrame, # target dataframe
|
|
100
100
|
pred: pd.DataFrame, # predicted dataframe
|
|
101
101
|
absolute = True, # if absolute, take average with absolute values for pearson/spearman
|
|
@@ -134,7 +134,7 @@ def score_each(target: pd.DataFrame, # target dataframe
|
|
|
134
134
|
|
|
135
135
|
return mse,pearson_mean, metrics_df
|
|
136
136
|
|
|
137
|
-
# %% ../nbs/03_ML.ipynb
|
|
137
|
+
# %% ../nbs/03_ML.ipynb 22
|
|
138
138
|
def train_ml(df, # dataframe of values
|
|
139
139
|
feat_col, # feature columns
|
|
140
140
|
target_col, # target columns
|
|
@@ -169,7 +169,7 @@ def train_ml(df, # dataframe of values
|
|
|
169
169
|
|
|
170
170
|
return y_test, y_pred
|
|
171
171
|
|
|
172
|
-
# %% ../nbs/03_ML.ipynb
|
|
172
|
+
# %% ../nbs/03_ML.ipynb 25
|
|
173
173
|
def train_ml_cv( df, # dataframe of values
|
|
174
174
|
feat_col, # feature columns
|
|
175
175
|
target_col, # target columns
|
|
@@ -213,7 +213,7 @@ def train_ml_cv( df, # dataframe of values
|
|
|
213
213
|
|
|
214
214
|
return oof, metrics
|
|
215
215
|
|
|
216
|
-
# %% ../nbs/03_ML.ipynb
|
|
216
|
+
# %% ../nbs/03_ML.ipynb 32
|
|
217
217
|
def predict_ml(df, # Dataframe that contains features
|
|
218
218
|
feat_col, # feature columns
|
|
219
219
|
target_col=None,
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: python-katlas
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.1.1
|
|
4
4
|
Summary: tools for predicting kinome specificities
|
|
5
5
|
Home-page: https://github.com/sky1ove/python-katlas
|
|
6
6
|
Author: lily
|
|
@@ -60,6 +60,13 @@ helpful to your research.
|
|
|
60
60
|
phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3),
|
|
61
61
|
and [CPTAC](https://pdc.cancer.gov/pdc/cptac-pancancer) /
|
|
62
62
|
[LinkedOmics](https://academic.oup.com/nar/article/46/D1/D956/4607804)
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
## Web applications
|
|
66
|
+
|
|
67
|
+
Users can now run the analysis directly on the web without needing to code.
|
|
68
|
+
|
|
69
|
+
Check out our latest web: [kinase-atlas.com](https://kinase-atlas.com/)
|
|
63
70
|
|
|
64
71
|
## Tutorials on Colab
|
|
65
72
|
|
|
@@ -67,20 +74,16 @@ helpful to your research.
|
|
|
67
74
|
sequence](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_01_sinlge_input.ipynb)
|
|
68
75
|
- 2. [High throughput substrate scoring on phosphoproteomics
|
|
69
76
|
dataset](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_02_high_throughput.ipynb)
|
|
70
|
-
- 3. [
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
- 4. [Kinase enrichment analysis for AKT
|
|
74
|
-
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04a_enrichment_AKTi.ipynb)
|
|
75
|
-
/ [Kinase enrichment analysis for EGFR
|
|
76
|
-
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04b_enrichment_EGFRi.ipynb)
|
|
77
|
+
- 3. [Kinase enrichment analysis for AKT
|
|
78
|
+
inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_03a_enrichment_AKTi.ipynb)
|
|
79
|
+
|
|
77
80
|
|
|
78
81
|
## Install
|
|
79
82
|
|
|
80
|
-
Install the latest version through
|
|
83
|
+
Install the latest version through pip
|
|
81
84
|
|
|
82
85
|
``` python
|
|
83
|
-
|
|
86
|
+
pip install python-katlas -Uq
|
|
84
87
|
```
|
|
85
88
|
|
|
86
89
|
## Import
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
__version__ = "0.0.8"
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|