python-katlas 0.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,402 @@
1
+ Metadata-Version: 2.1
2
+ Name: python-katlas
3
+ Version: 0.0.1
4
+ Summary: tools for predicting kinome specificities
5
+ Home-page: https://github.com/sky1ove/python-katlas
6
+ Author: lily
7
+ Author-email: lcai888666@gmail.com
8
+ License: Apache Software License 2.0
9
+ Keywords: nbdev jupyter notebook python
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Natural Language :: English
13
+ Classifier: Programming Language :: Python :: 3.7
14
+ Classifier: Programming Language :: Python :: 3.8
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: License :: OSI Approved :: Apache Software License
18
+ Requires-Python: >=3.7
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Requires-Dist: fastai (>=2.7.12)
22
+ Requires-Dist: pandas
23
+ Requires-Dist: logomaker
24
+ Requires-Dist: seaborn
25
+ Requires-Dist: rdkit
26
+ Requires-Dist: fairscale
27
+ Requires-Dist: fair-esm
28
+ Requires-Dist: umap-learn
29
+ Requires-Dist: adjustText
30
+ Requires-Dist: bokeh
31
+ Requires-Dist: fastbook
32
+ Requires-Dist: biopython
33
+ Requires-Dist: scikit-learn (>=1.3.0)
34
+ Requires-Dist: statsmodels
35
+ Requires-Dist: openpyxl
36
+ Provides-Extra: dev
37
+ Requires-Dist: nbdev ; extra == 'dev'
38
+ Requires-Dist: pyngrok ; extra == 'dev'
39
+
40
+ # KATLAS
41
+
42
+
43
+ <!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
44
+
45
+ <a target="_blank" href="https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/index.ipynb">
46
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
47
+ </a>
48
+
49
+ <img alt="Katlas logo" width="700" caption="Katlas logo" src="https://github.com/sky1ove/katlas/raw/main/dataset/images/logo.png" id="logo"/>
50
+
51
+ KATLAS is a repository containing python tools to predict kinases given
52
+ a substrate sequence. It also contains datasets of kinase substrate
53
+ specificities and human phosphoproteomics.
54
+
55
+ ***References***: Please cite the appropriate papers if KATLAS is
56
+ helpful to your research.
57
+
58
+ - KATLAS was described in the paper \[Decoding Human Kinome
59
+ Specificities through a Computational Data-Driven Approach
60
+ (manuscript)\]
61
+
62
+ - The positional scanning peptide array (PSPA) data is from paper [An
63
+ atlas of substrate specificities for the human serine/threonine
64
+ kinome](https://www.nature.com/articles/s41586-022-05575-3) and paper
65
+ [The intrinsic substrate specificity of the human tyrosine
66
+ kinome](https://www.nature.com/articles/s41586-024-07407-y)
67
+
68
+ - The kinase substrate datasets used for generating PSSMs are derived
69
+ from
70
+ [PhosphoSitePlus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245126/)
71
+ and paper [Large-scale Discovery of Substrates of the Human
72
+ Kinome](https://www.nature.com/articles/s41598-019-46385-4)
73
+
74
+ - Phosphorylation sites are acquired from
75
+ [PhosphoSitePlus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245126/),
76
+ paper [The functional landscape of the human
77
+ phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3),
78
+ and [CPTAC](https://pdc.cancer.gov/pdc/cptac-pancancer) /
79
+ [LinkedOmics](https://academic.oup.com/nar/article/46/D1/D956/4607804)
80
+
81
+ ## Tutorials on Colab
82
+
83
+ - 1. [Substrate scoring on a single substrate
84
+ sequence](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_01_sinlge_input.ipynb)
85
+ - 2. [High throughput substrate scoring on phosphoproteomics
86
+ dataset](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_02_high_throughput.ipynb)
87
+ - 3. [Query a protein’s phosphorylation sites and predict their
88
+ upstream
89
+ kinases](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_03_query_gene.ipynb)
90
+ - 4. [Kinase enrichment analysis for AKT
91
+ inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04a_enrichment_AKTi.ipynb)
92
+ / [Kinase enrichment analysis for EGFR
93
+ inhibitor](https://colab.research.google.com/github/sky1ove/katlas/blob/main/nbs/tutorial_04b_enrichment_EGFRi.ipynb)
94
+
95
+ ## Install
96
+
97
+ Install the latest version through git
98
+
99
+ ``` python
100
+ !pip install git+https://github.com/sky1ove/katlas.git -Uqq
101
+ ```
102
+
103
+ ## Import
104
+
105
+ ``` python
106
+ from katlas.core import *
107
+ ```
108
+
109
+ # Quick start
110
+
111
+ We provide two methods to calculate substrate sequence:
112
+
113
+ - Computational Data-Driven Method (CDDM)
114
+ - Positional Scanning Peptide Array (PSPA)
115
+
116
+ We consider the input in two formats:
117
+
118
+ - a single input string (phosphorylation site)
119
+ - a csv/dataframe that contains a column of phosphorylation sites
120
+
121
+ For input sequences, we also consider it in two conditions:
122
+
123
+ - all capital
124
+ - contains lower cases indicating phosphorylation status
125
+
126
+ ## Single sequence as input
127
+
128
+ ### CDDM, all capital
129
+
130
+ ``` python
131
+ predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)
132
+ ```
133
+
134
+ considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']
135
+
136
+ kinase
137
+ PAK6 2.032
138
+ ULK3 2.032
139
+ PRKX 2.012
140
+ ATR 1.991
141
+ PRKD1 1.988
142
+ ...
143
+ DDR2 0.928
144
+ EPHA4 0.928
145
+ TEK 0.921
146
+ KIT 0.915
147
+ FGFR3 0.910
148
+ Length: 289, dtype: float64
149
+
150
+ ### CDDM, with lower case indicating phosphorylation status
151
+
152
+ ``` python
153
+ predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)
154
+ ```
155
+
156
+ considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
157
+
158
+ kinase
159
+ ULK3 1.987
160
+ PAK6 1.981
161
+ PRKD1 1.946
162
+ PIM3 1.944
163
+ PRKX 1.939
164
+ ...
165
+ EPHA4 0.905
166
+ EGFR 0.900
167
+ TEK 0.898
168
+ FGFR3 0.894
169
+ KIT 0.882
170
+ Length: 289, dtype: float64
171
+
172
+ ### PSPA, with lower case indicating phosphorylation status
173
+
174
+ ``` python
175
+ predict_kinase('AEEKEyHsEGG',**param_PSPA).head()
176
+ ```
177
+
178
+ considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
179
+
180
+ kinase
181
+ EGFR 4.013
182
+ FGFR4 3.568
183
+ ZAP70 3.412
184
+ CSK 3.241
185
+ SYK 3.209
186
+ dtype: float64
187
+
188
+ ### To replicate the results from The Kinase Library (PSPA)
189
+
190
+ Check this link: [The Kinase
191
+ Library](https://kinase-library.phosphosite.org/site?s=AEEKEy*HsEGG&pp=false&scp=true),
192
+ and use log2(score) to rank, it shows same results with the below (with
193
+ slight differences due to rounding).
194
+
195
+ ``` python
196
+ predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)
197
+ ```
198
+
199
+ considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
200
+
201
+ kinase
202
+ EGFR 3.181
203
+ FGFR4 2.390
204
+ CSK 2.308
205
+ ZAP70 2.068
206
+ SYK 1.998
207
+ PDHK1_TYR 1.922
208
+ RET 1.732
209
+ MATK 1.688
210
+ FLT1 1.627
211
+ BMPR2_TYR 1.456
212
+ dtype: float64
213
+
214
+ - So far [The kinase Library](https://kinase-library.phosphosite.org)
215
+ considers all ***tyr sequences*** in capital regardless of whether or
216
+ not they contain lower cases, which is a small bug and should be fixed
217
+ soon.
218
+ - Kinase with “\_TYR” indicates it is a dual specificity kinase tested
219
+ in PSPA tyrosine setting, which has not been included in
220
+ kinase-library yet.
221
+
222
+ We can also calculate the percentile score using a referenced score
223
+ sheet.
224
+
225
+ ``` python
226
+ # Percentile reference sheet
227
+ y_pct = Data.get_pspa_tyr_pct()
228
+
229
+ get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)
230
+ ```
231
+
232
+ considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
233
+
234
+
235
+
236
+ | | log2(score) | percentile |
237
+ |-------|-------------|------------|
238
+ | EGFR | 3.181 | 96.787423 |
239
+ | FGFR4 | 2.390 | 94.012303 |
240
+ | CSK | 2.308 | 95.201640 |
241
+ | ZAP70 | 2.068 | 88.380041 |
242
+ | SYK | 1.998 | 85.522898 |
243
+ | ... | ... | ... |
244
+ | EPHA1 | -3.501 | 12.139440 |
245
+ | FES | -3.699 | 21.216678 |
246
+ | TNK1 | -4.269 | 5.481887 |
247
+ | TNK2 | -4.577 | 2.050581 |
248
+ | DDR2 | -4.920 | 10.403281 |
249
+
250
+
251
+
252
+ ## High-throughput substrate scoring on a dataframe
253
+
254
+ ### Load your csv
255
+
256
+ ``` python
257
+ # df = pd.read_csv('your_file.csv')
258
+ ```
259
+
260
+ ### Load a demo df
261
+
262
+ ``` python
263
+ # Load a demo df with phosphorylation sites
264
+ df = Data.get_ochoa_site().head()
265
+ df.iloc[:,-2:]
266
+ ```
267
+
268
+
269
+ | | site_seq | gene_site |
270
+ |-----|-----------------|----------------|
271
+ | 0 | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
272
+ | 1 | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
273
+ | 2 | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
274
+ | 3 | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
275
+ | 4 | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |
276
+
277
+
278
+
279
+ ### Set the column name and param to calculate
280
+
281
+ Here we choose param_CDDM_upper, as the sequences in the demo df are all
282
+ in capital. You can also choose other params.
283
+
284
+ ``` python
285
+ results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
286
+ results
287
+ ```
288
+
289
+ input dataframe has a length 5
290
+ Preprocessing
291
+ Finish preprocessing
292
+ Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
293
+
294
+ 100%|██████████| 289/289 [00:05<00:00, 56.64it/s]
295
+
296
+
297
+
298
+ | kinase | SRC | EPHA3 | FES | NTRK3 | ALK | EPHA8 | ABL1 | FLT3 | EPHB2 | FYN | ... | MEK5 | PKN2 | MAP2K7 | MRCKB | HIPK3 | CDK8 | BUB1 | MEKK3 | MAP2K3 | GRK1 |
299
+ |--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
300
+ | 0 | 0.991760 | 1.093712 | 1.051750 | 1.067134 | 1.013682 | 1.097519 | 0.966379 | 0.982464 | 1.054986 | 1.055910 | ... | 1.314859 | 1.635470 | 1.652251 | 1.622672 | 1.362973 | 1.797155 | 1.305198 | 1.423618 | 1.504941 | 1.872020 |
301
+ | 1 | 0.910262 | 0.953743 | 0.942327 | 0.950601 | 0.872694 | 0.932586 | 0.846899 | 0.826662 | 0.915020 | 0.942713 | ... | 1.175454 | 1.402006 | 1.430392 | 1.215826 | 1.569373 | 1.716455 | 1.270999 | 1.195081 | 1.223082 | 1.793290 |
302
+ | 2 | 0.849866 | 0.899910 | 0.848895 | 0.879652 | 0.874959 | 0.899414 | 0.839200 | 0.836523 | 0.858040 | 0.867269 | ... | 1.408003 | 1.813739 | 1.454786 | 1.084522 | 1.352556 | 1.524663 | 1.377839 | 1.173830 | 1.305691 | 1.811849 |
303
+ | 3 | 0.803826 | 0.836527 | 0.800759 | 0.894570 | 0.839905 | 0.781001 | 0.847847 | 0.807040 | 0.805877 | 0.801402 | ... | 1.110307 | 1.703637 | 1.795092 | 1.469653 | 1.549936 | 1.491344 | 1.446922 | 1.055452 | 1.534895 | 1.741090 |
304
+ | 4 | 0.822793 | 0.796532 | 0.792343 | 0.839882 | 0.810122 | 0.781420 | 0.805251 | 0.795022 | 0.790380 | 0.864538 | ... | 1.062617 | 1.357689 | 1.485945 | 1.249266 | 1.456078 | 1.422782 | 1.376471 | 1.089629 | 1.121309 | 1.697524 |
305
+
306
+
307
+
308
+ ## Phosphorylation sites
309
+
310
+ Besides calculating sequence scores, we also provides multiple datasets
311
+ of phosphorylation sites.
312
+
313
+ ### CPTAC pan-cancer phosphoproteomics
314
+
315
+ ``` python
316
+ df = Data.get_cptac_ensembl_site()
317
+ df.head(3)
318
+ ```
319
+
320
+
321
+
322
+ | | gene | site | site_seq | protein | gene_name | gene_site | protein_site |
323
+ |-----|--------------------|-------|-----------------|-------------------|-----------|-------------|-----------------------|
324
+ | 0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |
325
+ | 1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |
326
+ | 2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |
327
+
328
+
329
+
330
+ ### [Ochoa et al. human phosphoproteome](https://www.nature.com/articles/s41587-019-0344-3)
331
+
332
+ ``` python
333
+ df = Data.get_ochoa_site()
334
+ df.head(3)
335
+ ```
336
+
337
+
338
+ | | uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site |
339
+ |-----|------------|----------|---------|-------------|----------------|------------------------|-----------|------------------|------------------|-----------------|------------------|------|---------------------------------------------------|----------|-----------------|----------------|
340
+ | 0 | A0A075B6Q4 | 24 | S | True | 0.91 | 6.839384 | True | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
341
+ | 1 | A0A075B6Q4 | 35 | S | True | 0.87 | 9.192622 | False | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
342
+ | 2 | A0A075B6Q4 | 57 | S | False | 0.28 | 0.818834 | False | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
343
+
344
+
345
+
346
+ ### PhosphoSitePlus human phosphorylation site
347
+
348
+ ``` python
349
+ df = Data.get_psp_human_site()
350
+ df.head(3)
351
+ ```
352
+
353
+
354
+ | | gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site |
355
+ |-----|-------|-------------|---------|------|-----------|-------------|---------|-----------------------|--------|--------|--------|----------|----------------|
356
+ | 0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | \_\_\_\_\_\_MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
357
+ | 1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | \_\_MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
358
+ | 2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
359
+
360
+
361
+
362
+ ### Unique sites of combined Ochoa & PhosphoSitePlus
363
+
364
+ ``` python
365
+ df = Data.get_combine_site_psp_ochoa()
366
+ df.head(3)
367
+ ```
368
+
369
+
370
+ | | site_seq | gene_site | gene | source | num_site | acceptor | -7 | -6 | -5 | -4 | ... | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
371
+ |-----|-----------------|------------|-------|--------|----------|----------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
372
+ | 0 | AAAAAAASGGAGSDN | PBX1_S136 | PBX1 | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | G | A | G | S | D | N |
373
+ | 1 | AAAAAAASGGGVSPD | PBX2_S146 | PBX2 | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | G | G | V | S | P | D |
374
+ | 2 | AAAAAAASGVTTGKP | CLASR_S349 | CLASR | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | V | T | T | G | K | P |
375
+
376
+
377
+
378
+ ## Phosphorylation site sequence example
379
+
380
+ ***All capital - 15 length (-7 to +7)***
381
+
382
+ - QSEEEKLSPSPTTED
383
+ - TLQHVPDYRQNVYIP
384
+ - TMGLSARyGPQFTLQ
385
+
386
+ ***All capital - 10 length (-5 to +4)***
387
+
388
+ - SRDPHYQDPH
389
+ - LDNPDyQQDF
390
+ - AAAAAsGGAG
391
+
392
+ ***With lowercase - (-7 to +7)***
393
+
394
+ - QsEEEKLsPsPTTED
395
+ - TLQHVPDyRQNVYIP
396
+ - TMGLsARyGPQFTLQ
397
+
398
+ ***With lowercase - (-5 to +4)***
399
+
400
+ - sRDPHyQDPH
401
+ - LDNPDyQQDF
402
+ - AAAAAsGGAG
@@ -0,0 +1,14 @@
1
+ katlas/__init__.py,sha256=sXLh7g3KC4QCFxcZGBTpG2scR7hmmBsMjq6LqRptkRg,22
2
+ katlas/_modidx.py,sha256=wuIOxQQtyUyUDt8xnoZYyHfJAjnWMcoSYO6D3PXUFGE,10996
3
+ katlas/core.py,sha256=25yF0J2RBO_Fup1dUQA_h6Tfwcs96-A5uuzdf_lCpo0,34975
4
+ katlas/dl.py,sha256=Rm1EO6oGTiHpqp4EA2xAvbIUnh608FPYOdzndRGKVkc,10849
5
+ katlas/feature.py,sha256=3zgTuCnXqH1e0LGZ2Hkvan852PiaIHxj27cg_TJfKzo,11471
6
+ katlas/imports.py,sha256=-ZphRU8K1KspxMpgRxisE0OskrCw3S8JR8tvmeXBRY0,147
7
+ katlas/plot.py,sha256=vB3gv0aaCNERW1CtdDWqM4jIZOx1auGWwi_1I22xBa0,23630
8
+ katlas/train.py,sha256=s0ucsZVaixCTZPz-XAI2J7zQDeGkiYEJKOc2dFTYsAc,7625
9
+ python_katlas-0.0.1.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
10
+ python_katlas-0.0.1.dist-info/METADATA,sha256=3yYodyC6FDFo2E4vGk4DgDuJHGGK0PWIXXyIivPFk_s,15256
11
+ python_katlas-0.0.1.dist-info/WHEEL,sha256=EVRjI69F5qVjm_YgqcTXPnTAv3BfSUr0WVAHuSP3Xoo,92
12
+ python_katlas-0.0.1.dist-info/entry_points.txt,sha256=SF3xDlCmE84ECTBIMDo_FNg1aXGX2-lXkCvH5o4VgpM,34
13
+ python_katlas-0.0.1.dist-info/top_level.txt,sha256=pKBKw9KOSJgnnFkoilkDij_iJ_tJbIO4XnrSXIleqNc,7
14
+ python_katlas-0.0.1.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: bdist_wheel (0.35.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [nbdev]
2
+ katlas = katlas._modidx:d
@@ -0,0 +1 @@
1
+ katlas