python-katlas 0.0.8

Creator: railscoderz

Last updated:

Add to Cart

Description:

pythonkatlas 0.0.8

KATLAS





KATLAS is a repository containing python tools to predict kinases given
a substrate sequence. It also contains datasets of kinase substrate
specificities and human phosphoproteomics.
References: Please cite the appropriate papers if KATLAS is
helpful to your research.


KATLAS was described in the paper [Decoding Human Kinome
Specificities through a Computational Data-Driven Approach
(manuscript)]


The positional scanning peptide array (PSPA) data is from paper An
atlas of substrate specificities for the human serine/threonine
kinome and paper
The intrinsic substrate specificity of the human tyrosine
kinome


The kinase substrate datasets used for generating PSSMs are derived
from
PhosphoSitePlus
and paper Large-scale Discovery of Substrates of the Human
Kinome


Phosphorylation sites are acquired from
PhosphoSitePlus,
paper The functional landscape of the human
phosphoproteome,
and CPTAC /
LinkedOmics


Tutorials on Colab



Substrate scoring on a single substrate
sequence




High throughput substrate scoring on phosphoproteomics
dataset




Query a protein’s phosphorylation sites and predict their
upstream
kinases




Kinase enrichment analysis for AKT
inhibitor
/ Kinase enrichment analysis for EGFR
inhibitor



Install
Install the latest version through git
!pip install git+https://github.com/sky1ove/katlas.git -Uqq

Import
from katlas.core import *

Quick start
We provide two methods to calculate substrate sequence:

Computational Data-Driven Method (CDDM)
Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

a single input string (phosphorylation site)
a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

all capital
contains lower cases indicating phosphorylation status

Single sequence as input
CDDM, all capital
predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']

kinase
PAK6 2.032
ULK3 2.032
PRKX 2.012
ATR 1.991
PRKD1 1.988
...
DDR2 0.928
EPHA4 0.928
TEK 0.921
KIT 0.915
FGFR3 0.910
Length: 289, dtype: float64

CDDM, with lower case indicating phosphorylation status
predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

kinase
ULK3 1.987
PAK6 1.981
PRKD1 1.946
PIM3 1.944
PRKX 1.939
...
EPHA4 0.905
EGFR 0.900
TEK 0.898
FGFR3 0.894
KIT 0.882
Length: 289, dtype: float64

PSPA, with lower case indicating phosphorylation status
predict_kinase('AEEKEyHsEGG',**param_PSPA).head()

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR 4.013
FGFR4 3.568
ZAP70 3.412
CSK 3.241
SYK 3.209
dtype: float64

To replicate the results from The Kinase Library (PSPA)
Check this link: The Kinase
Library,
and use log2(score) to rank, it shows same results with the below (with
slight differences due to rounding).
predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR 3.181
FGFR4 2.390
CSK 2.308
ZAP70 2.068
SYK 1.998
PDHK1_TYR 1.922
RET 1.732
MATK 1.688
FLT1 1.627
BMPR2_TYR 1.456
dtype: float64


So far The kinase Library
considers all tyr sequences in capital regardless of whether or
not they contain lower cases, which is a small bug and should be fixed
soon.
Kinase with “_TYR” indicates it is a dual specificity kinase tested
in PSPA tyrosine setting, which has not been included in
kinase-library yet.

We can also calculate the percentile score using a referenced score
sheet.
# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']





log2(score)
percentile




EGFR
3.181
96.787423


FGFR4
2.390
94.012303


CSK
2.308
95.201640


ZAP70
2.068
88.380041


SYK
1.998
85.522898


...
...
...


EPHA1
-3.501
12.139440


FES
-3.699
21.216678


TNK1
-4.269
5.481887


TNK2
-4.577
2.050581


DDR2
-4.920
10.403281



High-throughput substrate scoring on a dataframe
Load your csv
# df = pd.read_csv('your_file.csv')

Load a demo df
# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]





site_seq
gene_site




0
VDDEKGDSNDDYDSA
A0A075B6Q4_S24


1
YDSAGLLSDEDCMSV
A0A075B6Q4_S35


2
IADHLFWSEETKSRF
A0A075B6Q4_S57


3
KSRFTEYSMTSSVMR
A0A075B6Q4_S68


4
FTEYSMTSSVMRRNE
A0A075B6Q4_S71



Set the column name and param to calculate
Here we choose param_CDDM_upper, as the sequences in the demo df are all
in capital. You can also choose other params.
results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results

input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]

100%|██████████| 289/289 [00:05<00:00, 56.64it/s]




kinase
SRC
EPHA3
FES
NTRK3
ALK
EPHA8
ABL1
FLT3
EPHB2
FYN
...
MEK5
PKN2
MAP2K7
MRCKB
HIPK3
CDK8
BUB1
MEKK3
MAP2K3
GRK1




0
0.991760
1.093712
1.051750
1.067134
1.013682
1.097519
0.966379
0.982464
1.054986
1.055910
...
1.314859
1.635470
1.652251
1.622672
1.362973
1.797155
1.305198
1.423618
1.504941
1.872020


1
0.910262
0.953743
0.942327
0.950601
0.872694
0.932586
0.846899
0.826662
0.915020
0.942713
...
1.175454
1.402006
1.430392
1.215826
1.569373
1.716455
1.270999
1.195081
1.223082
1.793290


2
0.849866
0.899910
0.848895
0.879652
0.874959
0.899414
0.839200
0.836523
0.858040
0.867269
...
1.408003
1.813739
1.454786
1.084522
1.352556
1.524663
1.377839
1.173830
1.305691
1.811849


3
0.803826
0.836527
0.800759
0.894570
0.839905
0.781001
0.847847
0.807040
0.805877
0.801402
...
1.110307
1.703637
1.795092
1.469653
1.549936
1.491344
1.446922
1.055452
1.534895
1.741090


4
0.822793
0.796532
0.792343
0.839882
0.810122
0.781420
0.805251
0.795022
0.790380
0.864538
...
1.062617
1.357689
1.485945
1.249266
1.456078
1.422782
1.376471
1.089629
1.121309
1.697524



Phosphorylation sites
Besides calculating sequence scores, we also provides multiple datasets
of phosphorylation sites.
CPTAC pan-cancer phosphoproteomics
df = Data.get_cptac_ensembl_site()
df.head(3)





gene
site
site_seq
protein
gene_name
gene_site
protein_site




0
ENSG00000003056.8
S267
DDQLGEESEERDDHL
ENSP00000000412.3
M6PR
M6PR_S267
ENSP00000000412_S267


1
ENSG00000003056.8
S267
DDQLGEESEERDDHL
ENSP00000440488.2
M6PR
M6PR_S267
ENSP00000440488_S267


2
ENSG00000048028.11
S1053
PPTIRPNSPYDLCSR
ENSP00000003302.4
USP28
USP28_S1053
ENSP00000003302_S1053



Ochoa et al. human phosphoproteome
df = Data.get_ochoa_site()
df.head(3)





uniprot
position
residue
is_disopred
disopred_score
log10_hotspot_pval_min
isHotspot
uniprot_position
functional_score
current_uniprot
name
gene
Sequence
is_valid
site_seq
gene_site




0
A0A075B6Q4
24
S
True
0.91
6.839384
True
A0A075B6Q4_24
0.149257
A0A075B6Q4
A0A075B6Q4_HUMAN
None
MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...
True
VDDEKGDSNDDYDSA
A0A075B6Q4_S24


1
A0A075B6Q4
35
S
True
0.87
9.192622
False
A0A075B6Q4_35
0.136966
A0A075B6Q4
A0A075B6Q4_HUMAN
None
MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...
True
YDSAGLLSDEDCMSV
A0A075B6Q4_S35


2
A0A075B6Q4
57
S
False
0.28
0.818834
False
A0A075B6Q4_57
0.125364
A0A075B6Q4
A0A075B6Q4_HUMAN
None
MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...
True
IADHLFWSEETKSRF
A0A075B6Q4_S57



PhosphoSitePlus human phosphorylation site
df = Data.get_psp_human_site()
df.head(3)





gene
protein
uniprot
site
gene_site
SITE_GRP_ID
species
site_seq
LT_LIT
MS_LIT
MS_CST
CST_CAT#
Ambiguous_Site




0
YWHAB
14-3-3 beta
P31946
T2
YWHAB_T2
15718712
human
______MtMDksELV
NaN
3.0
1.0
None
0


1
YWHAB
14-3-3 beta
P31946
S6
YWHAB_S6
15718709
human
__MtMDksELVQkAk
NaN
8.0
NaN
None
0


2
YWHAB
14-3-3 beta
P31946
Y21
YWHAB_Y21
3426383
human
LAEQAERyDDMAAAM
NaN
NaN
4.0
None
0



Unique sites of combined Ochoa & PhosphoSitePlus
df = Data.get_combine_site_psp_ochoa()
df.head(3)





site_seq
gene_site
gene
source
num_site
acceptor
-7
-6
-5
-4
...
-2
-1
0
1
2
3
4
5
6
7




0
AAAAAAASGGAGSDN
PBX1_S136
PBX1
ochoa
1
S
A
A
A
A
...
A
A
S
G
G
A
G
S
D
N


1
AAAAAAASGGGVSPD
PBX2_S146
PBX2
ochoa
1
S
A
A
A
A
...
A
A
S
G
G
G
V
S
P
D


2
AAAAAAASGVTTGKP
CLASR_S349
CLASR
ochoa
1
S
A
A
A
A
...
A
A
S
G
V
T
T
G
K
P



Phosphorylation site sequence example
All capital - 15 length (-7 to +7)

QSEEEKLSPSPTTED
TLQHVPDYRQNVYIP
TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

SRDPHYQDPH
LDNPDyQQDF
AAAAAsGGAG

With lowercase - (-7 to +7)

QsEEEKLsPsPTTED
TLQHVPDyRQNVYIP
TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

sRDPHyQDPH
LDNPDyQQDF
AAAAAsGGAG

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.