gao bosc2010 musite

24
Musite: Prediction of Protein Musite: Prediction of Protein Phosphorylation Sites Phosphorylation Sites Jianjiong Gao University of Missouri Columbia University of Missouri, Columbia http://musite.sourceforge.net/

Upload: bosc-2010

Post on 11-Jun-2015

256 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Gao bosc2010 musite

Musite: Prediction of ProteinMusite: Prediction of Protein Phosphorylation SitesPhosphorylation Sites

Jianjiong GaoUniversity of Missouri ColumbiaUniversity of Missouri, Columbia

http://musite.sourceforge.net/

Page 2: Gao bosc2010 musite

Background:Protein Phosphorylation

Protein phosphorylation is one of the mostimportant post-translational modifications.p p

It was estimated that up to 50% of proteins are phosphorylated in some cellular stateAbnormality in phosphorylation is a cause or consequence of many diseases

CancerCancerDiabeteParkinson’sHepertitis B…

Page 3: Gao bosc2010 musite

Background:Protein Phosphorylation

Phosphorylation-dephosphorylation is a biochemical switch system regulating y g gvarious cellular processes.Catalyzed by various specific proteinCatalyzed by various specific protein kinases.

Kinase

ON

PhosphataseOFF

Phosphatase

Page 4: Gao bosc2010 musite

Phosphorylation Site PredictionProblem Formulation

Phosphorylation site: a phosphorylated amino acid in a protein (determined by protein sequence)in a protein (determined by protein sequence) General phosphorylation site prediction: to predict whether an amino acid can be phosphorylatedKinase-specific phosphorylation site prediction: to p p p y ppredict whether an amino acid can be phosphorylated by a specific kinasep p y y pBased on protein sequence only

Page 5: Gao bosc2010 musite

Limitations of Current MethodsLimitations of Current Methods

Current prediction tools have limitations when applying to wholelimitations when applying to whole proteomes

Prediction accuracy could be improvedMost were released as web servers and haveMost were released as web servers and have restrictions for the uploaded data by usersTraining data were out of dateTraining data were out of dateStringency adjustment was not fully supportedsupported

Page 6: Gao bosc2010 musite

Our tool Musite is uniqueOur tool Musite is unique

Novel method with better accuracyFirst open source tool in the field that meetFirst open-source tool in the field that meet OSI Open Standards RequirementStandalone program designed for proteome-scale predictionpSupport both general and kinase-specific phosphorylation site predictionphosphorylation site predictionSupport customized model trainingSupport continuous stringency adjustment

Page 7: Gao bosc2010 musite

Phosphorylation Site Prediction Flowchart

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Page 8: Gao bosc2010 musite

Phosphorylation Site Prediction Data Extraction

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Page 9: Gao bosc2010 musite

Phosphorylation Site Prediction Feature Extraction

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Page 10: Gao bosc2010 musite

Phosphorylation Site Prediction Feature Extraction

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Page 11: Gao bosc2010 musite

KNN FeaturesMotivation

Rationale of using KNN features: local sequence clusters exist aroundsequence clusters exist around phosphorylation sites, since

Each phosphorylation site is a substrate of a specific protein kinase Substrates of the same kinase or kinase family usually shares similar patterns in local sequences

Page 12: Gao bosc2010 musite

KNN FeaturesResult

Overall, phosphosites have larger KNN scores 1

(A)

Phospho Nonphospho

have larger KNN scores than non-phosphosites 0.8

core

Average KNN scores 0.4

0.6

KN

N s

c

0.7~0.8 for phosphosites≈0.5 for non-phosphosites

0 25 0 5 1 2 40

0.2

Boxplot of KNN features(H S /Th )

0.25 0.5 1 2 4Size of nearest neighbors (% of sample size)

(Human Ser/Thr)

Page 13: Gao bosc2010 musite

Disorder FeaturesConcept & Rationale

Disordered region (structure)Some parts of a protein have a rigid structure, such as α-helix and β-sheet.Other parts, disordered regions, do not have well-defined conformationswell defined conformationsThe conformational flexibility of disordered regions may facilitate protein phosphorylationregions may facilitate protein phosphorylation [Dunker, 2008]: protein phosphorylation sites are frequently located within disordered regionsare frequently located within disordered regions

Page 14: Gao bosc2010 musite

Disorder Features

F h h it

ResultFor phosphosites

Occurrence increases exponentiallywhen disorder score increases

10000

(A) Phospho-S/T in H. sapiens

4

5

6

e d so de sco e c easesFor non-phosphosites

Significantly different distribution0

5000

e 1

2

3

4

Disorder score > 0.50 0.2 0.4 0.6 0.8 1

0

occu

rrenc

e

2

2.5x 10

5(B) Non-phospho-S/T in H. sapiens

-1

0

1

Phosphosites: ~91%Non-phosphosites: ~55%

Phosphosites are significantly 05

1

1.5

2

-4

-3

-2

Phosphosites are significantly over-represented in disordered regions 0 0.2 0.4 0.6 0.8 1

0

0.5

Disorder Score

-6

-5

Histogram of disorder features(Human Ser/Thr)

Page 15: Gao bosc2010 musite

Amino Acid FrequenciesResult

1

0

0.5

1qu

ency

)

1

-0.5

0

of F

req

H. sapiens (S/T)M. musculus (S/T)

2

-1.5

-1

g 2(R

atio D. melanogaster (S/T)

C. elegans (S/T)S. cerevisiae (S/T)

P R D E S K G A Q N V T H L M I F Y W C-2.5

-2

Log

A i A id

( )A. thaliana (S/T)

P, R, D, E, S, K, and G are enriched around phosphosites

Amino Acid

phosphositesC, W, Y, F, I, M, L, H, T, and V are depleted

Page 16: Gao bosc2010 musite

Phosphorylation Site Prediction Classifier Training

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Page 17: Gao bosc2010 musite

ResultsTrained Models

General Prediction/

Kinase-Specific PredictionHuman ser/thr

Human tyr

PredictionATM

Mouse ser/thrMouse tyr

CDK/CDK1/CDK2CK1/CK2

Fluit fly ser/thrWorm ser/thr

MAPK1/MAPK3PKAWorm ser/thr

Yeast ser/thrArabidopsis ser/thr

PKAPKBPKCArabidopsis ser/thr PKCSrc

Page 18: Gao bosc2010 musite

ResultsCross validation

1

0 8

1C. elegans (S/T)A. thaliana (S/T)H. sapiens (S/T)

0 6

0.8

y 0.8

M. musculus (S/T)S. cerevisiae (S/T)D. melanogaster (S/T)

0 4

0.6

Sen

sitiv

ity

0.6M. musculus (Y)H. sapiens (Y)Random guess

0 2

0.4S

0.2

0.4

0

0.2

0 0.02 0.04 0.06 0.08 0.10

0 0.2 0.4 0.6 0.8 10

1 - Specificity

Page 19: Gao bosc2010 musite

ResultsComparison to other tools

1

0 8

0.9

1

Musite

0 6

0.7

0.8Scan-xDISPHOSNetPhos

0

0.5

0.6

Sen

sitiv

ity 0.6

0.3

0.4S

0.2

0.4

0.1

0.2

0 0 02 0 04 0 06 0 08 0 10

0 0.2 0.4 0.6 0.8 10

1 - Specificity

0 0.02 0.04 0.06 0.08 0.1

Page 20: Gao bosc2010 musite

Phosphorylation Site Prediction Software Implementation-Musite

Open SourceLicense: GNU General Public License (GPL)http://musite sourceforge net/http://musite.sourceforge.net/

Stand-alone applicationBased on JavaSupport Windows Linux and Mac OS XSupport Windows, Linux, and Mac OS X

A web server is also being developedg phttp://musite.net/

Page 21: Gao bosc2010 musite

ImplementationUser Interface

Page 22: Gao bosc2010 musite

ImplementationCustomized Model Training

A unique utility for users to train di ti d l f th i d tprediction models from their own data

Take advantage of latest dataTake advantage of latest dataTrain disease-specific modelsTrain organ-specific modelsIntegrate into experimental procedure in an g p piterative way

Page 23: Gao bosc2010 musite

SummarySummary

Musite is for prediction of general and kinase-specific phosphosites in a better accuracyspecific phosphosites in a better accuracy

Musite is a open-source standalone program capable of performing proteome-widecapable of performing proteome wide predictions

Page 24: Gao bosc2010 musite

AcknowledgementsAcknowledgements

Dr. Dong Xu (University of Missouri) Dr. Jay Thelen (University of Missouri)Jay e e (U e s ty o ssou )Dr. Keith Dunker (Indiana University)Curtis Bollinger (University of Missouri)Curtis Bollinger (University of Missouri)

FundingNSF [# DBI-0604439]

Visit us athttp://musite.sourceforge.netNSF [# DBI 0604439]

NIH [# R21/R33 GM078601]

p ghttp://musite.netPoster R09 at ISMB