gao bosc2010 musite

Post on 11-Jun-2015

258 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Musite: Prediction of ProteinMusite: Prediction of Protein Phosphorylation SitesPhosphorylation Sites

Jianjiong GaoUniversity of Missouri ColumbiaUniversity of Missouri, Columbia

http://musite.sourceforge.net/

Background:Protein Phosphorylation

Protein phosphorylation is one of the mostimportant post-translational modifications.p p

It was estimated that up to 50% of proteins are phosphorylated in some cellular stateAbnormality in phosphorylation is a cause or consequence of many diseases

CancerCancerDiabeteParkinson’sHepertitis B…

Background:Protein Phosphorylation

Phosphorylation-dephosphorylation is a biochemical switch system regulating y g gvarious cellular processes.Catalyzed by various specific proteinCatalyzed by various specific protein kinases.

Kinase

ON

PhosphataseOFF

Phosphatase

Phosphorylation Site PredictionProblem Formulation

Phosphorylation site: a phosphorylated amino acid in a protein (determined by protein sequence)in a protein (determined by protein sequence) General phosphorylation site prediction: to predict whether an amino acid can be phosphorylatedKinase-specific phosphorylation site prediction: to p p p y ppredict whether an amino acid can be phosphorylated by a specific kinasep p y y pBased on protein sequence only

Limitations of Current MethodsLimitations of Current Methods

Current prediction tools have limitations when applying to wholelimitations when applying to whole proteomes

Prediction accuracy could be improvedMost were released as web servers and haveMost were released as web servers and have restrictions for the uploaded data by usersTraining data were out of dateTraining data were out of dateStringency adjustment was not fully supportedsupported

Our tool Musite is uniqueOur tool Musite is unique

Novel method with better accuracyFirst open source tool in the field that meetFirst open-source tool in the field that meet OSI Open Standards RequirementStandalone program designed for proteome-scale predictionpSupport both general and kinase-specific phosphorylation site predictionphosphorylation site predictionSupport customized model trainingSupport continuous stringency adjustment

Phosphorylation Site Prediction Flowchart

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Phosphorylation Site Prediction Data Extraction

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Phosphorylation Site Prediction Feature Extraction

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

Phosphorylation Site Prediction Feature Extraction

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

KNN FeaturesMotivation

Rationale of using KNN features: local sequence clusters exist aroundsequence clusters exist around phosphorylation sites, since

Each phosphorylation site is a substrate of a specific protein kinase Substrates of the same kinase or kinase family usually shares similar patterns in local sequences

KNN FeaturesResult

Overall, phosphosites have larger KNN scores 1

(A)

Phospho Nonphospho

have larger KNN scores than non-phosphosites 0.8

core

Average KNN scores 0.4

0.6

KN

N s

c

0.7~0.8 for phosphosites≈0.5 for non-phosphosites

0 25 0 5 1 2 40

0.2

Boxplot of KNN features(H S /Th )

0.25 0.5 1 2 4Size of nearest neighbors (% of sample size)

(Human Ser/Thr)

Disorder FeaturesConcept & Rationale

Disordered region (structure)Some parts of a protein have a rigid structure, such as α-helix and β-sheet.Other parts, disordered regions, do not have well-defined conformationswell defined conformationsThe conformational flexibility of disordered regions may facilitate protein phosphorylationregions may facilitate protein phosphorylation [Dunker, 2008]: protein phosphorylation sites are frequently located within disordered regionsare frequently located within disordered regions

Disorder Features

F h h it

ResultFor phosphosites

Occurrence increases exponentiallywhen disorder score increases

10000

(A) Phospho-S/T in H. sapiens

4

5

6

e d so de sco e c easesFor non-phosphosites

Significantly different distribution0

5000

e 1

2

3

4

Disorder score > 0.50 0.2 0.4 0.6 0.8 1

0

occu

rrenc

e

2

2.5x 10

5(B) Non-phospho-S/T in H. sapiens

-1

0

1

Phosphosites: ~91%Non-phosphosites: ~55%

Phosphosites are significantly 05

1

1.5

2

-4

-3

-2

Phosphosites are significantly over-represented in disordered regions 0 0.2 0.4 0.6 0.8 1

0

0.5

Disorder Score

-6

-5

Histogram of disorder features(Human Ser/Thr)

Amino Acid FrequenciesResult

1

0

0.5

1qu

ency

)

1

-0.5

0

of F

req

H. sapiens (S/T)M. musculus (S/T)

2

-1.5

-1

g 2(R

atio D. melanogaster (S/T)

C. elegans (S/T)S. cerevisiae (S/T)

P R D E S K G A Q N V T H L M I F Y W C-2.5

-2

Log

A i A id

( )A. thaliana (S/T)

P, R, D, E, S, K, and G are enriched around phosphosites

Amino Acid

phosphositesC, W, Y, F, I, M, L, H, T, and V are depleted

Phosphorylation Site Prediction Classifier Training

Bootstrap

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Training data

...TrainingPh h l ti it N h h l ti it

Bootstrapsample 1

Bootstrapsample mNon-redundant datasets built by BLASTclust

... Classifier m

TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction

KNN scores Disorder scoresClassifier 1

Features from Features from

AggregatingSpecificityestimation

Amino acid frequencies

PhosphorylationFeatures frompositive set

Features fromnegative set

Control data

estimation

Making predictions

Phosphorylation prediction model

Control data Making predictions on new data

ResultsTrained Models

General Prediction/

Kinase-Specific PredictionHuman ser/thr

Human tyr

PredictionATM

Mouse ser/thrMouse tyr

CDK/CDK1/CDK2CK1/CK2

Fluit fly ser/thrWorm ser/thr

MAPK1/MAPK3PKAWorm ser/thr

Yeast ser/thrArabidopsis ser/thr

PKAPKBPKCArabidopsis ser/thr PKCSrc

ResultsCross validation

1

0 8

1C. elegans (S/T)A. thaliana (S/T)H. sapiens (S/T)

0 6

0.8

y 0.8

M. musculus (S/T)S. cerevisiae (S/T)D. melanogaster (S/T)

0 4

0.6

Sen

sitiv

ity

0.6M. musculus (Y)H. sapiens (Y)Random guess

0 2

0.4S

0.2

0.4

0

0.2

0 0.02 0.04 0.06 0.08 0.10

0 0.2 0.4 0.6 0.8 10

1 - Specificity

ResultsComparison to other tools

1

0 8

0.9

1

Musite

0 6

0.7

0.8Scan-xDISPHOSNetPhos

0

0.5

0.6

Sen

sitiv

ity 0.6

0.3

0.4S

0.2

0.4

0.1

0.2

0 0 02 0 04 0 06 0 08 0 10

0 0.2 0.4 0.6 0.8 10

1 - Specificity

0 0.02 0.04 0.06 0.08 0.1

Phosphorylation Site Prediction Software Implementation-Musite

Open SourceLicense: GNU General Public License (GPL)http://musite sourceforge net/http://musite.sourceforge.net/

Stand-alone applicationBased on JavaSupport Windows Linux and Mac OS XSupport Windows, Linux, and Mac OS X

A web server is also being developedg phttp://musite.net/

ImplementationUser Interface

ImplementationCustomized Model Training

A unique utility for users to train di ti d l f th i d tprediction models from their own data

Take advantage of latest dataTake advantage of latest dataTrain disease-specific modelsTrain organ-specific modelsIntegrate into experimental procedure in an g p piterative way

SummarySummary

Musite is for prediction of general and kinase-specific phosphosites in a better accuracyspecific phosphosites in a better accuracy

Musite is a open-source standalone program capable of performing proteome-widecapable of performing proteome wide predictions

AcknowledgementsAcknowledgements

Dr. Dong Xu (University of Missouri) Dr. Jay Thelen (University of Missouri)Jay e e (U e s ty o ssou )Dr. Keith Dunker (Indiana University)Curtis Bollinger (University of Missouri)Curtis Bollinger (University of Missouri)

FundingNSF [# DBI-0604439]

Visit us athttp://musite.sourceforge.netNSF [# DBI 0604439]

NIH [# R21/R33 GM078601]

p ghttp://musite.netPoster R09 at ISMB

top related