conditional graphical models for protein structure prediction

48
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh) PhD Thesis Proposal

Upload: ryu

Post on 16-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

PhD Thesis Proposal. Conditional Graphical Models for Protein Structure Prediction. Yan Liu Language Technologies Institute Carnegie Mellon University. Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

1

Conditional Graphical Models for Protein Structure Prediction

Yan LiuLanguage Technologies Institute

Carnegie Mellon University

Thesis Committee

Jaime Carbonell (Chair)John Lafferty Eric P. Xing

Vanathi Gopalakrishnan (Univ. of Pittsburgh)

PhD Thesis Proposal

Page 2: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

2

Proteins in Our LifeNobelprize.org

Predict protein structures from

sequences

Page 3: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

3

Protein Structure Hierarchy

• Our focus: predicting the topology of the structures

Page 4: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

4

Previous work

• General approaches for structural motif recognition Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]

Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]

Homology modeling or threading, e.g. Threader [Jones, 1998]

Window-based methods, e.g. PSI_pred [Jones, 2001]

• Methods of careful design for specific structure motifs Example: αα- and ββ- hairpins, β-turn and β-helix

- Structural similarity without clear sequence similarity- Long-range interactions

- Hard to generalize

Page 5: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

5

Missing pieces

• Informative features without clear interpretation

• No principled models to formulate the structured properties of proteins

Discriminative models

Graphical models

Conditional graphical models for protein structure prediction

Page 6: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

6

Conditional Graphical Models (CGM)

• Structural prediction is reduced to segmentation and labeling problem

• Define segment semantics structural modules W = (M, {Wi}), M: # of segments, Wi: configuration of segment i

• The conditional probability of a possible segmentation W given the observation x is defined as

1

1( | ) exp( ( , ))

G

K

k k cc C k

fP W wZ

x x

Page 7: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

7

Advantages of Conditional Graphical Models

• Flexible feature definition and kernelization

• Different loss functions and regularizers

• Convex functions for globally optimal solutions

• Structured information, especially the long-range interactions

Page 8: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

8

Training and Testing

• Training phase : learn the model parameters

Minimizing regularized log loss

Seek the direction whose empirical values agree with the expectation

Iterative searching algorithms have to be applied

• Testing phase: search the segmentation that maximizes P(w|x)

1

( , ) log ( )G

K

k k cc C k

L f W Z

x

( | )( ( , ) [ ( , )]) ( ) 0G

k c p W x k cc Ck

Lf w E f W

x x

Page 9: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

9

Thesis WorkConditional graphical models for protein structure prediction

Page 10: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

10

Model Roadmap

Conditional random fields

Kernel CRFs

Segmentation CRFs

Chain graph model

Factorial segmentation CRFs

Kernelized

Segmentation

Locally normalized

Factorized

Page 11: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

11

OutlineConditional graphical models for protein structure prediction

Page 12: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

12

Task Definition and Evaluation Materials

• Given a protein sequence, predict its secondary structure assignments Three class: helix (C), sheets (E) and coil (C)

APAFSVSPASGACGPECA

CCEEEEECCCCCHHHCCC

Protein Secondary Structure Prediction

Page 13: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

13

CGM on Secondary Structure Prediction

[Liu, Carbonell, Klein-Seetharaman and Gopalakrishnan, Bioinformatics 2004]

• Structural component Individual residues

• Segmentation definition W = (n, Y), n: # of residues, Yi = H, E

or C

• Specific models Conditional random fields (CRFs)

Kernel conditional random fields (kCRFs)

• where

Protein Secondary Structure Prediction

11 1

1( | ) exp( ( , , , ))

N K

k k i ii k

P f i y yZ

y x x

*1( | ) exp( ( , , ))

G

cc C

P y f c yZ

x x* ( )

1.. | |

( , ( , , ))c

cG cj

jy c

j L c C y Y

f K c y

x

Page 14: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

14

Target Directions

Protein Secondary Structure Prediction

Input Features(PSI-BLAST profile)

Prediction Combination(CRFs)

Feature Exploration(KCRFs)

Learning Algorithm(SVM)

Beta-sheet Detection

Page 15: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

15

Prediction Combination

• Previous work Window-based label combination

• Rule-based algorithm Window-based score combination

• SVM, Neural networks

• Other graphical models for score combination Maximum entropy Markov model (MEMM)

Higher-order MEMMs (HOMEMM)

Pseudo state duration MEMMs (PSMEMM)

Protein Secondary Structure Prediction

MEMM

HOMEMM

Page 16: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

16

Experiment Results

Protein Secondary Structure Prediction - Prediction Combination

• Graphical models are consistently better than the window-based approaches• CRFs perform the best among the four graphical models

Page 17: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

17

Experiment Results

• Offset from the correct register for the correctly predicted beta-strand pairs (CB513)

Protein Secondary Structure Prediction

- Beta-sheet detection

Considerable improvement in prediction performance

Prediction accuracy for beta-sheets

Page 18: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

18

Experiment Results

• Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E) PSI-BLAST profile with RBF kernel

Protein Secondary Structure Prediction

- Feature exploration

Page 19: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

19

Summary• Conditional graphical model for protein secondary structure

prediction Conditional random fields (CRFs) Kernel conditional random fields (KCRFs)

Protein Secondary Structure Prediction

Page 20: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

20

OutlineConditional graphical models for protein structure prediction

Page 21: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

21

Task Definition and Evaluation Materials

..APAFSVSPASGACGPECA..

Contains the structural motif?

..NNEEEEECCCCCHHHCCC..

Structural motif recognition

• Structural motif Regular arrangement of secondary structural elements Super-secondary structure, or protein fold

Yes

Page 22: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

22

CGM for Structural Motif Recognition

• Structural component Secondary structure elements

• Protein structural graph Nodes for the states of secondary structural elements of unknown lengths Edges for interactions between nodes in 3-D

• Example: β-α-β motif

• Tradeoff between fidelity of the model and graph complexity

Structural Motif Recognition

Page 23: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

23

CGM for Structural Motif Recognition

• Segmentation definition Yi: state of segment i

Si: the length of segment iM: number of segments

• Segmentation conditional random fields (SCRFs) For any graph, we have

For a simplified graph, we have

1 1

1( | ) exp( ( , , ))

i

M K

k k ii k

P W f W WZ

x x

1

1( | ) exp( ( , ))

G

K

k k cc C k

P W f wZ

x x

Structural Motif Recognition

Page 24: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

24

Protein Folds with Structural Repeats

• Prevalent in proteins and important in functions

• Each repeats consists of structural motifs and insertions

• Challenge Low sequence similarity in

structural motifs Long-range interactions

Structural Motif Recognition

Page 25: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

25

CGM for Structural Motif Recognition

• Chain graph A graph consisting of directed and

undirected graphs Given a variable set V that forms

multiple subgraphs U

• Segmentation defintion Two layer segmentation W = {M, {Ξi},

T}M: # of envelops Ξi: the segmentation of envelops

Ti: the state of envelop i = repeat, non-repeat

• Chain graph model

( ) ( | parents( ))u U

P V P u u

Structural Motif Recognition

1 1 11

( | ) ( ,{ }, | ) ( ) ( | , , ) ( | , , )M

i i i i i i ii

P W P M T P M P T x T P T

x x x

Page 26: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

26

Experiments

• Right-handed β-helix fold An elongated helix-like structures whose

successive rungs composed of three parallel β-strands (B1, B2, B3 strands)

T2 turn: a unique two-residue turn Low sequence identity

• Leucine-rich repeats (LLR) Solenoid-like regular arrangement of beta-

strands and an alpha-helix, connected by coils

Relatively high sequence identity (many Leucines)

Structural Motif Recognition

Page 27: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

27

Experiment Results

• Cross-family validation for classifying β-helix proteins SCRFs can score all known β-helices higher than non β-helices

Structural Motif Recognition SCRFs model

Page 28: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

28

Experiment Results

• Predicted Segmentation for known Beta-helices

Structural Motif Recognition SCRFs model

Page 29: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

29

Experiment Results

• Histograms of scores for known β-helices against PDB-minus dataset 18 non β-helix proteins have a

score higher than 0

13 from β-class and 5 from α/β class

Most confusing proteins: β-sandwiches and left-handed β-helix

5

Structural Motif Recognition SCRFs model

Page 30: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

30

Experiment Results

• Verification on recently crystallized structures

Successfully identify 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase

• score 10.47

1PXZ: Jun A 1, The Major Allergen From Cedar Pollen• score 32.35

GP14 of Shigella bacteriophage as a β-helix protein with scoring 15.63

Structural Motif Recognition SCRFs model

Page 31: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

31

Experiment Results

• Cross-family validation for classifying β-helix proteins Chain graph model can score all known β-helices higher than non

β-helices

Structural Motif Recognition Chain graph model

Page 32: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

32

Experiment Results

• Cross-family validation for classifying LLR proteins Chain graph model can score all known LLR higher than non-LLR

Structural Motif Recognition Chain graph model

Page 33: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

33

Experiment Results

• Predicted Segmentation for known Beta-helices and LLRs

Structural Motif Recognition Chain graph model

Page 34: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

34

Further Experiments

• Virus proteins Noncellular biological entity that

can reproduce only within a host cell

Dynamical properties

Various kinds of viruses• Adenovirus (common cold)• Bacteriophage (which infects

bacteria)• DNA virus, RNA virus

Structural Motif Recognition

Gp 41: core protein of HIV virus (1AIK)

Page 35: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

35

Summary• Segmented conditional graphical models for structural

motif recognition Segmentation conditional random fields Chain graph model

• Successful applications Right-handed beta-helices Leucine-rich repeats

• Further verfication Virus spike folds and others

Structural Motif Recognition

Page 36: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

36

OutlineConditional graphical models for protein structure prediction

Page 37: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

37

Quaternary Structure Prediction

• Quaternary structures Multiple chains associated together

through noncovalent bonds or disulfide bonds

• Classes of quaternary structures Based on the number of subunits Based on the identity of the subunits

• Homo-oligomers* and hetero-oligomers

• Very few related research work

..APAFSVSPASGACGPECA..

Contains the quaternary structures?

..NNEEEEECCCCCHHHCCC..

Yes

Page 38: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

38

Proposed Work

• Structural component Secondary structure elements Super-secondary structure elements

• Protein structural graph Nodes for the states of secondary structural

or super-secondary structural elements of unknown lengths

Edges for interactions between nodes in 3-D

Protocol: nodes representing secondary structure elements must involve long-range interactions

Quaternary structure prediction

Page 39: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

39

Proposed Work

• Segmentation definition W = (M, {Wi}),

M: # of segmentsWi: configuration of segment i

• Factorial segmentation conditional random fields

R: # of chains in quaternary structures

Quaternary structure prediction

(1) ( )

1

1( ,..., | ) exp( ( , ))

G

KR

k k cc C k

P W W f wZ

x x

Page 40: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

40

Experiment Design• Triple beta-spirals

Described by van Raaij et al. in Nature (1999) Clear sequence repeats Two proteins with crystallized structures and

about 20 without structure annotation

• Tripe beta-helices Described by van Raaij et al. in JMB (2001) Structurally similar to beta-helix without clear

sequence similarity Two proteins with crystallized structures Information from beta-helices can be used

• Both folds characterized by unusual stability to heat, protease, and detergent

Quaternary structure prediction

Page 41: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

41

SummaryConditional graphical models for protein structure

prediction

Page 42: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

42

Expected Contribution• Computational contribution

Conditional graphical models for structured data prediction “First effective models for the long-range interactions”

• Biological contribution Improvement in protein structure prediction Hypothesis on potential proteins with biological important

folds Aids for function analysis and drug design

Page 43: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

43

Timeline• Jun, 2005 -- Aug, 2005

Data collection for triple beta-spirals and triple beta-helices Preliminary investigation for virus protein folds

• Sep, 2005 -- Nov, 2005 Implementation and Testing the model on synthetic data and real data Determining the specific virus proteins to work on

• Nov, 2005 -- Jan, 2006 Virus protein fold recognition Analysis the properties for virus protein folds

• Feb, 2006 -- June, 2006: Virus protein fold recognition Investigation the possibility for protein function prediction or information

extraction

• July, 2006 -- Aug, 2006 Writing up thesis

Page 44: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

44

Page 45: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

45

Features• Node features

Regular expression template, HMM profiles Secondary structure prediction scores Segment length

• Inter-node features β-strand Side-chain alignment scores Preferences for parallel alignment scores Distance between adjacent B23 segments

• Features are general and easy to extend

Structural Motif Recognition

Page 46: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

46

Evaluation Measure

• Q3 (accuracy)

• Precision, Recall

• Segment Overlap quantity (SOV)

• Matthew’s Correlation coefficients

)(

)1()2;1(

)2;1()2;1(1)2,1(

iS

SLENSSMAXOV

SSDELTASSMINOV

NSSSOV

))(()()( iiiiiiii

iiiii

onunopup

ounpC

P + P-

T + P u

T - o nuonp

npQ

3

op

pQ pre

up

pQ

Page 47: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

47

Local Information• PSI-blast profile

Position-specific scoring matrices (PSSM) Linear transformation[Kim & Park, 2003]

• SVM classifier with RBF kernel • Feature#1 (Si): Prediction score for each

residue Ri

)5(0.1

)55(1.05.0

)5(0

)(

xif

xifx

xif

xf

Page 48: Conditional Graphical Models  for Protein Structure Prediction

Carnegie MellonSchool of Computer Science

48

Previous Work

• Huge literature over decades Window-based methods Hidden Markov models

• Major breakthroughs in recent years Combine the predictions from neighboring residues or

various methods (Cuff & Barton, 1999)

Explore evolutionary information from sequences (Jones, 1999)

Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003)

Protein Secondary Structure Prediction