protein structure bioinformatics introductionintroduction to protein structure bioinformatics...

35
Introduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 1 Swiss Institute of Bioinformatics Protein Structure Bioinformatics Introduction Secondary Structure Prediction & Fold recognition EMBnet course Basel, September 29, 2004 Lorenza Bordoli Overview Introduction Secondary Structure Prediction Fold Recognition

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 1

Swiss Institute of Bioinformatics

Protein Structure BioinformaticsIntroduction

Secondary Structure Prediction & Fold recognition

EMBnet course Basel, September 29, 2004

Lorenza Bordoli

Overview

Introduction

Secondary Structure Prediction

Fold Recognition

Page 2: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 2

Principles of protein structure

Primary Structure

Secondary Structure

Tertiary Structure (Fold)

Quaternary Structure

Principles of protein structure

Protein structure include:

Core Region:Secondary structure element packed in close proximity in

hydrophobic environment

Limited amino acid substitution

Outside the core:loops and structural elements in contact with water, membrane

or other proteins

Amino acid substitution: not as restricted as above

Page 3: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 3

PDB Holdings

PDB Holdings

Page 4: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 4

Protein Structure Databases

PDB: http://www.pdb.org

X-Ray, NMR => atom coordinates of the proteins are

deposited in PDB: worldwide repository for the 3-D

biological macromolecular structure data.

EBI-MSD: http://www.ebi.ac.uk/msd/ (2003)

suite of web-based search and retrieval interfaces for

macromolecular structure research.

Protein Structure Databases

http://www.wwpdb.org/

Page 5: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 5

Introduction

Goal: Relationship between amino acid sequence and three-dimensional structure in proteins? Can we predict the structure from the sequence?

Currently: comparative (homology) modeling;

See Lecture Thursday (Torsten)Homology Modeling

Similar Sequence Similar Structure

Homology modeling = Comparative protein modeling

Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).

Structure is better conserved than sequence

Page 6: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 6

Flow chart: analyze a new protein sequence

Protein Sequence

Homology ModelingPredicted

3DStructural model

3D structural analysis

in laboratory

Structure prediction(Secondary Structure

Fold recognition)

Protein familySequence search

(Pfam)

Database similarity search

(BLAST)

Relatioshipto known structure?

Does sequence alignwith a protein of

known structure ?

Hints for domain assignment?

Function?

Secondary structure assignment

DSSP

Dictionary of Secondary Structure of Proteins (Kabsch & Sander, 1983)

Based on recognition of hydrogen-bonding patterns in known structures

Automated assignment of secondary structure

Interprets backbone hydrogen bonds

Uses a Coulomb approximation for the hydrogen bond energy (-0.5 kcal/mol cut-off)

Secondary structures are assigned to consecutive segments of residues with hydrogen bonds

Page 7: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 7

Secondary structure assignment

DSSP secondary structure elements8 secondary structure classes

– H (α-helix) → H

– G (310-helix) → H

– I (π-helix) → H

– E (extended strand) → E

– B (residue in isolated β-bridge) → E

– T (turn) → L

– S (bend) → L

– " " (blank = other) → L

Secondary Structure prediction

What is protein secondary structure prediction?

Simplification of prediction problem

3D → 1D

Why do we need it?

As starting point for 3D modeling:

• Improve sequence alignments

• Use in fold recognition (discover family/superfamily relationship)

• Definition of loops / core regions

Page 8: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 8

Secondary Structure prediction

Assumption:there should be a correlation between amino acid sequence

and secondary structure

What can we predict?α-helix

β-strand

Loop (coil)

Secondary Structure prediction

Projection onto strings of structural assignments“Secondary Structure” 3-state model:

(S) β-Strand (E) (H) α-Helix (L) Loop

SEQ MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAK SS SSSSSSLLLLLLHHHHHHHHHHHLLLSSSLHHHHHHHHHHHLLLLLLHHHSS SSSSSS HHHHHHHHHHH SSS HHHHHHHHHHH HHH

Page 9: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 9

Accuracy of prediction

3-state-per-residue accuracy:

Gives % of correctly predicted residues in α,

β or other state

Q3 = 100 • Σ ci/N

• N= total number of residues

• Ci = number of correctly predicted residue in state

I (H,E,L)

Performance Evaluation

Assumption: there should be a correlation* between amino acid sequence and secondary structure

Systematic performance testing pre-requisite for reliability of method

Training Set Test Set

Dataset

PDB

PDB sub set:derive correlation*

PDB sub-set:=> Q3

Page 10: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 10

Conformational Preferences

Biochimica et Biophysica Acta 916: 200-204 (1987).

α

β

RT

1st Generation secondary structure prediction

1st Generation based on single amino acid propensitiesChou and Fasman, 1974Robson, 1976GOR-1: Garnier, Osguthorpe, and Robson, 1978

Preference of particular residues for certain secondary structure elements:

Single-residue statistics: analysis of the frequency of each 20 aain α helices, β strands or coils

Databases of very limited size< 55% Q3 accuracy

Page 11: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 11

1st Generation secondary structure prediction

Chou and Fasman (partial table):

Am ino Acid Pα P β P t

Glu 1.51 0.37 0.74Met 1.45 1.05 0.60Ala 1.42 0.83 0.66Val 1.06 1.70 0.50Ile 1.08 1.60 0.50Tyr 0.69 1.47 1.14Pro 0.57 0.55 1.52Gly 0.57 0.75 1.56

Name P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine 142 83 66 0.06 0.076 0.035 0.058Arginine 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081Asparagine 67 89 156 0.161 0.083 0.191 0.091Cysteine 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064Glutamine 111 110 98 0.074 0.098 0.037 0.098Glycine 57 75 156 0.102 0.085 0.19 0.152Histidine 100 87 95 0.14 0.047 0.093 0.054Isoleucine 108 160 47 0.043 0.034 0.013 0.056Leucine 121 130 59 0.061 0.025 0.036 0.07Lysine 114 74 101 0.055 0.115 0.072 0.095Methionine 145 105 60 0.068 0.082 0.014 0.055Phenylalanine 113 138 60 0.059 0.041 0.065 0.065Proline 57 55 152 0.102 0.301 0.034 0.068Serine 77 75 143 0.12 0.139 0.125 0.106Threonine 83 119 96 0.086 0.108 0.065 0.079Tryptophan 108 137 96 0.077 0.013 0.064 0.167Tyrosine 69 147 114 0.082 0.065 0.114 0.125Valine 106 170 50 0.062 0.048 0.028 0.053

Chou-Fasman Pij-values

Page 12: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 12

Chou-Fasman

How it works:

a. Assign all of the residues the appropriate set of parameters

b. Identify a-helix and b-sheet regions. Extend the regions in both

directions.

c. If structures overlap compare average values for P(H) and P(E) and

assign secondary structure based on best scores.

d. Turns are modeled as tetra-peptides using 2 different probability values.

Assign Pij values

1. Assign all of the residues the appropriate set of parameters

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

P(turn) 114 143 152 114 66 74 59 60 95 143 114 156

Page 13: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 13

Scan peptide for α−helix regions

2. Identify regions where 4/6 aa have a P(H) >100 “alpha-helix nucleus”

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Extend α-helix nucleus

3. Extend helix in both directions until a set of four residues have an average P(H) <100.

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Repeat steps 1 – 3 for entire peptide

Page 14: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 14

4. Identify regions where 3/5 have a P(E) >100 “b-sheet nucleus”

Extend b-sheet until 4 continuous residues have an average P(E) < 100

If region average > 105 and the average P(E) > average P(H) then “b-sheet”

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

Scan peptide for β-sheet regions

Chou-Fasman

1. Assign all of the residues in the peptide the appropriate set of parameters.

2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.

3. Repeat this procedure to locate all of the helical regions in the sequence.

4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.

5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.

6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

Page 15: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 15

CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.

TSPTAELMRSTG helix <> sheet EEEEEEE turns T

Residue totals: H: 2 E: 7 T: 1 percent: H: 16.7 E: 58.3 T: 8.3

Chou-Fasman Results

2nd Generation secondary structure prediction

Improvements

Larger database of protein structures

Segment-based statistics (11-21 residue window)

Basic idea:

"How likely is it that the central residue in a window adopts a particular

secondary structure state?"

Algorithm used:

Presumably all conceivable algorithms on this planet have been

applied to the Secondary Structure prediction problem.

E.g. statistical information, physicochemical properties, sequence

patterns, neural networks, graph theory, expert rules

Page 16: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 16

(H) α-Helix, local interactions

Neural Network

Artificial intelligence:Computer programs are trained to be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures

NN can detect interactions between amino acids in a sequence windows.

Neural Networks for Secondary Structure prediction

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

(B.Rost, Columbia, NewYork)

Input Layer

Hidden Layer

Output Layer

Weights

Page 17: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 17

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

Neural Networks for secondary structure predictions

(B.Rost, Columbia, NewYork)

= 0.19

= 0.61

= 0.17

The winner is:

E

Neural Networks

BenefitsGeneral applicable

Can capture higher order correlations

Inputs other than sequence information

DrawbacksNeeds many data points (solved structures)

Risk of overtraining

Page 18: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 18

2nd Generation secondary structure prediction

Methods:

GORIII

COMBINE

Q3 accuracy < 70%

Problems with first and second generation methods

Q3 accuracy < 70%

β-stands predicted < 28 - 48 % (slightly better than random)

Predicted helices and strands are too short

Page 19: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 19

3rd Generation secondary structure prediction

Breakthrough: Using evolutionary information 1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYI yrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYI fgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCI yes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYI src_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI stk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYI src_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYI hck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYI blk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYV hck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYI lyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFI lck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFI ss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGII abl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV abl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV src1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLI mysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKV yfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIF abl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWV tec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYI abl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWV txk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLI yha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIF abp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

3rd Generation secondary structure prediction

PHD method (Rost and Sander)

Combine neural networks with MAXHOM multiple sequence profiles

6-8 Percentage points increase in prediction accuracy over standard neural networks

Page 20: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 20

3rd Generation secondary structure prediction

Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

(B.Rost, Columbia, NewYork)

3rd generation secondary structure prediction

PHD (Rost et. al.) Q3 better than 72 %

[ B.Rost (2001) J.Struct.Biol. 134, 204 ]

59 %

65 %

72 %

Q3

Prediction reliability (0 = weak, 9 = strong)

[http://www.embl-heidelberg.de/predictprotein/]

Page 21: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 21

3rd generation secondary structure prediction

PSI-Pred (Jones, DT)Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network

Better predictions due to better sequence profiles

Available as stand alone program and via the web

[http://bioinf.cs.ucl.ac.uk/psipred/psiform.html]

How accurate are predictions today?

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm

(B.Rost, Columbia, NewYork)

Page 22: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 22

How accurate are predictions today?

Q3 = 72-76% +- 11 % (on average)

I.e. 30 % of predicted assignments are wrong

I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly

I.e. for your protein, accuracy can be lower than 60% or higher than 80%

How accurate are predictions today?

At present it is not always possible to predict secondary structure with very high reliability

As methods have improved (from 1st->3d generation of methods), prediction has reached an average accuracy of 64%-75%

Page 23: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 23

Secondary Structure Prediction

META-PredictProtein Server

http://cubic.bioc.columbia.edu/meta/

Simultaneous submission tool to several other servers, e.g.JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro

Includes also motif searches, domain assignments, TM predictions, etc.

1D-Structure prediction

Secondary Structure Prediction

Solvent Accessibility Prediction

Identify exposed residues, e.g. for mutation

studies, epitopes, etc.

Page 24: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 24

1D-Structure prediction

Projection onto strings of structural assignmentsE.g. “Solvent Accessibility” (buried or

exposed?)

A B C D E F G…¦ ¦ ¦ ¦ ¦ ¦ ¦e e b b e e e…

Accuracy of two-state prediction: 75% ± 10 %

PHDacc: solvent accessibility prediction

[http://cubic.bioc.columbia.edu/predictprotein/]

Page 25: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 25

1D-Structure prediction

Secondary Structure Prediction

Solvent Accessibility Prediction

Transmembrane Helices prediction

PHDhtm [http://www.embl-heidelberg.de/predictprotein/predictprotein.html]

TMHMM [http://www.cbs.dtu.dk/services/TMHMM/]

TMpred [http://www.ch.embnet.org/software/TMPRED_form.html]

Fold Recognition

Page 26: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 26

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

C hr i s t in e Ore ng o (S t ruc tur es , 1 997 , 5 , 1 093-1108 )

Fold Classification Databases

Page 27: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 27

Chr i s t i ne O ren g o (S t ruc tu res , 1997 , 5 , 1093 -1108)

Fold Classification Databases

Protein structure classification databases

Databases: provide structural comparisons for the proteins

in PDB:

Methods used to classify the protein structures:Manual examination

fully automatic computer algorithms

Examples:SCOP

CATH

FSSP

Page 28: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 28

[ http://scop.mrc-lmb.cam.ac.uk/scop/ ]

SCOP - Structural Classification of Proteins

MRC Cambridge UK, A. Murzin, Brenner S. E., Hubbard T., Chothia C.created by manual inspection hierarchical classification of protein domain structurescomprehensive description of the structural and evolutionary relationships organized as a tree structure:

Class all α classFold globin-like fold (6 helices; folded leaf)Superfamily globin-like superfamilyFamily globin and phycocyanin familiesDomain hemoglobin 1, myoglobin,…Species

Domain= segment of a polypetide chain that can autonomously fold into a 3D structure

[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]

CATH - Protein Structure Classification

UCL, Janet Thornton & Christine Orengo

Hierarchical classification of protein domain structures

clusters proteins at four major levels:

Class (C)

Architecture(A)

Topology(T)

Homologous superfamily (H)

Page 29: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 29

[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]

CATH - Protein Structure Classification

Class(C)

derived from secondary structure content is assigned automatically

Architecture(A)

describes the gross orientation of secondary structures, independent of connectivity.

Topology(T)

clusters structures according to their topological connections and numbers of secondary structures

FSSP-Fold Classification structure-structure alignment

Holm and Sander, EBI, UK

Fold classification based on pair-wise structural alignment of PDB. (DALI program)

Clusters of fold types = unique configuration of secondary structure elements

[http://www2.ebi.ac.uk/dali/fssp/]

Page 30: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 30

Structural Alignments

Protein Structure is better conserved than sequence

Structural alignments establish equivalences between amino acid

residues based on the 3D structures of two or more proteins

Structure alignments therefore provide information not available

from sequence alignment methods

Structural alignments can be used to guide sequence alignments

(see: T_COFFEE / SAP)

See Lecture Thursday (Laurent)Sequence alignment

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

Page 31: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 31

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

New folds per year

“Old” folds per year

The number of fold appears to be limited

The number of fold appears to be limited

Many different sequences will adopt the same fold:

A reasonable probability that a new sequence will posses an already identified fold

Goal of fold recognition: discover which fold is best matched

Sequence alignment method (e.g. HMM)3D structure prediction methods (e.g. threading)

Page 32: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 32

Find a compatible fold for a given sequence ....

>Protein XYMSTLYEKLGGTTAVDLAVDKFYERVLQDDRIKHFFADVDMAKQRAHQKAFLTYAFGGTDKYDGRYMREAHKELVENHGLNGEHFDAVAEDLLATLKEMGVPEDLIAEVAAVAGAPAHKRDVLNQ

≈?

Fold recognition

Number of protein folds that occurs in nature is limited. Fold Recognition

can be used to:

Identify templates for modeling

Assign Protein Function

Fold recognition: sequence based

Sequence alignment (HMM) can be used to identify a family of homologous proteins that have the same seq. and presumably a similar 3D-structure

ex.: Superfamily database:uses a library (covering all proteins of known structure) consisting of 1294 SCOP superfamilieseach of which is represented by a group of hidden Markov models HMM

[http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/]

Page 33: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 33

Fold recognition: threading

The amino acid sequence of a query protein is examined for compatibility with the structural core of known protein structures:

Structure profile method (e.g. 3D-PSSM)Contact potential method (e.g. 123D)

Fold recognition methods

3DPSSM

Three-dimensional

position specific

scoring matrix

Kelley et al, JMB, 299, 499 (2000)

http://www.sbg.bio.ic.ac.uk/~3dpssm/

Page 34: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 34

Fold recognition and Function

Some words of warning concerning fold recognition:

There is no simple close association of fold and function in a one-to-one sense.

The five most versatile folds (TIM-barrel, alpha-beta hydrolase, Rossmann, P-loop containing NTP hydrolase, ferredoxin fold), accommodate from six to as many as 16 functions.

The two most versatile enzymatic functions (hydrolases and o-glycosyl-glucosidases) are associated with seven folds each.

Aspartase [1JSW]

CO2-

C

H

NH3+

HH

OO-

CO2-

H

H-O2C+ NH3

Histidase [1B8F]

N NH

CO2-

H

HH+NH3

HH CO2

-

NHN+ NH3

δ2-Crystallin [1AUW]

Avian eye lens protein

Functional assignment by fold recognition ?

Page 35: Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Introduction to Protein Structure Bioinformatics 29.9.2004

Lorenza Bordoli 35

Fold Recognition Servers

Meta serverhttp://bioinfo.pl/meta/

3DPSSM http://www.sbg.bio.ic.ac.uk/servers/3dpssm/

GenTHREADERhttp://bioinf.cs.ucl.ac.uk/psipred/

FUGUE2http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html

SAMhttp://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html

FOLDhttp://fold.doe-mbi.ucla.edu/

FFAS/PDBBLASThttp://bioinformatics.burnham-inst.org/

References

D.W. Mount, Bioinformatics, CSHLP.

P.E.Bourne, H. Weissig. Structural Bioinformatics,

Wiley-Liss and Sons.

Methods in Molecular Biology 143: Protein Structure

Prediction, Humana Press.

Protein Structure Prediction: A practical Approach,

Oxford University Press.