Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 1
Swiss Institute of Bioinformatics
Protein Structure BioinformaticsIntroduction
Secondary Structure Prediction & Fold recognition
EMBnet course Basel, September 29, 2004
Lorenza Bordoli
Overview
Introduction
Secondary Structure Prediction
Fold Recognition
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 2
Principles of protein structure
Primary Structure
Secondary Structure
Tertiary Structure (Fold)
Quaternary Structure
Principles of protein structure
Protein structure include:
Core Region:Secondary structure element packed in close proximity in
hydrophobic environment
Limited amino acid substitution
Outside the core:loops and structural elements in contact with water, membrane
or other proteins
Amino acid substitution: not as restricted as above
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 3
PDB Holdings
PDB Holdings
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 4
Protein Structure Databases
PDB: http://www.pdb.org
X-Ray, NMR => atom coordinates of the proteins are
deposited in PDB: worldwide repository for the 3-D
biological macromolecular structure data.
EBI-MSD: http://www.ebi.ac.uk/msd/ (2003)
suite of web-based search and retrieval interfaces for
macromolecular structure research.
Protein Structure Databases
http://www.wwpdb.org/
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 5
Introduction
Goal: Relationship between amino acid sequence and three-dimensional structure in proteins? Can we predict the structure from the sequence?
Currently: comparative (homology) modeling;
See Lecture Thursday (Torsten)Homology Modeling
Similar Sequence Similar Structure
Homology modeling = Comparative protein modeling
Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).
Structure is better conserved than sequence
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 6
Flow chart: analyze a new protein sequence
Protein Sequence
Homology ModelingPredicted
3DStructural model
3D structural analysis
in laboratory
Structure prediction(Secondary Structure
Fold recognition)
Protein familySequence search
(Pfam)
Database similarity search
(BLAST)
Relatioshipto known structure?
Does sequence alignwith a protein of
known structure ?
Hints for domain assignment?
Function?
Secondary structure assignment
DSSP
Dictionary of Secondary Structure of Proteins (Kabsch & Sander, 1983)
Based on recognition of hydrogen-bonding patterns in known structures
Automated assignment of secondary structure
Interprets backbone hydrogen bonds
Uses a Coulomb approximation for the hydrogen bond energy (-0.5 kcal/mol cut-off)
Secondary structures are assigned to consecutive segments of residues with hydrogen bonds
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 7
Secondary structure assignment
DSSP secondary structure elements8 secondary structure classes
– H (α-helix) → H
– G (310-helix) → H
– I (π-helix) → H
– E (extended strand) → E
– B (residue in isolated β-bridge) → E
– T (turn) → L
– S (bend) → L
– " " (blank = other) → L
Secondary Structure prediction
What is protein secondary structure prediction?
Simplification of prediction problem
3D → 1D
Why do we need it?
As starting point for 3D modeling:
• Improve sequence alignments
• Use in fold recognition (discover family/superfamily relationship)
• Definition of loops / core regions
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 8
Secondary Structure prediction
Assumption:there should be a correlation between amino acid sequence
and secondary structure
What can we predict?α-helix
β-strand
Loop (coil)
Secondary Structure prediction
Projection onto strings of structural assignments“Secondary Structure” 3-state model:
(S) β-Strand (E) (H) α-Helix (L) Loop
SEQ MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAK SS SSSSSSLLLLLLHHHHHHHHHHHLLLSSSLHHHHHHHHHHHLLLLLLHHHSS SSSSSS HHHHHHHHHHH SSS HHHHHHHHHHH HHH
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 9
Accuracy of prediction
3-state-per-residue accuracy:
Gives % of correctly predicted residues in α,
β or other state
Q3 = 100 • Σ ci/N
• N= total number of residues
• Ci = number of correctly predicted residue in state
I (H,E,L)
Performance Evaluation
Assumption: there should be a correlation* between amino acid sequence and secondary structure
Systematic performance testing pre-requisite for reliability of method
Training Set Test Set
Dataset
PDB
PDB sub set:derive correlation*
PDB sub-set:=> Q3
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 10
Conformational Preferences
Biochimica et Biophysica Acta 916: 200-204 (1987).
α
β
RT
1st Generation secondary structure prediction
1st Generation based on single amino acid propensitiesChou and Fasman, 1974Robson, 1976GOR-1: Garnier, Osguthorpe, and Robson, 1978
Preference of particular residues for certain secondary structure elements:
Single-residue statistics: analysis of the frequency of each 20 aain α helices, β strands or coils
Databases of very limited size< 55% Q3 accuracy
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 11
1st Generation secondary structure prediction
Chou and Fasman (partial table):
Am ino Acid Pα P β P t
Glu 1.51 0.37 0.74Met 1.45 1.05 0.60Ala 1.42 0.83 0.66Val 1.06 1.70 0.50Ile 1.08 1.60 0.50Tyr 0.69 1.47 1.14Pro 0.57 0.55 1.52Gly 0.57 0.75 1.56
Name P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine 142 83 66 0.06 0.076 0.035 0.058Arginine 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081Asparagine 67 89 156 0.161 0.083 0.191 0.091Cysteine 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064Glutamine 111 110 98 0.074 0.098 0.037 0.098Glycine 57 75 156 0.102 0.085 0.19 0.152Histidine 100 87 95 0.14 0.047 0.093 0.054Isoleucine 108 160 47 0.043 0.034 0.013 0.056Leucine 121 130 59 0.061 0.025 0.036 0.07Lysine 114 74 101 0.055 0.115 0.072 0.095Methionine 145 105 60 0.068 0.082 0.014 0.055Phenylalanine 113 138 60 0.059 0.041 0.065 0.065Proline 57 55 152 0.102 0.301 0.034 0.068Serine 77 75 143 0.12 0.139 0.125 0.106Threonine 83 119 96 0.086 0.108 0.065 0.079Tryptophan 108 137 96 0.077 0.013 0.064 0.167Tyrosine 69 147 114 0.082 0.065 0.114 0.125Valine 106 170 50 0.062 0.048 0.028 0.053
Chou-Fasman Pij-values
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 12
Chou-Fasman
How it works:
a. Assign all of the residues the appropriate set of parameters
b. Identify a-helix and b-sheet regions. Extend the regions in both
directions.
c. If structures overlap compare average values for P(H) and P(E) and
assign secondary structure based on best scores.
d. Turns are modeled as tetra-peptides using 2 different probability values.
Assign Pij values
1. Assign all of the residues the appropriate set of parameters
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
P(turn) 114 143 152 114 66 74 59 60 95 143 114 156
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 13
Scan peptide for α−helix regions
2. Identify regions where 4/6 aa have a P(H) >100 “alpha-helix nucleus”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Extend α-helix nucleus
3. Extend helix in both directions until a set of four residues have an average P(H) <100.
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Repeat steps 1 – 3 for entire peptide
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 14
4. Identify regions where 3/5 have a P(E) >100 “b-sheet nucleus”
Extend b-sheet until 4 continuous residues have an average P(E) < 100
If region average > 105 and the average P(E) > average P(H) then “b-sheet”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
Scan peptide for β-sheet regions
Chou-Fasman
1. Assign all of the residues in the peptide the appropriate set of parameters.
2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.
3. Repeat this procedure to locate all of the helical regions in the sequence.
4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.
6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 15
CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.
TSPTAELMRSTG helix <> sheet EEEEEEE turns T
Residue totals: H: 2 E: 7 T: 1 percent: H: 16.7 E: 58.3 T: 8.3
Chou-Fasman Results
2nd Generation secondary structure prediction
Improvements
Larger database of protein structures
Segment-based statistics (11-21 residue window)
Basic idea:
"How likely is it that the central residue in a window adopts a particular
secondary structure state?"
Algorithm used:
Presumably all conceivable algorithms on this planet have been
applied to the Secondary Structure prediction problem.
E.g. statistical information, physicochemical properties, sequence
patterns, neural networks, graph theory, expert rules
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 16
(H) α-Helix, local interactions
Neural Network
Artificial intelligence:Computer programs are trained to be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures
NN can detect interactions between amino acids in a sequence windows.
Neural Networks for Secondary Structure prediction
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
(B.Rost, Columbia, NewYork)
Input Layer
Hidden Layer
Output Layer
Weights
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 17
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
Neural Networks for secondary structure predictions
(B.Rost, Columbia, NewYork)
= 0.19
= 0.61
= 0.17
The winner is:
E
Neural Networks
BenefitsGeneral applicable
Can capture higher order correlations
Inputs other than sequence information
DrawbacksNeeds many data points (solved structures)
Risk of overtraining
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 18
2nd Generation secondary structure prediction
Methods:
GORIII
COMBINE
Q3 accuracy < 70%
Problems with first and second generation methods
Q3 accuracy < 70%
β-stands predicted < 28 - 48 % (slightly better than random)
Predicted helices and strands are too short
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 19
3rd Generation secondary structure prediction
Breakthrough: Using evolutionary information 1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYI yrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYI fgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCI yes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYI src_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI stk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYI src_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYI hck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYI blk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYV hck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYI lyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFI lck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFI ss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGII abl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV abl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV src1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLI mysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKV yfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIF abl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWV tec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYI abl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWV txk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLI yha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIF abp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
3rd Generation secondary structure prediction
PHD method (Rost and Sander)
Combine neural networks with MAXHOM multiple sequence profiles
6-8 Percentage points increase in prediction accuracy over standard neural networks
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 20
3rd Generation secondary structure prediction
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
(B.Rost, Columbia, NewYork)
3rd generation secondary structure prediction
PHD (Rost et. al.) Q3 better than 72 %
[ B.Rost (2001) J.Struct.Biol. 134, 204 ]
59 %
65 %
72 %
Q3
Prediction reliability (0 = weak, 9 = strong)
[http://www.embl-heidelberg.de/predictprotein/]
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 21
3rd generation secondary structure prediction
PSI-Pred (Jones, DT)Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network
Better predictions due to better sequence profiles
Available as stand alone program and via the web
[http://bioinf.cs.ucl.ac.uk/psipred/psiform.html]
How accurate are predictions today?
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
(B.Rost, Columbia, NewYork)
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 22
How accurate are predictions today?
Q3 = 72-76% +- 11 % (on average)
I.e. 30 % of predicted assignments are wrong
I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly
I.e. for your protein, accuracy can be lower than 60% or higher than 80%
How accurate are predictions today?
At present it is not always possible to predict secondary structure with very high reliability
As methods have improved (from 1st->3d generation of methods), prediction has reached an average accuracy of 64%-75%
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 23
Secondary Structure Prediction
META-PredictProtein Server
http://cubic.bioc.columbia.edu/meta/
Simultaneous submission tool to several other servers, e.g.JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro
Includes also motif searches, domain assignments, TM predictions, etc.
1D-Structure prediction
Secondary Structure Prediction
Solvent Accessibility Prediction
Identify exposed residues, e.g. for mutation
studies, epitopes, etc.
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 24
1D-Structure prediction
Projection onto strings of structural assignmentsE.g. “Solvent Accessibility” (buried or
exposed?)
A B C D E F G…¦ ¦ ¦ ¦ ¦ ¦ ¦e e b b e e e…
Accuracy of two-state prediction: 75% ± 10 %
PHDacc: solvent accessibility prediction
[http://cubic.bioc.columbia.edu/predictprotein/]
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 25
1D-Structure prediction
Secondary Structure Prediction
Solvent Accessibility Prediction
Transmembrane Helices prediction
PHDhtm [http://www.embl-heidelberg.de/predictprotein/predictprotein.html]
TMHMM [http://www.cbs.dtu.dk/services/TMHMM/]
TMpred [http://www.ch.embnet.org/software/TMPRED_form.html]
Fold Recognition
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 26
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
C hr i s t in e Ore ng o (S t ruc tur es , 1 997 , 5 , 1 093-1108 )
Fold Classification Databases
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 27
Chr i s t i ne O ren g o (S t ruc tu res , 1997 , 5 , 1093 -1108)
Fold Classification Databases
Protein structure classification databases
Databases: provide structural comparisons for the proteins
in PDB:
Methods used to classify the protein structures:Manual examination
fully automatic computer algorithms
Examples:SCOP
CATH
FSSP
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 28
[ http://scop.mrc-lmb.cam.ac.uk/scop/ ]
SCOP - Structural Classification of Proteins
MRC Cambridge UK, A. Murzin, Brenner S. E., Hubbard T., Chothia C.created by manual inspection hierarchical classification of protein domain structurescomprehensive description of the structural and evolutionary relationships organized as a tree structure:
Class all α classFold globin-like fold (6 helices; folded leaf)Superfamily globin-like superfamilyFamily globin and phycocyanin familiesDomain hemoglobin 1, myoglobin,…Species
Domain= segment of a polypetide chain that can autonomously fold into a 3D structure
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]
CATH - Protein Structure Classification
UCL, Janet Thornton & Christine Orengo
Hierarchical classification of protein domain structures
clusters proteins at four major levels:
Class (C)
Architecture(A)
Topology(T)
Homologous superfamily (H)
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 29
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]
CATH - Protein Structure Classification
Class(C)
derived from secondary structure content is assigned automatically
Architecture(A)
describes the gross orientation of secondary structures, independent of connectivity.
Topology(T)
clusters structures according to their topological connections and numbers of secondary structures
FSSP-Fold Classification structure-structure alignment
Holm and Sander, EBI, UK
Fold classification based on pair-wise structural alignment of PDB. (DALI program)
Clusters of fold types = unique configuration of secondary structure elements
[http://www2.ebi.ac.uk/dali/fssp/]
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 30
Structural Alignments
Protein Structure is better conserved than sequence
Structural alignments establish equivalences between amino acid
residues based on the 3D structures of two or more proteins
Structure alignments therefore provide information not available
from sequence alignment methods
Structural alignments can be used to guide sequence alignments
(see: T_COFFEE / SAP)
See Lecture Thursday (Laurent)Sequence alignment
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 31
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
New folds per year
“Old” folds per year
The number of fold appears to be limited
The number of fold appears to be limited
Many different sequences will adopt the same fold:
A reasonable probability that a new sequence will posses an already identified fold
Goal of fold recognition: discover which fold is best matched
Sequence alignment method (e.g. HMM)3D structure prediction methods (e.g. threading)
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 32
Find a compatible fold for a given sequence ....
>Protein XYMSTLYEKLGGTTAVDLAVDKFYERVLQDDRIKHFFADVDMAKQRAHQKAFLTYAFGGTDKYDGRYMREAHKELVENHGLNGEHFDAVAEDLLATLKEMGVPEDLIAEVAAVAGAPAHKRDVLNQ
≈?
Fold recognition
Number of protein folds that occurs in nature is limited. Fold Recognition
can be used to:
Identify templates for modeling
Assign Protein Function
Fold recognition: sequence based
Sequence alignment (HMM) can be used to identify a family of homologous proteins that have the same seq. and presumably a similar 3D-structure
ex.: Superfamily database:uses a library (covering all proteins of known structure) consisting of 1294 SCOP superfamilieseach of which is represented by a group of hidden Markov models HMM
[http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/]
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 33
Fold recognition: threading
The amino acid sequence of a query protein is examined for compatibility with the structural core of known protein structures:
Structure profile method (e.g. 3D-PSSM)Contact potential method (e.g. 123D)
Fold recognition methods
3DPSSM
Three-dimensional
position specific
scoring matrix
Kelley et al, JMB, 299, 499 (2000)
http://www.sbg.bio.ic.ac.uk/~3dpssm/
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 34
Fold recognition and Function
Some words of warning concerning fold recognition:
There is no simple close association of fold and function in a one-to-one sense.
The five most versatile folds (TIM-barrel, alpha-beta hydrolase, Rossmann, P-loop containing NTP hydrolase, ferredoxin fold), accommodate from six to as many as 16 functions.
The two most versatile enzymatic functions (hydrolases and o-glycosyl-glucosidases) are associated with seven folds each.
Aspartase [1JSW]
CO2-
C
H
NH3+
HH
OO-
CO2-
H
H-O2C+ NH3
Histidase [1B8F]
N NH
CO2-
H
HH+NH3
HH CO2
-
NHN+ NH3
δ2-Crystallin [1AUW]
Avian eye lens protein
Functional assignment by fold recognition ?
Introduction to Protein Structure Bioinformatics 29.9.2004
Lorenza Bordoli 35
Fold Recognition Servers
Meta serverhttp://bioinfo.pl/meta/
3DPSSM http://www.sbg.bio.ic.ac.uk/servers/3dpssm/
GenTHREADERhttp://bioinf.cs.ucl.ac.uk/psipred/
FUGUE2http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html
SAMhttp://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html
FOLDhttp://fold.doe-mbi.ucla.edu/
FFAS/PDBBLASThttp://bioinformatics.burnham-inst.org/
References
D.W. Mount, Bioinformatics, CSHLP.
P.E.Bourne, H. Weissig. Structural Bioinformatics,
Wiley-Liss and Sons.
Methods in Molecular Biology 143: Protein Structure
Prediction, Humana Press.
Protein Structure Prediction: A practical Approach,
Oxford University Press.