local statistical dependencies in protein structure: discovery, evaluation, prediction and...
Post on 18-Dec-2015
219 Views
Preview:
TRANSCRIPT
Local Statistical Dependencies Local Statistical Dependencies in Protein Structure: Discovery, in Protein Structure: Discovery,
Evaluation, Prediction and Evaluation, Prediction and ApplicationsApplications
Advancement to Candidacy
Computer Science Department
by Rachel Karchin
Advisor: Kevin Karplus
2
OutlineOutline
Protein structure - primary, secondary, tertiary
Fold recognition, local and secondary structure
Alphabets of local structureDesigning and evaluating local
structure alphabetsImproving fold recognition
3
Molecular structure of proteinsMolecular structure of proteins
Proteins are large, organic molecules composed of smaller molecules called amino acids.
Ball-and-stick atomic model of Crambinplant seed protein with 44 amino acids
threonine cysteine arginine
4
The amino acidsThe amino acids
There are 20 kinds ofamino acids found in natural proteins.
All share a common structure.
Biochemistry Mathews, 3ed. AddisonWesley
R side chain
carboxyl groupamine group
alpha carbon(with attached hydrogen)
5
Primary structurePrimary structure
Proteins consist of one or more polypeptide chains of amino acids connected by peptide bonds.
The sequence of linked amino acids along the chain is called the protein’s primary structure.Phe-Leu-Ser-Cys . . .FLSC . . .
Access Excellence NHGRI Graphics Gallery
6
Secondary structureSecondary structure
Symmetric patterns of hydrogen bonds between amino acids.
Anthony Day/Pace et. al. 1996
Helix. H-bonds between residues close in primary sequence.
7
Secondary structureSecondary structure
Strand. H-bonds between residues not close in primary sequence.
Anthony Day/Pace et. al. 1996
8
Protein FoldingProtein Folding
In an aqueous environment (such as cell cytoplasm), polypeptide chains fold into 3D shapes (tertiary structure).
9
From primary to tertiary structureFrom primary to tertiary structure
A protein’s 3D shape is determined by its primary amino acid sequence. Anfinsen et. al. 1963.
Predicting tertiary structure from amino acid sequence is an unsolved problem.
– Difficult to model the energies that stabilize a protein molecule.
– Conformational search space is enormous.
Laboratory of MolecularBiophysics, University of Oxford
10
Fold recognitionFold recognition
In nature, proteins are observed to assume on the order of a thousand shapes or “folds”.
Biochemistry Mathews, 3ed. AddisonWesley
11
Fold recognitionFold recognition Given an amino acid sequence target:
– search a set of known folds by aligning target and a template fold representative
– predict the fold that gets the best scoring alignment
Target amino acid sequence
Template
Fold library
YLAADTYK
Template amino acid sequence FISSETCN MEPSSYV TGLIRKN
Target/template Score: 7 21 2
12
Twilight zone sequence Twilight zone sequence relationshipsrelationships This method is very effective when target
and template have > 30% sequence identity. Approximately 1/3 of protein sequences can
be assigned folds and modeled this way. We would like to extend the method to
sequences in the twilight zone (< 30% identity to any sequence of known structure).
13
SAM-T98SAM-T98
Build a target HMM of amino acid frequencies from a multiple alignment of target plus homologs (SAM-T98).
YLAADTYK Target amino acid sequence
Protein DatabaseSearch for
homologs
YLAADTYK FISTE-HR HVATD-H- -ITA--HR YLASDS-R
Multiple alignment
Target amino acid HMM
Courtesy of K. Karplus
14
SAM-T98SAM-T98
Target amino acid HMM Template Fold library
Template amino acid sequence FISSETCN MEPSSYV TGLIRKN
Amino acid HMM for target. Amino acid strings for templates Three -fold increase in recognizing twilight zone similarities (Park et. al.
1998)
Target/template Score: 7 21 2
Courtesy of K. Karplus
15
SAM-T98 enhancementsSAM-T98 enhancements
Two-way scoring Augment the method with secondary structure information.
16
Two-way SAM-T98Two-way SAM-T98 Also build amino acid HMMs for templates. Do 2-way scoring
to strengthen recognition of twilight zone relationships.
Template amino acid HMMs
Target amino acid sequence
YLAADTYK
Target/template Score: 19 82 31
Template Fold library
17
Secondary structureSecondary structure
DSSP alphabet (Kabsch and Sander 1983). Classifies the secondary structure of a residue using known tertiary structure.
alpha helixH
beta strandE
pi helixI
3-10 helixG
turnT
bendS
bridgeB
random coilC
Basic patterns:Repeating
turns:Repeating
bridges:
Other:
Biochemistry Mathews, 3ed. AddisonWesley
18
Secondary structureSecondary structure
Alternatives to DSSP definitions.– Collapse 8 classes to 3: H,E,C– Other programs to automate
assignment:• Richards and Kundrot (1988) Define• Sklenar (1989) P-Curve• Adzhubei and Sternberg (1993)• Frishman and Argos (1995) STRIDE• King and Johnson (1999) xlsstr
19
Predicting secondary structurePredicting secondary structure
Extensive research on predicting secondary structure from primary sequence.
Neural nets are most successful approach.– PHD (Rost and Sander 1996)– Predict_2nd (Karplus and Barrett 1998)
Best methods around 75-80% accurate
20
Secondary structure and fold Secondary structure and fold recognitionrecognition Predicted secondary structure shown useful for
fold recognition (Russell et. al. 1998). Fold recognition accuracy correlated with
secondary structure prediction accuracy(Di Francesco 1995, 1997, 1999).
Why?– Structure more conserved than sequence.
– Proteins in the same fold family have similar topologies (secondary structure elements have similar lengths, spatial organization and connectivities).
21
Two-track SAM-T2KTwo-track SAM-T2K Predicted probability vectors of secondary
structure added to target HMM
YLAADTYK Target amino acid sequence
H E CY 0.65 0.2 0.15L 0.15 0.7 0.25A 0.01 0.04 0.9A 0.47 0.45 0.08D 0.85 0.1 0.05T 0.32 0.18 0.5Y 0.81 0.09 0.1K 0.5 0.25 0.15
Target two-track HMM
YLAADTYKFISTE-HRHVATD-H--ITA--HR
Multiple alignment
Courtesy of C. Barrett
Courtesy of K. Karplus
P(H) P(E) P(C)
22
Two-track SAM-T2KTwo-track SAM-T2K
Search template library of sequence pairs with two-track target HMM
Template with 2 sequence pairsFISSETCN CCEECHHH
MEPSSYV HHHHCCE
TGLIRKN EEECEEE
Target two-track HMM
Target/template Score: 22 68 15
Courtesy of K. Karplus
Template Fold library
23
Motivation for alternatives to Motivation for alternatives to secondary structure classessecondary structure classesWhat’s wrong with secondary structure
classes?– The most widely used secondary structure
alphabet (3-state DSSP) is crude (Helix, Strand, Coil).
– Secondary structure classes are ambiguous.• Automated assignment methods disagree.• 63% agreement between DSSP, Define and
P-Curve (Collc’h et. al. 1993).
24
What is Local structure? – describes environment of a residue– a residue’s relationship to neighbors
Can use this information to predict fold from primary structure.
Requires comparing local structure of target and template.
Local structure and fold Local structure and fold recognitionrecognition
KnownMust predict (easier than 3d)
25
Low level descriptions of local Low level descriptions of local structurestructure Lowest level representation of protein
structure - atomic position vectors.
ATOM 1 CA THR 1 7.047 14.099 3.625ATOM 2 C THR 1 16.967 12.784 4.338ATOM 3 O THR 1 15.685 12.755 5.133ATOM 4 N SER 2 15.115 11.555 5.265ATOM 5 CA SER 2 13.856 11.469 6.066ATOM 6 C SER 2 14.164 10.785 7.379ATOM 7 O SER 2 14.993 9.862 7.443ATOM 8 CB SER 2 12.732 10.711 5.261ATOM 9 N CYS 3 13.488 11.241 8.417ATOM 10 CA CYS 3 13.660 10.707 9.787
AtomNo. Type
ResidueType No.
Position vectorX Y Z
Conformations of BiopolymersIUPAC-IUB
26
“One level up”. From atomic position vectors can derive a list of properties that describe a residue’s local environment.
Low level descriptions of local Low level descriptions of local structurestructure
Conformations of BiopolymersIUPAC-IUB
27
Dihedral and bond anglesDihedral and bond angles
Dihedral angles are defined by 4 atoms.
Bond angles are defined by 3 atoms.
Conformations of BiopolymersIUPAC-IUB
Conformations of BiopolymersIUPAC-IUB
28
Dihedral angles: Phi, Psi, OmegaDihedral angles: Phi, Psi, Omega
The 6 atoms in each peptide unit lie in the same plane.
ω
ω
= 180 (trans)or 0 (cis)
and free to rotate
Biochemistry Mathews, 3ed. AddisonWesley
29
Dihedral angles: Phi, Psi, OmegaDihedral angles: Phi, Psi, Omega
Result: good approximation of polypeptide backbone is list of (,) pairs ( cis is rare).
(,) pairs often represented on a plane called the Ramachandran plot.
http://www.biochem.artizona.eduBiochemistry 462A Lecture Notes
30
A small gallery of properties: A small gallery of properties: the geometry of local structurethe geometry of local structure
Kappa. Virtual bond angle between
C of residues i-2, i, i+2
Alpha. Virtual dihedral angle between C of residues i-1, i, i+1, i+2
Tau. Virtual bond angle between C of residues i-1, i, i+1
Zeta. Dihedral angle between carbonyl bonds of residues i and i-1
31
Relationship of a residue to its Relationship of a residue to its neighborsneighbors Density measures. How many residues
are within a given distance?
Count of H-bond partners.
12 neighboring residueswithin 6 A radius
2 H-bond partners
32
Existing local structure alphabetsExisting local structure alphabets
Approximately 30 alphabets of local structure in the literature.
Can they be used to improve fold recognition?
33
Phi/psi alphabetsPhi/psi alphabets
Classes based on partition of phi/psi space
Bystroff et. al. 2000. 10 classes: B E b d e G H L I x
Kang et. al. 1993. 1296 classes: uniform partitioning by 10
Sun et. al. 1996DSSP H,E plus 5 phi/psi classes: a b e l t
Bystroff et. al. 2000
34
Backbone fragment alphabetsBackbone fragment alphabets
Classes based on clustering low-level properties of contiguous series of residues.
Unger et. al. 1987~100 6-residue fragments
k-nearest neighbor clustering by RMSD of C atoms Centroid of each cluster selected as building block
Unger et. al. 1987
35
Backbone fragment alphabetsBackbone fragment alphabets
De Brevern et. al. 2000Protein Building Blocks (PBBs).
16 classes of 5-residue fragments. SOM clustering of vectors of
8 dihedral angles ( and ).
De Brevern et. al. 2000
36
Desired properties of local Desired properties of local structural alphabetsstructural alphabetsFor purposes of improving fold
recognition:– Predictable from primary sequence– Conserved within a fold family
37
Comparison of existing local Comparison of existing local structure alphabetsstructure alphabets
Only a few of the alphabets have been tested for predictability. None of the alphabets have been tested for conservation within fold families.
38
Designing a Local Structure Designing a Local Structure AlphabetAlphabet Extract properties with respect to each residue in the
dataset.
Selected property:
TCO
Selected PDB structures
Property extraction
PDBNo AA TCO1 M -0.32 L -0.343 S 0.914 P 0.9355 E -0.16 V 0.2..
i-1 i
39
Designing a Local Structure Designing a Local Structure AlphabetAlphabet Partition the data into k populations.
PDBNo AA TCO1 M -0.32 L -0.343 S 0.914 P 0.9355 E -0.16 V 0.2..
UnsupervisedLearning
Algorithm
PDBNo AA TCO1 M -0.32 L -0.345 E -0.1
PDBNo AA TCO 3 S 0.914 P 0.9356 V 0.2
Class A
Class B
-1 -0.5 0 0.5 1
X OX O
Class A Class B
X O
40
Designing a Local Structure Designing a Local Structure AlphabetAlphabet
Selected property:KJ descriptor vector*:
[,, d1, d2, d3]
ZETA TAU
D1 dison3:H-bond lengthfrom Oi to Ni+3
D2 dison4:H-bond lengthfrom Oi to Ni+4
D3 discn3:length from Ci to Ni+3
* Descriptor vector of key geometric properties identified by King and Johnson 1999
i
i
i
i+3
i+3
i+4
i
i-1
i
i-1 i+1
41
Designing a Local Structure Designing a Local Structure AlphabetAlphabet Extract properties with respect to each residue in the
dataset.
Selected property:KJ descriptor vector:
[, , d1, d2, d3]
Selected PDB structures
Property extraction
PDBNo AA KJDV1 M [13.6, 9 2.9, 3.7, 3.1, 4.1]2 L [14.4, 9, 5.7,4 .9, 7.1, 4.9]3 S [19.8, 100.3, 7.2, 10.1, 6.9]4 P [18.1, 116.2, 6.7, 9.2,6 .9]...
42
Designing a Local Structure Designing a Local Structure AlphabetAlphabet Clustering multi-dimensional data points.
PDBNo AA KJDV1 M [13.6, 9 2.9, 3.7, 3.1, 4.1]2 L [14.4, 9, 5.7,4 .9, 7.1, 4.9]3 S [19.8, 100.3, 7.2, 10.1, 6.9]4 P [18.1, 116.2, 6.7, 9.2,6 .9]...
Components in different units. Scale to same range? For very high dimensional vectors require feature reduction.
43
Evaluation protocolEvaluation protocol
Protocol is based on:– testing candidate alphabets for their conservation within fold families.– testing predictability of candidate alphabets– testing improvements in fold recognition when candidate alphabets are used.
44
Evaluation Protocol: string Evaluation Protocol: string translationtranslation
Selected PDB structures
Selected alphabet Stringbuilder
Position-equivalent strings in
new alphabet
>2abdCAAABCAB>4ecaACBBABCA. . .
>2abdMDAAVKTG>4ecaMELVIRSG. . .
45
Evaluation Protocol: alignment Evaluation Protocol: alignment translationtranslation
Fold family alignments
Alignmentbuilder
Position-equivalent alignments
in new alphabet
Position-equivalent strings in
new alphabet
CA-AABCABAC-BBABCAC-AACCBBCCCA-BB-A-
MD-AAVKTGME-LVIRSGM-SAGCRDKMEA-SC-E-
46
Position-equivalent alignments
in new alphabet
Conserved?
CA-AABCABAC-BBABCAC-AACCBBCCCA-BB-A-
Evaluation Protocol: alphabet Evaluation Protocol: alphabet conservationconservation
Average entropy in columns of alignments. Relative entropy of substitution matrix
constructed from alignments (Altschul 91).
47
Evaluation Protocol: alphabet Evaluation Protocol: alphabet predictabilitypredictability
Test predictability with Predict_2nd neural net.
Improve on neural net performance with alternate methods. Position-
equivalent strings in
new alphabet
Predictable?
Courtesy of C. Barrett
P(A) P(B) P(C)
48
Evaluation Protocol: fold Evaluation Protocol: fold recognitionrecognition
Build a fold library that incorporates the local structure alphabet and do fold recognition testing using this library.
49
Incorporating local structure Incorporating local structure alphabets into a fold libraryalphabets into a fold library Simplest. Use predicted local structure string for
target and known local structure string for templates.
Target local structure string
ABBCACAB
Target/template Score: 7 21 2
Template local structure string CCABBBAC AACBCAA CAACBBB
PROBLEM!Wrong letter predicted.
Template Fold library
50
Incorporating local structure Incorporating local structure information into a fold libraryinformation into a fold library Use several strings (amino acid and local
structure) for target and templates.Target with string tuple
YLAADTYKABBCACABWYTZTTVU
Template with string tuples FISSETCNCCABBBACYVUUTZVV
MEPSSYVAACBCAATTYUVWZ
TGLIRKNCAACBBBYUUUVZW
Target/template Score: 6 23 5
PROBLEM!Wrong letters predicted.
Template Fold library
51
Add tracks to the target HMM. Search template library of sequence tuples with multi-track target HMM.
Template with sequence tuplesFISSETCNCCABBBACYVUUTZVV
MEPSSYVAACBCAATTYUVWZ
TGLIRKNCAACBBBYUUUVZW
Target multi-track HMM
Extending the SAM-T2K method Extending the SAM-T2K method with local structure informationwith local structure information
Target/template Score: 75 3 22
Template Fold library
52
Adding local structure strings to the template HMM. Enable 2-way HMM scoring.
Template amino acid HMMs plus local structure strings
Extending the SAM-T2K method Extending the SAM-T2K method with local structure informationwith local structure information
Target/template Score: 8 24 49
CCABBBACYVUUTZVV
AACBCAATTYUVWZ
CAACBBBYUUUVZW
Target
YLAADTYKABBCACABWYTZTTVU
A B CY 0.65 0.2 0.15L 0.15 0.7 0.25A 0.01 0.04 0.9A 0.47 0.45 0.08D 0.85 0.1 0.05T 0.32 0.18 0.5Y 0.81 0.09 0.1K 0.5 0.25 0.15
Template Fold library
53
Build multi-track HMMs for target and template.
Target multi-track HMM
Extending the SAM-T2K method Extending the SAM-T2K method with local structure informationwith local structure information
Template multi-track HMMs
Target/template Score: 6 23 5
Template Fold library
54
Evaluation Protocol: fold Evaluation Protocol: fold recognitionrecognition
Foldclassification
database
Fold testset
Non-redundant
119l T4 Lysozyme12asA Asparagine Synthetase153l Goose Lysozyme16pk Phosphoglycerate Kinase16vpA VP16 regulatory protein. . .
Target
Template Fold library
119l
Target/template Score: 12 2 71
Templates: 12asA 153l 16pk
119l12asA153l16pk16vpA. . .
55
Evaluation Protocol: fold Evaluation Protocol: fold recognitionrecognition
courtesy of K. Karplus
1
2
5
10
20
50
100
200
500
1000
2000
500 1000 2000 5000 10000
Fals
e Po
sitiv
es
True Positives
+=Same foldold PSI-blast
PSI-blastSAM-T2K
SAM-T2K EHL 50-50SAM-T2K EBGHTL 50-50
DALI
56
Research ScheduleResearch Schedule
Year 1:Find a local structure alphabet that improves fold recognition. Build a fold library that uses the alphabet. Put up a webserver for public use of the library.
Summer 2002CASP5
57
Research ScheduleResearch Schedule
Year 2:Design more alphabets. Compare and combine new and existing alphabets. Expand the methods to continuous-value predictions. Incorporate best combination into my fold library.
June 2003Produce completed dissertation.
top related