data mining for protein structure prediction
DESCRIPTION
Data Mining for Protein Structure Prediction. Mohammed J. Zaki SPIDER Data Mining Project: S calable, P arallel and I nteractive D ata Mining and E xploration at R PI http://www.cs.rpi.edu/~zaki. Outline of the Talk. How do proteins form? Protein folding problem Contact map mining - PowerPoint PPT PresentationTRANSCRIPT
Data Mining for Protein Structure Prediction
Mohammed J. Zaki
SPIDER Data Mining Project: Scalable, Parallel and Interactive
Data Mining and Exploration at RPI
http://www.cs.rpi.edu/~zaki
Outline of the Talk
How do proteins form? Protein folding problem Contact map mining Using HMMs based on local motifs Mining “physical” dense frequent patterns
(non-local motifs) Future directions
Heuristic rules Folding pathways
How do Proteins Form?
Building Blocks of Biological Systems DNA (nucleotides, 4 types): information carrier/encoder RNA: bridge from DNA to protein Protein (amino acids, 20 types): action molecules.
Processes Replication of DNA Transcription of gene (DNA) to messenger RNA (mRNA)
Splicing of non-coding regions of the genes (introns) Translation of mRNA into proteins Folding of proteins into 3D structure Biochemical or structural functions of proteins
How do Proteins Form?
Protein Folding Problem
Protein Structures
Primary structure Un-branched polymer 20 side chains (residues or amino acids)
Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together
Amino Acid
Polypeptide Chain
Torsion Angles
The Protein Folding Problem
Contact Map Mining
Contact Map
Amino acids Ai and Aj are in contact if their 3D distance is less than threshold (7å)
Sequence separation is given as |i-j| Contact map C is an N x N matrix, where
C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise
Consider all pairs with |i-j| >= 4
Protein 2igd: 3D Structure
Anti-parallel Beta Sheets
Parallel Beta Sheets
Alpha Helix
Contact Map (2igd PDB)Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
Amino Acid Ai
Am
ino
Aci
d A
j
How much information in Amino Acids Alone: Classification Problem
A pair of amino acids (Ai,Aj) is an instance The class: C (1) or NC (0), i.e., contact or
non-contact Highly skewed class distribution
1.7% C and 98.3% NC; 300K C vs 17,3M NC Features for each instance
Ai and Aj Class: C or NC
Predicting Protein Contacts
Predict contacts for new sequence
A D T S K C PA 0 1 0 0 1 0D 0 1 0 0 1T 1 0 0 0S 0 1 0K 0 1C 0P
Ai Aj F1 FnA T .. ..A C .. ..D S .. ..D P .. ..T S .. ..S C .. ..K P .. ..
Classification via Association Mining Association mining good for skewed data Mining: Mine frequent itemsets in C data (Dc)
P(X | Dc) = Frequency(X | Dc) / |Dc| Counting: find P(X | Dnc) Pruning
Likelihood of a contact: r = P(X|Dc) / P(X|Dnc) Prune pattern X if ratio r of contact to non-contact
probability is less then some threshold i.e., keep only the patterns highly predictive of
contacts
Testing Phase 90-10 split into training and testing 2.4 million pairs, with 36K contacts (1.5%) Evidence calculation:
Find matching patterns P for each instance Compute cumulative frequency in C and NC
Sc = Sum of frequency (X | Dc) where X in P Snc = Sum of frequency (X | Dnc) where X in P
Compute evidence: ratio of Sc / Snc Prediction: Sort instances on evidence
Predict top PR fraction as contacts
Experiments
794 Proteins from Protein Data Bank Distinct structures (< 25% similarity) Longest: 907, Smallest: 35 amino acids 90-10 split for training-testing Total pairs: 20 million (> 2.5 GB) Contacts: 330 thousand (1.6%) Highly uneven class distribution
Evaluation Metrics Na: set of all pairs Na*: all pairs with positive evidence Ntc: true contacts in test data Ntc*: true contacts with positive evidence Npc: predicted contacts Ntpc: correctly predicted contacts Accuracy = Ntpc / Npc Coverage = Ntpc / Ntc Prediction Ratio (PR): Ntc*/Na* Random Predictor Accuracy: Ntc/Na
Results (Amino Acids; All Lengths)
Crossover: 7% accuracy and 7% coverage; 2 times over Random
Results (Amino Acids; by length)
1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C
170-300: 4.5% A and C; 300+: 2% A and C
Using HMMs based on Local Motifs to Improve Classification
An HMM for Local Predictions HMMSTR (Chris Bystroff, Biology, RPI) Build a library of short sequences that tend to
fold uniquely across protein families: the I-Sites Library
Treat each motif as a Markov chain Merge the motifs into a global HMM for local
structure prediction
Training the HMM Build I-sites Library
Short sequence motifs (3 to 19) Exhaustive clustering of sequences Non-redundant PDB dataset (< 25% similarity)
Build an HMM Each of 262 motifs is a chain of Markov states Each state has sequence and structure for one
position Merge I-sites motifs hierarchically to get one global
HMM for all the motifs
HMM Output
Total of 282 States in the HMM Each state produces or “emits”:
Amino acid profile (20 probability values) Secondary structure (D) (helix, strand or loop) Backbone angles (R) (11 dihedral angle symbols) Finer structural context (C) (10 context symbols)
I-Sites Motifs (Initiation Sites)Beta Hairpin Beta to Alpha Helix C-Cap
Data Format and Preparation Take the 794 PDB proteins Compute optimal alignment to HMM
Find best state sequence for the observed acids Output probability distribution of a residue over all
the 282 HMM states Integrate the 3 datasets
Alignment probability distribution (Nx282) Amino acid and context information (D, R, C) Contact map (NxN)
HMMSTR Output (per Protein)PDB Name: 153l_
Sequence Length: 185
Position: 1Residue: RCoordinates: 0.0, -73.2, 177.6AA Profile (20 values): 0.0 1.0 0.0HMMSTR State Probabilities (282 values):0.0 0.7 0.3 0.0Distances (185 values): 0 3 5 18 15 13...Position: 185Residue: YCoordinates: -88.7 , 0.0, 0.0AA Profile (20 values): 0.0 0.4 0.6 0.0HMMSTR State Probabilities (282 values):0.0 0.2 0.5 0.3 0.0Distances (185 values): 15 13 10 5 3 0
Adding features from HMMSTR The class: C (1) or NC (0)
Highly skewed class distribution Approx 1.5% C and 98.5% NC
Features for each instance Ai Aj Di Dj Ri Rj Ci Cj Profile: pi1 pi2 … pi20 pj1 pj2 … pj20 HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282 Class: C or NC
HMM and AA + (R,D,C) ; All Lengths
Left Crossover: 19% accuracy and coverage; 5.3 times over Random
Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random
HMM + AA + R,D,C (by length)
1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C
170-300: 10% A and C; 300+: 6% A and C
Predicted Contact Map (2igd)
Summary of Classification Results Challenging prediction problem In essence, we have to predict a contact
matrix for a new protein Hybrid HMM/Associations approach Best results to-date: 19% overall
accuracy/coverage, 30% for short proteins 14.4% Accuracy (Fariselli, Casadio ‘99; NN) 13% Accuracy (Thomas et al ‘96) Short proteins: 26% (Olmea, Valencia, ‘97)
Mining “Physical” Dense Frequent Patterns (non-local motifs)
Characterizing Physical, Protein-like Contact Maps A very small subset of all contact maps code
for physically possible proteins (self-avoiding, globular chains)
A contact map must: Satisfy geometric constraints Represent low-energy structure
What are the typical non-local interactions? Frequent dense 0/1 submatrices in contact maps 3-step approach: 1) data generation, 2) dense
pattern mining, and 3) mapping to structure space
Dense Pattern Mining 12,524 protein-like 60 residue structures
Use HMMSTR to generate protein-like sequences Use ROSETTA to generate their structures
Monte Carlo fragment insertion (from I-sites library) Up to 5 possible low-energy structures retained
Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)^2 / 2 possible windows per N length protein Look for “minimum density”; scale away from diag Try different window sizes
Counting Dense Patterns
Naïve Approach:for W=5, N=60 there are 1485 windows per protein. Total 15 Million possible windows for 12,524 proteins Test if two submatrices are equal
Linear search: O(P x W^2) with P current dense patterns Hash based: O(W^2)
Our Approach: 2-level Hashing O(W) time
Pattern (WxW Submatrix) Encoding Encode submatrix as string (W integers)
Submatrix Integer Value
00000 0
01100 12
01000 8
01000 8
00000 0
Concatenated String: 0.12.8.8.0
String ID (M) =
Level 1 (approximate):
Level2 (exact) : h2 (M) = StringID (M)
Two-level Hashing
W
iivMh
1)(1
Wvvv ...... 21
Binding Patterns to Proteins Sequence and Structure Using window size, W=5 StringID:0.12.8.8.0, Support = 170
0000001100010000100000000
Occurrences:pdb-name (X,Y) X_sequence Y_sequence Interaction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...
Frequent Dense Local PatternsSubmatrix 0 0 0 0 0 0
0 00 0 0 0 0 0
0 00 0 0 0 0 0
0 00 0 0 0 0 0
0 00 0 0 0 0 0
0 10 0 0 0 0 0
1 10 0 0 0 0 1
1 10 0 0 0 1 1
1 0
0 1 1 1 0 0 0 0
1 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
Frequency 2.0% 2.2% 1.9%
Physical Phenomenon
antiparallel beta sheet secondary structure
antiparallel beta sheet secondary structure
parallel beta sheet secondary structure
Frequent Dense Non-Local Patterns
Alpha – Alpha Alpha – Beta Sheet
Frequent Dense Non-Local Patterns
Alpha – Beta Turn Beta Sheet – Beta Turn
Future Directions
Mining Physicality Rules
Comprehensive list of non-local motifs I-sites library catalogs local motifs
Mining heuristic rules for “physicality” Based on simple geometric constraints Rules governing contacts and non-contacts
Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0
Anti-parallel Beta Sheets: If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0
Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1, then C(i+2,j) = 0
Heuristic Rules of Physicality
i
i+2 j
j+2
If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0
If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0
Parallel Beta SheetsAnti-parallel Beta Sheets
i
i+2
j
j+2
Protein Folding Pathways
Rules for Pathways in Contact Map Space Pathway is time-ordered sequence of contacts Condensation rule: New contact within Smax
U(i,j) <= Smax; U(i,j) is unfolded residues from i to j Pathway prediction is complementary to structure
prediction
Contact Map Folding Pathways