directions in protein contact map mining mohammed j. zaki computer science dept. joint work with...
TRANSCRIPT
Directions in Protein Contact Map Mining
Mohammed J. ZakiComputer Science Dept.
joint work withJingjing Hu & Xiaolan Shen, CS Dept.
Yu Shao & Prof. Chris Bystroff, Biology Dept.
Rensselaer Polytechnic Institute, Troy NY
Protein Structures Primary structure
Un-branched polymer 20 side chains (residues or amino acids) PDB file 2IGD: MTPAVTTYSLVINGLTLSGU…..
Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together
Contact Map
Amino acids Ai and Aj are in contact if their 3D distance is less than contact threshold (e.g., 7 Angstroms)
Sequence separation is given as |i-j| Contact map C is a symmetric N x N
matrix with C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise
Consider all pairs with |i-j| >= 4
Contact Map (2IGD)
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
Amino Acid Ai
Am
ino
Aci
d A
j
Characterizing Physical, Protein-like Contact Maps
A very small subset of all contact maps code for physically possible proteins (self-avoiding, globular chains)
A contact map must: Satisfy geometric constraints Represent low-energy structure
Characterizing Physical Contact Maps in Proteins
What are the typical non-local interactions? Frequent dense 0/1 sub-matrices in
contact maps 3-step approach
Dense pattern mining Pruning mined patterns Clustering dense patterns (non-local
pattern signatures)
Dense Pattern Mining
Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)2 / 2 possible windows for N length
protein Look for “minimum density” (number of
1’s) scale away from diagonal
Try different window sizes
Counting Dense Patterns Naïve Approach: for W=5, N=60 there are
1485 windows per protein. 28 million possible windows for 18,544 proteins (in PDB) Test if two sub-matrices are equal
Linear search: O(P x W2) with P current dense patterns
Hash based: O(W2)
Our Approach: 2-level Hashing O(W) time
Pattern (WxW Sub-matrix) Encoding
Encode sub-matrix as string (W ints)Sub-matrix Integer Value 00000 0 01100 12 01000 8 01000 8 00000 0Concatenated String: 0.12.8.8.0
String-ID(M) =
Level1 (approximate):
Level2 (exact): h2(M) = String-ID(M)
Two-level Hashing
W
iivMh
1)(1
Wvvv ...... 21
Binding Patterns to Protein Sequence and Structure
StringID:0.12.8.8.0, Support = 170 (window size W=5)0000001100010000100000000
Occurrences:pdb-name (X,Y) X_sequence Y_sequenceInteraction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...
Frequent Dense Local Patterns
Submatrix 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0
0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0
0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
0 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
0 1 1 1 0 0 0 0
0 0 1 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 0 1 1 1 0
0 0 0 0 0 1 1 1
Frequency 2.0% 2.0% 2.2%
PhysicalPhenomenon
Parallel beta sheet
Anti-parallel beta sheet
Anti-parallel beta sheet
Pruning Patterns
0000001000010000100000000
0000000100001000010000000
0000000010000100001000000
Same pattern (shifted to right) but different String-IDs
Merge horizontally or vertically shifted patternsPrune away the local patterns (alpha/beta)
Dense Pattern Mining Results
2702 non-redundant proteins from PDB
Min-Support = 1 (exhaustive patterns)
Window size = 5, Min-Density = 5Contact Threshold Number of Patterns
5 Angstroms 2508
6 Angstroms 9929
7 Angstroms 21231
Clustering Dense Patterns Distance: Mi, Mj are dense sub-matrices
Use agglomerative hierarchical clustering Find each cluster’s (c) representative (n patterns)
Conceptually the super-imposition of n sub-matrices Compute contact probability at each position
Note a 1 whenever contact probability is more than a probability threshold
|][][|),(2
1
W
kjiji kMkMMMd
n
kMkp
n
ii
c
1
][][
Cluster RepresentativeContact Probabilities:0: 0.05 1: 0.05 2: 0.68 3: 0.85 4: 0.71 5: 0.03 6: 0.02 7: 0.14 8: 0.07 9: 0.09 10: 0.05 11: 0.05 12: 0.12 13: 0.09 14: 0.0315: 0.03 16: 0.05 17: 0.15 18: 0.27 19: 0.85 20: 0.25 21: 0.10 22: 0.59 23: 0.92 24: 0.83
Representative contact pattern: 00111 00000 00000 00001 00011
Clustering Quality
High and low value of pc[k] are good (most cluster members agree on k)
For a cluster c, define quality Qc:
Overall clustering quality (0.5 <= Q <= 1)
)5.0][(],[2
1
1
W
kccc kpkpS
NP
QcQ
NC
ici i
1
|| NC = Number of ClustersNP = Number of Patterns
)5.0][(],[12
1
0
W
kccc kpkpS
01ccc SSQ
Example 1: Mined Cluster
#1355
#3496
#6282
#7980
representative
0001100011011111100010000
0000100101111111100010000
0001000000110001000010000
0001100101111001000000000
0001100001111001000010000
Example 2: Mined Cluster
#196 #503 #2834 #8697 representative
1101001111010000100011000
0100001110010000100011000
1100001100011100100001000
1101001110011000110001000
1100001110010000100001000
Clustering Results
Contact Threshold
Number of Patterns
Number of Clusters
Cluster Quality
5 A 2508 83 0.89
6 A 9929 99 0.86
7 A 21231 367 0.84
Future Work
Comprehensive list of non-local motifs I-sites library (by Prof. Bystroff)
catalogs local motifs Future Directions
Improving prediction of contact maps Mining heuristic rules for “physicality” Protein folding pathways
Mining Physicality Rules Mining heuristic rules for “physicality”
Based on simple geometric constraints Rules governing contacts and non-contacts
Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1,
then C(i,j+2) = 0 and C(i+2,j) = 0 Anti-parallel Beta Sheets:
If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0
Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1,
then C(i+2,j) = 0
Heuristic Rules of Physicality
i
i+2 j
j+2
If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0
Anti-parallel Beta Sheets
If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0
Parallel Beta Sheets
i
i+2
j
j+2
Heuristic Rules of Physicality
Heuristic Rules of Physicality
j
i
i+4
i+2
Alpha Helix
If C(i,j) = 1 and C(i+4,j) = 1 and C(I,i+4) = 1, then C(i+2,j) = 0
Protein Folding Pathways Rules for Pathways in Contact Map Space
Pathway is time-ordered sequence of contacts
Consider only native contacts (those that are present in the true map)
Condensation rule: New contacts within Smax U(i,j) <= Smax; U(i,j) unfolded residues from i to j
Pathway prediction is complementary to structure prediction