phycmap: predicting protein contact map using evolutionary and physical constraints by integer...
TRANSCRIPT
![Page 1: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/1.jpg)
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming
Zhiyong Wang and Jinbo XuToyota Technological Institute at Chicago
Web server at http://raptorx.uchicago.eduSee http://arxiv.org/abs/1308.1975 for an extended version
![Page 2: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/2.jpg)
Problem DefinitionContact : Distance between two Cα or Cβ atoms < 8Å
short range: 6-12 AAs apartmedium range: 12-24 AAs long range: >24 AAs apart
1J8B
![Page 3: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/3.jpg)
Existing WorkResidue co-evolution method: mutual information (MI), PSICOV, Evfold Needs a large number of homologous sequences PSICOV and Evfold better than MI since they differentiate direct and indirect residue
couplings (Residues A and C indirect coupling if it is due to direct A-B and B-C couplings)
PSICOV and Evfold also enforce sparsity
Supervised learning method: NNcon, SVMcon, CMAPpro Mutual information, sequence profile and others Predicts contacts one by one, ignoring their correlation Do not differentiate direct and indirect residue couplings
First-principle method: Astro-Fold No evolutionary information Minimize contact potential Enforce physical feasibility including sparsity
![Page 4: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/4.jpg)
Our Method: PhyCMAP1. Focus on proteins with few sequence homologs proteins with many sequence homologs
very likely have similar templates in PDB
2. Integrate by machine learning seq profile, residue co-evolution and
non-evolutionary info (implicitly) differentiate direct and
indirect residue couplings through feature engineering
3. Enforce physical constraints, which imply sparsity
![Page 5: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/5.jpg)
Info used by Random Forests• Evolution info from a single protein family– sequence profile – co-evolution: 2 types of mutual information (MI)
• Non-evolution info from the whole structure space: residue contact potential
• Mixed info from the above 2 sources– homologous pairwise contact score– EPAD: context-specific evolutionary-based distance-
dependent statistical potential• amino acid physic-chemical properties
![Page 6: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/6.jpg)
Mutual Information
1. Contrastive Mutual Information (CMI): remove local background by measuring the MI difference of one pair with its neighbors.
2. Chaining effect of residue couplings: MI, MI2, MI3, MI4, equivalent to (1-MI), (1-MI)2, (1-MI)3, (1-MI)4 (see http://arxiv.org/abs/1308.1975 for more details)
![Page 7: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/7.jpg)
CMI Example: 1J8B• Upper triangle: mutual information• Lower triangle: contrastive mutual information• Blue boxes: native contacts
![Page 8: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/8.jpg)
Homologous Pairwise Contact Score
Probability of a residue pair forming a contact between 2 secondary structures.
PSbeta (a, b): prob of two AAs a and b forming a beta contactPShelix (a, b): prob of two AAs a and b forming a helix contactH: the set of sequence homologs in a multiple seq alignment
𝑃𝑆 (𝑖 , 𝑗 )= 1|𝐻|
(∑h∈𝐻
𝑃𝑆𝑏𝑒𝑡𝑎 (h 𝑖 , h 𝑗 )𝑜𝑟 ∑h∈𝐻
𝑃𝑆h𝑒𝑙𝑖𝑥 (h𝑖 , h 𝑗))
![Page 9: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/9.jpg)
Training Random Forests• Training dataset– Chosen before CASP10 started– 900 non-redundant protein structures– <25% sequence identity– All contacts and 20% of non-contacts
• Model parameters– Number of features: 300– Number of trees: 500– 5 fold cross validation
![Page 10: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/10.jpg)
Select Physically Feasible Contacts by Integer Linear Programming
Maximize accumulative contact probability while minimize violation of physical constraints
Xi,j Indicate one contact between two residues i and j
Rr a relaxation variable of the rth soft constraint
g(R) penalty for violation of physical constraints
6
,,,
)(maxij
jijiRX
RgPX
![Page 11: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/11.jpg)
Soft Constraints 1
# contacts between two secondary structure segments is limited
2)(:2,11,
,1)(,
sjSStypejssji bRX
siSStypei
s1,s2 95% MaxH,H 5 12H,E 3 10H,C 4 11E,H 4 12E,E 9 13E,C 6 15C,H 3 12C,E 5 12C,C 6 20
![Page 12: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/12.jpg)
Soft Constraints 2Upper and lower bounds for #contacts between two beta strands
))(),(min(3 ,
)(),(2,
vLenuLenS
RX
vu
uSSegjvSSegiji
3
)(),(,
))(),(max(3.3 RvLenuLen
XuSSegjvSSegiji
![Page 13: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/13.jpg)
Soft Constraints 3
Statistics shows that only 3.4% of loop segments that have a contact between the start and end residues.
![Page 14: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/14.jpg)
Hard Constraints 1
• For parallel contacts between two β strands, the contacts of neighboring residue pairs should satisfy the following constraints
• For anti-parallel contacts
11,11,1, jijiji XXX
11,11,1, jijiji XXX
![Page 15: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/15.jpg)
Hard Constraints 2
1) One residue cannot form contacts with both j and j+2 when j and j+2 are in the same alpha helix
12,, jiji XX
2) One beta-strand can form beta-sheets with up to 2 other beta-strands.
![Page 16: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/16.jpg)
Test Datasets• CASP10: 123 proteins– 36 are “hard”, i.e., no similar templates in PDB – low sequence identity (<25%) among them– low seq id with the training data, which were chosen
before CASP10 started
• Set600: 601 proteins– share <25% seq ID with the training proteins – each has ≥50 AAs and an X-ray structure with resolution
<1.9Å– each has ≥5 AAs with predicted secondary structure
being alpha-helix or beta-strand
![Page 17: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/17.jpg)
Accuracy w.r.t. #sequence homologs1. Meff: #non-redundant sequence homologs of a protein
2. Divide the CASP10 targets into groups by Meff
3. Top L/10 predicted medium- and long-range contacts
logMeff
accuracy
![Page 18: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/18.jpg)
Results on CASP10 – Medium RangeOverall accuracy on top L/5 predicted Cβ contacts: PhyCMAP 0.465, CMAPpro 0.370, PSICOV 0.316
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
![Page 19: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/19.jpg)
Results on CASP10 – Long RangeOverall accuracy on top L/5 predicted Cβ contacts: PhyCMAP: 0.373, CMAPpro: 0.313, PSICOV: 0.315
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
![Page 20: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/20.jpg)
Results on 36 hard CASP10 targetsaccuracy on top L/5 medium and long-range Cβ contacts:
PhyCMAP: 0.363, CMAPpro: 0.308, PSICOV: 0.180
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
![Page 21: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/21.jpg)
CMAPproPSICOV
PhyC
MAP
PhyC
MAP
Results on Set600 with few homologs (Meff ≤ 100)
top L/5 predicted medium and long Cβ contacts: PhyCMAP: 0.345, CMAPpro: 0.287, PSICOV: 0.059
![Page 22: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/22.jpg)
Example: T0677-D2Dozens of sequence homologs Meff=31
Upper triangle: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.357Right lower triangle: Evfold accuracy ~0
Note contacts between alpha helices are not continuous
![Page 23: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/23.jpg)
Example: T0693-D2Many sequence homologs Meff=2208
Upper triangles: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.744
Right lower triangle: Evfold accuracy 0.419
![Page 24: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/24.jpg)
Example: T0701-D1Many sequence homologs Meff=3300
Upper triangle: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.794
Right lower triangle: Evfold accuracy 0.444
![Page 25: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/25.jpg)
Example: T0756-D1
Many sequence homologs Meff=1824Upper triangles: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.944Right lower triangle: Evfold accuracy 0.500
![Page 26: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/26.jpg)
Summary
Combining seq profile, residue co-evolution, non-evolutionary info can result in good accuracy even for proteins with 10--100 non-redundant seq homologs
Physical constraints are helpful for proteins with few sequence homologs
L/10 L/
5
L/10 L/
5
Short-range
contacts
Medium and long-
range
0.2
0.3
0.4
0.5
with physical constraintsno physical constraints
Cβ accuracy on 130 proteins Meff ≤ 100
![Page 27: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/27.jpg)
Acknowledgements• Student: Zhiyong Wang• Funding – NIH R01GM0897532– NSF CAREER award– Alfred P. Sloan Research Fellowship
• Computational resources– University of Chicago Beagle team– TeraGrid
Web server at http://raptorx.uchicago.edu
![Page 28: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/28.jpg)
Protein contact Contact : Distance between two Cα or Cβ atoms < 8Å; or Distance between the closest atoms of 2 residues.
1J8B
short range: 6-12 AAs apartmedium range: 12-24 AAs long range: >24 AAs apart
![Page 29: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/29.jpg)
Why contact prediction?
• Contacts describe spatial and functional relationship of residues
• Contains key information for 3D structure• Useful for protein structure prediction• Used for protein structure alignment and
classification
![Page 30: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/30.jpg)
Contrastive Mutual Information
Contrastive Mutual Information (CMI) removes local background, by measuring the MI difference between one pair of residues and neighboring pairs.
![Page 31: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/31.jpg)
Integer Linear Programming
• Objective function:• g(R): penalty for violation of physical constraints
Variables ExplanationsXi,j equal to 1 if there is a contact between
two residues i and j.APu,v equal to 1 if two beta-strands u and v
form an anti-parallel beta-sheet.Pu,v equal to 1 if two beta-strands u and v
form a parallel beta-sheet.Su,v equal to 1 if two beta-strands u and v
form a beta-sheet. Tu,v equal to 1 if there is an alpha-bridge
between two helices u and v.Rr a non-negative integral relaxation
variable of the rth soft constraint.
6
,,,
)(maxij
jijiRX
RgPX
![Page 32: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/32.jpg)
Hard Constraints 3
One beta-strand can form beta-sheets with up to 2 other beta-strands.
2)(:
, betavSStypev
vuS
![Page 33: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/33.jpg)
Global constraints
• Antiparallel and parallel contacts
• A residue contact implies a segment-wise contact
• Put a limit of total number of contacts
– k is the number of top contacts we want to predict.
vuvuvu SPAP ,,,
)(),(,,, vSSegjuSSegiSX vuji
6,1,
ijLjiji kX
![Page 34: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/34.jpg)
Results on Set600 with many sequence homologs (Meff > 100)
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
top L/5 predicted medium and long Cβ contacts: PhyCMAP: 0.611, CMAPpro: 0.515, PSICOV: 0.569
![Page 35: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/35.jpg)
Contribution of HPS and CMI featuresAverage Cβ accuracy the 471 proteins with Meff >100
L/10 L/
5
L/10 L/
5Short-range contacts Medium and long-
range
0.4
0.5
0.6
0.7
with CMI and HPS no CMI and HPS
![Page 36: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649c7e5503460f94933647/html5/thumbnails/36.jpg)
Contribution of physical constraints Average Cβ accuracy on 130 proteins with Meff ≤ 100
L/10 L/
5
L/10 L/
5
Short-range contacts Medium and long-range
0.2
0.3
0.4
0.5
with physical constraintsno physical constraints