capra: c-alpha pattern recognition algorithm thomas r. ioerger department of computer science texas...
Post on 20-Dec-2015
216 views
TRANSCRIPT
CAPRA:C-Alpha Pattern Recognition Algorithm
Thomas R. IoergerDepartment of Computer Science
Texas A&M University
Overview of CAPRA
• goal: predict CA chains from density map• not just “tracing” - more than Bones• desire 1:1 correspondence, ~3.8A apart• based on principles of pattern recognition
– use neural net to estimate which pseudo-atoms in trace “look” closest to true C-alphas
– use feature extraction to capture 3D patterns in density for input to neural net
– use other heuristics for “linking” together into chains, including geometric analysis (s.s.)
What can you do with CA chains?
• build-in side-chain and backbone atoms– TEXTAL, Segment-Match Modeling (Levitt),
Holm and Sander
• recognize fold from secondary structure– identify candidates for molecular replacement
• evaluate map quality (num/len of chains)
• density modification – create poly-alanine backbone and use it to do
phase recombination
Role in Automated Model Building
• Model building is one of the bottlenecks in high-throughput Structural Genomics
• Automation is needed
TEXTAL
CAPRA
PHENIX
reflections
map
modelCA
chains
(ha/dm/ncs)
refinement
Steps in CAPRA
Examples of CAPRA Steps
Tracer+ + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + +
+ + ++ + + + + + ++ + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
Neural Network
Feature Extraction
• characterize 3D patterns in local density
• must be “rotation invariant”
• examples:– average density in region– standard deviation, kurtosis...– distance to center of mass– moments of inertia, ratios of moments– “spoke angles”
• calculated over spheres of 3A and 4A radius
i
jijij biasoutwact ,
jactje
out
1
1
k
kkjjjj woutout ,)1(
ForwardPropagation:
BackwardPropagation:
kjkj outw ,
Selection of Candidate C-alpha’s
• method:– pick candidates in order of lowest predicted
distance first,– among all pseudo-atoms in trace,– as long as not closer than 2.5A
• notes:– no 3.8A constraint; distance can be as high as 5A– don’t rely on branch points (though often near) – picked in random order throughout map– initially covers whole map, including side-chains
and disconnected regions (e.g. noise in solvent)
Linking into Chains
• initial connectivity of CA candidates based on the trace
• “over-connected” graph - branches, cycles...
• start by computing connected components (islands, or clusters)
• two strategies:– for small clusters (<=20 candidates), find longest
internal chain with “good” atoms– for large clusters (>20 candidates), incrementally
clip branch points using heuristics
Extracting Chains from Small Clusters
• exhaustive depth-first search of all paths
• scoring function:– length– penalty for inclusion of points with high
predicted distance to true CA by neural net– preference for following secondary structure
(locally straight or helical)
Secondary Structure Analysis
• generate all 7-mers (connected fragments of candidate CAs of length 7)
• evaluate “straightness”– ratio of sum of link lengths to end-to-end distance– straightness>0.8 ==> potential beta-strand
• evaluate “helicity”– average absolute deviation of angles and torsions
along 7-mer from ideal values (95º and 50º)– helicity<20 ==> potential alpha-helix
Handling Large Clusters
• start by breaking cycles (near “bad” atoms)
• clip links at branch points till only linear chains remain
• clip the most “obvious” links first, e.g.– if other two links are part of sec. struct.– if clipped branch has “bad” atom nearby– if clipped branch is small and other 2 are large
? ??
Results
protein PDB id final res method res used sec. str. sizeCzrA 2.3A MAD/MR 2.8A 94/104IF5a 1bkb 1.75A MAD 2.8A 136/139MVK 1kkh 2.4A MAD 2.4A 317/317PCAa 1l1e 2.0A MAD 2.8A 262/287P2 Myelin 1pmp 2.7A MIR 2.7A 131/131
protein % built RMS error # chains longest # ins/del cross-oversCzrA 84/104 (81)% 1.08A 5 53 0IF5a 127/136 (93%) 0.78A 4 52 0MVK 298/317 (95%) 0.83A 6 101 0PCAa 212/262 (81%) 0.89A 11 50 1P2 Myelin 111/131 (85%) 0.91A 6 63 2
Analysis of RMS by Sec. Struct. (DSSP)
RMS CzrA IF-5a MVK PCAa P2 Myelinin alpha 1.03A (n=58) 0.57A (n=4) 0.76A (n=128) 0.87A (n=99) 1.15A (n=16) in beta 1.25A (n=11) 0.78A (n=75) 0.78A (n=76) 0.90A (n=36) 0.79A (n=67)in coil 1.15A (n=15) 0.79A (n=48) 0.95A (n=94) 0.92A (n=77) 1.03A (n=28)combined 1.08A (n=84) 0.78A (n=127) 0.83A (n=298) 0.89A (n=212) 0.91A (n=111)
Example of CA-chains for CzrA fit by CAPRA
Results for MVK
Effect of Resolution
• IF5a– initial map: 2.1A, RMS error: 1.23A– limited map: 2.8A, RMS error: 0.86A
• PCAa (2Fo-Fc)– initial map: 2.0A, RMS error: 1.1A– limited map: 2.8A, RMS error: 0.82A
Effect of Density Modification
• anecdotal evidence from ICL– before DM: many short, broken chains– after DM: longer chains, reasonable model
• hard to quantify, but the moral is:– the accuracy of CAPRA results depends on
“quality” of density, and CAPRA might not give useful results in noisy maps
• experiments with “blurring” maps– convolution with Gaussian by FFT
Future Work
• build poly-alanine– must determine directionality– currently done as part of TEXTAL (fits
backbone carbonyls as well as side-chain atoms)
• connect ends of chains – improve robustness to breaks in density
• use partial models to improve phases and hence make better maps (iteratively)– a new form of density modification?
Related Approaches
• Resolve (Terwilliger)– template convolution search, max. likelihood
• MAID (D. Levitt)– density correlation search, grow ends
• Critical-point analysis (Glasgow/Fortier)
• ARP/wARP (Perrakis and Lamzin)
• MAIN (D. Turk)– chiral carbons; iterate: extend ends, phase recomb.
• X-Powerfit (T. Oldfield, MSI)
Availability
• on pompano, add /xray/textal/bin/capra to your path• run ‘capra <protein>’ where <protein>.xplor is your
map in X-PLOR fmt• map should cover at least one whole molecule, though
smaller=faster• takes a minutes to an hour (especially for feature
calculations)• any space group & unit cell• resolution: 2.2-3.2A, 2.8A recommended• remember: quality of density must be high, e.g. post-
solvent-flattening, etc.
Acknowledgements
• Funding– National Institutes of Health– Welch Foundation
• People– Dr. James C. Sacchettini– The TEXTAL Group!
• Tod Romo
• Kreshna Gopal
• Reetal Pai