capra: c-alpha pattern recognition algorithm thomas r. ioerger department of computer science texas...

CAPRA:C-Alpha Pattern Recognition Algorithm

Thomas R. IoergerDepartment of Computer Science

Texas A&M University

Overview of CAPRA

• goal: predict CA chains from density map• not just “tracing” - more than Bones• desire 1:1 correspondence, ~3.8A apart• based on principles of pattern recognition

– use neural net to estimate which pseudo-atoms in trace “look” closest to true C-alphas

– use feature extraction to capture 3D patterns in density for input to neural net

– use other heuristics for “linking” together into chains, including geometric analysis (s.s.)

What can you do with CA chains?

• build-in side-chain and backbone atoms– TEXTAL, Segment-Match Modeling (Levitt),

Holm and Sander

• recognize fold from secondary structure– identify candidates for molecular replacement

• evaluate map quality (num/len of chains)

• density modification – create poly-alanine backbone and use it to do

phase recombination

Role in Automated Model Building

• Model building is one of the bottlenecks in high-throughput Structural Genomics

• Automation is needed

TEXTAL

CAPRA

PHENIX

reflections

map

modelCA

chains

(ha/dm/ncs)

refinement

Steps in CAPRA

Examples of CAPRA Steps

Tracer+ + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + +

+ + ++ + + + + + ++ + + + + + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + + +

Neural Network

Feature Extraction

• characterize 3D patterns in local density

• must be “rotation invariant”

• examples:– average density in region– standard deviation, kurtosis...– distance to center of mass– moments of inertia, ratios of moments– “spoke angles”

• calculated over spheres of 3A and 4A radius

i

jijij biasoutwact ,

jactje

out

1

1

k

kkjjjj woutout ,)1(

ForwardPropagation:

BackwardPropagation:

kjkj outw ,

Selection of Candidate C-alpha’s

• method:– pick candidates in order of lowest predicted

distance first,– among all pseudo-atoms in trace,– as long as not closer than 2.5A

• notes:– no 3.8A constraint; distance can be as high as 5A– don’t rely on branch points (though often near) – picked in random order throughout map– initially covers whole map, including side-chains

and disconnected regions (e.g. noise in solvent)

Linking into Chains

• initial connectivity of CA candidates based on the trace

• “over-connected” graph - branches, cycles...

• start by computing connected components (islands, or clusters)

• two strategies:– for small clusters (<=20 candidates), find longest

internal chain with “good” atoms– for large clusters (>20 candidates), incrementally

clip branch points using heuristics

Extracting Chains from Small Clusters

• exhaustive depth-first search of all paths

• scoring function:– length– penalty for inclusion of points with high

predicted distance to true CA by neural net– preference for following secondary structure

(locally straight or helical)

Secondary Structure Analysis

• generate all 7-mers (connected fragments of candidate CAs of length 7)

• evaluate “straightness”– ratio of sum of link lengths to end-to-end distance– straightness>0.8 ==> potential beta-strand

• evaluate “helicity”– average absolute deviation of angles and torsions

along 7-mer from ideal values (95º and 50º)– helicity<20 ==> potential alpha-helix

Handling Large Clusters

• start by breaking cycles (near “bad” atoms)

• clip links at branch points till only linear chains remain

• clip the most “obvious” links first, e.g.– if other two links are part of sec. struct.– if clipped branch has “bad” atom nearby– if clipped branch is small and other 2 are large

? ??

Results

protein PDB id final res method res used sec. str. sizeCzrA 2.3A MAD/MR 2.8A 94/104IF5a 1bkb 1.75A MAD 2.8A 136/139MVK 1kkh 2.4A MAD 2.4A 317/317PCAa 1l1e 2.0A MAD 2.8A 262/287P2 Myelin 1pmp 2.7A MIR 2.7A 131/131

protein % built RMS error # chains longest # ins/del cross-oversCzrA 84/104 (81)% 1.08A 5 53 0IF5a 127/136 (93%) 0.78A 4 52 0MVK 298/317 (95%) 0.83A 6 101 0PCAa 212/262 (81%) 0.89A 11 50 1P2 Myelin 111/131 (85%) 0.91A 6 63 2

Analysis of RMS by Sec. Struct. (DSSP)

RMS CzrA IF-5a MVK PCAa P2 Myelinin alpha 1.03A (n=58) 0.57A (n=4) 0.76A (n=128) 0.87A (n=99) 1.15A (n=16) in beta 1.25A (n=11) 0.78A (n=75) 0.78A (n=76) 0.90A (n=36) 0.79A (n=67)in coil 1.15A (n=15) 0.79A (n=48) 0.95A (n=94) 0.92A (n=77) 1.03A (n=28)combined 1.08A (n=84) 0.78A (n=127) 0.83A (n=298) 0.89A (n=212) 0.91A (n=111)

Example of CA-chains for CzrA fit by CAPRA

Results for MVK

Effect of Resolution

• IF5a– initial map: 2.1A, RMS error: 1.23A– limited map: 2.8A, RMS error: 0.86A

• PCAa (2Fo-Fc)– initial map: 2.0A, RMS error: 1.1A– limited map: 2.8A, RMS error: 0.82A

Effect of Density Modification

• anecdotal evidence from ICL– before DM: many short, broken chains– after DM: longer chains, reasonable model

• hard to quantify, but the moral is:– the accuracy of CAPRA results depends on

“quality” of density, and CAPRA might not give useful results in noisy maps

• experiments with “blurring” maps– convolution with Gaussian by FFT

Future Work

• build poly-alanine– must determine directionality– currently done as part of TEXTAL (fits

backbone carbonyls as well as side-chain atoms)

• connect ends of chains – improve robustness to breaks in density

• use partial models to improve phases and hence make better maps (iteratively)– a new form of density modification?

Related Approaches

• Resolve (Terwilliger)– template convolution search, max. likelihood

• MAID (D. Levitt)– density correlation search, grow ends

• Critical-point analysis (Glasgow/Fortier)

• ARP/wARP (Perrakis and Lamzin)

• MAIN (D. Turk)– chiral carbons; iterate: extend ends, phase recomb.

• X-Powerfit (T. Oldfield, MSI)

Availability

• on pompano, add /xray/textal/bin/capra to your path• run ‘capra <protein>’ where <protein>.xplor is your

map in X-PLOR fmt• map should cover at least one whole molecule, though

smaller=faster• takes a minutes to an hour (especially for feature

calculations)• any space group & unit cell• resolution: 2.2-3.2A, 2.8A recommended• remember: quality of density must be high, e.g. post-

solvent-flattening, etc.

Acknowledgements

• Funding– National Institutes of Health– Welch Foundation

• People– Dr. James C. Sacchettini– The TEXTAL Group!

• Tod Romo

• Kreshna Gopal

• Reetal Pai

capra: c-alpha pattern recognition algorithm thomas r. ioerger department of computer science texas...

Documents