fast alignment of protein structures based on conformational letters wei-mou zheng institute of...

34
Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Upload: esther-lewis

Post on 18-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Fast Alignment of Protein Structures Based on

Conformational Letters

Wei-Mou ZhengInstitute of Theoretical Physics

Academia Sinica

Page 2: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

• Introduction

• Conformational alphabet and CLESUM

• CLePAPS: Pairwise structure alignment

• BloMAPS: Multiple structure alignment

• Conclusion

Page 3: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Introduction

Protein structures comparison: an extremely important problem in structural and evolutional biology.

• detection of local or global structural similarity • prediction of the new protein's function. • structures are better conserved → remote homology detection • Structural comparison → organizing and classifying structures discovering structure patterns discovering correlation between structures & sequences → structure prediction

Page 4: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Conformational alphabet and CLESUMThe main difference of CLePAPS from other existing algorithms for structure alignment is the use of conformational letters. Conformational letters = discretized states of 3D segmental conformations. A letter = a cluster of combinations of three angles formed by C pseudobonds of four contiguous residues. (obtained by clustering according to the probability distribution.) Centers of 17 conformational letters

Page 5: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Similarity between conformational letters

CLESUM: Conformational LEtter SUbstitution Matrix

Mij = 20* log 2 (Pij/PiPj) ~ BLOSUM83, H ~ 1.05

constructed using FSSP representatives.

typical helix

typical sheet

evolutionary

+ geometric

Page 6: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

CLePAPS: Pairwise structure alignmentStructure alignment --- a self-consistent problemCorrespondence Rigid transformation

However, when aligning two protein structures, at the beginning we know neither the transformation nor the correspondence.

DALI, CEVASTSTRUCTAL, ProSup

CLePAPS: Conformational Letters based Pairwise Alignment of Protein Structures

Initialization + iteration• Similar Fragment Pairs (SFPs); • Anchor-based; • Alignment = As many consistent SFPs as possible

Page 7: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Anchor-based superposition

SFPs

anchor SFP

consistent

inconsistent

Collect as many consistent SFPs as possible

Page 8: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

the smaller

similar

seed

Redundancy removal

shaved

kept

SFP = highly scored string pair• Fast search for SFPs by string comparison

• CLESUM similarity score importance of SFPs

Guided by CLESUM scores, only the top few SFPs need to be examinedto determine the superposition for alignment, and hence a reliable greedy strategy becomes possible.

Page 9: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

rank 1 3 5 2 4

Example: Top K, K = 2; J = 5

Anchor

Anchor

# of consistent SFBs = 4 # of consistent SFBs = 1

Selection of optimal anchor

1

2

Top-1 SFB is globally supported by three other SFPs, while Top-2 SFB is supported only by itself.

Page 10: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Anchor

‘Zoom-in’

d1

d2

d3

d1 > d2 > d3

successively reduced cutoffs for maximal coordinate difference

Page 11: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Flow chart of the CLePAPS algorithm

specificity

sensitivity

1/2

5/6

1

Page 12: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

• Finding SFPs of high CLESUM similarity scores• The greedy `zoom-in' strategy• Refinement by elongation and shrinking

• The Fischer benchmark test• Database search with CLePAPS• Multiple solutions of alignments: repeats, symmetry, domain move• Non-topological alignment and domain shuffling

Page 13: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Multiple structure alignment

Multiple alignment carries significantly more information than pairwise alignment, and hence is a much more powerful tool for classifying proteins, detecting evolutionary relationship and common structural motifs, and assisting structure/function prediction.

Most existing methods of multiple structural alignment combine a pairwise alignment and some heuristic with a progressive-type layout to merge pairwise alignments into a multiple alignment. like CLUSTAL-W, T-Coffee: MAMMOTH-mult, CE-MC• slow• alignments which are optimal for the whole input set might be missed

A few true multiple alignment tools: MASS, MultiProt

Page 14: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Vertical equivalency and horizontal consistencylocal similarity among consistent spatial

structures arrangement for a pair

Page 15: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Highly similar fragment block (HSFB)

Attributes of HSFB: width, positions, depth, score, consensus

Horizontal consistency of two HSFBs

template

pivot

superposition inconsistent

consistent

anchor HSFB

similar

seed

a

b

c

a&b

a&c

Page 16: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

1. Creating HSFBscreate HSFBs using the shortest protein as a templatesort HSFBs according to depths, then to scoresderive redundancy-removed HFSBs by examining position overlapIf the new HSFB has a high proportion of positions which overlaps with existing HSFBs, remove it.

2. Selecting optimal HSFBfor each HSFB in top K select the pivot protein based on the HSFB consensus;superimpose anchored proteins on the pivot; find consistent HSFBs;A consistent HSFB contains at least 3 consistent SFPs.

select the best HSFB which admits most consistent HSFBs;

Page 17: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

3. Building scaffoldbuild a primary scaffold from the consistent HSFBs;update the transformation using the consistent HSFBs;recruit aligned fragments;improve the scaffold;create average template;

4. Dealing with unanchored proteinsUnanchored protein: has no member in the anchor HSFB which is supported by enough number of consistent SFPs.

for each unanchored protein if (members are found in colored HSFBs) find top K members;Try to use ‘colored’ HSFBs other than the anchor HSFB.

else search for fragments similar to the scaffold, and select top K; pairwisely align the protein on the template;

Page 18: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

5. Fingding missing motifsfind missing motifs by registering atoms in spatial cells;Only patterns shared by the pivot protein have a chance to be discovered above. Two ways for discovering ‘missing motifs’: by searching for maximal star-trees and by registering atoms in spatial cell. The latter: We divide the space occupied by the structures after superimposition into uniform cubic cells of a finite size, say 6A. The number of different proteins = cell depth. Sort cells in descending order of depth.

from cells to ‘octads’

find fragments in octads

find aligned fragments

6. Final refinementrefine the alignment and the average template;

Page 19: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Conclusion

CLePAPS and BLOMAPS distinguish themselves from other existing algorithms for structure alignment in the use of conformational letters. • Conformational alphabet: aptly balance precision with simplicity• CLESUM: a proper measure of similarity between states• fit the -congruent problem • CLESUM extracted from the database FSSP contains information of structure database statistics, which reduces the chance of accidental matching of two irrelevant helices. evolutionary + geometric = specificity gain

For example, two frequent helices are geometrically very similar,

but their score is relatively low.• CLESUM similarity score can be used to sort the importance of SFPs for a greedy algorithm. Only the top few SFPs need to be examined.

Page 20: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Conclusion

Greedy strategies:HSFBs instead of SFBsUse the shortest protein to generate HSFBUse consensus to select pivotTop K --- guided by scoresOptimal anchor HSFBMissing motifs

Tested on 17 structure datasetsFaster than MASS by 3 orders

Page 21: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Thank you

Page 22: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

* * * Overall algorithm of BLOMAPAS * * *create HSFBs using the shortest protein as a template;sort HSFBs;derive redundancy-removed HFSBs;for each HSFB in top K select the pivot protein based on the HSFB consensus; superimpose anchored proteins on the pivot; find consistent HSFBs;select the best HSFB which admits most consistent HSFBs;build a primary scaffold from the consistent HSFBs;update the transformation using the consistent HSFBs;recruit aligned fragments;improve the scaffold;create average template;for each unanchored protein if (members are found in colored HSFBs) find top K members; else search for fragments similar to the scaffold, and select top K; pairwisely align the protein on the template;find missing motifs by registering atoms in spatial cells;refine the average template;

Page 23: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

17 test sets

Page 24: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Comparison of BLOMAPS with MAMMOTH-mult and CE-MC

Page 25: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

1ak6-aa sasgvqVADEVCRIFYDMkvrkcstpeeikkrkKAVIFCLSADKKCIIVEEGKeilvgdvgvtitDPFKHFVGMLPEKD1cfyB-aa vaVADESLTAFNDLKLGKKY---------KFILFGLNDAKTEIVVKETStd----------PSYDAFLEKLPEND1cnuA-aa giaVSDDCVQKFNELKLGHQH---------RYVTFKMNASNTEVVVEHVGgpn---------ATYEDFKSQLPERD1f7sA-aa asgmaVHDDCKLRFLELKAKRTH---------RFIVYKIEEKQKQVVVEKVGqpi---------QTYEEFAACLPADE1cfyB-ss ceECHHHHHHHHHHHHHCCC CEEEEEECCCCCEEEEEEEEcc CCHHHHHCCCCCCC Core **1111111111111*** xaaaaaa******bbbbbbb **22222xxxxxxx 1ak6-cl mkobbeCCAJJKJJIKJKmlnmgalpjkmjjjcBMEDEEQCQNOLQCFBGNCLqdbahjmklnleCGCMJHJJKMEAJL1cfyB-cl CCAJJHHHHHHHIKKAPL---------BLDEEECCAHOMLDECPLEEbg----------PGAJJHIJJGCAKL1cnuA-cl fCCAJJIJIHHHHKKKAPL---------BLDEFECCAHOMLDECBLEEcai---------GFAHJIKJIGCAKL1f7sA-cl dceDCAJJIHHHHHHIKKAML---------BLDEEEDPPJKOGDECBLEFcal---------EDAHHHHJJMCAIL

1ak6-aa CRYALYDASFETkesrke--ELMFFLWAPELAPLKSKMIYASSKDAIKKKFQGIKHECQANGPeDLNRACIAEKLGGsl1cfyB-aa CLYAIYDFEYEIngNEGKRSKIVFFTWSPDTAPVRSKMVYASSKDALRRALNGVSTDVQGTDFsEVSYDSVLERVSR1cnuA-aa CRYAIFDYEFQV--DGGQRNKITFILWAPDSAPIKSKMMYTSTKDSIKKKLVGIQVEVQATDAaEISEDAVSERAKK1f7sA-aa CRYAIYDFDFVT-aENCQKSKIFFIAWCPDIAKVRSKMIYASSKDRFKRELDGIQVELQATDPte1cfyB-ss CEEEEEEEEEEEccCCEEEEEEEEEEECCCCCCHHHHHHHHHHHHHHHHHCCCCCEEEEECCHhHHCHHHHHHHHHC Core xccccc****** ******dddddddxxxxxx333333333333333*******eeeeexx4 44x444******* 1ak6-cl EECEEEAFBLDEbkfnge--BEFCEPOGEAIGCAJIIIKIJKJJIIIJJKJMKPOLBGEEPLAjJJGAHJKKKINIJkq1cfyB-cl DEDEDEBFEEDEcnOMCQEEECEEEBEBEAJGCAHHHHHHHIIMHJKIHIGENGBLEEBEPLAjHKGAIHIHJHIK1cnuA-cl DEDEDEAFEEDB--NOGCEEEBFEEBEBEAJGCAHHHHIIHHIMIIIHHHMENBBLEEEEBLAiIIGAJIIIJJII1f7sA-cl DEDEDEBBEEDF-aJOGCEFDCEEFBEBEAIGCAJHHHHHHHIMIHHHJHMENBBLEEDFEDPj

The alignment of four CL proteins to the template for the CL-GL ensemble. two additional helices indexed as 1 and 2, “x”: capping; ‘*’: submotifs specific to CL

Page 26: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Default parameters of BLOMAPS

Page 27: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

FSSP representative pairs with the same first three family indices are used to construct CLESUM.

amino acidsa.b.c.u.v.w avpetRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRS... a.b.c.x.y.z ahLTVKKIFVGGIKEDT....EEHHLRDYFEQYGKIEVIEIMTDRGS

conformational lettersa.b.c.u.v.w CCPMCEALEEEENGCPJGCCIHHHHHHHHIKMJILQEPLDEEEBGAIKa.b.c.x.y.z ...BBEBGEDEENMFNML....FAHHHHHKKMJJLCEBLDEBCECAKK

NAB++; NBA++;

Page 28: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

1, Fast search for SFPs by merely string comparison

2, Width 20 for specificity + width 8 for sensitivity

3, Optimal anchor SFP selected by checking consistency

4, Avoid local trap by ’zoom-in’

The running time for the 68 pairs of the Fischer benchmark is less than 2% of that of the downloaded CE local version.

Next steps

1, BLOMAPS: fast multiple structure alignment;

SFPs → Highly Similar Fragment Blocks (HFBs)

2, Include biochemical information into CLESUM by amino acid clustering.

Entropic clustering: AVCFIWLMY (h) + DEGHKNPQRST (p)

Page 29: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Entropic h-p clustering: AVCFIWLMY and DEGHKNPQRST

CLESUM-hh (lower left) and CLESUM-pp (upper right)

Page 30: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

CLESUM-hp for type h-p (row-column)

Page 31: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Rigid transformation for SuperimpositionFinding the rotation R and the translation T to minimize

If the two sets of points are shifted with their centers of mass at the origin, T=0. Let X3xn any Y3xn be the sets after shift. . Introduce = ,the objective function is defined aswhere Lagrange multipliers are g and symmetric matrix L, representing the conditions for R to be an orthogonal and proper rotation matrix. constraint:

where M is symmetric, and S = diag(si), si = 1 or -1. |C| = |R||M| = |M| = |D||S|.Singular values are non-negative, |D|>0. Finally, |S| = sgn(|C|), and

Page 32: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

initial correspondence (anchor SFP)

optimal transformation for the correspondence

Correspondence updating

by adding consistent SFPs

Convergent?

no

yesend

Page 33: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

• Protein structure alphabet: discretization of 3D continuous structure states

• Bridge 2’ structure and 3D structure

• Enhance correlation between sequence and structure

• Fast structure comparison

• Structure prediction (transplant 2’ structure prediction methods for 3D)

Page 34: Fast Alignment of Protein Structures Based on Conformational Letters Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica

Default parameters of CLePAPS

Symbol Value Meaning

20 length of long SFPs 350 similarity threshold for long SFPs 10 number of long SFPs used as seed candidates 50 number of long SFPs for building a star-tree 8 length of short SFPs for blank-filling 0 similarity threshold for short SFPs 4 minimum length of aligned fragments 5A distance cutoff for evaluating overall alignment 10A separation threshold for star construction 8A separation cutoff for blank-filling in first run 6A separation cutoff for blank-filling in second run 5A separation cutoff for blank-filling in third run 0.1 maximal difference for rotational matrix entries of two ‘identical' alignments