multiple sequence alignment: np-hardness and how to deal with it jens stoye bielefeld university,...

16
Multiple Sequence Alignment: NP-Hardness and How to Deal with It Jens Stoye Bielefeld University, Germany

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Multiple Sequence Alignment:NP-Hardness and How to Deal with It

Jens StoyeBielefeld University, Germany

Preliminaries: Pairwise Alignment

>pdb|1KSW|A Chain A, Structure Of Human C-Src Tyrosine Kinase (Thr338gly Mutant) In Complex With N6-Benzyl Adp Length=452

Score = 161 bits (408), Expect = 5e-47, Method: Compositional matrix adjust. Identities = 81/85 (95%), Positives = 81/85 (95%), Gaps = 1/85 (1%)

Query 1 PRESLRLEAKLGQGCFGEVWMGTWNDTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL 60 PRESLRLE KLGQGCFGEVWMGTWN TTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL Sbjct 182 PRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL 241

Query 61 VQLYAVVS-EPIYIVIEYMSKGSLL 84 VQLYAVVS EPIYIV EYMSKGSLL Sbjct 242 VQLYAVVSEEPIYIVGEYMSKGSLL 266

PRESLRLEAKLGQGCFGEVWMGTWNDTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL PRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL

VQLYAVVS-EPIYIVIEYMSKGSLL VQLYAVVSEEPIYIVGEYMSKGSLL

Preliminaries: Pairwise Alignment

Find best alignment of two sequences:highest score/lowest cost

Analysis: O(n2) time

Multiple Alignment

k sequences, not just 2

sp|P00526|SRC ---GLAK--DAWEIPRESLRLEAKLGQGCFGEVWMGTWND-TTRVAIKTLKPGT--MSPE 52sp|P00527|YES ---GLAK--DAWEIPRESLRLEVKLGQGCFGEVWMGTWNG-TTKVAIKTLKLGT--MMPE 52sp|P00521|ABL TIYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDT--MEVE 58sp|P00542|FES -VLNRAVPKDKWVLNHEDLVLGEQIGRGNFGEVFSGRLRADNTLVAVKSCRETLPPDIKA 59sp|P00530|FPS -VLTRAVLKDKWVLNHEDVLLGERIGRGNFGEVFSGRLRADNTPVAVKSCRETLPPELKA 59sp|P00532|KRAF -------SSYYWKMEASEVMLSTRIGSGSFGTVYKGKWHGDVAVKILKVVDPTP--EQLQ 51 * : .: : ::* * :* *: * . :*

sp|P00526|SRC AFLQEAQVMKKLRHEKLVQLYAVVSEEP-IYIVIEYMSKGSLLDFLKGEMGKYLRLPQLV 111sp|P00527|YES AFLQEAQIMKKLRHDKLVPLYAVVSEEP-IYIVTEFMTKGSLLDFLKEGEGKFLKLPQLV 111sp|P00521|ABL EFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVSAVVLL 118sp|P00542|FES KFLQEAKILKQYSHPNIVRLIGVCTQKQPIYIVMELVQGGDFLTFLRT-EGARLRMKTLL 118sp|P00530|FPS KFLQEARILKQCNHPNIVRLIGVCTQKQPIYIVMELVQGGDFLSFLRS-KGPRLKMKKLI 118sp|P00532|KRAF AFRNEVAVLRKTRHVNILLFMGYMTKDN-LAIVTQWCEGSSLYKHLHV-QETKFQMFQLI 109 * :*. :::: * ::: : . :.. : *: : ..: .*: . *:

sp|P00526|SRC DMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAK- 170sp|P00527|YES DMAAQIADGMAYIERMNYIHRDLRAANILVGDNLVCKIADFGLARLIEDNEYTARQGAK- 170sp|P00521|ABL YMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAK- 177sp|P00542|FES QMVGDAAAGMEYLESKCCIHRDLAARNCLVTEKNVLKISDFGMSREAADGIYAASGGLRQ 178sp|P00530|FPS KMMENAAAGMEYLESKHCIHRDLAARNCLVTEKNTLKISDFGMSRQEEDGVYASTGGMKQ 178sp|P00532|KRAF DIARQTAQGMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPT 169 : : : .* *:. :***: : * :: : *:.***:: :

sp|P00526|SRC FPIKWTAPEAALYG---RFTIKSDVWSFGILLTELTTKGRVPYPGMVNR-EVLDQVERGY 226sp|P00527|YES FPIKWTAPEAALYG---RFTIKSDVWSFGILLTELVTKGRVPYPGMVNR-EVLEQVERGY 226sp|P00521|ABL FPIKWTAPESLAYN---KFSIKSDVWAFGVLLWEIATYGMSPYPGIDLS-QVYELLEKDY 233sp|P00542|FES VPVKWTAPEALNYG---RYSSESDVWSFGILLWETFSLGASPYPNLSNQ-QTREFVEKGG 234sp|P00530|FPS IPVKWTAPEALNYG---WYSSESDVWSFGILLWEAFSLGAVPYANLSNQ-QTREAIEQGV 234sp|P00532|KRAF GSVLWMAPEVIRMQDDNPFSFQSDVYSYGIVLYELMAG-ELPYAHINNRDQIIFMVGRGY 228 .: * *** :: :***:::*::* * : **. : : : :.

sp|P00526|SRC RMPCP----PECPESLHDLMCQCWRKDPEERPTFKYLQAQLLPACVLEVAE- 273sp|P00527|YES RMPCP----QGCPESLHELMKLCWKKDPDERPTFEYIQSFLEDYFTAAEPSG 274sp|P00521|ABL RMERP----EGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSIS- 280sp|P00542|FES RLPCP----ELCPDAVFRLMEQCWAYEPGQRPSFSAIYQELQSIRKRHR--- 279sp|P00530|FPS RLEPP----EQCPEDVYRLMQRCWEYDPHRRPSFGAVHQDLIAIRKRHR--- 279sp|P00532|KRAF ASPDLSRLYKNCPKAIKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 280 **. : *: * ** * : :

Multiple Alignment

k sequences, not just 2

Multiple Alignment – Why?

Highlight similarities of the sequences in a family:– sequence assembly– molecular modeling, structure-function conclusions– database search (sequence families)– protein domains– primer design

Highlight dissimilarities between the sequences in a family:– reconstruction of phylogenetic trees– analysis of single nucleotide polymorphisms (SNPs)

„One or two homologous sequences whisper ... a full multiple alignment shouts out loud“

(Hubbard et al., 1996)

Multiple Alignment Objective Functions

• Find best alignment of k sequences:highest score/lowest cost

• Based on pairwise projections:

a) sum of all pairs:

b) tree alignment score:

Alignment of 2 Sequences

2 sequences O(n2) time

Alignment of 3 Sequences

3 sequences O(n3) time

k sequences O(nk) time

Alignment of k Sequences

in fact even worse: O(nk2k) time

NP Hardness

CS terminology:

The computational problem of SP multiple sequence alignment is NP hard.

In practice:

Don‘t even try it for more than 10 or 12 sequences.

What can we do?– compute anyway– running time heuristics– approximation algorithms– fixed parameter algorithms– correctness heuristics

Carrillo/Lipman Heuristics

Running time heuristics: often faster, but not in worst case.

Center Star Algorithm

Approximation algorithm:Never worse than 2 times the optimum

Divide and Conquer Alignment

No performance guarantee, but often very good

Multiple Alignment in Practice

Mostly progressive, e.g. CLUSTAL W

Not covered:

hybrid approaches, e.g. T-COFFEE, MAUVE, Clustal Omegalocal multiple alignment, e.g. DIALIGN

Thank you

---

Any further questions?