visualisation of multiple sequence alignments vizbi 2011 des higgins conway institute university...
TRANSCRIPT
Visualisation of Multiple Sequence Alignments
VIZBI 2011
Des HigginsConway Institute
University College Dublin
Ireland
Multiple Alignment?
• Align 3 or more sequences together– Homologous residues lined up in columns
Whale myoglobin ----VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin GSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTP---EFFPKFKGLTTLupin globin ---GALTESQAALVKSSWEEF--NIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
• Needed because of – Orthologues from different speciesBut mainly:– Paralogues from Gene duplications
• Multi-gene families– e.g. humans have approx. 500 protein kinases
Human Protein Kinases
The human kinome comprises 40 atypical PKs and 478 classical PKs. The latter
consist of 388 serine/threonine kinases, 90
tyrosine kinases and 50 sequences which lack a functional catalytic site.
(Manning et al., Science, 2002)
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment
1. Visualise the residues/gaps?
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment
Alpha helices
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment
Haem binding Histidines
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment
2. Visualise the sequence groupings?
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
So: What is the Problem?
• What if N >> 100,000?
• e.g. SSU rRNA– www.arb-silva.de– 1,471,257 seqs
• e.g. ABC transporters– PFAM– ABC_tran PF00005– 127,458 seqs
• Metagenomics
•Sequence 10,000 vertebrate genomes!
=>5,000,000 protein kinases, GPCRs
SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence Comparison In Context James Slack Kristian Hildebrandy Tamara Munzner Katherine St. John. Proc. German Conference on Bioinformatics 2004, pp 37-42
Poster D03 VIZBI, 2011
Sequence Surveyor: scalable multiple sequence alignment overview visualisation. Danielle Albers, Colin Dewey, Michael Gleicher
Poster D09 VIZBI, 2011
JProfileGrid: visualising very large multiple sequence alignments.
Alberto Roca, Aaron Abajian, David Vigerust
This talk
• How to make huge multiple alignments
• How to cluster > 100,000 sequences
• MDS/PCA on big datasets
Multiple Sequence Alignment
• NP complete
• Mainly use: “Progressive Alignment”– Greedy heuristic– Use a tree/clustering of the seqs
• Barton and Sternberg (1988)Feng and Doolittle (1987)Higgins and Sharp (1988) Hogeweg and Hesper (1984)Willlie Taylor (1987)
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
“Guide Tree”
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Clustal
• 66,000 citations
• Clustal1-Clustal4 – 1988, Paul Sharp, Dublin
• Clustal V 1992– EMBL Heidelberg, – Rainer Fuchs
– Alan Bleasby • Clustal W, Clustal X 1994-2005
– Toby Gibson, EMBL, Heidelberg– Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 2007– University College Dublin
www.clustal.org
Complexity
• Guide tree constructionO(N2)
• Later Progressive AlignmentO(N)
• Guide tree construction is limiting>10,000 seq alignment is tough
PartTree
• MAFFT Package• Select n sequences where n << N• UPGMA on n sequences• Cluster the remainder (N-n) with their
closest clusters
Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23, 372–374.
Embedding?
• Replace each sequence by a Vector– Vector-Vector distances
• MUCH faster than • Seq. – Seq. distances
• Vectors very fast/simple to cluster• e.g. cluster 10,000 vectors of length 150
• <<1 min on 1 processor • UPGMA
• e.g. cluster 300,000 vectors of length 300 • 6 mins• k-means, k = 300
Embedding papers
• FastMap • Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for
Indexing Data-Mining and Visualisation of Traditional and Multimedia Datasets, Proc. 1995 ACM SIGMOD International Con. on Management of Data, pp.163–174.
• Sparsemap• G. Hristescu and M. Farach-Colton. Cluster-preserving
embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University, 1999.
mBED
• Select k seqs “randomly”– k << N– k α logN
• Use distance to each of these k “references” – k long vector for each sequence
• Use heuristics – avoid duplicates – find outliers
• Very fast and simple– Complexity O(kN) i.e. O(NlogN)
• Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. (2010)Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 14;5:21.
N
N
mBED
k seeds
k
N
MDS visualisation?
• Do PCA on Embedded sequences
• 3994 H3N2 HA sequences– 1967 (blue)
- 2008 (orange)
Guide Tree Quality• 1000 random
guide trees
• 1000 sparsemap trees
• Clustal tree
• mBED
Clustal Ω
• Release first version by April 2011• Scalable
– mBed– Gordon Blackshields
• Accurate– HMM-HMM alignment– HHalign– Johannes Söding, Munich.
• Re-use old alignments– Kevin Karplus– UCSC
• Align 120,000 abc transporters– 6 hours on 1 core
• More accurate than – MUSCLE or MAFFT
• Coming soon...
Fabian SieversAndreas WilmDavid Dineen
MDS/PCA etc.
• Dimension reduction• Treat alignment columns as variables
– PCA • Principal Components Analysis
– CA• Correspondence Analysis, Jean Paul Benzécri
• Use NxN distance matrix– MDS– PCOORD
Use CA, PCA for Sequences?
• every alignment column: – 20 binary
variables
– Or several physicochemical properties
d = 0.05
EC_4_117 EC_4_0
EC_1_1 EC_1_19
EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93
EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114
EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116
EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115
EC_1_13 EC_1_14
EC_4_87
EC_4_46
EC_1_17 EC_1_18
EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45
EC_4_36 EC_4_37 EC_4_38
EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27
EC_4_28
EC_1_2
EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80
EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47
EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69 EC_4_63 EC_4_66 EC_4_64
EC_4_65 EC_4_67 EC_4_70 EC_4_71
EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14
EC_4_13
EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 0.05
Chymotrypsin
Elastase
Tripsin
d = 0.1
X3N
X7A
X10N
X14W
X16S X18I
X54V
X66T X70R
X82E
X82G
X87L
X92I
X93I
X93F X95N
X98W
X98Y
X132Y
X137C X154T
X154V
X155S
X155T
X162S
X165N
X180Q
X181A X183L
X196Y X204S
X228K
X229D
X229S
X232Q
X232M X243Q
X265S
X273K
X275G
Chymotrypsin
Elastase
Tripsin
0 e+
004
e-
048
e-
04
Eigenvalues
15 Chymotrypsins
31 Trypsins10 Elastases
Trypsin-like serine proteases
•Correspondence Analysis•Supervise:
•Between Groups Analysis•Dolédec and Chessel (1987)(similar to PLS discriminant analysis)
d = 0.05
EC_4_117 EC_4_0
EC_1_1 EC_1_19
EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93
EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114
EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116
EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115
EC_1_13 EC_1_14
EC_4_87
EC_4_46
EC_1_17 EC_1_18
EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45
EC_4_36 EC_4_37 EC_4_38
EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27
EC_4_28
EC_1_2
EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80
EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47
EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69 EC_4_63 EC_4_66 EC_4_64
EC_4_65 EC_4_67 EC_4_70 EC_4_71
EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14
EC_4_13
EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 0.05
Chymotrypsin
Elastase
Tripsin
d = 0.1
X3N
X7A
X10N
X14W
X16S X18I
X54V
X66T X70R
X82E
X82G
X87L
X92I
X93I
X93F X95N
X98W
X98Y
X132Y
X137C X154T
X154V
X155S
X155T
X162S
X165N
X180Q
X181A X183L
X196Y X204S
X228K
X229D
X229S
X232Q
X232M X243Q
X265S
X273K
X275G
Chymotrypsin
Elastase
Tripsin
0 e+
004
e-
048
e-
04
Eigenvalues
Trypsin
d = 0.05
EC_4_117 EC_4_0
EC_1_1 EC_1_19
EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93
EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114
EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116
EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115
EC_1_13 EC_1_14
EC_4_87
EC_4_46
EC_1_17 EC_1_18
EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45
EC_4_36 EC_4_37 EC_4_38
EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27
EC_4_28
EC_1_2
EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80
EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47
EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69 EC_4_63 EC_4_66 EC_4_64
EC_4_65 EC_4_67 EC_4_70 EC_4_71
EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14
EC_4_13
EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 0.05
Chymotrypsin
Elastase
Tripsin
d = 0.1
X3N
X7A
X10N
X14W
X16S X18I
X54V
X66T X70R
X82E
X82G
X87L
X92I
X93I
X93F X95N
X98W
X98Y
X132Y
X137C X154T
X154V
X155S
X155T
X162S
X165N
X180Q
X181A X183L
X196Y X204S
X228K
X229D
X229S
X232Q
X232M X243Q
X265S
X273K
X275G
Chymotrypsin
Elastase
Tripsin
0 e+
004
e-
048
e-
04
Eigenvalues
Trypsin
Wallace IM, Higgins DG.(2007) Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics. 8:135.
MDS
• Multidimensional Scaling• Fit distances to a NxN distance matrix• Use euclidean distances?
– “Classical scaling”= Principal Co-Ordinates Analysis
• PCOORD, John Gower– Gower, J. C. (1966). Some distance properties of latent root and vector
methods used in multivariate analysis. Biometrika 53, 325-328.– Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to
analysing large sequence data sets. CABIOS, 8, 15-22.
– Complexity at least O(N2)
Large scale MDS?
• SC-MDS• Jengnan Tzeng, Henry Horng-Shing Lu, and Wen-Hsiung Li (2008)
Multidimensional scaling for large genomic data sets BMC Bioinformatics. 2008; 9: 179.
• mBED• Blackshields et al., (2010)
• PCOORD or MDS on a subset of the sequences• add the rest later
• Landmark MDS + Nystrom approximation• V. de Silva, J.B. Tenenbaum, “Sparse multidimensional scaling using
landmark points.” (2004) Technical report, Stanford University.
Easily do MDS on >100,000 seqs
• 307,434 lentivirus (HIV etc) sequences from UniProt.
H3N2 flu sequences
• Weifeng Shi
• 8167 HA sequences – human H3N2 influenza viruses
• DNAdist in Phylip – K2P (Kimura two parameter) model
• Python: MatplotlIb
1960s
1970s
1980s
1990s
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
BGA, CIAAedin CulhaneIan JefferyStephen MaddenIain WallaceGuy Perriere, Lyons
Clustal OmegaFabian SieversAndreas WilmDavid DineenJohannes Soeding, MunichRodrigo Lopez, EBI
mBEDGordon BlackshieldsMark Larkin
Flu MDSWeifeng Shi
Supervised PCA or CA?
Malate Dehydrogenases
Lactate Dehydrogenases
ADE-4 http://pbil.univ-lyon1.fr/ADE-4/
Thioulouse J., Chessel D., Dolédec S., & Olivier J.M. (1997) ADE-4: a multivariate analysis and graphical display software. Statistics and Computing, 7, 1, 75-83.
• MADE4 – Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005)
MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics. 21(11):2789-2790.
Between Group Analysis BGA
Dolédec, S. & Chessel, D. (1987)
Acta Oecologica, Oecologica Generalis, 8, 3, 403-426.Supervised Correspondence Analysis or PCA
CO-Inertia Analysis CIA
Dolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294.
Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329
2 datasets; Simultaneous CA or PCA
Very large datasets• e.g. 381,602 tRNA
from RF00005
• 40 mins embeddingPlus 6 mins to cluster with k-means– k = 300