cecs 694-04 bioinformatics journal club eric rouchka, d.sc. september 10, 2003

29
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Upload: basil

Post on 18-Mar-2016

31 views

Category:

Documents


0 download

DESCRIPTION

SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics . 19 (11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003. What is Multiple Sequence Alignment (MSA) ?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

SATCHMO: sequence alignment and tree construction using hidden

Markov models

Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411.

CECS 694-04 Bioinformatics Journal ClubEric Rouchka, D.Sc.September 10, 2003

                              

Page 2: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

What is Multiple Sequence Alignment (MSA) ?

• Taking more than two sequences and aligning based on similarity

Page 3: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Globin Example>gamma_AMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD

PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH>alfaVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK

LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>betaVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVD

PENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>deltaVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVD

PENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH>epsilonVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDP

ENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH>gamma_GMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD

PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH>myoglobinMGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP

VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG>teta1ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLL

GHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR>zetaSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHC

LLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

Page 4: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Globin Multiple Alignment

Page 5: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Why do MSA?

• Homology Searching– Important regions conserved across (or

within) species• Genic Regions• Regulatory Elements

• Phylogenetic Classification• Subfamily classification• Identification of critical residues

Page 6: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

MSA Approaches

• All columns alignable across all sequences– MSA– ClustalW

• Columns alignable throughout all sequences singled out (Profile HMM)– HMMER– SAM

Page 7: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

MSA

• N-dimensional dynamic programming• Time consuming• High memory usage

• Guaranteed to yield maximum alignment

Page 8: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

ClustalW

• Progressive Alignment– Sequences aligned in pair-wise fashion– Alignment scores produce phylogenetic

tree

– Enhanced dynamic programming approach

Page 9: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Hidden Markov Models

• Match State, Insert State, Delete State

Page 10: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

HMMs

• Models conserved regions

• Successful at detecting and aligning critical motifs and conserved core structure

• Difficulty in aligning sequence outside of these regions

Page 12: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

SATCHMO

• Progressive Alignment– Built iteratively in pairs– Profile HMMs used

• Alignments of same sequences not same at each node

• Number of columns predicted smaller as structures diverge

• Output not represented by single matrix

Page 13: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Why HMMs?

• Homologs ranked through scoring• Accurate profiles from small numbers of

sequences• Accurately combines two alignments

having low sequence similarity

Page 14: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Bits saved relative to background

• K = 1..M: HMM node number• a: amino acid type• Pk(a): emission probability of a in kth match state

• P0(a): approximation of background probability of a

Page 15: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Sequence weights

• Sequences weighted such that b converges on a desired value

• Weights compensate for correlation in sequences

Page 16: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

HMM Construction

• Profile HMM constructed from multiple alignment

• Some columns alignable; others not

Page 17: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

HMM Construction

• Given an alignment a, a profile HMM is generated

• Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids

Page 18: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Transition Probabilities

• If we have a total of five match states, the probabilities can be stored in the following table:

Page 19: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

HMM Terminology

: Path through an HMM to produce a sequence s

• P(A|) = P(s| s)

+: maximum probability path through the HMM

Page 20: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Aligning Two Alignments

• One alignment is converted to an HMM

• Second alignment is aligned to the HMM– Some columns remain alignable– Affinities (relative match scores) calculated

• New MSA results• HMM Constructed from new MSA

Page 21: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Aligning Two Alignments

Page 22: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

SATCHMO Algorithm

• Step 1: – Create a cluster for each input sequence and

construct an HMM from the sequence

• Step 2: – Calculate the similarity of all pairs of clusters and

identify a pair with highest similarity – align the target and template to produce a new

node

Page 23: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

SATCHMO Algorithm

• Repeat set 2 until:– All sequences assigned to a cluster– Highest similarity between clusters is below a

threshold– No alignable positions are predicted

• Output: A set of binary trees – Nodes are sequences– Each node contains an HMM aligning the

sequences in the subtree

Page 24: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Graphical Interface for SATCHMO

Page 25: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Demonstration of SATCHMO

Page 26: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Validation Set

• BAliBASE benchmark alignment set used– Ref1: equidistant sequences– Ref2: distantly related sequences– Ref3: subgroups of sequences; < 25%

similarity between groups– Ref4: alignments with long extensions on

the ends– Ref5: alignments with long insertions

Page 27: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Comparision of Results

• SATCHMO compared to:– ClustalW (Progressive Pairwise Alignment)– SAM (HMM)

Page 28: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Page 29: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

Eric C. Rouchka, University of Louisville

Discussion

• SATCHMO effective in identifying protein domains

• Comparison to T-Coffee and PRRP would be useful– Time and sensitivity

• Tree representation is unique, modeling structural similarity