cecs 694-04 bioinformatics journal club eric rouchka, d.sc. september 10, 2003
DESCRIPTION
SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics . 19 (11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003. What is Multiple Sequence Alignment (MSA) ?. - PowerPoint PPT PresentationTRANSCRIPT
Eric C. Rouchka, University of Louisville
SATCHMO: sequence alignment and tree construction using hidden
Markov models
Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411.
CECS 694-04 Bioinformatics Journal ClubEric Rouchka, D.Sc.September 10, 2003
Eric C. Rouchka, University of Louisville
What is Multiple Sequence Alignment (MSA) ?
• Taking more than two sequences and aligning based on similarity
Eric C. Rouchka, University of Louisville
Globin Example>gamma_AMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD
PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH>alfaVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK
LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>betaVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVD
PENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>deltaVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVD
PENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH>epsilonVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDP
ENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH>gamma_GMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD
PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH>myoglobinMGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP
VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG>teta1ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLL
GHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR>zetaSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHC
LLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
Eric C. Rouchka, University of Louisville
Globin Multiple Alignment
Eric C. Rouchka, University of Louisville
Why do MSA?
• Homology Searching– Important regions conserved across (or
within) species• Genic Regions• Regulatory Elements
• Phylogenetic Classification• Subfamily classification• Identification of critical residues
Eric C. Rouchka, University of Louisville
MSA Approaches
• All columns alignable across all sequences– MSA– ClustalW
• Columns alignable throughout all sequences singled out (Profile HMM)– HMMER– SAM
Eric C. Rouchka, University of Louisville
MSA
• N-dimensional dynamic programming• Time consuming• High memory usage
• Guaranteed to yield maximum alignment
Eric C. Rouchka, University of Louisville
ClustalW
• Progressive Alignment– Sequences aligned in pair-wise fashion– Alignment scores produce phylogenetic
tree
– Enhanced dynamic programming approach
Eric C. Rouchka, University of Louisville
Hidden Markov Models
• Match State, Insert State, Delete State
Eric C. Rouchka, University of Louisville
HMMs
• Models conserved regions
• Successful at detecting and aligning critical motifs and conserved core structure
• Difficulty in aligning sequence outside of these regions
Eric C. Rouchka, University of Louisville
SATCHMO
• Simultaneous Alignment and Tree Construction using Hidden Markov mOdels
www.lib.jmu.edu/music/composers/ armstrong.htm
Eric C. Rouchka, University of Louisville
SATCHMO
• Progressive Alignment– Built iteratively in pairs– Profile HMMs used
• Alignments of same sequences not same at each node
• Number of columns predicted smaller as structures diverge
• Output not represented by single matrix
Eric C. Rouchka, University of Louisville
Why HMMs?
• Homologs ranked through scoring• Accurate profiles from small numbers of
sequences• Accurately combines two alignments
having low sequence similarity
Eric C. Rouchka, University of Louisville
Bits saved relative to background
• K = 1..M: HMM node number• a: amino acid type• Pk(a): emission probability of a in kth match state
• P0(a): approximation of background probability of a
Eric C. Rouchka, University of Louisville
Sequence weights
• Sequences weighted such that b converges on a desired value
• Weights compensate for correlation in sequences
Eric C. Rouchka, University of Louisville
HMM Construction
• Profile HMM constructed from multiple alignment
• Some columns alignable; others not
Eric C. Rouchka, University of Louisville
HMM Construction
• Given an alignment a, a profile HMM is generated
• Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids
Eric C. Rouchka, University of Louisville
Transition Probabilities
• If we have a total of five match states, the probabilities can be stored in the following table:
Eric C. Rouchka, University of Louisville
HMM Terminology
: Path through an HMM to produce a sequence s
• P(A|) = P(s| s)
+: maximum probability path through the HMM
Eric C. Rouchka, University of Louisville
Aligning Two Alignments
• One alignment is converted to an HMM
• Second alignment is aligned to the HMM– Some columns remain alignable– Affinities (relative match scores) calculated
• New MSA results• HMM Constructed from new MSA
Eric C. Rouchka, University of Louisville
Aligning Two Alignments
Eric C. Rouchka, University of Louisville
SATCHMO Algorithm
• Step 1: – Create a cluster for each input sequence and
construct an HMM from the sequence
• Step 2: – Calculate the similarity of all pairs of clusters and
identify a pair with highest similarity – align the target and template to produce a new
node
Eric C. Rouchka, University of Louisville
SATCHMO Algorithm
• Repeat set 2 until:– All sequences assigned to a cluster– Highest similarity between clusters is below a
threshold– No alignable positions are predicted
• Output: A set of binary trees – Nodes are sequences– Each node contains an HMM aligning the
sequences in the subtree
Eric C. Rouchka, University of Louisville
Graphical Interface for SATCHMO
Eric C. Rouchka, University of Louisville
Demonstration of SATCHMO
Eric C. Rouchka, University of Louisville
Validation Set
• BAliBASE benchmark alignment set used– Ref1: equidistant sequences– Ref2: distantly related sequences– Ref3: subgroups of sequences; < 25%
similarity between groups– Ref4: alignments with long extensions on
the ends– Ref5: alignments with long insertions
Eric C. Rouchka, University of Louisville
Comparision of Results
• SATCHMO compared to:– ClustalW (Progressive Pairwise Alignment)– SAM (HMM)
Eric C. Rouchka, University of Louisville
Eric C. Rouchka, University of Louisville
Discussion
• SATCHMO effective in identifying protein domains
• Comparison to T-Coffee and PRRP would be useful– Time and sensitivity
• Tree representation is unique, modeling structural similarity