cecs 694-04 bioinformatics journal club eric rouchka, d.sc. september 10, 2003

Eric C. Rouchka, University of Louisville

SATCHMO: sequence alignment and tree construction using hidden

Markov models

Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411.

CECS 694-04 Bioinformatics Journal ClubEric Rouchka, D.Sc.September 10, 2003

http://www.ncrr.nih.gov/


What is Multiple Sequence Alignment (MSA) ?

• Taking more than two sequences and aligning based on similarity


Globin Example>gamma_AMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD

PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH>alfaVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK

LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>betaVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVD

PENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>deltaVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVD

PENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH>epsilonVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDP

ENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH>gamma_GMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD

PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH>myoglobinMGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP

VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG>teta1ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLL

GHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR>zetaSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHC

LLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR


Globin Multiple Alignment


Why do MSA?

• Homology Searching– Important regions conserved across (or

within) species• Genic Regions• Regulatory Elements

• Phylogenetic Classification• Subfamily classification• Identification of critical residues


MSA Approaches

• All columns alignable across all sequences– MSA– ClustalW

• Columns alignable throughout all sequences singled out (Profile HMM)– HMMER– SAM


MSA

• N-dimensional dynamic programming• Time consuming• High memory usage

• Guaranteed to yield maximum alignment


ClustalW

• Progressive Alignment– Sequences aligned in pair-wise fashion– Alignment scores produce phylogenetic

tree

– Enhanced dynamic programming approach


Hidden Markov Models

• Match State, Insert State, Delete State


HMMs

• Models conserved regions

• Successful at detecting and aligning critical motifs and conserved core structure

• Difficulty in aligning sequence outside of these regions


SATCHMO

• Simultaneous Alignment and Tree Construction using Hidden Markov mOdels

www.lib.jmu.edu/music/composers/ armstrong.htm

http://www.lib.jmu.edu/music/composers/armstrong.htm









SATCHMO

• Progressive Alignment– Built iteratively in pairs– Profile HMMs used

• Alignments of same sequences not same at each node

• Number of columns predicted smaller as structures diverge

• Output not represented by single matrix


Why HMMs?

• Homologs ranked through scoring• Accurate profiles from small numbers of

sequences• Accurately combines two alignments

having low sequence similarity


Bits saved relative to background

• K = 1..M: HMM node number• a: amino acid type• Pk(a): emission probability of a in kth match state

• P0(a): approximation of background probability of a


Sequence weights

• Sequences weighted such that b converges on a desired value

• Weights compensate for correlation in sequences


HMM Construction

• Profile HMM constructed from multiple alignment

• Some columns alignable; others not


HMM Construction

• Given an alignment a, a profile HMM is generated

• Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids


Transition Probabilities

• If we have a total of five match states, the probabilities can be stored in the following table:


HMM Terminology

: Path through an HMM to produce a sequence s

• P(A|) = P(s| s)

+: maximum probability path through the HMM


Aligning Two Alignments

• One alignment is converted to an HMM

• Second alignment is aligned to the HMM– Some columns remain alignable– Affinities (relative match scores) calculated

• New MSA results• HMM Constructed from new MSA


Aligning Two Alignments


SATCHMO Algorithm

• Step 1: – Create a cluster for each input sequence and

construct an HMM from the sequence

• Step 2: – Calculate the similarity of all pairs of clusters and

identify a pair with highest similarity – align the target and template to produce a new

node


SATCHMO Algorithm

• Repeat set 2 until:– All sequences assigned to a cluster– Highest similarity between clusters is below a

threshold– No alignable positions are predicted

• Output: A set of binary trees – Nodes are sequences– Each node contains an HMM aligning the

sequences in the subtree


Graphical Interface for SATCHMO


Demonstration of SATCHMO


Validation Set

• BAliBASE benchmark alignment set used– Ref1: equidistant sequences– Ref2: distantly related sequences– Ref3: subgroups of sequences; < 25%

similarity between groups– Ref4: alignments with long extensions on

the ends– Ref5: alignments with long insertions


Comparision of Results

• SATCHMO compared to:– ClustalW (Progressive Pairwise Alignment)– SAM (HMM)


Discussion

• SATCHMO effective in identifying protein domains

• Comparison to T-Coffee and PRRP would be useful– Time and sensitivity

• Tree representation is unique, modeling structural similarity

cecs 694-04 bioinformatics journal club eric rouchka, d.sc. september 10, 2003

Documents