• Multiple sequence alignment– ClustalW– Muscle
• Motif discovery– MEME– Jaspar
Multiple sequence alignments and motif discovery
• More than two sequences– DNA– Protein
• Evolutionary relation– Homology Phylogenetic tree– Detect motif
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
• Dynamic Programming– Optimal alignment– Exponential in #Sequences
• Progressive– Efficient– Heuristic
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
ClustalW
“CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, J D Thompson et al
ClustalW
• Progressive– At each step align two existing alignments or
sequences– Gaps present in older alignments remain fixed
-TGTTAAC-TGT-AAC-TGT--ACATGT---CATGT-GGC
ClustalW - Inputhttp://www.ebi.ac.uk/Tools/clustalw2/index.html
Input sequences
Gap scoring
Scoring matrix
Email address
Output format
Can we find motifs using multiple sequence alignment?
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0
1 3 5 7 9..YDEEGGDAEE....YDEEGGDAEE....YGEEGADYED....YDEEGADYEE....YNDEGDDYEE....YHDEGAADEE.. * :** *:
MotifA widespread pattern with a biological significance
MEME – Multiple EM* for Motif finding
• http://meme.sdsc.edu/• Motif discovery from unaligned sequences
– Genomic or protein sequences• Flexible model of motif presence (Motif can be absent in
some sequences or appear several times in one sequence)
*Expectation-maximization
MEME - InputEmail addres
s
Input file (fasta file)
How many times in each
sequence?
How many motifs?
How many sites?
Range of motif
lengths
MAST
• Searches for motifs (one or more) in sequence databases:– Like BLAST but motifs for input– Similar to iterations of PSI-BLAST
• Profile defines strength of match– Multiple motif matches per sequence– Combined E value for all motifs
• MEME uses MAST to summarize results: – Each MEME result is accompanied by the MAST result for
searching the discovered motifs on the given sequences.
http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi
JASPAR
• Profiles – Transcription factor binding sites– Multicellular eukaryotes– Derived from published collections of experiments
• Open data accesss
JASPAR• profiles
– Modeled as matrices.– can be converted into PSSM for scanning genomic
sequences.
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0