finding and aligning related sequences (martin frith)
TRANSCRIPT
![Page 1: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/1.jpg)
Finding and aligning related sequences
Martin C. Frith Computational Biology Research Center
AIST, Tokyo www.cbrc.jp/~martin
2012-12-09 @ BioinfoSummer, Adelaide
![Page 2: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/2.jpg)
CBRC
2 www.cbrc.jp
![Page 3: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/3.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
3
![Page 4: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/4.jpg)
Compare human and mouse genomes
gctagtgtac
||| || ||
gct--tgaac
aa-gtaca
|| |||||
aaggtaca
4
![Page 5: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/5.jpg)
Human
Mouse
5
![Page 6: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/6.jpg)
Compare DNA from a patient to a reference genome
Patient DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc
Reference genome sequence
ctagcttatcgt
DNA reads
6
![Page 7: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/7.jpg)
What kinds of microbial genes are there?
Water from e.g. a hot spring
DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
DNA reads
ArgLysTyrProPheLeuLeuIsoArgLysPheAlaPro-ProGlyGlyAlaGly…
atatatatatattagccgt
|||...||| |||...|||
GlyGlyPhePheGlyAlaLeuCysCysTrpTrpAlaGlyAlaPro…
Database of all known proteins 7
![Page 8: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/8.jpg)
More examples
• Compare ancient DNA to a reference genome
– Mammoth, neanderthal, Turin Shroud, …
• Align (potentially spliced) RNA sequences to a reference genome
– To see which genes are active
• Align short DNA reads to each other
– In order to assemble them
8
![Page 9: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/9.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
9
![Page 10: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/10.jpg)
What are we really trying to do?
1. Find and align similar sequences?
2. Find and align homologous sequences?
3. Find and align orthologous sequences?
4. Find and align paralogous sequences?
10
![Page 11: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/11.jpg)
Homology, orthology, paralogy
11
Homology: descent from a common ancestor
Orthology: descent from a common ancestor by genome division
Paralogy: descent from a common ancestor by duplication within a genome
Past
Present
![Page 12: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/12.jpg)
Example
human mouse
β1-globin β2-globin
β-globin
12
![Page 13: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/13.jpg)
Example
human
β1-globin β2-globin
β-globin
Orthologs
Paralogs
mouse
13
![Page 14: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/14.jpg)
Example
human
β1-globin β2-globin
β-globin
• Orthology is not necessarily 1-to-1 • Orthology is not transitive Not an equivalence relation
mouse
Orthologs
Orthologs
14
![Page 15: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/15.jpg)
What are we really trying to do?
1. Find and align similar sequences?
2. Find and align homologous sequences?
3. Find and align orthologous sequences?
4. Find and align paralogous sequences?
15
![Page 16: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/16.jpg)
Compare human and mouse genomes
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
16
![Page 17: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/17.jpg)
Compare human and mouse genomes
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
Do we want to align mouse α-globin with
human β-globin?
Probably not
17
![Page 18: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/18.jpg)
Compare DNA from a patient to a reference genome
Patient DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc
Reference genome sequence
DNA reads
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
18
![Page 19: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/19.jpg)
Compare DNA from a patient to a reference genome
Patient DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc
Reference genome sequence
DNA reads
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
Do we want to align the patient’s α-globin to the
reference’s β-globin?
19
![Page 20: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/20.jpg)
What are we really trying to do?
1. Find and align similar sequences?
2. Find and align homologous sequences?
3. Find and align orthologous sequences?
4. Find and align paralogous sequences?
20
![Page 21: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/21.jpg)
Aims and algorithms
• Sequence comparison algorithms basically find similar sequences
• Finding homologs is harder
• Finding orthologs is even harder
21
![Page 22: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/22.jpg)
Similarity versus homology
Similar sequences
Homologous sequences
Convergent evolution
Rapid evolution over a long time span
22
![Page 23: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/23.jpg)
• The most frequent case of convergent evolution is simple sequences
23
![Page 24: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/24.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
24
![Page 25: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/25.jpg)
Simple sequences
• DNA (and RNA and protein) frequently has simple sequences:
atgatcgattatcgtagtctaggtcgtatgctatgatt
cgataaaaaaaaaaaaaaaaaaacggtatgcgtagctg
cgatcgtagtgactatatgagagaggattcgatgctaa
gttctctaggagaggcttaggctgagcgcgtatcactg
gctcgcggctgtgtgtgtgtgtgtgtgtgtgtgtgtga
cgtatcgcacatcgtcgattttgagattcccgatggcc
25
![Page 26: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/26.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatc
gtagtagtagtagtagta
26
![Page 27: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/27.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatca
gtagtagtagtagtagta
27
![Page 28: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/28.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcat
gtagtagtagtagtagta
28
![Page 29: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/29.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcatc
gtagtagtagtagtagta
29
![Page 30: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/30.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcatca
gtagtagtagtagtagta
30
![Page 31: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/31.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcatcat
gtagtagtagtagtagta
31
![Page 32: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/32.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
32
catcat
gtagtagtagtagtagta
cat
![Page 33: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/33.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
33
catcat
gtagtagtagtagtagta
catc
![Page 34: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/34.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
34
catcat
gtagtagtagtagtagta
catca
![Page 35: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/35.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
35
catcat
gtagtagtagtagtagta
catcat
![Page 36: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/36.jpg)
How do simple sequences evolve?
• Strand slippage during DNA replication:
36
catcat
gtagtagtagtagtagta
catcat
On the top strand, it has got longer
![Page 37: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/37.jpg)
How do simple sequences evolve?
• An initial (short, mild) simple sequence occurs by chance
• Due to slippage, it gets longer…
• And longer…
37
![Page 38: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/38.jpg)
Homology between human and banana?
• Probably not.
38
atatatatatatatatatatatatatatatatatatatata
|||||||||||||||||||||||||||||||||||||||||
atatatatatatatatatatatatatatatatatatatata
Human
Banana
![Page 39: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/39.jpg)
Avoiding non-homologous alignments of simple sequences
• The standard way is to identify and “mask” them, before alignment
39
![Page 40: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/40.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
40
![Page 41: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/41.jpg)
Repeat masking
• There are standard “repeat masking” tools
– RepeatMasker, DustMasker, SegMasker, TRF, …
• Most people just assume they work
41
![Page 42: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/42.jpg)
Repeat confusion
atcttatgtctctctctctctctctctctggatgcttgaccac
cttgttattgctgatcgtcctctctgtaaattgttattgctgatcatgctttaac
Simple sequence:
Interspersed repeat:
They are both called “repeats”, but they are rather different. Don’t confuse them.
42
![Page 43: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/43.jpg)
Test of avoiding non-homologous alignments
• Compare two sequences after reversing one of them
• Sequences never evolve by reversal, so there are no true homologs in this test
• But repeats may still cause strong similarities, if they are not suppressed
• Hello
43
atatatatatatatatatatatatatatatatatatatata
|||||||||||||||||||||||||||||||||||||||||
atatatatatatatatatatatatatatatatatatatata
Human
Banana
![Page 44: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/44.jpg)
Test result
The C. elegans genome versus the reversed P. pacificus genome, after masking both with DustMasker:
Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 44
![Page 45: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/45.jpg)
A spurious alignment
Upper sequence: from C. elegans Lower sequence: from reversed P. pacificus
Conclusion: DustMasker fails to mask some tandem repeats
45
![Page 46: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/46.jpg)
Other methods?
46
Upper sequence: part of an animal protein Lower sequence: part of a reversed plant protein
• SegMasker does not work either:
• Nor does RepeatMasker, TRF…
A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.
![Page 47: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/47.jpg)
Repeat masking
• There are standard “repeat masking” tools
– RepeatMasker, DustMasker, SegMasker, TRF, …
• Most people just assume they work
– Cargo cult science
• Genomic bioinformatics is riddled with it
47
![Page 48: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/48.jpg)
New repeat-masking method
• tantan: http://www.cbrc.jp/tantan/
• It looks for slippery regions in sequences
• Slippery = similar to shifted versions of itself
• It integrates similarity at different slip distances, using a Forward-Backward algorithm
48
A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.
![Page 49: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/49.jpg)
tantan test result
The C. elegans genome versus the reversed P. pacificus genome, after masking with tantan:
Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 49
![Page 50: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/50.jpg)
Conclusion
• tantan prevents simple-sequence alignments
• Without masking an excessive amount
• It even works for extremely AT-rich DNA
– Plasmodium falciparum (malaria): 80% AT
– Dictyostelium discoideum (slime mould): 80% AT
50
![Page 51: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/51.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
51
![Page 52: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/52.jpg)
Classic score-based alignment
52
1. Define a scoring scheme
2. Find alignments with high (maximum) scores
![Page 53: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/53.jpg)
Alignment scoring scheme
53
a c g t
a 2 -3 -1 -3
c -3 2 -3 -1
g -1 -3 2 -3
t -3 -1 -3 2
Gap existence cost: 5 Gap extension cost: 1
Substitution score matrix Gap scores
![Page 54: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/54.jpg)
Alignment scoring scheme
54
a c g t
a 2 -3 -1 -3
c -3 2 -3 -1
g -1 -3 2 -3
t -3 -1 -3 2
t a c g t g - - a g g t
| | | | | | | | |
t a c a t g c t a g g t
Gap existence cost: 5 Gap extension cost: 1
Substitution score matrix
Alignment score: 10
2 +2 +2 -1 +2 +2 -7 +2 +2 +2 +2
Gap scores
Example:
![Page 55: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/55.jpg)
Classic score-based alignment
55
1. Define a scoring scheme
2. Find alignments with high (maximum) scores
tacgtg--aggt
||| || ||||
tacatgctaggt
ctatgctacgtgaggtgtggc
attacatgctaggtccac
![Page 56: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/56.jpg)
How to find alignments with max score?
• Smith-Waterman algorithm – Exact: guarantees to find the max score – A bit slow
• BLAST, FASTA, etc – Heuristic: no guarantee – Faster
56
tacgtg--aggt
||| || ||||
tacatgctaggt
ctatgctacgtgaggtgtggc
attacatgctaggtccac
![Page 57: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/57.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
57
![Page 58: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/58.jpg)
Alignment scoring scheme
• Where do these scores come from?
• Why is this a good method anyway?
58
a c g t
a 2 -3 -1 -3
c -3 2 -3 -1
g -1 -3 2 -3
t -3 -1 -3 2
Gap existence cost: 5 Gap extension cost: 1
Substitution score matrix Gap scores
Sxy =
![Page 59: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/59.jpg)
Scores are log likelihood ratios
Sxy = t ´ logAxy
Px ´Qy
æ
èçç
ö
ø÷÷
Probability of x aligned to y in a true alignment
Probability of x in the first sequence
Probability of y in the second sequence
Model of homologous sequences
Model of independent
sequences
![Page 60: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/60.jpg)
Different matrices for different tasks
60
a c g t
a 1 -3 -3 -3
c -3 1 -3 -3
g -3 -3 1 -3
t -3 -3 -3 1
a c g t
a 2 -6 -6 -6
c -6 2 -6 1
g -6 -6 2 -6
t -6 -6 -6 1
a c g t
a 2 -3 -2 -3
c -3 5 -3 -2
g -2 -3 5 -3
t -3 -2 -3 2
AT-rich DNA (e.g. malaria) Bisulfite-converted DNA
a c g t
a 1 -1 -1 -1
c -1 1 -1 -1
g -1 -1 1 -1
t -1 -1 -1 1
Strong similarities (~99% identity) Weak similarities (~75% identity)
![Page 61: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/61.jpg)
What about gap scores?
Pair hidden Markov model
The arrows describe probabilities for insertions and deletions. (It looks more complicated than it really is.)
![Page 62: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/62.jpg)
A useful formula
Prob alignment( ) µ exp alignment score / t( )
![Page 63: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/63.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
63
![Page 64: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/64.jpg)
Alignment ambiguity
ctagctaaccgtatcgtgggc
||||| | ||||| | ||
ctagcca---gtatctagtgc
?
ctagctaaccgtatcgtgggc
||||| | ||||| | ||
ctagc---cagtatctagtgc
Or
64
![Page 65: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/65.jpg)
Per-column probabilities
… g c a t c c t t g g g t c t c g a c a t …
… g c c t c g t t a g a - - t a g a t a g …
.99
.99
.99
.95
.93
.92
.90
.79
.55
.33
.16
.22
.49
.55
.59
.71
.93
.97
.98
.99
65
![Page 66: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/66.jpg)
Importance
• Column reliability is important for:
– Studying the evolution of binding sites
– Identifying polymorphisms
– Finding recombination breakpoints
– …
66
… g c a t c c t t g g g t c t c g a c a t …
… g c c t c g t t a g a - - t a g a t a g …
.99
.99
.99
.95
.93
.92
.90
.79
.55
.33
.16
.22
.49
.55
.59
.71
.93
.97
.98
.99
![Page 67: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/67.jpg)
How to calculate ambiguity
ctagctaaccgtatcgtgggc
||||| | ||||| | ||
ctagcca---gtatctagtgc
67
Prob(column) =sum of exp(score / t) for all alignments that include the column
sum of exp(score / t) for all alignments
Prob(column) = sum of probs of all alignments that include the column
![Page 68: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/68.jpg)
An aligner that indicates ambiguity
68
Since 2008
http://last.cbrc.jp/
Warning: LAST was made by me and colleagues
![Page 69: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/69.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
69
![Page 70: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/70.jpg)
Sequence quality data
• Some DNA sequencers estimate the error probability of every base
• We ought to use this information when comparing sequences
t a g c t g a
0.01 0.02 0.07 0.24 0.32 0.75 0.75
70
![Page 71: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/71.jpg)
General case: compare 2 sequences with error probabilities
a t g c c …
0.01 0.02 0.02 0.09 0.17
g t a c c …
0.03 0.01 0.08 0.12 0.44
71
![Page 72: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/72.jpg)
Sxy = logAxy
BxBy
é
ë ê
ù
û ú
a c g t a 2 -3 -1 -3 c -3 2 -3 -1 g -1 -3 2 -3 t -3 -1 -3 2
Sxpyq = log pqAxy
BxBy+ (1- pq)
é
ë ê
ù
û ú
Traditional sequence comparison Giga-sequencers
t a g c t
0.01 0.02 0.07 0.24 0.32
Error probabilities Score matrix
Real substitutions (mutation / evolution) Erroneous substitutions
Generalized log likelihood ratio:
72 Incorporating sequence quality data into alignment improves DNA read mapping. Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100
![Page 73: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/73.jpg)
An aligner that combines score matrix & quality data
73
Since 2008
http://last.cbrc.jp/
Warning: LAST was made by me and colleagues
![Page 74: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/74.jpg)
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
74
![Page 75: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/75.jpg)
Why is BLAST too slow?
75
![Page 76: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/76.jpg)
Why is BLAST too slow?
1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed
…atcgtatcgtatcgtactgctggcctagtggggga…
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
76
![Page 77: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/77.jpg)
Why is BLAST too slow?
1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed
…atcgtatcgtatcgtactgctggcctagtggggga…
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
Problem
Non-uniform composition:
atatatatatatatatatata Alu
LINEs SINEs Isochores
CpG islands
too many seeds too many extensions too slow 77
![Page 78: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/78.jpg)
Example
• Compare the human and chimp genomes
• Each genome has ~ 1 million Alu elements
• So we will get ~ 1012 seed matches…
Problem
Non-uniform composition:
atatatatatatatatatata Alu
LINEs SINEs Isochores
CpG islands
too many seeds too many extensions too slow 78
![Page 79: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/79.jpg)
Solution: adaptive seeds
1. Find “seeds” (initial matches) of a fixed length rareness 2. Try extending an alignment from each seed
…atcgtatcgtatcgtactgctggcctagtggggga…
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
79
Adaptive seeds can be found efficiently by using a suffix array
![Page 80: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/80.jpg)
An aligner that uses adaptive seeds
80
Since 2008
http://last.cbrc.jp/
Warning: LAST was made by me and colleagues
![Page 81: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/81.jpg)
LAST run times
• Compare the human and chicken genomes
– 3.5 hours
• Align 1 million length-87 DNA reads to the human genome
– 6 minutes
• (Using 1 CPU core)
81
![Page 82: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/82.jpg)
82 Sensitivity (% of reads that are correctly aligned) Run time (minutes)
Simulated DNA reads Error rate
(% of aligned reads that are wrong)
![Page 83: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/83.jpg)
Method Time (min)
bwa 16
bwa-n10 67
last 41
last
last
novoalign 518
shrimp2 ?
stampy 72
stampy (sensitive) 248
![Page 84: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/84.jpg)
For more detail
• Adaptive seeds tame genomic sequence comparison Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC Genome Research 2011 21:487
• Incorporating sequence quality data into alignment improves DNA read mapping Frith MC, Wan R, Horton P Nucleic Acids Research 2010 38:e100
• A mostly traditional approach improves alignment of bisulfite-converted DNA Frith MC, Mori R, Asai K Nucleic Acids Research 2012 40:e100
84
![Page 85: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/85.jpg)
Summary
• It is feasible to use classic, statistical alignment approaches with large modern sequence datasets
– This is beneficial for modeling: diverged sequences, biased base frequencies, etc.
• Alignment ambiguity should be used more often
• Try to avoid cargo cult science!
85
![Page 86: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/86.jpg)
Main collaborators
Paul Horton CBRC
Michiaki Hamada U of Tokyo / CBRC
Szymon Kielbasa Leiden University
![Page 87: Finding and Aligning Related Sequences (Martin Frith)](https://reader035.vdocuments.us/reader035/viewer/2022081403/55504e65b4c905b2788b5220/html5/thumbnails/87.jpg)
Programming wisdom
• Measuring programming progress by lines of code is like measuring aircraft building progress by weight. – Bill Gates
• As you're about to add a comment, ask yourself, 'How can I improve the code so that this comment isn't needed?’ – Steve McConnell
• The key to performance is elegance, not battalions of special cases. – Jon Bently and M. Douglas McIlroy
• Weeks of programming can save you hours of planning. – Unknown
• Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. – Unknown
87