graph algorithms 8.6-8.10 cs 6030 – bioinformatics summer ii 2012 jason eric johnson
TRANSCRIPT
![Page 1: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/1.jpg)
Graph Algorithms 8.6-8.10
CS 6030 – BioinformaticsSummer II 2012
Jason Eric Johnson
![Page 2: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/2.jpg)
Sequencing by Hybridization
• DNA Array gives all strings of length l
• How do we find the order?
• Spectrum(s,l) – String s of length n– Spectrum is multiset of n-l+1 l-mers in s
![Page 3: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/3.jpg)
Sequencing by Hybridization
• s = TATGGTGC• l = 3• Spectrum(s,l) = {TAT,ATG,TGG,GGT,GTG,TGC}
• Problem:• Input: Set S of all l-mers from s• Output: String s s.t. Spectrum(s,l) = S
![Page 4: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/4.jpg)
Hybridization on DNA Array
![Page 5: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/5.jpg)
Sequencing by Hybridization
• Special case of Shortest Superstring Problem
• SBH is linear-time
• SSP (NP-Complete) is more general– In SSP, no guaranteed overlap– In SBH, we know the length of the target sequence
![Page 6: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/6.jpg)
Sequencing by Hybridization
• There is a problem with DNA Arrays
• No good way to distinguish a match from a highly stable mismatch– Mismatch could give strong hybridization signal– Need longer probes to deal with mutations
![Page 7: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/7.jpg)
SBH: Hamiltonian Path Approach
• Two l-mers overlap if overlap(p,q) = l – 1– Last l-1 letters of p are same as first l-1 of q
• Make each l-mer in Spectrum(s,l) a node• Construct directed graph(s) that connect every
p and q with a directed edge• 1 to 1 correspondence between paths that
visit each vertex exactly once (Hamiltonian Paths) and DNA fragments with Spectrum(s,l)
![Page 8: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/8.jpg)
SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG }
Path visited every VERTEX once
ATG AGG TGC TCCH GTC GGT GCA CAG
ATG C A G G T C C
![Page 9: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/9.jpg)
SBH: Hamiltonian Path Approach
A more complicated graph:
S = { ATG TGG TGC GTG GGC GCA GCG CGT }
HH
![Page 10: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/10.jpg)
SBH: Hamiltonian Path Approach S = { ATG TGG TGC GTG GGC GCA GCG CGT }
Path 1:
HH
ATGCGTGGCA
HH
ATGGCGTGCA
Path 2:
![Page 11: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/11.jpg)
SBH: Hamiltonian Path Approach
• Problem is that there is no efficient algorithm
• As overlap graph gets larger, this is not a useful technique since the Hamiltonian Path problem is NP-Complete
![Page 12: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/12.jpg)
SBH: Eulerian Path Approach
• This leads to simple linear-time algorithm for sequence reconstruction
• Construct graph whose edges correspond to l-mers
• Find path(s) that visit each edge exactly once
![Page 13: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/13.jpg)
SBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }
Edges correspond to l – mers from S
AT
GT CG
CAGCTG
GG Path visited every EDGE once
![Page 14: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/14.jpg)
SBH: Eulerian Path ApproachS = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:
ATGGCGTGCA ATGCGTGGCA
AT TG GCCA
GG
GT CG
AT
GT CG
CAGCTG
GG
![Page 15: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/15.jpg)
SBH: Eulerian Path Approach
• If for every vertex the number of incoming edges is equal to the number of outgoing edges, the graph is balanced
• Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced
• Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced
![Page 16: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/16.jpg)
Some Difficulties with SBH• Fidelity of Hybridization: difficult to detect differences
between probes hybridized with perfect matches and 1 or 2 mismatches
• Array Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology.
• Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future
• Practicality again: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques
![Page 17: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/17.jpg)
Fragment Assembly
• Now that we have our reads sequenced, we need to assemble them into the entire DNA sequence
![Page 18: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/18.jpg)
Fragment Assembly
• We have some problems:– Errors in reads (1% to 3%)– Which strand did the read come from?• Did the read come from the target DNA sequence or its
Watson-Crick complement?
– Repeats in DNA (this is the major problem)• See page 278 for puzzle example
![Page 19: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/19.jpg)
Fragment Assembly
• Very difficult to put it all together if repeats are longer than read length
• Could solve this by increasing read length, but the technology isn’t there yet
![Page 20: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/20.jpg)
Fragment Assembly
• One approach is to break the sequence into about 30,000 Bacterial Artificial Chromosomes– Sequence each BAC individually– Put them all together– Used and shown effective (if cumbersome) by the
Human Genome Project
![Page 21: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/21.jpg)
Fragment Assembly
• Another option (used in mouse genome assembly) is the Weber-Meyers approach– Pairs reads that are separated by a fixed-size gap– Gap size L is chosen to be longer than most
repeats– Unlikely both reads lie in large repeat– Read that is in unique portion of DNA tells us
which copy of a repeat the mate is in
![Page 22: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/22.jpg)
Fragment Assembly
• Most algorithms consist of these steps:
• Overlap– Find potentially overlapping reads
• Layout:– Find order of reads along DNA
• Consensus:– Derive DNA sequence from layout
![Page 23: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/23.jpg)
Overlap
• Find the best match between the suffix of one read and the prefix of another
• Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment
• Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
![Page 24: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/24.jpg)
Overlapping Reads
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
• Sort all k-mers in reads (k ~ 24)
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >95% similar
T GA
TAGA| ||
TACA
TAGT||
![Page 25: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/25.jpg)
Overlapping Reads and Repeats
• A k-mer that appears N times, initiates N2 comparisons
• For an Alu that appears 106 times 1012 comparisons – too much
• Solution:Discard all k-mers that appear more than
t Coverage, (t ~ 10)
![Page 26: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/26.jpg)
Finding Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
![Page 27: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/27.jpg)
Layout
• Repeats are a major challenge• Do two aligned fragments really overlap, or
are they from two copies of a repeat? • Solution: repeat masking – hide the repeats!!!• Masking results in high rate of misassembly
(up to 20%)• Misassembly means alot more work at the
finishing step
![Page 28: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/28.jpg)
Consensus
• A consensus sequence is derived from a profile of the assembled fragments
• A sufficient number of reads is required to ensure a statistically significant consensus
• Reading errors are corrected
![Page 29: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/29.jpg)
Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
![Page 30: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/30.jpg)
Protein Sequencing and Identification
• Protein can be digested into peptides by proteases (such as trypsin)
• Can then sequence the fragments individually and re-assemble
• Mass spectrometry allows us to find proteins involved in cell death, for example
![Page 31: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/31.jpg)
Protein Sequencing and Identification
• Tandem mass spectrometer breaks peptides into smaller fragments
• These fragments have electrical charge• Fragments are spun around in an magnetic field
until they hit a detector• Larger masses are harder to spin than smaller
ones, so mass can be determined by the amount of energy required to fling fragments around
![Page 32: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/32.jpg)
Protein Sequencing and Identification
• The problem we encounter is how to reconstruct the amino acid sequence of the peptide from the masses of the broken pieces
![Page 33: Graph Algorithms 8.6-8.10 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649d9e5503460f94a88972/html5/thumbnails/33.jpg)
References
• Generated from:
• An Introduction to Bioinformatics Algorithms, Neil C. Jones, Pavel A. Pevzner, A Bradford Book, The MIT Press, Cambridge, Mass., London, England, 2004
• Slides 4, 8-10, 13, 14, 16, 23-29 from http://bix.ucsd.edu/bioalgorithms/slides.php#Ch8