lecture 14: dna sequencing and assembly
TRANSCRIPT
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 1
Lecture 14: DNA Sequencing
and Assembly Study Chapter 8.9
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 2
DNA Sequencing • Shear DNA into
millions of small fragments
• Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 4
Shotgun Sequencing
cut many times at random (Shotgun)
Genomic region
Get one or two reads from the ends
of each segment ~500 bp ~500 bp
Illumina HiSeq 2500 (8 @ UNC) 2 x 100 bp reads 11 days for 16 samples ~35 GB per sample (12x coverage)
High throughput
Individual labs
Life Technologies Ion Torrent 2 hours ~100 MB to 3 GB
Illumina MiSeq 2x250 bp reads 20 hours, 1 GB per day
Current sequencing technologies
Pacific Biosciences (1@UNC) 1000-10,000 bp reads 20 min, 200 MB
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 5
Illumina reversible dye terminator chemistry
DNA/cDNA (0.1-1 ug total RNA)
Single molecule array Sample
preparation Cluster growth 5’
5’ 3’
G
T
C
A
G
T
C
A
G
T
C
A
C
A
G
T C
A
T
C
A
C
C
T A G
C G
T A
G T
1 2 3 7 8 9 4 5 6
Image acquisition Base calling
T G C T A C G A T …
Sequencing
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 6
Pacific Biosciences Single-Molecule Real-Time sequencing
Metzker ML (2010) Nat Rev Genet
• No PCR steps are required • Mutated polymerase has slower base incorporation (1-3 bp per second) • Read lengths > 1 kb, but a high error rate (~15%)
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 7
10/21/2014
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 9
Fragment Assembly • Assembles the individual overlapping short
reads (fragments) into a genomic sequence • Shortest Superstring problem is an overly
simplified abstraction • Problems:
– DNA read error rate of 1% to 3% – Can’t separate coding and template strands – DNA is full of repeats
• Let’s take a closer look
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 10
Fragment Assembly
Cover region with ~7-fold redundancy Overlap reads and extend to reconstruct the
original genomic region
reads
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 11
Read Coverage
Length of genomic segment: L
Number of reads: n Coverage C = n l / L Length of each read: l How much coverage is enough?
Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped
region per 1,000,000 nucleotides
C
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 12
Challenges in Fragment Assembly • > 50% of human genome is repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) • Repeats are a major problem for fragment assembly
– assume reads are 100bp and we have 300bp repeats
Repeat Repeat Repeat
Green and blue fragments are interchangeable when assembling repetitive DNA
Types of Genome Assemblies • De Novo –
An assembly based entirely on self-consistency or self-similarity of short reads (contigs).
• Comparative –
An assembly of a genome using the sequence of a close relative as a reference. Sometimes called a “template assembly” or “resequencing”
• Confounding problem for both types: Repeats
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 13
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 14
Repeat Types • Low-Complexity DNA (e.g. ATATATATACATA…)
• Microsatellite repeats (a1…ak)N where k ~ 3-6
(e.g. CAGCAGTAGCAGCACCAG) • Transposons/retrotransposons
– SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, >106 in human)
– LINE Long Interspersed Nuclear Elements ~500 - 5,000 bp long, > 200,000 in human – LTR retrotransposons Long Terminal Repeats (~700 bp) at
each end • Gene Families genes duplicate & then diverge
• Segmental duplications ~very long, very similar copies
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 15
Overlap-Layout-Consensus Assembly
Assembler programs ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and then combine contigs into supercontigs
Consensus: requires many overlap-ping reads to derive the DNA seq-uence and to correct for read errors
..ACGATTACAATAGGTT..
Common Approach:
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 16
Overlap • Find the best match between the suffix of one
read and the prefix of another (shortest superstring)
• Due to sequencing errors, most algorithms use dynamic programming to find the optimal overlap alignment
• Filter out fragment pairs that do not share a significantly long common substring
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 17
Overlapping Reads
TAGATTACACAGATTAC
TAGATTACACAGATTAC |||||||||||||||||
• Make an index of all k-mers of all reads (k ~ 20-24)
• Find read-pairs sharing a k-mer
• Extend alignment – throw away if not >95% similar
T GA
TAGA | ||
TACA
TAGT ||
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 18
Histogram Similarity
• Histogram of 3-mers (18 total)
v = tagattacacagattattga
A2 C2 G2 T2
A3:C3:G3:T3 A3:C3:G3:T3 A3:C3:G3:T3 A3:C3:G3:T3
A1 0:0:0:0 2:0:0:0 2:0:0:0 0:0:0:3 C1 0:1:1:0 0:0:0:0 0:0:0:0 0:0:0:0 G1 0:0:0:2 0:0:0:0 0:0:0:0 0:0:0:0 T1 0:1:1:1 0:0:0:0 1:0:0:0 2:0:1:0
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 19
Overlapping Reads and Repeats • Does this really speed up the process? • A k-mer that appears N times, initiates N2
comparisons (you consider all pairs of reads that share the k-mer substring)
• For an Alu that appears 106 times 1012 comparisons – too much
• How to avoid repeats: Discard all k-mers that appear more than t × Coverage, (t ~ 10)
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 20
Finding Overlapping Reads
k-mer table makes it easy to create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 21
Finding Overlapping Reads (cont’d)
• Correct errors using multiple alignment and consensus scoring
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
C: 20 C: 35 T: 30 C: 35 C: 40
C: 20 C: 35 C: 0 C: 35 C: 40
• Score alignments • Accept alignments with good scores
A: 15 A: 25 A: 40 A: 25 -
A: 15 A: 25 A: 40 A: 25 A: 0
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 22
Layout • Repeats are still a major challenge • Do two aligned fragments really overlap, or are
they from two copies of a repeat? • Solution: repeat masking – hide the repeats?
– Masking results in high rate of misassembly (up to 20%)
– Misassembly means alot more work at the finishing step
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 23
2. Merge Reads into Contigs
• Overlap graph: – Nodes: reads r1…..rn – Edges: overlaps (ri, rj, shift, orientation, score)
Note: of course, we don’t know the “color” of these nodes
Reads that come from two regions of the genome (blue and red) that contain the same repeat
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 24
2. Merge Reads into Contigs
We want to merge reads up to potential repeat boundaries
repeat region
Unique Contig
Overcollapsed Contig
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 25
2. Merge Reads into Contigs
• Ignore non-maximal reads • Merge only maximal reads into contigs
repeat region
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 26
2. Merge Reads into Contigs
• Remove transitively inferable overlaps – If read r overlaps to the right reads r1, r2,
and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)
r r1 r2 r3
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 27
2. Merge Reads into Contigs
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 28
2. Merge Reads into Contigs
• Ignore “hanging” reads, when detecting repeat boundaries
sequencing error repeat boundary???
b a
a
b
…
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 29
Overlap graph after forming contigs
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 30
Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved
– Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK – We throw overlaps between two reads in different copies of the repeat
• To make the genome appear less repetitive, try to:
– Increase read length – Decrease sequencing error rate
Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate ⇒ decreases effective repeat content ⇒ increases contig length
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 37
Consensus • A consensus sequence is derived from a profile
of the assembled fragments
• A sufficient number of reads is required to ensure a statistically significant consensus
• Reading errors are corrected
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 38
Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 39
Some Assemblers • PHRAP
• Early assembler, widely used, good model of read errors • Overlap O(n2) → layout (no mate pairs) → consensus
• Celera • First assembler to handle large genomes (fly, human, mouse) • Overlap → layout → consensus
• Arachne • Public assembler (mouse, several fungi) • Overlap → layout → consensus
• Phusion • Overlap → clustering → PHRAP → assemblage → consensus
• Euler • Indexing → Euler graph → layout by picking paths → consensus
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 40
EULER Fragment Assembly
• Traditional “overlap-layout-consensus” technique has a high rate of mis-assembly
• EULER uses the Eulerian Path approach borrowed from
the SBH problem • Fragment assembly without repeat masking can be done
in linear time with greater accuracy
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 41
Overlap Graph: Hamiltonian Approach
Find a path visiting every VERTEX exactly once: Hamiltonian path problem
Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others.
Repeat Repeat Repeat
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 42
Overlap Graph: Eulerian Approach
Repeat Repeat Repeat
Find a path visiting every EDGE exactly once: Eulerian path problem
Placing each repeat edge together gives a clear progression of the path through the entire sequence.
Two solutions
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 43
Multiple Repeats Repeat1 Repeat1 Repeat2 Repeat2
Can be easily constructed with any number of repeats
Two solutions
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 44
Construction of Repeat Graph • Construction of repeat graph from k – mers:
emulates an SBH experiment with a huge (virtual) DNA chip.
• Breaking reads into k – mers: Transform
sequencing data into virtual DNA chip data.
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 45
Construction of Repeat Graph (cont’d)
• Error correction in reads: “consensus first” approach to fragment assembly. Makes reads (almost) error-free BEFORE the assembly even starts.
• Using reads and mate-pairs to simplify the
repeat graph (Eulerian Superpath Problem).
Hybrid Sequencing • Use short read sequencing to create accurate
overlap graphs
• Align noisy long reads to overlap graphs to link contigs – How to align a noisy read to a graph?
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 53
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 54
Conclusions • Graph theory is a vital tool for solving biological
problems • Wide range of applications, including
sequencing, motif finding, protein networks, and many more
10/21/2014 COMP 555 Bioalgorithms (Fall 2014) 55
References
• Simons, Robert W. Advanced Molecular Genetics Course, UCLA (2002). http://www.mimg.ucla.edu/bobs/C159/Presentations/Benzer.pdf
• Batzoglou, S. Computational Genomics Course, Stanford University (2006). http://ai.stanford.edu/~serafim/CS262_2006/