cz5225: modeling and simulation in biology lecture 9: next generation sequencing prof. chen yu zong...
TRANSCRIPT
CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology
Lecture 9: Next Generation SequencingLecture 9: Next Generation Sequencing
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6516-6877Tel: 6516-6877Email: Email: [email protected]
http://bidd.nus.edu.sgRoom 08-14, level 8, S16, NUSRoom 08-14, level 8, S16, NUS
OutlineOutline
• First generation sequencing
• Next generation sequencing
• Third generation sequencing
• Analysis challenges
Sanger SequencingSanger Sequencing
• DNA is fragmented• Cloned to a plasmid
vector• Cyclic sequencing
reaction• Separation by
electrophoresis• Readout with
fluorescent tags
Steps to Assemble a GenomeSteps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology
read a 500-900 long word that comes out of sequencer
mate pair a pair of reads from two endsof the same insert fragment
contig a contiguous sequence formed by several overlapping readswith no gaps
supercontig an ordered and oriented set(scaffold) of contigs, usually by mate
pairs
consensus sequence derived from thesequene multiple alignment of reads
in a contig
Sequencing Types and ApplicationsSequencing Types and Applications
Cyclic-Array MethodsCyclic-Array Methods
• DNA is fragmented• Adaptors ligated to
fragments• Several possible
protocols yield array of PCR colonies.
• Enyzmatic extension with fluorescently tagged nucleotides.
• Cyclic readout by imaging the array.
Emulsion PCREmulsion PCR
• Fragments, with adaptors, are PCR amplified within a water drop in oil.
• One primer is attached to the surface of a bead. • Used by 454, Polonator and SOLiD.
Bridge PCRBridge PCR
• DNA fragments are flanked with adaptors.• A flat surface coated with two types of primers,
corresponding to the adaptors.• Amplification proceeds in cycles, with one end of each
bridge tethered to the surface.• Used by Solexa.
Comparison of Existing MethodsComparison of Existing Methods
Genome Assembly: Find Overlapping ReadsGenome Assembly: Find Overlapping Reads
aaactgcagtacggatctaaactgcag aactgcagt… gtacggatct tacggatctgggcccaaactgcagtacgggcccaaa ggcccaaac… actgcagta ctgcagtacgtacggatctactacacagtacggatc tacggatct… ctactacac tactacaca
(read, pos., word, orient.)
aaactgcagaactgcagtactgcagta… gtacggatctacggatctgggcccaaaggcccaaacgcccaaact…actgcagtactgcagtacgtacggatctacggatctacggatcta…ctactacactactacaca
(word, read, orient, pos.)
aaactgcagaactgcagtacggatcta actgcagta actgcagtacccaaactgcggatctacctactacacctgcagtacctgcagtacgcccaaactggcccaaacgggcccaaagtacggatcgtacggatctacggatcttacggatcttactacaca
• Find pairs of reads sharing a k-mer, k ~ 24• Extend to full alignment – throw away if not >98% similar
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
T GA
TAGA| ||
TACA
TAGT||
• Caveat: repeats A k-mer that occurs N times, causes O(N2) read/read comparisons ALU k-mers could cause up to 1,000,0002 comparisons
• Solution: Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available
Genome Assembly: Find Overlapping ReadsGenome Assembly: Find Overlapping Reads
Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
Genome Assembly: Find Overlapping ReadsGenome Assembly: Find Overlapping Reads
• Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGA
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGA
insert A
replace T with Ccorrelated errors—probably caused by repeats disentangle overlaps
TAGATTACACAGATTACTGATAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
In practice, error correction removes up to 98% of the errors
Genome Assembly: Find Overlapping ReadsGenome Assembly: Find Overlapping Reads
Genome Assembly: Merge Reads into ContigsGenome Assembly: Merge Reads into Contigs
• Overlap graph:– Nodes: reads r1…..rn
– Edges: overlaps (ri, rj, shift, orientation, score)
Note:of course, we don’tknow the “color” ofthese nodes
Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat
We want to merge reads up to potential repeat boundaries
repeat region
Unique Contig
Overcollapsed Contig
Genome Assembly: Merge Reads into ContigsGenome Assembly: Merge Reads into Contigs
• Ignore non-maximal reads• Merge only maximal reads into contigs
repeat region
Genome Assembly: Merge Reads into ContigsGenome Assembly: Merge Reads into Contigs
Read Length and PairingRead Length and Pairing
• Short reads are problematic, because short sequences do not map uniquely to the genome.
• Solution #1: Get longer reads.• Solution #2: Get paired reads.
ACTTAAGGCTGACTAGC TCGTACCGATATGCTG
Third Generation SequencingThird Generation Sequencing
• Nanopore sequencing– Nucleic acids driven through a nanopore.– Differences in conductance of pore provide readout.
• Real-time monitoring of PCR activity– Read-out by fluorescence resonance energy transfer
between polymerase and nucleotides or– Waveguides allow direct observation of polymerase
and fluorescently labeled nucleotides
Nanopore sequencingNanopore sequencing
Deamer, DW, and Akeson, M. ‘Nanopores and Nucleic Acids: prospects for ultrarapid sequencing’. Tibtech.Meller, A J. Phys.: Condens. Matter 15 (2003) R581–R607
Earlier Findings – Transmembrane voltage drives
RNA through the protein nanopore α-hemolysin.
– Passage of RNA through the pore reduces the ionic current
– Blockage current is modulated by base identity
• PolyC – iblock = 5 pA, • PolyA – Iblock = 20 pA
– Translocation rate depends on base identity
• PolyC - v = 3 µs/base• PolyA – v = 20 µs/base
Automated Rapid DNA Sequencing with NanoporesAutomated Rapid DNA Sequencing with Nanopores
Church, George M. ‘Genomes for All’ Scientific American, Jan 2006, pp. 47-54.
Sequencing will require a better understanding of the physics of the interaction between DNA and protein pore during translocation.
Modeling of ssDNA TranslocationModeling of ssDNA Translocation
• F = zeVa– ze = effective charge / base– V = applied voltage– a = base-to-base distance
• F = (1)(1.6 x10-19)(.125)(.4 x 10-9) ~ 5kbT / a ~ 44 pN
• Basis for modeling– P(forward or backward) ~ exp(Fa/kBT)– Averaged over all monomers
• Model Assumptions: – Length of polymer = L >> pore length – With short polymers, membrane has 0 thickness
D. K. Lubensky and D. R. Nelson, Biophys. J. 77, 1824 (1999).
F
Experiment of ssDNA TranslocationExperiment of ssDNA Translocation
Conditions– Temp: 2oC
– Electrolyte solution– 1M KCl, 1 mM Tris-EDTA buffer,
pH 8.5
– Polymer• Polydeoxyadenylic acid
(poly(dA))• Length: 4 – 100 bases
– Driving voltage: 70-300 mV
Meller, A., L. Nivon, D. Branton, 2001. Voltage-Driven DNA
Translocations Through a Nanopore, Phys. Rev. Lett., 86,3435-39
2323
Sequence Alignment as a Mathematical Sequence Alignment as a Mathematical Problem: Problem:
Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC
Best Alignment: ATTCTTGC
ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap
What is a good alignment?
2424
How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)
• Mismatch: -5 (w(x, y) = -5, if x ≠ y)
• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
2525
Pairwise AlignmentPairwise AlignmentSequence a: CTTAACTSequence b: CGGATCAT
An alignment of a and b:
C---TTAACTCGGATCA--T
Insertion gap
Match Mismatch
Deletion gap
2626
Alignment GraphAlignment GraphSequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
Insertion gap
Deletion gap
2727
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C
C C---TTAACTCGGATCA--T
2828
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A
C C---TTAACTCGGATCA--T
2929
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T
C
T
C---TTAACTCGGATCA--T
3030
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T C A
C
T
T
A
A
C
C---TTAACTCGGATCA--T
3131
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
3232
Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
3333
Alignment ScoreAlignment ScoreSequence a: CTTAACT
Sequence b: CGGATCAT
8 5 2 -1
-1+8
=7
7-3
=4
4+8
=12
12-3
=9
9-3
=6
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
6+8=14
Alignment score
3434
An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score
• Let A=a1a2…am and B=b1b2…bn .
• Si,j: the score of an optimal alignment between
a1a2…ai and b1b2…bj
• With proper initializations, Si,j can be computedas follows.
),(
),(
),(
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bws
aws
s
3535
Computing Computing SSi,ji,j
i
j
w(ai,-)
w(-,bj)
w(ai,bj)
Sm,n
3636
InitializationsInitializationsS0,0= 0
S0,1=-3, S0,2=-6,
S0,3=-9, S0,4=-12,
S0,5=-15, S0,6=-18,
S0,7=-21, S0,8=-24
S1,0=-3, S2,0=-6,
S3,0=-9, S4,0=-12,
S5,0=-15, S6,0=-18,
S7,0=-21
0 -3 -6 -9 -12 -15 -18 -21 -24
-3
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Gap symbol: -3
3737
SS1,11,1 = = ??Option 1:
S1,1 = S0,0 +w(a1, b1)
= 0 +8 = 8
Option 2:
S1,1=S0,1 + w(a1, -)
= -3 - 3 = -6
Option 3:
S1,1=S1,0 + w( - , b1)
= -3-3 = -6
Optimal:
S1,1 = 8
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 ?
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
3838
SS1,21,2 = = ??Option 1:
S1,2 = S0,1 +w(a1, b2)
= -3 -5 = -8
Option 2:
S1,2=S0,2 + w(a1, -)
= -6 - 3 = -9
Option 3:
S1,2=S1,1 + w( - , b2)
= 8-3 = 5
Optimal:
S1,2 =5
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 ?
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
3939
SS2,12,1 = = ??Option 1:
S2,1= S1,0 +w(a2, b1)
= -3 -5 = -8
Option 2:
S2,1=S1,1 + w(a2, -)
= 8 - 3 = 5
Option 3:
S2,1=S2,0 + w( - , b1)
= -6-3 = -9
Optimal:
S2,1 =5
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5
-6 ?
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
4040
SS2,22,2 = = ??Option 1:
S2,2= S1,1 +w(a2, b2)
= 8 -5 = 3
Option 2:
S2,2=S1,2 + w(a2, -)
= 5 - 3 = 2
Option 3:
S2,2=S2,1 + w( - , b2)
= 5-3 = 2
Optimal:
S2,2 =3
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5
-6 5 ?
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
4141
SS3,53,5 = = ??
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 ?
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
4242
SS3,53,5 = = ??
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
optimal score
4343
C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
8 – 5 –5 +8 -5 +8 -3 +8 = 14
4444
Multiple sequence alignment MSAMultiple sequence alignment MSA
4545
How to score an MSA?How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
A---C
G-ATC
GC-TC
A---C
GC-TC
G-ATC
A---C
G-ATC
Score =
Score
Score
Score
+
+
4646
How to score an MSA?How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
A---C
G-ATC
GC-TC
A---C
GC-TC
G-ATC
A---C
G-ATC
Score =
Score
Score
Score
+
+
-5-3+8-3+8= 5
+
8-3-3+8+8= 18
+
-5+8-3-3+8= 5
= 28
SP-score=5+18+5=28