cz5225: modeling and simulation in biology lecture 9: next generation sequencing prof. chen yu zong...

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 9: Next Generation SequencingLecture 9: Next Generation Sequencing

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6516-6877Tel: 6516-6877Email: Email: [email protected]

http://bidd.nus.edu.sgRoom 08-14, level 8, S16, NUSRoom 08-14, level 8, S16, NUS

mailto:[email protected]

http://bidd.nus.edu.sg/

OutlineOutline

• First generation sequencing

• Next generation sequencing

• Third generation sequencing

• Analysis challenges

Sanger SequencingSanger Sequencing

• DNA is fragmented• Cloned to a plasmid

vector• Cyclic sequencing

reaction• Separation by

electrophoresis• Readout with

fluorescent tags

Steps to Assemble a GenomeSteps to Assemble a Genome

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes out of sequencer

mate pair a pair of reads from two endsof the same insert fragment

contig a contiguous sequence formed by several overlapping readswith no gaps

supercontig an ordered and oriented set(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from thesequene multiple alignment of reads

in a contig

Sequencing Types and ApplicationsSequencing Types and Applications

Cyclic-Array MethodsCyclic-Array Methods

• DNA is fragmented• Adaptors ligated to

fragments• Several possible

protocols yield array of PCR colonies.

• Enyzmatic extension with fluorescently tagged nucleotides.

• Cyclic readout by imaging the array.

Emulsion PCREmulsion PCR

• Fragments, with adaptors, are PCR amplified within a water drop in oil.

• One primer is attached to the surface of a bead. • Used by 454, Polonator and SOLiD.

Bridge PCRBridge PCR

• DNA fragments are flanked with adaptors.• A flat surface coated with two types of primers,

corresponding to the adaptors.• Amplification proceeds in cycles, with one end of each

bridge tethered to the surface.• Used by Solexa.

Comparison of Existing MethodsComparison of Existing Methods

Genome Assembly: Find Overlapping ReadsGenome Assembly: Find Overlapping Reads

aaactgcagtacggatctaaactgcag aactgcagt… gtacggatct tacggatctgggcccaaactgcagtacgggcccaaa ggcccaaac… actgcagta ctgcagtacgtacggatctactacacagtacggatc tacggatct… ctactacac tactacaca

(read, pos., word, orient.)

aaactgcagaactgcagtactgcagta… gtacggatctacggatctgggcccaaaggcccaaacgcccaaact…actgcagtactgcagtacgtacggatctacggatctacggatcta…ctactacactactacaca

(word, read, orient, pos.)

aaactgcagaactgcagtacggatcta actgcagta actgcagtacccaaactgcggatctacctactacacctgcagtacctgcagtacgcccaaactggcccaaacgggcccaaagtacggatcgtacggatctacggatcttacggatcttactacaca

• Find pairs of reads sharing a k-mer, k ~ 24• Extend to full alignment – throw away if not >98% similar

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

T GA

TAGA| ||

TACA

TAGT||

• Caveat: repeats A k-mer that occurs N times, causes O(N2) read/read comparisons ALU k-mers could cause up to 1,000,0002 comparisons

• Solution: Discard all k-mers that occur “too often”

• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available


Create local multiple alignments from the

overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA


• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGA

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGA

insert A

replace T with Ccorrelated errors—probably caused by repeats disentangle overlaps

TAGATTACACAGATTACTGATAGATTACACAGATTACTGA

TAG-TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAG-TTACACAGATTATTGA

In practice, error correction removes up to 98% of the errors


Genome Assembly: Merge Reads into ContigsGenome Assembly: Merge Reads into Contigs

• Overlap graph:– Nodes: reads r1…..rn

– Edges: overlaps (ri, rj, shift, orientation, score)

Note:of course, we don’tknow the “color” ofthese nodes

Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat

We want to merge reads up to potential repeat boundaries

repeat region

Unique Contig

Overcollapsed Contig


• Ignore non-maximal reads• Merge only maximal reads into contigs

repeat region


Read Length and PairingRead Length and Pairing

• Short reads are problematic, because short sequences do not map uniquely to the genome.

• Solution #1: Get longer reads.• Solution #2: Get paired reads.

ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

Third Generation SequencingThird Generation Sequencing

• Nanopore sequencing– Nucleic acids driven through a nanopore.– Differences in conductance of pore provide readout.

• Real-time monitoring of PCR activity– Read-out by fluorescence resonance energy transfer

between polymerase and nucleotides or– Waveguides allow direct observation of polymerase

and fluorescently labeled nucleotides

Nanopore sequencingNanopore sequencing

Deamer, DW, and Akeson, M. ‘Nanopores and Nucleic Acids: prospects for ultrarapid sequencing’. Tibtech.Meller, A J. Phys.: Condens. Matter 15 (2003) R581–R607

Earlier Findings – Transmembrane voltage drives

RNA through the protein nanopore α-hemolysin.

– Passage of RNA through the pore reduces the ionic current

– Blockage current is modulated by base identity

• PolyC – iblock = 5 pA, • PolyA – Iblock = 20 pA

– Translocation rate depends on base identity

• PolyC - v = 3 µs/base• PolyA – v = 20 µs/base

Automated Rapid DNA Sequencing with NanoporesAutomated Rapid DNA Sequencing with Nanopores

Church, George M. ‘Genomes for All’ Scientific American, Jan 2006, pp. 47-54.

Sequencing will require a better understanding of the physics of the interaction between DNA and protein pore during translocation.

Modeling of ssDNA TranslocationModeling of ssDNA Translocation

• F = zeVa– ze = effective charge / base– V = applied voltage– a = base-to-base distance

• F = (1)(1.6 x10-19)(.125)(.4 x 10-9) ~ 5kbT / a ~ 44 pN

• Basis for modeling– P(forward or backward) ~ exp(Fa/kBT)– Averaged over all monomers

• Model Assumptions: – Length of polymer = L >> pore length – With short polymers, membrane has 0 thickness

D. K. Lubensky and D. R. Nelson, Biophys. J. 77, 1824 (1999).

F

Experiment of ssDNA TranslocationExperiment of ssDNA Translocation

Conditions– Temp: 2oC

– Electrolyte solution– 1M KCl, 1 mM Tris-EDTA buffer,

pH 8.5

– Polymer• Polydeoxyadenylic acid

(poly(dA))• Length: 4 – 100 bases

– Driving voltage: 70-300 mV

Meller, A., L. Nivon, D. Branton, 2001. Voltage-Driven DNA

Translocations Through a Nanopore, Phys. Rev. Lett., 86,3435-39

2323

Sequence Alignment as a Mathematical Sequence Alignment as a Mathematical Problem: Problem:

Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC

Best Alignment: ATTCTTGC

ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap

What is a good alignment?

2424

How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

2525

Pairwise AlignmentPairwise AlignmentSequence a: CTTAACTSequence b: CGGATCAT

An alignment of a and b:

C---TTAACTCGGATCA--T

Insertion gap

Match Mismatch

Deletion gap

2626

Alignment GraphAlignment GraphSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T


Insertion gap

Deletion gap

2727

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C

C C---TTAACTCGGATCA--T

2828



C G G A

C C---TTAACTCGGATCA--T

2929



C G G A T

C

T


3030



C G G A T C A

C

T

T

A

A

C


3131



C G G A T C A T

C

T

T

A

A

C

T


3232

Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T


3333

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8 5 2 -1

-1+8

=7

7-3

=4

4+8

=12

12-3

=9

9-3

=6

C G G A T C A T

C

T

T

A

A

C

T


6+8=14

Alignment score

3434

An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between

a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

3535

Computing Computing SSi,ji,j

i

j

w(ai,-)

w(-,bj)

w(ai,bj)

Sm,n

3636

InitializationsInitializationsS0,0= 0

S0,1=-3, S0,2=-6,

S0,3=-9, S0,4=-12,

S0,5=-15, S0,6=-18,

S0,7=-21, S0,8=-24

S1,0=-3, S2,0=-6,

S3,0=-9, S4,0=-12,

S5,0=-15, S6,0=-18,

S7,0=-21

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Gap symbol: -3

3737

SS1,11,1 = = ？？Option 1:

S1,1 = S0,0 +w(a1, b1)

= 0 +8 = 8

Option 2:

S1,1=S0,1 + w(a1, -)

= -3 - 3 = -6

Option 3:

S1,1=S1,0 + w( - , b1)

= -3-3 = -6

Optimal:

S1,1 = 8

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 ?

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

3838

SS1,21,2 = = ？？Option 1:

S1,2 = S0,1 +w(a1, b2)

= -3 -5 = -8

Option 2:

S1,2=S0,2 + w(a1, -)

= -6 - 3 = -9

Option 3:

S1,2=S1,1 + w( - , b2)

= 8-3 = 5

Optimal:

S1,2 =5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 ?

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

3939

SS2,12,1 = = ？？Option 1:

S2,1= S1,0 +w(a2, b1)

= -3 -5 = -8

Option 2:

S2,1=S1,1 + w(a2, -)

= 8 - 3 = 5

Option 3:

S2,1=S2,0 + w( - , b1)

= -6-3 = -9

Optimal:

S2,1 =5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5

-6 ?

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

4040

SS2,22,2 = = ？？Option 1:

S2,2= S1,1 +w(a2, b2)

= 8 -5 = 3

Option 2:

S2,2=S1,2 + w(a2, -)

= 5 - 3 = 2

Option 3:

S2,2=S2,1 + w( - , b2)

= 5-3 = 2

Optimal:

S2,2 =3

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5

-6 5 ?

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

4141

SS3,53,5 = = ？？

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

4242

SS3,53,5 = = ？？

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

4343

C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

4444

Multiple sequence alignment MSAMultiple sequence alignment MSA

4545

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

4646

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

-5-3+8-3+8= 5

+

8-3-3+8+8= 18

+

-5+8-3-3+8= 5

= 28

SP-score=5+18+5=28

cz5225: modeling and simulation in biology lecture 9: next generation sequencing prof. chen yu zong...

Documents

nusessential bioinformatics

contigessential bioinformatics

biocomputing lsm2104steps

generation sequencingprof

overlapping readswith

overlapping reads4

array of pcr colonies

contiguous sequence