concepts and methods in sequencing and genome assembly

39
Concepts and methods in sequencing and genome assembly B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique: [email protected] BCM-2002

Upload: daniel-potts

Post on 31-Dec-2015

39 views

Category:

Documents


0 download

DESCRIPTION

BCM-2002. Concepts and methods in sequencing and genome assembly. B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique : [email protected]. Outline Concepts in DNA and RNA sequencing Sequencing technologies Random genome sequencing, with/without cloning - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Concepts and methods in  sequencing and  genome assembly

Concepts and methods in

sequencing and genome assembly

B. Franz LANG, Département de Biochimie Bureau: H307-15Courrier électronique: [email protected]

BCM-2002

Page 2: Concepts and methods in  sequencing and  genome assembly

Outline

1. Concepts in DNA and RNA sequencing2. Sequencing technologies3. Random genome sequencing, with/without cloning4. Data formats of results – autoradiograms, traces, fastq

and base call qualities5. Sequencing and assembly artifacts

Page 3: Concepts and methods in  sequencing and  genome assembly

1. Concepts in DNA and RNA sequencing

Reminder

• DNA and RNA are polar (5’P; 3’OH), charged biopolymers, made up of nucleotides.

• By convention, sequences are always written from 5’ (left) to 3’ (right); otherwise, the polarity has to be indicated.

• DNA usually occurs in double-stranded, antiparallel perfectly base-paired form:

5’ AGCTATTGATTTCCTTGG 3’ 3’ TCGATAACTAAAGGAACC 5’

• RNAs are most often single-stranded and may form secondary and tertiary base-pairs (intra-molecular, or with other molecules). Single-stranded DNA does the same. For sequencing, DNAs and RNAs have to be denatured and single-stranded, without structure.

Page 4: Concepts and methods in  sequencing and  genome assembly

Principles; see also Maniatis (a popular biochemistry cook-book):

The initial two sequencing techniques are the enzymatic synthesis method of Sanger et al. (1977) and the chemical degradation method of Maxam and Gilbert (1977). Note that Maxam and Gilbert is slow, using highly toxic/cancerogenic substances, and no longer used - except for special applications such as mapping of protein binding to DNA. New Generation Sequencing (NGS) techniques have taken over for genome projects – see below. They do not require electrophoretic techniques but use instead various nano-technological approaches. The currently by far most popular technology is Illumina.

1. Concepts in DNA and RNA sequencing

Page 5: Concepts and methods in  sequencing and  genome assembly

Principle:

Although very different in principle, both Maxam/Gilbert and Sanger produce populations of (radio- or fluorochrome-) labeled oligonucleotides that all start at the same site of a given DNA/RNA, and that end in a given nucleotide (G,A,T/U,C) that is generated with a given sequencing biochemistry (nucleotide-specific termination of DNA synthesis, or nucleotide-specific cleavage; etc.).

1. Concepts in DNA and RNA sequencing

Cleavage at random meG site

========== >‘Visible’ radioactive

fragments

Note that in any sequencing technology, only the labeled single-stranded DNAs or RNAs are sequenced; unlabeled material does not

matter. When more molecules carry the same label, these need to be first separated (e.g., by electrophoresis).

Page 6: Concepts and methods in  sequencing and  genome assembly

Electrophoretic separation, and detection principles:

These populations of oligonucleotides are then resolved by electrophoresis under conditions that discriminate size differences at the single nucleotide level (PAGE). When loaded into four adjacent lanes of a sequencing gel, the order of nucleotides can be read directly from an image after visualizing the radioactive or any other label (see below). When sequence reactions are marked with four different fluorescent dyes (the current standard of Sanger sequencing), these can be loaded on a single lane (or capillary), and read automatically and continuously as different-wavelength light emissions, generated by laser excitation.

1. Concepts in DNA and RNA sequencing

Page 7: Concepts and methods in  sequencing and  genome assembly

Principles of RNA sequencing:

RNA is sequenced similar to DNA, either directly by chemical methods (Maxam-Gilbert-like, yet inefficient, slow), by a Sanger-like synthesis protocol with reverse transcriptase (to produce cDNA sequence ladders), or after transformation to cDNA by regular DNA sequencing procedures (Sanger or NGS technologies).

RNA classes may be separated by size (micro RNAs, tRNAs rRNAs …) or by enrichment of eukaryotic mRNAs carrying a 3’ poly-A, by purification with an oligo-dT column. That is, RNA sequencing may provide more information than just the primary sequence.

Most RNAs have distinct start and processing sites. High volume RNA sequencing (NGS, called RNA-seq) allows precise identification of starts and stops, and measurement of relative quantities (i.e., quantitative mapping of RNA 5’ and 3’ ends).

1. Concepts in DNA and RNA sequencing

Page 8: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.1. Maxam and Gilbert (chemical)

• Requires high amount of highly purified DNA fragments (e.g., restriction fragments).

• Single radioactive label, can be on double- or single-stranded DNA. • Nucleotide-specific, partial chemical modification (random along

DNA).• Chemical cleavage at modified nucleotides.• Denaturation (heat, formamide), to allow uniform electrophoresis of

single-stranded DNA molecules that are perfectly linear and without secondary structure (if not – sequencing artifacts).

• High-resolution slab gel PAGE, followed by autoradiography.• Reading (up to a few hundred nt/reaction) usually by a human expert.• Several days labor with a few gel runs provides at best 10 kbp sequence

Page 9: Concepts and methods in  sequencing and  genome assembly

2.1. Maxam-Gilbert sequencing – summary

Slow, many DNA purification steps, requires lots of DNA, toxic reagents, no automation available, relatively short reads up to a few hundred.

Page 10: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.2. Sanger (enzymatic synthesis)

• Unique start of sequencing ladder is determined by a sequencing primer, hybridized to DNA or RNA. Purity of template is not an issue (!), a huge advantage.

• DNA polymerase (reverse transcriptase for RNA) for primer elongation.• Nucleotide-specific termination (random) with one of four dideoxy-

nucleotides that are mixed with the four regular nucleotides.

Page 11: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.2. Sanger (enzymatic synthesis)

• Label may be radioactive or a fluorescent dye on• Primer itself (e.g., 5’ P32; dye label added during primer synthesis).• Nucleotides incorporated during synthesis (e.g., P32, S35).• Dideoxy-nucleotides (different dyes emitting different colors –

single lane or capillary sequencing is possible : current standard).• High-resolution slab gel or capillary electrophoresis• Autoradiography or automated reading of migrating fragments (laser,

with camera or diodes).• Several days labor may produce ~100 kbp sequence. Robotic

procedures for template purification and sequence reactions allows scale-up.

Page 12: Concepts and methods in  sequencing and  genome assembly

2.2. Sanger (enzymatic synthesis), summary

In this example, the primer is labeled, therefore requiring gel separation in four lanes

Page 13: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.3. 454 Technology – Roche GS FLX pyrosequencing(several hundred MB per run; advantage: reads up to 1,000 nt)

Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.

Originally the leader in NGS, this technology is less effective and more error-prone than Illumina, and will therefore be abandoned by the company in a few years !

Page 14: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.3. 454 Technology – Roche GS FLX

Page 15: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.3. 454 Technology – Roche GS FLX(multiplex reaction in oil emulsion droplets)

Page 16: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.3. 454 Technology – Roche GS FLX

Page 17: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.3. 454 Technology – Roche GS FLX

• DNA polymerase incorporates the correct, complementary dNTPs onto the template. This incorporation releases pyrophosphate (PPi) stoichiometrically.

• ATP sulfurylase quantitatively converts PPi to ATP in the presence of adenosine 5´ phosphosulfate. This ATP acts as fuel to the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light

• Unincorporated nucleotides and ATP are degraded by apyrase, and the reaction can restart.

Page 18: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.4. Illumina (several GB per run; reads up to 300 nt)

Page 19: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.4. Illumina

Page 20: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.4. Illumina

Page 21: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.4. Illumina

Base calling example for two clusters

Page 22: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.5. ABI SOLiD – sequencing by ligation(2,000 MB per run; but only 35 nt/read)

A library of DNA fragments, ligated with universal sequence adaptors, is attached to the surface of magnetic beads (one fragment per bead). Emulsion PCR taking place in micro-reactors amplifies the fragments that are then covalently bound to a glass slide.SOLiD technology applies a rather complicated ligation/cleavage procedure. Partially degenerate, fluorescently labeled DNA octamers with dinucleotide sequence recognition cores are hybridized to the template, and perfectly annealing sequences are ligated to the primer. After imaging, unextended strands are capped and fluorophores are cleaved. Repetitions of new priming, primer removal, and ligation cycles will in the end cover a stretch of 35 nt twice (redundantly), which improves the accuracy of base calling. Yet the value of a 35 nt reading starts dwindling ,in face of other NGS technologies producing longer reads almost every year (e.g., Illumina promising 300 nt for 2014).

First cycle cleavage

Page 23: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

…. and so on ….

Page 24: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.6. Ion Torrent (100 MB + per run; up to 200 nt/read)Incorporation of a deoxyribonucleotide triphosphate (dNTP) into a primed, growing DNA strand involves the release of pyrophosphate, and a hydrogen ion that s measured on a semiconductor chip.

Microwells each containing one single-stranded template DNA molecule plus a DNA polymerase are sequentially flooded with A, C, G or T. Only if an introduced dNTP is complementary to the next unpaired nucleotide on the template strand it is incorporated into the growing complementary strand. If more than one nucleotides follow each other, the signal strength correlates with the number of identical incorporated nucleotides.

The series of electrical pulses is translated into a DNA sequence, without intermediate signal conversion, the use of labeled nucleotides, or error-prone intermediate amplification steps. However, the signal precision is lower than with 454, Illumina, and Solid technologies.

Page 25: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000)

The PacBio RS II is a single molecule, real-time DNA sequencing system that provides the longest read lengths of any available sequencing technology, however in comparison to all other NGS technologies it has the lowest precision. Sequencing occurs on SMRT Cells, each containing thousands of Zero-Mode Waveguides (ZMWs) in which polymerases are immobilized. The ZMWs provide a way for directly watching DNA polymerase with a high-resolution camera, as it performs sequencing by synthesis (fluorescence measurement; four different flurochrome-labeled nucleotides).

The long read length is precious for the assembly of genomes, in particular in regions containing long sequence repeats that cause otherwise problems in genome assembly. In addition, it detects DNA base modifications using the kinetics of the polymerization reaction during sequencing.

Page 26: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies

2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000)

Page 27: Concepts and methods in  sequencing and  genome assembly

2. Sequencing technologies – comparison from 2012Quail et al. BMC Genomics 2012, 13:341

Page 28: Concepts and methods in  sequencing and  genome assembly

3. Random genome sequencing comparison3.1. Sanger, Maxam Gilbert - with cloning – DNA is not amplified in vitro, therefore has no DNA amplification artifacts. Each clone receives original piece of DNA in a plasmid that is multiplied by E. coli.

Page 29: Concepts and methods in  sequencing and  genome assembly

3. Random genome sequencing comparison

3.2. NGS procedures without cloning , using either DNAs attached to nano-chips (micro wells) or in oil drop emulsion.

454, Illumina, Solid – DNA is highly PCR-amplified. Errors may therefore come from PCR amplification artifacts.

Pacific Biosciences and Ion Torrent technologies both read single molecules directly without prior PCR amplification. Yet in contrast. their relatively high error rate is due to the signal imprecision itself.

Page 30: Concepts and methods in  sequencing and  genome assembly

4. Data formats of results – autoradiograms, traces, fastq and base call qualities

Trace file typical for Sanger sequencing with base call qualities indicated by the height of blue bars and Q numbers. The advantage of this format is easy spotting of artifacts by a human expert. The typical NGS format (FastQ) only reports the sequence plus the quality encoded in machine readable format.

Page 31: Concepts and methods in  sequencing and  genome assembly

4. Data formats of results – quality scores in fastq format

Typical NGS format (FastQ) only reports the sequence plus the quality encoded in machine readable format.

Page 32: Concepts and methods in  sequencing and  genome assembly

4. Data formats of results – quality scores

Page 33: Concepts and methods in  sequencing and  genome assembly

4. Data formats of results – quality scores

Page 34: Concepts and methods in  sequencing and  genome assembly

4. Data formats of results – quality scores (Illumina example)

Page 35: Concepts and methods in  sequencing and  genome assembly

The denatured DNA is not linear as it folds back on itself and then migrates differently on the sequencing gel (Sanger)

– reason: secondary structures, mainly in G+C –rich regions– effect: ‘compression’ zones in the sequencing ladder – solutions (i) sequence DNA in the two directions of complementary

strands; sequencing artifacts due to folding are not symmetric; (ii) for Sanger sequencing, use nucleotide analogs that minimize secondary structure folding, like deaza-NTP, deaza-dITP, or ITP ( instead of NTPs or dGTP, respectively)

5. Artifacts in sequencing and sequence assembly

Page 36: Concepts and methods in  sequencing and  genome assembly

Sequencing ladders terminate prematurely or contain ‘holes’

Reasons: • sequencing reactions over-modified (M&G), or too elevated terminator

concentrations (Sanger); • (ii) strong nucleotide bias, like long runs of A or T that cause many

polymerases to fall of the template (Sanger)

5. Artifacts in sequencing and assembly

Page 37: Concepts and methods in  sequencing and  genome assembly

Uncertain number of identical nucleotides in a row (homopolymers; > 6)

Reasons: • Amplification errors by DNA polymerase (Illumina, 454)• Signal ambiguity when estimating the number of identical nucleotides

from the height of a single signal (Illumina, very high error with 454)

5. Artifacts in sequencing and assembly

Page 38: Concepts and methods in  sequencing and  genome assembly

Readings that only partially fit a genome sequence (one of the worst artifacts)

Reasons: • Ligation of separate pieces into one fragment, during primer ligation

(applies to all technologies using primer ligation)• Partial deletion of sequence during PCR at repeat sequence and folded

structures (all technologies using PCR amplification)

5. Artifacts in sequencing and assembly

Page 39: Concepts and methods in  sequencing and  genome assembly

This is it, folks!