next generation dna sequencing ipm-nus workshop on computational biology mehdi sadeghi
TRANSCRIPT
DNA sequencing methodologies: 1977
• Maxam-Gilbert – base modification by
general and specific chemicals.
– depurination or depyrimidination.
– single-strand excision.– not amenable to
automation
• Sanger– DNA replication.– substitution of
substrate with chain-terminator chemical.
– more efficient– automation?
DNA sequencing: Chemistry
template + polymerase +
dCTPdTTPdGTPdATP
ddATPddGTPddTTPddCTP
extension
electrophoresis
A•TG•CA•TT•AC•GT•AG•CG•CA•TG•CT•AT•AC•GT•AG•CA•T
DNA SequencingGoal:
Find the complete sequence of A, C, G, T’s in DNA
Challenge:
There is no machine that takes long DNA as an input, and gives the complete sequence as output
Can only sequence ~500 letters at a time
Genome Sequencing
1515
ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…
ACGTGGTAATGGCGTATACACCCTTAGGCCATA
Short fragments of DNA
AC..GCTT..TC
CG..CA
AC..GC
TG..GT TC..CC
GA..GCTG..AC
CT..TGGT..GC AC..GC AC..GC
AT..ATTT..CC
AA..GC
Short DNA sequences
ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT...
Sequenced genome
Genome
DNA sequencing – vectors
+ =
DNA
Shake
DNA fragments
VectorCircular genome(bacterium, plasmid)
Knownlocation
(restrictionsite)
Different types of vectors
VECTOR Size of insert
Plasmid2,000-10,000
Can control the size
Cosmid 40,000
BAC (Bacterial Artificial Chromosome)
70,000-300,000
YAC (Yeast Artificial Chromosome)
> 300,000
Not used much recently
Sanger sequencing
• DNA is fragmented• Cloned to a plasmid
vector• Cyclic sequencing
reaction• Separation by
electrophoresis• Readout with
fluorescent tags
Sanger Sequencing
• Advantages Long reads (~750bps) Suitable for small projects
• Disadvantages Low throughput Expensive
20
Method to sequence longer regions
cut many times at random (Shotgun)
genomic segment
Get one or two reads from each segment
~500 bp ~500 bp
Reconstructing the Sequence (Fragment Assembly)
Cover region with ~7-fold redundancy (7X)
Overlap reads and extend to reconstruct the original genomic region
reads
Definition of Coverage
Length of genomic segment: L
Number of reads: n
Length of each read: l
Definition: Coverage C = n l / L
How much coverage is enough?
C
Assembly: How Much DNA?
24
many pieces to assemble
High coverage:
a few contigs, a few gaps
Low coverage:
A few pieces to assemble
many contigs, many gaps
Input OutputLander and Waterman,
1988
Challenges with Fragment Assembly
• Sequencing errors
~1-2% of bases are wrong
• Repeats
false overlap due to repeat
RepeatsBacterial genomes: 5%Mammals: 50%
Repeat types:
• Low-Complexity DNA (e.g. ATATATATACATA…)
• Microsatellite repeats (a1…ak)N where k ~ 3-6(e.g. CAGCAGTAGCAGCACCAG)
• Transposons – SINE (Short Interspersed Nuclear Elements)
e.g., ALU: ~300-long, 106 copies– LINE (Long Interspersed Nuclear Elements)
~4000-long, 200,000 copies– LTR retroposons (Long Terminal Repeats (~700 bp) at each end)
cousins of HIV
• Gene Families genes duplicate & then diverge (paralogs)
• Recent duplications ~100,000-long, very similar copies
Strategies for whole-genome sequencing
1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun
Example: Yeast, Worm, Human, Rat
2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go
Example: Rice genome
3. Whole genome shotgun
One large shotgun pass on the whole genome
Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu
Whole Genome Shotgun Sequencing
cut many times at random
genome
forward-reverse paired reads
plasmids (2 – 10 Kbp)
cosmids (40 Kbp) known dist
~500 bp~500 bp
Assembly
48
Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994)
contig 1 contig 215Kbp mates
2Kbp mates
~(length―1,000)
~500 bp ~500 bp
resolving repeats
Better assembly of contigs, gap lengths estimation
• Many years of hard work• More than 20.000 BAC clones• Each containing about 100kb fragment• Together provided a tiling path through each human
chromosome• Amplification in bacterial culture• Isolation, select pieces about 2-3 kb• Subcloned into plasmid vectors, amplification, isolation• recreate contigs • Refinement, gap closure, sequence quality improvement• (less 1 error/ 40.000 bases)• BAC based approaches toward WGS
Sequencing of Human Genome
Public Consortium
Sanger Sequencing
51
1980 1990 2000
1982: lambda virusDNA stretches up to 30-40Kbp (Sanger et al.)
1994: H. Influenzae1.8 Mbp (Fleischmann et al.)
2001: H. Sapiens, D. Melanogaster3 Gbp (Venter et al.)
2007: Global Ocean Sampling~3,000 organisms, 7Gbp (Venter et al.)
52
2010: 5K$, a few days
2009: Illumina, Helicos40-50K$
Sequencing the Human Genome
Year
Log
10(p
rice)
201020052000
2012: 100$, <24 hrs?
2008: ABI SOLiD60K$, 2 weeks
2007: 4541M$, 3 months
2001: Celera100M$, 3 years
2001: Human Genome Project2.7G$, 11 years
2nd Generation: Pyrosequencing
• Sequencing by synthesis
• Advantages:– Accurate– Parallel processing– Easily automated– Eliminates the need for labeled primers and
nucleotides– No need for gel electrophoresis
Pyrosequencing• Basic idea:
– Visible light is generated and is proportional to the number of incorporated nucleotides
– 1pmol DNA = 6*1011 ATP = 6*109 photons at 560nm DNA Polymerase I from E.coli.
pyrophospate
From fireflies, oxidizes luciferin and generates light
• 1st Method– Solid Phase
• Immobilized DNA• 3 enzymes• Wash step to remove nucleotides after each addition
Pyrosequencing
• 2nd Method– Liquid Phase
• 3 enzymes + apyrase (nucleotide degradation enzyme)– Eliminates need for washing step
• In the well of a microtiter plate:• primed DNA template• 4 enzymes
• Nucleotides are added stepwise
• Nucleotide-degrading enzyme degrade previous nucleotides
Pyrosequencing
Disadvantages
• Smaller sequences
• Nonlinear light response after more than 5-6 identical nucleotides
Pyrosequencing
Next Generation Sequencing
• DNA is fragmented
• Adaptors ligated to fragments
• Several possible protocols yield array of PCR colonies.– Emulsion PCR– Bridge PCR
• Enyzmatic extension with fluorescently tagged nucleotides.
• Cyclic readout by imaging the array.
Next Generation Sequencing
• 454 Life Sciences/Roche– Genome Sequencer FLX: currently produces 400-600
million bases per day per machine
– Published 1 million bases of Neanderthal DNA in 2006
– May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage)
• Solexa/Illumina– 10 GB per machine/week
– May 2008 published complete genomes for 3 hapmap subjects (14x coverage)
• ABI SOLiD– 20 GB per machine/week
“Paradigm Shift”
• Standard ABI “Sanger” sequencing – 96 samples/day– Read length ~750 bp– Total = 70,000 bases of sequence data
• 454 was the game changer!– ~400,000 different templates (reads)/day– Read length ~250 bp– Total = 100,000,000 bases of sequence
data!!!
Solexa ups the Game
• Solexa (Illumina GA)– 60,000,000 different sequence templates
(yes that is an 60 million reads)
– 36 bp read length– 4 billion bases of DNA per run (3 days)
• Each system works differently, but they are all based on a similar principals: – Shear target DNA into small pieces– bind individual DNA molecules to a solid surface, – amplify each molecule into a cluster– copy one base at a time and detect different
signals for A, C, T, & G bases– requires very precise high-resolution imaging of
tiny features (charge-coupled device (CCD) )
454
• First high-throughput DNA sequencer, commercially
available in 2004• Now produces ~500 MB reads of 500 bp• Run of 8 samples in 10 hours, so can do multiple runs/week• Uses pyrosquencing, beads, and a microtiter plate • Low error rate, but insert/delete problems with
homopolymers (stretches of a single base)
Illumina Genome Analyzer
• Originally developed by Solexa, now subsidiary of Illumina.
• Commercially available in 2006• Now produces 8-12 million reads per sample of 36 bp
length = 10 GB/week. • Run takes 3 days for 7 samples.• Low error rate, mostly base changes, few indels
ABI-SOLiD
• First commercially available in late 2007• Currently capable of producing 20 GB of data
per run (week)• Most users generate 6 GB/run• Reads ~30 bp long• Uses unique
sequence-by-ligation method• “color-space” data• Very low error rate
454 vs Solexa
• Read length: 400 bp• Number of reads: 400.000• Per-base cost greater• de novo assembly,
metagenomics
•Read length: 40 bp•Number of reads: millions•Per-base cost cheaper•Ideal for application requiring short reads
Applications• “If you build it, they will come.”• An explosion of scientific innovation!• Every new technology enables new
applications, which are not directly foreseen by the original developers of the tech.
• Cheap access to high-volume sequencing becomes a data collection method for many different types of experimental applications
• Ancient DNA• DNA mixtures from diverse ecosystems, metagenomics• Resequencing previously published reference strains• Identification of all mutations in an organism• Expand the number of available genomes• Comparative studies• Deciphering cell’s transcripts at sequence level without knowledge of the genome sequence• Sequencing extremely large genomes, crop plants• Detection of cancer specific alleles avoiding traditional cloning• Chip-seq: interactions protein-DNA• Epigenomics• Detecting ncRNA• Genetic human variation : SNP, CNV (diseases)
Usage of sequencing data
• Transcriptome (RNA) sequencing• Differential expression• Alternative splicing
• Complete/targeted genome (DNA) resequencing
• Polymorphism and mutation discovery
De Novo sequencing
• New species/strains• Challenge of assembly with short reads
– 8x coverage of 3 GB genome = 750 million fragments– Exponential problem for all-vs-all algorithm
• Again big problem with repeats• Assemble contigs, fill gaps• Paired-end reads are essential• Can sequence the entire genome of a microbe in
a single run
Resequencing(mutation discovery/genotyping)
• A lot of current sequencing effort is spent on re-sequencing genomes of known species– Individual humans (1000 Genomes Project)– Experimental organisms – looking for genetic
variation, copy number variation• Challenge is to (quickly) align millions of
sequence reads to a reference genome with some % of mismatches
• Challenge to accurately call SNPs and indels• Problems with repeated sequences – both
tandem and dispersed repeats
Read length and pairing
• Short reads are problematic, because short sequences do not map uniquely to the genome.
• Solution #1: Get longer reads.• Solution #2: Get paired reads.
ACTTAAGGCTGACTAGC TCGTACCGATATGCTG
RNA Sequencing• “Digital Gene Expression” or “RNA-Seq”• Truly accurate gene expression measurements
– Can replace gene expression microarrays • 25% more sensitive• Does not rely on hybridization (no %GC bias, no cross-
hybridization between related genes)
• Discover novel genes (and other kinds of RNA
molecules) – one experiment found that 34% of human transcripts were
not from known genes• Sultan et al, Science. 2008 Aug 15;321(5891):956-60.
More information from RNA
• Can capture true alternative splicing information– Sequence of splice-junctions
• One study found 4,096 previously unknown splice junctions in 3,106 human genes
– Different transcription start and end points for RNA molecules
• Allelic variation (SNPs) • Small RNAs
Metagenomics• Survey/discovery all of the species present in an
Environmental or Medical sample• “Human Microbiome”
– disease vs. healthy microbe populations in mouth, intestines, skin, reproductive tract, etc
• Complete multiple genome sequencing
• Complete multi-species transcript profiling (metabolic reconstruction)
• Deep sampling of genetic variation in microbial populations (frequency of drug resistant, toxin producing, etc.)
Informatics is the Bottleneck
• Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it
• Customized analysis / Bioinformatics consulting is needed for every project
Bioinformatics Challenges
• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of
algorithms to work in a parallel environment– Very large text files (~10 million lines long)– Impossible memory usage and execution time
Future Directions
• Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.
• complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.
• Data storage and analysis bottleneck• Data security/privacy issues
genomic segment
AC..GCTT..TC
CG..CA
AC..GC
TG..GT TC..CC
GA..GCTG..AC
CT..TGGT..GC AC..GC AC..GC
AT..ATTT..CC
AA..GC
Short DNA sequences
ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…
ACGTGGTAATGGCGTATACACCCTTAGGCCATA
Overview
Whole genome shotgun sequencing
De Novo sequencing
• New species/strains• Challenge of assembly with short reads
– 8x coverage of 3 GB genome = 750 million fragments (32 bp)
– Exponential problem for all-vs-all algorithm• Again big problem with repeats• Assemble contigs, fill gaps• Paired-end reads are essential• Can sequence the entire genome of a microbe in a
single run
Genoem Sequencing
• Assembly Algorithms– Shotgun sequencing assembly problem
• Find the shortest common superstring of a set of sequences.
• Given strings {s1, s2, …} find the shortest string T such that every si is a substring of T.
• This is NP-hard.
Greedy Algorithm
• Nodes are fragments
• Edges means there exist overlaps.
• Weight are number of overlaps found after calculateing pairwise alignments of all fragments.
Greedy Algorithm
• Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e
• Hamiltonian paths: A path that goes through every vertex
Greedy Algorithm
• Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.
• “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge
• Assembly Algorithms• Overlap-layout-consensus
–An assembler builds the graph –Output is a set of nonintersecting simple
paths, each path being a contigue.
Genoem Sequencing
Overlap-layout-consensus
• Overlap-layout-consensus method for assembly.– Build an overlap graph where each node
represents a read. An edge exists between two reads if they overlap.
– Traverse the graph to find unambiguous paths which form contigs.
Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)
Overlap-layout-consensus
Next-generation sequencing
• Lower cost / base pair
• Very short fragment lengths (25-75bps)
• High error rate
• Inherent ability to do paired-end (mate-pair) sequencing.
Next-generation sequencing
• Challenging to assembly data.• Short fragment length = very small overlap
therefore many false overlaps
• Sequenced up to 100x coverage, increase in data size.
• Large number of reads + short overlap + higher error rate make traditional overlap - layout - consensus approach impractical.
Current approaches
• Euler / De Bruijn approach.
• Introduced as a alternative to overlap-layout-consensus approach in capillary sequencing.
• More suited for short read assembly.
• Assembly Algorithms• Eularian path
– Eularian path – a path that visits all edges of a graph
– Breaks reads into overlapping n-mers.– Source – n-1 prefix and destination is the n-
1 suffix corresponding to an n-mer.– Basic problem is to find a path that uses all
the edges. – Eularian path is more efficient.
Genoem Sequencing
Eulerian Circuits and PathsEulerian Circuit – visits each edge in a graph exactly
once, and ends at the same vertex in which it started.
a-d-b-f-e-d-f-c-b-a is an Eulerian cycle in this particular graph
ab c
d fe
Eulerian Path – visits each edge in a graph exactly once.
a
b c
d
f
e
ji
h
g
h
a-b-c-d-e-f-g-c-h-f-i-j is an Eulerian trail in this particular graph
De Bruijn Graphs
• Nodes are (k-1)-mers• Edges are k-mers
• The set of k-mers is called a k-spectrum
• Finding shortest string with given k-spectrum.
{AGC, ATC, ATT, CAG, CAT, GCA,
TCA, TTC}
CA
GC AG
TC AT
TT
• Break each read sequence to overlapping fragments of size k. (k-mers)
• Form De Bruijn graph such that each (k-1)-mer represents a node in the graph.
• Edge exists between node a to b iff there exists a k-mer such that it’s prefix is a and suffix is b.
• Traverse the graph in unambiguous path to form contigs.
De Bruijn Graphs
Eulerian Path Approach to DNA Fragment Assembly
• Ultimately, converts an NP-complete Hamilton Path Problem into a simplified Eulerian Path Problem through construction of a de Bruijn graph
•The number of ways to reconstruct the graph is equivalent to the number of paths which follow the respective directions and travel through all edges
•The resulting problem is that there are a number of different Eulerian Paths through this graph, and we cannot tell which would resemble the original path
Eulerian Superpath Problem
•Eulerian Superpath Problem – Given an Eulerian Graph and a collection of paths on this graph, find an Eulerian path in this graph that contains all these paths as subpaths.
•The original Eulerian Path Problem is a case of the Eulerian Superpath Problem, in which every path is a single edge.
Solving: Take graph G and the system of paths P, and transform these to a new graph G1 and a new system P1. With the goal in mind that there is a one-to-one correspondence (equivalence) between (G,P) and (G1,P1), we go on to make a series of these transformations.
(G,P) → (G1,P1) → (G2,P2) →…→ (Gk,Pk)
All these transformations should lead to a system Pk in which every path is represented by one edge. Since all transformations from beginning to end are equal, every solution of EPP in (Gk,Pk) will provide a solution to the ESPP in (G,P).
An x,y-detachment for no multiple edges Let x = (vin,vmid) and y = (vmid,vout) be two consecutive edges in G and Px,y be all paths from P that include x,y as a subpath.
P→x is the paths from P that end on x and Py→ is the collection of paths from P that start with y.
Adding a new edge z = (vin,vout) to delete the edges x and y.
We can substitute z instead of x,y in all paths from Px,y, x in all paths from P→x, and y in all paths from Py→. Thus, reducing an ESPP to an EPP.
• Elegant way of representing the problem.• Very fast execution.• Error correction can be handled in the graph.• De Bruijn graph size can be huge.
– ~200GB for human genomes.
• Does not use pair information in initial phase, resulting in overlay complicated graphs.
De Bruijn Graphs
Repeats
• Repeats in the sequence– Assembly programs should detect repeats in
the assembly process and not after. • Incorrect genome reconstruction
– Assemblers should try to resolve correctly as many repeats as possible.
• Detecting repeats– Euler assembly program
• Finds repeats by complex parts of the graph constructed during the assembly process.
• Researchers look into these complex areas to try and resolve repeats.
• Assemblers can use clone mate (paired end) information to find incorrect assemblies. This is based on finding clone-mate pairs too close or too far from one another.
Repeats
ASSEMBLY OF READS WITH ERRORS
• Errors in read data greatly complicate the task of fragment assembly.
• Error correction is performed prior to assembly by solving the error correction problem.
Resequencing(mutation discovery/genotyping)
• A lot of current sequencing effort is spent on re-sequencing genomes of known species
– Individual humans (1000 Genomes Project)– Experimental organisms – looking for genetic
variation, copy number variation• Challenge is to (quickly) align millions of sequence reads
to a reference genome with some % of mismatches• Challenge to accurately call SNPs and indels• Problems with repeated sequences – both tandem and
dispersed repeats
Need to alignment programs to map short sequencing reads from next-generation sequencing technologies to a reference genome are introduced
151
New Challenge
given a set of reads R, for each read r ∈R, find its target regions on the reference genome G, such that for each target region t there are at most k mismatches between r and t.
152
The reads mapping problem
Aligner algorithms can be divide in to two categories :
Seeded alignments algorithms (BLAST like)
Burrows-Wheeler transform based algorithms
154
Aligner algorithms
BLAST is the most popular tool.Requires a query sequence to search for, and a
sequence to search againstStep 1: Make a k-letter word list of the query sequence.
Step 2: List the possible matching words
step 3: extend the match to find the high similarity pair
TAGGACCTAACC
GACCACCTTTT
155
TAGGACCTAACC
GACCACCTTTT
Seed alignment algorithm
Find seeded matches of 11 base pairs
Extend each match to right and left, until the scores drop too much, to form an alignment
Report all local alignments
Example: AGCGATGTCACGCGCCCGTATTTCCGTA TCGGATCTCACGCGCCCGGCTTACCGTG
| | | | | | | | | | | | | | | | || | |
0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0``
156
Blast algorithm
Spaced Seed: nonconsecutive matches and optimized match positions.
Represent BLAST seed by 11111111111 Spaced seed: 111010010100110111
1 means a required match0 means “don’t care” position
The length of the seed is the string length, and the weight of the seed is the number of 1s in the string.
This seemingly simple change makes a huge difference: significantly increases hit to homologous region while reducing bad hits.
157
Spaced seed
Multiple simultaneous seeds are defined as a set of seeds.∏= {seed1, seed2,…seed i,…, seedn}
∏ detects a similarity if at least one of the component seeds detects the similarity
ExampleSimultaneous seeds {1101, 1011} detect
similarities 100110100001, 1000010110001, 1101001011001
158
Multiple simultaneous seeds
The prefix trie for string X is a tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from a leaf to the root gives a unique prefix of X.
On the prefix trie, the string concatenation of the edge symbols from a node to the root gives a unique substring of X .
The prefix trie of X is identical to the suffix trie of reverse of X and therefore suffix trie theories can also be applied to prefix trie
159
Let ∑ be an alphabet. Symbol $ is not present in and is lexicographically smaller than all the symbols in ∑
A string X=a0a1 ...an−1 is always ended with symbol $ (i.e. an−1=$)
Suffix array S of X is a permutation of the integers 0...n−1 such that S(i) is the start position of the i-th smallest suffix. 160
For compute S(.), string X is circulated to generate strings, which are then lexicographically sorted.
161
After sorting, the positions of the first symbols form the suffix array.
BWT(X) is the last column of the sorted matrix.
162
Most algorithms for constructing suffix array require at least nlog2n bits of working space, which amounts to 12GB for human genome.
Recently, Hon et al. (2007) gave a new algorithm that uses n bits of working space and only requires <1GB memory at peak time for constructing the BWT of human genome
164
If string W is a substring of X, the position of each occurrence of W in X will occur in an interval in the suffix array.
Based on this observation, we define:
R(W) = min{k :W is the prefix of XS(k)}R’(W) = max{k :W is the prefix of XS(k)}
(Xi=X[i,n−1] a suffix of X)In particular, ifW is an empty string, R(W)=1 and R’(W)=n−1.165
The interval [R(W) ,R(W)’] is called the SA interval of W and the set of positions of all occurrences of W in X is
{S(k) :R(W) ≤k≤ R(W)’}
For example the SA interval of string ‘go’ is [1,2]The suffix array values in this interval are 3 and 0 which
give the positions of all the occurrences of ‘go’ in the “googol”. 166
Knowing the intervals in suffix array we can get the positions.
Therefore, sequence alignment is equivalent to searching for the SA intervals of substrings of X that match the query.
For the exact matching problem, we can find only one such interval
167
We can compute SA intervals for all node in the trie and each read map equivalent to search the tree.
168
We can compute SA intervals for all node in the trie and each read map equivalent to search the tree.
169