meraculous: de novo genome assembly with short paired-end reads jarrod a. chapman, isaac ho, sirisha...
TRANSCRIPT
![Page 1: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/1.jpg)
Meraculous: De Novo Genome Assembly with Short Paired-
End Reads
Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar
Joint Genome Institute, Illumina Inc., University of California, Berkeley
Presenter: Priyanka Ghosh
![Page 2: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/2.jpg)
Meraculous: De Novo Genome Assembly
withShort Paired-End Reads
![Page 3: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/3.jpg)
De Novo Genome Assembly1) Take 3 copies of the same DNA:
2) Sequence is broken down to generate Reads:
3) Reconstruct original DNA sequence from read set
Short Paired-End Reads
Sequence both ends of a read.
Why?1) Generates high- quality,
alignable sequence data2) Likely to align better with
reference genome
![Page 4: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/4.jpg)
Why use shorter reads?
• Readily aligned to a reference
• Error rates are low
• Lower cost per base, higher throughput
• Utilized in short read assemblers like:– Velvet– ABySS– SOAPdenovo
Take advantage of the de Bruijn graph representation of the assembly problem using k-mer approach
![Page 5: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/5.jpg)
Genome Assembly vocabulary
• k-mer : a DNA sequence of bases with length k
• De Bruijn graph of k-mers: a graph representing overlaps between k-mers
• Contig: a contiguous sequence of DNA formed by combining k-mers
• Scaffold: one or more contigs linked together by unknown sequence
![Page 6: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/6.jpg)
Meraculous - Contributions
• De novo assembler developed at Joint Genome Institute
• Efficient and conservative traversal of a subgraph of the de Bruijn graph (DBG)
• Takes into consideration unique (single-base) high-quality extensions in the dataset
• Avoids explicit error correction step
• Incorporates a novel low-memory hash structure to access the DBG = small memory footprint Vs. other short-read assemblers
![Page 7: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/7.jpg)
Steps for genome assembly using Meraculous :
Reads:
K-mers:
Contigs:
Scaffolds:
Reads fragmented further into k-mers, to possibly exclude errors
Construct & traverse de Bruijn graph of k-mers, generate contigs
Use read information to link contigs and generate scaffolds
Reads may contain errors
Slide courtesy: Evangelos Georganas SC14 paper ” Parallel De Bruijn Graph Construction and Traversal for de novo Genome Assembly”
![Page 8: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/8.jpg)
Test input
• Pichia stipitis CBS 6054 : predominantly haploid yeast genome (15.4 Mbp)
• Statistics: dataset of 3 lanes of 75bp paired-end shotgun
![Page 9: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/9.jpg)
Assembly Algorithm
1) Selection of k-mer set (Generation of K- mer’s)
2) Production of maximal linear sub-paths of DBG (Generation of Contigs)
3) Identify read-pair information used to produce scaffolds by ordering and orienting contigs (Generation of Scaffolds)
4) Close gaps contained within scaffolds - with reads projected to lie within the gaps
![Page 10: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/10.jpg)
Generation of sub-graph of DBG• Select an odd integer k(=41): such that a) fraction of targeted sequence for
assembly is unique as k-mers b) reads have multiple overlapping error-free k-mers
• Threshold multiplicity - dmin (=10): # of occurrences of each k-mer in the dataset.
• k-mers > dmin : likely to error-free and occur in genome (part of k-mer set)
• k-mers < dmin : likely to contain sequence errors
• For each k-mer count all single-base extensions (forward & backward) such that the next /previous base has Q >= Qmin
• Single base extensions with Q >= Qmin = “high quality extensions”
• Each end of a k-mer marked - X, U, or F, depending on 0, 1, or >= 2 distinct high quality extensions
“X” : no high quality extensions
“U” : unique high quality extensions
“F” : fork in the DBG
“U-U k-mer” “U-U contig”
![Page 11: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/11.jpg)
Assembly Algorithm Contd ….
• ‘U-U’ contig: subgraph of DBG comprising of linear chain of U-U k-mers (extensions must be unique)
• Omit [F] k-mers: candidates for boundaries
• Error correction made implicitly by adhering to dmin and Qmin constraints
• # of U-U contigs of the DBG depends on choice of Dmin
• Dmin high = U-U contigs likely to terminate at ‘X’
• Dmin low = U-U contigs likely to terminate at ‘F’
• Next we link contigs (>= 100bp for Pichia) into scaffolds using paired-end links to jump over unassembled repetitive regions, leaving gaps
• Finally intra-scaffold gaps closed using reads whose mate-pairs constrain them to lie within a gap
![Page 12: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/12.jpg)
Lightweight Hash implementation – Contig generation phase
• Reduces memory needed to store and randomly access the DBG
• Utilizes a recursive collision strategy with multiple hash functions to avoid explicitly storing the keys
• Key-value pair: key = U-U k-mer , value= 2 letter code [ACGT][ACGT]
• Algorithm: - Primed to expose all k-mer’s (Assume hash functions h0, h1,….,hn already defined)
– 1. Initialize hash depth ‘d’ to 0, write all keys to file Fd
– 2. For all keys in Fd , evaluate hash function hd. Update a ‘primer object’ Pd to key track of the keys that collide under hd.
– 3. Write all colliding keys to file Fd+1, increment hash depth d– 4. Repeat steps 2,3 until number of colliding keys is 0
• All Primers P0….Pd sent to lightweight hash initializer to create lightweight hash object
• K-mers are loaded with extension codes: Each key-value pair added to this hash object, checks Primer information to determine which level of recursion to store value, key is discarded
![Page 13: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/13.jpg)
Results - I
Advantages w.r.t memory using Lightweight hash:
Conventional hash:
• # of distinct keys comparable to Genome (G) Size
• Memory required to naively store the hash =
2G * (k+1) bits (with majority cost associated with storing the keys)
Example: Human Genome G = 3 * 109 , with k=75Total memory required = 450 GB
Lightweight hash:
• Applied if complete set of keys is known initially and does not change
• Average lookup time does not depend on Genome size
• Hash requires = e*G bytes memory (independent of k, e=2.71828)
Example: Human Genome G = 3 * 109 .Total memory required = 8GB
Approx 60-fold memory savings ..!!!
![Page 14: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/14.jpg)
Results - II• Benchmarking Meraculous against other short-read
assemblers for E.coli K-12 MG 1655 dataset of 10.4 million pairs of 36-bp reads
• Finished reference sequence appox = 6.64 Mbp genome
• Short-read dataset represents a nominal ~ 160X shotgun coverage
• k=21, dmin=9
• Meraculous assembled 97.8% of the genome into contigs ranging from 200bp to 175kbp, with no assembly errors
![Page 15: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/15.jpg)
Also to be noted …
• Accuracy of Pichia genome: Meraculous reconstructs 95% of the genome in long contigs (N50=101kbp) and scaffolds (N50=269kbp)
• Most steps of Meraculous assembly pipeline are parallelized, by partitioning reads (or k-mers) among processors
• 2 steps NOT parallelized are: – a) construction of U-U subgraph (memory
intensive) – b) Scaffolding step
• Limitation: current Meraculous algorithm assumes data from a haploid genome
![Page 16: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/16.jpg)
Summary
• Meraculous, new short-read assembler produces high quality, near complete de novo assemblies of small fungal genomes
• Does not construct the full DBG. Instead limited to “U-U” subgraph, which includes k-mers that possess unique high-quality extensions at each end – thereby removing most error-containing k-mers
• U-U subgraph produced with a memory footprint that scales linearly with the genome size
• Meraculous avoids explicit error correction, by identifying outliers and disregards them in a robust without degrading the assembly
• Gap-filling allows residual errors to be corrected – Combining initial set of contigs (using DBG approach) with reads using mate-pairs to link and fill gaps between contigs
![Page 17: Meraculous: De Novo Genome Assembly with Short Paired-End Reads Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar](https://reader036.vdocuments.us/reader036/viewer/2022082417/56649ebd5503460f94bc69f7/html5/thumbnails/17.jpg)
Thank you for listening !!!
Questions ??