meraculous: de novo genome assembly with short paired-end reads jarrod a. chapman, isaac ho, sirisha...

Meraculous: De Novo Genome Assembly with Short Paired-

End Reads

Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar

Joint Genome Institute, Illumina Inc., University of California, Berkeley

Presenter: Priyanka Ghosh

Meraculous: De Novo Genome Assembly

withShort Paired-End Reads

De Novo Genome Assembly1) Take 3 copies of the same DNA:

2) Sequence is broken down to generate Reads:

3) Reconstruct original DNA sequence from read set

Short Paired-End Reads

Sequence both ends of a read.

Why?1) Generates high- quality,

alignable sequence data2) Likely to align better with

reference genome

Why use shorter reads?

• Readily aligned to a reference

• Error rates are low

• Lower cost per base, higher throughput

• Utilized in short read assemblers like:– Velvet– ABySS– SOAPdenovo

Take advantage of the de Bruijn graph representation of the assembly problem using k-mer approach

Genome Assembly vocabulary

• k-mer : a DNA sequence of bases with length k

• De Bruijn graph of k-mers: a graph representing overlaps between k-mers

• Contig: a contiguous sequence of DNA formed by combining k-mers

• Scaffold: one or more contigs linked together by unknown sequence

Meraculous - Contributions

• De novo assembler developed at Joint Genome Institute

• Efficient and conservative traversal of a subgraph of the de Bruijn graph (DBG)

• Takes into consideration unique (single-base) high-quality extensions in the dataset

• Avoids explicit error correction step

• Incorporates a novel low-memory hash structure to access the DBG = small memory footprint Vs. other short-read assemblers

Steps for genome assembly using Meraculous :

Reads:

K-mers:

Contigs:

Scaffolds:

Reads fragmented further into k-mers, to possibly exclude errors

Construct & traverse de Bruijn graph of k-mers, generate contigs

Use read information to link contigs and generate scaffolds

Reads may contain errors

Slide courtesy: Evangelos Georganas SC14 paper ” Parallel De Bruijn Graph Construction and Traversal for de novo Genome Assembly”

Test input

• Pichia stipitis CBS 6054 : predominantly haploid yeast genome (15.4 Mbp)

• Statistics: dataset of 3 lanes of 75bp paired-end shotgun

Assembly Algorithm

1) Selection of k-mer set (Generation of K- mer’s)

2) Production of maximal linear sub-paths of DBG (Generation of Contigs)

3) Identify read-pair information used to produce scaffolds by ordering and orienting contigs (Generation of Scaffolds)

4) Close gaps contained within scaffolds - with reads projected to lie within the gaps

Generation of sub-graph of DBG• Select an odd integer k(=41): such that a) fraction of targeted sequence for

assembly is unique as k-mers b) reads have multiple overlapping error-free k-mers

• Threshold multiplicity - dmin (=10): # of occurrences of each k-mer in the dataset.

• k-mers > dmin : likely to error-free and occur in genome (part of k-mer set)

• k-mers < dmin : likely to contain sequence errors

• For each k-mer count all single-base extensions (forward & backward) such that the next /previous base has Q >= Qmin

• Single base extensions with Q >= Qmin = “high quality extensions”

• Each end of a k-mer marked - X, U, or F, depending on 0, 1, or >= 2 distinct high quality extensions

“X” : no high quality extensions

“U” : unique high quality extensions

“F” : fork in the DBG

“U-U k-mer” “U-U contig”

Assembly Algorithm Contd ….

• ‘U-U’ contig: subgraph of DBG comprising of linear chain of U-U k-mers (extensions must be unique)

• Omit [F] k-mers: candidates for boundaries

• Error correction made implicitly by adhering to dmin and Qmin constraints

• # of U-U contigs of the DBG depends on choice of Dmin

• Dmin high = U-U contigs likely to terminate at ‘X’

• Dmin low = U-U contigs likely to terminate at ‘F’

• Next we link contigs (>= 100bp for Pichia) into scaffolds using paired-end links to jump over unassembled repetitive regions, leaving gaps

• Finally intra-scaffold gaps closed using reads whose mate-pairs constrain them to lie within a gap

Lightweight Hash implementation – Contig generation phase

• Reduces memory needed to store and randomly access the DBG

• Utilizes a recursive collision strategy with multiple hash functions to avoid explicitly storing the keys

• Key-value pair: key = U-U k-mer , value= 2 letter code [ACGT][ACGT]

• Algorithm: - Primed to expose all k-mer’s (Assume hash functions h0, h1,….,hn already defined)

– 1. Initialize hash depth ‘d’ to 0, write all keys to file Fd

– 2. For all keys in Fd , evaluate hash function hd. Update a ‘primer object’ Pd to key track of the keys that collide under hd.

– 3. Write all colliding keys to file Fd+1, increment hash depth d– 4. Repeat steps 2,3 until number of colliding keys is 0

• All Primers P0….Pd sent to lightweight hash initializer to create lightweight hash object

• K-mers are loaded with extension codes: Each key-value pair added to this hash object, checks Primer information to determine which level of recursion to store value, key is discarded

Results - I

Advantages w.r.t memory using Lightweight hash:

Conventional hash:

• # of distinct keys comparable to Genome (G) Size

• Memory required to naively store the hash =

2G * (k+1) bits (with majority cost associated with storing the keys)

Example: Human Genome G = 3 * 109 , with k=75Total memory required = 450 GB

Lightweight hash:

• Applied if complete set of keys is known initially and does not change

• Average lookup time does not depend on Genome size

• Hash requires = e*G bytes memory (independent of k, e=2.71828)

Example: Human Genome G = 3 * 109 .Total memory required = 8GB

Approx 60-fold memory savings ..!!!

Results - II• Benchmarking Meraculous against other short-read

assemblers for E.coli K-12 MG 1655 dataset of 10.4 million pairs of 36-bp reads

• Finished reference sequence appox = 6.64 Mbp genome

• Short-read dataset represents a nominal ~ 160X shotgun coverage

• k=21, dmin=9

• Meraculous assembled 97.8% of the genome into contigs ranging from 200bp to 175kbp, with no assembly errors

Also to be noted …

• Accuracy of Pichia genome: Meraculous reconstructs 95% of the genome in long contigs (N50=101kbp) and scaffolds (N50=269kbp)

• Most steps of Meraculous assembly pipeline are parallelized, by partitioning reads (or k-mers) among processors

• 2 steps NOT parallelized are: – a) construction of U-U subgraph (memory

intensive) – b) Scaffolding step

• Limitation: current Meraculous algorithm assumes data from a haploid genome

Summary

• Meraculous, new short-read assembler produces high quality, near complete de novo assemblies of small fungal genomes

• Does not construct the full DBG. Instead limited to “U-U” subgraph, which includes k-mers that possess unique high-quality extensions at each end – thereby removing most error-containing k-mers

• U-U subgraph produced with a memory footprint that scales linearly with the genome size

• Meraculous avoids explicit error correction, by identifying outliers and disregards them in a robust without degrading the assembly

• Gap-filling allows residual errors to be corrected – Combining initial set of contigs (using DBG approach) with reads using mate-pairs to link and fill gaps between contigs

Thank you for listening !!!

Questions ??

meraculous: de novo genome assembly with short paired-end reads jarrod a. chapman, isaac ho, sirisha...

Documents