asm 2013 metagenomic assembly workshop slides

Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASM Workshop, May 2013

Visual Complexityhttp://www.flickr.com/photos/maisonbisson

Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo

Janet Jansson Susannah Tringe

MSU Lab: Collaborators:

I will upload this on slideshare (adinachuanghowe) Khmer documentation

github.com/ged-lab/khmer/https://khmer.readthedocs.org/en/latest/guide.html

Manuscripts

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

http://www.pnas.org/content/early/2012/07/25/1121464109

A reference-free algorithm for computational normalization of shotgun sequencing data

http://arxiv.org/abs/1203.4802

Assembling large, complex metagenomeshttp://arxiv.org/abs/1212.2832

A few gotchas of sequencing:

Errors / Artifacts (confusion)

Diversity / Complexity (scale)

1. Digital normalization (lossy compression)

2. Partitioning3. Enabling usage of current previously

unusable assembly tools

Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools available

But…

Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools available

But…

High memory requirementsDepends on good (~10x) sequencing coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Note that k-mer abundance is not properly represented here! Each blue k-mer will be present

around 10 times.

Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.

Low-abundance peak (errors)

High-abundance peak(true k-mers)

Suppose you have a dilution factor of A

(10) to B(1). To get 10x of B you need to

get 100x of A! Overkill!!

This 100x will consume disk space

and, because of errors, memory.

We can discard it for you…

A digital analog to cDNA library normalization, diginorm:

Reference free.

Is single pass: looks at each read only once;

Does not “collect” the majority of errors;

Keeps all low-coverage reads;

Smooths out coverage of regions.

Digital normalization produces “good” metagenome assemblies.

Smooths out abundance variation, strain variation.

Reduces computational requirements for assembly.

It also kinda makes sense :)

Split reads into “bins” belonging to different source species.

Can do this based almost entirely on connectivity of sequences.

“Divide and conquer”Memory-efficient

implementation helps to scale assembly.

Pell et al., 2012, PNAS

Low coverage is the dominant problem blocking assembly of your soil metagenome.

In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.

These heuristics may not be appropriate for your sample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in

assembly.

We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited by

sequencing technology (insert size; sampling depth)

Strain variation confuses assembly, but does not prevent useful results. Diginorm is systematic strategy to enable

assembly. Banfield has shown how to deconvolve strains at

differential abundance. Kostas K. results suggest that there will be a

species gap sufficient to prevent contig misassembly.

Most metagenomes require 50-150 GB of RAM.

Many people don’t have access to computers of that size.

Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.

http://ged.msu.edu/angus/2013-hmp-assembly-webinar/index.html

Optimizing our programs => faster.

Building an evaluation framework for metagenome assemblers.

Error correction!

Achieving one or more assemblies is fairly straightforward.

An assembly is a hypothesis and evaluating them is challenging, however, and where you should be thinking hardest about assembly.

There are relatively few pipelines available for analyzing assembled metagenomic data.

Questions?

How do we study complexity? Interactions? Diversity? Communities? Evolution? Our environment?

Visual Complexityhttp://www.flickr.com/photos/maisonbisson

• Major efforts of data collection

• Open-mind for discoveries• Willingness to adjust to

change• Multiple efforts• Well-designed experiments

Workshop example: Illumina deep sequencing and scaling large datasets on soil metagenomes

We receive Gb of sequences Generally, my data is…

Split by barcodes Untrimmed Adapters are present Two paired end fastq files

Underestimation of computational requirements: Quality control steps usually require 2-3

times the amount of hard drive space Similarity comparison against known

databases impractical (soil metagenome ~50 years to BLAST)

Home Alone ScreamMy first slide graphic that I’m scared may date me.

Two ways to reduce the onslaught:

Cluster into known observances (annotate, bin)AssemblySome mix of the above

Ten of you upload 1 Hiseq flowcell into MG-RAST

Illumina short reads from soil metagenome (~100 bp)

454 short reads from soil metagenome (~368 bp)

Assembled contigs (Illumina) reads from soil metagenome (~491 bp)

Read length will increase… computational requirements? Assembly great way to reduce data.

asm 2013 metagenomic assembly workshop slides

Technology

x sequencing coverage

coverage of regions

lowcoverage readssmooths

highabundance peaktrue

new references

abundance variation

mer abundance

lowabundance peak errors