asm 2013 metagenomic assembly workshop slides

40
Adina Howe Michigan State University, Adjunct Argonne National Laboratory, Postdoc ASM Workshop, May 2013 Visual Complexity http://www.flickr.com/photos/maisonbisson

Upload: adina-chuang-howe

Post on 10-May-2015

1.323 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: ASM 2013 Metagenomic Assembly Workshop Slides

Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASM Workshop, May 2013

Visual Complexityhttp://www.flickr.com/photos/maisonbisson

Page 2: ASM 2013 Metagenomic Assembly Workshop Slides

Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo

Janet Jansson Susannah Tringe

MSU Lab: Collaborators:

Page 3: ASM 2013 Metagenomic Assembly Workshop Slides

I will upload this on slideshare (adinachuanghowe) Khmer documentation

github.com/ged-lab/khmer/https://khmer.readthedocs.org/en/latest/guide.html

Manuscripts

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

http://www.pnas.org/content/early/2012/07/25/1121464109

A reference-free algorithm for computational normalization of shotgun sequencing data

http://arxiv.org/abs/1203.4802

Assembling large, complex metagenomeshttp://arxiv.org/abs/1212.2832

Page 4: ASM 2013 Metagenomic Assembly Workshop Slides

A few gotchas of sequencing:

Errors / Artifacts (confusion)

Diversity / Complexity (scale)

Page 5: ASM 2013 Metagenomic Assembly Workshop Slides

1. Digital normalization (lossy compression)

2. Partitioning3. Enabling usage of current previously

unusable assembly tools

Page 6: ASM 2013 Metagenomic Assembly Workshop Slides

Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools available

But…

Page 7: ASM 2013 Metagenomic Assembly Workshop Slides

Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools available

But…

High memory requirementsDepends on good (~10x) sequencing coverage

Page 8: ASM 2013 Metagenomic Assembly Workshop Slides

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 9: ASM 2013 Metagenomic Assembly Workshop Slides

Note that k-mer abundance is not properly represented here! Each blue k-mer will be present

around 10 times.

Page 10: ASM 2013 Metagenomic Assembly Workshop Slides

Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.

Page 11: ASM 2013 Metagenomic Assembly Workshop Slides
Page 12: ASM 2013 Metagenomic Assembly Workshop Slides
Page 13: ASM 2013 Metagenomic Assembly Workshop Slides

Low-abundance peak (errors)

Page 14: ASM 2013 Metagenomic Assembly Workshop Slides

High-abundance peak(true k-mers)

Page 15: ASM 2013 Metagenomic Assembly Workshop Slides

Suppose you have a dilution factor of A

(10) to B(1). To get 10x of B you need to

get 100x of A! Overkill!!

This 100x will consume disk space

and, because of errors, memory.

We can discard it for you…

Page 16: ASM 2013 Metagenomic Assembly Workshop Slides
Page 17: ASM 2013 Metagenomic Assembly Workshop Slides
Page 18: ASM 2013 Metagenomic Assembly Workshop Slides
Page 19: ASM 2013 Metagenomic Assembly Workshop Slides
Page 20: ASM 2013 Metagenomic Assembly Workshop Slides
Page 21: ASM 2013 Metagenomic Assembly Workshop Slides
Page 22: ASM 2013 Metagenomic Assembly Workshop Slides

A digital analog to cDNA library normalization, diginorm:

Reference free.

Is single pass: looks at each read only once;

Does not “collect” the majority of errors;

Keeps all low-coverage reads;

Smooths out coverage of regions.

Page 23: ASM 2013 Metagenomic Assembly Workshop Slides

Digital normalization produces “good” metagenome assemblies.

Smooths out abundance variation, strain variation.

Reduces computational requirements for assembly.

It also kinda makes sense :)

Page 24: ASM 2013 Metagenomic Assembly Workshop Slides

Split reads into “bins” belonging to different source species.

Can do this based almost entirely on connectivity of sequences.

“Divide and conquer”Memory-efficient

implementation helps to scale assembly.

Pell et al., 2012, PNAS

Page 25: ASM 2013 Metagenomic Assembly Workshop Slides
Page 26: ASM 2013 Metagenomic Assembly Workshop Slides
Page 27: ASM 2013 Metagenomic Assembly Workshop Slides
Page 28: ASM 2013 Metagenomic Assembly Workshop Slides

Low coverage is the dominant problem blocking assembly of your soil metagenome.

Page 29: ASM 2013 Metagenomic Assembly Workshop Slides

In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.

These heuristics may not be appropriate for your sample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in

assembly.

Page 30: ASM 2013 Metagenomic Assembly Workshop Slides
Page 31: ASM 2013 Metagenomic Assembly Workshop Slides

We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited by

sequencing technology (insert size; sampling depth)

Strain variation confuses assembly, but does not prevent useful results. Diginorm is systematic strategy to enable

assembly. Banfield has shown how to deconvolve strains at

differential abundance. Kostas K. results suggest that there will be a

species gap sufficient to prevent contig misassembly.

Page 32: ASM 2013 Metagenomic Assembly Workshop Slides

Most metagenomes require 50-150 GB of RAM.

Many people don’t have access to computers of that size.

Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.

http://ged.msu.edu/angus/2013-hmp-assembly-webinar/index.html

Page 33: ASM 2013 Metagenomic Assembly Workshop Slides

Optimizing our programs => faster.

Building an evaluation framework for metagenome assemblers.

Error correction!

Page 34: ASM 2013 Metagenomic Assembly Workshop Slides

Achieving one or more assemblies is fairly straightforward.

An assembly is a hypothesis and evaluating them is challenging, however, and where you should be thinking hardest about assembly.

There are relatively few pipelines available for analyzing assembled metagenomic data.

Page 35: ASM 2013 Metagenomic Assembly Workshop Slides

Questions?

Page 36: ASM 2013 Metagenomic Assembly Workshop Slides

How do we study complexity? Interactions? Diversity? Communities? Evolution? Our environment?

Visual Complexityhttp://www.flickr.com/photos/maisonbisson

• Major efforts of data collection

• Open-mind for discoveries• Willingness to adjust to

change• Multiple efforts• Well-designed experiments

Workshop example: Illumina deep sequencing and scaling large datasets on soil metagenomes

Page 37: ASM 2013 Metagenomic Assembly Workshop Slides

We receive Gb of sequences Generally, my data is…

Split by barcodes Untrimmed Adapters are present Two paired end fastq files

Underestimation of computational requirements: Quality control steps usually require 2-3

times the amount of hard drive space Similarity comparison against known

databases impractical (soil metagenome ~50 years to BLAST)

Home Alone ScreamMy first slide graphic that I’m scared may date me.

Page 38: ASM 2013 Metagenomic Assembly Workshop Slides

Two ways to reduce the onslaught:

Cluster into known observances (annotate, bin)AssemblySome mix of the above

Page 39: ASM 2013 Metagenomic Assembly Workshop Slides

Ten of you upload 1 Hiseq flowcell into MG-RAST

Page 40: ASM 2013 Metagenomic Assembly Workshop Slides

Illumina short reads from soil metagenome (~100 bp)

454 short reads from soil metagenome (~368 bp)

Assembled contigs (Illumina) reads from soil metagenome (~491 bp)

Read length will increase… computational requirements? Assembly great way to reduce data.