natalia ivanova

20
Advancing Science with DNA Sequence Natalia Ivanova Natalia Ivanova MGM Workshop MGM Workshop September 12, 2012 September 12, 2012 Metagenome analysis: use case

Upload: noura

Post on 13-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Metagenome analysis: use case. Natalia Ivanova. MGM Workshop September 12, 2012. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Natalia Ivanova

Advancing Science with DNA Sequence

Natalia IvanovaNatalia Ivanova

MGM WorkshopMGM Workshop

September 12, 2012September 12, 2012

Metagenome analysis: use case

Page 2: Natalia Ivanova

Advancing Science with DNA Sequence

Minoan eruption and metagenomics

…it seemed as though the sea was being sucked backwards, as if it were being pushed back by the shaking of the land…Behind us were frightening dark clouds, rent by lightning twisted and hurled, opening to reveal huge figures of flame. These were like lightning, but bigger.

From Pliny the Younger’s Letter

Page 3: Natalia Ivanova

Advancing Science with DNA Sequence

Apart from Minoan eruption…

from Chernicoff & Stanley, Geology, 2007

Diagram by Gary Massoth/PMEL

Page 4: Natalia Ivanova

Advancing Science with DNA Sequence

Sampling sites

white mat

red mat

Key gradients white vs red:Temperature 60 vs 18oCCO2 tension >99% vs <1%

Page 5: Natalia Ivanova

Advancing Science with DNA Sequence

This is what it looks like

Page 6: Natalia Ivanova

Advancing Science with DNA Sequence

Chimney material may be of biological origin

Page 7: Natalia Ivanova

Advancing Science with DNA Sequence

Standard JGI metagenome pipeline

DNA sample

DNA QC

SSU pyrotags

shotgun libraries

http://pyrotagger.jgi-psf.orgCommunity compositionSemi-quantitative – OTU abundance

Illumina long mate pair

Illumina standard

454 standard

454 long mate pair

Metagenome IMG/M-ERcontigs + unassembled readsCommunity compositionFunctional analysis

Assembly

Analysis

Page 8: Natalia Ivanova

Advancing Science with DNA Sequence

Pyrotag results – BLASTn against Greengenes database

Pyrotags - phylum level, filtered at 0.1% of all clusters

0 5 10 15 20 25 30 35 40

ProteobacteriaBacteroidetes

PlanctomycetesChloroflexi

Marine_group_AThaumarchaeota

UnknownOP11

AcidobacteriaActinobacteria

Caldithrix_KSB1OP3OP8WS3

VerrucomicrobiaChlorobi

GemmatimonadetesMBMPE71

NitrospiraepMC2A209ABY1_OD1

Thermoplasmata_EuryFirmicutes

NKB19VHS-B5-50

SpirochaetespMC1

LentisphaeraeOP5TM6

MAT-CR-M3-H11Chlamydiae

TM7DHVE3

CyanobacteriapMC2A384

pMC2A15WS6

SM2F11C2

BRC1Thermotogae

ThermosulfidobacteriumEM3

ph

ylu

m

% pyrotag clusters

Kolumbo_volcano_whiteKolumbo_volcano_red

Page 9: Natalia Ivanova

Advancing Science with DNA SequencePhyloDistribution results – BLASTp of metagenome CDSs against isolates in

IMGPhyloDistribution of CDSs - phylum level, filtered at 0.1% abundance

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%

Proteobacteria

Bacteroidetes

Planctomycetes

Chlorofl exi

Thaumarchaeota

Acidobacteria

Actinobacteria

Caldi thrix_KSB1

Verrucomicrobia

Chlorobi

Thermoplasmata_Eury

Fi rmicutes

Spirochaetes

Cyanobacteria

Thermotogae

Thermosulfi dobacterium

Lentisphaerae

ph

ylu

m

% CDS hits

Kolumbo_volcano_white_grey

Kolumbo_volcano_red

Page 10: Natalia Ivanova

Advancing Science with DNA Sequence

Pyrotags vs PhyloDistribution – white mat

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%

BacteroidetesProteobacteria

ThermotogaeOP5

PlanctomycetesThermosulfidobacteri

EM3Thaumarchaeota

Thermoplasmata_EuryVerrucomicrobia

ChlorobiUnknown

AcidobacteriaActinobacteria

Caldithrix_KSB1Lentisphaerae

Marine_group_AKolumbo_volcano_white_grey_PhyloDist

Kolumbo_volcano_white_grey_Pyro

Big differences in abundance (an order of magnitude or more) of Bacteroidetes and Thermotogae

Page 11: Natalia Ivanova

Advancing Science with DNA Sequence

Possible explanations

• Amplification artifacts in pyrotags – well known for metagenome data

• Sequencing GC bias in the metagenome – low and high (<30% and >65%) are underrepresented in Illumina data

• K-mer assembler problems: abundant populations may be undrrepresented in assembly if incorrect k-mer/coverage parameters selected

• Primer bias in pyrotags (against Proteobacteria)?

Page 12: Natalia Ivanova

Advancing Science with DNA Sequence

PCR artifacts in metagenome data

12

Reason: presence of free beads during the library prep step; escaped emPCR products bind to free beads and are disproportionately amplified

454 technology includes an emulsion PCR step, which may lead to artificial overrepresentation of certain sequences

Page 13: Natalia Ivanova

Advancing Science with DNA Sequence

Low GC (Brachyspira)

What about GC bias?

Medium GC (Arcanobacterium) High GC (Cellulomonas)

Question: how do you find average/max/min GC content for a clade?Answer: IMG=>Genome Browser=>View Phylogenetically=>click on green + to select the clade, then “Add selected to Genome Cart”=>Compare Genomes=>Genome Statistics

Result: Thermotogae GC percent 41 average/47 max/31 minBacteroidetes GC percent 42.5 average/66 max/31 min

Page 14: Natalia Ivanova

Advancing Science with DNA Sequence

Let’s take a closer look at the unassembled reads

White mat Red mat

454 reads total 299,975 1,429,091

Illumina reads total 49,227,146 45,337,178

Assembled contigs 195,590 88,776

N50, bp 659 869

Longest contig, bp 28,145 75,483

Illumina reads mapped to assembly, % total

42.3 12.5

454 reads mapped to assembly, % total

62.1 15.30 10 20 30 40 50 60 70

Actinobacteria

Aquificae

Bacteroidetes

Chlorobi

Chloroflexi

Cyanobacteria

Euryarchaeota

Firmicutes

Planctomycetes

Proteobacteria

Spirochaetes

Thaumarchaeota

Thermotogae

Verrucomicrobia

454Illumina

Page 15: Natalia Ivanova

Advancing Science with DNA Sequence

It’s pyrotag bias after all!

• JGI uses primer pair 946F-1492R1492R primerTACGCYTACCTTGTTACGACTTTACGGTTACCTTGTTACGACTTSequence in the metagenome• CG mismatch• JGI did extensive testing on artificial

communities – this problem not detected

Page 16: Natalia Ivanova

Advancing Science with DNA Sequence

Functional analysis: metagenome as a bag of functions

• Red mat is taxonomically more diverse• Is it more diverse functionally?

White mat Red mat

COG clusters 3631 3402

Pfam clusters 3847 3505

Question: where do you find this information?Answer: IMG=>Taxon Details=>Metagenome Statistics; Genes with

Pfam=>Display as a list =>Export

10000 20000 30000 40000 50000 60000 70000 80000

Specimens

0

400

800

1200

1600

2000

2400

2800

3200

3600

Taxa

(95%

con

fidence) Rarefaction curves: white mat

is expected to have ~4000 different Pfams; red mat ~3600

Page 17: Natalia Ivanova

Advancing Science with DNA Sequence

Abundance Comparisons

Motility and chemotaxis genes are overrepresented in white mat (detected by both Pfams and COG Categories)

white mat red mat

Page 18: Natalia Ivanova

Advancing Science with DNA Sequence

Is motility/chemotaxis common to all organisms in white mat?

• Scenario 1: the function/pathway is overrepresented because it is present in all members of the community, possibly at higher copy number

• Scenario 2: the function/pathway is overrepresented because it is present in one clade, which is absent from the second sample

Question: can we distinguish between the two scenarios?Answer: click on the gene count for protein family/functional category, add all genes to Gene

Cart=>add scaffolds to Scaffold Cart=>PhyloDistribution of all scaffolds in the Scaffold Cart

Page 19: Natalia Ivanova

Advancing Science with DNA Sequence

Carbon fixation pathways

Page 20: Natalia Ivanova

Advancing Science with DNA Sequence

Conclusions

Two communities have different composition; white mat sampled next to the hydrothermal vent has lower complexity

Community composition as sampled by pyrotags and the metagenome may be quite different due to a number of biases

Some protein families/functional categories are more abundant because of different community composition, and not because they are more important