natalia ivanova

22
Advancing Science with DNA Sequence Natalia Ivanova Natalia Ivanova MGM Workshop MGM Workshop February 2, 2012 February 2, 2012 Metagenome analysis: use case

Upload: solomon-beach

Post on 01-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Metagenome analysis: use case. Natalia Ivanova. MGM Workshop February 2, 2012. - PowerPoint PPT Presentation

TRANSCRIPT

Advancing Science with DNA Sequence

Natalia IvanovaNatalia Ivanova

MGM WorkshopMGM Workshop

February 2, 2012February 2, 2012

Metagenome analysis: use case

Advancing Science with DNA Sequence

Minoan eruption and metagenomics

…it seemed as though the sea was being sucked backwards, as if it were being pushed back by the shaking of the land…Behind us were frightening dark clouds, rent by lightning twisted and hurled, opening to reveal huge figures of flame. These were like lightning, but bigger.

From Pliny the Younger’s Letter

Advancing Science with DNA Sequence

Apart from Minoan eruption…

from Chernicoff & Stanley, Geology, 2007

Diagram by Gary Massoth/PMEL

Advancing Science with DNA Sequence

Sampling sites

white mat

red mat

Key gradients white vs red:Temperature 60 vs 18oCCO2 tension >99% vs <1%

Advancing Science with DNA Sequence

This is what it looks like

Advancing Science with DNA Sequence

Chimney material may be of biological origin

Advancing Science with DNA Sequence

Standard JGI metagenome pipeline

DNA sample

DNA QC

SSU pyrotags

shotgun libraries

http://pyrotagger.jgi-psf.orgCommunity compositionSemi-quantitative – OTU abundance

Illumina long mate pair

Illumina standard

454 standard

454 long mate pair

Metagenome IMG/M-ERcontigs + unassembled readsCommunity compositionFunctional analysis

Assembly

Analysis

Advancing Science with DNA Sequence

Pyrotag results – BLASTn against Greengenes database

Pyrotags - phylum level, filtered at 0.1% of all clusters

0 5 10 15 20 25 30 35 40

ProteobacteriaBacteroidetes

PlanctomycetesChloroflexi

Marine_group_AThaumarchaeota

UnknownOP11

AcidobacteriaActinobacteria

Caldithrix_KSB1OP3OP8WS3

VerrucomicrobiaChlorobi

GemmatimonadetesMBMPE71

NitrospiraepMC2A209ABY1_OD1

Thermoplasmata_EuryFirmicutes

NKB19VHS-B5-50

SpirochaetespMC1

LentisphaeraeOP5TM6

MAT-CR-M3-H11Chlamydiae

TM7DHVE3

CyanobacteriapMC2A384

pMC2A15WS6

SM2F11C2

BRC1Thermotogae

ThermosulfidobacteriumEM3

ph

ylu

m

% pyrotag clusters

Kolumbo_volcano_whiteKolumbo_volcano_red

Advancing Science with DNA SequencePhyloDistribution results – BLASTp of metagenome CDSs against isolates in

IMGPhyloDistribution of CDSs - phylum level, filtered at 0.1% abundance

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%

Proteobacteria

Bacteroidetes

Planctomycetes

Chlorofl exi

Thaumarchaeota

Acidobacteria

Actinobacteria

Caldi thrix_KSB1

Verrucomicrobia

Chlorobi

Thermoplasmata_Eury

Fi rmicutes

Spirochaetes

Cyanobacteria

Thermotogae

Thermosulfi dobacterium

Lentisphaerae

ph

ylu

m

% CDS hits

Kolumbo_volcano_white_grey

Kolumbo_volcano_red

Advancing Science with DNA Sequence

Pyrotags vs PhyloDistribution – white mat

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%

BacteroidetesProteobacteria

ThermotogaeOP5

PlanctomycetesThermosulfidobacteri

EM3Thaumarchaeota

Thermoplasmata_EuryVerrucomicrobia

ChlorobiUnknown

AcidobacteriaActinobacteria

Caldithrix_KSB1Lentisphaerae

Marine_group_AKolumbo_volcano_white_grey_PhyloDist

Kolumbo_volcano_white_grey_Pyro

Big differences in abundance (an order of magnitude or more) of Bacteroidetes and Thermotogae

Advancing Science with DNA Sequence

Possible explanations

• Amplification artifacts in pyrotags – well known for metagenome data

• Sequencing GC bias in the metagenome – low and high (<30% and >65%) are underrepresented in Illumina data

• K-mer assembler problems: abundant populations may be undrrepresented in assembly if incorrect k-mer/coverage parameters selected

• Primer bias in pyrotags (against Proteobacteria)?

Advancing Science with DNA Sequence

PCR artifacts in metagenome data

12

Reason: presence of free beads during the library prep step; escaped emPCR products bind to free beads and are disproportionately amplified

454 technology includes an emulsion PCR step, which may lead to artificial overrepresentation of certain sequences

Advancing Science with DNA Sequence

Low GC (Brachyspira)

What about GC bias?

Medium GC (Arcanobacterium) High GC (Cellulomonas)

Question: how do you find average/max/min GC content for a clade?Answer: IMG=>Genome Browser=>View Phylogenetically=>click on green + to select the clade, then “Add selected to Genome Cart”=>Compare Genomes=>Genome Statistics

Result: Thermotogae GC percent 41 average/47 max/31 minBacteroidetes GC percent 42.5 average/66 max/31 min

Advancing Science with DNA Sequence

Are there any abundant populations that could be filtered out in assembly?

Kolumbo_volcano_white_grey% identity TaxonomyCluster1 32.4 100 Bacteria BacteroidetesBacteroidales VC21_Bac22Cluster2 12.84 94 Bacteria ThermotogaeCluster3 4.93 98.51 Bacteria ProteobacteriaZetaproteobacteriaCluster6 3.44 94.03 Bacteria ProteobacteriaGammaproteobacteria MethylococcalesMethylococcaceaeClonothrixCluster12 3.42 97.5 Bacteria OP5 SRI-280Cluster15 2.83 98.5 Bacteria ThermosulfidobacteriumCluster13 2.82 91.13 Bacteria EM3Cluster8 2.81 97 Bacteria ProteobacteriaGammaproteobacteria ThiomicrospiraThiomicrospira_frisiaThiomicrospira_chilensisCluster9 1.86 89.22 Bacteria ProteobacteriaAlphaproteobacteria CaulobacteralesCaulobacteraceaeCaulobacterCluster17 1.53 92.12 Bacteria ProteobacteriaDeltaproteobacteriaCluster19 1.44 97 Bacteria ProteobacteriaDeltaproteobacteria Desulfobacterium_catecholicumDesulfobulbus_rhabdoformisCluster25 1.27 93.56 Bacteria ProteobacteriaDesulfurellales DesulfurellaceaeCluster53 1.03 100 Archaea ThaumarchaeotaCenarchaeales Cenarchaeum

Typical Pyrotagger output

There are 2 highly abundant populations – just 2 clusters account for nearly all Bacteroidetes and Thermotogae in the sample

Advancing Science with DNA Sequence

Let’s take a closer look at the assemblies and unassembled reads

White mat Red mat

454 reads total 299,975 1,429,091

Illumina reads total 49,227,146 45,337,178

Assembled contigs 195,590 88,776

N50, bp 659 869

Longest contig, bp 28,145 75,483

Illumina reads mapped to assembly, % total

42.3 12.5

454 reads mapped to assembly, % total

62.1 15.30 10 20 30 40 50 60 70

Actinobacteria

Aquificae

Bacteroidetes

Chlorobi

Chloroflexi

Cyanobacteria

Euryarchaeota

Firmicutes

Planctomycetes

Proteobacteria

Spirochaetes

Thaumarchaeota

Thermotogae

Verrucomicrobia

454Illumina

Advancing Science with DNA Sequence

It’s pyrotag bias after all!

• JGI uses primer pair 946F-1492R1492R primerTACGCYTACCTTGTTACGACTTTACGGTTACCTTGTTACGACTTSequence in the metagenome• CG mismatch• JGI did extensive testing on artificial

communities – this problem not detected

Advancing Science with DNA Sequence

Functional analysis: metagenome as a bag of functions

• Red mat is taxonomically more diverse• Is it more diverse functionally?

White mat Red mat

COG clusters 3631 3402

Pfam clusters 3847 3505

Question: where do you find this information?Answer: IMG=>Taxon Details=>Metagenome Statistics; Genes with

Pfam=>Display as a list =>Export

10000 20000 30000 40000 50000 60000 70000 80000

Specimens

0

400

800

1200

1600

2000

2400

2800

3200

3600

Taxa

(95%

con

fidence) Rarefaction curves: white mat

is expected to have ~4000 different Pfams; red mat ~3600

Advancing Science with DNA Sequence

Abundance Comparisons

Motility and chemotaxis genes are overrepresented in white mat (detected by both Pfams and COG Categories)

white mat red mat

Advancing Science with DNA Sequence

Is motility/chemotaxis common to all organisms in white mat?

• Scenario 1: the function/pathway is overrepresented because it is present in all members of the community, possibly at higher copy number

• Scenario 2: the function/pathway is overrepresented because it is present in one clade, which is absent from the second sample

Question: can we distinguish between the two scenarios?Answer: click on the gene count for protein family/functional category, add all genes to Gene

Cart=>add scaffolds to Scaffold Cart=>PhyloDistribution of all scaffolds in the Scaffold Cart

Advancing Science with DNA Sequence

Are Sulfurimonas-like bacteria present in both samples?

red mat white matCluster1730 13 33 EpsilonproteobacteriaCampylobacteralesHelicobacteraceaeHelicobacterCluster5877 28 EpsilonproteobacteriaCampylobacteralesCampylobacteraceaeSulfurospiril lumCluster8886 5 EpsilonproteobacteriaCampylobacteralesHelicobacteraceaeHelicobacterCluster13550 8 EpsilonproteobacteriaCampylobacteralesArcobacteraceaePL-7C7Cluster14168 4 EpsilonproteobacteriaCampylobacteralesCampylobacteraceaeSulfurospiril lum_arcachonenseCluster17937 4 EpsilonproteobacteriaCluster20681 5 EpsilonproteobacteriaCampylobacteralesCampylobacteraceaeSulfurospiril lum_arcachonenseCluster22836 2 EpsilonproteobacteriaCluster35524 5 EpsilonproteobacteriaCampylobacteralesHelicobacteraceaeSulfurimonasCluster38665 2 EpsilonproteobacteriaSulfurovumalesCluster44900 1 EpsilonproteobacteriaSulfurovumalesSulfurovumaceaeRimicaris_exoculata_ectosymbiontCluster57912 1 EpsilonproteobacteriaCluster60712 1 EpsilonproteobacteriaCluster76930 1 1 EpsilonproteobacteriaunclassifiedunclassifiedNitratiruptorCluster87523 1 EpsilonproteobacteriaunclassifiedunclassifiedNitratiruptorCluster160974 1 EpsilonproteobacteriaCampylobacteralesCampylobacteraceaeSulfurospiril lum

The total number of sequences in all clusters assigned to Epsilonproteobacteria is 50 in white mat and 66 in red mat

Largest cluster in white mat includes 125K+ sequencesLargest cluster in red mat includes 14K+ sequencesQuestion: what about the presence of Sulfurimonas-like

bacteria in the metagenomes?Answer: go to Compare Genomes=>PhyloDistribution=>Genome vs Metagenomes, select the

genome; the histogram shows the number of BLASTp hits from CDSs in all metagenomes to this genome

Advancing Science with DNA Sequence

Are there any methylotrophs in the white mat?

Advancing Science with DNA Sequence

Conclusions

Two communities have different composition; white mat sampled next to the hydrothermal vent has lower complexity

Community composition as sampled by pyrotags and the metagenome may be quite different due to a number of biases

Some protein families/functional categories are more abundant in one sample as compared to the other because of different community composition, and not necessarily because they are more important in this environment