18-21 august 2009 metagenomic workshop james r. cole, ph.d. ribosomal database project center for...

Download 18-21 August 2009 METAGENOMIC WORKSHOP James R. Cole, Ph.D. Ribosomal Database Project Center for Microbial Ecology Michigan State University

If you can't read please download the document

Upload: randall-harvey

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

18-21 August 2009 Additional Functions Shannon Index Rarefaction Alignment Merger Estimate S SPADE Phylip Chao 1 Estimate Library Compare Dereplicate PAST R Mothur Many others compatible! Export Formats for Common Tools

TRANSCRIPT

18-21 August 2009 METAGENOMIC WORKSHOP James R. Cole, Ph.D. Ribosomal Database Project Center for Microbial Ecology Michigan State University 18-21 August 2009 RDP Pyrosequencing Pipeline Tools for high-throughput analysis 18-21 August 2009 Additional Functions Shannon Index Rarefaction Alignment Merger Estimate S SPADE Phylip Chao 1 Estimate Library Compare Dereplicate PAST R Mothur Many others compatible! Export Formats for Common Tools 18-21 August 2009 SPADE 18-21 August 2009 PAlaentologicalSTatistics 18-21 August 2009 R Cluster Based Method 18-21 August 2009 TM7 Clostridia Unclassifed Bacteria Actinobacteria Bacteroidetea Acidobacteria Unclassifed Proteobacteria Deltaproteobacteria Gammaproteobacteria Verrucomicrobia Bacilli Planctomycetes Gemmatimonadetes Unclassified Firmicutes Betaproteobacteria Alphaproteobacteria TM7 Clostridia Unclassifed Bacteria Actinobacteria Bacteroidetea Acidobacteria Unclassifed Proteobacteria Deltaproteobacteria Gammaproteobacteria Verrucomicrobia Bacilli Planctomycetes Gemmatimonadetes Unclassified Firmicutes Betaproteobacteria Alphaproteobacteria Position by cluster order (thousands) Species Genus Family Novel Species Genus Family Novel Pigeon Pea Bare Fallow Similarity Total abundance 18-21 August 2009 Pipeline Performance Processing Time 52 samples, 350, FLX reads Classifier ~ 2 CPU hrs. Aligner ~12 CPU hrs. Clustering ~2 CPU hrs. (depends on sample sizes) SeqMatch ~23 CPU hrs. 18-21 August 2009 Usage Stats 380 users since June 2008 April 2009 stats: 182 initial process jobs 1243 cluster jobs 832 alignment jobs >11 million sequences aligned RDP Pyro tools distributed to several major institutions 18-21 August 2009 Analysis of 16S Variable Regions Important features 18-21 August 2009 v6 v4 v3 v1v2 rRNA Gene Regions Processed by the RDP Pyrosequencing Pipeline 55 3 16S rRNA Gene % of Sequence Covering Position 18-21 August 2009 V6 V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% Statistics from 300,000 Sanger Sequences (RDP release 10.11) Secondary-structure figures from 18-21 August 2009 V3 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% 18-21 August 2009 V4 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% 18-21 August 2009 V1 V2 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% 18-21 August 2009 V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% Chance a Species Sequence is Identical to at Least One Other Species Based on 6,841 bacterial species type strain sequences Strain information from The Living Tree Projectprojects/living-tree/ 18-21 August 2009 Chance Two Operons Differ in One Organism V6V3V4V1-V2 Canonical Positions Conserved Base Pairs % Size Range % Missing Pairs 2x x x x10 -4 % Half Pairs 4x x x x10 -4 % Paired % Aligned78%89%99%84% Identical Species 49%36%37%6% Different Operons 10%12%7%32% Based on 561 completed genome sequences with two or more rRNA operons 18-21 August 2009 V4 SangerFLX Avg. Size207 % Missing Pairs 0.3x x10 -4 % Half Pairs 8x x10 -2 % Paired62 % Aligned99% Quality of Recovered Structure 18-21 August 2009 V4 SangerFLX Avg. Size207 % Missing Pairs 0.3x x10 -4 % Half Pairs 8x x10 -2 % Paired62 % Aligned99% Quality of Recovered Structure 18-21 August 2009 Introduction to the Short Read Archive (SRA) myRDP SRA Prepkit 18-21 August 2009 SRA Submission Format 18-21 August N StudyExperiment AnalysisRunSample 1 1 N N N 1 Submission Six Different SRA Document Types 18-21 August 2009 myRDP SRA Prepkit myRDP SRA PREPKIT SEQUENCE READS XML DOCUMENTS NCBI-SRA EMBL-ERA METADATA SEQUENCING PROJECT myRDP SWS SUBMIT 18-21 August 2009 Sample Attributes Prefilled Genomic Standards Consortium MIMS (Minimal Information about a Metagenome Sequence)* *Nature Biotechnology 26, (2008) 18-21 August 2009 Functional Genes 18-21 August 2009 FGPR Home Page Screenshot 18-21 August 2009 FGPR Screenshots seed sequences active links to GenBank records active links to GenBank records organism name display/filter options custom analysis 18-21 August 2009 Functional Gene Pipeline/Repository Sequence Analysis interactive commands sub-selection for further analysis sub-selection for further analysis dynamic tree applet 18-21 August 2009 Functional Gene Processing 1)Remove Frameshifts 1)tBLASTX 2)GeneWise 2)Translate and align sequences 1)HMMER 2)MUSCLE 3)Determine conserved residues 1)Entropy plot 4)Compare to reference sequences 1)Determine functional subclass 18-21 August 2009 Entropy (Dioxygenease Genes) 18-21 August 2009 Interactive distance matrix display Couples matrix with taxonomy information Allows rapid detection of taxonomic inconsistencies Taxomatic: Interactive Taxonomy Explorer 18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer 18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer 18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer 18-21 August 2009 Integrated overlays Taxomatic: Interactive Taxonomy Explorer 18-21 August 2009 zoom and pan 18-21 August 2009 Can zoom down to individual sequences 18-21 August 2009 Megan Taxonomic analysis through metagenomic data 18-21 August 2009 Megan Modified k-nn LCA taxonomic classifier Requires BLAST result file Extracts taxonomy, cogs from matches Features from NCBI Prokaryotic Attributes Table 18-21 August 2009 MEGAN Screenshot1 18-21 August 2009 MEGAN Screenshot2 18-21 August 2009 MEGAN Screenshot3 18-21 August 2009 Metagenomics Analysis Pipelines Sequence Comparison 18-21 August 2009 General Considerations What databases are used? GenBank nr (not good) Pfam, TIGRfam, FIGfam? What search strategy is used? BLAST, HMMER, Additional tools? Will they process my data Will my data become public 18-21 August 2009 HMMER vs BLAST 18-21 August 2009 BMC Genomics Aziz 18-21 August 2009 The SEED & RAST Subsystems: Pathway database Expert annotation Curated simultaneously across many genomes FIGfams: Database of protein families Derived from Subsystems database Controlled addition of new family members RAST: Genome annotation system Uses FIGfams for gene annotation Uses Subsystems for pathway annotation 18-21 August 2009 The SEED & RAST 18-21 August 2009 fromPDF 18-21 August 2009 BMC RAST Fig. 2 18-21 August 2009 BMC RAST Fig. 4 18-21 August 2009 JGIS IMG/M HOME 18-21 August 2009 CAMERA HOME 18-21 August 2009 CAMERA DASHBOARD 18-21 August 2009 CAMERA PROJECT SAMPLES 18-21 August 2009 Metadata Data about data 18-21 August 2009 Metadata Standards Minimum Information about a Microarray Experiment (MIAME) Minimum Information about a genome sequence (MIGS) Minimum Information about a metagenome sequence (MIMS) 18-21 August 2009 Nature Biotechnology 26, (2008) 18-21 August 2009 MIMS extension: select to report a set of uniform measurements for a given habitat: Water body: (temperature, pH, salinity, pressure, chlorophyll, conductivity, light intensity, dissolved organic carbon (DOC), current, atmospheric data, density, alkalinity, dissolved oxygen, particulate organic carbon (POC), phosphate, nitrate, sulfates, sulfides, primary production) (integer, unit) Box 1 Minimum Information about a Genome Sequence (MIGS): Habitat Specific Attributes 18-21 August 2009 To help establish a set of suggested attributes for soil sequence data In cooperation with: - The Genomic Standards Consortium - The International Soils Metagenome Sequencing Consortium (Terragenome) Soil Metadata Survey 18-21 August 2009 Soil Metadata Survey Summary Not Difficulty to obtain Importance Very Easy Hard 18-21 August 2009 Soil Metadata Survey Summary Not Difficulty to obtain Importance Very Easy Hard VERY IMPORTANT / EASY TO OBTAIN -- Chemical: pH (in water or Calcium chloride) Biological: plant cover (native) Soil/Geological: horizon Geographical: latitude and longitude, elevation Management: land use (e.g., urban, agri- culture, forestry), tillage (type), crops (current, rotation), fertilizers (type and annual amount) Climate: mean and seasonal rainfall, mean and seasonal temperatures Sampling: depth, composite design, moisture content at sampling area represented by composite sample, weight of sample used for DNA extraction 18-21 August 2009 Technology Issues Limitations of Pyrosequencing 18-21 August 2009 Gomez-Alvarez ISME Article 18-21 August 2009 Gomez-Alvarez Fig. 1 Figure 1 (a) Alignment of five sequences in a cluster demonstrates the types of sequencing errors and length variation (highlighted in gray) included in a cluster. (b) Number of reads in a cluster versus the cluster number, ordered from the largest to smallest sized cluster; both axes are plotted on a log 10 scale. (c) The best BLAST match and COG affiliation for four of the most abundant clusters in replicate soil metagenomes. (d) Distribution of exact duplicate and all replicate reads in a metagenomic dataset from soil (this study) and seawater metagenomes (Frias-Lopez et al., 2008; Mou et al., 2008). *Rep, technical replicates; +Sp, biological replicates. The number of reads in each category is presented in Table 1. 18-21 August 2009 Gomez-Alvarez Table 1 (left) Gomez-Alvarez, V., Teal, T.K., Schmidt, T.M. (July 2009) Accurate determination of microbial diversity from 454 pyrosequencing data. ISME Journal advance online publication. doi: /ismej Table 1 Total numbers of reads, exact duplicates and all replicate sequences, including duplicates, from representative metagenomic data sets Habitat (metagenome) Number of reads 18-21 August 2009 PyroNoise Article 18-21 August 2009 Pyro Fig. 1 Figure 1 | OTU number as a function of percentage sequence difference for 90 pyrosequenced 16S rRNA gene clones of known sequence. (a,b) Results are repeated for complete linkage (a) and average linkage algorithms (b). 18-21 August 2009 Pyro Fig. 2 Figure 2 | Proportion of sequences assigned to the correct OTU as a function of percentage sequence difference for pyrosequenced 16S rRNA gene clones of known sequence. (a,b) Results are repeated for complete linkage (a) and average linkage algorithms (b). 18-21 August 2009 Pyro Table 1 Quince, C., Lanzn, A., Curtis, T.P., Davenport, R.J., Hall, N., Head, I.M., Read, L.F., and Sloan, W.T. (2009) Accurate determination of microbial diversity from 454 pyrosequencing data. Nature Methods Advanced Online Publication Aug doi: /NMETH.1361