public data resources for metagenomics · (2) upload sequence data and metadata (3) sequence data...

Public data resources for metagenomics

Alex Mitchellmitchell@ebi.ac.uk

My background

Doctorate in pharmacology (1995-1998)

Post-doc in molecular biology (1998-2001)

Bioinformatics research (2001-2011)

Co-ordinator for InterPro and EBI metagenomics databases (2011-)

My background

Overview

• Considerations for the analysis of metagenomic sequence data

• What public metagenomic analysis resources offer

• The EBI metagenomics resource

What is metagenomics?

“Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.”

“Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”

“Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora”

“Metagenomics” means literally ‘beyond genomics’

Sequencing

Filtering step

Extraction of DNA

Sampling from environment

Quality control

Taxonomic analysis

Functional analysis

16S rRNA18S rRNA

ITSetc

Identification and characterisation of

protein coding sequences

Applications of taxonomic analyses

Diversity analysisIdentification of new species

Comparing populations from different sites or

states

Applications of functional analyses

Bioprospecting for novel sequences with

functional applications

Reconstruction of pathways present in the

community

Comparing functional activities from different

sites or states

• Short sequence fragments are hard to characterise

• Assembly can lead to chimeras

• Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop

• Millions of different pieces• Thousands of different puzzles• All mixed together• Most of the pieces are missing• No boxes to refer to

Why is metagenomics challenging?

Limitations and pitfalls

Data used for analysis can have limitations:

• 16S rRNA genes - limited resolving power and subject to copy number variation

• Viral sequences – currently no gold-standard reference database

• Protist sequences – little experimentally-derived annotation of protein function in public databases

Additional pitfalls

• Different functional and taxonomic analysis tools can give different results

• The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3)

• The same version of the same tools can give different results depending on the reference database used

Reference databases

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at: www.genome.gov/sequencingcosts.

Other considerations: data analysis speed

• The cost of sequencing has really gone down

• Now I can do metagenomics!

• Awesome!

• Amount of sequence generated has increased 5,000-fold

• Computational speed has increased only 10-fold

• Time taken to analyse has increased 500-fold

• $@%*!!!

Data analysis speed

70 %(~80 bp/$)

14.5 %

(~2m bp/$)

36.5 %

14.5 %

Sboner et al. Genome Biology (2011) 12:125

Data analysis cost

Raw sequence data:

• Important for metagenomics as some samples are hard to replicate

• Large file sizes

Analysis results ?

• Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions

Data description including metadata

• Essential: what, where, who, how and when

• If absent, raw data have very limited usefulness

What data to store?

Metadata includes the in-depth, controlled description of the sample that your sequence was taken from

The importance of metadata

Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?

How was it sampled? How was it extracted? How was it stored? What sequencing platform was used?

• If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible

The importance of metadata

Show the microbial species found in the North Pacific

… at depths of 50 – 100 m

… in samples taken May-June

… compared to the Indian Ocean, under the same conditions

Where are you going to store this?

• Locally : back-up ?

long term ?

sharing ?

access ?

• Amazon, Google or specialist research clouds

• Public repositories, such as ENA, NCBI or DDBJ

Considerations: storing data

• Free!• Secure long term storage

• No need for local infrastructure

• Enforced compliance:• Publisher requirements (accession numbers)• Institutional requirements• Funder requirements

• Data are more useful: • Data are reusable and can be discovered by others• Available for re- and meta-analyses

Public repositories

• Transferring a 100 Gb NGS data file across the internet• 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week*• High-speed bandwidth (10 Gigabit/s) < 1 day*

Considerations: moving data

* Stein, Genome Biol. (2010) 11:207

Traditional methods may be the most effective!

Metagenomics portals

http://www.ebi.ac.uk/metagenomics

http://metagenomics.anl.gov/

http://camera.calit2.net/

http://img.jgi.doe.gov/

Submit data

Sequence analysis(prebuilt workflows)

Quality filtering of sequences

Visualisation/Interpretation

What do metagenomics portals offer?

Sequence archiving

Tools to help capture & store

metadata

Tools to help transfer data

Data archivingPowerful analysisEasy submission

A free resource for the analysis, archiving & browsing of metagenomic study data

http://www.ebi.ac.uk/metagenomics

(1) Register for an account

(2) Upload sequence data and metadata

(3) Sequence data is archived in ENA and accessioned

(4) Sequence data is analysed by the pipeline

(5) Projects, metadata and results are made available on the website for private or public browsing / download

The submission & analysis process

~ 1-2 weeks, depending on study size, compute farm usage, etc

The submission process can be run interactively

The GSC (Genomics Standards Consortium) have created minimum standards for metagenomics metadata

Metagenomics standards

Metadata is captured via GSC-compliant checklist

GSC MIxS

rRNAselector

reads with rRNA

reads without

rRNAFragGeneScan

predicted CDS

Amplicon-based data

processed reads

discarded reads

raw reads

Taxonomic analysis

InterProScan

Function assignment

Unknown function

The sequence analysis pipeline

EBI Metagenomics: QC step by step

• Clipping - low quality ends trimmed and adapter sequences removed

• Quality filtering - sequences with > 10% undetermined nucleotides removed

• Read length filtering - short sequences (< 100 nt) are removed

• Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen

• Repeat masking - RepeatMasker (open-3.2.2), removes reads with 50% or more nucleotides masked (low complexity regions)

EBI Metagenomics: QC consequences

Roche 454

Illumina

Ion Torrent

EBI Metagenomics: taxonomic analysis

rRNAselector

reads with rRNA

Amplicon-based data

processed reads

Taxonomic analysis

Taxonomic analysis with EBI Metagenomics

EBI Metagenomics currently only provides taxonomy analysis for Prokaryotes.

rRNA sequences are identified using rRNASelector:

hidden Markov models to identified rRNA sequences

60 bp minimum overlap with well-curated HMM model

E-value < 10-5

Annotations are associated using Qiime:

rRNA are annotated using the Greengenes reference database

EBI Metagenomics taxonomy visualizations

Re-analysis of: Sutton et al, (2013), Impact of Long-Term Diesel

Contamination on Soil Microbial Community Structure.

Validation of taxonomic analysis

Alpha diversity analysis

polluted

clean (outlier)

EBI Metagenomics: overview of functional analysis

reads without rRNA

FragGeneScan

predicted CDS

InterProScan

Function assignment

Unknown function

EBI Metagenomics: functional annotation

EBI Metagenomics uses FragGeneScan to predict CDSs directly from the reads:

hidden Markov models to correct frame-shift using codon usage

probabilistic identification of start and stop codons

60 bp minimum ORF

Annotation is carried out using InterProScan to mine a subset of the InterPro database

Why not BLAST?

• BLAST: Basic Local Alignment and Search Tool

• Relatively fast

• User friendly

• Very good at recognising similarity between closely related sequences

Using BLAST for annotation

Because BLAST performs local pairwise alignment, it:

• can sometimes struggle with multi-domain proteins

• is less useful for weakly-similar sequences (e.g., divergent homologues)

BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 closely-related species

60S acidic ribosomal protein P0: multiple sequence alignment

An alternative approach

• This is the approach taken by protein signature databases

• Alternatively, we can model the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed

Full alignment methods

Single motif methods

Patterns

Multiple motif methods

Fingerprints

Three different protein signature approaches

Profiles & Hidden Markov models (HMMs)

* For a detailed description, see: https://www.ebi.ac.uk/training/online/course/introduction-protein-classification-ebi

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

The aim of InterPro

InterPro

Features of InterPro

• Manually checked and updated against a manually annotated database

• Errors are identified and fixed• Annotated with full text abstracts and Gene Ontology terms

… with a brief diversion into the Gene Ontology…

http://geneontology.org/

Aims of the Gene Ontology

• Allow cross-species and/or cross-database comparisons

• Unify the representation of gene and gene product attributes across species

English is not a very precise language

• Same name for different concepts• Different names for the same concept

Inconsistency in naming of biological concepts

An example …

Tactition Tactile sense

Taction

Sensory perception of touch ; GO:0050975

• A way to capture biological knowledge in a written and computable form

The Gene Ontology

• A set of concepts and their relationships to each other arrangedas a hierarchy

www.ebi.ac.uk/QuickGO

Less specific concepts

More specific concepts

The Concepts in GO

1. Molecular Function

2. Biological Process

3. Cellular Component

An elemental activity or task or job

• protein kinase activity• insulin receptor activity

A commonly recognised series of events

• cell division

Where a gene product is located

• mitochondrion

• mitochondrial matrix

• mitochondrial inner membrane

Anatomy of a GO term

Unique identifier

Term name

Definition

Synonyms

InterPro2GO

InterPro

We now return to your scheduled programming...

Using InterPro for annotation

• Underlies the automated system that adds annotation to

UniProtKB/TrEMBL

• Provides matches to 67 million proteins - over 80% of UniProtKB

• Source of ~170 million GO mappings for ~50 million distinct

UniProtKB sequences

Annotation consistency:• Using InterPro and GO for annotation allows direct comparison

with all of the proteins in UniProtKB

Analysing metagenomic sequences with InterPro

Considerations for metagenome analysis:

• Vast numbers of short reads

• analysis speed

• ability to cope with sequence fragments

• Making sense of output• visualisation on web site• downstream analysis and sample comparison

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Patterns

Databases

Assembly of metagenomics data

• Metagenomics: Not clear how you avoid assembling sequences from different species together : chimaera

EBI Metagenomics does not perform assembly

We are still able to annotate metagenome data as shown by this re-analysis of rumen metagenomics by Hess et al, (2011)

Visualising data: InterProScan results

Visualising data: GO Slims

• GO slims are cut-down versions of the GO ontologies

containing a subset of the terms in the whole GO

• Give a broad overview of the ontology content without the

detail of the specific fine-grained terms

GO Slims

Slimmed term:

Visualising data: GO slims

• For visualisation, EMG uses a GO slim specially developed for metagenomic data sets

EBI Metagenomics output files

sequence files

tab or comma separated files

TreeView, TOL,

Newick Viewer …

Megan …

sequence files

Simplified overview of MG-RAST pipeline

Reads Quality control

Feature prediction(FragGeneScan)

Clustering (Uclust)Protein databases

http://metagenomics.anl.gov/

Abundance profilesMetabolic

reconstructionMetabolic model

RNA database

BlatrRNAs

SILVA CommunityprofilesBlat

NH3 + A-H2 + O2 NH2OH + A + H2O ammonia monooxygenase:

12 Ammonia monooxygenase 2 ammonia monooxygenase family protein 4 Ammonia monooxygenase subunit A 5 Ammonia monooxygenase, putative62 Putative ammonia monooxygenase 3 putative ammonia monooxygenase protein 4 putative ammonia monooxygenase subunit A

EBI Metagenomics: 3 IPR003393 Ammonia monooxygenase/particulate methane monooxygenase, subunit A

25 IPR007820 Putative ammonia monooxygenase/protein AbrB

8 KEGG18 eggNOG13 GenBank11 IMG 8 PATRIC10 RefSeq12 TrEMBL 9 SEED

MG-RAST & EBI Metagenomics Functional analysis

MG-RAST: 92 hits to 8 different databases

Example: Analysis of Prairie Soil Sample

1 ammonia monooxygenase family protein2 ammonia monooxygenase subunit A1 ammonia monooxygenase, putative6 putative ammonia monooxygenase2 Putative ammonia monooxygenase1 putative ammonia monooxygenase subunit A

13 GenBank

MG-RAST & EBI Metagenomics Taxonomy analysis

MG-RAST

EBI Metagenomics: only Prokaryotic taxonomy (333 OTU)

Bacteria

Archaebacteria

Eukaryotes

Others (including virus)

(55 categories)

(15 categories)

(98 categories)

(3 types)

domain level of taxonomy

Phylum level of bacteria domain taxonomy

28 categories

MG-RAST

13 OTU

EBI Metagenomics

MG-RAST & EBI Metagenomics Taxonomy analysis

http://img.jgi.doe.gov/m

Some other metagenomics packages and tools

http://www.computationalbioenergy.org/software.html

http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS

CloVR metagenomics

http://clovr.org/methods/clovr-metagenomics/

Hands-on session

• Using InterProScan to analyse a single metagenomic sequence

• Exploring EMG Portal’s analysis of a metagenomic data set

• Comparing analysis results for samples within a project using STAMP

Questions?

public data resources for metagenomics · (2) upload sequence data and metadata (3) sequence data...

Documents

managing next generation sequence data with gmod

indexing biological sequence data

2007 new data items - seer.cancer.gov€¦ · 1/1/2007 ·...

sequence-aware privacy preserving data-leak detection

agilent openlab chromatography data system (cds)...sequence...

sequences. the sequence abstract data type implementing a...

connecting sequence data to virulence factors in ......

accessing sequence resources · 2 this module introduces...

basics of sequence analysis ch.6 and chbasics of sequence...

accessioned records in washington, dc

pattern discovery using sequence data mining

visual anomaly detection in event sequence data

high throughput sequence (hts) data analysis

detecting differential expression in rna-sequence data

mining sequence in biological data

sequence hacking pitch at data science hackathon

molecular sequence data generation - from template dna to...

manual for demo data sequence pilot module seqnext-hla ·...

data flow diagram and sequence diagram

whole-genome sequence data analysis of anoxybacillus