public data resources for metagenomics · (2) upload sequence data and metadata (3) sequence data...
TRANSCRIPT
![Page 2: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/2.jpg)
My background
Doctorate in pharmacology (1995-1998)
Post-doc in molecular biology (1998-2001)
Bioinformatics research (2001-2011)
Co-ordinator for InterPro and EBI metagenomics databases (2011-)
![Page 3: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/3.jpg)
My background
![Page 4: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/4.jpg)
Overview
• Considerations for the analysis of metagenomic sequence data
• What public metagenomic analysis resources offer
• The EBI metagenomics resource
![Page 5: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/5.jpg)
What is metagenomics?
“Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.”
“Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”
“Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora”
“Metagenomics” means literally ‘beyond genomics’
![Page 6: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/6.jpg)
Sequencing
Filtering step
Extraction of DNA
Sampling from environment
![Page 7: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/7.jpg)
Quality control
Taxonomic analysis
Functional analysis
16S rRNA18S rRNA
ITSetc
Identification and characterisation of
protein coding sequences
![Page 8: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/8.jpg)
Applications of taxonomic analyses
Diversity analysisIdentification of new species
Comparing populations from different sites or
states
![Page 9: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/9.jpg)
Applications of functional analyses
Bioprospecting for novel sequences with
functional applications
Reconstruction of pathways present in the
community
Comparing functional activities from different
sites or states
![Page 10: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/10.jpg)
• Short sequence fragments are hard to characterise
• Assembly can lead to chimeras
• Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop
• Millions of different pieces• Thousands of different puzzles• All mixed together• Most of the pieces are missing• No boxes to refer to
Why is metagenomics challenging?
![Page 11: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/11.jpg)
Limitations and pitfalls
Data used for analysis can have limitations:
• 16S rRNA genes - limited resolving power and subject to copy number variation
• Viral sequences – currently no gold-standard reference database
• Protist sequences – little experimentally-derived annotation of protein function in public databases
![Page 12: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/12.jpg)
Additional pitfalls
• Different functional and taxonomic analysis tools can give different results
• The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3)
• The same version of the same tools can give different results depending on the reference database used
![Page 13: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/13.jpg)
Reference databases
![Page 14: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/14.jpg)
Reference databases
![Page 15: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/15.jpg)
Reference databases
![Page 16: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/16.jpg)
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at: www.genome.gov/sequencingcosts.
Other considerations: data analysis speed
![Page 17: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/17.jpg)
• The cost of sequencing has really gone down
• Now I can do metagenomics!
• Awesome!
• Amount of sequence generated has increased 5,000-fold
• Computational speed has increased only 10-fold
• Time taken to analyse has increased 500-fold
• $@%*!!!
Data analysis speed
![Page 18: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/18.jpg)
70 %(~80 bp/$)
14.5 %
28 %
(~2m bp/$)
36.5 %
14.5 %
14.5 %
55 %
30 %
4.5 %
Sboner et al. Genome Biology (2011) 12:125
Data analysis cost
![Page 19: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/19.jpg)
Raw sequence data:
• Important for metagenomics as some samples are hard to replicate
• Large file sizes
Analysis results ?
• Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions
Data description including metadata
• Essential: what, where, who, how and when
• If absent, raw data have very limited usefulness
What data to store?
![Page 20: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/20.jpg)
Metadata includes the in-depth, controlled description of the sample that your sequence was taken from
The importance of metadata
Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?
How was it sampled? How was it extracted? How was it stored? What sequencing platform was used?
![Page 21: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/21.jpg)
• If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible
The importance of metadata
Show the microbial species found in the North Pacific
… at depths of 50 – 100 m
… in samples taken May-June
… compared to the Indian Ocean, under the same conditions
![Page 22: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/22.jpg)
Where are you going to store this?
• Locally : back-up ?
long term ?
sharing ?
access ?
• Amazon, Google or specialist research clouds
• Public repositories, such as ENA, NCBI or DDBJ
Considerations: storing data
![Page 23: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/23.jpg)
• Free!• Secure long term storage
• No need for local infrastructure
• Enforced compliance:• Publisher requirements (accession numbers)• Institutional requirements• Funder requirements
• Data are more useful: • Data are reusable and can be discovered by others• Available for re- and meta-analyses
Public repositories
![Page 24: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/24.jpg)
• Transferring a 100 Gb NGS data file across the internet• 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week*• High-speed bandwidth (10 Gigabit/s) < 1 day*
Considerations: moving data
* Stein, Genome Biol. (2010) 11:207
Traditional methods may be the most effective!
![Page 25: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/25.jpg)
Metagenomics portals
http://www.ebi.ac.uk/metagenomics
http://metagenomics.anl.gov/
http://camera.calit2.net/
http://img.jgi.doe.gov/
![Page 26: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/26.jpg)
Submit data
Sequence analysis(prebuilt workflows)
Quality filtering of sequences
Visualisation/Interpretation
What do metagenomics portals offer?
Sequence archiving
Tools to help capture & store
metadata
Tools to help transfer data
![Page 27: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/27.jpg)
Data archivingPowerful analysisEasy submission
A free resource for the analysis, archiving & browsing of metagenomic study data
http://www.ebi.ac.uk/metagenomics
![Page 28: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/28.jpg)
(1) Register for an account
(2) Upload sequence data and metadata
(3) Sequence data is archived in ENA and accessioned
(4) Sequence data is analysed by the pipeline
(5) Projects, metadata and results are made available on the website for private or public browsing / download
The submission & analysis process
~ 1-2 weeks, depending on study size, compute farm usage, etc
![Page 29: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/29.jpg)
The submission process can be run interactively
3
![Page 30: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/30.jpg)
The GSC (Genomics Standards Consortium) have created minimum standards for metagenomics metadata
Metagenomics standards
![Page 31: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/31.jpg)
Metadata is captured via GSC-compliant checklist
GSC MIxS
![Page 32: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/32.jpg)
rRNAselector
reads with rRNA
reads without
rRNAFragGeneScan
predicted CDS
Amplicon-based data
processed reads
discarded reads
QC
raw reads
Qiime
Taxonomic analysis
InterProScan
Function assignment
Unknown function
pCDS
The sequence analysis pipeline
![Page 33: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/33.jpg)
EBI Metagenomics: QC step by step
• Clipping - low quality ends trimmed and adapter sequences removed
• Quality filtering - sequences with > 10% undetermined nucleotides removed
• Read length filtering - short sequences (< 100 nt) are removed
• Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen
• Repeat masking - RepeatMasker (open-3.2.2), removes reads with 50% or more nucleotides masked (low complexity regions)
![Page 34: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/34.jpg)
EBI Metagenomics: QC consequences
Roche 454
Illumina
Ion Torrent
![Page 35: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/35.jpg)
EBI Metagenomics: taxonomic analysis
rRNAselector
reads with rRNA
Amplicon-based data
processed reads
Qiime
Taxonomic analysis
![Page 36: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/36.jpg)
Taxonomic analysis with EBI Metagenomics
EBI Metagenomics currently only provides taxonomy analysis for Prokaryotes.
rRNA sequences are identified using rRNASelector:
hidden Markov models to identified rRNA sequences
60 bp minimum overlap with well-curated HMM model
E-value < 10-5
Annotations are associated using Qiime:
rRNA are annotated using the Greengenes reference database
![Page 37: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/37.jpg)
EBI Metagenomics taxonomy visualizations
![Page 38: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/38.jpg)
Re-analysis of: Sutton et al, (2013), Impact of Long-Term Diesel
Contamination on Soil Microbial Community Structure.
Validation of taxonomic analysis
Alpha diversity analysis
polluted
clean
clean (outlier)
![Page 39: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/39.jpg)
EBI Metagenomics: overview of functional analysis
reads without rRNA
FragGeneScan
predicted CDS
InterProScan
Function assignment
Unknown function
pCDS
![Page 40: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/40.jpg)
EBI Metagenomics: functional annotation
EBI Metagenomics uses FragGeneScan to predict CDSs directly from the reads:
hidden Markov models to correct frame-shift using codon usage
probabilistic identification of start and stop codons
60 bp minimum ORF
Annotation is carried out using InterProScan to mine a subset of the InterPro database
![Page 41: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/41.jpg)
Why not BLAST?
• BLAST: Basic Local Alignment and Search Tool
• Relatively fast
• User friendly
• Very good at recognising similarity between closely related sequences
![Page 42: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/42.jpg)
Using BLAST for annotation
![Page 43: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/43.jpg)
Using BLAST for annotation
![Page 44: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/44.jpg)
Using BLAST for annotation
![Page 45: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/45.jpg)
Because BLAST performs local pairwise alignment, it:
• can sometimes struggle with multi-domain proteins
• is less useful for weakly-similar sequences (e.g., divergent homologues)
Using BLAST for annotation
![Page 46: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/46.jpg)
BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 closely-related species
Using BLAST for annotation
![Page 47: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/47.jpg)
60S acidic ribosomal protein P0: multiple sequence alignment
![Page 48: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/48.jpg)
An alternative approach
• This is the approach taken by protein signature databases
• Alternatively, we can model the pattern of conserved amino acids at specific positions within a multiple sequence alignment
• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed
![Page 49: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/49.jpg)
Full alignment methods
Single motif methods
Patterns
Multiple motif methods
Fingerprints
Three different protein signature approaches
Profiles & Hidden Markov models (HMMs)
* For a detailed description, see: https://www.ebi.ac.uk/training/online/course/introduction-protein-classification-ebi
![Page 50: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/50.jpg)
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Profiles Patterns
HAMAP
![Page 51: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/51.jpg)
The aim of InterPro
InterPro
![Page 52: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/52.jpg)
Features of InterPro
• Manually checked and updated against a manually annotated database
• Errors are identified and fixed• Annotated with full text abstracts and Gene Ontology terms
![Page 53: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/53.jpg)
… with a brief diversion into the Gene Ontology…
http://geneontology.org/
![Page 54: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/54.jpg)
Aims of the Gene Ontology
• Allow cross-species and/or cross-database comparisons
• Unify the representation of gene and gene product attributes across species
http://geneontology.org/
![Page 55: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/55.jpg)
English is not a very precise language
• Same name for different concepts• Different names for the same concept
Inconsistency in naming of biological concepts
?
An example …
Tactition Tactile sense
Taction
Sensory perception of touch ; GO:0050975
http://geneontology.org/
![Page 56: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/56.jpg)
• A way to capture biological knowledge in a written and computable form
The Gene Ontology
• A set of concepts and their relationships to each other arrangedas a hierarchy
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
http://geneontology.org/
![Page 57: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/57.jpg)
The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
An elemental activity or task or job
• protein kinase activity• insulin receptor activity
A commonly recognised series of events
• cell division
Where a gene product is located
• mitochondrion
• mitochondrial matrix
• mitochondrial inner membrane
http://geneontology.org/
![Page 58: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/58.jpg)
Anatomy of a GO term
Unique identifier
Term name
Definition
Synonyms
http://geneontology.org/
![Page 59: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/59.jpg)
InterPro2GO
InterPro
![Page 60: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/60.jpg)
We now return to your scheduled programming...
![Page 61: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/61.jpg)
Using InterPro for annotation
• Underlies the automated system that adds annotation to
UniProtKB/TrEMBL
• Provides matches to 67 million proteins - over 80% of UniProtKB
• Source of ~170 million GO mappings for ~50 million distinct
UniProtKB sequences
Annotation consistency:• Using InterPro and GO for annotation allows direct comparison
with all of the proteins in UniProtKB
![Page 62: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/62.jpg)
Analysing metagenomic sequences with InterPro
Considerations for metagenome analysis:
• Vast numbers of short reads
• analysis speed
• ability to cope with sequence fragments
• Making sense of output• visualisation on web site• downstream analysis and sample comparison
![Page 63: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/63.jpg)
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Patterns
Databases
4
![Page 64: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/64.jpg)
Assembly of metagenomics data
• Metagenomics: Not clear how you avoid assembling sequences from different species together : chimaera
![Page 65: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/65.jpg)
EBI Metagenomics does not perform assembly
We are still able to annotate metagenome data as shown by this re-analysis of rumen metagenomics by Hess et al, (2011)
![Page 66: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/66.jpg)
Visualising data: InterProScan results
![Page 67: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/67.jpg)
Visualising data: GO Slims
• GO slims are cut-down versions of the GO ontologies
containing a subset of the terms in the whole GO
• Give a broad overview of the ontology content without the
detail of the specific fine-grained terms
![Page 68: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/68.jpg)
GO Slims
![Page 69: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/69.jpg)
GO Slims
Slimmed term:
![Page 70: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/70.jpg)
Visualising data: GO slims
• For visualisation, EMG uses a GO slim specially developed for metagenomic data sets
![Page 71: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/71.jpg)
EBI Metagenomics output files
sequence files
tab or comma separated files
TreeView, TOL,
Newick Viewer …
Megan …
sequence files
![Page 72: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/72.jpg)
Simplified overview of MG-RAST pipeline
Reads Quality control
Feature prediction(FragGeneScan)
Clustering (Uclust)Protein databases
http://metagenomics.anl.gov/
Abundance profilesMetabolic
reconstructionMetabolic model
RNA database
BlatrRNAs
SILVA CommunityprofilesBlat
Blat
![Page 73: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/73.jpg)
NH3 + A-H2 + O2 NH2OH + A + H2O ammonia monooxygenase:
12 Ammonia monooxygenase 2 ammonia monooxygenase family protein 4 Ammonia monooxygenase subunit A 5 Ammonia monooxygenase, putative62 Putative ammonia monooxygenase 3 putative ammonia monooxygenase protein 4 putative ammonia monooxygenase subunit A
EBI Metagenomics: 3 IPR003393 Ammonia monooxygenase/particulate methane monooxygenase, subunit A
25 IPR007820 Putative ammonia monooxygenase/protein AbrB
8 KEGG18 eggNOG13 GenBank11 IMG 8 PATRIC10 RefSeq12 TrEMBL 9 SEED
MG-RAST & EBI Metagenomics Functional analysis
MG-RAST: 92 hits to 8 different databases
Example: Analysis of Prairie Soil Sample
1 ammonia monooxygenase family protein2 ammonia monooxygenase subunit A1 ammonia monooxygenase, putative6 putative ammonia monooxygenase2 Putative ammonia monooxygenase1 putative ammonia monooxygenase subunit A
13 GenBank
![Page 74: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/74.jpg)
MG-RAST & EBI Metagenomics Taxonomy analysis
MG-RAST
EBI Metagenomics: only Prokaryotic taxonomy (333 OTU)
Bacteria
Archaebacteria
Eukaryotes
Others (including virus)
(55 categories)
(15 categories)
(98 categories)
(3 types)
Example: Analysis of Prairie Soil Sample
domain level of taxonomy
![Page 75: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/75.jpg)
Example: Analysis of Prairie Soil Sample
Phylum level of bacteria domain taxonomy
28 categories
MG-RAST
13 OTU
EBI Metagenomics
MG-RAST & EBI Metagenomics Taxonomy analysis
![Page 76: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/76.jpg)
IMG/M
http://img.jgi.doe.gov/m
![Page 77: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/77.jpg)
Some other metagenomics packages and tools
http://www.computationalbioenergy.org/software.html
http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS
CloVR metagenomics
http://clovr.org/methods/clovr-metagenomics/
![Page 78: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/78.jpg)
Hands-on session
• Using InterProScan to analyse a single metagenomic sequence
• Exploring EMG Portal’s analysis of a metagenomic data set
• Comparing analysis results for samples within a project using STAMP
![Page 79: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline](https://reader034.vdocuments.us/reader034/viewer/2022042217/5ec120cb31861a12ab18c440/html5/thumbnails/79.jpg)
Questions?