designing a community resource - sandra orchard
TRANSCRIPT
Designing a community resource – the Complex Portal as an example
Sandra Orchard
Hands-on exercise
Design a manually curated data resource that will enable the description of species agnostic protein complexes, to act as reference resource in the same way that UniProt does for proteins – use as examples
1. Human Haemoglobin2. Arabidopsis Light harvesting complex
Designing a new resource - what else is out there?• Before starting to design a resource, assess what
else is out there – re-inventing the wheel causes community fragmentation and confusion as well as being a waste of limited funds
• Is it needed – what gap in the market is it designed to fill?
• Investigate possibilities for collaboration, rather than competition
• If another resource exists, does it meet your/consumer demands – can you contribute and improve
Designing a new resource• How will researchers use it, what information do
they want? Conduct extensive user requirement studies before starting the design process.
• How will users search it? This will impact on data entry/annotation.
• Data visualisation – again, what do users want? Usability studies are critical
• Long term plans – will it survive the first grant renewal?
Complex Portal - what else was out there?• Information on protein complexes scattered
between multiple resources but no unifying resource
• MIPS catalogued yeast complexes in 2000• Corum – human complexes, project terminated in
2009• Decision – use as starting point or start again?
Information content and presentation
• User consultation – design what they need, not what you want to give them
• Don’t get too attached to your first paper prototype – be prepared to sacrifice your concept to community need
• Develop a beta site, then observe researchers using it.
• Keep testing, react to new demands, novel use cases
Use of community standards• Use of community standards enable
• Data merger across multiple resources – contribute to a greater community effort
• Data re-use and longevity• Immediate access to existing tool suites
Use of Community standards – Complex Portal• Established standard formats for molecular interactions
PSI-MI XML/MITAB)• PSI-XML2.5 designed for experimental data, curated
complex data not a perfect fit – worked with PSI-MI workgroup to produce new version
• MITAB designed for binary pairs, not complexes – ComplexTAB will be presented to MI workgroup for adoption
Use of Community standards – Complex Portal• Used existing identifiers for components
(UniProtKB, ChEBI, RNAcentral) – enables import of additional information using resource APIs, for example can search website using gene synonyms- Organism non-specific, enables us to describe complexes in a range of species, including non-model organisms
Use of Community standards enables use of existing tools• Community standards have encouraged tool
development by users, software often open-source and freely available – often can be incorporated directly into websites with little/no additional development• Complex Portal viewer originallywritten to visualise cross-linking data
Use of Community standards enables use of existing tools• Look for initiatives which make open-source tools,
apps/plug-ins, visualizers and widgets freely available e.g. BioJS, BioPerl, Cytoscape……
Free text vs OntologiesFree text
Pros – versatile, fully descriptive, flexibleCons – can be difficult to interpret, long
winded, error-prone, difficult to search
CVsPros – structured, consistent, conciseCons – may not deal well with ‘odd’ cases, lack
of informationConsider using both!
Use of controlled vocabularies• Again, re-use rather than re-invent• Use of CVs enables searches across resources, but
also can make intelligent searches within resources easy to implementFor example can search for • all transcription factors• all complexes involved in respiration• all mitochondrial complexes
Use of controlled vocabulariesIn the Complex Portal you can search for
1. All enzymes - GO:0003824 (catalytic activity)2. All transferases - GO:0016740 (transferase activity)3. All protein kinases - GO:0004672 (protein kinase
activity)4. All cyclin-dependent protein kinase - GO:0097472
(cyclin-dependent protein kinase activity)Similarly can use the ChEBI ontology – search on porphyrin
Linking to external resources• Extensive cross-
referencing is time consuming but enables subsequent pulling in of data from other resources
Make this the ‘go to’ resource for your community• Must fit community need, be easy to search and
deliver the results the user wantsOutreach – publications, conferences, talks….Collaborate on a high impact analysis paper, with your resource playing a key role.Protocols, tutorials, videos, hands-on training courses.Use social media
Using Social Media
InterPro and Annotation transfer to Non-Model Organism Proteomes
What is InterPro
• InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites.
• Combine protein signatures from a number of member databases into a single searchable resource,
• Has resulted in an integrated database and diagnostic tool (InerProScan).
Protein signaturesModel the pattern of conserved amino acids at specific positions within a multiple sequence alignment
• Patterns• Profiles• Profile HMMs
Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed
Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc
Protein signaturesAlternatively, model the pattern of conserved amino acids at specific positions within a multiple sequence alignment
• Patterns• Profiles• Profile HMMs
Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed
Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc
Introduction to InterPro
How are protein signatures made?
Multiple sequence alignmentProtein family/domain Build model Search
Significant matches
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
SVPRSPVCGSDGVTYGTECDLK
HPPPGPVCGTDGLTYDNRCELR
E-value 1e-49E-value 3e-42E-value 5e-39E-value 6e-10
Proteinsignature
Refine
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Profiles Patterns
HAMAP
Database Basis Institution Built from Focus URL
Pfam HMM EBI Sequence alignment
Family & Domain based on conserved sequence
http://pfam.xfam.org/
Gene3D HMM UCL Structure alignment
Structural Domain
http://gene3d.biochem.ucl.ac.uk/Gene3D/
Superfamily HMM Uni. of Bristol Structure alignment
Evolutionary domain relationships
http://supfam.cs.bris.ac.uk/SUPERFAMILY/
SMART HMM EMBL Heidelberg Sequence alignment
Functional domain annotation
http://smart.embl-heidelberg.de/
TIGRFAM HMM J. Craig Venter Inst. Sequence alignment
Microbial Functional Family Classification
http://www.jcvi.org/cms/research/projects/tigrfams/overview/
Panther HMM Uni. S. California Sequence alignment
Family functional classification http://www.pantherdb.org/
PIRSF HMM PIR, Georgetown, Washington D.C.
Sequence alignment
Functional classification
http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml
PRINTS Fingerprints Uni. of Manchester Sequence alignment
Family functional classification
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
PROSITE Patterns & Profiles SIB Sequence
alignmentFunctional annotation http://expasy.org/prosite/
HAMAP Profiles SIB Sequence alignment
Microbial protein family classification
http://expasy.org/sprot/hamap/
ProDom Sequence clustering
PRABI : Rhône-Alpes Bioinformatics Center
Sequence alignment
Conserved domain prediction
http://prodom.prabi.fr/prodom/current/html/home.php
The aim of InterPro
InterPro
InterPro: multiple sequence analysis
• Outputs TSV, XML, GFF3, HTML & SVG formats
InterPro as a tool for Automatic annotation
Why automatic annotation is needed• data growth in UniProtKB is fast:
• manual curation is time-consuming• experimental data are unavailable for many
sequences/organisms• organisms’ genomes are sequenced but often no
biochemical characterization is conducted
Release Section of database No. of entries Growth2015_10 reviewed (Swiss-Prot) ~0.5 mio slow2015_10 unreviewed (TrEMBL) >50 mio rapid
The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
An elemental activity or task or job
• protein kinase activity• insulin receptor activity
A commonly recognised series of events
• cell division
Where a gene product is located
• mitochondrion
• mitochondrial matrix
• mitochondrial inner membrane
The relationship between InterPro and GO (InterPro2GO)
• Curators manually add relevant GO terms to InterPro entries
• InterPro entry specificity determines the GO terms assigned
GO:0007186 G-protein coupled receptor signalingGO:0016021 integral to membrane GO:0007601 visual perception
GO:0007186 G-protein coupled receptor signalingGO:0016021 integral to membrane
InterPro2GO
InterPro
Using InterPro for annotation
• InterPro is the world’s major source of GO terms:
~ 90 million GO terms for ~ 30 million distinct UniProtKB seqs
• Also underlies the system adding annotation to UniProtKB/TrEMBL
• Provides matches to ~40 million proteins (approx 80% of
UniProtKB)
Annotation consistency:• Using InterPro and GO for annotation allows direct comparison
proteins in UniProtKB
System Rule creation Trigger Annotation
s Scope
SAAS automatic taxonomyInterPro
protein names, EC numbers,
comments, KW
GO terms
all taxa
UniRule manual
taxonomyInterPro*proteome property
sequence length
protein names,
EC numbers, gene names, comments, features**,
KW, GO terms
all taxa
*flexibility to create custom signatures and submitted to InterPro as required**predictors for signal, transmembrane, coiled-coil features, alignment for positional ones
Automatic Annotation in UniProtKB
Components of a rule: conditionsRestrict application of rules to those unreviewed UniProtKB entries fulfilling the conditions
Types of conditions:
• InterPro signatures
• Functional classification of proteins using predictive models (signatures)
• taxonomy
• sequences features, e.g. length• proteome features, e.g. outer membrane:yes; (bacterial
sequences)
Components of a rule: annotationsIf an unreviewed UniProtKB entry fulfils conditions of a rule, annotations in a rule are propagated to this entry.
Types of annotations:
• protein names, including enzyme classification (EC) numbers
• functional annotation, e.g. catalytic activities• gene ontology terms• keywords• sequence features, e.g. active sites, transmembrane
domains
How to access automatic annotation data?
How to access automatic annotation data?
Example of a UniRule
UR000172789 applied
evidence tags clearly state where annotation comes from
Example of a UniRule
highlight a rule’s logic
Example of a UniRule
highlight a rule’s logic
Attributing evidenceIt needs to be made clear to the user when information is
1. experimentally based2. predicted3. transferred from a related species
Use of evidence codes give this information
Evidence Code Ontologyhttp://www.ebi.ac.uk/ols/ontologies/eco
Thank you!
www.ebi.ac.uk
Twitter: @emblebi
Facebook: EMBLEBI