designing a community resource - sandra orchard

Designing a community resource – the Complex Portal as an example

Sandra Orchard

Hands-on exercise

Design a manually curated data resource that will enable the description of species agnostic protein complexes, to act as reference resource in the same way that UniProt does for proteins – use as examples

1. Human Haemoglobin2. Arabidopsis Light harvesting complex

Designing a new resource - what else is out there?• Before starting to design a resource, assess what

else is out there – re-inventing the wheel causes community fragmentation and confusion as well as being a waste of limited funds

• Is it needed – what gap in the market is it designed to fill?

• Investigate possibilities for collaboration, rather than competition

• If another resource exists, does it meet your/consumer demands – can you contribute and improve

Designing a new resource• How will researchers use it, what information do

they want? Conduct extensive user requirement studies before starting the design process.

• How will users search it? This will impact on data entry/annotation.

• Data visualisation – again, what do users want? Usability studies are critical

• Long term plans – will it survive the first grant renewal?

Complex Portal - what else was out there?• Information on protein complexes scattered

between multiple resources but no unifying resource

• MIPS catalogued yeast complexes in 2000• Corum – human complexes, project terminated in

2009• Decision – use as starting point or start again?

Information content and presentation

• User consultation – design what they need, not what you want to give them

• Don’t get too attached to your first paper prototype – be prepared to sacrifice your concept to community need

• Develop a beta site, then observe researchers using it.

• Keep testing, react to new demands, novel use cases

Use of community standards• Use of community standards enable

• Data merger across multiple resources – contribute to a greater community effort

• Data re-use and longevity• Immediate access to existing tool suites

Use of Community standards – Complex Portal• Established standard formats for molecular interactions

PSI-MI XML/MITAB)• PSI-XML2.5 designed for experimental data, curated

complex data not a perfect fit – worked with PSI-MI workgroup to produce new version

• MITAB designed for binary pairs, not complexes – ComplexTAB will be presented to MI workgroup for adoption

Use of Community standards – Complex Portal• Used existing identifiers for components

(UniProtKB, ChEBI, RNAcentral) – enables import of additional information using resource APIs, for example can search website using gene synonyms- Organism non-specific, enables us to describe complexes in a range of species, including non-model organisms

Use of Community standards enables use of existing tools• Community standards have encouraged tool

development by users, software often open-source and freely available – often can be incorporated directly into websites with little/no additional development• Complex Portal viewer originallywritten to visualise cross-linking data

Use of Community standards enables use of existing tools• Look for initiatives which make open-source tools,

apps/plug-ins, visualizers and widgets freely available e.g. BioJS, BioPerl, Cytoscape……

Free text vs OntologiesFree text

Pros – versatile, fully descriptive, flexibleCons – can be difficult to interpret, long

winded, error-prone, difficult to search

CVsPros – structured, consistent, conciseCons – may not deal well with ‘odd’ cases, lack

of informationConsider using both!

Use of controlled vocabularies• Again, re-use rather than re-invent• Use of CVs enables searches across resources, but

also can make intelligent searches within resources easy to implementFor example can search for • all transcription factors• all complexes involved in respiration• all mitochondrial complexes

Use of controlled vocabulariesIn the Complex Portal you can search for

1. All enzymes - GO:0003824 (catalytic activity)2. All transferases - GO:0016740 (transferase activity)3. All protein kinases - GO:0004672 (protein kinase

activity)4. All cyclin-dependent protein kinase - GO:0097472

(cyclin-dependent protein kinase activity)Similarly can use the ChEBI ontology – search on porphyrin

Linking to external resources• Extensive cross-

referencing is time consuming but enables subsequent pulling in of data from other resources

Make this the ‘go to’ resource for your community• Must fit community need, be easy to search and

deliver the results the user wantsOutreach – publications, conferences, talks….Collaborate on a high impact analysis paper, with your resource playing a key role.Protocols, tutorials, videos, hands-on training courses.Use social media

Using Social Media

InterPro and Annotation transfer to Non-Model Organism Proteomes

What is InterPro

• InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites.

• Combine protein signatures from a number of member databases into a single searchable resource,

• Has resulted in an integrated database and diagnostic tool (InerProScan).

Protein signaturesModel the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• Patterns• Profiles• Profile HMMs

Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed

Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc

Protein signaturesAlternatively, model the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• Patterns• Profiles• Profile HMMs

Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed

Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc

Introduction to InterPro

How are protein signatures made?

Multiple sequence alignmentProtein family/domain Build model Search

Significant matches

ITWKGPVCGLDGKTYRNECALL

AVPRSPVCGSDDVTYANECELK

SVPRSPVCGSDGVTYGTECDLK

HPPPGPVCGTDGLTYDNRCELR

E-value 1e-49E-value 3e-42E-value 5e-39E-value 6e-10

Proteinsignature

Refine

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

HAMAP

Database Basis Institution Built from Focus URL

Pfam HMM EBI Sequence alignment

Family & Domain based on conserved sequence

http://pfam.xfam.org/

Gene3D HMM UCL Structure alignment

Structural Domain

http://gene3d.biochem.ucl.ac.uk/Gene3D/

Superfamily HMM Uni. of Bristol Structure alignment

Evolutionary domain relationships

http://supfam.cs.bris.ac.uk/SUPERFAMILY/

SMART HMM EMBL Heidelberg Sequence alignment

Functional domain annotation

http://smart.embl-heidelberg.de/

TIGRFAM HMM J. Craig Venter Inst. Sequence alignment

Microbial Functional Family Classification

http://www.jcvi.org/cms/research/projects/tigrfams/overview/

Panther HMM Uni. S. California Sequence alignment

Family functional classification http://www.pantherdb.org/

PIRSF HMM PIR, Georgetown, Washington D.C.

Sequence alignment

Functional classification

http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml

PRINTS Fingerprints Uni. of Manchester Sequence alignment

Family functional classification

http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

PROSITE Patterns & Profiles SIB Sequence

alignmentFunctional annotation http://expasy.org/prosite/

HAMAP Profiles SIB Sequence alignment

Microbial protein family classification

http://expasy.org/sprot/hamap/

ProDom Sequence clustering

PRABI : Rhône-Alpes Bioinformatics Center

Sequence alignment

Conserved domain prediction

http://prodom.prabi.fr/prodom/current/html/home.php

The aim of InterPro

InterPro

InterPro: multiple sequence analysis

• Outputs TSV, XML, GFF3, HTML & SVG formats

InterPro as a tool for Automatic annotation

Why automatic annotation is needed• data growth in UniProtKB is fast:

• manual curation is time-consuming• experimental data are unavailable for many

sequences/organisms• organisms’ genomes are sequenced but often no

biochemical characterization is conducted

Release Section of database No. of entries Growth2015_10 reviewed (Swiss-Prot) ~0.5 mio slow2015_10 unreviewed (TrEMBL) >50 mio rapid

The Concepts in GO

1. Molecular Function

2. Biological Process

3. Cellular Component

An elemental activity or task or job

• protein kinase activity• insulin receptor activity

A commonly recognised series of events

• cell division

Where a gene product is located

• mitochondrion

• mitochondrial matrix

• mitochondrial inner membrane

The relationship between InterPro and GO (InterPro2GO)

• Curators manually add relevant GO terms to InterPro entries

• InterPro entry specificity determines the GO terms assigned

GO:0007186 G-protein coupled receptor signalingGO:0016021 integral to membrane GO:0007601 visual perception

GO:0007186 G-protein coupled receptor signalingGO:0016021 integral to membrane

InterPro2GO

InterPro

Using InterPro for annotation

• InterPro is the world’s major source of GO terms:

~ 90 million GO terms for ~ 30 million distinct UniProtKB seqs

• Also underlies the system adding annotation to UniProtKB/TrEMBL

• Provides matches to ~40 million proteins (approx 80% of

UniProtKB)

Annotation consistency:• Using InterPro and GO for annotation allows direct comparison

proteins in UniProtKB

System Rule creation Trigger Annotation

s Scope

SAAS automatic taxonomyInterPro

protein names, EC numbers,

comments, KW

GO terms

all taxa

UniRule manual

taxonomyInterPro*proteome property

sequence length

protein names,

EC numbers, gene names, comments, features**,

KW, GO terms

all taxa

*flexibility to create custom signatures and submitted to InterPro as required**predictors for signal, transmembrane, coiled-coil features, alignment for positional ones

Automatic Annotation in UniProtKB

Components of a rule: conditionsRestrict application of rules to those unreviewed UniProtKB entries fulfilling the conditions

Types of conditions:

• InterPro signatures

• Functional classification of proteins using predictive models (signatures)

• taxonomy

• sequences features, e.g. length• proteome features, e.g. outer membrane:yes; (bacterial

sequences)

Components of a rule: annotationsIf an unreviewed UniProtKB entry fulfils conditions of a rule, annotations in a rule are propagated to this entry.

Types of annotations:

• protein names, including enzyme classification (EC) numbers

• functional annotation, e.g. catalytic activities• gene ontology terms• keywords• sequence features, e.g. active sites, transmembrane

domains

How to access automatic annotation data?

Example of a UniRule

UR000172789 applied

evidence tags clearly state where annotation comes from

Example of a UniRule

highlight a rule’s logic

Attributing evidenceIt needs to be made clear to the user when information is

1. experimentally based2. predicted3. transferred from a related species

Use of evidence codes give this information

Evidence Code Ontologyhttp://www.ebi.ac.uk/ols/ontologies/eco

Thank you!

www.ebi.ac.uk

Twitter: @emblebi

Facebook: EMBLEBI

designing a community resource - sandra orchard

Science