designing a community resource - sandra orchard

43
Designing a community resource – the Complex Portal as an example Sandra Orchard

Upload: embl-abr

Post on 15-Apr-2017

58 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Designing a community resource - Sandra Orchard

Designing a community resource – the Complex Portal as an example

Sandra Orchard

Page 2: Designing a community resource - Sandra Orchard

Hands-on exercise

Design a manually curated data resource that will enable the description of species agnostic protein complexes, to act as reference resource in the same way that UniProt does for proteins – use as examples

1. Human Haemoglobin2. Arabidopsis Light harvesting complex

Page 3: Designing a community resource - Sandra Orchard

Designing a new resource - what else is out there?• Before starting to design a resource, assess what

else is out there – re-inventing the wheel causes community fragmentation and confusion as well as being a waste of limited funds

• Is it needed – what gap in the market is it designed to fill?

• Investigate possibilities for collaboration, rather than competition

• If another resource exists, does it meet your/consumer demands – can you contribute and improve

Page 4: Designing a community resource - Sandra Orchard

Designing a new resource• How will researchers use it, what information do

they want? Conduct extensive user requirement studies before starting the design process.

• How will users search it? This will impact on data entry/annotation.

• Data visualisation – again, what do users want? Usability studies are critical

• Long term plans – will it survive the first grant renewal?

Page 5: Designing a community resource - Sandra Orchard

Complex Portal - what else was out there?• Information on protein complexes scattered

between multiple resources but no unifying resource

• MIPS catalogued yeast complexes in 2000• Corum – human complexes, project terminated in

2009• Decision – use as starting point or start again?

Page 6: Designing a community resource - Sandra Orchard

Information content and presentation

• User consultation – design what they need, not what you want to give them

• Don’t get too attached to your first paper prototype – be prepared to sacrifice your concept to community need

• Develop a beta site, then observe researchers using it.

• Keep testing, react to new demands, novel use cases

Page 7: Designing a community resource - Sandra Orchard

Use of community standards• Use of community standards enable

• Data merger across multiple resources – contribute to a greater community effort

• Data re-use and longevity• Immediate access to existing tool suites

Page 8: Designing a community resource - Sandra Orchard

Use of Community standards – Complex Portal• Established standard formats for molecular interactions

PSI-MI XML/MITAB)• PSI-XML2.5 designed for experimental data, curated

complex data not a perfect fit – worked with PSI-MI workgroup to produce new version

• MITAB designed for binary pairs, not complexes – ComplexTAB will be presented to MI workgroup for adoption

Page 9: Designing a community resource - Sandra Orchard

Use of Community standards – Complex Portal• Used existing identifiers for components

(UniProtKB, ChEBI, RNAcentral) – enables import of additional information using resource APIs, for example can search website using gene synonyms- Organism non-specific, enables us to describe complexes in a range of species, including non-model organisms

Page 10: Designing a community resource - Sandra Orchard

Use of Community standards enables use of existing tools• Community standards have encouraged tool

development by users, software often open-source and freely available – often can be incorporated directly into websites with little/no additional development• Complex Portal viewer originallywritten to visualise cross-linking data

Page 11: Designing a community resource - Sandra Orchard

Use of Community standards enables use of existing tools• Look for initiatives which make open-source tools,

apps/plug-ins, visualizers and widgets freely available e.g. BioJS, BioPerl, Cytoscape……

Page 12: Designing a community resource - Sandra Orchard

Free text vs OntologiesFree text

Pros – versatile, fully descriptive, flexibleCons – can be difficult to interpret, long

winded, error-prone, difficult to search

CVsPros – structured, consistent, conciseCons – may not deal well with ‘odd’ cases, lack

of informationConsider using both!

Page 13: Designing a community resource - Sandra Orchard

Use of controlled vocabularies• Again, re-use rather than re-invent• Use of CVs enables searches across resources, but

also can make intelligent searches within resources easy to implementFor example can search for • all transcription factors• all complexes involved in respiration• all mitochondrial complexes

Page 14: Designing a community resource - Sandra Orchard

Use of controlled vocabulariesIn the Complex Portal you can search for

1. All enzymes - GO:0003824 (catalytic activity)2. All transferases - GO:0016740 (transferase activity)3. All protein kinases - GO:0004672 (protein kinase

activity)4. All cyclin-dependent protein kinase - GO:0097472

(cyclin-dependent protein kinase activity)Similarly can use the ChEBI ontology – search on porphyrin

Page 15: Designing a community resource - Sandra Orchard

Linking to external resources• Extensive cross-

referencing is time consuming but enables subsequent pulling in of data from other resources

Page 16: Designing a community resource - Sandra Orchard

Make this the ‘go to’ resource for your community• Must fit community need, be easy to search and

deliver the results the user wantsOutreach – publications, conferences, talks….Collaborate on a high impact analysis paper, with your resource playing a key role.Protocols, tutorials, videos, hands-on training courses.Use social media

Page 17: Designing a community resource - Sandra Orchard

Using Social Media

Page 18: Designing a community resource - Sandra Orchard

InterPro and Annotation transfer to Non-Model Organism Proteomes

Page 19: Designing a community resource - Sandra Orchard

What is InterPro

• InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites.

• Combine protein signatures from a number of member databases into a single searchable resource,

• Has resulted in an integrated database and diagnostic tool (InerProScan).

Page 20: Designing a community resource - Sandra Orchard

Protein signaturesModel the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• Patterns• Profiles• Profile HMMs

Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed

Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc

Page 21: Designing a community resource - Sandra Orchard

Protein signaturesAlternatively, model the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• Patterns• Profiles• Profile HMMs

Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed

Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc

Page 22: Designing a community resource - Sandra Orchard

Introduction to InterPro

How are protein signatures made?

Multiple sequence alignmentProtein family/domain Build model Search

Significant matches

ITWKGPVCGLDGKTYRNECALL

AVPRSPVCGSDDVTYANECELK

SVPRSPVCGSDGVTYGTECDLK

HPPPGPVCGTDGLTYDNRCELR

E-value 1e-49E-value 3e-42E-value 5e-39E-value 6e-10

Proteinsignature

Refine

Page 23: Designing a community resource - Sandra Orchard

Structuraldomains

Functional annotation of families/domains

Protein features 

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

HAMAP

Page 24: Designing a community resource - Sandra Orchard

Database Basis Institution Built from Focus URL

Pfam HMM EBI Sequence alignment

Family & Domain based on conserved sequence

http://pfam.xfam.org/

Gene3D HMM UCL Structure alignment

Structural Domain

http://gene3d.biochem.ucl.ac.uk/Gene3D/

Superfamily HMM Uni. of Bristol Structure alignment

Evolutionary domain relationships

http://supfam.cs.bris.ac.uk/SUPERFAMILY/

SMART HMM EMBL Heidelberg Sequence alignment

Functional domain annotation

http://smart.embl-heidelberg.de/

TIGRFAM HMM J. Craig Venter Inst. Sequence alignment

Microbial Functional Family Classification

http://www.jcvi.org/cms/research/projects/tigrfams/overview/

Panther HMM Uni. S. California Sequence alignment

Family functional classification http://www.pantherdb.org/

PIRSF HMM PIR, Georgetown, Washington D.C.

Sequence alignment

Functional classification

http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml

PRINTS Fingerprints Uni. of Manchester Sequence alignment

Family functional classification

http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

PROSITE Patterns & Profiles SIB Sequence

alignmentFunctional annotation http://expasy.org/prosite/

HAMAP Profiles SIB Sequence alignment

Microbial protein family classification

http://expasy.org/sprot/hamap/

ProDom Sequence clustering

PRABI : Rhône-Alpes Bioinformatics Center

Sequence alignment

Conserved domain prediction

http://prodom.prabi.fr/prodom/current/html/home.php

Page 25: Designing a community resource - Sandra Orchard

The aim of InterPro

InterPro

Page 26: Designing a community resource - Sandra Orchard

InterPro: multiple sequence analysis

• Outputs TSV, XML, GFF3, HTML & SVG formats

Page 27: Designing a community resource - Sandra Orchard

InterPro as a tool for Automatic annotation

Page 28: Designing a community resource - Sandra Orchard

Why automatic annotation is needed• data growth in UniProtKB is fast:

• manual curation is time-consuming• experimental data are unavailable for many

sequences/organisms• organisms’ genomes are sequenced but often no

biochemical characterization is conducted

Release Section of database No. of entries Growth2015_10 reviewed (Swiss-Prot) ~0.5 mio slow2015_10 unreviewed (TrEMBL) >50 mio rapid

Page 29: Designing a community resource - Sandra Orchard

The Concepts in GO

1. Molecular Function

2. Biological Process

3. Cellular Component

An elemental activity or task or job

• protein kinase activity• insulin receptor activity

A commonly recognised series of events

• cell division

Where a gene product is located

• mitochondrion

• mitochondrial matrix

• mitochondrial inner membrane

Page 30: Designing a community resource - Sandra Orchard

The relationship between InterPro and GO (InterPro2GO)

• Curators manually add relevant GO terms to InterPro entries

• InterPro entry specificity determines the GO terms assigned

GO:0007186 G-protein coupled receptor signalingGO:0016021 integral to membrane GO:0007601 visual perception

GO:0007186 G-protein coupled receptor signalingGO:0016021 integral to membrane

Page 31: Designing a community resource - Sandra Orchard

InterPro2GO

InterPro

Page 32: Designing a community resource - Sandra Orchard

Using InterPro for annotation

• InterPro is the world’s major source of GO terms:

~ 90 million GO terms for ~ 30 million distinct UniProtKB seqs

• Also underlies the system adding annotation to UniProtKB/TrEMBL

• Provides matches to ~40 million proteins (approx 80% of

UniProtKB)

Annotation consistency:• Using InterPro and GO for annotation allows direct comparison

proteins in UniProtKB

Page 33: Designing a community resource - Sandra Orchard

System Rule creation Trigger Annotation

s Scope

SAAS automatic taxonomyInterPro

protein names, EC numbers,

comments, KW

GO terms

all taxa

UniRule manual

taxonomyInterPro*proteome property

sequence length

protein names,

EC numbers, gene names, comments, features**,

KW, GO terms

all taxa

*flexibility to create custom signatures and submitted to InterPro as required**predictors for signal, transmembrane, coiled-coil features, alignment for positional ones

Automatic Annotation in UniProtKB

Page 34: Designing a community resource - Sandra Orchard

Components of a rule: conditionsRestrict application of rules to those unreviewed UniProtKB entries fulfilling the conditions

Types of conditions:

• InterPro signatures

• Functional classification of proteins using predictive models (signatures)

• taxonomy

• sequences features, e.g. length• proteome features, e.g. outer membrane:yes; (bacterial

sequences)

Page 35: Designing a community resource - Sandra Orchard

Components of a rule: annotationsIf an unreviewed UniProtKB entry fulfils conditions of a rule, annotations in a rule are propagated to this entry.

Types of annotations:

• protein names, including enzyme classification (EC) numbers

• functional annotation, e.g. catalytic activities• gene ontology terms• keywords• sequence features, e.g. active sites, transmembrane

domains

Page 36: Designing a community resource - Sandra Orchard

How to access automatic annotation data?

Page 37: Designing a community resource - Sandra Orchard

How to access automatic annotation data?

Page 38: Designing a community resource - Sandra Orchard

Example of a UniRule

Page 39: Designing a community resource - Sandra Orchard

UR000172789 applied

evidence tags clearly state where annotation comes from

Page 40: Designing a community resource - Sandra Orchard

Example of a UniRule

highlight a rule’s logic

Page 41: Designing a community resource - Sandra Orchard

Example of a UniRule

highlight a rule’s logic

Page 42: Designing a community resource - Sandra Orchard

Attributing evidenceIt needs to be made clear to the user when information is

1. experimentally based2. predicted3. transferred from a related species

Use of evidence codes give this information

Evidence Code Ontologyhttp://www.ebi.ac.uk/ols/ontologies/eco

Page 43: Designing a community resource - Sandra Orchard

Thank you!

www.ebi.ac.uk

Twitter: @emblebi

Facebook: EMBLEBI