service discovery in my grid and the biocatalogue, a life science service registry

36
Service Discovery in my Grid and the Biocatalogue, a Life Science Service Registry Katy Wolstencroft myGrid University of Manchester

Upload: razi

Post on 14-Jan-2016

31 views

Category:

Documents


6 download

DESCRIPTION

Service Discovery in my Grid and the Biocatalogue, a Life Science Service Registry. Katy Wolstencroft myGrid University of Manchester. Lots of Resources. NAR 2008 – over 1000 databases. Taverna Workflow Workbench. Design and execution of workflows - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Service Discovery in myGrid and the Biocatalogue, a Life Science Service

Registry

Katy Wolstencroft

myGrid

University of Manchester

Page 2: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Lots of Resources

NAR 2008 – over 1000 databases

Page 3: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Taverna Workflow Workbench

• Design and execution of workflows

• Access to local and remote resources and analysis tools

• Automation of data flow• Iteration over large data

sets• Part of the myGrid project

Page 4: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Who Uses Taverna?

Access to 3500+ public service operations

55,000+ sourceforge downloads10,000+ downloads of v1.740+ downloads per dayRanked 148 sourceforge activity

(11 Nov 2008)350+ known organisations17 known commercial1000+ active users at any one timeUsers throughout UK, USA,

Europe, SE Asia and South America

Netherlands Bioinformatics CentreGenome Canada Bioinformatics PlatformBioMOBYUS iPlant ConsortiumUS FLOSS social science programRENCIFrench SIGENAE farm animals projectThaiGridCARMEN Neuroscience projectSPINE consortiumEU Enfin, EMBRACE, BioSapian, CasimirEU SysMO ConsortiumNEBC The NERC Environmental Bioinformatics

CentreBergen Centre for Computational BiologyMax-Planck institute for Plant Breeding

ResearchGenoa Cancer Research CentreAstroGridcaBIG/caGRID

Page 5: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

What do Scientists use Taverna for?

• Data gathering, annotation and model building

• Data analysis from distributed tools

• Data mining and knowledge management– Hypothesis generation and

modelling and Text mining

• Data curation and warehouse population

• Parameter sweeps and simulation

Systems biology model buildingProteomicsSequence analysisProtein structure predictionGene/protein annotation ProteomicsMicroarray data analysisQTL studiesQSAR studiesChemoinformaticsMedical image analysisPublic Health care epidemiologyHeart model simulationsHigh throughput screeningPhenotype studiesPhylogenyStatistical analysisText miningAstronomy, Music, Meteorology

Page 6: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Create and run workflows

Create and manage services as components

API Consumer

Share, discover and reuse workflows

Manage the metadata needed and generated

RDF, OWL

Discover and reuse services

Feta

Open Source Workflow Environment for Scientists

Page 7: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Workflow Reuse

• Workflows allow high throughput experiments and automation

• Workflows are encapsulations of experiments• Workflows developed for one experiment can be reused

for others

• Easier to share, reuse and repurpose

The METHODS section of a scientific publication

Page 8: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry
Page 9: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Recycling, Reuse, Repurposing

• Paul writes workflows for identifying biological pathways implicated in resistance to Trypanosomiasis in cattle

• Paul meets Jo. Jo is investigating mouse Whipworm infection.

• Jo reuses one of Paul’s workflow without change.

• Jo identifies the biological pathways involved in sex dependence in the mouse model, believed to be involved in the ability of mice to expel the parasite.

• Previously a manual two year study by Jo had failed to do this.

Page 10: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Where are the Services From?

• Over 3500 services available

• Major Service Providers– European Bioinformatics Institute– DNA DataBank of Japan– NCBI – USA

• ‘Boutique’ Services– Individual research labs producing public data sets– Specialist tools for niche experiments

• We are not service providers

Page 11: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

What types of services?

• HTML• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Local Java services• Beanshell• Workflows• ….coming soon – REST, Matlab

Variable or non-existent documentation or help

Page 12: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Taverna in a ‘open’ world

Advantages• Connection to lots of resources• Flexible system• Can adapt to new technologies

Disadvantages• Services are developed for other purposes• We can’t control how they work• We have to deal with the heterogeneity

Page 13: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Finding Services

When using services, scientists need to:• Find them – in distributed locations, produced by

different host institutions• Interpret them – what do the services do - what

experiments can they perform using them?• Know how to invoke them – what data and initial

parameters do they need to supply?

Page 14: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Metadata from a WSDL

<wsdl:message name="getGlimmersResponse"> <wsdl:part name="getGlimmersReturn" type="xsd:string"/> </wsdl:message> <wsdl:message name="aboutServiceRequest"/> <wsdl:message name="getGlimmersRequest"> <wsdl:part name="in0" type="xsd:string"/> <wsdl:part name="in1" type="xsd:string"/> <wsdl:part name="in2" type="xsd:string"/> <wsdl:part name="in3" type="xsd:string"/> <wsdl:part name="in4" type="xsd:string"/> <wsdl:part name="in5" type="xsd:string"/> <wsdl:part name="in6" type="xsd:string"/> <wsdl:part name="in7" type="xsd:int"/> <wsdl:part name="in8" type="xsd:string"/>

Pathport Web service from the Virginia Bioinformatics Institute

http://pathport.vbi.vt.edu/services/wsdls/beta/glimmer.wsd

Name of the service

Uninformative names for parameters

What kind of string?

Page 15: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Semantics and Web Services

• SAWSDL – Semantic Annotations for WSDL working group

• Virtually no uptake by bioinformatics service providers

• Doesn’t address non-WSDL services

Page 16: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Adding Semantics – Annotating Services

Find services by their function instead of their name

• The services might be distributed, but a registry of service descriptions can be central and queried

• We need to annotate services with semantics

In myGrid, we use the Feta Semantic Discovery tool

and a semantic annotation tool – and expert curation

Page 17: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

myGrid Ontology

Logically separated into two parts:

• Service ontologyPhysical and operational features of (web) services

• Domain ontology (Semantic Content Model)Annotation vocabulary for core bioinformatics data, data types and their relationships

Page 18: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Service Ontology

• Models services from the point of view of the scientist– Where is it? – How many inputs/outputs?– Who hosts it?

• Invocation details are hidden by the Taverna workbench

• Differs from related initiatives in this respect

Page 19: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Domain Ontology

• Informatics: captures the key concepts of data, data structures, databases and metadata.

• Bioinformatics: The domain-specific data sources (e.g. the model organism sequencing databases), and domain-specific algorithms for searching and analyzing data (e.g. the sequence alignment algorithm, clustalw).

• Molecular biology: Concepts include examples such as, protein sequence, and nucleic acid sequence.

• Formats: A hierarchy describing bioinformatics file formats. For example, fasta format for sequence data, or phylip format for phylogenetic data

• Tasks: A hierarchy describing the generic tasks a service operation can perform. Examples include retrieving, displaying, and aligning.

Page 20: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Example Service Annotation

• Example : BLAST from the DDBJ– Performs task: Alignment– Uses Method: Similarity Search Algorithm– Uses Resources: DNA/Protein sequence databases– Inputs:

• biological sequence (and format)

• database name (and format)

• blast program (and format)

– Outputs: Blast Report

Page 21: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

myGrid Ontology

First version of the ontology ~ 2002

Originally developed in DAML+OIL

Now developed in OWL and a version exported to RDFS

Number of classes in the ontology ~750

Domain and service ontology used by myGrid users and developers of myGrid related plugins

Service ontology also used by BioMoby

W3C compliant WRT ontology modelling

Page 22: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

How do we use the ontology?

Two methods of service description

1. Decision Support - queryingComposite matches to ontology terms

Multiple terms are used to query the annotations

2. Decision Making - reasoningSingle description – whole service model

Enables automated detection of service mismatchesEnables possibility of automated addition of services

Page 23: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Curation Sweatshop

Steady increase in numbers of services and workflows

Users able to find annotated services

BUT

Time-consuming and expensive

More and more services built daily

SO

Should we encourage service providers to add value?

Should we get users involved?

Page 24: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Collaboration between University of Manchester and EBI

Drawing on 6 years experience in Taverna of semantic annotation of services using RDF and OWL ontologies

Drawing on experience at EBI in service provision

Drawing on experience of social curation and networking from myExperiment

First pilot December 2008

Page 25: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Getting the Minimum

Community annotation • Must be easy and quick• Must allow partial descriptions • Multiple annotations of the same service

• What is the minimum information to enable – service discovery– service invocation

Page 26: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Grading Services

• Bronze – enough to locate the service. Example of service invocation

• Silver

• Gold

• Platinum – full description. All properties annotated – including dependencies between them – reliability metrics – AND CHECKED AND VERIFIED BY A CURATOR

Page 27: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Automatic Annotation

• Inferring service descriptions from workflows• Gathering usage data

– How many workflows use this service

• Gathering reliability data - monitoring– When is this service available– How many times does it fail

• Helps with “shopping” for services– People who used this service also used this service– Top 10 services– Services that do the same things

Page 28: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Annotation Provenance

• Who said what about what?• Harvesting community annotation• Verifying and augmenting by a curator• ‘Trust’ Models

• Annotation versions– In a workflow context– As stand alone services

Page 29: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Feta Model

Semantic Content Model

Service Model

Page 30: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

CurationModel

Quantitative Content

Tags

Service Model

Semantic Content Model

Ontologies

FunctionalProvenance

OperationalOperationalMetrics

Conditions of Use

Social Standing

Biocatalogue Service Profile

Page 31: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

A.N. Other

Curation

Quant’ve

Service Model

Semantic Content Model

ExecutionHost

Service ProfileFinding

WSDL

WADL

S-A.N. Other

SAWSDL

SA-REST

Analytics

Ranking

Browse/Shop

Search

Customised

Service

Workflow

Page 32: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Annotation Process

Page 33: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

BioCatalogue: The pilot

Features: User Registration

Service Registration

Search

Annotation

Notification

Integration with myExperiment

Page 34: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

For More Information

• BioCatalogue website http://www.biocatalogue.org/

• BioCatalogue wiki http://www.biocatalogue.org/wiki

• myGrid website http://www.mygrid.org.uk/

Page 35: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

myGrid Team

Page 36: Service Discovery in  my Grid and the  Biocatalogue, a Life Science Service Registry

Services

Interface

Neutral

Func

tiona

l

Conditions of Use

Operational

Social Standing Oper

ation

al M

etric

sProvenance

Multiply described Third Party

Aggregated FeedsMonitoring

Multiple Sources

Multiple Versions

Dynamic

Multiple Instances

Discovery

Interoperability

Composition

Reuse

TrustedAuthorities

Policies

Ranking