bioinformatics workflow management thoughts and case studies from industry. mark schreiber,...

31
Bioinformatics workflow management Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007

Upload: chelsea-stokely

Post on 15-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Bioinformatics workflow managementThoughts and case studies from industry.

Mark Schreiber, Bioinformatics Research Investigator

WWWFG, 5-7 June 2007

2 | Bioinformatics workflow management | Mark Schreiber

Outline

Integration and workflows

Early attempts

Case studies and examples

What does the future hold?

Conclusions

3 | Bioinformatics workflow management | Mark Schreiber

Bioinformatics at NITD Data Integration

• Ontologies, Standards, DBs

Knowledge Discovery

• Algorithms, Informatics, Machine Learning

Modelling

• Pathways, Circuits, Abstraction

Infrastructure

SupportResearch

4 | Bioinformatics workflow management | Mark Schreiber

Bioinformatics at NITD

BI combines data gathering, data storage and knowledge management with analytical tools to present complex and competitive information to planners and decision makers.

Hypothesis Generation and Validation. Providing the right information at the right time.

Decision Support.

5 | Bioinformatics workflow management | Mark Schreiber

Data SourcesHeterogeneity

The most significant research is done when heterogeneous data sources can be combined in one analysis.

Data

Scapers, CGI-Bin,

WS-Clients

Parsers (one per format)

BioJava/ BioPerl

Parser FrameworksImage analysis

SQL, JDBC/ODBC,

J2EE, .NETAPI

Webpages / Services Flatfiles XML Images / Video Relational DB Instrument

6 | Bioinformatics workflow management | Mark Schreiber

Applications (Services)Yet more heterogeneity

RDBMS• Oracle, MySQL, PostGres etc

Open Source• Usually just a command line interface

Commercial software• API, scripting engine, webservice

Web services and Web resources

Integration is rarely seamless

7 | Bioinformatics workflow management | Mark Schreiber

Productivity vs. InnovationFinding a balance

Development and manufacturing prioritize productivity

Research requires more innovation

Standardization increases productivity

Standardization limits innovation• At the level it is applied

Standardization promotes innovation• At higher levels

Workflows give a nice balance

8 | Bioinformatics workflow management | Mark Schreiber

What is a workflow?In Bioinformatics

A data-driven procedure consisting of one or more transformation processes (nodes).

Can be represented as a directed graph.• Direction is time – The order of transformations.

• A set of transformation rules.

A flow of data from it’s source to a destination (or result) via a series of merges, joins, manipulations and interconnected tools (services).

A specification designed in a Workflow Design System (modeling component) and run by a Workflow Management System (execution component).

9 | Bioinformatics workflow management | Mark Schreiber

The UNIX PhilosophyAnalogy to workflows

Write programs that do one thing and do it well

Write programs that work together

Write programs to handle text streams, because that is the universal interface• Text formatted as XML

Do one thing and do it well

A workflow is made up of nodes that do one thing and do it well• So is a Service Oriented Architecture (SOA)

10 | Bioinformatics workflow management | Mark Schreiber

An early attempt: PolymerUnix shell scripts + Biojava objects

Biojava is a large API of Java objects that are useful for bioinformatics.

Biojava objects can be assembled into mini-programs tha ‘do one thing and do it well’.

Polymer combines these mini-programs into a very simple workflow using Unix shell scripts.• Much like Unix piping.

Unfortunately it instantiates multiple JVMs

Lacks management and logging systems

11 | Bioinformatics workflow management | Mark Schreiber

How could Polymer have been better?

Provide an execution class and allow it to execute a script.• This would mean only one JVM is launched and could allow for

threading of branches in the script.

Use Groovy script instead of Unix shell script.• But Groovy hadn’t been invented at the time.

At the same time workflow management systems were emerging which made Polymer redundant.

12 | Bioinformatics workflow management | Mark Schreiber

A production example: Drug Target Identification Rational bioinformatics prioritization

In collaboration with biologists identify desirable characteristics of a drug target

Integrate relevant data from large datasets

Combine data and score each target based on the presence or absence of desirable characteristics

Prioritize targets based on their overall score

13 | Bioinformatics workflow management | Mark Schreiber

HomologyEssentiality

Expression Druggable domains

StructurePathways

AssessDrugTarget

Scientist defines desirable criteria Assign

weights Produce a

score for each gene

Select targets for promotion to D1 Competitive

advantage

Legal position

Literature

Biological feasibility

DB

Epidemiology

Assayability

A production example: Drug Target Identification Rational bioinformatics prioritization

Hasan S, Daugelat S, Rao PSS, Schreiber M (2006) Prioritizing genomic drug targets in pathogens: Application to Mycobacterium tuberculosis. PLoS Comput Biol 2(6):e61

14 | Bioinformatics workflow management | Mark Schreiber

Workflow Management SystemControlling the workflow

A WMS should provide a means to execute a workflow in a controlled way.

Ideally it will also provide:• Logging

• Messaging

• Security and provenance management

• Scheduling and load balancing

• Exception handling

• Resource pooling (eg DB connections)

Much of the above is easily accessible from a JEE/ .NET application server• JBoss, Glassfish

15 | Bioinformatics workflow management | Mark Schreiber

Workflow Design SystemBuilding the workflow

Many WMS systems are also a WDS• Eg Taverna, Pipeline Pilot, Inforsense

A GUI that allows rapid workflow development• Increases productivity and encourages experimentation

• Drag and drop assembly of a workflow

Provides an API or scripting interface to allow the design of new nodes

A simple scripting interface would also be an alternative to using a GUI for design

16 | Bioinformatics workflow management | Mark Schreiber

Simple Data Mining Workflow

Each node has a discrete function.

Internally the processing can be complex (eg Decision Tree) but input and output is simple and generic.

Self documenting.

Can be run by other users.

17 | Bioinformatics workflow management | Mark Schreiber

AnnotationFinding malaria kinases

Semi-automated annotation

18 | Bioinformatics workflow management | Mark Schreiber

Advanced annotationCombining multiple services

19 | Bioinformatics workflow management | Mark Schreiber

Workflows become nodesStanding on the shoulders of giants

Elements of workflows that are frequently re-used should become nodes.

Workflow re-use, Object oriented workflows

20 | Bioinformatics workflow management | Mark Schreiber

Example: From Arrays to PathwaysUsing whole workflows as nodes

Process and array and find the over represented KEGG pathways and NCBI processes.

21 | Bioinformatics workflow management | Mark Schreiber

Workflow design systems promote rapid development

Finding orthologues and paralogues using whole genome pairwise blast.

Development of the workflow took about 5mins.

22 | Bioinformatics workflow management | Mark Schreiber

Workflow design systems promote experimentationMind map data analysis

23 | Bioinformatics workflow management | Mark Schreiber

Integration Via Ontology

Workflows in bioinformatics typically do a lot of integration before and/ or after analysis.

Integration is normally done using joins and filters.• Using equality and Boolean operations.

- Eg type = protease OR type = serine protease …

Joins and filters should be able to be evaluated using ontology.• Eg. Filtering for proteases would include all subconcepts

automatically.

Data sets could be quickly mapped using custom ontologies.

24 | Bioinformatics workflow management | Mark Schreiber

Simplifying Service IntegrationExpose an API

All programs likely to be called by a workflow management system should publish a webservice or expose a scripting API.

Easier to learn than a full Java or C API.

Should be based on an existing scripting language not a new one.• Python, Groovy, Ruby or Perl

While you are at it expose your stack via the scripting language.• Imagine what could be done with BLAST if the stack could be

manipulated via scripting.

25 | Bioinformatics workflow management | Mark Schreiber

Web Services and Service Oriented Architecture‘Outsourcing your processing’

Webservices• Services can reside on different servers

• Platform independent HTTP protocol

• CGI, REST, XML-RPC, SOAP

• SOAP is the easiest to generically connect to and parse

• Results are available as XML

Service Oriented Architecture• Usually implies web services

• SOA promotes re-use and simplifies maintenance

• Bottleneck shifts from CPU time to network availability

26 | Bioinformatics workflow management | Mark Schreiber

Resource Oriented ArchitectureOutsourcing your data warehouse

Bioinformatics is very resource intensive

ROA simplifies maintenance and removes the need for synchronization.

Many resources are now accessible by webservices in XML format

27 | Bioinformatics workflow management | Mark Schreiber

Resource Oriented ArchitectureThe challenges

Network latency can become a major problem• Intelligent caching and increased network speed are a must

Requires resource discovery and cross referencing• RDF and Ontology will play an increasingly important role

• Workflow management systems will need to understand these

Increasingly workflows will make use of loosely-coupled interoperable resources and services.

28 | Bioinformatics workflow management | Mark Schreiber

Business ProcessesFrom proactive to reactive

Business processes are long running, asynchronous processes• Typically they react to events, e.g. a change in a stock price.

- ‘Push’ vs ‘Pull’ model of data access.

• Known as ‘programming in the large’

• Defined using BPEL with very heavy use of SOA and ROA

Currently, most workflows are explicitly executed, ‘short running’, synchronous processes

Bioinformatics will increasingly use business processes• React to streaming machine data

• Continuously process literature or database updates

29 | Bioinformatics workflow management | Mark Schreiber

Web Service ChoreographyWill it be relevant to bioinformatics?

Business processes and workflows are ‘orchestrations’• Scope is limited to one participant

• The BP or the Workflow talks to other participants but doesn’t care how they do their job or how they are managed.

Choreography involves the management of several loosely coupled BP’s• A network of long running asynchronous BP’s that react to the behavior of

their peers.

• Choreography of workflows would require a standard workflow description or exposure of a workflow as a business process

Web Service BP Choreography

Node Workflow ???

One to Many

One to ManyOne to Many

One to Many

30 | Bioinformatics workflow management | Mark Schreiber

ConclusionsDesign and management

Workflows are created using a workflow design system and executed on a workflow management system

A well designed workflow management can considerably increase productivity

Promotes workflow re-use and helps organize a multi-user environment

A good design system allows rapid development of a workflow

A good design system promotes experimentation and data exploration

31 | Bioinformatics workflow management | Mark Schreiber

ConclusionsThe future

Ontology will play an increasing role in data integration• Join and Filter operations that can reason over an ontology model

Business processes and web choreography will become more relevant to bioinformatics• ‘Live’ data favors programming ‘in the large’

• Workflows exposed as business processes

• Network speed and optimal caching are key

All of these approaches have been used before• Used and proven in business intelligence

• Bioinformatics needs to acquaint itself with modern IT practice and stop re-inventing technology