Building Scientific Workflows with Taverna and BPEL:
a Comparative Study in caGrid
Wei Tan1, Paolo Missier2, Ravi Madduri1,Ian Foster1
1 University of Chicago and Argonne National Laboratory, USA
2 School of Computer Science, University of Manchester, Manchester, U.K
http://www-fp.mcs.anl.gov/~foster/
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
2
Agenda
• Introduction to caGrid• Why scientific workflows in caGrid?• BPEL and Taverna comparison
- Service discovery
- Service composition & workflow execution
- Data-driven vs. control-driven modeling
- Implicit vs. explicit definition of data
- Implicit vs. explicit iteration on data
- Workflow result analysis
• Conclusion
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Globus
Introduction: caBIG and caGrid
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
As of Oct19, 2008:
122 participants
105 services
70 data
35 analytical
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
5caGrid
data
instruments
computation resource
Virtualization
Security
Connectivity
Introduction: caGrid and workflow
Cancer Data Standards Repository
Discovery Composition
Execution
Analysis
Community
Scientific workflow lifecycle
reuse
generate
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Challenges faced by caGrid users
66caGrid
Cancer Data Standards Repository
Discovery Composition
Execution Analysis
Community
reuse
generate
Locating needed servicesDetermining function
Accessing services from a workflow GUI for building workflows easily
Executing workflow efficiently
Persisting and visualizing results
Sharing and reusing workflows
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Our goals in this paper
• Communicate practical experiences based on our work in the caGrid project
• Cover the entire scientific workflow lifecycle, from service discovery to service composition, workflow execution, and workflow result analysis
Based on caGrid requirements for workflow language and tooling
Also applicable to other areas in data-intensive and exploratory science?
7
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
BPEL and Taverna
• Not the only two but they are representative choices• BPEL
- XML-based specification for web service based process behavior
- Industry standard adopted by IBM, SAP, Oracle, etc.- Has also attracted attention from the scientific
community because of its support for SOA paradigm• Taverna
- Open-source, from the myGrid consortium in UK- Design and execution of scientific workflows- Plug-in architecture for extension (access more
applications, visualize more data types, etc.)8
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Querying semantic data in cancer research• Identify description logic
concepts relating to a particular context, e.g., “caCore” 1) Query all projects related to
context “caCore” 2) find UML classes in each
project3) use project and UML class
information to query the semantic metadata
4) retrieve the concept code• We adopt this query as a use
case to guide our comparison9
Project
Class
context
Class metadata
Description logic concept
caCore1
2
3
4
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Support for service discovery
• Before building a workflow- Need to find appropriate services to be composed- Service endpoints are not naturally known to users- Exact semantics of those services are not known
Taverna offers- A extensible scavenger interface for arbitrary service discovery
according to users needs (see next page)- A native semantic discovery facility called Feta: myGrid ontology based
service annotation and search.
BPEL offers- UDDI which is not widely adopted- Research efforts like: WSMO, OWL-S, which are more on specification
level- No open-source tool is available that works with a service query
component in an integrated way
10
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
11
Solution for caGrid: Metadata-based service query
caGrid service metadata
caGrid scavenger: query the CaDSR Service in the use case
• Types of query- String based- Property based- Semantic based
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Service composition & workflow execution
• Data-driven vs. control-driven modeling
• Implicit vs. explicit definition of data
• Implicit vs. explicit iteration on data
13
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Data-driven vs. control-driven modeling
14
BPEL Taverna (Scufl)Activities in
modelBasic and structure
activitiesProcessors as data processing
units with in/output portsSemantics
of linksTransfer of control Transfer of data
Data definition
Explicitly defined (global variables)
Implicitly defined (processor’s input/output)
Data initializatio
n
Complex data type mustbe explicitly initialized
Automatically
Control logic
Full-fledged: sequence, conditional, parallel, event-
triggered, etc
Limited: sequential, parallel and conditional
Parallel execution
Defined in <flow> or <ForEach>
By default
Comparison of BPEL and Taverna (Scufl) w.r.t. control/data-flow
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Implicit vs. explicit definition of data
• Taverna- Processors have input/output ports with an
associated data type- Data travels from the output port of a processor to the
input of one or more downstream processors- Interaction among processors is defined entirely by
the arcs in the dataflow graph
• BPEL- Requires the explicit definition of variables, and
explicit initiation for complex types- Data are shared amongst activities (i.e., are global)- More complexity, but more power and flexibility in
data handling 15
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Implicit vs. explicit iteration on data
• Implicit iteration in Taverna- Occurs when an input port receives a list element:
- E.g., a processor that outputs a “list of strings,” can legally be connected to a processor with an input port of type “string.”
- Taverna interprets this type mismatch as an indication that the destination processor must be invoked repeatedly, once for each element of the input list
- This behavior is defined with Taverna's functional programming model
• Explicit iteration in BPEL- BPEL does not allow type mismatch and iterate needs to be
defined explicitly- Again, BPEL offers more flexibility to define more advanced
iteration patterns (with more complexity in the model, though)
16
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Implicit vs. explicit iteration in CaDSR
17
findProjects returns an array Project []
findClassesInProject receives type Project and finds all UML classes in this (single) project
In Taverna an xmlsplitter extracts the project array and feeds this directly into findClassesInProject
In BPEL a ForEach construct is needed for the iteration over array Project []
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Workflow result analysis
• Workflow provides a natural framework for data tracking and analysis- In both Taverna and BPEL
• Taverna: offers native provenance support- More precise linkage annotation between services’
input and output- Semantic support- Not the focus of our project, see ref. [16] [17] for more
details
18
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
19
Conclusion: Taverna offers lifecycle support
caGrid
Cancer Data Standards Repository
Discovery composition
Execution Analysis
Community
reuse
generate
Scavenger: for customized service discovery Feta: service annotation and discovery.
Scufl: compact modeling of data flow Built-in processors: Soaplab, BioMart, etc. Customized processors as plug-ins
Implicit iteration: handle parallel execution
Result persistence and visualization
A community for sharing workflows
Provides a compact set of primitives that eases the modeling of data flowsAllows users to specify “what to do” instead of “how to do it”
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
20
Conclusion: BPEL offers unique features • Build-time
- A comprehensive set of primitives to model processes of all flavors- control-flow oriented- data-flow oriented (although a little verbose)- event driven, etc.
- Full featured- process logic, data manipulation, event and message
processing, fault handling, etc.• Run-time
- BPEL engines typically run inside application servers with- persistent state storage- reliability and scalability guarantees
- Important for long-running and computation-intensive workflows- For now Taverna engine does not provide these capabilities
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
21
Conclusion
• Factors in deciding which language/tool to choose- User IT expertise
- some prefer scripting language, others a friendly GUI- Problem size
- Taverna often runs on desktop and handles problem of moderate size (currently common in bioinformatics)
- Grid/server based systems like Swift can deal with huge volume of data and intensive computation (for example, applications in medical informatics, neuroscience, physics)
- Applications involved- Web services, batch jobs, shell scripts, etc.
• Future work- Enrich the caGrid workflow tool set based on Taverna- Build more real workflows to help scientific investigation- Address issues of scale as they arise
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
22
Thank you for your Thank you for your attentionattention
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
23
Introduction: caGrid and workflow
caGrid
data
instruments
computation resource
Virtualization Security
Connectivity Cancer Data Standards Repository