cabig workflow university of chicago, usa university of manchester, uk
TRANSCRIPT
Agenda
• caBIG Workflows: The “BIG” picture• caBIG Workflow Infrastructure
• Semantic Service discovery• Composing the Workflow
• Invoke stateful and secure services
• Workflow execution service• Discovering and Executing caBIG workflows using ca
Grid portal• Examples of caBIG workflows• Future directions
The caGrid ecosystem and the role of workflow
caGrid
data
instruments
computation resource
Virtualization
Security
Connectivity Cancer Data Standards Repository
Discovery Composition
Orchestration
Reuse
Community
Scientific workflow lifecycle
reuse
genera
te
•Workflow as consumer•Easily reuse services for complex experiments.•Workflow as contributor •Workflow as “best practice” wrapped as services.
The caBIG Workflow System
caGrid
Cancer Data Standards Repository
Discovery composition
Execution Reuse
Community
reuse
genera
te
Service discovery based on cancer research metadata.
Data-flow modeling flavor caGrid activity
State management (WSRF)Security (GSI)
Implicit iteration: handle parallel executionWSRF and GSI enforcement
A “Facebook” for caGrid workflows
Workflow Execution. ServiceWorkflows in caGrid Portal
Lymphoma Prediction Workflow
•Scientific value• Use gene-expression patterns associat
ed with two lymphoma types to predict the type of an unknown sample.
• Connect caGrid data service (caArray) with analytical services (PreProcess, SVM and KNN from GenePattern).
•Major steps• Querying training data from experiment
s stored in caArray.• Preprocessing, i.e., normalizing the mic
roarray data.• Predicting lymphoma type using SVM &
KNN services.
•Extension• Generalized the workflow into a cancer
type prediction routine that can be used on other caArray data sets.
*Fig. from MA Shipp. Nature Medicine, 2002
*
MicroArray from
tumor tissue
Microarray
preProcessing
Lymphoma
prediction
Lymphoma Prediction Workflow
Lymphoma type prediction
Acknowledgement: Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)
• 2 default caGrid configurations in Taverna:
• NCI Production caGrid v1.3• Training caGrid
• Configuration – a set of caGrid services belonging to the same grid
• Other “caGrids” can be defined through preferences
Configuring Taverna
Semantic Service Discovery
• Semantic search – searches Index Service for registered caGrid services matching various search criteria:• Service name, inputs, outputs, research center,
class names, concept codes, etc.
Adding caGrid services directly
• If user knows WSDL url of a caGrid service – the service can be added directly
caBIG services palette
• As a result of semantic search or direct adding• caBIG services appear in Taverna’s Service Panel• Ready to be drag
and dropped into caGrid workflows
Stateful caGrid services
• Taverna provides support for stateful caGrid services that implement the WSRF spec.
• Taverna can detect if a service is WSRF-compliant and adds special input port ‘EndpointReference’to it
• EPR can be passed aroundthe workflow as normal parameter
Secure caGrid services
• Taverna can invoke secure caGrid services that require user to log in to caGrid
• Taverna interacts with caGrid’s GAARDS infrastructure to obtain user’s proxy:• Authenticate the user with user’s affiliated Authentication Service• Obtain user’s proxy from Dorian Service• Default proxy lifetime: 12 hours
Using secure caGrid services
• Involves:1. Configuring a secure caGrid service from Taverna
2. Logging onto selected caGrid to obtain a proxy certificate
3. Saving and managing caGrid proxies and username and passwords
Configuring secure services (1/2)
• Authentication Service and Dorian Service urls required in order to obtain user’s proxy
• Can be configured globally for all services from the same caGrid (in preferences)
• Can be configured individually for a particular caGrid service (overrides configuration from preferences)
Configuring secure services (2/2)
• View secure’s service details
• Configure service’ssecurity properties
Logging onto caGrid
• User is prompted for his caGrid username and password when any secure service is invoked from a workflow for the first time
Credential management (1/2)
• Taverna obtains proxy for user from Dorian Service using user’s caGrid username and password
• Proxies are saved and managed byCredential Manager
• caGrid username and password can also be remembered
Workflow execution service
Taverna Workflow Service wraps the Taverna execution engine into a WS-Resource and exposes operations such as createResource, startWorkflow, getStatus, and getOutput for user submitted workflows.
startWorkflow
createResource
getStatus
getOutput
Workflow Service
Stateful Resources
(Resource Properties)
Stateful Resources
(Resource Properties)
EPR
Taverna Engine
Data Services
Data Services
Analytical ServicesAnalytical Services
caGrid &
Other Services
Client API
Taverna Workbench Workflow Portlet
Workflow execution service
Taverna Workflow Service Provides stateful resources that execute the workflows.
Supports caGrid security architecture (GSI Security).
Allows programmatic submission of workflows.
Access Taverna workflow via caGrid portal
Taverna Workflow Portlet is deployed in the caGrid Portal on the training Grid:
URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow
•The Portlet currently lists a few workflows with their descriptions that can be browsed from the above URL
• Users can select a workflow they are interested in running.
View : 1
Access Taverna workflow via caGrid portal
URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow
• Based on the number of input ports in the workflow, the portlet prompts the users to enter the input values in the textbox.
• For example, the Lymphoma workflow takes only one input in the form an Experiment ID that identifies the experiment that caArray uses for data collection.
• Hit submit after the entering the data.
View : 2
Access Taverna workflow via caGrid portal
URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow
• The portlet stores the user submitted workflows in the current session of the portal.
• Users can View all the Active and Completed Workflows in the session.
• Clicking the Output Button shows the output of the workflow.
• The portlet provides workflow specific view-resolvers to render the outputs. For E.g: Lymphoma workflow currently displays the output in a html table.
Views : 3, 4, & 5
Ack. Manav Kher, Joshua Phillips (SemanticBits)
Workflow execution service plug-in
• Submit the workflow into an execution servce.
• Retrieve execution result asynchronously.
Examples of caBIG workflows:caDSR
•Scientific value
• To find all the UML packages related to a given context (‘caCore’).
• Not a real scientific experiment.• Simple.• Important in caGrid.
•Steps
• Querying Project object.
• Do data transformation.
• Querying Packages object and get the result.
Workflow
input
caGrid
services
“Shim”
services
Workflow
output
Protein sequence information query
•Scientific value
• To query protein sequence information out of 3 caGrid data services: caBIO, CPAS and GridPIR.
• To analyze a protein sequence from different data sources.
•Steps
• Querying CPAS and get the id, name, value of the sequence.
• Querying caBIO and GridPIR using the id or name obtained from CPAS.
Microarray clustering*
•Scientific value• A common routine to group genes or
experiments into clusters with similar profiles.
• To identify functional groups of genes.•Steps
• Querying and retrieving the microarray data of interest from a caArrayScrub data service at Columbia University
• Preprocessing, or normalize the microarray data using the GenePattern analytical service at the Broad Institute at MIT
• Running hierarchical clustering using the geWorkbench analytical service at Columbia University
Workflow in/output
caGrid services
“Shim” servicesothers
*Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster. Orchestrating caGrid Services in Taverna. ICWS 08.
caGrid workflows in myExperiment
•caGrid Workflows covered
• Data service workflow• caDSR query
• Protein sequence query
• Data + analytical service• Microarray clustering
• Lymphoma type classification
•caGrid workflows are uploaded to myExperiment and accessible from:http://www.myexperiment.org/workflows/search?query=cabig
Future Directions
• More guidance in workflow modeling
• Leverage caDSR, EVS and the workflows at myExperiment
• More friendly user interface
• A CQL builder for caGrid data services
• More shim services for data transformation
• More features
• Integration with caGrid transfer to access data
• Browsing and executing workflows from caGrid portal
• Enhanced security support
• More workflows of real scientific value
More information
• caGrid workflow
• http://cagrid.org/display/workflow/Home
• Our team
Carole Goble
Univ. Manchester, UK
Univ. Chicago
Wei Tan
Dinanath Sulakhe
Stian Soiland-Reyes
Ravi Madduri
Alexandra Nenadic