Exploring Many Task Computing in
Scientific WorkflowsEduardo Ogasawara Daniel de Oliveira
MTAGS 2009 - 1
Eduardo Ogasawara Daniel de Oliveira
Fernando Seabra Carlos Barbosa
Renato Elias Vanessa Braganholo
Alvaro Coutinho Marta Mattoso
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page.
To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
MTAGS '09 November 16th, 2009, Portland, Oregon, USA
Copyright © 2009 ACM 978-1-60558-714-1/09/11... $10.00
Federal University of Rio de Janeiro, Brazil
Agenda
• Introduction
o Scientific experiments
o Scientific workflows
o Experiments life cycle
• Hydra middleware
• Case study
• Related work
• Conclusion
MTAGS 2009 - 2
Typical scenario: scientific experiment
1. Data collection
2. Data analyzed by program X
3. Large Volume of Data Produced ...
MTAGS 2009 - 3
4. ...which needto be processedby program Y in a cluster
5. Results are analyzedby program Z
Variations of data or parameters
1. Data collection
2. Data analyzed by program X
3. Large Volume of Data Produced ...
MTAGS 2009 - 4
4. ...which needto be processedby program Y in a MTC environment
5. Results are analyzedby program Z
Current solutions
• Scientific Workflow Management Systems (SWfMS)
• SWfMS allow the execution of Scientific Workflows
o Some SWfMS are strong in workflow design and provenance support (VisTrails, Kepler, Taverna)
o Some SWfMS are strong in HPC support (Pegasus, Swift, o Some SWfMS are strong in HPC support (Pegasus, Swift, Triana)
• Scientists should be free to choose the SWfMS that suits best for their needs
• This choice should not prevent the adoption of an MTC solution for executing one or more activities of a workflow
MTAGS 2009 - 5
Parallelization difficulties
• Controlling parallel execution in distributed
environments
• Steering activities in distributed environments
• Provenance gathering in distributed/ • Provenance gathering in distributed/
heterogeneous environments
MTAGS 2009 - 6
Provenance can support analyzing
scientific experiments• Before execution:
o What programs may be used? Is there any alternative to explore?
o Is there any dependency between activities? Which activities are mandatory?
• After execution:
o What were the parameters that lead the best result?o What were the parameters that lead the best result?
o What was the scientific workflow that lead to the desired result?
o Where are the output files generated by the distributed activity A using the parameters P?
o How many times the activity A in version V was used in the experiment E?
MTAGS 2009 - 7
Our vision of the experiment life cycle
Composition
Conception
Reuse
GExpLine tool
support s the
experiment
life cycle
MTAGS 2009 - 8
Provenance
Data
Analysis ExecutionVisualization
Query
Discovery
Monitoring
Distribution
SWfMS
Hydra
HPC
Hydra
• Middleware solution that bridges the SWfMS to the HPC supporting MTC parallelization strategies
SWfMS HPC Environment
Hydra Middleware
• Goal: reduce the complexity involved in designing and managing activity/workflow parallel executions while gathering distributed provenance data
MTAGS 2009 - 9
Supported parallelization types
Data Input
Data
Fragmentation
I1 In…
Parameters
Parameter
Sweep
Pt1 Ptn…
Data Parameter
Sweep
MTAGS 2009 - 10
ParametersActivity/
Wf
Activity/
Wf…
Data
Analysis
O1 On…
Data Output
Data InputActivity/
Wf
Activity/
Wf…
Data
Analysis
O1 On…
Data Output
Hydra Architecture
Workflow
Hydra
Setup
Workspace
Handler
MUX
Parameter
Sweeper
Data
Fragmenter
Cartridge
Hydra MTC Layer
Configuration
Falkon
PBS
SchedulerHydra Client
Components
Hydra Setup
Hydra Preprocessing
MTAGS 2009 - 11
Downloader
Uploader
Dispatcher
Gatherer
Client Layer
Provenance
MTC Environment
Data Analyzer Cartridge
Dispatcher
MonitorSWfMS
VisTrails
Swift
Storage Control Data
Hydra Dispatcher /
Monitor
Hydra Post-processing
Hydra External
Components
Hydra pre-processing components
Workspace
HandlerData
MTAGS 2009 - 14
HandlerParameter
Sweeper
Data
Fragmenter
Cartridge
Pre-Processing
Hydra post-processing components
Provenance Data Analyzer Cartridge
Post-Processing
MTAGS 2009 - 16
Post-Processing
Hydra Architecture
Workflow
Uploader
Hydra
Setup
Hy
dra
Cli
en
t C
om
po
ne
nts
Workspace
Handler
MUX
Parameter
Sweeper
Data
Fragmenter
Cartridge
Hydra MTC Layer
Configuration
Falkon
PBS
Scheduler
MTAGS 2009 - 17
Downloader
Uploader
Dispatcher
Gatherer
Client Layer
Hy
dra
Cli
en
t C
om
po
ne
nts
Provenance
MTC Environment
Data Analyzer Cartridge
Dispatcher
Monitor
Pre-Processing
MTC Processing
Post-Processing
SWfMS
VisTrails
Swift
Storage Control Data
Case study
• Computational Fluid Dynamics (CFD)
• EdgeCFD: a parallel stabilized finite element
incompressible flow solver
• Synthesized in four steps:• Synthesized in four steps:
o Modeling
o Preprocessing
o Solution
o Visualization
MTAGS 2009 - 18
TAU parallel profiling of CFD solver on SGI
Altix ICE 8200, 128 cores
EdgeCFD experiment life cycle
Composition
Conception
Reuse
<<Semi-Automated>>
Visualization
<<Automated>>
EdgeCFD Preprocessor
<<Sub-Workflow, Sweep>>
EdgeCFD Solver and Control Applications
file nn.part.infile
nn.part.msh
file part.mat
filepart.ic
Filepart.edg
Visualization
file .case
Visualization
file nn.geo
Visualization
file velo_nnnn.vecnn
Visualization
file press_0000_sdnn
Visualization
file scal_nnnn_sdnn
Visualization
file DD_nnnn_sdnn
MTAGS 2009 - 19
Provenance
Data
Analysis ExecutionVisualization
Query
Discovery
Monitoring
Distribution
VisTrails
& Hydra
Workflow modeled in UML
<<Automated>>
EdgeCFD Preprocessor
file nn.part.infile
nn.part.msh
file part.mat
filepart.ic
Filepart.edg
Pre-processing
MTAGS 2009 - 20
<<Semi-Automated>>
Visualization
<<Sub-Workflow, Sweep>>
EdgeCFD Solver and Control Applications
Visualization
file .case
Visualization
file nn.geo
Visualization
file velo_nnnn.vecnn
Visualization
file press_0000_sdnn
Visualization
file scal_nnnn_sdnn
Visualization
file DD_nnnn_sdnn
solver
visualization
Related work
• Swift/Falkon
o Provides MTC support from Swift SWfMS
• MyCluster
o Supports PBS with transient fault support over remote
sites
o Supports PBS with transient fault support over remote
sites
• Dryad
o Supports data parallelization with high scalability
• Sawzal
o It is a framework for MTC that explore data parallelism
MTAGS 2009 - 28
Conclusions
• Experiments life cycle must be managed as a whole:o Composition: experiment is modeled in a workflow abstraction
level until being deployed into a specific SWfMS
o Execution: some activities demand HPC with monitoring facilities and provenance gathering
o Analysis: uses both information from the composition (prospective provenance) and from execution (local and (prospective provenance) and from execution (local and distributed - retrospective provenance)
• Hydra can be a bridge between the SWfMS and the HPC environment o Supports workflow data and parameter sweep parallelization
o Evaluated in a real case CFD solver with little overhead
o Supports distributed provenance gathering
MTAGS 2009 - 29
Future work
• Evaluate different kinds of applications (e.g. blast,
uncertainty quantification )
• Model distributed activities that are actually sub-
workflows
• Run experiments in HPC with more cores
MTAGS 2009 - 30
Thank you!Thank you!
Exploring Many Task Computing
in Scientific Workflows
PleasePlease visitvisit ourour sitesite
http://gexp.nacad.ufrj.brhttp://gexp.nacad.ufrj.br
Thank you!Thank you!
MTAGS 2009 - 31
Eduardo Ogasawara Daniel de Oliveira
Fernando Seabra Carlos Barbosa
Renato Elias Vanessa Braganholo
Alvaro Coutinho Marta Mattoso
Federal University of Rio de Janeiro, Brazil