dynamic cloud provisioning for scientific grid workflows
TRANSCRIPT
DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS
Simon Ostermann, Radu Prodan and Thomas FahringerInstitute of Computer Science, University of Innsbruck
Technikerstrasse 21a, Innsbruck, [email protected]
• Introduction
•Optimized Cloud Provisioning• Cloud Start• Instance Size• Grid Scheduling• Cloud Stop
• Evaluation using 3 scientific workflows• Wien2k
• Invmod
• Meteoag
• Conclusion
OVERVIEW
INTRODUCTION
• Infrastructure as a Service a branch of Cloud computing
•On-demand resources i.e.: Amazon EC2, GoGrid, ...
•Other common Cloud computing areas not covered:
• Platform as a Service
• Software as a Service
• Specialized solutions for Storage, Web hosting, ...
CLOUD COMPUTING FOR SCIENTIFIC COMPUTING?
• Rent resources instead of buying own hardware• Eliminates permanent operation, maintenance, and
deprecation costs• Scale up/down an infrastructure based on temporary
immediate needs• Significantly reduced over-provisioning• Virtualised resources enables scalable deployment and
provisioning of application software• Reliability through business SLA relationships that bind
actors to offering higher QoS guarantees
CLOUD MODELS• Cloud computing mostly available on a hourly basis
• Some research papers assume finer granularity
• Interesting problems arise:• How much do i use this full hour?• How can i maximize the usage / minimize the cost?
nothing 50Unallocated 100Requested 100Starting 100Running 30Accessible 270Shutting down 50Terminated 10Unallocated 100
!""#$%&'#(
)*%+(%,-#./*'(
0$*&'#(%,-#./*'(1%2#(
0,*''3"*-#+(
0,*''3"*-#+(
4-*.5,
6(
78,,
%,6(
7#98
#$-#+(
1#.2%,*-#+(
4:8;
,6(+3<
,(
=#>-(&%''%,6(%,-#./*'(
GRID COMPUTING• Grid has emerged as a worldwide shared distributed platform
for solving large-scale scientific problems
• Grid computing with additional Cloud resources to speed up scientific computing
• Just in time Scheduler from ASKALON, a workflow execution system for Grid and Cloud resources
• ASKALON is a Workflow system developed by the DPS group at the University of Innsbruck
•Multiple scientific workflows from different fields of science
GROUDSIM•Grid and Cloud Simulator
• Event based for scalability reasons
• Experiments showed up to 90% better performance and better scalability then GridSim
• Java based - to allow integration into existing software
• Simulation allows wide analysis of Cloud without expenses
• Simulation results match real executions
GROUDSIM ARCHITECTURE
!"#$%&'()*+),")-*
./($0!"#*12-/*
3$4$/-*+5-)4*6"24*73+68*
./"0*&)0*9%($0*+)''-2*3&"%$/-*.-)-/&4(/*
:&;<,/($)0*6(&0-/*
Infrastructure + application simulation Callbacks Put events in list
Get next event
Submit jobs Transfer files
Generate failure
="24/">$'()*
="24/">$'()*
="24/">$'()*
OPTIMIZED CLOUD PROVISIONING
• Analysis of regular executions and the resulting costs
• Analysis resulted in multiple parts needing optimization
• Choices have to be made about: start and stop of resources and the amount of instances requested
• Four optimizations found, defined as algorithms (in the paper) and exploited in the evaluation
CLOUD START• Parallel regions with more tasks then available cores
•Depending of Cloud and Grid speed Serialization and Imbalance overheads are analyzed
•When minimization of the runtime of the parallel section is possible Cloud resources are started
Grid core 3 120 120Grid core 2 120 120Grid core 1 120 120Cloud core 1 250
Grid core 3 120 120Grid core 2 120 120Grid core 1 120Cloud core 1 300
!" #!!" $!!"
%&'(")*&+","
%&'(")*&+"$"
%&'(")*&+"#" -*."#"
-*."$"
-*.","
-*."/"
-*."0"
-*."1"
2+&'34'536*7"
!" #!!" $!!" ,!!"
%&'(")*&+","
%&'(")*&+"$"
%&'(")*&+"#"
84*9(")*&+"#"
-*.","
-*."$"
-*."#"
-*."/"
-*."1"
-*."0"
:;.3437)+"
<';+" <';+"
INSTANCE SIZE• Instances may offer different number of cores
•When only part of the Cloud cores are used the cost efficiency is lower
• Getting to little cores may result in serialization / no benefit
• Important to decide if number of instances to request is rounded up or down resulting in 2 behaviors:
• generous: better performance but more expensive
• economical: less expensive but performance may not improve
GRID SCHEDULING• Grid is a dynamical shared environment
• Resources may become available while workflow execution uses Cloud resources
• Rescheduling resources to Grid might save cost / might decrease execution time
• depending of work already completed from a job mapped to a Cloud resource and the speed difference from Grid and Cloud decisions are made
CLOUD STOP• Unused resources are shut down to save money
• Shutdown after 5 minutes of a payed hour is as expensive as after 58 minutes
• Resources might be reused in the upcoming 53 minutes and this reuse will reduce the overall Cloud provisioning overheads
• Shut down time is in payed period therefor the point in time has to be chosen knowing the Shut down time of the Cloud
• in some case: 1 hour of cloud time can be saved
EVALUATION• Three different scientific workflows with different levels of parallelism
• Execution simulated using GroudSim
• Impact of different optimizations on the three workflows when using 3 different types of Cloud resources and 3 Clusters from the Austrian Grid
METRIC• Comparison of executions on Grid resources and executions
using Grid and additional on demand Cloud resources
•We define a new metric CT called cost per unit of saved time ($/T)
• Represents how expensive a unit of saved execution time comes with the assumption that Grid resources are freely available
WORKFLOWS• From different fields of science with different structures
• Parallelisation size x representing a factor that represents the amount of tasks in a workflow which is evaluated for values from 1 - 900
• Computationally intensive, data transfers are small part of each workflow
• Cloud network speed and storage influence kept low
• Simulation data based on real executions in the Austrian Grid
GENERAL OBSERVATIONS
0 20 40 60 80
100 120 140 160 180
0 100 200 300 400 500 600 700 800 900
Cost
[$]
Parallelisation size [x]
Grid+m1.small (Cloud stop)Grid+m1.large (Cloud stop)Grid+c1.xlarge (Cloud stop)Grid+m1.small (no opt.)Grid+m1.large (no opt.)Grid+c1.xlarge (no opt.)
Comparison of regular and optimized executions of different big workflows
WIEN2K• Vienna University of Technology
• Theoretical chemistry (materials science)
• Electronic structure calculations for solids using density functional theory
•Number of activities• 2 * x + 3• x = parallelisation size
0 5
10 15 20 25 30 35
0 100 200 300 400 500 600 700 800 900
Tim
e [h
ours
]
Parallelisation size [x]
GridGrid + m1.smallGrid + m1.largeGrid + c1.xlarge
0 20 40 60 80
100 120 140 160 180
0 100 200 300 400 500 600 700 800 900
Cost
[$]
Parallelisation size [x]
Grid + m1.smallGrid + m1.largeGrid + c1.xlarge
WIEN2K
Execution times and cost on the Grid
and with additional Cloud resources
Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale
0.01
0.1
1
10
0 100 200 300 400 500 600 700 800 900
Cost
/ Sa
ved
time
[min
/$],
loga
rithm
ic sc
ale
[log
C T]
Parallelisation size [x]
Grid + m1.smallGrid + m1.largeGrid + c1.xlarge
INVMOD
• A hydrological application using Levenberg-Marquardt algorithm to minimize the error between simulation and measurements
•Number of activities• 12 * x + 1• x = parallelisation size
10 15 20 25 30 35 40 45 50
50 100 150 200 250 300
Tim
e [h
ours
]
Parallelisation size [x]
GridGrid + m1.smallGrid + m1.largeGrid + c1.xlarge
0
50
100
150
200
250
50 100 150 200 250 300
Cost
[$]
Parallelisation size [x]
Grid + m1.smallGrid + m1.largeGrid + c1.xlarge
INVMOD
0.01
0.1
1
10
100
50 100 150 200 250 300
Cost
/ Sa
ved
time
[min
/$],
loga
rithm
ic sc
ale
[log
C T]
Parallelisation size [x]
Grid + m1.smallGrid + m1.largeGrid + c1.xlarge
Execution times and cost on the Grid
and with additional Cloud resources
Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale
METEOAG•Meteorology and Geophysics
Institute
•Meteorological simulations with the numerical model RAMS
• Resolve alpine watersheds and thunderstorms in the Arlberg region of the West Austria
•Number of activities• 69 * x + 2• x = parallelisation size
simulation_initcase_initrams_makevfile rams_makevfile rams_makevfilerams_init
revu_compareraverrams_hist
revu_dumpstageout
continue?
Initial Conditions Initial Conditions Initial Conditions
no yes
6 h SimulationPost Process
Post Process
Verify and Select
18 h Simulation
case_init case_initcase 1 case 2 case n
METEOAG
0 20 40 60 80
100 120 140 160
50 100 150 200 250 300
Tim
e [h
ours
]
Parallelisation size [x]
GridGrid + m1.smallGrid + m1.largeGrid + c1.xlarge
0 100 200 300 400 500 600 700 800 900
50 100 150 200 250 300
Cost
[$]
Parallelisation size [x]
Grid + m1.smallGrid + m1.largeGrid + c1.xlarge
Execution times and cost on the Grid
and with additional Cloud resources
Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale
0.01
0.1
1
10
100
50 100 150 200 250 300
Cost
/ Sa
ved
time
[min
/$],
loga
rithm
ic sc
ale
[log
C T]
Parallelisation size [x]
Grid + m1.smallGrid + m1.largeGrid + c1.xlarge
CONCLUSION
• Granularity of Cloud payment has an important roll in Cloud allocation decisions
•Optimizations like the presented needed to allow efficient usage of this dynamic resource class
• The longer Cloud resources needed the lower the impact
• Future extension with full graph scheduling algorithms planed
THANK YOUAny questions?