dynamic cloud provisioning for scientific grid workflows

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS

Simon Ostermann, Radu Prodan and Thomas FahringerInstitute of Computer Science, University of Innsbruck

Technikerstrasse 21a, Innsbruck, [email protected]

mailto:[email protected]

mailto:[email protected]

• Introduction

•Optimized Cloud Provisioning• Cloud Start• Instance Size• Grid Scheduling• Cloud Stop

• Evaluation using 3 scientific workflows• Wien2k

• Invmod

• Meteoag

• Conclusion

OVERVIEW

INTRODUCTION

• Infrastructure as a Service a branch of Cloud computing

•On-demand resources i.e.: Amazon EC2, GoGrid, ...

•Other common Cloud computing areas not covered:

• Platform as a Service

• Software as a Service

• Specialized solutions for Storage, Web hosting, ...

CLOUD COMPUTING FOR SCIENTIFIC COMPUTING?

• Rent resources instead of buying own hardware• Eliminates permanent operation, maintenance, and

deprecation costs• Scale up/down an infrastructure based on temporary

immediate needs• Significantly reduced over-provisioning• Virtualised resources enables scalable deployment and

provisioning of application software• Reliability through business SLA relationships that bind

actors to offering higher QoS guarantees

CLOUD MODELS• Cloud computing mostly available on a hourly basis

• Some research papers assume finer granularity

• Interesting problems arise:• How much do i use this full hour?• How can i maximize the usage / minimize the cost?

nothing 50Unallocated 100Requested 100Starting 100Running 30Accessible 270Shutting down 50Terminated 10Unallocated 100

!""#$%&'#(

)*%+(%,-#./*'(

0$*&'#(%,-#./*'(1%2#(

0,*''3"*-#+(

0,*''3"*-#+(

4-*.5,

6(

78,,

%,6(

7#98

#$-#+(

1#.2%,*-#+(

4:8;

,6(+3<

,(

=#>-(&%''%,6(%,-#./*'(

GRID COMPUTING• Grid has emerged as a worldwide shared distributed platform

for solving large-scale scientific problems

• Grid computing with additional Cloud resources to speed up scientific computing

• Just in time Scheduler from ASKALON, a workflow execution system for Grid and Cloud resources

• ASKALON is a Workflow system developed by the DPS group at the University of Innsbruck

•Multiple scientific workflows from different fields of science

GROUDSIM•Grid and Cloud Simulator

• Event based for scalability reasons

• Experiments showed up to 90% better performance and better scalability then GridSim

• Java based - to allow integration into existing software

• Simulation allows wide analysis of Cloud without expenses

• Simulation results match real executions

GROUDSIM ARCHITECTURE

!"#$%&'()*+),")-*

./($0!"#*12-/*

3$4$/-*+5-)4*6"24*73+68*

./"0*&)0*9%($0*+)''-2*3&"%$/-*.-)-/&4(/*

:&;<,/($)0*6(&0-/*

Infrastructure + application simulation Callbacks Put events in list

Get next event

Submit jobs Transfer files

Generate failure

="24/">$'()*

="24/">$'()*

="24/">$'()*

OPTIMIZED CLOUD PROVISIONING

• Analysis of regular executions and the resulting costs

• Analysis resulted in multiple parts needing optimization

• Choices have to be made about: start and stop of resources and the amount of instances requested

• Four optimizations found, defined as algorithms (in the paper) and exploited in the evaluation

CLOUD START• Parallel regions with more tasks then available cores

•Depending of Cloud and Grid speed Serialization and Imbalance overheads are analyzed

•When minimization of the runtime of the parallel section is possible Cloud resources are started

Grid core 3 120 120Grid core 2 120 120Grid core 1 120 120Cloud core 1 250

Grid core 3 120 120Grid core 2 120 120Grid core 1 120Cloud core 1 300

!" #!!" $!!"

%&'(")*&+","

%&'(")*&+"$"

%&'(")*&+"#" -*."#"

-*."$"

-*.","

-*."/"

-*."0"

-*."1"

2+&'34'536*7"

!" #!!" $!!" ,!!"

%&'(")*&+","

%&'(")*&+"$"

%&'(")*&+"#"

84*9(")*&+"#"

-*.","

-*."$"

-*."#"

-*."/"

-*."1"

-*."0"

:;.3437)+"

<';+" <';+"

INSTANCE SIZE• Instances may offer different number of cores

•When only part of the Cloud cores are used the cost efficiency is lower

• Getting to little cores may result in serialization / no benefit

• Important to decide if number of instances to request is rounded up or down resulting in 2 behaviors:

• generous: better performance but more expensive

• economical: less expensive but performance may not improve

GRID SCHEDULING• Grid is a dynamical shared environment

• Resources may become available while workflow execution uses Cloud resources

• Rescheduling resources to Grid might save cost / might decrease execution time

• depending of work already completed from a job mapped to a Cloud resource and the speed difference from Grid and Cloud decisions are made

CLOUD STOP• Unused resources are shut down to save money

• Shutdown after 5 minutes of a payed hour is as expensive as after 58 minutes

• Resources might be reused in the upcoming 53 minutes and this reuse will reduce the overall Cloud provisioning overheads

• Shut down time is in payed period therefor the point in time has to be chosen knowing the Shut down time of the Cloud

• in some case: 1 hour of cloud time can be saved

EVALUATION• Three different scientific workflows with different levels of parallelism

• Execution simulated using GroudSim

• Impact of different optimizations on the three workflows when using 3 different types of Cloud resources and 3 Clusters from the Austrian Grid

METRIC• Comparison of executions on Grid resources and executions

using Grid and additional on demand Cloud resources

•We define a new metric CT called cost per unit of saved time ($/T)

• Represents how expensive a unit of saved execution time comes with the assumption that Grid resources are freely available

WORKFLOWS• From different fields of science with different structures

• Parallelisation size x representing a factor that represents the amount of tasks in a workflow which is evaluated for values from 1 - 900

• Computationally intensive, data transfers are small part of each workflow

• Cloud network speed and storage influence kept low

• Simulation data based on real executions in the Austrian Grid

GENERAL OBSERVATIONS

0 20 40 60 80

100 120 140 160 180

0 100 200 300 400 500 600 700 800 900

Cost

[$]

Parallelisation size [x]

Grid+m1.small (Cloud stop)Grid+m1.large (Cloud stop)Grid+c1.xlarge (Cloud stop)Grid+m1.small (no opt.)Grid+m1.large (no opt.)Grid+c1.xlarge (no opt.)

Comparison of regular and optimized executions of different big workflows

WIEN2K• Vienna University of Technology

• Theoretical chemistry (materials science)

• Electronic structure calculations for solids using density functional theory

•Number of activities• 2 * x + 3• x = parallelisation size

0 5

10 15 20 25 30 35

0 100 200 300 400 500 600 700 800 900

Tim

e [h

ours

]


GridGrid + m1.smallGrid + m1.largeGrid + c1.xlarge

0 20 40 60 80

100 120 140 160 180

0 100 200 300 400 500 600 700 800 900

Cost

[$]


Grid + m1.smallGrid + m1.largeGrid + c1.xlarge

WIEN2K

Execution times and cost on the Grid

and with additional Cloud resources

Cost per unit of saved time ($/T) for the three different Cloud with logarithmic scale

0.01

0.1

1

10

0 100 200 300 400 500 600 700 800 900

Cost

/ Sa

ved

time

[min

/$],

loga

rithm

ic sc

ale

[log

C T]



INVMOD

• A hydrological application using Levenberg-Marquardt algorithm to minimize the error between simulation and measurements


10 15 20 25 30 35 40 45 50

50 100 150 200 250 300

Tim

e [h

ours

]



0

50

100

150

200

250

50 100 150 200 250 300

Cost

[$]



INVMOD

0.01

0.1

1

10

100

50 100 150 200 250 300

Cost

/ Sa

ved

time

[min

/$],

loga

rithm

ic sc

ale

[log

C T]






METEOAG•Meteorology and Geophysics

Institute

•Meteorological simulations with the numerical model RAMS

• Resolve alpine watersheds and thunderstorms in the Arlberg region of the West Austria


simulation_initcase_initrams_makevfile rams_makevfile rams_makevfilerams_init

revu_compareraverrams_hist

revu_dumpstageout

continue?

Initial Conditions Initial Conditions Initial Conditions

no yes

6 h SimulationPost Process

Post Process

Verify and Select

18 h Simulation

case_init case_initcase 1 case 2 case n

METEOAG

0 20 40 60 80

100 120 140 160

50 100 150 200 250 300

Tim

e [h

ours

]



0 100 200 300 400 500 600 700 800 900

50 100 150 200 250 300

Cost

[$]






0.01

0.1

1

10

100

50 100 150 200 250 300

Cost

/ Sa

ved

time

[min

/$],

loga

rithm

ic sc

ale

[log

C T]



CONCLUSION

• Granularity of Cloud payment has an important roll in Cloud allocation decisions

•Optimizations like the presented needed to allow efficient usage of this dynamic resource class

• The longer Cloud resources needed the lower the impact

• Future extension with full graph scheduling algorithms planed

THANK YOUAny questions?

dynamic cloud provisioning for scientific grid workflows

Documents