hosted science: managing computational workflows in the cloud

14
Parallel Processing Letters Vol. 23, No. 2 (2013) 1340004 (14 pages) c World Scientific Publishing Company DOI: 10.1142/S0129626413400045 HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD EWA DEELMAN 1 , GIDEON JUVE 1 , MACIEJ MALAWSKI 2 and JAREK NABRZYSKI 3 1 USC Information Sciences Institute, Marina del Rey, CA, USA 2 AGH University of Science and Technology, Dept. of Computer Science, Krakow, Poland 3 University of Notre Dame, Center for Research Computing, Notre Dame, IN, USA Received December 2012 Revised March 2013 Published 28 June 2013 Communicated by Jack Dongarra and Bernard Tourancheau ABSTRACT Scientists today are exploring the use of new tools and computing platforms to do their science. They are using workflow management tools to describe and manage complex applications and are evaluating the features and performance of clouds to see if they meet their computational needs. Although today, hosting is limited to providing virtual resources and simple services, one can imagine that in the future entire scientific analyses will be hosted for the user. The latter would specify the desired analysis, the timeframe of the computation, and the available budget. Hosted services would then deliver the desired results within the provided constraints. This paper describes current work on managing scientific applications on the cloud, focusing on workflow management and related data management issues. Frequently, applications are not represented by single workflows but rather as sets of related workflowsworkflow ensembles. Thus, hosted services need to be able to manage entire workflow ensembles, evaluating tradeoffs between completing as many high-value ensemble members as possible and delivering results within a certain time and budget. This paper gives an overview of existing hosted science issues, presents the current state of the art on resource provisioning that can support it, as well as outlines future research directions in this field. Keywords : Hosted science; resource provisioning; cloud computing; scientific workflows. 1. Introduction In recent years, there has been a clear shift in the way computing is done within the commercial and government sector. Services, which were traditionally hosted within local IT infrastructures are being migrated to clouds that are being built by companies such as Amazon, Google, and Microsoft [1]. The US government agencies, including NSF and NIST are exploring the use of cloud technologies to support both governmental functions and scientific computing [2–5]. Today, clouds provide many services we use on a daily basis: email, social networking, word processing, 1340004-1 Parallel Process. Lett. 2013.23. Downloaded from www.worldscientific.com by KANSAS STATE UNIVERSITY on 07/16/14. For personal use only.

Upload: jarek

Post on 29-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Parallel Processing Letters

Vol. 23, No. 2 (2013) 1340004 (14 pages)c© World Scientific Publishing Company

DOI: 10.1142/S0129626413400045

HOSTED SCIENCE: MANAGING COMPUTATIONAL

WORKFLOWS IN THE CLOUD

EWA DEELMAN1, GIDEON JUVE1, MACIEJ MALAWSKI2 and JAREK NABRZYSKI3

1USC Information Sciences Institute, Marina del Rey, CA, USA2AGH University of Science and Technology, Dept. of Computer Science, Krakow, Poland

3University of Notre Dame, Center for Research Computing, Notre Dame, IN, USA

Received December 2012

Revised March 2013

Published 28 June 2013Communicated by Jack Dongarra and Bernard Tourancheau

ABSTRACT

Scientists today are exploring the use of new tools and computing platforms to do their

science. They are using workflow management tools to describe and manage complexapplications and are evaluating the features and performance of clouds to see if they

meet their computational needs. Although today, hosting is limited to providing virtual

resources and simple services, one can imagine that in the future entire scientific analyseswill be hosted for the user. The latter would specify the desired analysis, the timeframe of

the computation, and the available budget. Hosted services would then deliver the desired

results within the provided constraints. This paper describes current work on managingscientific applications on the cloud, focusing on workflow management and related data

management issues. Frequently, applications are not represented by single workflows but

rather as sets of related workflowsworkflow ensembles. Thus, hosted services need to beable to manage entire workflow ensembles, evaluating tradeoffs between completing as

many high-value ensemble members as possible and delivering results within a certain

time and budget. This paper gives an overview of existing hosted science issues, presentsthe current state of the art on resource provisioning that can support it, as well as

outlines future research directions in this field.

Keywords: Hosted science; resource provisioning; cloud computing; scientific workflows.

1. Introduction

In recent years, there has been a clear shift in the way computing is done within

the commercial and government sector. Services, which were traditionally hosted

within local IT infrastructures are being migrated to clouds that are being built by

companies such as Amazon, Google, and Microsoft [1]. The US government agencies,

including NSF and NIST are exploring the use of cloud technologies to support

both governmental functions and scientific computing [2–5]. Today, clouds provide

many services we use on a daily basis: email, social networking, word processing,

1340004-1

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 2: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

entertainment, etc. At the same time, Infrastructure as a Service is being evaluated

and used in the context of scientific applications [6–9].

As the computing infrastructure is changing, so are the scientific applications.

Many scientific disciplines are entering the “Big Data” era [10, 11], where data, on

the order of peta- or even exa-bytes, will be distributed in the environment and pro-

cessed by complex computational pipelines. Moving towards cloud-hosted science

services can help democratize research and accelerate discovery by bridging the gap

between smaller labs (long tail science) and large-scale collaborations, creating a

unified research community [12–14].

We envision that all scientific data will be stored “in the cloud”, i.e. in hosted

repositories or databases and is accessible over the Internet. This data is often too

big to make it practical to transfer to the local machine for processing. Instead, it

is possible to deploy processing services on demand, either in the form of queries

or scripts, or in the form of more complex scientific workflows operating on the

data in situ. In the area of computing, Infrastructure as a Service (IaaS) clouds can

provide virtual machines and computing clusters that can be co-located with the

data repositories and have configurable resource capacities, such as the amount of

processing cores, memory, and temporary storage for data processing.

This new view of computational science implies radical changes to the manage-

ment of resources for scientific applications. The traditional approach of procuring

computing hardware and submitting jobs to local compute clusters will be replaced

by a process of provisioning resources and launching an entire analysis at the click

of a button, where the amount of research to be done is only limited by the time

and budget available to the scientists.

However, to transform the state of the art in scientific computing, and lower the

barrier to complex computation, there needs to be a bridge between the demands of

the applications and the infrastructure available to them. The bridge today is built

with rather low-level middleware and data management services. As we move to

hosted environments, more hardware-level concerns will be migrated to large-scale

data centers, freeing up the traditional campus IT infrastructures to provide more

sophisticated services that deliver the power of data centers to their scientific users.

Thus, a new set of online hosted services that can store, access, process, share, and

collaborate on data at a Web-scale can revolutionize scientific research in the same

way that cloud computing is changing the commercial IT landscape.

A critical part of providing services based on hosted environments is to manage

the computing resources in terms of provisioning, scheduling, utilization, perfor-

mance, and cost. Optimal decisions in this scenario are complex in nature, and one

must consider the multiple cloud providers and types of clouds (public, private,

community, and hybrid) with different costs, performance, security, reliability and

availability, as well as their locality with respect to data to be processed. We argue

that hosted science will need a set of methods and tools for assisting users and

service providers in the resource management decisions and capacity planning to

support research.

1340004-2

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 3: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Hosted Science: Managing Computational Workflows in the Cloud

In this paper, we take a step in the direction of hosted science by examin-

ing the issues of managing a class of important scientific applications—scientific

workflows—on cloud resources. In particular we explore the issues related to the

management of sets of scientific workflows—workflow ensembles—that represent

an entire scientific analysis. Through simulation, we explore resource provisioning

and workflow scheduling algorithms for this class of applications running on IaaS

clouds. This paper extends our prior work [15].

2. Hosted Science Ecosystem

We see growing interest in the use of cloud computing for science. The compu-

tational science ecosystem combines various services, including traditional instru-

ments and clusters, as well as new cloud services, from infrastructure and storage

to higher-level database and analysis services. In bioinformatics, companies such as

Illumina [16] sell instruments that can store data in the cloud as part of a sam-

ple analysis pipeline, and then offer cloud-based services to share and analyze that

data. The cost model adopted by these providers today is not clearsome limited

computation may be done at the expense of the provider, but if the costs are too

large, they will be passed onto the user, depending on the application and on the

set of resources used. Companies such as Illumina need to be able to estimate what

resources the scientists will need and what those resources will cost. If the users are

to pay for the resources directly, then they also need such capabilities.

We can also observe many examples of partnerships between research organi-

zations and cloud service providers. Internet2 has announced partnerships with

multiple cloud providers to offer infrastructure, data and software services for re-

search and education applications [17]. Among these services are Box.net cloud

storage, access to the Microsoft Azure platform with waived data transfer charges,

simplified access to the Amazon EC2 cloud via CENIC Corporation, and access to

an infrastructure as a service cloud platform provided by HP and SHI International

[18]. The Center for Research Computing of the University of Notre Dame (CRC)

was selected as one of the pilot users of the HP/SHI cloud. We have managed to

successfully extend the capacity of the CRC compute cluster with nodes deployed

as virtual machines in the SHI cloud datacenter, and to run scalability benchmarks

using scientific workflow applications managed by the Pegasus Workflow Manage-

ment System [19]. In Europe, HelixNebula [20] is a project funded by European

Commission that includes CERN, EMBL and ESA. The goal of this project is

to deploy large-scale scientific applications from HEP, genome analysis and radar

satellite data processing on cloud services from multiple providers (CloudSigma,

Atos, T-Systems). This proof-of-concept deployment is a step towards larger scale

production deployments of scientific applications on clouds. The 1000 Genomes

project [21] is hosting a 200TB dataset of human genomes on Amazon S3. It has

obviously become more practical for scientists to process such large datasets in the

cloud rather than to download them to local computing resources.

1340004-3

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 4: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

There are also new high-level hosted services developed specifically for research

communities. SQLShare [13] is an example of a database as a service for science. It

enables creating, managing, accessing and publishing SQL databases online, using

a Web-based portal or Python, REST and R APIs. SDSC Cloud Storage Services

[22] provide object storage services for researchers based on a pay-per-use model,

but in contrast to Amazon S3, there is no bandwidth cost. Globus Online [12] is

a service for managing scientific data transfers between multiple endpoints. It is

a cloud-hosted service that aims to support transfer and synchronization of large

datasets by providing a portal, command-line and API interfaces that support func-

tionality such as security, multiple protocols and reliability using transfer restarts.

EarthServer [23] is European project that creates a service for open access and ana-

lytics on Earth Science data. National eResearch Collaboration Tools and Resources

(NeCTAR) [24] is an Australian program that provides a research cloud and tools

for the scientific community. It is coupled with the Research Data Storage Infras-

tructure (RDSI) [25] that is mandated by the Australian research funding agency

as the repository for all publicly funded research results. These projects are only

point solutions that need to be integrated, significantly enhanced, and augmented

with new capabilities to provide science as a service. They all need fundamental re-

source management and capacity planning solutions in order to deliver cost-effective

services to their scientific user community.

3. Workflow Ensembles

Workflow ensembles represent an entire scientific analysis as a set of inter-related,

but independent, workflows. There are many applications that require scientific

workflow ensembles. For example, CyberShake [26], an earthquake science applica-

tion, uses ensembles to generate seismic hazard maps. Each workflow in a Cyber-

Shake ensemble generates a hazard curve for a particular geographic location, and

several hazard curves are interpolated to create a hazard map. In 2009 CyberShake

was used to generate a hazard map of Southern California that required an ensem-

ble of 239 workflows. Similarly, users of Montage [27], an astronomy application,

often need several workflows with different parameters to generate a set of image

mosaics that can be combined into a single, large mosaic. The Galactic Plane ensem-

ble, which generates several mosaics of the galactic plane in different wavelengths,

consists of 17 workflows, each of which contains 900 sub-workflows. Another ex-

ample is the Periodograms application [28], which searches for extra-solar planets

by detecting periodic dips in the light intensity that occur when they transit their

host star. This application uses workflow ensembles to process the input dataset

using different parameters. A recent analysis of Kepler satellite data required three

ensembles of 15 workflows. Another example of large scale ensemble computation

is described in [29], where ensembles of oil reservoir simulations are run on IBM

Blue Gene/P platform. The supercomputer is divided into partitions, which can be

dynamically provisioned as a service in a similar way as compute cloud resources.

1340004-4

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 5: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Hosted Science: Managing Computational Workflows in the Cloud

Ensembles of simulations are also very important in climate change studies [30],

since climate prediction models are sensitive to parameter changes. To quantify

these uncertainties, a large number of ensembles, or even “super-ensembles”, needs

to be simulated.

4. Resource Provisioning for Ensembles

In our prior work we investigated the resource provisioning problems that arise in

the hosted science model by developing algorithms to manage ensembles of work-

flows on a simple IaaS cloud infrastructure [15]. The goal was to maximize the

number of user-prioritized workflows that could be completed given budget and

deadline constraints. We developed three algorithms to solve this problem: two

dynamic algorithms, DPDS (Dynamic Provisioning Dynamic Scheduling) and WA-

DPDS (Workflow Aware DPDS), and one static algorithm, SPSS (Static Provision-

ing Static Scheduling).

The DPDS algorithm is an online algorithm that adds all the ready tasks of all

workflows into a single priority queue, where the priority of a task is set to the cor-

responding workflows priority. It provisions the number of VMs that will consume

the entire budget just before the deadline. When a VM becomes available, it assigns

the first task from the queue to that VM. This approach achieves high utilization

of VMs by backfilling idle VMs with tasks of lower priority workflows. However,

the algorithm does not prevent starting workflows that cannot finish within the

constraints and may delay the execution of higher-priority workflows.

The WA-DPDS algorithm addresses this issue by adding an admission test be-

fore starting the first task of a new workflow from the ensemble. The admission

test estimates whether it is possible to finish the new workflow, by comparing its

estimated cost to the estimated remaining budget. If the workflow fits within the

constraints, it is accepted and the task is scheduled, otherwise it is rejected and the

entire workflow is removed from the queue. This simple test prevents workflows that

can never finish from starting so that they do not consume resources that could be

used by higher-priority workflows. It is possible to add other admission tests, such

as a comparison of the critical path with the deadline constraint.

The SPSS algorithm is an offline (static) algorithm that prepares a provisioning

and scheduling plan ahead of execution using estimates of task runtimes. It is based

on a heuristic that adds new workflows to the plan in a way that will minimize the

cost of each workflow while maximizing the number of workflows completed. The

algorithm considers the workflows in priority order, distributes the sub-deadlines to

tasks based on workflow levels and finds a cost-optimal schedule for the workflow.

If the schedule fits into the budget, it is admitted, otherwise it is rejected. At

runtime, the tasks are executed according to the plan, but in the case of unpredicted

behavior of application or environment, such as deviations from runtime estimates

of provisioning delays of VMs, the algorithm cannot adapt to these changes and

constraints may be exceeded.

1340004-5

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 6: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

●●●

●●●●

●●

●●●●●●●●●●●●●●

●●●

●●●

●●●●●

SPSSWADPDSDPDS

tota

l run

time

in h

ours

deadline in hours

450

500

550

600

0 10 20 30 40

●●

●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●

SPSSWADPDSDPDS

com

putin

g co

st in

$/h

deadline in hours

1.0

1.1

1.2

1.3

1.4

0 10 20 30 40

Fig. 1. Example performance of algorithms expressed in sum of workflow sizes and effective cost,

data for the LIGO application, Pareto Sorted ensemble of 100 workflows, budget = $600.

For the purposes of evaluation we have developed a simulator that models the

cloud infrastructure, and a workflow engine with tightly-coupled scheduler and pro-

visioner modules. We used ensembles of synthetic workflows that were generated

based on statistics from real scientific applications. The results of our simulations

indicate that the two algorithms that take into account the structure of the work-

flow and task runtime estimates (WA-DPDS and SPSS) yield better results than

the simple priority-based scheduling (DPDS). This underscores the importance of

viewing workflow ensembles as a whole rather than as individual tasks or workflows.

In cases where there are no provisioning delays, where the task runtime estimates

were good, and where failures were rare, we found that SPSS performed signifi-

cantly better than both dynamic algorithms. However, when conditions were less

than perfect, the static plans produced by SPSS were disrupted and frequently

exceeded the budget and deadline constraints.

To illustrate the behavior of the algorithms we show examples of performance

curves obtained for selected parameters out of 5 applications, 3 algorithms, 5 en-

semble size distributions and 10 random seeds that we analyzed. Figure 1 (left)

shows the total computing time Tc, which is defined as a sum of runtimes of all

tasks of fully completed workflows from the ensemble. We show data obtained for

a LIGO (gravitational-wave physics) application with Pareto Sorted distribution of

workflows (sorted ensembles are sorted by size, so that the largest workflows have

the highest priority—in this case the sizes are drawn from a Pareto distribution).

The Figure shows that the SPSS algorithm gives the best performance and that

WA-DPDS performs better than the workflow-unaware version, DPDS. This means

that for the workflows we tested the simple admission procedure based on estima-

tion of workflow cost does not degrade the solution, but rather, in many cases it

enables larger workflows that would lead to budget overrun to be rejected. Thus, it

can save resources for smaller workflows to complete.

This plot allows us observe a more general property: for a given budget, when the

deadline is tightened the amount of work done steeply decreases. This is partially

1340004-6

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 7: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Hosted Science: Managing Computational Workflows in the Cloud

due to the fact that fewer workflows can be completed given a short deadline because

their critical paths are simply too long and they are squeezed out of early start

times by high priority workflows. In addition, assigning a shorter deadline means

that more VMs need to be provisioned to complete the work. However, because of

the dependencies in the workflows, many of these VMs will either be idle, or will

be backfilled with tasks from low-priority workflows that never finish, resulting,

effectively, in reduced resource utilization and fewer completed workflows.

These observations are further confirmed when examining the effective comput-

ing cost, as shown in Figure 1 (right). The computing cost in [$/h] is the total cost

divided by the total computing time Tc, where for simplicity we assume that 1 hour

of VM time costs $1. This means that the effective computing cost is the inverse of

resource utilization, since it also includes the cost of idle resources. It can be seen

that lower resource utilization and lower parallel efficiency for short deadlines lead

to substantial increases in the effective computing cost. This increase can be sig-

nificant. For example, for short deadlines the effective computing cost is ∼10–40%

higher than the minimum effective cost of $1/hour, depending on the algorithm

used. This means that out of the $600 budget, $60-$240 is spent on idle resources.

Another general conclusion can be drawn from the shape of Figure 1 (right): this

represents a trade-off curve between two conflicting objectives: time and cost. We

observe that the selection of deadline and cost can be formulated as a multi-criteria

optimization problem, and that the curves presented here are approximations of the

Pareto front (or Pareto set) of solutions to this problem. In our case the solutions

were achieved by maximizing the amount of work done, which is the inverse of

minimizing the effective computing cost metric for a given deadline. Therefore, the

results obtained can be used to aid in multi-criteria resource allocation planning,

where both cost and time are given as objectives rather than constraints.

5. Related Work in Resource Provisioning

Typical auto-scaling techniques such as Amazon Auto Scaling [31] and RightScale

[32] provide policy and rule-based mechanisms for dynamic provisioning on IaaS

clouds. They allow the size of a resource pool to be adjusted based on metrics related

to infrastructure and application by setting thresholds and limits to tune the be-

havior of these auto scaling systems. However, no support for more complex metrics

and deadlines is provided. Policy-based approaches for scientific workloads (e.g. [33,

34]) also enable scaling cloud resource pools or expanding the capabilities of clusters

using cloud-bursting techniques. A cloud-targeted auto scaling solution [35] consid-

ers a stream of workflows with unpredictable demands and individual deadlines per

workflow. A bi-objective workflow scheduling for clouds [36] proposes an auction

model between resource providers and users, [37] considers non-deterministic work-

flows where branches are determined at runtime, while more examples of algorithms

are given in [38]. For deadline-constrained workflow scheduling, substantial research

has been done to address bi-criteria and multi-criteria scheduling [39–41]. Similarly,

1340004-7

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 8: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

strategies for deadline-constrained cost-minimization workflow scheduling were de-

veloped for utility grid systems, e.g. [42] and [43]. The budget-constrained workflow

scheduling problem [44] addresses only a single objective. Other approaches [45] use

metaheuristics, which we consider less efficient in dynamic scenarios.

Scheduling of resources for data-intensive applications was a subject of research

in the DataGrid environment [46–49] with multiple compute sites and data repli-

cated between multiple storage sites. The goals of this work were to minimize time

or cost of data transfers, or to optimize global system throughput. Workflows on

data grids have also been studied in [50], with heuristics to minimize makespan

by reducing the overheads of file transfers. Data-aware schedulers, such as Stork

[51] have been developed for efficiently managing data transfers and storage space.

In [52] data clustering techniques are used for distributing datasets of workflow

applications between cloud storage sites. In [53] an integrated task and data place-

ment for workflows on clouds based on graph partitioning is derived, with the goal

of minimizing data transfers. The approach to use data locality for efficient task

scheduling is mainly applied to MapReduce [54], where various improvements over

default Hadoop scheduling are proposed, e.g. theoretical in [55] and [56] or experi-

mental in [57].

Our work is different in that none of the existing approaches look at a work-

flow ensemble as a whole and address the problem of maximizing the number of

completed workflows with user-defined priorities under both budget and deadline

constraints. Our algorithms are also distinctive in that they address both the work-

flow task scheduling and the resource provisioning problems, which we consider

crucial for cloud environments.

6. Future Research Directions

Based on our preliminary results, we observe that there is a need for substantial

research in the area of resource provisioning and capacity planning for applica-

tions in the hosted science model. Fundamental algorithmic research is needed to

explore the continuum of static and dynamic algorithms, augmenting static algo-

rithms with adaptivity so that they can adjust to a dynamic environment, and

providing more foresight to dynamic algorithms, allowing them to “plan ahead”

based on the knowledge about individual workflow ensemble members (not just in-

dividual workflow tasks) and extending the time horizon in which they make their

decisions. There is also a need to investigate algorithms in the context of heteroge-

neous environments that include multiple VM types and cloud providers, including

private and community clouds, which will make the problem even more complex

and challenging. We argue that both static and dynamic approaches can benefit

from mathematical modeling and optimization techniques. The modeling can be

used on multiple levels, from coarse grained optimization of global resource alloca-

tion among heterogeneous clouds, to optimal planning within static algorithms, to

optimization of control decisions in adaptive algorithms. We have already started

1340004-8

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 9: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Hosted Science: Managing Computational Workflows in the Cloud

research on modeling and cost optimization of scientific applications on multiple

heterogeneous clouds [58], but this work needs to be extended to address workflows

and ensembles. Our preliminary research was conducted using a simulation envi-

ronment, but to make the results complete they must be extended to experiments

on real cloud infrastructures. This will require implementation of our algorithms in

a workflow ensemble manager prototype.

Another key challenge will be to develop application and infrastructure mod-

els that include the various data storage options available on clouds. To our best

knowledge, there are no simulation studies that address this type of model. Our

previous experimental study [6, 26, 28] suggests that the data demands of scientific

workflows have a large impact on not only the execution time, but also on the cost

of workflows in commercial clouds. Additionally, faults, such as running out of stor-

age, often cripple the execution of data-intensive applications. The most relevant

storage options to include in the model are cloud object stores, such as Amazon S3

and RackSpace cloud files, but also private and scientific clouds, which offer limited

but free storage capabilities. The data transfer times and speeds between storage

sites can also vary, and there are incentives offered by commercial providers (e.g.

transfers between Amazon EC2 and S3 are free). If the data resides on a commercial

cloud, then the actual cost of transferring it to free computing nodes on a private

cloud may be actually higher than the cost of running the compute instances on the

commercial provider. For example 1 CPU hour on EC2 costs the same as a 1GB

network transfer.

Finally, in order to support the vision of hosted science it will be important

to investigate issues related to multitenancy, such as security and fairness. Our

existing solution assumes that a single user owns all the workflows in an ensemble

and prioritizes them according to his or her needs. This approach works well if the

ensemble manager has only one user. In a hosted ensemble management system,

however, there will be multiple users submitting workflows at the same time. This

will require authentication and authorization mechanisms to ensure users data is

kept separate, and delegation [59] to enable the ensemble manager to provision

resources on behalf of the user. In addition, depending on the pricing model used,

the service may need to balance the workload of several users accessing the same set

of provisioned resources. It may be possible to make more efficient use of resources if

the cost of provisioning a resource is shared among users. For example, one user may

be able to take advantage of unused partial VM-hours provisioned by other users,

or the system may be able to schedule complementary jobs from different users on

the same resource (e.g. an I/O-bound job with a CPU-bound job). In supporting

such use cases it will be important to consider fairness [60] in deciding how to

schedule jobs, so that users get what they pay for without wasting underutilized

resources. In a more general case there may be a need to establish a framework

for negotiating and monitoring service level agreements (SLA) [61, 62] between the

ensemble manager and users on one hand, and between the ensemble manager and

providers on the other hand.

1340004-9

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 10: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

7. Conclusions

In this paper we addressed some of the important problems of scheduling and re-

source provisioning for scientific workflow ensembles hosted on the cloud. We argue

that cloud computing is becoming a valid option for many scientific computing ap-

plications. As many commercial cloud providers are starting to offer cloud services

to the scientific community, the adoption of cloud offerings by scientists is just a

matter of time. However, public clouds as seen today have many limitations and

substantial research and development has to be done in areas such as security, re-

source and data management, networking, virtualization, and others. Also, although

the unit cost of using public clouds by researchers is known a priori, the total cost

of complex scientific simulations, such as the workflow ensembles discussed in this

paper, is usually far from researchers best effort estimates. Hosted science will not

be successful without the availability of integrated services and tools that will sup-

port seamless resource usage and management. Here, we have taken the initial steps

towards supporting hosted science.

Recognizing the fact that cloud infrastructures provide more control over the

resources than traditional science infrastructures such as grids, we developed re-

source provisioning mechanisms that can adapt according to changing application

requirements. The novelty of our approach is two-fold. First, we seek to not only

efficiently map tasks to available resources, but also to select the best provision-

ing plan. Second, we formulate the problem as a maximization of the number of

prioritized workflows completed from the ensemble. This approach requires work-

flows to be admitted or rejected based on their estimated resource demands. We

have considered both static and dynamic scheduling approaches and developed the

DPDS, WA-DPDS and SPSS algorithms based on these strategies. The algorithms

were evaluated on ensembles of synthetic workflows, which were generated based

on real scientific applications. The results of our simulation studies indicate that

the two algorithms that take into account information about the structure of the

workflow and task runtime estimates (WA-DPDS and SPSS) yield better results

than the simple on-line priority-based scheduling strategy (DPDS). We believe that

the results of our study can be applied in practice to develop tools that assist re-

searchers in planning their large-scale computational experiments. The estimates of

cost, runtime and number of workflows completed that can be obtained from both

the static algorithms and from the simulation runs, constitute valuable hints for

planning ensembles and for assessing the cost/performance trade-offs.

Our current study suggests many areas for future work. We plan to extend our

application and infrastructure model to explicitly include the various data storage

options available on clouds (right now we only model data access as part of the task

time and do not include data costs). Finally, we plan to investigate heterogeneous

environments that include multiple VM types and cloud providers, including pri-

vate and community clouds, which will make the problem even more complex and

challenging.

1340004-10

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 11: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Hosted Science: Managing Computational Workflows in the Cloud

Acknowledgments

This work was sponsored by the National Science Foundation under awards OCI-

0943725 and OCI-1148515.

References

[1] R. Buyya, J. Broberg, and A. Goscinski, Cloud Computing: Principles and Paradigms.(Wiley, 2011).

[2] Government Cloud Computing Programs. [Online].Available: http://www.cloudbook.net/directories/gov-clouds/government-cloud-computing.php. [Accessed: 01-Oct-2012].

[3] K. Yelick, S. Coghlan, B. Draney, and R. S. Canon, Eds., The Magellan Report onCloud Computing for Science Dec. 2011.

[4] L. Badger, T. Grance, R. Patt-Corner, and J. Voas, Cloud Computing Synopsisand Recommendations, Recommendations of the National Institute of Standards andTechnology, Special Publication 800-146.” May-2012.

[5] nsf12040 NSF Report on Support for Cloud Computing – US National Sci-ence Foundation (NSF). [Online]. Available: https://www.nsf.gov/publications/pub summ.jsp?ods key=nsf12040. [Accessed: 05-Oct-2012].

[6] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maech-ling, Data Sharing Options for Scientific Workflows on Amazon EC2, in SC 10 Pro-ceedings of the 2010 ACM/IEEE International Conference for High PerformanceComputing, Networking, Storage and Analysis, 2010, pp. 1–9.

[7] A. Iosup, S. Ostermann, N. Yigitbasi, R. Prodan, T. Fahringer, and D. Epema, Perfor-mance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing,IEEE Transactions on Parallel and Distributed Systems, 22 (2011) 931–945.

[8] M. Malawski, J. Meizner, M. Bubak, and P. Gepner, Component Approach to Com-putational Applications on Clouds, Procedia Computer Science, 4 (2011) 432–441.

[9] G. von Laszewski, G. C. Fox, A. J. Younge, A. Kulshrestha, G. G. Pike, W. Smith,J. Vockler, R. J. Figueiredo, J. Fortes, and K. Keahey, Design of the FutureGridexperiment management framework, in 2010 Gateway Computing EnvironmentsWorkshop (GCE), 2010, pp. 1–10.

[10] T. Hey, S. Tansley, and K. Tolle, Eds., The Fourth Paradigm: Data-Intensive Scien-tific Discovery. (Microsoft Research, 2009).

[11] Core Techniques and Technologies for Advancing Big Data Science & Engi-neering (BIGDATA)(nsf12499). [Online]. Available: http://www.nsf.gov/pubs/2012/nsf12499/nsf12499.htm. [Accessed: 01-Oct-2012].

[12] B. Allen, J. Bresnahan, L. Childers, I. Foster, G. Kandaswamy, R. Kettimuthu,J. Kordas, M. Link, S. Martin, K. Pickett, and S. Tuecke, Software as a servicefor data scientists, Commun. ACM, 55 (2011) 81–88.

[13] B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle,Database-as-a-service for long-tail science, in SSDBM11 Proceedings of the 23rd inter-national conference on Scientific and statistical database management, 2011, pp. 480–489.

[14] A. Ebert, Cloud Computing, Research and Innovation and Economics, in Researchin Future Cloud Computing (2 May 2012), 2012.

[15] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski, Cost- and Deadline-Constrained Provisioning for Scientific Workflow Ensembles in IaaS Clouds, in SC12 Proceedings of the International Conference on High Performance Computing,Networking, Storage and Analysis, 2012.

1340004-11

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 12: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

[16] BaseSpace. [Online]. Available: https://basespace.illumina.com/home/index. [Ac-cessed: 01-Oct-2012].

[17] Internet2, 16 Major Technology Companies Announce Cloud Service Part-nerships, 24-Apr-2012. [Online]. Available: http://www.internet2.edu/news/pr/2012.04.24.cloud-services-partnerships.html.

[18] HP Cloud Service. [Online]. Available: http://www.internet2.edu/netplus/hp-cloud.html. [Accessed: 02-Oct-2012].

[19] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta,K. Vahi, G. B. Berriman, J. Good, A. C. Laity, J. C. Jacob, and D. S. Katz, Pega-sus: A framework for mapping complex scientific workflows onto distributed systems,Scientific Programming, 13 (2005) 219–237.

[20] Helix Nebula – Home. [Online]. Available: http://www.helix-nebula.eu/. [Accessed:26-Jul-2012].

[21] 1000 Genomes Project data available on Amazon Cloud, News Release – NationalInstitutes of Health (NIH). 29-Mar-2012.

[22] Home – SDSC Cloud Storage. [Online]. Available:https://cloud.sdsc.edu/hp/index.php. [Accessed: 26-Jul-2012].

[23] EarthServer: Big Earth Data Analytics. [Online]. Available:http://www.earthserver.eu/.

[24] NeCTAR. [Online]. Available: http://www.nectar.org.au/. [Accessed: 26-Jul-2012].[25] The University of Queensland Research Data Storage Infrastructure (RDSI) Home

Page. [Online]. Available: http://www.rdsi.uq.edu.au/. [Accessed: 26-Jul-2012].[26] S. Callaghan, P. Maechling, P. Small, K. Milner, G. Juve, T. Jordan, E. Deelman,

G. Mehta, K. Vahi, D. Gunter, K. Beattie, and C. X. Brooks, Metrics for hetero-geneous scientific workflows: A case study of an earthquake science application, IJH-PCA, 25 (2011) 274–285.

[27] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good, The cost of doing scienceon the cloud: the Montage example, in SC 08 Proceedings of the 2008 ACM/IEEEconference on Supercomputing, 2008, pp. 1–12.

[28] J. S. Voeckler, G. Juve, E. Deelman, M. Rynge, and B. Berriman, Experiences us-ing cloud computing for a scientific workflow application, in Proceedings of the 2ndinternational workshop on Scientific cloud computing, 2011, pp. 15–24.

[29] M. Parashar, Blue Gene Sniffs for Black Gold in the Cloud, HPC in the Cloud. 2011.[30] J. M. Murphy, D. M. H. Sexton, D. N. Barnett, G. S. Jones, M. J. Webb, M. Collins,

and D. A. Stainforth, Quantification of modelling uncertainties in a large ensembleof climate change simulations, Nature, 7001 (2004), 768–772.

[31] Auto Scaling. [Online]. Available: http://aws.amazon.com/autoscaling/. [Accessed:01-Oct-2012].

[32] Cloud Management for Public and Private Clouds by RightScale. [Online]. Available:http://www.rightscale.com/. [Accessed: 01-Oct-2012].

[33] H. Kim, Y. el-Khamra, I. Rodero, S. Jha, and M. Parashar, Autonomic managementof application workflows on hybrid computing infrastructure, Scientific Programming,19 (2011) 75–89.

[34] P. Marshall, K. Keahey, and T. Freeman, Elastic Site: Using Clouds to ElasticallyExtend Site Resources, in Cluster Computing and the Grid, IEEE International Sym-posium on, 2010, pp. 43–52.

[35] M. Mao and M. Humphrey, Auto-scaling to minimize cost and meet applicationdeadlines in cloud workflows, in SC 11 Proceedings of 2011 International Conferencefor High Performance Computing, Networking, Storage and Analysis, 2011.

[36] H. Mohammadi Fard, R. Prodan, and T. Fahringer, A Truthful Dynamic Workflow

1340004-12

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 13: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

Hosted Science: Managing Computational Workflows in the Cloud

Scheduling Mechanism for Commercial Multi-Cloud Environments, IEEE Transac-tions on Parallel and Distributed Systems, preprint, 2012.

[37] E. Caron, F. Desprez, A. Muresan, and F. Suter, Budget Constrained ResourceAllocation for Non-deterministic Workflows on an IaaS Cloud, in Algorithms andArchitectures for Parallel Processing, 7439, 2012.

[38] A. Bala and D. I. Chana, A Survey of Various Workflow Scheduling Algorithms inCloud Environment, IJCA Proceedings on 2nd National Conference on Informationand Communication Technology, 4, (2011), 26–30.

[39] M. Wieczorek, A. Hoheisel, and R. Prodan, Towards a general model of the multi-criteria workflow scheduling on the grid, Future Generation Computer Systems, 25(2009) 237–256.

[40] J. J. Dongarra, E. Jeannot, E. Saule, and Z. Shi, Bi-objective scheduling algorithmsfor optimizing makespan and reliability on heterogeneous systems, in Proceedingsof the nineteenth annual ACM symposium on Parallel algorithms and architecturesSPAA 07, 2007, pp. 280–288.

[41] H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, A Multi-objectiveApproach for Workflow Scheduling in Heterogeneous Environments, in Cluster Com-puting and the Grid, IEEE International Symposium on, 2012, pp. 300–309.

[42] Jia Yu, R. Buyya, and Chen Khong Tham, Cost-Based Scheduling of Scientific Work-flow Application on Utility Grids, in E-SCIENCE 05 Proceedings of the First Inter-national Conference on e-Science and Grid Computing, pp. 140–147.

[43] S. Abrishami, M. Naghibzadeh, and D. Epema, Cost-driven scheduling of grid work-flows using Partial Critical Paths, in 11th IEEE/ACM International Conference onGrid Computing (GRID), 2010, pp. 81–88.

[44] R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. Dikaiakos, Scheduling Workflowswith Budget Constraints, in Integrated Research in GRID Computing CoreGRIDIntegration Workshop 2005 (Selected Papers), S. Gorlatch and M. Danelutto, Eds.Springer US, 2007, pp. 189–202.

[45] S. Pandey, L. Wu, S. M. Guru, and R. Buyya, A Particle Swarm Optimization-BasedHeuristic for Scheduling Workflow Applications in Cloud Computing Environments,in Advanced Information Networking and Applications, International Conference on,2010, pp. 400–407.

[46] S.-M. Park and J.-H. Kim, Chameleon: a resource scheduler in a data grid envi-ronment, in Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rdIEEE/ACM International Symposium on, 2003, pp. 258–265.

[47] K. Ranganathan and I. Foster, Decoupling computation and data scheduling in dis-tributed data-intensive applications, in High Performance Distributed Computing,2002. HPDC-11 2002. Proceedings. 11th IEEE International Symposium on, 2002,pp. 352–358.

[48] S. Venugopal, R. Buyya, and L. Winton, A grid service broker for scheduling dis-tributed data-oriented applications on global grids, in MGC 04 Proceedings of the2nd workshop on Middleware for grid computing, 2004, pp. 75–80.

[49] S. Kim and J. B. Weissman, A genetic algorithm based approach for scheduling de-composable data grid applications, in ICPP 04 Proceedings of the 2004 InternationalConference on Parallel Processing, 2004, pp. 406–413 vol.1.

[50] E. C. Machtans, L. M. Sato, and A. Deppman, Improvement on Scheduling Depen-dent Tasks for Grid Applications, in Computational Science and Engineering, 2009.CSE 09., 2009, pp. 95–102.

[51] T. Kosar and M. Balman, A new paradigm: Data-aware scheduling in grid computing,Future Generation Computer Systems, 25, (2009), 406–413.

1340004-13

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.

Page 14: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD

June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045

E. Deelman et al.

[52] D. Yuan, Y. Yang, X. Liu, and J. Chen, A data placement strategy in scientific cloudworkflows, Future Generation Computer Systems, 26, (2010), 1200–1214.

[53] . V. Catalyurek, K. Kaya, and B. Ucar, Integrated data placement and task as-signment for scientific workflows in clouds, in DIDC 11 Proceedings of the fourthinternational workshop on Data-intensive distributed computing, 2011, pp. 45–54.

[54] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters,Commun. ACM, 51, (2008), 107–113.

[55] J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong, BAR: An Efficient Data LocalityDriven Task Scheduling Algorithm for Cloud Computing, in CCGRID 11 Proceedingsof the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and GridComputing, 2011, pp. 295–304.

[56] J. Berlinska and M. Drozdowski, Scheduling divisible MapReduce computations,Journal of Parallel and Distributed Computing, 71, (2011), 450–459.

[57] T. Gunarathne, B. Zhang, T.-L. Wu, and J. Qiu, Portable Parallel Programming onCloud and HPC: Scientific Applications of Twister4Azure, in Fourth IEEE Interna-tional Conference on Utility and Cloud Computing (UCC), 2011, pp. 97–104.

[58] M. Malawski, K. Figiela, and J. Nabrzyski, Cost minimization for computationalapplications on hybrid cloud infrastructures, Future Generation Computer Systems,(accepted) http://dx.doi.org/10.1016/j.future.2013.01.004, 2013.

[59] I. T. Foster, C. Kesselman, G. Tsudik, and S. Tuecke, A Security Architecture forComputational Grids, in CCS ’98 Proceedings of the 5th ACM conference on Com-puter and communications security, 1998, pp. 83–92.

[60] R. Jain, D. Chiu, and W. Hawe, A Quantitative Measure Of Fairness And Discrimi-nation For Resource Allocation In Shared Computer Systems, DEC Research ReportTR-301, 1984.

[61] M. D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis, and A. Vakali, Cloud Comput-ing: Distributed Internet Computing for IT and Scientific Research, IEEE InternetComputing, 13, (2009), 10–13.

[62] V. C. Emeakaroha, M. A. S. Netto, R. N. Calheiros, I. Brandic, R. Buyya, and C. A. F.De Rose, Towards autonomic detection of SLA violations in Cloud infrastructures,Future Generation Computer Systems, 28, (2012), 1017–1029.

1340004-14

Para

llel P

roce

ss. L

ett.

2013

.23.

Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

KA

NSA

S ST

AT

E U

NIV

ER

SIT

Y o

n 07

/16/

14. F

or p

erso

nal u

se o

nly.