hosted science: managing computational workflows in the cloud
TRANSCRIPT
![Page 1: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/1.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Parallel Processing Letters
Vol. 23, No. 2 (2013) 1340004 (14 pages)c© World Scientific Publishing Company
DOI: 10.1142/S0129626413400045
HOSTED SCIENCE: MANAGING COMPUTATIONAL
WORKFLOWS IN THE CLOUD
EWA DEELMAN1, GIDEON JUVE1, MACIEJ MALAWSKI2 and JAREK NABRZYSKI3
1USC Information Sciences Institute, Marina del Rey, CA, USA2AGH University of Science and Technology, Dept. of Computer Science, Krakow, Poland
3University of Notre Dame, Center for Research Computing, Notre Dame, IN, USA
Received December 2012
Revised March 2013
Published 28 June 2013Communicated by Jack Dongarra and Bernard Tourancheau
ABSTRACT
Scientists today are exploring the use of new tools and computing platforms to do their
science. They are using workflow management tools to describe and manage complexapplications and are evaluating the features and performance of clouds to see if they
meet their computational needs. Although today, hosting is limited to providing virtual
resources and simple services, one can imagine that in the future entire scientific analyseswill be hosted for the user. The latter would specify the desired analysis, the timeframe of
the computation, and the available budget. Hosted services would then deliver the desired
results within the provided constraints. This paper describes current work on managingscientific applications on the cloud, focusing on workflow management and related data
management issues. Frequently, applications are not represented by single workflows but
rather as sets of related workflowsworkflow ensembles. Thus, hosted services need to beable to manage entire workflow ensembles, evaluating tradeoffs between completing as
many high-value ensemble members as possible and delivering results within a certain
time and budget. This paper gives an overview of existing hosted science issues, presentsthe current state of the art on resource provisioning that can support it, as well as
outlines future research directions in this field.
Keywords: Hosted science; resource provisioning; cloud computing; scientific workflows.
1. Introduction
In recent years, there has been a clear shift in the way computing is done within
the commercial and government sector. Services, which were traditionally hosted
within local IT infrastructures are being migrated to clouds that are being built by
companies such as Amazon, Google, and Microsoft [1]. The US government agencies,
including NSF and NIST are exploring the use of cloud technologies to support
both governmental functions and scientific computing [2–5]. Today, clouds provide
many services we use on a daily basis: email, social networking, word processing,
1340004-1
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 2: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/2.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
entertainment, etc. At the same time, Infrastructure as a Service is being evaluated
and used in the context of scientific applications [6–9].
As the computing infrastructure is changing, so are the scientific applications.
Many scientific disciplines are entering the “Big Data” era [10, 11], where data, on
the order of peta- or even exa-bytes, will be distributed in the environment and pro-
cessed by complex computational pipelines. Moving towards cloud-hosted science
services can help democratize research and accelerate discovery by bridging the gap
between smaller labs (long tail science) and large-scale collaborations, creating a
unified research community [12–14].
We envision that all scientific data will be stored “in the cloud”, i.e. in hosted
repositories or databases and is accessible over the Internet. This data is often too
big to make it practical to transfer to the local machine for processing. Instead, it
is possible to deploy processing services on demand, either in the form of queries
or scripts, or in the form of more complex scientific workflows operating on the
data in situ. In the area of computing, Infrastructure as a Service (IaaS) clouds can
provide virtual machines and computing clusters that can be co-located with the
data repositories and have configurable resource capacities, such as the amount of
processing cores, memory, and temporary storage for data processing.
This new view of computational science implies radical changes to the manage-
ment of resources for scientific applications. The traditional approach of procuring
computing hardware and submitting jobs to local compute clusters will be replaced
by a process of provisioning resources and launching an entire analysis at the click
of a button, where the amount of research to be done is only limited by the time
and budget available to the scientists.
However, to transform the state of the art in scientific computing, and lower the
barrier to complex computation, there needs to be a bridge between the demands of
the applications and the infrastructure available to them. The bridge today is built
with rather low-level middleware and data management services. As we move to
hosted environments, more hardware-level concerns will be migrated to large-scale
data centers, freeing up the traditional campus IT infrastructures to provide more
sophisticated services that deliver the power of data centers to their scientific users.
Thus, a new set of online hosted services that can store, access, process, share, and
collaborate on data at a Web-scale can revolutionize scientific research in the same
way that cloud computing is changing the commercial IT landscape.
A critical part of providing services based on hosted environments is to manage
the computing resources in terms of provisioning, scheduling, utilization, perfor-
mance, and cost. Optimal decisions in this scenario are complex in nature, and one
must consider the multiple cloud providers and types of clouds (public, private,
community, and hybrid) with different costs, performance, security, reliability and
availability, as well as their locality with respect to data to be processed. We argue
that hosted science will need a set of methods and tools for assisting users and
service providers in the resource management decisions and capacity planning to
support research.
1340004-2
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 3: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/3.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Hosted Science: Managing Computational Workflows in the Cloud
In this paper, we take a step in the direction of hosted science by examin-
ing the issues of managing a class of important scientific applications—scientific
workflows—on cloud resources. In particular we explore the issues related to the
management of sets of scientific workflows—workflow ensembles—that represent
an entire scientific analysis. Through simulation, we explore resource provisioning
and workflow scheduling algorithms for this class of applications running on IaaS
clouds. This paper extends our prior work [15].
2. Hosted Science Ecosystem
We see growing interest in the use of cloud computing for science. The compu-
tational science ecosystem combines various services, including traditional instru-
ments and clusters, as well as new cloud services, from infrastructure and storage
to higher-level database and analysis services. In bioinformatics, companies such as
Illumina [16] sell instruments that can store data in the cloud as part of a sam-
ple analysis pipeline, and then offer cloud-based services to share and analyze that
data. The cost model adopted by these providers today is not clearsome limited
computation may be done at the expense of the provider, but if the costs are too
large, they will be passed onto the user, depending on the application and on the
set of resources used. Companies such as Illumina need to be able to estimate what
resources the scientists will need and what those resources will cost. If the users are
to pay for the resources directly, then they also need such capabilities.
We can also observe many examples of partnerships between research organi-
zations and cloud service providers. Internet2 has announced partnerships with
multiple cloud providers to offer infrastructure, data and software services for re-
search and education applications [17]. Among these services are Box.net cloud
storage, access to the Microsoft Azure platform with waived data transfer charges,
simplified access to the Amazon EC2 cloud via CENIC Corporation, and access to
an infrastructure as a service cloud platform provided by HP and SHI International
[18]. The Center for Research Computing of the University of Notre Dame (CRC)
was selected as one of the pilot users of the HP/SHI cloud. We have managed to
successfully extend the capacity of the CRC compute cluster with nodes deployed
as virtual machines in the SHI cloud datacenter, and to run scalability benchmarks
using scientific workflow applications managed by the Pegasus Workflow Manage-
ment System [19]. In Europe, HelixNebula [20] is a project funded by European
Commission that includes CERN, EMBL and ESA. The goal of this project is
to deploy large-scale scientific applications from HEP, genome analysis and radar
satellite data processing on cloud services from multiple providers (CloudSigma,
Atos, T-Systems). This proof-of-concept deployment is a step towards larger scale
production deployments of scientific applications on clouds. The 1000 Genomes
project [21] is hosting a 200TB dataset of human genomes on Amazon S3. It has
obviously become more practical for scientists to process such large datasets in the
cloud rather than to download them to local computing resources.
1340004-3
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 4: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/4.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
There are also new high-level hosted services developed specifically for research
communities. SQLShare [13] is an example of a database as a service for science. It
enables creating, managing, accessing and publishing SQL databases online, using
a Web-based portal or Python, REST and R APIs. SDSC Cloud Storage Services
[22] provide object storage services for researchers based on a pay-per-use model,
but in contrast to Amazon S3, there is no bandwidth cost. Globus Online [12] is
a service for managing scientific data transfers between multiple endpoints. It is
a cloud-hosted service that aims to support transfer and synchronization of large
datasets by providing a portal, command-line and API interfaces that support func-
tionality such as security, multiple protocols and reliability using transfer restarts.
EarthServer [23] is European project that creates a service for open access and ana-
lytics on Earth Science data. National eResearch Collaboration Tools and Resources
(NeCTAR) [24] is an Australian program that provides a research cloud and tools
for the scientific community. It is coupled with the Research Data Storage Infras-
tructure (RDSI) [25] that is mandated by the Australian research funding agency
as the repository for all publicly funded research results. These projects are only
point solutions that need to be integrated, significantly enhanced, and augmented
with new capabilities to provide science as a service. They all need fundamental re-
source management and capacity planning solutions in order to deliver cost-effective
services to their scientific user community.
3. Workflow Ensembles
Workflow ensembles represent an entire scientific analysis as a set of inter-related,
but independent, workflows. There are many applications that require scientific
workflow ensembles. For example, CyberShake [26], an earthquake science applica-
tion, uses ensembles to generate seismic hazard maps. Each workflow in a Cyber-
Shake ensemble generates a hazard curve for a particular geographic location, and
several hazard curves are interpolated to create a hazard map. In 2009 CyberShake
was used to generate a hazard map of Southern California that required an ensem-
ble of 239 workflows. Similarly, users of Montage [27], an astronomy application,
often need several workflows with different parameters to generate a set of image
mosaics that can be combined into a single, large mosaic. The Galactic Plane ensem-
ble, which generates several mosaics of the galactic plane in different wavelengths,
consists of 17 workflows, each of which contains 900 sub-workflows. Another ex-
ample is the Periodograms application [28], which searches for extra-solar planets
by detecting periodic dips in the light intensity that occur when they transit their
host star. This application uses workflow ensembles to process the input dataset
using different parameters. A recent analysis of Kepler satellite data required three
ensembles of 15 workflows. Another example of large scale ensemble computation
is described in [29], where ensembles of oil reservoir simulations are run on IBM
Blue Gene/P platform. The supercomputer is divided into partitions, which can be
dynamically provisioned as a service in a similar way as compute cloud resources.
1340004-4
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 5: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/5.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Hosted Science: Managing Computational Workflows in the Cloud
Ensembles of simulations are also very important in climate change studies [30],
since climate prediction models are sensitive to parameter changes. To quantify
these uncertainties, a large number of ensembles, or even “super-ensembles”, needs
to be simulated.
4. Resource Provisioning for Ensembles
In our prior work we investigated the resource provisioning problems that arise in
the hosted science model by developing algorithms to manage ensembles of work-
flows on a simple IaaS cloud infrastructure [15]. The goal was to maximize the
number of user-prioritized workflows that could be completed given budget and
deadline constraints. We developed three algorithms to solve this problem: two
dynamic algorithms, DPDS (Dynamic Provisioning Dynamic Scheduling) and WA-
DPDS (Workflow Aware DPDS), and one static algorithm, SPSS (Static Provision-
ing Static Scheduling).
The DPDS algorithm is an online algorithm that adds all the ready tasks of all
workflows into a single priority queue, where the priority of a task is set to the cor-
responding workflows priority. It provisions the number of VMs that will consume
the entire budget just before the deadline. When a VM becomes available, it assigns
the first task from the queue to that VM. This approach achieves high utilization
of VMs by backfilling idle VMs with tasks of lower priority workflows. However,
the algorithm does not prevent starting workflows that cannot finish within the
constraints and may delay the execution of higher-priority workflows.
The WA-DPDS algorithm addresses this issue by adding an admission test be-
fore starting the first task of a new workflow from the ensemble. The admission
test estimates whether it is possible to finish the new workflow, by comparing its
estimated cost to the estimated remaining budget. If the workflow fits within the
constraints, it is accepted and the task is scheduled, otherwise it is rejected and the
entire workflow is removed from the queue. This simple test prevents workflows that
can never finish from starting so that they do not consume resources that could be
used by higher-priority workflows. It is possible to add other admission tests, such
as a comparison of the critical path with the deadline constraint.
The SPSS algorithm is an offline (static) algorithm that prepares a provisioning
and scheduling plan ahead of execution using estimates of task runtimes. It is based
on a heuristic that adds new workflows to the plan in a way that will minimize the
cost of each workflow while maximizing the number of workflows completed. The
algorithm considers the workflows in priority order, distributes the sub-deadlines to
tasks based on workflow levels and finds a cost-optimal schedule for the workflow.
If the schedule fits into the budget, it is admitted, otherwise it is rejected. At
runtime, the tasks are executed according to the plan, but in the case of unpredicted
behavior of application or environment, such as deviations from runtime estimates
of provisioning delays of VMs, the algorithm cannot adapt to these changes and
constraints may be exceeded.
1340004-5
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 6: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/6.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
●
●
●
●
●●●
●●●●
●
●●
●●●●●●●●●●●●●●
●
●●●
●●●
●●●●●
●
SPSSWADPDSDPDS
tota
l run
time
in h
ours
deadline in hours
450
500
550
600
0 10 20 30 40
●
●
●●
●●●
●●●●●
●●●●●●●●●●●●●●●●
●●●
●●●●●●●●●
●
SPSSWADPDSDPDS
com
putin
g co
st in
$/h
deadline in hours
1.0
1.1
1.2
1.3
1.4
0 10 20 30 40
Fig. 1. Example performance of algorithms expressed in sum of workflow sizes and effective cost,
data for the LIGO application, Pareto Sorted ensemble of 100 workflows, budget = $600.
For the purposes of evaluation we have developed a simulator that models the
cloud infrastructure, and a workflow engine with tightly-coupled scheduler and pro-
visioner modules. We used ensembles of synthetic workflows that were generated
based on statistics from real scientific applications. The results of our simulations
indicate that the two algorithms that take into account the structure of the work-
flow and task runtime estimates (WA-DPDS and SPSS) yield better results than
the simple priority-based scheduling (DPDS). This underscores the importance of
viewing workflow ensembles as a whole rather than as individual tasks or workflows.
In cases where there are no provisioning delays, where the task runtime estimates
were good, and where failures were rare, we found that SPSS performed signifi-
cantly better than both dynamic algorithms. However, when conditions were less
than perfect, the static plans produced by SPSS were disrupted and frequently
exceeded the budget and deadline constraints.
To illustrate the behavior of the algorithms we show examples of performance
curves obtained for selected parameters out of 5 applications, 3 algorithms, 5 en-
semble size distributions and 10 random seeds that we analyzed. Figure 1 (left)
shows the total computing time Tc, which is defined as a sum of runtimes of all
tasks of fully completed workflows from the ensemble. We show data obtained for
a LIGO (gravitational-wave physics) application with Pareto Sorted distribution of
workflows (sorted ensembles are sorted by size, so that the largest workflows have
the highest priority—in this case the sizes are drawn from a Pareto distribution).
The Figure shows that the SPSS algorithm gives the best performance and that
WA-DPDS performs better than the workflow-unaware version, DPDS. This means
that for the workflows we tested the simple admission procedure based on estima-
tion of workflow cost does not degrade the solution, but rather, in many cases it
enables larger workflows that would lead to budget overrun to be rejected. Thus, it
can save resources for smaller workflows to complete.
This plot allows us observe a more general property: for a given budget, when the
deadline is tightened the amount of work done steeply decreases. This is partially
1340004-6
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 7: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/7.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Hosted Science: Managing Computational Workflows in the Cloud
due to the fact that fewer workflows can be completed given a short deadline because
their critical paths are simply too long and they are squeezed out of early start
times by high priority workflows. In addition, assigning a shorter deadline means
that more VMs need to be provisioned to complete the work. However, because of
the dependencies in the workflows, many of these VMs will either be idle, or will
be backfilled with tasks from low-priority workflows that never finish, resulting,
effectively, in reduced resource utilization and fewer completed workflows.
These observations are further confirmed when examining the effective comput-
ing cost, as shown in Figure 1 (right). The computing cost in [$/h] is the total cost
divided by the total computing time Tc, where for simplicity we assume that 1 hour
of VM time costs $1. This means that the effective computing cost is the inverse of
resource utilization, since it also includes the cost of idle resources. It can be seen
that lower resource utilization and lower parallel efficiency for short deadlines lead
to substantial increases in the effective computing cost. This increase can be sig-
nificant. For example, for short deadlines the effective computing cost is ∼10–40%
higher than the minimum effective cost of $1/hour, depending on the algorithm
used. This means that out of the $600 budget, $60-$240 is spent on idle resources.
Another general conclusion can be drawn from the shape of Figure 1 (right): this
represents a trade-off curve between two conflicting objectives: time and cost. We
observe that the selection of deadline and cost can be formulated as a multi-criteria
optimization problem, and that the curves presented here are approximations of the
Pareto front (or Pareto set) of solutions to this problem. In our case the solutions
were achieved by maximizing the amount of work done, which is the inverse of
minimizing the effective computing cost metric for a given deadline. Therefore, the
results obtained can be used to aid in multi-criteria resource allocation planning,
where both cost and time are given as objectives rather than constraints.
5. Related Work in Resource Provisioning
Typical auto-scaling techniques such as Amazon Auto Scaling [31] and RightScale
[32] provide policy and rule-based mechanisms for dynamic provisioning on IaaS
clouds. They allow the size of a resource pool to be adjusted based on metrics related
to infrastructure and application by setting thresholds and limits to tune the be-
havior of these auto scaling systems. However, no support for more complex metrics
and deadlines is provided. Policy-based approaches for scientific workloads (e.g. [33,
34]) also enable scaling cloud resource pools or expanding the capabilities of clusters
using cloud-bursting techniques. A cloud-targeted auto scaling solution [35] consid-
ers a stream of workflows with unpredictable demands and individual deadlines per
workflow. A bi-objective workflow scheduling for clouds [36] proposes an auction
model between resource providers and users, [37] considers non-deterministic work-
flows where branches are determined at runtime, while more examples of algorithms
are given in [38]. For deadline-constrained workflow scheduling, substantial research
has been done to address bi-criteria and multi-criteria scheduling [39–41]. Similarly,
1340004-7
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 8: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/8.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
strategies for deadline-constrained cost-minimization workflow scheduling were de-
veloped for utility grid systems, e.g. [42] and [43]. The budget-constrained workflow
scheduling problem [44] addresses only a single objective. Other approaches [45] use
metaheuristics, which we consider less efficient in dynamic scenarios.
Scheduling of resources for data-intensive applications was a subject of research
in the DataGrid environment [46–49] with multiple compute sites and data repli-
cated between multiple storage sites. The goals of this work were to minimize time
or cost of data transfers, or to optimize global system throughput. Workflows on
data grids have also been studied in [50], with heuristics to minimize makespan
by reducing the overheads of file transfers. Data-aware schedulers, such as Stork
[51] have been developed for efficiently managing data transfers and storage space.
In [52] data clustering techniques are used for distributing datasets of workflow
applications between cloud storage sites. In [53] an integrated task and data place-
ment for workflows on clouds based on graph partitioning is derived, with the goal
of minimizing data transfers. The approach to use data locality for efficient task
scheduling is mainly applied to MapReduce [54], where various improvements over
default Hadoop scheduling are proposed, e.g. theoretical in [55] and [56] or experi-
mental in [57].
Our work is different in that none of the existing approaches look at a work-
flow ensemble as a whole and address the problem of maximizing the number of
completed workflows with user-defined priorities under both budget and deadline
constraints. Our algorithms are also distinctive in that they address both the work-
flow task scheduling and the resource provisioning problems, which we consider
crucial for cloud environments.
6. Future Research Directions
Based on our preliminary results, we observe that there is a need for substantial
research in the area of resource provisioning and capacity planning for applica-
tions in the hosted science model. Fundamental algorithmic research is needed to
explore the continuum of static and dynamic algorithms, augmenting static algo-
rithms with adaptivity so that they can adjust to a dynamic environment, and
providing more foresight to dynamic algorithms, allowing them to “plan ahead”
based on the knowledge about individual workflow ensemble members (not just in-
dividual workflow tasks) and extending the time horizon in which they make their
decisions. There is also a need to investigate algorithms in the context of heteroge-
neous environments that include multiple VM types and cloud providers, including
private and community clouds, which will make the problem even more complex
and challenging. We argue that both static and dynamic approaches can benefit
from mathematical modeling and optimization techniques. The modeling can be
used on multiple levels, from coarse grained optimization of global resource alloca-
tion among heterogeneous clouds, to optimal planning within static algorithms, to
optimization of control decisions in adaptive algorithms. We have already started
1340004-8
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 9: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/9.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Hosted Science: Managing Computational Workflows in the Cloud
research on modeling and cost optimization of scientific applications on multiple
heterogeneous clouds [58], but this work needs to be extended to address workflows
and ensembles. Our preliminary research was conducted using a simulation envi-
ronment, but to make the results complete they must be extended to experiments
on real cloud infrastructures. This will require implementation of our algorithms in
a workflow ensemble manager prototype.
Another key challenge will be to develop application and infrastructure mod-
els that include the various data storage options available on clouds. To our best
knowledge, there are no simulation studies that address this type of model. Our
previous experimental study [6, 26, 28] suggests that the data demands of scientific
workflows have a large impact on not only the execution time, but also on the cost
of workflows in commercial clouds. Additionally, faults, such as running out of stor-
age, often cripple the execution of data-intensive applications. The most relevant
storage options to include in the model are cloud object stores, such as Amazon S3
and RackSpace cloud files, but also private and scientific clouds, which offer limited
but free storage capabilities. The data transfer times and speeds between storage
sites can also vary, and there are incentives offered by commercial providers (e.g.
transfers between Amazon EC2 and S3 are free). If the data resides on a commercial
cloud, then the actual cost of transferring it to free computing nodes on a private
cloud may be actually higher than the cost of running the compute instances on the
commercial provider. For example 1 CPU hour on EC2 costs the same as a 1GB
network transfer.
Finally, in order to support the vision of hosted science it will be important
to investigate issues related to multitenancy, such as security and fairness. Our
existing solution assumes that a single user owns all the workflows in an ensemble
and prioritizes them according to his or her needs. This approach works well if the
ensemble manager has only one user. In a hosted ensemble management system,
however, there will be multiple users submitting workflows at the same time. This
will require authentication and authorization mechanisms to ensure users data is
kept separate, and delegation [59] to enable the ensemble manager to provision
resources on behalf of the user. In addition, depending on the pricing model used,
the service may need to balance the workload of several users accessing the same set
of provisioned resources. It may be possible to make more efficient use of resources if
the cost of provisioning a resource is shared among users. For example, one user may
be able to take advantage of unused partial VM-hours provisioned by other users,
or the system may be able to schedule complementary jobs from different users on
the same resource (e.g. an I/O-bound job with a CPU-bound job). In supporting
such use cases it will be important to consider fairness [60] in deciding how to
schedule jobs, so that users get what they pay for without wasting underutilized
resources. In a more general case there may be a need to establish a framework
for negotiating and monitoring service level agreements (SLA) [61, 62] between the
ensemble manager and users on one hand, and between the ensemble manager and
providers on the other hand.
1340004-9
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 10: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/10.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
7. Conclusions
In this paper we addressed some of the important problems of scheduling and re-
source provisioning for scientific workflow ensembles hosted on the cloud. We argue
that cloud computing is becoming a valid option for many scientific computing ap-
plications. As many commercial cloud providers are starting to offer cloud services
to the scientific community, the adoption of cloud offerings by scientists is just a
matter of time. However, public clouds as seen today have many limitations and
substantial research and development has to be done in areas such as security, re-
source and data management, networking, virtualization, and others. Also, although
the unit cost of using public clouds by researchers is known a priori, the total cost
of complex scientific simulations, such as the workflow ensembles discussed in this
paper, is usually far from researchers best effort estimates. Hosted science will not
be successful without the availability of integrated services and tools that will sup-
port seamless resource usage and management. Here, we have taken the initial steps
towards supporting hosted science.
Recognizing the fact that cloud infrastructures provide more control over the
resources than traditional science infrastructures such as grids, we developed re-
source provisioning mechanisms that can adapt according to changing application
requirements. The novelty of our approach is two-fold. First, we seek to not only
efficiently map tasks to available resources, but also to select the best provision-
ing plan. Second, we formulate the problem as a maximization of the number of
prioritized workflows completed from the ensemble. This approach requires work-
flows to be admitted or rejected based on their estimated resource demands. We
have considered both static and dynamic scheduling approaches and developed the
DPDS, WA-DPDS and SPSS algorithms based on these strategies. The algorithms
were evaluated on ensembles of synthetic workflows, which were generated based
on real scientific applications. The results of our simulation studies indicate that
the two algorithms that take into account information about the structure of the
workflow and task runtime estimates (WA-DPDS and SPSS) yield better results
than the simple on-line priority-based scheduling strategy (DPDS). We believe that
the results of our study can be applied in practice to develop tools that assist re-
searchers in planning their large-scale computational experiments. The estimates of
cost, runtime and number of workflows completed that can be obtained from both
the static algorithms and from the simulation runs, constitute valuable hints for
planning ensembles and for assessing the cost/performance trade-offs.
Our current study suggests many areas for future work. We plan to extend our
application and infrastructure model to explicitly include the various data storage
options available on clouds (right now we only model data access as part of the task
time and do not include data costs). Finally, we plan to investigate heterogeneous
environments that include multiple VM types and cloud providers, including pri-
vate and community clouds, which will make the problem even more complex and
challenging.
1340004-10
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 11: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/11.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Hosted Science: Managing Computational Workflows in the Cloud
Acknowledgments
This work was sponsored by the National Science Foundation under awards OCI-
0943725 and OCI-1148515.
References
[1] R. Buyya, J. Broberg, and A. Goscinski, Cloud Computing: Principles and Paradigms.(Wiley, 2011).
[2] Government Cloud Computing Programs. [Online].Available: http://www.cloudbook.net/directories/gov-clouds/government-cloud-computing.php. [Accessed: 01-Oct-2012].
[3] K. Yelick, S. Coghlan, B. Draney, and R. S. Canon, Eds., The Magellan Report onCloud Computing for Science Dec. 2011.
[4] L. Badger, T. Grance, R. Patt-Corner, and J. Voas, Cloud Computing Synopsisand Recommendations, Recommendations of the National Institute of Standards andTechnology, Special Publication 800-146.” May-2012.
[5] nsf12040 NSF Report on Support for Cloud Computing – US National Sci-ence Foundation (NSF). [Online]. Available: https://www.nsf.gov/publications/pub summ.jsp?ods key=nsf12040. [Accessed: 05-Oct-2012].
[6] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maech-ling, Data Sharing Options for Scientific Workflows on Amazon EC2, in SC 10 Pro-ceedings of the 2010 ACM/IEEE International Conference for High PerformanceComputing, Networking, Storage and Analysis, 2010, pp. 1–9.
[7] A. Iosup, S. Ostermann, N. Yigitbasi, R. Prodan, T. Fahringer, and D. Epema, Perfor-mance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing,IEEE Transactions on Parallel and Distributed Systems, 22 (2011) 931–945.
[8] M. Malawski, J. Meizner, M. Bubak, and P. Gepner, Component Approach to Com-putational Applications on Clouds, Procedia Computer Science, 4 (2011) 432–441.
[9] G. von Laszewski, G. C. Fox, A. J. Younge, A. Kulshrestha, G. G. Pike, W. Smith,J. Vockler, R. J. Figueiredo, J. Fortes, and K. Keahey, Design of the FutureGridexperiment management framework, in 2010 Gateway Computing EnvironmentsWorkshop (GCE), 2010, pp. 1–10.
[10] T. Hey, S. Tansley, and K. Tolle, Eds., The Fourth Paradigm: Data-Intensive Scien-tific Discovery. (Microsoft Research, 2009).
[11] Core Techniques and Technologies for Advancing Big Data Science & Engi-neering (BIGDATA)(nsf12499). [Online]. Available: http://www.nsf.gov/pubs/2012/nsf12499/nsf12499.htm. [Accessed: 01-Oct-2012].
[12] B. Allen, J. Bresnahan, L. Childers, I. Foster, G. Kandaswamy, R. Kettimuthu,J. Kordas, M. Link, S. Martin, K. Pickett, and S. Tuecke, Software as a servicefor data scientists, Commun. ACM, 55 (2011) 81–88.
[13] B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle,Database-as-a-service for long-tail science, in SSDBM11 Proceedings of the 23rd inter-national conference on Scientific and statistical database management, 2011, pp. 480–489.
[14] A. Ebert, Cloud Computing, Research and Innovation and Economics, in Researchin Future Cloud Computing (2 May 2012), 2012.
[15] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski, Cost- and Deadline-Constrained Provisioning for Scientific Workflow Ensembles in IaaS Clouds, in SC12 Proceedings of the International Conference on High Performance Computing,Networking, Storage and Analysis, 2012.
1340004-11
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 12: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/12.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
[16] BaseSpace. [Online]. Available: https://basespace.illumina.com/home/index. [Ac-cessed: 01-Oct-2012].
[17] Internet2, 16 Major Technology Companies Announce Cloud Service Part-nerships, 24-Apr-2012. [Online]. Available: http://www.internet2.edu/news/pr/2012.04.24.cloud-services-partnerships.html.
[18] HP Cloud Service. [Online]. Available: http://www.internet2.edu/netplus/hp-cloud.html. [Accessed: 02-Oct-2012].
[19] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta,K. Vahi, G. B. Berriman, J. Good, A. C. Laity, J. C. Jacob, and D. S. Katz, Pega-sus: A framework for mapping complex scientific workflows onto distributed systems,Scientific Programming, 13 (2005) 219–237.
[20] Helix Nebula – Home. [Online]. Available: http://www.helix-nebula.eu/. [Accessed:26-Jul-2012].
[21] 1000 Genomes Project data available on Amazon Cloud, News Release – NationalInstitutes of Health (NIH). 29-Mar-2012.
[22] Home – SDSC Cloud Storage. [Online]. Available:https://cloud.sdsc.edu/hp/index.php. [Accessed: 26-Jul-2012].
[23] EarthServer: Big Earth Data Analytics. [Online]. Available:http://www.earthserver.eu/.
[24] NeCTAR. [Online]. Available: http://www.nectar.org.au/. [Accessed: 26-Jul-2012].[25] The University of Queensland Research Data Storage Infrastructure (RDSI) Home
Page. [Online]. Available: http://www.rdsi.uq.edu.au/. [Accessed: 26-Jul-2012].[26] S. Callaghan, P. Maechling, P. Small, K. Milner, G. Juve, T. Jordan, E. Deelman,
G. Mehta, K. Vahi, D. Gunter, K. Beattie, and C. X. Brooks, Metrics for hetero-geneous scientific workflows: A case study of an earthquake science application, IJH-PCA, 25 (2011) 274–285.
[27] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good, The cost of doing scienceon the cloud: the Montage example, in SC 08 Proceedings of the 2008 ACM/IEEEconference on Supercomputing, 2008, pp. 1–12.
[28] J. S. Voeckler, G. Juve, E. Deelman, M. Rynge, and B. Berriman, Experiences us-ing cloud computing for a scientific workflow application, in Proceedings of the 2ndinternational workshop on Scientific cloud computing, 2011, pp. 15–24.
[29] M. Parashar, Blue Gene Sniffs for Black Gold in the Cloud, HPC in the Cloud. 2011.[30] J. M. Murphy, D. M. H. Sexton, D. N. Barnett, G. S. Jones, M. J. Webb, M. Collins,
and D. A. Stainforth, Quantification of modelling uncertainties in a large ensembleof climate change simulations, Nature, 7001 (2004), 768–772.
[31] Auto Scaling. [Online]. Available: http://aws.amazon.com/autoscaling/. [Accessed:01-Oct-2012].
[32] Cloud Management for Public and Private Clouds by RightScale. [Online]. Available:http://www.rightscale.com/. [Accessed: 01-Oct-2012].
[33] H. Kim, Y. el-Khamra, I. Rodero, S. Jha, and M. Parashar, Autonomic managementof application workflows on hybrid computing infrastructure, Scientific Programming,19 (2011) 75–89.
[34] P. Marshall, K. Keahey, and T. Freeman, Elastic Site: Using Clouds to ElasticallyExtend Site Resources, in Cluster Computing and the Grid, IEEE International Sym-posium on, 2010, pp. 43–52.
[35] M. Mao and M. Humphrey, Auto-scaling to minimize cost and meet applicationdeadlines in cloud workflows, in SC 11 Proceedings of 2011 International Conferencefor High Performance Computing, Networking, Storage and Analysis, 2011.
[36] H. Mohammadi Fard, R. Prodan, and T. Fahringer, A Truthful Dynamic Workflow
1340004-12
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 13: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/13.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
Hosted Science: Managing Computational Workflows in the Cloud
Scheduling Mechanism for Commercial Multi-Cloud Environments, IEEE Transac-tions on Parallel and Distributed Systems, preprint, 2012.
[37] E. Caron, F. Desprez, A. Muresan, and F. Suter, Budget Constrained ResourceAllocation for Non-deterministic Workflows on an IaaS Cloud, in Algorithms andArchitectures for Parallel Processing, 7439, 2012.
[38] A. Bala and D. I. Chana, A Survey of Various Workflow Scheduling Algorithms inCloud Environment, IJCA Proceedings on 2nd National Conference on Informationand Communication Technology, 4, (2011), 26–30.
[39] M. Wieczorek, A. Hoheisel, and R. Prodan, Towards a general model of the multi-criteria workflow scheduling on the grid, Future Generation Computer Systems, 25(2009) 237–256.
[40] J. J. Dongarra, E. Jeannot, E. Saule, and Z. Shi, Bi-objective scheduling algorithmsfor optimizing makespan and reliability on heterogeneous systems, in Proceedingsof the nineteenth annual ACM symposium on Parallel algorithms and architecturesSPAA 07, 2007, pp. 280–288.
[41] H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, A Multi-objectiveApproach for Workflow Scheduling in Heterogeneous Environments, in Cluster Com-puting and the Grid, IEEE International Symposium on, 2012, pp. 300–309.
[42] Jia Yu, R. Buyya, and Chen Khong Tham, Cost-Based Scheduling of Scientific Work-flow Application on Utility Grids, in E-SCIENCE 05 Proceedings of the First Inter-national Conference on e-Science and Grid Computing, pp. 140–147.
[43] S. Abrishami, M. Naghibzadeh, and D. Epema, Cost-driven scheduling of grid work-flows using Partial Critical Paths, in 11th IEEE/ACM International Conference onGrid Computing (GRID), 2010, pp. 81–88.
[44] R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. Dikaiakos, Scheduling Workflowswith Budget Constraints, in Integrated Research in GRID Computing CoreGRIDIntegration Workshop 2005 (Selected Papers), S. Gorlatch and M. Danelutto, Eds.Springer US, 2007, pp. 189–202.
[45] S. Pandey, L. Wu, S. M. Guru, and R. Buyya, A Particle Swarm Optimization-BasedHeuristic for Scheduling Workflow Applications in Cloud Computing Environments,in Advanced Information Networking and Applications, International Conference on,2010, pp. 400–407.
[46] S.-M. Park and J.-H. Kim, Chameleon: a resource scheduler in a data grid envi-ronment, in Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rdIEEE/ACM International Symposium on, 2003, pp. 258–265.
[47] K. Ranganathan and I. Foster, Decoupling computation and data scheduling in dis-tributed data-intensive applications, in High Performance Distributed Computing,2002. HPDC-11 2002. Proceedings. 11th IEEE International Symposium on, 2002,pp. 352–358.
[48] S. Venugopal, R. Buyya, and L. Winton, A grid service broker for scheduling dis-tributed data-oriented applications on global grids, in MGC 04 Proceedings of the2nd workshop on Middleware for grid computing, 2004, pp. 75–80.
[49] S. Kim and J. B. Weissman, A genetic algorithm based approach for scheduling de-composable data grid applications, in ICPP 04 Proceedings of the 2004 InternationalConference on Parallel Processing, 2004, pp. 406–413 vol.1.
[50] E. C. Machtans, L. M. Sato, and A. Deppman, Improvement on Scheduling Depen-dent Tasks for Grid Applications, in Computational Science and Engineering, 2009.CSE 09., 2009, pp. 95–102.
[51] T. Kosar and M. Balman, A new paradigm: Data-aware scheduling in grid computing,Future Generation Computer Systems, 25, (2009), 406–413.
1340004-13
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.
![Page 14: HOSTED SCIENCE: MANAGING COMPUTATIONAL WORKFLOWS IN THE CLOUD](https://reader031.vdocuments.us/reader031/viewer/2022030106/57509f991a28abbf6b1b2055/html5/thumbnails/14.jpg)
June 19, 2013 16:37 WSPC/INSTRUCTION FILE S0129626413400045
E. Deelman et al.
[52] D. Yuan, Y. Yang, X. Liu, and J. Chen, A data placement strategy in scientific cloudworkflows, Future Generation Computer Systems, 26, (2010), 1200–1214.
[53] . V. Catalyurek, K. Kaya, and B. Ucar, Integrated data placement and task as-signment for scientific workflows in clouds, in DIDC 11 Proceedings of the fourthinternational workshop on Data-intensive distributed computing, 2011, pp. 45–54.
[54] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters,Commun. ACM, 51, (2008), 107–113.
[55] J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong, BAR: An Efficient Data LocalityDriven Task Scheduling Algorithm for Cloud Computing, in CCGRID 11 Proceedingsof the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and GridComputing, 2011, pp. 295–304.
[56] J. Berlinska and M. Drozdowski, Scheduling divisible MapReduce computations,Journal of Parallel and Distributed Computing, 71, (2011), 450–459.
[57] T. Gunarathne, B. Zhang, T.-L. Wu, and J. Qiu, Portable Parallel Programming onCloud and HPC: Scientific Applications of Twister4Azure, in Fourth IEEE Interna-tional Conference on Utility and Cloud Computing (UCC), 2011, pp. 97–104.
[58] M. Malawski, K. Figiela, and J. Nabrzyski, Cost minimization for computationalapplications on hybrid cloud infrastructures, Future Generation Computer Systems,(accepted) http://dx.doi.org/10.1016/j.future.2013.01.004, 2013.
[59] I. T. Foster, C. Kesselman, G. Tsudik, and S. Tuecke, A Security Architecture forComputational Grids, in CCS ’98 Proceedings of the 5th ACM conference on Com-puter and communications security, 1998, pp. 83–92.
[60] R. Jain, D. Chiu, and W. Hawe, A Quantitative Measure Of Fairness And Discrimi-nation For Resource Allocation In Shared Computer Systems, DEC Research ReportTR-301, 1984.
[61] M. D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis, and A. Vakali, Cloud Comput-ing: Distributed Internet Computing for IT and Scientific Research, IEEE InternetComputing, 13, (2009), 10–13.
[62] V. C. Emeakaroha, M. A. S. Netto, R. N. Calheiros, I. Brandic, R. Buyya, and C. A. F.De Rose, Towards autonomic detection of SLA violations in Cloud infrastructures,Future Generation Computer Systems, 28, (2012), 1017–1029.
1340004-14
Para
llel P
roce
ss. L
ett.
2013
.23.
Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
KA
NSA
S ST
AT
E U
NIV
ER
SIT
Y o
n 07
/16/
14. F
or p
erso
nal u
se o
nly.