dataflow detection and applications to workflow scheduling

23
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283 Published online 25 April 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1708 Dataflow detection and applications to workflow scheduling Yang Wang , and Paul Lu Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 SUMMARY In high-performance computing (HPC) workloads (i.e. the set of computations to be completed), the same computational workflow of jobs (e.g. a Pipeline, a Fork&Join, or a Lattice graph) may be applied to different input files and parameters. Each of these workflow instances has the same workflow shape, but accesses (possibly) separate input, intermediate, and output files. Therefore, the selective isolation of each workflow instance can be important for maximizing scheduling flexibility and performance. However, in practice, realizing this benefit is not obvious due to a variety of problems and constraints. For example, the unmediated interaction of different workflow instances can lead to a problem of filename conflicts between concurrent workflow instances overwriting common files, which, for a control-flow driven batch scheduler, may result in either unsafe computation of the multiple instances in the same sub-directory or storage overheads when multiple directories are used. We propose a novel approach of selectively coupling and integrating job schedulers and file systems, known as a Workflow-aware File System (WaFS), with two major benefits. First, separate namespaces can be constructed on a per-instance basis to maximize the concurrency of workflow instances, despite filename conflicts, while minimizing storage overhead. Second, exploiting inferred dataflow information, trade-offs can be made between makespan and storage overhead while maintaining correctness. Through a simulation-based study, we have shown the potential benefits of WaFS to job concurrency and we have characterized the trade-offs that can be made between storage overhead and performance. New scheduling policies, Versioned Namespace (VNS), Overwrite-Safe Concurrency (OSC) and hybrids, are made possible by WaFS, with different advantages and disadvantages. Copyright 2011 John Wiley & Sons, Ltd. Received 21 July 2009; Revised 29 December 2010; Accepted 29 December 2010 KEY WORDS: dataflow; concurrency; storage 1. INTRODUCTION Many high-performance computing (HPC) and scientific workloads (i.e. the set of computations to be completed), such as those in bioinformatics [1, 2], biomedical informatics [3], cheminformatics [4], and geoinformatics [5], consist of jobs with control-flow or dataflow dependencies, represented as a Directed Acyclic Graph (DAG). Control-flow dependency specifies that one job must be completed before other jobs can start. In contrast, dataflow dependency specifies that a job cannot start until all its input data are available, which are typically created by previously completed jobs. Control flow is the more commonly used abstraction to reason about the relationship between different jobs. Dataflow dependency signifies the actual dependency requirements of the computation and ensures the correctness of the computation under the assumption that there is no data-sharing via Correspondence to: Yang Wang, National University of Singapore, Singapore. E-mail: [email protected] Copyright 2011 John Wiley & Sons, Ltd.

Upload: yang-wang

Post on 11-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dataflow detection and applications to workflow scheduling

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2011; 23:1261–1283Published online 25 April 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1708

Dataflow detection and applications to workflow scheduling

Yang Wang∗,† and Paul Lu

Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8

SUMMARY

In high-performance computing (HPC) workloads (i.e. the set of computations to be completed), the samecomputational workflow of jobs (e.g. a Pipeline, a Fork&Join, or a Lattice graph) may be applied todifferent input files and parameters. Each of these workflow instances has the same workflow shape, butaccesses (possibly) separate input, intermediate, and output files. Therefore, the selective isolation of eachworkflow instance can be important for maximizing scheduling flexibility and performance. However, inpractice, realizing this benefit is not obvious due to a variety of problems and constraints. For example,the unmediated interaction of different workflow instances can lead to a problem of filename conflictsbetween concurrent workflow instances overwriting common files, which, for a control-flow driven batchscheduler, may result in either unsafe computation of the multiple instances in the same sub-directory orstorage overheads when multiple directories are used. We propose a novel approach of selectively couplingand integrating job schedulers and file systems, known as a Workflow-aware File System (WaFS), withtwo major benefits. First, separate namespaces can be constructed on a per-instance basis to maximizethe concurrency of workflow instances, despite filename conflicts, while minimizing storage overhead.Second, exploiting inferred dataflow information, trade-offs can be made between makespan and storageoverhead while maintaining correctness. Through a simulation-based study, we have shown the potentialbenefits of WaFS to job concurrency and we have characterized the trade-offs that can be made betweenstorage overhead and performance. New scheduling policies, Versioned Namespace (VNS), Overwrite-SafeConcurrency (OSC) and hybrids, are made possible by WaFS, with different advantages and disadvantages.Copyright � 2011 John Wiley & Sons, Ltd.

Received 21 July 2009; Revised 29 December 2010; Accepted 29 December 2010

KEY WORDS: dataflow; concurrency; storage

1. INTRODUCTION

Many high-performance computing (HPC) and scientific workloads (i.e. the set of computations tobe completed), such as those in bioinformatics [1, 2], biomedical informatics [3], cheminformatics[4], and geoinformatics [5], consist of jobs with control-flow or dataflow dependencies, representedas a Directed Acyclic Graph (DAG). Control-flow dependency specifies that one job must becompleted before other jobs can start. In contrast, dataflow dependency specifies that a job cannotstart until all its input data are available, which are typically created by previously completed jobs.Control flow is the more commonly used abstraction to reason about the relationship betweendifferent jobs.

Dataflow dependency signifies the actual dependency requirements of the computation andensures the correctness of the computation under the assumption that there is no data-sharing via

∗Correspondence to: Yang Wang, National University of Singapore, Singapore.†E-mail: [email protected]

Copyright � 2011 John Wiley & Sons, Ltd.

Page 2: Dataflow detection and applications to workflow scheduling

1262 Y. WANG AND P. LU

BLAST

Feature Extraction 2

Feature Extraction 1 Function Classifier

Localization Classifier

Create Summary

A

B

C

D

E

F

Figure 1. Proteome analyst workflow (a Pipeline).

side effects (e.g. a shared read–write file). It should be noted that not all computations are easilycharacterized simply in terms of either their control-flow or dataflow. For example, a commondatabase breaks the producer–consumer relationship of a DAG. But many HPC workloads are ofthe form described here. For example, the Proteome Analyst (PA) web service [2] has a multistagePipeline-like workflow (Figure 1) that classifies the proteome (i.e. all of the proteins of an organism,usually represented as a set of strings) in terms of its molecular function and subcellular localization.In this example, two pipelines are constructed. One Pipeline is Job A, B, D and F and the otheris Job A, C, E and F. In this case, the shapes of the control-flow DAG and the dataflow DAGare the same. But in general, they do not have to be the same.

In practice, a workload is often composed of multiple instances of the same workflow, with eachinstance acting on (possibly) independent input files or different initial parameters. In our example,analyzing a proteome may require one instance of the workflow (e.g. Pipeline) for each protein inthe proteome. The goal of this paper is to maximize the performance of these kinds of workloadsin HPC systems. The primary metric of performance is makespan, which is the turnaround timefor all jobs in a workload.

To achieve this goal, we have a key observation that from a control-flow perspective, workflowinstances are inherently independent. However, in the context of a shared file system, where thenamespace and finite resources are shared, interactions between instances can lead to incorrectexecutions. Whether the issue is anti or output dependency [6] for files or simple competition forstorage resources, the selective isolation of each workflow instance can be important for maximizingscheduling flexibility and performance.

However, in practice, realizing this benefit is not obvious due to a variety of problems andconstraints. For example, the concurrent executions of multiple workflow instances may be limitedby the filename conflict problem that occurs when two or more jobs in the computation have outputfiles with exactly the same name, and the jobs can erroneously overwrite each other’s data.

Although there are some existing methods, such as the sub-directory-based method and theoverwrite method, to address this problem, each suffers from its own drawbacks due to the limitationin most current control-flow-driven batch schedulers where the job scheduling is generally basedon the inter-job control dependencies (e.g. control-flow) specified by users:

1. Sub-directory-based method: To resolve the filename conflicts, most current batch schedulersadopt a sub-directory-based strategy (also called Sub-dir in the later discussion) that createsa working directory for each workflow instance and moves all required data to that directory(e.g. GEL [7], Triana [8] and DAGMan [9]). Without any dataflow knowledge of what filesare used within the workflow instance, control-flow-driven batch schedulers have little choicebut to partition the file namespace in a brute-force renaming strategy.

Specifically, all the computations of the instance are carried out in that directory. But basedon the control-flow information, it is not always possible to determine if a file is no longerused by the other jobs. For example, in Figure 2(a), after Job B in workflow instance WI1finishes, Out.A cannot be deleted immediately because we do not know if it will be usedby other jobs such as Job D, Job E or even Job F. Therefore, files are usually not deleted

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 3: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1263

(a) Control–flow–based OSC (Unsafe) (b) Dataflow–based OSC (Safe)

E

FA

D

C

Out.B

Out.AB

B

C

D

E

F

Out.A

Out.B

A

inferred name dependency

Out.A

Out.A

B

C

D

E

F

Out.A

Out.B

A

E

FA

D

C

Out.B

Out.AB

inferred name dependency

WI1

WI2

WI1

WI2

Figure 2. Control-flow is not always safe in exploiting inter-workflow instance concurrency.A counter-example (a) shows how File Out.A is unsafely overwritten by Job A in WI2 beforeJob D in WI1 consumes Out.A. The dataflow example (b) is correct in that both Job C andJob D in WI1 must complete before Job A in WI2 can overwrite Out.A. Note that the dataflow

from Job A to Job D is only specific to (b) for comparison purpose.

immediately even when they can be deleted, incurring large storage overhead, especiallywhen a large number of workflow instances execute concurrently.

2. Overwrite method: To maximize the concurrency while minimizing the storage overhead,the overwrite method is another strategy that is often used. However, the control-flow-drivenbatch scheduler cannot always ensure a correct overlap of multiple workflow instances withrespect to safe file overwriting. A counter-example in Figure 2(a) shows how Job A inworkflow instance WI2 cannot be overlapped with Job D in WI1. The control-flow-basedoverwrite strategy might assume that since Job B and Job C in WI1 are finished, the outputof Job A (i.e. Out.A) in WI1 can be overwritten. However, the overwrite may be premature,leading to an incorrect schedule. Thus, in practice, a common solution is to execute eachworkflow instance in a sequential order (also called BASE policy in the later discussion).Although this serial policy is simple and incurs small storage overhead, it does not allow anyinter-instance concurrency and thus suffers from low performance.

To address these problems, we argue that having the dataflow information is fundamentallyadvantageous to determining the precise scope and time-window when resources are required. Inour previous example (Figure 2(a)), with the dataflow information, we know that the file Out.Ain WI1 can be deleted immediately after Job B finishes because no more jobs need that file again.In Figure 2(b), with the dataflow information, we know to delay the start of Job A in WI2 until afterJob C and Job D in WI1 are completed. Delaying the start of Job A maintains correctness withoutrequiring any additional resources for a file renaming strategy. And, since Job A of instance WI2can still overlap Job F of WI1, there is still inter-workflow instance concurrency.

1.1. Contributions

Although some ad hoc solutions exist and other systems have attempted to address these prob-lems, we are advocating a more systematic and comprehensive solution. More specifically, ourcontributions are the following:

1. New policies exploiting dataflow to maximize concurrency: We introduce two new dataflow-based scheduling policies, Versioned Namespace (VNS) and Overwrite-Safe Concurrency(OSC) to maximize concurrency (and minimize makespan) by reducing the impact of thefilename conflict problem. Both VNS and OSC take advantage of dataflow informationto maximize the inter-workflow instance concurrency, each with its own advantages anddisadvantages.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 4: Dataflow detection and applications to workflow scheduling

1264 Y. WANG AND P. LU

2. Workflow-aware File System (WaFS) for dataflow collection: We propose and prototypea novel system called Workflow-aware File System (WaFS) that extends a traditional filesystem to provide a distinct namespace for each workflow instance (to address the filenameconflict problem) and transparently gather the dataflow information to help the scheduler.Unfortunately, the dataflow information is not usually available from the user submission incontrol-flow-based systems or tracked by the traditional file systems.

To overcome these challenges, we show how an enhanced VNS manager (VNM) can belayered on top of a traditional file system to integrate the file system and the batch scheduleras a WaFS Scheduler. The WaFS Scheduler uses WaFS to collect dataflow information,stores that dataflow information in the VNM and the modified scheduler exploit the dataflowinformation for better scheduling.

WaFS is primarily a proof-of-concept prototype of a new combined file system and sched-uler architecture. A full evaluation of different implementation strategies is beyond the scopeof this paper.

The remainder of the paper is organized as follows. Section 2 introduces our WaFS Schedulerwith focus on the mechanism for collecting the dataflow information and the proposed dataflow-based scheduling policies. Some preliminary results are presented in Section 5 via simulation andprototype studies. Section 6 covers some related work. Concluding remarks are in the last section.

2. DATAFLOW COLLECTION AND SCHEDULING POLICIES

2.1. Motivation

Our dataflow-based scheduling policies rely on having a mechanism to collect the dataflow infor-mation for batch scheduled jobs. Then, our policies exploit the information to maximize the jobconcurrency within the workflows, despite a possible filename conflict. Knowing the true depen-dency of the jobs [6] enables a file renaming strategy that eliminates artificial bottlenecks toconcurrency, while efficiently using resources. However, there are several major challenges tocollecting and using dataflow information:

1. Dataflow information is not always available from the user submission: As discussed, ingeneral, the user-submitted control-flow dependencies and the dataflow dependencies of aworkflow do not have to be the same. Therefore, the dataflow information has to be gatheredautomatically during the computation.

2. Traditional file systems do not track the dataflow information: The underlying file systemsused in HPC, typically, do not track the dataflow information inherent to the jobs. Historically,file systems react to file operations requested by the application, instead of proactivelygathering information.

3. Traditional schedulers do not incorporate dataflow information: Batch schedulers are goodat using control-flow information to schedule jobs. But schedulers are not able to exploitdataflow information to solve filename conflict problems. Schedulers dispatch jobs to runwithout much concern about how the jobs will interact with the files.

To address these challenges, we propose a WaFS Scheduler, a novel approach of integratingthe file systems and the batch schedulers to collect and exploit the dataflow information on aper-workflow instance (or per-instance, for short) basis. More specifically, with this integration,we can obtain several benefits.

1. The dataflow dependencies between the jobs in a workflow can be inferred by combining thescheduler’s knowledge of the jobs (and possibly control-flow) and the file system’s knowledgeof the files accessed.

2. Separate namespaces can automatically be constructed on a per-instance basis to maximizethe workflow instance concurrency while incurring low storage overhead, despite filenameconflicts.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 5: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1265

3. The dataflow information can be used to make trade-offs between concurrency and storageoverhead when there are (potential) filename conflicts.

To achieve these ends, we propose and evaluate two dataflow-based scheduling policies, VNSand OSC, to address the problems of filename conflicts. Using simulation studies for a varietyof workloads, we show the value of dataflow-based scheduling policies for improving the degreeof job concurrency and makespan while minimizing the storage overhead of workflow-basedcomputations.

3. DATAFLOW COLLECTION: WaFS SCHEDULER

To collect the dataflow information and manage a distinct namespace for each workflow instance,we propose a WaFS that layers a VNM on top of existing file systems and integrate it with a batchscheduler. The integrated system (i.e. WaFS+Batch Scheduler) is called the WaFS Scheduler.Note that in traditional HPC systems, neither the batch scheduler nor the file system can obtainand exploit the dataflow information alone. For example, file systems do not associate files beingaccessed with a workflow or instance; file systems passively respond to file operations withoutrecording the jobs that access the files. And, schedulers do not consider the set of files that a job,workflow or instance will access when making scheduling decisions.

The architecture of WaFS Scheduler for dataflow collection is shown in Figure 3, which consistsof two major components: the batch scheduler (enhanced with VNS, OSC and their hybrid HBpolicy) and WaFS. The enhanced batch scheduler obtains the dataflow information from WaFS andleverages it to maximize the job concurrency through the proposed policies (discussed later). WaFSmonitors the workflow computations, interacts with the underlying file system to capture the fileaccess information and infer the dataflow information on a per-instance basis. More specifically,under the assumption that no filename conflicts occur inside workflow instances, for any pair ofcontrol-dependent jobs (i.e. there is a direct path between them in the control-flow graph), if afile is created by one job (source) and read by the other job (destination), then a data dependencyis established between these two jobs (i.e, from the source job to the destination job). The initialinformation of the data dependency can be obtained from the first run or some trial run of theworkflow.

VNS OSC

Scheduler

WaFS

File System

computation node

new instance

Versioned Namespace Manager (VNM)

File Access Info

Dataflow

Figure 3. WaFS Scheduler: integration of WaFS with Batch Scheduler for dataflow collection.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 6: Dataflow detection and applications to workflow scheduling

1266 Y. WANG AND P. LU

In addition, WaFS also provides services for the remote batch schedulers to exploit the inferreddataflow information to maximize the job concurrency while minimizing storage overhead.

To validate the basic ideas behind WaFS, we develop a simple WaFS prototype [10]. Theprototype works at user level, using ptrace(), via a monitor component (not shown in Figure 3),to trace the file-oriented system calls (e.g. open() and close()) and collect the dataflowinformation in the VNM. Although a full, production-quality dataflow-based scheduler has notbeen implemented, the WaFS prototype does validate the basic design and shows one possibleimplementation strategy of a key mechanism.

4. DATAFLOW-BASED SCHEDULING POLICIES

To exploit the WaFS mechanism, we propose two basic policies: VNS and OSC. Both polices,together with the reference policies of BASE and Sub-dir, are characterized in Table I.

The essence of VNS and OSC is to exploit the dataflow information to selectively break the namedependencies (i.e. the filename conflicts) between concurrent workflow instances. To simplify thepresentation of VNS and OSC, we assume that the final output files are staged out to a differentfile system, by the workflow instance itself, before each instance is complete. Therefore, theWaFS Scheduler assumes that it can simply deallocate all of the storage resources upon instancecompletion.

Consider Figure 4(b) and (c) as examples where three workflow instances (i.e. WI1, WI2 andWI3) are submitted for scheduling. For comparison purposes, we also show the BASE policy(i.e. the serial policy) (Figure 4(a)). In BASE, the inter-workflow instance concurrency is limited bythe control-flow information, and thus each workflow instance is executed sequentially (i.e. no inter-instance concurrency). Files are never versioned and storage is deallocated after the completionof each instance (see Table I). Although it is a bit of a ‘straw man’ policy to execute workflowinstance sequentially, the BASE policy does represent a class of users and workloads in practice.

Perhaps a more reasonable comparison is the Sub-dir policy that employs a per-instance workingdirectory strategy to isolate the input and output files of each individual workflow instances (i.e. filesare always versioned). Therefore, Sub-dir inherently breaks filename conflicts and maximizes theconcurrency. In Sub-dir, the inter-instance concurrency is limited by the available total storage,and the storage held by each instance is deallocated after the instance completion (see Table I).As we will see, Sub-dir is similar to VNS. But in contrast, VNS is transparent to the applicationand VNS has other benefits.

4.1. VNS policy

The VNS policy adopts a renaming strategy by automatically versioning each output file(Figure 4(b), VNS). Specifically, files are always versioned when created with a file open for

Table I. The characteristics of the compared policies: VNS, OSC and HB are our dataflow-based policies,BASE is the control-flow-based serial policy and Sub-dir refers to the policy that employs the working

directory to address the filename conflicts and maximize the job concurrency.

Intra-Instance Inter-Instance Storage File Storage allocation/Policy DOC limited by limited by overhead versioned Deallocation granularity

BASE Low Control-flow Control-flow Low Never Job/InstanceSub-dir High Control-flow Total storage High Always Job/Instance

VNS High Dataflow Total storage High Always Job/JobOSC Medium Dataflow Dataflow Low Never Job/Job (when safe

to overwrite)

BASE and Sub-dir policies are listed for comparison purposes. DOC is the short form for ‘Degree ofConcurrency’.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 7: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1267

A

B

C

(a) Serial Policy (BASE)

D

F A

B

C

D

F A

B

C

D

F

A

B

C

D

F

A

B

C

D

F

A

B

C

D

F

E

E

E

A

B

C

D

F

E

NS3

A

B

C

D

F

E

NS2

A

B

C

D

F

E

NS1

E EE

WI1

WI2

WI3

3IW2IW1IW

WI1

WI2

WI3

Out.A.2Out.A

Out.A

Out.A

Out.A

Out.A

Out.A

Out.AOut.A.2

Out.A.1

Out.A.1

CJ2

CJ1

(c) Overwrite–Safe Concurrency Policy (OSC) (b) Versioned Namespace Policy (VNS)

Figure 4. Inter-workflow instance concurrency: (a) serial policy (BASE); (b) VersionedNamespace (VNS); and (c) Overwrite-Safe Concurrency (OSC).

writing. Then, when the file is closed and if the dataflow information determines that the file isno longer needed (e.g. has no more readers), the file storage is deallocated (see Table I).

The basic strategy is similar to register renaming [11, 12] in processor microarchitecture inthat extra (i.e. file) resources are used to improve concurrency. The differences between VNS andregister renaming include: First, the file-based dataflow information required for VNS to work isnot readily available in current systems. Our proposed WaFS fills in that dataflow gap. Second,for initial evaluation, we will assume an infinite amount of renaming resources (i.e. file storage)to study the inherent concurrency available in the workload.

With VNS, although the different instances may generate files that have the same name, theirversion numbers are different. For example, in Figure 4(b), Job A in WI1 and WI2 may haveoutput files that have the same name Out.A, but this file will have different version numbers ineach workflow instance such as Out.A.1 in WI1 and Out.A.2 in WI2. Given this versioningpolicy, together with the integration of file system and job scheduler, VNS can construct a separatenamespace for each workflow instance (i.e. NS1, NS2 and NS3). Here, the namespace of VNS,in terms of isolating the workflow instances, is similar to the working directory in the Sub-dirpolicy. But in contrast, the namespace of VNS can be related back to the workflow instance (sincescheduler and file systems are coupled) to capture and exploit its dataflow information.

In our example, given the separate namespaces for NS1, NS2 and NS3 (Figure 4(b)), the schedulercan overlap the execution of WI1, WI2 and WI3, and no filename conflicts are incurred. Specifically,the degree of concurrency (DOC) (i.e. the number of concurrent jobs) increases to four (Figure 4(b),dashed box, concurrent jobs CJ1). Since each workflow instance has its own namespace, there areno filename conflicts at all when multiple workflow instances execute concurrently. Although VNScan maximize the job concurrency, it creates significant storage overhead with all the file versions.However, if the intermediate files do not need to be kept, VNS can delete them immediately based

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 8: Dataflow detection and applications to workflow scheduling

1268 Y. WANG AND P. LU

on the dataflow information at the completion of each job (i.e. job deallocation granularity, seeTable I). Although, compared with the instance deallocation granularity in the Sub-dir policy, sucha job deallocation granularity can reduce the storage overhead, the storage overhead of VNS isstill high due to the potential of a large number of concurrent workflow instances.

4.2. OSC policy

To overcome the storage overhead of VNS, the OSC policy overwrites files, when it is safe todo so, instead of always versioning files as per VNS. Specifically, files are never versioned whencreated and never deleted when closed, but they can be overwritten by later instances as long asthey are not needed in the current instance (i.e. job deallocation granularity, see Table I).

As an example, in Figure 4(c), Jobs D and E of WI1 can execute concurrently with Job Aof WI2. Specifically, Job A of WI2 has to wait until the completions of Job B and Job C ofWI1, then WI2’s Job A can overwrite the file Out.A (Job A’s output file). The DOC increases tothree (Figure 4(c), dashed box, concurrent jobs CJ2 limited by dataflow information, see Table I).Therefore, OSC improves the DOC as compared with the serial policy (Figure 4(a)) by increasingthe inter-workflow instance concurrency.

Since OSC solves the filename conflict problem by overwriting files, instead of versioning filesin VNS, the storage overhead of OSC is small. In fact, the storage overhead of OSC is proportionalto the actual DOC and not proportional to the number of workflow instances. On the other hand,OSC improves DOC over strategies (e.g. BASE) that must be conservative in overwriting files (e.g.when all jobs in a workflow instance are completed) but without incurring extra storage overhead.

4.3. Summary

In this section, we proposed two basic policies, VNS and OSC, to maximize the job concurrencyby addressing the problems in the control-flow-based batch schedulers (see Table I).

VNS and Sub-dir are consistently the best overall policies in terms of DOC [13, 14]. But bothsuffer from storage overhead. However, compared with the Sub-dir policy, VNS has the benefit ofconstructing the namespace by inferring and capturing the dataflow information on a per-workflowinstance basis. Therefore, VNS can improve the intra-instance concurrency and deallocate theunused storage at the earliest time. In contrast, without dataflow information, Sub-dir can onlydeallocate the storage at the end of each instance, suffering from more storage overhead than VNS.

Owing to the low storage overhead, OSC is valuable in the situation where the storage growsscarce, but it suffers from potentially low DOC.

5. RESULTS

We first use simulation-based techniques to show the potential of dataflow information to theimprovement of workflow scheduling, and then show the low overhead of the WaFS through aprototype study. In all simulation experiments, we use the serial policy and the sub-dir policy, twocommon solutions in practice, as our baseline strategies (i.e. BASE and Sub-dir) and identify thecircumstances under which OSC and VNS outperform these baseline strategies with respect to themakespan, the average DOC and the storage overhead.

We develop a WaFS prototype as a proof-of-concept to efficiently collect and exploit the file-based dataflow information on a per-instance basis. The prototype is based on a client/serverarchitecture where the batch scheduler is simulated as a batch queuing and workload managementsystem that acts as a client that communicates with the VNM (the server) to schedule and managethe user-submitted workloads among a set of networked computational hosts. The VNM is builton top of an existing file system and uses Ptrace mechanism to track both open() and close()system calls to build up the dataflow dependencies for each instance.

VNS is equal to Sub-dir (current best practice) on makespan, but always better, and usually afactor of 2 or better, on storage overhead. And OSC is even more efficient on storage, than either

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 9: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1269

Hei

ght

Stag

e

Wid

th

(a)

(b)

Figure 5. Benchmark workflow graphs: circle represents job and rounded rectangle representsinput/output file. The Fork&Join (a) is characterized by fan-out factor and the number of stages,

whereas the Lattice (b) is characterized by height and width.

VNS or Sub-dir, while remaining comparable to VNS and Sub-dir on makespan for moderate tonon-intensive workloads.

With WaFS, OSC and VNS can substantially improve makespan over BASE usually by an order-of-magnitude. The actual improvement depends on the intensity of workload and other factors.To different degrees, OSC and VNS exploit the inherent concurrency between workflow instancesthat BASE is unable to exploit.

5.1. Simulation methodology

We use three representative structures in the simulation studies, Fork&Join, Lattice and Pipeline(see Figure 5). These structures cover a spectrum of workflows and DOC. The Fork&Join structure,characterized by the number of stages and fan-out factors, exhibits near-constant DOC and isrepresentative of a large class of problems with a Pipeline of parallel phases [2, 15–17]. TheLattice structure, characterized by its width and height, exhibits variable concurrency, where theconcurrency increases initially to a maximum degree and then decreases progressively. A varietyof numerical linear algebra computations that arise in a broad range of scientific and engineeringapplications have a Lattice structure [18–21]. The Pipeline structure can be viewed as a specialcase of Fork&Join (i.e. fan-out factor is one) or Lattice (i.e. either width or height is one), but itis very common in scientific computation [3, 22–25].

The dataflow DAGs of these workflows are shown in Figure 5. For the Fork&Join, we assumethat the control-flow DAG is similar to its dataflow counterpart except that any two consecutivestages (i.e. all jobs in the stage) are synchronized by an implicit virtual job (like a barrier). Incontrast, for both Lattice and Pipeline, the control-flow DAG and the dataflow DAG are assumedto be exactly the same. Although the user-submitted control-flow DAGs may have various shapes,the assumptions we made here are reasonable for the users to easily reason about their workflows.

As implied in the previous sections, for the OSC and VNS strategies to work, the schedulermust know both the control-flow of the computation (i.e. the control-flow DAG) and the dataflowof the jobs in the workflow. The control-flow information is the typical way in which dependenciesare made known to batch schedulers such as LSF [26], PBS [27] and Condor [9]. The dataflowinformation is gathered by the WaFS during the execution of the first workflow instance andexploited by OSC and VNS to improve the DOC of later instances.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 10: Dataflow detection and applications to workflow scheduling

1270 Y. WANG AND P. LU

Table II. The characteristics of the benchmark workloads.

Characteristics Fork&Join Lattice Pipeline

Shape parameter Stages×fan-out Height×width StagesJob service time Uniform Uniform UniformInter-arrival time Exponential Exponential ExponentialWorkload size 100 100 100File size Uniform Uniform Uniform

Since there is no well-accepted model for job service times (JSTs), data file sizes as well as theirrelationships for the workflow-based workloads, in experiments, we assume that for instances ofall the examined workflows, the JST as well as the data file size are uniformly distributed. Theseassumptions are consistent with some previous studies [28–31]. In addition, in each experiment,a total of 100 workflow instances are in the workload, and the workflow instance inter-arrivaltime follows the exponential distribution. The characteristics of the benchmark workloads aresummarized in Table II, and the compared policies except for the HB policy are characterized inTable I.

Finally, we further assume that an unbounded number of homogeneous computational nodesand infinite storage are available so that the maximum DOC is never constrained by thehardware.

We use the discrete event simulation package SMURPH [32] to implement a simulator. Thesimulated scheduler is given the control-flow DAG by the user submitting the workflow instances.The simulated VNS manager (i.e. VNM) sees all of the file reads and writes and records thedataflow DAG for a workflow. Based on the historical dataflow information, the scheduler knows(from VNM) the dataflow of each workflow instance. The SMURPH-based simulation is in C++with both the VNM and scheduler abstracted into modules independent of the underlying simulationengine.

5.2. Results, data points, and standard deviation

There are a variety of factors that have impacts on the performance and average DOC of theworkloads (i.e. makespan and average DOC). Some of them are identified in our experiments:

1. Instance Inter-arrival Time: simulated user behavior, e.g. exponential distribution.2. Workflow Shape: the structure of workflow, e.g. Pipeline.3. JST: e.g. uniform distribution.

Since the storage budget is unbounded, file size distribution does not affect the makespan and theaverage DOC, it only affects the storage overhead. Therefore, in all experiments, we fix the file sizedistribution as a uniform distribution on [1,10] storage units. The data point in each experiment isaveraged over 10 runs by changing the random seed in the simulator.

We found that, in all experiments, the standard deviation for the 10 runs is never greater than12% of the mean of the 10 runs (i.e. the data point’s value). More specifically, for all makespanand DOC data points, the standard deviation is less than 5%. And, for all storage overhead datapoints, the standard deviation is less than 12%. Therefore, for clarity of presentation, we do notshow the standard deviation bars on the graphs.

We first vary the average inter-arrival time of workflow instances to understand their impacton performance and storage overhead. For a Fork&Join structure with three stages and a fan-outof 32 per stage, Figure 6 shows makespan, corresponding average DOC and storage overheadfor a variety of different simulation parameters. In Figure 6, the JSTs are uniformly distributedbetween 500 and 1000 time units and we vary the inter-arrival time between instances from 0 to6400 time units. Intuitively, a short inter-arrival time represents an intense workload where theinstances arrive close to each other. On the extreme right of the scale, an inter-arrival time of 6400represents a lighter workload, where the inter-arrival time is much larger than the JSTs.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 11: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1271

0

(a) Makespan

(b) Average DOC.

(c) Peak Storage Overhead

1600 3200 6400

Average Interarrival Time

0

2e+05

4e+05

6e+05

8e+05

Mak

espa

n

BASE

OSC

VNS

Sub-dir

Average Interarrival Time

0

500

1000

1500

2000

Avg

. DO

C.

BASE

OSC

VNS

Sub-dir

Average Interarrival Time

0

20000

40000

60000

80000

Stor

age

BASE

OSC

VNS

Sub-dir

100 200 400 800

0 1600 3200 6400100 200 400 800

0 1600 3200 6400100 200 400 800

Figure 6. Simulation results for the Fork&Join (3×32): makespan (a); average DOC (b); and storageoverhead (c) (DOC units are numbers of jobs; all other values are either time units or storage units).

For intensive workloads (i.e. x-axis ≤200 in Figure 6(a)), VNS and Sub-dir are better (i.e. lowermakespan) than BASE (i.e. the typical, Serial Strategy) by over an order-of-magnitude. OSC alsohas a lower makespan than BASE, but not as low as VNS. The performance improvements are

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 12: Dataflow detection and applications to workflow scheduling

1272 Y. WANG AND P. LU

due to improvements in the DOC (Figure 6(b)), which typically results in a lower makespan. Asdiscussed earlier, VNS isolates each workflow instance by creating a separate namespace for eachinstance. As a result, there are no name conflicts between the different instances, and the jobs canbe executed immediately as long as their intra-workflow instance data dependencies are resolved.Sub-dir creates a separate directory for each instance and thus has similar performance to VNS. Butcompared with VNS, Sub-dir has a little bit lower DOC due to its control-flow based scheduling(VNS is based on the dataflow), especially when all the instances in a workload arrive at the sametime (i.e. x-axis 0). However, this difference is marginal.

The main drawback of Sub-dir is the storage overhead since it never overwrites files until theend of the instance computation. In contrast, both BASE and OSC create only a limited numberof different files for the workload, and VNS can overwrite files immediately based on the datadependency information.

As the instance inter-arrival time increases (i.e. workload intensity decreases), the performancedifference as well as the storage overhead between BASE, OSC, VNS and Sub-dir diminish.A larger inter-arrival time means that fewer workflow instances are in the scheduler’s queue atany given time, which implies a smaller number of active instances and a smaller DOC. Sincethe storage overhead of compared policies is either proportional to the number of active instances(i.e. Sub-dir and VNS) or proportional to DOC (i.e. BASE and OSC), it decreases as the instanceinter-arrival time increases. Naturally, if there is a lack of inherent job concurrency in the workload,the benefits of OSC and VNS are lessened.

Therefore, for low-intensity workloads (e.g. where the inter-arrival time is 3200 time units orlarger), the BASE strategy is preferred since it has the same makespan of the other strategies, withnone of the additional complexity and overhead. For medium-intensity workloads (e.g. where theinter-arrival time is between, say, 1200 and 2400 time units), OSC performs almost as well asVNS and Sub-dir, but without the storage overhead. For high-intensity workloads (e.g. where theinter-arrival time is 800 time units or less), VNS is the clear performance leader. It outperformsSub-dir in terms of makespan and storage overhead.

Of course, many HPC workloads consist of a large parameter sweep, where all workflowinstances are known at the beginning of the computation, which corresponds to the inter-arrivaltimes of 200 time units (or less), which is also the region of the graphs where OSC and VNSperform the best.

To evaluate the impact of workflow shapes, we do the same simulation studies on both theLattice and Pipeline workflow whose simulation results are shown in Figures 7 and 8, respectively.Recall that the Lattice is expected to have a lower intra-workflow instance DOC than the Fork&Joinbecause of the additional dependencies between the jobs. For our specific Lattice, an 8×12rectangle/diamond (see Figure 7), the critical path through each workflow instance is much longerthan the 3-stage Fork&Join discussed above. This is reflected in the near-constant makespan forBASE despite variations in the inter-arrival times of the workflow instances. Intuitively, the Latticehas lower average DOC than the 3-stage Fork&Join and a longer critical path, which reduces theintra-workflow instance DOC such that the BASE strategy cannot reduce the makespan, even forlow-intensity workloads.

However, both OSC and VNS can still exploit inter-workflow instance concurrency to signifi-cantly reduce makespan through higher DOC. VNS continues to be better than OSC at reducingthe makespan, but (once again) at the cost of increased file storage due to versioning. Sub-dirhas the same performance with VNS since the shapes of the control-flow DAG and the dataflowDAG are exactly the same for our Lattice workflow. However, Sub-dir suffers from larger storageoverhead than VNS.

For both VNS and Sub-dir, their performance improvements over BASE are largely independentof the workflow shapes, which is different from OSC. Since for OSC, a longer critical pathusually implies a larger number of concurrent instances during the computation, OSC thus exhibitsrelatively better performance for a workflow with a longer critical path. We can observe it bycomparing the makespan between BASE and OSC in Figures 6(a) and 7(a), where again, thecritical path of the Lattice instance is much longer than that of the Fork&Join instance.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 13: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1273

0

(a) Makespan

Mak

espa

n

(b) Average DOC.

(c) Peak Storge Overhead

1600 3200 6400

Average Interarrival Time

0

1e+06

2e+06

3e+06

BASE

OSC

VNS

Sub-dir

Average Interarrival Time

0

100

200

300

400

500

Avg

. DO

C.

BASE

OSC

VNS

Sub-dir

Average Interarrival Time

0

20000

40000

60000

80000

1e+05

Stor

age

BASE

OSC

VNS

Sub-dir

100 200 400 800

0 1600 3200 6400100 200 400 800

0 1600 3200 6400100 200 400 800

Figure 7. Simulation results for the Lattice (8×12): makespan (a); average DOC (b); and storage overhead(c) (DOC units are numbers of jobs; all other values are either time units or storage units).

In contrast to the 3-stage Fork&Join, we also found that the difference of storage overheadbetween VNS and Sub-dir for Lattice becomes relatively large (compare Figures 6(c) and 7(c)). Thereason is not difficult to understand since DOC is proportional to the storage overhead of VNS,

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 14: Dataflow detection and applications to workflow scheduling

1274 Y. WANG AND P. LU

0

(a) Makespan

1600 3200 6400

Average Interarrival Time

0

5e+05

1e+06

1.5e+06

Mak

espa

n

BASE

OSC

VNS

Sub-dir

Average Interarrival Time

0

20

40

60

80

100

Avg

. DO

C

BASE

OSC

VNS

Sub-dir

Average Interarrival Time

0

1000

2000

3000

4000

5000

6000

Stor

age

BASE

OSC

VNS

Sub-dir

100 200 400 800

0 1600 3200 6400100 200 400 800

0 1600 3200 6400100 200 400 800

(b) Average DOC.

(c) Peak Storage Overhead

Figure 8. Simulation results for the Pipeline (10-stage): makespan (a); average DOC (b); and storageoverhead (c) (DOC units are numbers of jobs; all other values are either time units or storage units).

and the DOC of the Lattice is much less than that of the Fork&Join (i.e. the storage overhead ofVNS for Fork&Join is relatively high).

The same performance observation on the compared policies can be also obtained from the10-stage Pipeline workflow (see Figure 8), an extreme case of Fork&Join (Lattice) workflow.However, the relative performance between OSC and BASE for the Pipeline is not as good as

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 15: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1275

that for the Lattice (compare Figures 7(a) and 8(a)). The reason is easy to understand since thereis no intra-workflow instance DOC, the significant performance improvements of OSC, VNS andSub-dir over BASE are totally from exploiting the inter-workflow instance job concurrency. On theother hand, in our experiments, the critical paths of the Pipeline instances are shorter than thoseof the Lattice instances, limiting the number of concurrent instances for OSC.

Since the intra-instance concurrency of the Pipeline workflow is lower, the difference of storageoverhead between VNS and Sub-dir is relatively large for the Pipeline, which is similar to thesituation in the Lattice.

To summarize, for all the benchmark workflow shapes, we have the following conclusions:

1. OSC and VNS consistently outperform BASE up to an order-of-magnitude. Most performancegains are from exploiting inter-instance concurrency.

2. VNS continues to be better than OSC at reducing makespan but at the expense of increasedfile storage. Sub-dir has almost the same performance with VNS but suffers from largerstorage overhead than VNS. The performance improvements of both policies over BASE areindependent of the workflow shapes.

3. The workflow shape has an impact on the performance of OSC. In general, OSC exhibitsbetter performance for a workflow with a longer critical path.

4. Workflow shape would more or less affect the relative storage overhead between VNS andSub-dir, which is, in general, reduced when the workflow has higher intra-instance concur-rencies.

In the following sub-experiments (Figures 9–11), we show how DOC, makespan and storageoverhead depend on multiple factors, including the JST, the shape of the workflow DAG andthe instance inter-arrival time. We tried various JST ranges to approximate poorly balanced JST(i.e. JST in the range of [10, 1000]), moderately balanced JST (i.e. JST in the range of [500, 1000],see the previous experiments) and well-balanced JST (i.e. JST in the range of [800, 1000]). Wealso varied the inter-arrival time between 0 and 6400 time units.

Our conclusions from these sub-experiments are:

1. Independent of JST and the shape of workflow DAG, OSC and VNS are consistently betterthan BASE and Sub-dir for high-intensity workloads (i.e. low inter-arrival times) with respectto makespan and storage overhead, respectively.

2. As the JST range varies, the inherent intra-instance DOC of the workload changes (asexpected, except for the Pipeline) because processes are left idle due to the load imbalance,but, OSC, VNS and Sub-dir continue to achieve higher DOC than BASE, for intensiveworkloads at the expense of increased storage overhead.

3. Ultimately, the maximum JST (which is always 1000 time units) in our sub-experimentsdetermines the critical path of each workflow instances and thus the makespan. Consequently,regardless of load imbalance within workflow instances, OSC, VNS and Sub-dir exploitenough concurrency between workflow instances to be preferred over BASE, with similarcaveats and trade-offs as discussed for Figure 6.

4. The impact of JST on the storage overhead of each compared policy is different. Specifi-cally, for intensive workloads, regardless of the workflow shape, the impact on BASE, OSCand VNS is small. However, depending on the workflow shape, the impact on Sub-dir isdifferent, either small for both the Fork&Join and Lattice or large for the Pipeline (compareFigure 11(e) and (f)).

5.3. Overhead of the WaFS prototype on potential applications

To measure the overhead of our prototype on the potential applications, we examine theGROMACS benchmarking system gmxbench [3] that consists of four molecules published bythe GROMACS group. The four molecules in the benchmark are d.dppc, d.lzm, d.poly-ch2 andd.villin, whose atom trajectories, in water, over a period of time, are simulated by GROMACS

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 16: Dataflow detection and applications to workflow scheduling

1276 Y. WANG AND P. LU

0 1600 3200 6400

Avg. InterAV. Time

(a) Makespan (b) Makespan

0

1e+05

2e+05

3e+05

4e+05

5e+05

6e+05

7e+05

Mak

espa

n

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

1e+05

2e+05

3e+05

4e+05

5e+05

6e+05

7e+05

Mak

espa

n

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

500

1000

1500

2000

Avg

. DO

C.

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

500

1000

1500

2000

Avg

. DO

C.

BASEOSCVNSSub-dir

0 5000 7000Avg. InterAV. Time

0

20000

40000

60000

80000

Stor

age

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

20000

40000

60000

80000

Stor

age

BASEOSCVNSSub-dir

4800 0 1600 3200 64004800

0 1600 3200 64004800 0 1600 3200 64004800

1000 2000 3000 4000 6000 0 5000 70001000 2000 3000 4000 6000

(c) Average DOC. (d) Average DOC.

(e) Peak Storage Overhead (f) Peak Storage Overhead

Figure 9. Impact of job service time on the Fork&Join (3×12): makespan, average DOC and storageoverhead (Left: JST[10, 1000], Right: JST[800, 1000]).

software. The gmxbench is known to be compute-intensive and is representative of a large classof simulation-based applications.

For our study, we compare the runtime of the computationally intensive mdrun program thatactually performs the simulation. The configuration for this experiment is shown in Table III. Weuse two computers; one is assigned to the simulated batch scheduler and monitor, and the otherto the VNM. The monitor runs in each compute host to monitor the job execution and sends theintercepted information to the remote VNM. The monitor and VNM constitute the WaFS prototype.The network between the Simulated Batch and VNM is 1 Gbit/s Ethernet.

The results of the GROMACS gmxbench are shown in Figure 12. Each data point is averagedover 5 runs. The bars labeled Orig are for the same runs, but without the overheads associatedwith WaFS. The overheads for WaFS are from 0.39 (d.dppc) to 11.74% (d.villin), depending on thecomputation involved in each simulated molecule. Although 11.74% overhead might be considered

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 17: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1277

0

(a) Makespan (b) Makespan

(c) Average DOC. (d) Average DOC.

1600 3200 6400

Avg. InterAV. Time

0

5e+05

1e+06

1.5e+06

2e+06

Mak

espa

n

BASEOSCVNSSub-dir

0 5000 7000

Avg. InterAV. Time

0

5e+05

1e+06

1.5e+06

2e+06

Mak

espa

n

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

100

200

300

400

500

Avg

. DO

C.

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

100

200

300

400

500

Avg

. DO

C

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

20000

40000

60000

80000

1e+05

Stor

age

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

20000

40000

60000

80000

1e+05

Stor

age

BASEOSCVNSSub-dir

4800 1000 2000 3000 4000 6000

0 1600 3200 64004800 0 1600 3200 64004800

0 5000 70001000 2000 3000 4000 6000 0 5000 70001000 2000 3000 4000 6000

(e) Peak Storage Overhead (f) Peak Storage Overhead

Figure 10. Impacts of job service time on the Lattice (8×12): makespan, average DOC and storageoverhead (Left: JST[10, 1000], Right: JST[800, 1000]).

high in absolute terms, the low runtime (97 s, Table IV) is not as typical as the longer runtimes ofthe d.dppc example.

5.4. Summary

Our simulation studies show that the basic idea of WaFS Scheduler (i.e. the integrated the filesystems (WaFS) and the batch schedulers) can effectively resolve the filename conflicts and signifi-cantly improve job scheduling by maximizing job concurrency while lowering the storage overhead.Specifically, gathering and using the dataflow information to support the novel OSC and VNSpolicies is shown to

1. reduce makespan, relative to BASE and reduce storage overhead, relative to Sub-dir;2. improve inter-workflow instance concurrency, relative to BASE;

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 18: Dataflow detection and applications to workflow scheduling

1278 Y. WANG AND P. LU

0

(a) Makespan (b) Makespan

(c) Average DOC. (d) Average DOC.

(e) Peak Storage Overhead (f) Peak Storage Overhead

1600 3200 6400

Avg. InterAV. Time

0

2e+05

4e+05

6e+05

8e+05

1e+06

Mak

espa

n

BASEOSCVNSSub-dir

0 1600 3200 6400

Avg. InterAV. Time

0

2e+05

4e+05

6e+05

8e+05

1e+06

Mak

espa

n

BASEOSCVNSSub-dir

0 1600 3200 6400

Avg. InterAV. Time

0

20

40

60

80

100

Avg

. DO

C.

BASEOSCVNSSub-dir

0 1600 3200 6400

Avg. InterAV. Time

0

20

40

60

80

100

Avg

. DO

C.

BASEOSCVNSSub-dir

0 5000 7000

Avg. InterAV. Time

0

1000

2000

3000

4000

5000

Stor

age

BASEOSCVNSSub-dir

Avg. InterAV. Time

0

1000

2000

3000

4000

5000

Stor

age

BASEOSCVNSSub-dir

4800 4800

4800 4800

1000 2000 3000 4000 6000 0 5000 70001000 2000 3000 4000 6000

Figure 11. Impacts of job service time on the Pipeline (10-stage): makespan, average DOC and storageoverhead (Left: JST[10, 1000], Right: JST[800, 1000]).

Table III. Experimental configuration for gmxbench.

Component CPU Memory (GB) Cache (kB) OS

Simulated batch (Monitor) AMD Athlon 2.4 GHz 1 512 Linux 2.4.29VNM AMD Athlon XP 2.2 GHz 1 512 Linux 2.6.18

3. maintain the benefits over BASE and Sub-dir for a variety of workload intensities, for avariety of JST distributions and for three different workflow shapes (i.e. Fork&Join, Latticeand Pipeline).

The main criteria to choose between OSC and VNS is to trade off performance for the storageoverhead of file versioning. In addition, we also measured the performance overhead of WaFS

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 19: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1279

d.dppc0

1000

2000

3000

4000

5000

6000

Tim

e (s

ecs)

OrigWaFS

GMX Benchmark

0.39%

1.6%

10.02% 11.74%

d.lzm d.poly-ch2 d.villin

Figure 12. Performance of GROMACS gmxbench in absence and presence of WaFS. The performanceoverhead of WaFS is shown by the percentage above the bar.

Table IV. WaFS-measured job service times and file sizes for the mdrun stage of GROMACS gmxbench.

Benchmark Job service time (s) File size (MB)

d.dppc 5433 12d.lzm 708 2d.poly-ch2 112 2d.villin 97 1

for the GROMACS benchmark gmxbench. The overheads are measured to be between 0.39 and11.74%, with lower overheads associated with longer JSTs, which qualitatively show the advantagesof our WaFS scheduling polices.

6. RELATED WORK

To enable multiple workflow instances to execute concurrently, existing systems adopt variousstrategies to avoid filename conflicts. For example, the most simple solution is to execute each work-flow instance in a sequential order. Other solutions might be different from project to project [7, 9].

Grid Execution Language (GEL) [7] is a scripting language developed by the BioinformaticsInstitute, Singapore to facilitate job scheduling in Grid computing. It allows multiple instances ofthe same workflow to execute concurrently by creating a separate directory (i.e. working directory).All binaries (in each instance) run in the same working directory where they read, create and modifyfiles. The output of the instance may finally be moved to some stage sites from its working directory.

Unlike GEL, DAGMan [9] and Condor [33] (on which DAGMan is built) provide a numberof complementary mechanisms to manage multiple instances of similar jobs, and to help avoidfilename conflict. For output files, Condor uses the $(Cluster) (i.e. job ID) macro when namingthem, so that it is unique to each job instance.

Of course, these solutions can avoid the filename conflicts, but none of them considers itsdrawbacks, either suffering from low inter-workflow instance concurrency or incurring large storageoverhead. To overcome the drawbacks of the existing solutions while enjoying their advantages,we are advocating WaFS, a more systematic and comprehensive approach, to gather and exploitthe dataflow information for selectively controlling the interaction between workflow instances(via OSC and VNS).

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 20: Dataflow detection and applications to workflow scheduling

1280 Y. WANG AND P. LU

Dataflow concepts have been used in a variety of contexts, including computer architecture(e.g. [34]). Typically, dataflow is used as a technique to improve instruction-level parallelism,which is roughly analogous to intra-workflow instance concurrency. Of course, OSC and VNS usedataflow information to improve inter-workflow instance concurrency (which could be comparedwith Simultaneous Multithreading (SMT) processor designs [35, 36]). Also, we have already madethe analogy between VNS and register renaming [11, 12] in processor design. However, registerrenaming has practical limits due to VLSI space budgets, but VNS is less constrained due to themore-abundant nature of storage for different file versions. Nonetheless, in the software systemscontext, OSC and VNS are unique in their ability to improve job scheduling through the integra-tion of the file namespace manager (for gathering and controlling dataflow information) and thescheduler (for exploiting the dataflow information).

In compiler optimization, dataflow and control-flow analyses are common. Often, the compilerattempts to improve performance by increasing instruction-level parallelism and by re-orderingor transforming code to reduce overheads. Of course, OSC and VNS improve the degree ofconcurrency. However, neither of our proposed strategies attempt to re-order or transform the jobsof a workflow since there is (not yet) no higher-level semantics (e.g. a programming language fora compiler) to constrain and guide such transformations.

Using dataflow information to minimize the storage overhead is also found in the most recentwork of Ramakrishnan et al. who considered the scheduling of data-intensive workflows with moregeneral shapes onto a set of storage-constrained distributed resources [37]. Their basic approachis to add a cleanup job for a data file when that file is no longer required by other jobs in theworkflow or when it has already been transferred to permanent storage (determined by dataflowinformation). As such, the unused files are deleted (i.e. also called garbage collection) in time, andthe amount of storage used for the workflow can be reduced significantly. In contrast, the storageminimization in OSC and VNS is directly controlled by the batch scheduler based on the dataflowinformation gathered in WaFS, instead of scheduling cleanup jobs.

File systems have been studied in various computing environments, for different workloads, andwith different goals. For example, FileNet [38] was designed to support a class of read-mostlyworkloads (e.g. document image processing) in a distributed system. Zebra [39] is a network filesystem that combines the two ideas of a log structured file system (LFS) and striping with paritycalculations to increase the throughput. Elephant [40, 41] is a versioning file system with a designgoal of automatically retaining all important versions of the user’s files. Recently, the Google FileSystem (GFS) [42] was developed to address issues in fault tolerance, the management of large datasets and the optimization of append-intensive files for large distributed data-intensive applications.In contrast to these file systems, WaFS is oriented to high-performance workflow-based workloads.It is designed to layer on top of the traditional file systems to discover the workflow-specificinformation automatically. With respect to this goal, WaFS bears some similarities to severaldata provenance systems, such as Karma [43, 44], PASS [45], ES3 [46], to quote a few. Thesesystems either require user involvement to instrument the workflow components (e.g. Karma) ornot (e.g. PASS and ES3) for provenance collection and are designed to aid users in recording andquerying provenance data. In contrast, WaFS is designed for integrating with the batch scheduler toimprove workflow scheduling. Therefore, WaFS simplifies its design only to detect data dependencyinformation and maintain a namespace for each instance. It does not gather other information tosupport sophisticated query language such as in Karma, PASS and ES3. Therefore, we expect thatWaFS has lower overhead.

So far as the integration of the batch scheduler and file system is concerned, WaFS is similarto BAD-FS [47], a batch-aware distributed file system. But BAD-FS (at this time) is designed todeal with the issues of data consistency and replication but not scheduling. In contrast, WaFS ismotivated by a desire to improve job concurrency and to allow for efficient deadlock avoidancevia dataflow information.

Versioning file systems is not a new idea either. Traditionally, they are designed to recordthe history of changes to files to facilitate easy back-ups and rollbacks to previous versions offiles. Some versioning file systems in the literature include Elephant [40], Versionfs [48] and

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 21: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1281

Moraine [49]. However, none of these systems are integrated with a scheduler with the purpose ofimproving job concurrency.

7. CONCLUSION AND FUTURE WORK

In this paper, we studied the potential of using dataflow information in maximizing job concurrencywhile resolving the filename conflicts. Our contributions are in the following aspects:

1. We propose the WaFS Scheduler, a novel approach that integrates the file systems with thebatch schedulers to collect the dataflow information and make it available to the control-flow-driven batch scheduler to facilitate the workflow scheduling.

2. To exploit the inferred dataflow information, we propose and evaluate through simulationstudies a set of simple, yet effective scheduling policies, VNS and OSC. The essence ofthese policies is to take advantage of the dataflow information to remove the artificial limits(i.e. filename conflicts) on the degree of concurrency thereby to allow the batch scheduler tobetter exploit the available HPC resources.

3. To measure the overhead of the WaFS on potential applications, we build a prototype andexamine the GROMACS benchmarking system, gmxbench.

Our results show that by combining dataflow information along with a versioned namespace(i.e. VNS), the makespan can be improved by over an order-of-magnitude while the storage andcomputation overhead is low. In addition, the dataflow information can also make trade-offs betweenconcurrency and storage overhead (i.e. OSC) when there are (potential) filename conflicts.

Given the potential for significant improvements in performance, we will be implementing bothOSC and VNS in our own scheduler system. We are also exploring hybrids between OSC andVNS that can make more fine-grained trade-offs between computation time and storage space.

In addition to these, we are also planning to extend our approaches to scheduling computationalworkflows in cloud computing systems [50, 51]. Cloud computing provides a completely newmodel of utilizing the computing infrastructure where the comput, storage and network resourcesare virtualized to a set of virtual machines (VMs) and dynamically allocated to applications ona pay-per-VM basis [52–54]. This provides some opportunities for our approaches to optimizethe use of VMs. More specifically, the scheduler must be aware of the job’s data dependency sothat it can shut down the idle VMs and deallocate them in time to reduce the cost. Otherwise,the scheduler cannot decide at what point in time a particular idle VM is no longer needed sinceit goes nowhere to know if the idle VM contains intermediate results that are still required. Inaddition to these, a follow-up research problem is the scheduling of the exploited concurrent jobsin clouds. Some major challenges of this problem are the cloud’s highly dynamic and possiblyheterogeneous nature of the resources as well as opaqueness in terms of data locality [50]. Someprogresses have been made in this aspect [50, 55, 56].

REFERENCES

1. Werner T. Target gene identification from expression array data by promoter analysis. Biomolecular Engineering2001; 17:87–94.

2. Szafron D, Lu P, Greiner R, Wishart DS, Poulin B, Eisner R, Lu Z, Anvik J, Macdonell C, Fyshe A, Meeuwis D.Proteome analyst: Custom predictions with explanations in a web-based tool for high-throughput proteomeannotations. Nucleic Acids Research 2004; 32:W365–W371. Available at: http://www.cs.ualberta.ca/∼bioinfo/PA/[31 March 2011].

3. GROMACS. Available at: http://www.gromacs.org [31 March 2011].4. Schmidt M, Baldridge K, Boatz J, Elbert S, Gordon M, Jensen J, Koseki S, Matsunaga N, Montgomery J.

The general atomic and molecular electronic structure system. Journal of Computational Chemistry 1993;14:1347–1363. Available at: http://www.msg.ameslab.gov/GAMESS/GAMESS.html [31 March 2011].

5. Ludascher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y. Scientific workflowmanagement and the Kepler system. Concurrency and Computation: Practice & Experience 2005; 18(10):1039–1065. Special Issue on Scientific Workflows.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 22: Dataflow detection and applications to workflow scheduling

1282 Y. WANG AND P. LU

6. Muchnick SS. Advanced Compiler Design Implementation. Morgan Kaufmann: Los Altos, CA, 1997.7. Lian CC, Tang F, Issac P, Krishnan A. GEL: Grid execution language. Journal of Parallel and Distributed

Computing 2005; 65:857–869.8. Taylor I, Shields M, Wang I, Rana O. Triana applications within grid computing and peer to peer environments.

Journal of Grid Computing 2004; 1:199–217.9. Condor Team, 2004. Available at: http://www.cs.wisc.edu/condor/dagman [31 March 2011].

10. Wang Y. Transparent dataflow detection and use in workflow scheduling: Concurrency and deadlock avoidance.PhD Thesis, University of Alberta, Canada, 2008.

11. Sima D, Fountain T, Kacsuk P. Advanced Computer Architectures, A Design Space Approach. Addison-Wesley:Reading, MA, 1997.

12. Hennessy JL, Patterson DA. Computer Architecture: A Quantitative Approach (3rd edn). Morgan Kaufmann:Los Altos, CA, 2005.

13. Wang Y, Lu P. On the benefits of a workflow-aware versioning filesystem in metacomputing systems. The EighthInternational Conference on High Performance Computing in Asia Pacific Region, Beijing, China, 2005.

14. Wang Y, Lu P. Using dataflow information to improve inter-workflow instance concurrency. Sixth InternationalConference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), Dalian, China,2005.

15. Abbott B et al. Search for gravitational waves from binary inspirals in S3 and S4 LIGO data. Available at:http://edoc.mpg.de/316969.

16. Wang J, Kuehl H, Sacchi MD. Least-squares wave-equation avp imaging of 3d common azimuth data. Proceedingsof the 73rd Annual International Meeting, Society of Exploration Geophysicists, Dallas, TX, U.S.A., 2003.

17. Blaha P, Schwarz K, Madsen G, Kvasnicka D, Luitz J. WIEN2k: An augmented plane wave plus local orbitalsprogram for calculating crystal properties. Technical Report, Institute of Physical and Theoretical Chemistry,Vienna University of Technology, 2001.

18. Lake R, Schaeffer J, Lu P. Solving large retrograde analysis problems using a network of workstations. Advancesin Computer Chess 1994; VII:135–162.

19. Schaeffer J, Lake R. Solving the game of checkers. In Games of No Chance, Nowakowski RJ (ed.), vol. 20.Cambridge University Press: Cambridge, 1996.

20. Rosenberg AL. On scheduling mesh-structured computations for internet-based computing. IEEE Transactionson Computers 2004; 53(9):1176–1186.

21. Glatard T, Montagnat J, Pennec X. Grid-enabled workflows for data intensive medical applications. The 18thIEEE Symposium on Computer-based Medical Systems, Trinity College Dublin, Ireland, 2005; 537–542.

22. Crandall PE, Aydt RA, Chien AA, Reed DA. Input/output characteristics of scalable parallel applications.Proceedings of the IEEE/ACM Conference on Supercomputing, San Diego, CA, U.S.A., 1995; 59–89.

23. Hulith P. The amanda experiment. Proceedings of the XVII International Conference on Neutrino Physics andAstrophysics, Helsinki, Finland, June 1996.

24. Sum AK, de Pablo JJ. Nautilus: Molecular simulation code. Technical Report, Dept. of Chemical Engineering,University of Wisconsin-Madison, 2002.

25. NASA Ames and the Courant Institute at NYU. Cart3D. Available at: http://people.nas.nasa.gov/∼aftosmis/cart3d/cart3Dhome.html.

26. Zhou S. LSF: Load sharing in large-scale heterogeneous distributed systems. Proceedings of the Workshop onCluster Computing, Florida State University, Tallahassee, U.S.A., December 1992.

27. Henderson R, Tweten D. Portable batch system: External reference specification. NASA Ames Research Center,1996.

28. Sulistio A, Buyya R. A time optimization algorithm for scheduling bag-of-task applications in auction-basedproportional share systems. Proceedings of the 17th International Symposium on Computer Architecture and HighPerformance Computing, October 2005; 235–242.

29. Yu Z, Shi W. An adaptive rescheduling strategy for grid workflow applications. Proceedings of the IEEEInternational Parallel and Distributed Processing Symposium, Long Beach, CA, U.S.A., 2007; 214–220.

30. Blythe J, Gil Y, Deelman E. Coordinating workflows in shared grid environments. Proceedings of the 14thInternational Conference on Automated Planning and Scheduling, Whistler, BC, Canada, 2004.

31. Zhang Y, Koelbel C, Kennedy K. Relative performance of scheduling algorithms in grid environment. Proceedingsof the Seventh IEEE International Symposium on Cluster Computing and the Grid, Rio de Janeiro, Brazil, 2007.

32. Gburzynski P. SMURPH. Available at: http://www.cs.ualberta.ca/∼pawel/SMURPH/smurph.html.33. Thain D, Tannenbaum T, Livny M. Condor and the grid. In Grid Computing: Making the Global Infrastructure

a Reality, Berman F, Fox G, Hey T (eds.). Wiley: New York, 2003.34. Arvind K, Nikhil RS. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions

on Computers 1990; 39(3):3.35. Eggers SJ, Emer JS, Levy HM, Lo JL, Stamm RL, Tullsen DM. Simultaneous multithreading: A platform for

next-generation processors. IEEE Micro 1997; 17(5):12–19.36. Ungerer T, Robic B, Silc J. A survey of processors with explicit multithreading. ACM Computing Surveys 2003;

35(1):29–63.37. Ramakrishnan A, Singh G, Zhao H, Deelman E, Sakellariou R, Vahi K, Blackburn K, Mayers D, Samidi M.

Scheduling data-intensive workflows onto storage-constrained distributed resources. Proceedings of the SeventhIEEE International Symposium on Cluster Computing and the Grid, Rio de Janeiro, Brazil, 2007; 401–409.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe

Page 23: Dataflow detection and applications to workflow scheduling

DATAFLOW DETECTION 1283

38. Edwards DA, Mckendry MS. Exploiting read-mostly workloads in the filenet file system. Proceedings of the12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, U.S.A., 1989; 58–70.

39. Hartman JH, Ousterhout JK. The Zebra striped network file system. In High Performance Mass Storage andParallel I/O: Technologies and Applications, Jin H, Cortes T, Buyya R (eds). IEEE Computer Society Press,Wiley: Silver Spring, MD, New York, NY, 2001; 309–329.

40. Santry D, Feeley M, Hutchinson N, Veitch A, Carton R, Ofir J. Deciding when to forget in the elephantfile system. The 17th ACM Symposium on Operating Systems Principles (SOSP), Kiawah Island Resort, nearCharleston, SC, U.S.A., 1999; 110–123.

41. Santry DJ, Feeley MJ, Hutchinson NC, Veitch AC. Elephant: The file system that never forgets. Workshop onHot Topics in Operating Systems, Rio Rico, AZ, U.S.A., 1999; 2–7.

42. Ghemawat S, Gobioff H, Leung S-T. The Google file system. Nineteenth ACM Symposium on Operating SystemsPrinciples (SOSP), Bolton Landing, NY, U.S.A., 2003; 29–43.

43. Simmhan YL, Plale B, Gannon D. A framework for collecting provenance in data-centric scientific workflows.Proceedings of the IEEE International Conference on Web Services, Chicago, IL, U.S.A., 2006.

44. Simmhan YL, Plale B, Gannon D. Query capabilities of the karma provenance framework. Concurrency andComputation: Practice and Experience 2008; 20(5):441–451.

45. Holland DA, Seltzer MI, Braun U, Muniswamy-Reddy K-K. Passing the provenance challenge. Concurrency andComputation: Practice and Experience 2008; 20(5):531–540.

46. Frew J, Metzger D, Slaughter P. Automatic capture and reconstruction of computational provenance. Concurrencyand Computation: Practice and Experience 2007; 20(5):485–496.

47. Bent J, Thain D, Arpaci-Dusseau AC, Arpaci-Dusseau RH, Livny M. Explicit control in a batch-aware distributedfile system. Proceedings of Networked Systems Design and Implementation (NSDI), Chichester, U.K., 2004;365–378.

48. Muniswamy-Reddy K-K, Wright CP, Himmer A, Zodok E. A versatile and user-oriented versioning file system.Proceedings of the Third USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA,2004; 115–128.

49. Yamamoto T, Matsushita M, Inoue K. Accumulative versioning file system Moraine and its application to metricsenvironment mame. Proceedings of the Eighth ACM SIGSOFT International Symposium on Foundations ofSoftware Engineering, San Diego, CA, U.S.A., 2000; 80–87.

50. Warneke D, Kao O. Nephele: Efficient parallel data processing in the cloud. MTAG’09: Proceedings of the SecondWorkshop on Many-task Computing on Grids and Supercomputers, Portland, OR, November 2009.

51. Vecchiola C, Pandey S, Buyya R. High-performance cloud computing: A view of scientific applications.Proceedings of the 10th International Symposium on Pervasive Systems, Algorithms and Networks, Kaohsiung,Taiwan, 2009.

52. Yazir YO, Matthews C, Farahbod R, Neville S, Guitouni A, Ganti S, Coady Y. Dynamic resource allocation incomputing clouds using distributed multiple criteria decision analysis. Proceedings of IEEE Third InternationalConference on Cloud Computing, Miami, Florida, July 2010; 91–98.

53. Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin A, StoicaI, Zaharia M. Above the clouds: A berkeley view of cloud computing. Technical Report UCB/EECS-2009-28,EECS Department, University of California, Berkeley, February 2009.

54. Youseff L, Butrico M, Da Silva D. Toward a unified ontology of cloud computing. Grid Computing EnvironmentsWorkshop, Location: Austin, TX, November 2008.

55. Deelman E, Singh G, Su H-H, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A,Jacob JC, Katz DS. Pegasus: A framework for mapping complex scientific workflows onto distributed systems.Scientific Programming Journal 2005; 13(3):219–237.

56. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential buildingblocks. EuroSys’07: Proceedings of the Second ACM SIGOPS/EuroSys European Conference on ComputerSystems, Lisboa, Portugal, 2007; 59–72.

Copyright � 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:1261–1283DOI: 10.1002/cpe