performance-aware workﬂow management for grid computingcaoj/pub/doc/jcao_j_workflow.pdf ·...

Performance-aware WorkflowManagement for Grid Computing

D.P. Spooner1, J. Cao2, S.A. Jarvis1, L. He1 and G.R. Nudd1

1Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK2Center for Space Research, Massachusetts Institute of Technology, Cambridge, USA

Email: [email protected]

Grid middleware development has advanced rapidly over the past fewyears to support component-based programming models and service-orientatedarchitectures. This is most evident with the forthcoming release of the Globustoolkit (GT4) which represents a convergence of concepts (and standards) from theweb services community. Grid applications are increasingly modular, composed ofworkflow descriptions that can feature both resource and application dynamism.To manage workflow effectively, it is advantageous to understand the performanceimplications of executing individual tasks in a given configuration to ensure thatspecific qualities of service are achieved across the entire flow. In this work, amulti-tiered workflow management system is described. The system couples aperformance modelling tool with local and global scheduling algorithms that aim

to meet user-specified deadlines while resolving user and resource conflicts.

Keywords: Workflow Management, Grid Middleware, Performance Modelling

1. INTRODUCTION

Grid computing has undergone a number of signifi-cant changes in a relatively brief time-frame [1]. Withtremendous community effort, supporting grid mid-dleware has expanded significantly from rudimentarybatch processing front-ends to fully distributed compo-nents with complex scheduling, reservation and infor-mation sharing facilities. Central to this evolutionaryprocess have been the releases of the Globus toolkit [2],which represents the de-facto platform for grid applica-tions. With the imminent release of GT4, grid comput-ing moves closer towards web services frameworks [3],where applications are increasingly dynamic, modularand service-orientated.

Component-based systems typically require workflowdescriptions that reflect both organisational andtechnical boundaries. Applications may span multipleadministrative domains in order to obtain specificdata or to utilise specific processing capabilities.Equally, applications may wish to select componentsfrom a particular domain to increase throughput orreduce execution costs. In the grid context forexample, an application may have limited choice indata acquisition (possibly requiring a particular typeof instrumentation), but more scope for data post-processing (which requires a cluster of commodityprocessors). A further level of decomposition may existwithin the organisation or domain, where individualtask atoms are composed to provide the overall service.

Managing workflow is complex and currently the

subject of many research projects. The reasons forthis are manifold: component architectures encouragecode re-use, the ability to build dynamic applicationsis attractive to active middleware developers and itis a natural way to utilise services in a manner thatpromotes inter-organisation collaboration. Much of theexisting work focuses on workflow languages and howthey describes component features and facilities. Theselanguages typically support a variety of operators toexpress the workflow, with methods to compile (ortranslate) these descriptions into a working application.In addition, they often describe various input andoutput conventions used to inter-connect components.

The migration to component-based systems, orches-trated by a workflow management system, has interest-ing consequences for the performance community. First,the performance implications will become increasinglydifficult to evaluate as applications are constructed fromcomponents that are deeply de-coupled and widely dis-tributed. It will require an understanding of the im-pact of executing complex applications on different ar-chitectures under dynamic load conditions, as well asthe communication and data-transfer effects. There area number of performance evaluation techniques that areuseful and can be applied to this type of activity in-cluding simulation-based approaches [8] and analyticalmodelling techniques [5, 6]. In certain conditions, his-torical data is applicable, although usually a hybrid ap-proach is required to allow some form of performanceparameterisation [4].

A second performance criterion relates to a strength-

The Computer Journal, Vol. ??, No. ??, ????

2 D.P. Spooner, J. Cao, S.A. Jarvis, L. He, G.R. Nudd

ening of the quality-of-service (QoS) characteristicsand the implications to the anticipated ‘grid user’.Grid users were traditionally perceived as being mem-bers of academic scientific collaborations with relativelystraightforward requirements in terms of service quality.As the commercial features of web services have influ-enced Grid middleware design, QoS support has beenextended to offer a range of additional facilities. Timeli-ness, response and experimental accuracy may be differ-entiating factors for grid resource providers and futuregrid users. These factors are closely associated withgrid performability [7].

Grid workload management systems can be consid-ered as a supplementary layer to existing grid sched-ulers, an area of research that is well established in thegrid community. One such system, Pegasus [9], gener-ates scripts that Condor’s [10] DAGman [11] can usein order to execute local tasks. Layering systems inthis manner is often seen in grid work and a customaryapproach while standards are agreed. In the interim, anumber of systems (such as ICENI [12]) use proprietarymechanisms to handle workflow, and include interac-tive workflow designers, solvers and execution modulesto build, plan and execute applications end-to-end. Asstandard languages develop, it is important to considerthe performance effects at all stages in a component’slifetime, including the workflow actuation phase. Asevident in work on grid scheduling, significant gains toservice quality can be made when application and userbehaviour is predicted.

The system described in this paper is an extensionto an existing grid scheduling system [13] that mapstasks to appropriate resources based on their expectedperformance behaviour. The system does not prescribeany particular workflow language, neither is it tied toany particular implementation. The purpose of thescheduler is to predict how an application workflowwill behave a priori, permitting stronger assertionsregarding quality of service. The research describedin this paper uses an analytical modelling system [6]to anticipate component behaviour given a particularresource configuration. This assumes that the modelsare sufficiently parameterised and produce performancedata of reasonable fidelity. While these factors canbe problematic when users submit custom code (asin batch processing systems) the use of componentsis helpful in this context as high quality performancemodels can be written by a performance specialist whenthe component is developed.

The paper is organised as follows: section 2introduces the grid workflow management problem anddescribes the Warwick architecture and the supportingperformance service; section 3 includes a detailedaccount of the local and global scheduling algorithms.Section 4 evaluates the system and provides supportingexperimental results. Related work and conclusions aredocumented in sections 5 and 6.

Broker Broker

Broker

Scheduler

R R R R

R R R R

Grid LevelWORKFLOW

Broker

Broker

Broker

Scheduler

Scheduler Scheduler

Scheduler

Local LevelSUB WORKFLOW

Resource LevelTASKS

Local LevelSUB WORKFLOW

Grid LevelWORKFLOW

FIGURE 1. The multi-tiered workflow managementarchitecture. Grid level is used to describe inter-domainresource balancing. Local level is used to describe themapping of tasks to resources.

2. GRID WORKFLOW MANAGEMENT

Workflow management systems have been the subjectof research for a number of years (see [9, 11, 14, 15]),but have grown in popularity since the advent of webservices and more recently grid computing. Pioneeringwork in the grid community has allowed applicationsto be distributed across different domains to formvirtualised applications. This promotes collaboration,a key goal in grid computing, as it improves theability to co-ordinate resources and share data. Thechallenges with grid workflow involve the traditionalproblems related to wide-scale virtualisation: cross-domain management, where processes in the gridtraverse different domains of administration; anddynamic state, where the availability and performanceof grid resources may vary significantly over time.

Maintaining QoS targets for a flow of tasks acrossdifferent domains can create resource and user conflictsthat are potentially more severe than in single taskscheduling. User conflicts can occur when a user inone domain have a priority that does not equate toa priority in a foreign domain, or in which there isnot the same level of privilege. In this case it isnecessary to map user classes from one domain toanother based on a set of definable characteristics.Inevitably, this sort of mapping is the responsibility ofthe domain administrator and depends on the requiredbehaviour. Similarly, resource conflicts can occur whentasks from one workflow require systems in use byanother workflow.

Previous work at Warwick addressed similar issueswith the development of multi-tiered workload manage-ment system that consists of a local scheduler [13] formanaging applications on physical resources and a gridscheduler [16, 17] that uses a hierarchy of brokers to loadbalance across distributed domains. Performance mod-els were developed for both scheduled applications andthe underlying resources to assist in workload place-


Performance-aware Workflow Management for Grid Computing 3

ment decisions. This system, known as TITAN, pro-vides a solid basis for managing more complex applica-tions with associated topologies (workflow). The eval-uation of these performance models provides a level ofperformance awareness that can be used to enhance ser-vice quality and the two-tiered architecture assists inhandling workflows across domains.

The distinction between handling tasks at the localand grid level is crucial: local resources must bemanaged in order to obtain good resource utilisationand throughput. It is the responsibility of the localscheduler to map individual tasks to resources. A gridlevel scheduler operates across domains to move tasksequences to suitable resources that can meet servicequality constraints. This is used to provide a generalload balancing effect where faster architectures canaccommodate a greater proportion of the workload.Figure 1 illustrates the system architecture with brokersat the grid level, schedulers at the local level andresources as the underlying processing systems.

Local resource owners are interested in their ownsystems and how these resources are utilised. However,the emphasis (or balance) of system parameters maybe different depending on the particular domain. Oneenvironment might offer a system that aims to movejobs through the system rapidly (high throughput),with users treated on an equal basis. Individual taskdeadlines may be considered less important than theoverall system throughput. A second environmentmay have different domain criteria: deadlines shouldalways be met (or at least a good attempt undertaken)with a reasonably good throughput, otherwise thedomain is limited on how many customers can beaccommodated. The local system developed in thiswork is based on a genetic algorithm (GA) that usesa fitness function that can be adapted to suit variousrequirements for scheduling. A tuple of deadline, idle-time and end-to-end time is weighted to provide asingle penalty measure that can select ‘good quality’schedules from a population of candidate solutions. TheGA uses a coding scheme that facilities interdependentrelationships and workflow, while the deadlines arebased on penalty cost functions that are shaped on ‘userclass’.

Management at the grid level faces a slightly differentproblem. Brokers are used to exchange tasks to createa load balancing effect and advertise the capabilities oftheir local domains to other brokers. A compromiseexists between executing tasks locally (possibly foreconomic or contractual reasons) as well as locatingmore appropriate remote resources that might bettermeet deadlines. There are a number of higher levelissues involved with multi-domain management that arenot considered in this work including trust, allegiancesand economic cost models [18]. Performance models canstill assist at this level as they allow brokers to speculateon the potential performance of a remote resource, givenappropriate parameterisation. In a workflow scenario

this is important as the consequences of misplacementcan have far-reaching effects across different domains.

2.1. Workflow Definition

In this work, workflows are considered both abstractlyand hierarchically. At the atomic level are tasks that arecomposed into intra-domain applications referred to assub-workflows. Sub-workflows can then be connectedinto workflows that represent the entire application.Relationships between tasks and sub-workflows includecontinuation operators, forks and joins. In practice anXML script is used to describe workflows and this isbased on the ANT Java build tool [19], which providesa mechanism for representing runnable applicationswith dependencies. The choice of workflow languageis independent of this work; here we concentrate on theperformance model and scheduling characteristics.

Tasks: Tasks are the basic building blocks ofthe application in the grid workflow. They are theindividual components or jobs that are executed on alocal grid resource. They are typically MPI tasks thatexecute on multiple processors and can be shaped bythe GA to change the level of parallelism. By selectingan appropriate mapping for the task, good qualityschedules can be constructed that meet the various QoSrequirements.

Sub-workflows: A sub-workflow is a sequence ofclosely related tasks that are executed in a predefinedorder on local grid resources. This typically occurswhen there is significant communication between twocomponents and where it is unlikely that decompositionacross domains will improve application response.Again, the GA has some freedom to reorganisethe task order as long as it does not change theapplication’s task precedence. Conflicts occur whentasks from different sub-workflows require the sameresource simultaneously, and it is the responsibility ofthe GA to manage this interleaving.

Workflows: A grid application can be represented asa sequence of several different activities represented bysub-workflows. These activities are loosely coupled andmay require multi-sited grid resources.

Figure 2 illustrates the structure of the TITANworkflow management system. Workflow descriptions,which define each of the tasks with clearly definedpre- and post-activities, enter the system at the portallevel and are passed to a broker. The broker runs aninitial execution simulation of the workflow using thelocal performance service and neighbouring brokers todevelop a preliminary schedule. The simulation resultscan be returned to the portal for user agreement orexecuted directly.

The actual execution of the workflow may divergefrom the simulation due to the dynamic nature ofthe grid environment. Where significant changes have



Glo

bus

Cont

aine

r Ser

vices

Grid Users

Workflow Management

Global Grid

User Portal

Resource Management (A4)

Grid Resources

Local Resources

Sub-workflow Management

Resource Scheduling(TITAN)

Perfo

rman

ce S

ervic

es(P

ACE.

..)

FIGURE 2. Key components in the TITAN architecture.Workflow descriptions are submitted to a broker, whichlocates a suitable resource set and passes the sub-workflowscripts to a local scheduler.

occurred, the workflow may be sent back to thesimulation engine and rescheduled. Scheduling the sub-workflow at the local resource level is similar to theglobal process except that the local grid sub-workflowscheduling has to deal with multiple tasks that maybelong to different sub-workflows. The execution timeshave to be estimated with the extra consideration ofconflicts, which may occur when multiple tasks requirethe same grid resource at the same time.

2.2. The PACE Performance Service

Central to this work is ability to anticipate howcomponents will behave on a particular architecture ina given configuration. The Performance Analysis andCharacterisation Environment (PACE [6]) is used todevelop high-fidelity models that are used to drive thedecision making processes at all levels in the system.While the accuracy of the models has a direct influenceon how effective the scheduling system is, it is possibleto develop good quality models due to the reusablenature of component-based systems.

The PACE toolkit is based around an evaluationengine that takes separate resource and applicationmodels as inputs and is able to predict the executiontime of a task prior to run time. A task’sscalability (execution time vs. level of parallelism)can be determined using PACE. In the majority ofcases, the product of a tasks execution time and thenumber of processors it occupies are not constant andso PACE can be used to explore different run-timescenarios. This is useful for strong-scaling applications

Application Characterisation

Application Objects

Subtask Objects

Parallel Template Objects

Hardware Characterisation

PACE ClientPACE Clients

SOAP Request SOAP Response

PACE Grid Frontend

Evaluation Engine

Computation

Network (MPI)

Cache (L1, L2)

GT3 Grid Container

FIGURE 3. Key PACE components. Clients connectto a PACE performance service which runs inside anOGSA/OGSI container. This in turn compiles, caches andevaluates model objects using the PACE back-end tools.

to prevent them over-occupying a local resource byusing more processors than is necessary and is especiallyapparent where the execution time may not be inverselyproportional to the allocated size due to the presenceof communication among components.

Due to the separation of resource and applicationmodels, it is possible for brokers to execute ‘what-if’scenarios to identify how a sub-workflow will behaveon a foreign architecture prior to runtime. This canbe used to construct a preliminary workflow schedule.Each broker makes a resource model available as an‘advert’ of its local scheduler. It can then share thismodel with neighbouring brokers.

To integrate with existing grid middleware, PACEis made available as a grid service through a front-end that lies between the existing toolkit and gridclients (including TITAN). Clients are able to invokemodels and query the PACE evaluation engine usingthis interface, which currently runs inside a GT3container. The performance service itself consistsof a grid service factory [20], which can createapplication model instances that encapsulate thecomputation and communication behaviour inside acomponent using a combination of static code analysisand instrumentation. Resource descriptions arealso installed which model the processing capability,cache behaviour and network performance. Whencombined, the models form an evaluation engineinstance with which clients interact to obtain therelevant performance information.

The instance can be used by passing model



1 1 1 1 0 0

1 1 1 0 0 0

0 0 0 0 1 1

0 1 1 1 1 1

1 1 0 0 0 0

0 0 1 1 1 0

T3 T1 T6 T2 T5 T4

Processors

Task Order

FIGURE 4. 2D coding scheme which forms the basisof the GA evaluation and evolutionary phases. Taskorders are represented as an m-ary array of task identifiers,with a corresponding binary representation in the processordimension.

parameters to a queryModel function that returns theexpected execution time. A fundamental property is theexpected number of processors. The GA adjusts thisparameter in order to improve throughput, utilisationand deadline response by improving the interleaving ofeach sub-workflow. Where appropriate, other variablescan be modified by the GA, this includes applicationparameters that affect runtime, such as the resolutionof a grid in a weak-scaled code for example.

Figure 3 illustrates the structure of the PACE toolkitand its architecture as a grid service. Multiple clientscan connect through the container to install applicationand resource models as well as communicate with theevaluation engine.

3. THE GRID WORKFLOW SYSTEM

This section introduces the workflow managementsystem that is responsible for analysing work flowdescriptions and apportioning them to particularresources. Initially a local level algorithm is introduced,this shapes the tasks in a sub-workflow so that ituses appropriate resources within a cluster as guidedby the analytical performance models. At this levelit is necessary to ensure that tasks are run in anappropriate order (depending on the sub-workflow) andthat deadlines and inter-dependant relationships aremaintained. Following the local level, the grid level loadbalancing algorithm is described. The grid-level worksacross domains and makes use of a different algorithmwhich aims to find appropriate resources (not necessarythe best) using a system of relaxing deadlines.

3.1. The Genetic Algorithm

A genetic algorithm is used by TITAN which forms thebasis of the local workflow system. A two dimensionalcoding scheme is used to represent a schedule of paralleljobs in a cluster. Each column in the coding schemespecifies the allocation of processing nodes to a paralleljob, while the order of these columns in the coding isthe preferred sequence in which the corresponding jobsare to be executed. An example is given in Figure 4and illustrates the coding for a small (6 task) schedulewith associated processor mappings.

The task order sequence specified in the codingis termed the preferential task order as it may notrepresent the actual task sequence in time. Earlier taskallocations that utilise resources that a current taskrequires may prevent the task from running, allowing alater sequenced task to operate instead. Such artefactsare discovered when the coding is transformed intothe time domain as part of the schedule compositionprocess. The coding scheme in Figure 4, for example,contains a blocked processor so that task 6 executesearlier than task 1, despite being later in the codingsequence. This transformation is also used to handletask dependencies and inter-task relationships.

In order to transform the coding into time, eachtask identifier (with its allocated processor maps) ispassed to the PACE performance service to obtain anestimation of the application runtime. The expectedtime and host allocations can then be used to builda schedule for subsequent evaluation. A scheduleis evaluated using three metrics: cumulative over-deadline, schedule end-to-end time and cumulativeidle-time. These metrics are combined with domain-controlled weights to form a single comprehensivepenalty metric (denoted by CP ). CP is defined inEquation 1 where Γ, ω and θ are makespan, idle-time and over-deadline, respectively; and W i, Wm andW c are their associated weights. For a given weightcombination, the lower the value of CP , the better thecomprehensive performance of the schedule.

CPk =(W i ∗ Γk) + (Wm ∗ ωk) + (W c ∗ θk)

W i + Wm + W c(1)

TITAN uses the GA to find a schedule with a low CP .The algorithm first generates a set of random scheduleswith what is felt are reasonable properties (earlydeadline first etc.). The performance of a schedule,evaluated by the CP , is normalised to a fitness functionusing a dynamic scaling technique which is shown inEquation 2: where CPmin and CPmax represent thebest and the worst comprehensive performance in theschedule solution set; and CPk is the penalty of the k-thschedule.

fk =CPmax − CPk

CPmax − CPmin(2)

The GA uses fk to determine the probability thatthe schedule is selected to create the next generation.Schedules with good comprehensive performance havea far greater chance of being selected than poorerschedules. Crossover and mutation operations areperformed to generate the next generation of thecurrent schedule set, this continues until the variationin CP stabilises (diversity → 0) or a pre-set number ofiterations have occurred.

The crossover operation selects two schedules fromthe current schedule set, cuts the schedules at a randomlocation, merges the head portion of one schedule



1 1 1 1 0 0

1 1 1 0 0 0

0 0 0 0 1 1

0 1 1 1 1 1

1 1 0 0 0 0

0 0 1 1 1 0

T3 T1 T6 T2 T5 T4

Processors

Task Order

0 0 1 1 0 0

1 1 1 0 0 0

1 1 1 1 0 0

0 0 0 1 1 1

T6 T3 T1 T2

Processors

1 1 0 0 0 0

0 0 1 1 1 0

T4Task Order

T5

Gen

erat

ion

FIGURE 5. The crossover operator which merges the tailof one schedule with the head another. Where tasks areduplicated, the crossover process is reversed.

with the tail portion of the other and then reordersthe schedules to produce two legitimate children; thisprocess is illustrated in Figure 5. The mutationoperation consists of two parts; one exchanges theexecution order of two randomly selected jobs, while theother involves randomly adjusting jobs sizes as well asthe allocation to host computers. The probability thatthese adjustments are inherited depends on whetherthey lead to performance improvements as well asthe extent of the improvement. The probabilities ofperforming crossover and mutation are predetermined.

Using a GA with the concept of a fitness functionto reward good quality schedules has been applied toa number of machine task processing problems [21].However, its application in this case is useful because ithas the ability to integrate well with the performancetool and be guided by multiple metrics. This hasimplications for both user class designation (howimportant a user is) and how the domain controllerprioritises scheduled work (resource utilisation versusdeadline).

3.2. Local Workflow

At the local level, it is necessary to resolve conflictsthat occur when allocated sub-workflows competefor the same resource. Each scheduled task willhave associated deadlines together with dependenciesspecified in the workflow description. In most cases, thedomain administrator will want to minimise the degreeof deadline failure while increasing throughput andutilisation. The GA is able to deal with this problemeffectively as the multi-metric fitness function aimsto identify schedule solutions that offer a reasonablecompromise between each of the constituent factors.

In addition to the single task representation givenin Figure 4, it is possible to specify task dependenciesusing the ordering property of the chromosome codings.Figure 6 illustrates the method of mapping sub-workflow descriptions onto the two dimensional codingscheme used by the GA. Tasks are moved into theTITAN system by the broker in temporal order and

1 1 1 1 0 0

1 1 1 0 0 0

0 0 0 0 1 1

0 1 1 1 1 1

1 1 0 0 0 0

0 0 1 1 1 0

T3 T1 T6 T2 T5 T4

Processors

Task Order

FIGURE 6. 3 sub-workflows coded in TITAN.

dependencies can only be specified on tasks that havealready been accepted by the system. In this case,T6 is dependent on T3 (representing the simplestform of sub-workflow). The preferred task orderrepresentation places T6 after T3, although they wouldrun simultaneously if the dependencies did not exist(the processor maps do not clash).

The composition algorithm, which queries therelevant performance models to build the schedule,transforms each dependency into a task reservation thatbehaves in the same manner as a blocked processor,except that this reservation is only visible to thetask with the particular dependency. Creating thesereservations will inevitably result in significant blocksof idle time in the schedule as tasks are kept back towait for the completion of dependant tasks. However,assuming that the fitness function considers makespanand idle time (achieved by giving W i and Wm non-zero weights), offspring schedules will invariably fillthese gaps with other tasks as a consequence of theGA packing behaviour. This allows different sub-workflows to interleave according to their deadline,whilst respecting makespan and idle-time metrics.

Using the fitness function to improve the interleavingof the sub-workflow is preferable to other techniquesthat aim to constrain the search through the use ofcomplex coding mechanisms. Such cases can limit thesearch properties of the scheduler by over-constrainingthe GA. By adopting a technique that uses similarcoding, crossover and mutation processes to the singletask case, allows exploration into diverse areas of thesolution space without overburdening the search. Asthe scheduler is weighted towards reducing makespanand idle-time, in addition to resolving deadlines, itallows the system to perform its normal ‘gap-filling’activities. This approach only effects the comprehensivepenalty (CP ) and so successful schedules that interleavethe sub-workflow tend to enjoy a better overallperformance and hence a lower CP .

Figure 7 illustrates the initial transform from codingto time. It is apparent in 7(a) that idle-time gaps areprevalent where dependent tasks cause blocking in thequeue. However, after a 100 iterations of the schedulerGA the queue has been packed, removing most of theidle-time.



T3

T3

T3

T3

T6

T6

T1

T1

T1

T2

T2

T2

T2

T2

T5

T5

T4

T4

T4

1

2

3

4

5

6

Time

Processors

(a) Time composition of a schedule with 3 sub-workflows

T3

T3

T3

T3

T6

T6

T1

T1

T1

T2

T2

T2

T2

T2

T2

T4

T5

T5

T5

1

2

3

4

5

6

Time

Processors

T3

T3

(b) Improvement after 100 iterations of the GA

FIGURE 7. Schedule containing workflow: before andafter GA packing.

3.3. Grid Workflow

At the grid level, a different algorithm is used that aimsto allocate sub-workflows to a local scheduler that canrun the tasks effectively. The algorithm aims to makeeffective use of the deadlines by allocating more time tosub-workflows where appropriate.

The approach at the grid level is to consider aworkflow W as a set of sub-workflows Si (i = 1, . . . , n).Let pi be the number of pre- sub-workflows of Si

and qi be the number of post- sub-workflows of Si.Suppose that the group of potential local schedulersLj (j = 1, . . . ,m) constitutes an overall grid G.The scheduler aims to identify triples of start time(πs

i ), end time (πei ) and the allocated local grid (ζI).

The scheduler processes each sub-workflow sequentiallyusing the algorithm described in Algorithm 1.

The process is started with all the properties ofeach sub-workflow initialised to NULL. An additionalparameter K is used to signify whether a sub-workflowhas been scheduled. The scheduling process startsby looking for a schedulable sub-workflow, the pre-sub-workflows of which have all been scheduled. Thestart time of the chosen sub-workflow is configuredwith the latest end time of its pre- sub-workflows.The details of the sub-workflow as well as the starttime are then submitted to a grid level broker. Thebrokers work together to discover an available local

grid that can finish the sub-workflow at the earliesttime. These are illustrated in Algorithm 1 as calls tolocal grid sub-workflow scheduling functions managedby the local TITAN system. Brokers filter the localgrid resource information (from the information service)according to other properties (system architecture etc)and judge its applicability before a local grid is actuallycontacted. If there are a large number of local gridsin the environment, a discovery scope can be definedto optimise the broker discovery performance. Thescheduling ends when the end checkpoint is reached.In general, there is an additional adjustment orrescheduling procedure after scheduling. As shown inAlgorithm 1, the adjustment is processed if the endtime of a sub-workflow is earlier than the start times ofits post- sub-workflows, so that the required deadlinesof the sub-workflows are made less critical withoutincreasing the scheduled execution time of the completeworkflow. Another process can also be consideredfor rescheduling the less critical sub-workflows via thebrokers. This is required when the cost and theexecution time of the workflow are both considered.In this situation, less critical sub-workflows can beallocated to less powerful resources whose compute costis less.

Algorithm 1 The grid-level scheduling algorithm1: /** Initialisation **/2: for i = 0 to i = n do3: πs

i = NULL; πei = NULL;

4: ζi = NULL; Ki = FALSE;5: end for6: πs

1 = πe1 = CurrentTime(); K1 = TRUE;

7: /** Scheduling **/8: for lp = 2 to lp = n do9: for i = 1 to i = n do

10: if Ki = FALSE,Kip = TRUE (p = 1, . . . , pi)then

11: BREAK;12: end if13: end for14: /** Scheduling via TITAN **/15: πs

i = latest{πe

ip|p = 1, . . . , pi

};

16: if i 6= n OR πei 6= πs

i then17: (πe

i , ζi) = ealiest {Si, πsi } ;

18: end if19: Ki = TRUE;20: end for21: /** Adjustment **/22: for i = 2 TO i = n− 1 do23: ne

i = earliest{ns

iq|q = 1, . . . , qi

};

24: end for

This global grid workflow management algorithmrelies heavily on information obtained from the localsub-workflow scheduler. The scheduler must be capableof returning the latest completion time of the incomingtasks as well as the complementary operator earliest



which returns the earliest enabling time. The responseto these functions allows the broker to build a workflowschedule and compare them against solutions from otherbrokers. The local TITAN scheduler cannot give preciseresults to these functions (the GA search is random),but can produce approximate results by attaching thetask to the tail of the current queue and using theperformance service to obtain an estimate of thesefigures.

When tasks are received by a broker, they pass amessage to the local sub-workflow scheduler to simulatean execution of the task. In practice, this involvesrunning a pre-set number of iterations of the GA.The local scheduler then returns the earliest andlatest time to the broker which then compares thesewith timings received from other brokers. The sub-workflow scheduler that meets the QoS requirements issubsequently allocated the tasks. If multiple solutionsexist, the task is routed to the most local resource.

The difficulty with this approach is handling userswith different priorities who will place more importanceon task deadlines than on system metrics. The brokersand local schedulers are able to address this issue withuser classes.

3.4. User Class Desgination

Grid computing grew out academic and researchinstitutions where the user base was, historically, lessconcerned than industry with QoS issues. As gridservices converge with commercially-orientated webservices, it is inevitable that service quality will becomeincreasingly important. Such issues have importantimplications for workflow systems that must considermulti-domain systems where each domain may havesignificantly different performance requirements.

The local schedulers offer some flexibility in howthe fitness function is configured and it is possible toapply cost functions to the deadline weight in order todynamically alter the behaviour of the local scheduler.Figures 8(a) and 8(b) illustrate two such functions for astandard domain that wishes to offer gold and silverservices to applications. The functions are specifiedin terms as utility functions that modify the deadlineon a per-application basis, returning a penalty measurethat is a fraction of the domain’s pre-set QoS weight fordeadline (W c). The curves are selected by the domainadministrator to obtain a specified behaviour from theGA for the particular domain. In this case, the functionsilver is flat until the first deadline at which it takesa ‘penalty hit’ - this slopes until twice the deadlinetime at which it reaches a higher gradient. Wherethe GA builds schedules that have applications thatfail their deadline, the CP is adversely affected and itis likely that the GA will select a schedule where theapplications meet their deadlines.

The gold service function 8(b) has an increasingdeadline penalty from the outset which encourages the

2

Penalty

Deadlines0 1

(a) Silver quality (fair)

2

Penalty

Deadlines0 1

(b) Gold quality (fair)

2

Penalty

Deadlines0 1

(c) Alternate Silverquality (hard)

2

Penalty

Deadlines0 1

(d) Alternate Goldquality (hard)

FIGURE 8. Deadline utility functions

GA to select schedules that run the application as soonas possible. Failure to meet the deadline results in alarge penalty, worse than the equivalent silver failureat the same point in time.

These functions depend on how QoS failure ismeasured (and paid for). Figures 8(c) and 8(d) offeralternative utility functions that are based on pass/failcontracts (as found in hard scheduling). In otherwords, once the deadline has failed the user will becompensated and so it does not matter whether the taskis run now or at any time in the future, the penalty isfixed. While this may not feel intuitively correct from auser perspective, it may be reasonable from the broaderview of system averages. In this case it may be better tolet one deadline slip if it allows many other applicationsto achieve their deadline.

A related issue at the grid level is how one userdesignation maps onto another designation in a foreigndomain. An application submitted to a grid as silvermay run differently depending on whether the allocateddomain uses cost curves that resemble Figure 8(a) or8(c). One approach is to adopt an economic modelsimilar to work by Buyya [18] where a credit fund isimposed that costs each deadline function against theincoming application and user profile.

A gold user on a local domain may only achievesilver status on another domain given the same levelof capital. While this is a relatively crude methodof mapping users across a domain, it provides somemeasure of what level of service the user can expectbetween different domains. Using a similar mechanismto the resource model sharing, brokers can advertise thecost functions which can be used by remote brokers todetermine where to place the sub-workflow.



4. EXPERIMENTAL EVALUATION

To demonstrate the TITAN workflow architecture, aseries of case studies are performed to evaluate theeffectiveness of the system. In each instance a selectionof heterogeneous applications are executed that haverun-times between 10s and 100s, depending on the typeof application and the architecture the task is allocatedto. The applications are taken from a group of smallscientific kernels that are used as PACE test cases andhave sufficiently accurate performance models. Thesystem is configured so that each local scheduler isresponsible for 16 underlying grid resources: a clusterof 2.6Ghz Pentium 4 workstations in this case.

The first experiment is a straightforward demonstra-tion of the packing characteristic exhibited by the GAand is used to increase throughput and reduce idle time.The scheduler is compared with a DAG-based scheduler(which approximates the local scheduler’s behaviour),configured with the same parameters as TITAN. 100sub-workflows, each containing 5 tasks, are are admit-ted to the DAG scheduler with a user-specified numberof processors and allocated to the cluster. The scheduleris able to utilise free processors to minimise idle-timeand improve throughput. While TITAN operates in asimilar fashion, the GA is able to change the numberof processors allocated by exploring the performancemodel data. It can identify, from the execution curves,appropriate limits (both upward and downward) on theavailable processors as well as compromising a task’s ex-ecution time to improve another task’s deadline. Thisgives TITAN an advantage in removing idle time fromthe schedule. Another advantage is the dynamic ele-ment: should a processor (a host) fail, TITAN is ableto adapt quickly, removing the processor from the task-order coding and re-evaluating the schedule. The DAGscheduler may stall if there are not sufficient processorsavailable for a particular task to complete.

Figure 9(a) illustrates the difference between theDAG and TITAN approaches when presented with sub-workflows at different arrival rates. The vertical axis isthe CP measure and is a straightforward combinationof make-span and idle time (W c is set to 0 in thisexperiment). It is evident that at low workloads, theschedulers respond in a similar manner. At higher loadshowever, the GA-based scheduler is able to maintaina lower make-span (higher utilisation and lower idle-time) through task re-mapping, while using the DAGscheduler results in a poorer performance. The relatedresults in Figure 9(b) illustrate the average durationof a scheduled sub-workflow. While TITAN is able tomake a small reduction to the makespan of each sub-workflow, it tends to maintain a ‘system view’ whenimproving throughput and utilisation. In some cases,this can be to the detriment of an individual user.

A second experiment evaluates the QoS features ofTITAN by comparing the execution of sub-workflowsunder low, medium and high workloads against user-

(a) Comparison of comprehensive performance for DAG andTITAN under different workloads

(b) Mean length scheduled sub-workflows

FIGURE 9. Comparison of the TITAN and DAGschedulers

specified deadlines. For each workload rate, sub-workflows arrive at the scheduler with a deadlinespecified in a given range. The range is uniformallyselected and based on the application and architecturesperformance, representing a realistic restriction for theapplication. The results are given in Figures 10(a)and 10(b) which show the ratio of failed tasks and theaverage ‘over-time’ respectively. TITAN is typicallyable to meet more deadlines at a given workload.

Figures 11(a) and 11(b) illustrate the effect of theutility functions on the fitness and the associated CP .The workload in this experiment is the same as theprevious experiment where a number of tasks failed tomeet their deadlines. However, a proportion of the taskshave been given gold status whilst the remaining aregiven silver.

The ‘fair’ scheduler fails many deadlines at both low,medium and high load. However, the degree of failure(or the ‘over-deadline’, which contributes directly to theCP ) is small. In contract, the ‘hard’ scheduler typicallyfails deadlines less often, but the degree of failure ismore severe as their is no incentive to run the task after



(a) Proportion of failed deadlines

(b) Mean duration of each sub-workflow

FIGURE 10. Impact on QoS metrics with varyingworkload.

the deadline has passed.

5. RELATED WORK

Workflow management has become an essential servicein grid environments in a relatively short timeframe. Many issues still exist including: workflowspecification and adjustment, on-the-fly workflowconstruction, enactment and orchestration, simulationand scheduling. There are a number of on-goingprojects that address one or more of these challenges.Among existing projects Pegasus [9] is most relatedto the work described in this paper. Pegasusaims at planning for workflow execution in grids.Artificial intelligence algorithms are used for workflowscheduling. In addition, it is currently integrated withCondor’s [10] DAGman[11] which is similar to thescheduling approach used by TITAN [13]. Pegasusdoes not provide any QoS support at this time, whilethe algorithms in TITAN enable deadlines for jobexecutions to be satisfied and provides the basis for QoSsupport.

There are a number of other research projects whichare partly related to this work. GridAnt [22] reuses

(a) Based on utility functions 8(a) and 8(b)

(b) Based on utility functions 8(c) and 8(d)

FIGURE 11. The effect of the utility functions on thedeadlines for the same workload.

the Ant framework and provides a client-controllableworkflow system for GT3 processes. While GridAntprovides a set of Grid tasks to be used withinthe Ant framework, workflow scheduling is currentlynot the key concern in the GridAnt project. IBMBPWS4J is a web services flow execution enginedesigned for BPEL4WS [23], which also has thepotential to become a workflow standard in thegrid community. Other current projects that havea workflow focus include USA ASCI grid [14] andGriPhyN [15], UK e-Science project MyGrid [24],EC/IST FP5 projects GEMSS [25], GridLab [26], andGRASP [27], the Swiss project BioOpera [28], Japan’sNAREGI [29] and Business Grid projects [30]. Previousresearch on using workflow management in integratedmetacomputing and problem solving environmentsinclude Webflow [31], Symphony [32], Triana [33],SPINEware [34], TENT [35] and UNICORE [36]. Mostof these projects focus on workflow specification andenactment and target specific applications, they do notprimarily address workflow scheduling.



6. CONCLUSIONS

In this paper a workflow management system has beenpresented that is able to enact workflow descriptions atboth the local domain and the grid level. Performancemodels are used extensively to guide decision makingat each level in the system. User class utilityfunctions are applied to local and grid schedulingalgorithms to improve workflow allocations acrossmulti-domains. Different domains are able to specifydifferent operating parameters to suit their particularenvironment, which makes the system attractive todiverse grid applications. The main contribution ofthis work is to add performance-awareness to workflowservices. It does not define a new workflow language, asthere are already a large number of contenders, but aimsto offer an additional management tier that orchestratesthe workflow so that deadlines are met, throughput isincreased and resource utilisation is improved.

Workflow management is currently the subject ofa number of research projects as it clearly facilitatesdistributed, collaborative working practices. Givingusers the ability to connect large storage databases tocomputational and visualisation services co-ordinatedby supporting middleware is a strong benefit of gridcomputing, and one that is being actively pursuedby grid researchers. The approaches presented hereallow users to submit workflows to grid resources withexpected levels of service, performance and reliability.As important is the ability for grid resource providersto carefully tune their environments to meet potentiallybinding contractual service goals.

ACKNOWLEDGEMENTS

This work is sponsored in part by grants fromthe EPSRC (contract No. GR/R47424/01) andthe EPSRC e-Science Core Programme (contract No.GR/S03058/01).

REFERENCES

[1] F. Berman, R.J. Hey, G. Fox Grid Computing: Makingthe Global Infrastructure a Reality. Wiley Series inCommunications Networking and Distributed Systems,2003.

[2] I. Foster and C. Kesselman. Globus: A metacomputinginfrastructure toolkit. Int. Journal of SupercomputerApplications, 11(2):115–128, 1997.

[3] F. Curbera, W. Nagy, S. Weerawarana, WebServices: Why and How. Workshop on Object-OrientedWeb Services, ACM Conference on Object-OrientedProgramming, Systems, Languages, and Applications(OOPSLA ’01), 14-18 October 2001, Tampa Bay,Florida.

[4] D. Bacigalupo, S. Jarvis, L. He, G. Nudd. An investiga-tion into the application of different performance tech-niques to e-Commerce applications. Workshop on Per-formance Modelling, Evaluation and Optimization ofParallel and Distributed Systems, 18th IEEE Interna-

tional Parallel and Distributed Processing Symposium2004.

[5] V. Adve, R. Bagrodia, J. Browne, and E. Deelmanet. al. Poems: end-to-end performance design oflarge parallel adaptive computional systems. IEEETransactions on Software Engineering, 26(11):1027–1048, 2000.

[6] G. Nudd, D. Kerbyson, E. Papaefstathiou, S. Perry,J. Harper, and D. Wilcox. PACE : A toolset forthe performance prediction of parallel and distributedsystems. Int. Journal of High Performance ComputingApplications, 14(3):228–251, 2000.

[7] N. Thomas. Challenges and Opportunities in GridPerformability. CS-TR: 842, School of ComputerScience, University of Newcastle, May 2004

[8] P. Dinda. Online prediction of the running time oftasks. Cluster Computing, 5(3):225–236, 2002.

[9] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta,K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree,R. Cavanaugh, S. Koranda. Mapping AbstractComplex Workflows onto Grid Environments. J. GridComputing, 1(1):25-39, 2003.

[10] M. Litzkow, M. Livny, and M. Mutka. Condor – ahunter of idle workstations. 8th International Confer-ence on Distributed Computing Systems (ICDCS’88),pp. 104–111, 1988.

[11] Condor Team, University of Wisconsin-Madison,Condor Version 6.4.7 Manual, 2003.

[12] N. Furmento, A. Meyer, S. McGough, S. Newhouse,T. Field, J. Darlington. ICENI: Optimisation ofComponent Applications within a Grid Environment.Parallel Computing, 28(12):1753-1772, 2002.

[13] D. Spooner, S. Jarvis, J. Cao, S. Saini, G. Nudd.Local grid scheduling techniques using performanceprediction. IEE Proc.-Comput. Digit. Tech., 150(2):87-96, 2003.

[14] H. P. Bivens. Grid Workflow, Grid ComputingEnvironments Working Group, Global Grid Forum,2001.

[15] E. Deelman, C. Kesselman, G. Mehta, L. Meshkat,L. Pearlman, K. Blackburn, P. Ehrens, A. Lazzarini,R. Williams, S. Koranda, GriPhyN and LIGO,Building a Virtual Data Grid for Gravitational WaveScientists. 11th IEEE Int. Symp. On High PerformanceDistributed Computing, Edinburgh, Scotland, pp. 225-234, 2002.

[16] J. Cao, D. Kerbyson, G. Nudd. High performanceservice discovery in large-scale multi-agent and mobile-agent systems. Int. Journal of Software Engineeringand Knowledge Engineering, 11(5):621–641, 2001.

[17] J. Cao, S. Jarvis, S. Saini, D. Kerbyson, G. Nudd.ARMS: an agent-based resource management systemfor grid computing. Scientific Programming, 10(2):135–148, 2002.

[18] R. Buyya, D. Abramson, J. Giddy, and H. Stockinger.Economic models for resource management andscheduling in Grid computing. Concurrency andComputation: Practice and Experience, 14:1507-1542,2002.

[19] Ant 1.6.1 from ant.apache.org

[20] B. Sotomayor. The Globus Toolkit 3 Pro-grammers Tutorial, Chapter 1, Section 3, May



11, 2004. www.casa-sotomayor.net/gt3-tutorial/

multiplehtml/ch01s03.htm

[21] A.Y. Zomaya, Y.H. Teh. Observations on using geneticalgorithms for dynamic load-balancing. IEEE Trans.Parallel Distrib. Syst. 12(9):899-911, 2001.

[22] K. Amin, G. von Laszewski, M. Hategan, N.J. Zaluzec,S. Hampton, A. Rossi. GridAnt: a Client-ControllableGrid Workflow System. 37th IEEE Annual Hawaii Int.Conf. on System Sciences, pp. 210-219, 2004.

[23] F. Curbera, Y. Goland, J. Klein, F. Ley-mann, D. Roller, S. Thatte, S. Weer-awarana. Business Process Execution Lan-guage for Web Services, Version 1.0, July 2002.www.ibm.com/developerworks/library/ws-bpel/.

[24] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Green-wood, T. Carver, A. Wipat, P. Li. Taverna: A Toolfor the Composition and Enactment of BioinformaticsWorkflows. Bioinformatics, 2004.

[25] J. Cao, J. Fingberg, G. Berti, J.G. Schmidt.Implementation of Grid-enabled Medical SimulationApplications Using Workflow Techniques. 2nd Int.Workshop on Grid and Cooperative Computing,Shanghai, China, Lecture Notes in Computer Science3032:34-41, 2003.

[26] E. Seidel, G. Allen, A. Merzky, J. Nabrzyski. GridLab- a Grid Application Toolkit and Testbed. FutureGeneration Computer Systems 18(8):1143-1153, 2002.

[27] T. Dimitrakos, D.M. Randal, F. Yuan, M. Gaeta,G. Laria, P. Ritrovato, B. Serhan, S. Wesner,K. Wulf. An Emerging Architecture Enabling GridBased Application Service Provision. 7th IEEEInt. Enterprise Distributed Object Computing Conf.,Brisbane, Australia, pp. 240-251, 2003.

[28] W. Bausch, C. Pautasso, G. Alonso. Programmingfor Dependability in a Service-based Grid. 3rdIEEE/ACM International Symposium on ClusterComputing and the Grid, Tokyo, Japan, pp. 164-171,2003.

[29] K. Miura. NaReGI - The Japanese National ResearchGrid Initiative. CCGrid 2003.

[30] H. Kishimoto, T. Kojo, F. Maciel. Business GridMiddleware Goals and Status. GlobusWorld 2004.www.globusworld.org/program/slides/3c 1.pdf

[31] D. Bhatia, V. Burzevski, M. Camuseva, G. Fox,W. Furmanski, G. Premchandran. WebFlow: a VisualProgramming Paradigm for Web/Java Based CoarseGrain Distributed Computing. Concurrency: Practiceand Experience 9(6):555-577, 1997.

[32] M. Lorch, D. Kafura. Symphony: A Java-based Compo-sition and Manipulation Framework for ComputationalGrids. 2nd IEEE/ACM Int. Symp. on Cluster Comput-ing and the Grid, Berlin, Germany, pp. 136-143, 2002.

[33] I. Taylor, M. Shields, I. Wang, O. Rana. TrianaApplications within Grid Computing and Peer to PeerEnvironments. J. Grid Computing 1(2):199-217, 2003.

[34] E.H. Baalbergen, H. van der Ven. SPINEware- a Framework for User-oriented and TailorableMetacomputers. Future Generation Computer Systems15(5-6):549-558, 1999.

[35] A. Schreiber. The Integrated Simulation EnvironmentTENT. Concurrency and Computation: Practice andExperience 14(13-15):1553-1568, 2002.

[36] M. Romberg. The UNICORE Grid Infrastructure.Scientific Programming 10(2):149-157, 2002.


performance-aware workﬂow management for grid computingcaoj/pub/doc/jcao_j_workflow.pdf ·...

Documents