analysis of federated and global scheduling for parallel ...lu/papers/ecrts14.pdf · tasks that...

12
Analysis of Federated and Global Scheduling for Parallel Real-Time Tasks Jing Li , Jian-Jia Chen § , Kunal Agrawal , Chenyang Lu , Chris Gill , Abusayeed Saifullah Washington University in St. Louis, U.S.A. § TU Dortmund University, Germany [email protected], § [email protected], {kunal, lu, cdgill}@cse.wustl.edu, [email protected] Abstract This paper considers the scheduling of parallel real- time tasks with implicit deadlines. Each parallel task is characterized as a general directed acyclic graph (DAG). We analyze three different real-time scheduling strate- gies: two well known algorithms, namely global earliest- deadline-first and global rate-monotonic, and one new algorithm, namely federated scheduling. The federated scheduling algorithm proposed in this paper is a gener- alization of partitioned scheduling to parallel tasks. In this strategy, each high-utilization task (utilization 1) is assigned a set of dedicated cores and the remaining low-utilization tasks share the remaining cores. We prove capacity augmentation bounds for all three schedulers. In particular, we show that if on unit-speed cores, a task set has total utilization of at most m and the critical- path length of each task is smaller than its deadline, then federated scheduling can schedule that task set on m cores of speed 2; G-EDF can schedule it with speed 3+ 5 2 2.618; and G-RM can schedule it with speed 2+ 3 3.732. We also provide lower bounds on the speedup and show that the bounds are tight for federated scheduling and G-EDF when m is sufficiently large. I. Introduction In the last decade, multicore processors have become ubiquitous and there has been extensive work on how to exploit these parallel machines for real-time tasks. In the real-time systems community, there has been significant research on scheduling task sets with inter-task parallelism — where each task in the task set is a sequential program. In this case, increasing the number of cores allows us to increase the number of tasks in the task set. However, since each task can only use one core at a time, the computa- tional requirement of a single task is still limited by the capacity of a single core. Recently, there has been some interest in the design and analysis of scheduling strategies for task sets with intra-task parallelism (in addition to inter-task parallelism), where individual tasks are parallel programs and can potentially utilize more than one core in parallel. These models enable tasks with higher execution demands and tighter deadlines, such as those used in autonomous vehicles [31], video surveillance, computer vision, radar tracking and real-time hybrid testing [28] In this paper, we consider the general directed acyclic graph (DAG) model. We analyze three different scheduling strategies: a new strategy, namely federated scheduling, and two classic strategies, namely global EDF and global rate-monotonic scheduling. We prove that all three strate- gies provide strong performance guarantees, in the form of capacity augmentation bounds, for scheduling parallel DAG tasks with implicit deadlines. One can generally derive two types of performance bounds for real-time schedulers. The traditional bound is called a resource augmentation bound (also called a processor speed-up factor). A scheduler S provides a resource augmentation bound of b 1 if it can successfully schedule any task set τ on m cores of speed b as long as the ideal scheduler can schedule τ on m cores of speed 1. A resource augmentation bound provides a good notion of how close a scheduler is to the optimal schedule, but it has a drawback. Note that the ideal scheduler is only a hypothetical scheduler, meaning that it always finds a feasible schedule if one exists. Unfortunately, since we often cannot tell whether the ideal scheduler can schedule a given task set on unit-speed cores, a resource augmentation bound may not provide a schedulability test. Another bound that is commonly used for sequential tasks is a utilization bound. A scheduler S provides a utilization bound of b if it can successfully schedule any task set which has total utilization at most m/b on m cores. 1 A utilization bound provides more information than a resource augmentation bound; any scheduler that guar- antees a utilization bound of b automatically guarantees a 1 A utilization bound is often stated in terms of 1/b; we adopt this notation in order to be consistent with the other bounds stated here.

Upload: others

Post on 19-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

Analysis of Federated and Global Scheduling for Parallel Real-Time Tasks

Jing Li†, Jian-Jia Chen§, Kunal Agrawal†, Chenyang Lu†, Chris Gill†, Abusayeed Saifullah†

†Washington University in St. Louis, U.S.A.§TU Dortmund University, Germany

[email protected], §[email protected], {kunal, lu, cdgill}@cse.wustl.edu, [email protected]

Abstract

This paper considers the scheduling of parallel real-time tasks with implicit deadlines. Each parallel task ischaracterized as a general directed acyclic graph (DAG).We analyze three different real-time scheduling strate-gies: two well known algorithms, namely global earliest-deadline-first and global rate-monotonic, and one newalgorithm, namely federated scheduling. The federatedscheduling algorithm proposed in this paper is a gener-alization of partitioned scheduling to parallel tasks. Inthis strategy, each high-utilization task (utilization ≥ 1)is assigned a set of dedicated cores and the remaininglow-utilization tasks share the remaining cores. We provecapacity augmentation bounds for all three schedulers. Inparticular, we show that if on unit-speed cores, a taskset has total utilization of at most m and the critical-path length of each task is smaller than its deadline,then federated scheduling can schedule that task set onm cores of speed 2; G-EDF can schedule it with speed3+√5

2 ≈ 2.618; and G-RM can schedule it with speed2 +√3 ≈ 3.732. We also provide lower bounds on the

speedup and show that the bounds are tight for federatedscheduling and G-EDF when m is sufficiently large.

I. Introduction

In the last decade, multicore processors have becomeubiquitous and there has been extensive work on how toexploit these parallel machines for real-time tasks. In thereal-time systems community, there has been significantresearch on scheduling task sets with inter-task parallelism— where each task in the task set is a sequential program.In this case, increasing the number of cores allows us toincrease the number of tasks in the task set. However, sinceeach task can only use one core at a time, the computa-tional requirement of a single task is still limited by thecapacity of a single core. Recently, there has been someinterest in the design and analysis of scheduling strategies

for task sets with intra-task parallelism (in addition tointer-task parallelism), where individual tasks are parallelprograms and can potentially utilize more than one core inparallel. These models enable tasks with higher executiondemands and tighter deadlines, such as those used inautonomous vehicles [31], video surveillance, computervision, radar tracking and real-time hybrid testing [28]

In this paper, we consider the general directed acyclicgraph (DAG) model. We analyze three different schedulingstrategies: a new strategy, namely federated scheduling,and two classic strategies, namely global EDF and globalrate-monotonic scheduling. We prove that all three strate-gies provide strong performance guarantees, in the formof capacity augmentation bounds, for scheduling parallelDAG tasks with implicit deadlines.

One can generally derive two types of performancebounds for real-time schedulers. The traditional boundis called a resource augmentation bound (also calleda processor speed-up factor). A scheduler S provides aresource augmentation bound of b ≥ 1 if it can successfullyschedule any task set τ on m cores of speed b as long asthe ideal scheduler can schedule τ on m cores of speed1. A resource augmentation bound provides a good notionof how close a scheduler is to the optimal schedule, butit has a drawback. Note that the ideal scheduler is onlya hypothetical scheduler, meaning that it always finds afeasible schedule if one exists. Unfortunately, since weoften cannot tell whether the ideal scheduler can schedule agiven task set on unit-speed cores, a resource augmentationbound may not provide a schedulability test.

Another bound that is commonly used for sequentialtasks is a utilization bound. A scheduler S provides autilization bound of b if it can successfully schedule anytask set which has total utilization at most m/b on mcores.1 A utilization bound provides more information thana resource augmentation bound; any scheduler that guar-antees a utilization bound of b automatically guarantees a

1A utilization bound is often stated in terms of 1/b; we adopt thisnotation in order to be consistent with the other bounds stated here.

Page 2: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

resource augmentation bound of b as well. In addition, itacts as a very simple schedulability test in itself, since thetotal utilization of the task set can be calculated in lineartime and compared to m/b. Finally, a utilization boundgives an indication of how much load a system can handle;allowing us to estimate how much over-provisioning maybe necessary when designing a platform. Unfortunately, itis often impossible to prove a utilization bound for parallelsystems due to Dhall’s effect; often, we can constructpathological task sets with utilization arbitrarily close to1, but which cannot be scheduled on m cores.

Li et al. [35] defined a concept of capacity augmenta-tion bound which is similar to the utilization bound, butadds a new condition. A scheduler S provides a capacityaugmentation bound of b if it can schedule any task setτ which satisfies the following two conditions: (1) thetotal utilization of τ is at most m/b, and (2) the worst-case critical-path length of each task Li (execution timeof the task on an infinite number of cores)2 is at most 1/bfraction of its deadline. A capacity augmentation bound isquite similar to a utilization bound: it also provides moreinformation than a resource augmentation bound does; anyscheduler that guarantees a capacity augmentation boundof b automatically guarantees a resource augmentationbound of b as well. It also acts as a very simple schedula-bility test. Finally, it can also provide an estimation of theload a system is expected to handle.

There has been some recent research on proving bothresource augmentation bounds and capacity augmentationbounds for various scheduling strategies for parallel tasks.This work falls in two categories. In decomposition-basedstrategies, the parallel task is decomposed into a set ofsequential tasks and they are scheduled using existingstrategies for scheduling sequential tasks on multiproces-sors. In general, decomposition-based strategies requireexplicit knowledge of the structure of the DAG off-line inorder to apply decomposition. In non-decomposition basedstrategies, the program can unfold dynamically since nooff-line knowledge is required.

For a decomposed strategy, most prior work considerssynchronous tasks (subcategory of general DAGs) with im-plicit deadlines. Lakshmanan et al. [32] proved a capacityaugmentation bound of 3.42 for partitioned fixed-priorityscheduling for a restricted category of synchronous tasks 3

under decomposed deadline monotonic scheduling. Saiful-lah et al. [45] provide a different decomposition strategyfor general parallel synchronous tasks and prove a capacityaugmentation bound of 4 when the decomposed tasks arescheduled using global EDF and 5 when scheduled usingpartitioned DM. Kim et al. [31] provide another decompo-sition strategy and prove a capacity augmentation bound of

2Critical-path length of a sequential task is equal to its execution time3Fork-join task model, in their terminology.

3.73 using global deadline monotonic scheduling. Nelissenet al. [40] proved a resource augmentation bound of 2for general synchronous tasks. More recently, Saifullah etal. [44] provide a decomposition strategy for general DAGtasks that provides a capacity augmentation bound of 4.

For non-decomposition strategies, researchers havestudied primarily global earliest deadline first (G-EDF)and global rate-monotonic (G-RM). Andersson and Niz [4]show that G-EDF provides resource augmentation boundof 2 for synchronous tasks with constrained deadlines.Both Li et al. [35] and Bonifaci et al. [15] concurrentlyshowed that G-EDF provides a resource augmentationbound of 2 for general DAG tasks with arbitrary deadlines.In their paper, Bonifaci et al. also proved that G-RMprovides a resource augmentation bound of 3 for parallelDAG tasks with arbitrary deadlines; Li et al. also providea capacity augmentation bound of 4 for G-EDF for tasksets with implicit deadlines.

In summary, the best known capacity augmentationbound for implicit deadlines tasks are 4 for DAG tasks us-ing G-EDF, and 3.73 for parallel synchronous tasks usingdecomposition combined with G-DM. The contributions ofthis paper are as follows:

1 We propose a novel federated scheduling strategy.Here, each high-utilization task (utilization ≥ 1)is allocated a dedicated cluster (set) of cores. Amultiprocessor scheduling algorithm is used to sched-ule all low-utilization tasks, each of which is runsequentially, on a shared cluster composed of theremaining cores. Federated scheduling can be seenas a partitioned strategy generalized to parallel tasks.This is the best known capacity augmentation boundfor any scheduler for parallel DAGs.

2 We prove that the capacity augmentation bound forthis federated scheduler is 2. In addition, we alsoshow no scheduler can provide a better capacityaugmentation bound of 2 − 1/m for parallel tasks.Therefore, a bound of 2 for federated scheduling istight when m is large enough.

3 We improve the capacity augmentation bound of G-EDF to 3+

√5

2 ≈ 2.618 for DAGs. When m is large,there is a matching lower bound for G-EDF [35];hence, this result closes the gap for large m. Thisis the best known capacity augmentation bound forany global scheduler for parallel DAGs.

4 We show that G-RM has a capacity augmentationbound of 2 +

√3 ≈ 3.732. This is the best known

capacity augmentation bound for any fixed-priorityscheduler for DAG tasks. Even if restricted to syn-chronous tasks, this is still the best bound for globalfixed priority scheduling without decomposition.

The paper is organized as follows. Section II definesthe DAG model for parallel tasks and provides somedefinitions. Section III presents our federated scheduling

2

Page 3: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

algorithm and proves the augmentation bound. Section IVproves a lower bound for any scheduler for parallel tasks.Section V presents a canonical form to give an upper boundof the work of a DAG that should be done in a specifiedinterval length. Section VI proves that G-EDF provides acapacity augmentation bound of 2.618. Section VII showsthat G-RM provides a capacity augmentation bound of3.732. We discuss some practical considerations of thethree schedulers in Section VIII. Section IX discussesrelated work and Section X concludes this paper.

II. System ModelWe now present the details of the DAG task model for

parallel tasks and some additional definitions.We consider a set τ of n independent sporadic real-

time tasks {τ1, τ2, . . . , τn}. A task τi represents an infinitesequence of arrivals and executions of task instances (alsocalled jobs). We consider the sporadic task model [9,29] where, for a task τi, the minimum inter-arrival time(or period) Ti represents the time between consecutivearrivals of task instances, and the relative deadline Di

represents the temporal constraint for executing the job.If a task instance of τi arrives at time t, the execution ofthis instance must be finished no later than the absolutedeadline t+Di and the release of the next instance of taskτi must be no earlier than t plus the minimum inter-arrivaltime, i.e. t+Ti. In this paper, we consider implicit deadlinetasks where each task τi’s relative deadline Di is equal toits minimum inter-arrival time Ti; that is, Ti = Di. Weconsider the schedulability of this task set on a uniformmulticore system consisting of m identical cores.

Each task τi ∈ τ is a parallel task and is characterized asa directed acyclic graph (DAG). Each node (subtask) in theDAG represents a sequence of instructions (a thread) andeach edge represents a dependency between nodes. A node(subtask) is ready to be executed when all its predecessorshave been executed. Throughout this paper, as it is notnecessary to build the analysis based on specific structureof the DAG, only two parameters related to the executionpattern of task τi are defined:• total execution time (or work) Ci of task τi: This is

the summation of the worst-case execution times of allthe subtasks of task τi.

• critical-path length Li of task τi: This is the lengthof the critical-path in the given DAG, in which eachnode is characterized by the worst-case execution timeof the corresponding subtask of task τi. Critical-pathlength is the worst-case execution time of the task onan infinite number of cores.

Given a DAG, obtaining work Ci and critical-path lengthLi [46, pages 661-666] can both be done in linear time.

For brevity, the utilization CiTi

= CiDi

of task τi is denotedby ui for implicit deadlines. The total utilization of the taskset is U∑ =

∑τi∈τ ui.

Utilization-Based Schedulability Test. In this paper, weanalyze schedulers in terms of their capacity augmentationbounds. The formal definition is presented here:

Definition 1. Given a task set τ with total utilization ofU∑, a scheduling algorithm S with capacity augmenta-tion bound b can always schedule this task set on m coresof speed b as long as τ satisfies the following conditionson unit speed cores.

Utilization does not exceed total cores,∑τi∈τ

ui ≤ m (1)

For each task τi ∈ τ, the critical path Li ≤ Di (2)

Since no scheduler can schedule a task set τ on m unitspeed cores unless Conditions (1) and (2) are met, a capac-ity augmentation bound automatically leads to a resourceaugmentation bound. This definition can be equivalentlystated (without reference to the speedup factor) as follows:Condition (1) says that the total utilization U∑ is at mostm/b and Condition (2) says that the critical-path lengthof each task is at most 1/b of its relative deadline, thatis, Li ≤ Di/b. Therefore, in order to check if a task setis schedulable we only need to know the total task setutilization, and the maximum critical-path utilization. Notethat a scheduler with a smaller b is better than another witha larger b, since when b = 1 S is an optimal scheduler.

III. Federated SchedulingThis section presents the federated scheduling strategy

for parallel tasks with implicit deadlines. We prove that itprovides a capacity augmentation bound of 2 on m-coremachine parallel real-time tasks.

A. Federated Scheduling Algorithm

Given a task set τ , the federated scheduling algorithmworks as follows: First, tasks are divided into two disjointsets: τhigh contains all high-utilization tasks — tasks withworst-case utilization at least one (ui ≥ 1), and τlowcontains all the remaining low-utilization tasks. Consider ahigh-utilization task τi with worst-case execution time Ci,worst-case critical-path length Li, and deadline Di (whichis equal to its period Ti). We assign ni dedicated cores toτi, where ni is

ni =

⌈Ci − LiDi − Li

⌉(3)

We use nhigh =∑τi∈τhigh

ni to denote the total num-ber of cores assigned to high-utilization tasks τhigh. Weassign the remaining cores to all low-utilization tasks τlow,denoted as nlow = m − nhigh. The federated schedulingalgorithm admits the task set τ , if nlow is non-negativeand nlow ≥ 2

∑τi∈τlow

ui.After a valid core allocation, runtime scheduling pro-

ceeds as follows: (1) Any greedy (work-conserving) paral-lel scheduler can be used to schedule a high-utilization task

3

Page 4: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

τi on its assigned ni cores. Informally, a greedy scheduleris one that never keeps a core idle if some node is ready toexecute. (2) Low-utilization tasks are treated and executedas though they are sequential tasks and any multiprocessorscheduling algorithm (such as partitioned EDF [37], orvarious rate-monotonic schedulers [3]) with a utilizationbound of at most 1/2 can be used to schedule all the low-utilization tasks on the allocated nlow cores. The importantobservation is that we can safely treat low-utilization tasksas sequential tasks since Ci ≤ Di and parallel executionis not required to meet their deadlines.4

B. Capacity Augmentation Bound of 2 forFederated Scheduling

Theorem 1. The federated scheduling algorithm has acapacity augmentation bound of 2.

To prove Theorem 1, we consider a task set τ thatsatisfies Conditions (1) and (2) from Definition 1 for b =2.Then, we (1) state the relatively obvious Lemma 1; (2)prove that a high utilization task τi meets its deadlinewhen assigned ni cores; and (3) show that nlow is non-negative and satisfies nlow ≥ b

∑τi∈τlow

ui and thereforeall low utilization tasks in τ will meet deadlines whenscheduled using any multiprocessor scheduling strategywith utilization bound no less than b (i.e. can afford totaltask set utilization of m/b = 50%m). These three stepscomplete the proof.

Lemma 1. A task set τ is classified into disjoint subsetss1, s2, ..., sk, and each subset is assigned a dedicatedcluster of cores with size n1, n2, ..., nk respectively, suchthat

∑i ni ≤ m. If each subset sj is schedulable on its

nj cores using some scheduling algorithm Sj (possiblydifferent for each subset), then the whole task set isguaranteed to be schedulable on m cores.

High-Utilization Tasks Are Schedulable. Assume that amachine’s execution time is divided into discrete quantacalled steps. During each step a core can be either idle orperforming one unit of work. We say a step is completeif no core is idle during that step, and otherwise we sayit is incomplete. A greedy scheduler never keeps a coresidle if there is ready work available. Then, for a greedyscheduler on ni cores, we can state two straightforwardlemmas [35].

Lemma 2. [Li13] Consider a greedy scheduler running onni cores for t time steps. If the total number of incompletesteps during this period is t∗, the total work F t doneduring these time steps is at least F t ≥ nit− (ni − 1)t∗.

4Even if these tasks are expressed as parallel programs, it is easy toenforce correct sequential execution of parallel tasks — any topologicalordered execution of the nodes of the DAG is a valid sequential execution.

Lemma 3. [Li13] If a job of task τi is executed by agreedy scheduler, then every incomplete step reduces theremaining critical-path length of the job by 1.

From Lemmas 2 and 3, we can establish Theorem 2.

Theorem 2. If an implicit-deadline deterministic paralleltask τi is assigned ni =

⌈Ci−LiDi−Li

⌉dedicated cores, then

all its jobs can meet their deadlines, when using a greedyscheduler.

Proof: For contradiction, assume that some job of a high-utilization task τi misses its deadline when scheduled onni cores by a greedy scheduler. Therefore, during the Di

time steps between the release of this job and its deadline,there are fewer than Li incomplete steps; otherwise, byLemma 3, the job would have completed. Therefore, byLemma 2, the scheduler must have finished at least niDi−(ni − 1)Li work.

niDi − (ni − 1)Li = ni(Di − Li) + Li

=

⌈Ci − LiDi − Li

⌉(Di − Li) + Li

≥ Ci − LiDi − Li

(Di − Li) + Li = Ci

Since the job has work of at most Ci, it must have finishedin Di steps, leading to a contradiction.

Low-Utilization Tasks are Schedulable. We first calculatea lower bound on nlow, the number of total cores assignedto low-utilization tasks, when a task set τ that satisfiesConditions (1) and (2) of Definition 1 for b = 2 isscheduled using a federated scheduling strategy.

Lemma 4. The number of cores assigned to low-utilizationtasks is at least nlow ≥ 2

∑low ui.

Proof: Here, for brevity of the proof, we denote σi = DiLi

.It is obvious that Di = σiLi and hence Ci = Diui =σiuiLi. Therefore,

ni =

⌈Ci − LiDi − Li

⌉=

⌈σiuiLi − LiσiLi − Li

⌉=

⌈σiui − 1

σi − 1

⌉Since each task τi in task set τ satisfies Condition (2)

of Definition 1 for b = 2; therefore, the critical-path lengthof each task is at most 1/b of its relative deadline, that is,Li ≤ Di/b =⇒ σi ≥ b = 2.

By the definition of high-utilization task τi, we have1 ≤ ui. Together with σi ≥ 2, we know that:

0 ≤ (ui − 1)(σi − 2)

σi − 1

4

Page 5: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

From the definition of ceiling, we can derive

ni =

⌈σiui − 1

σi − 1

⌉<

σiui − 1

σi − 1+ 1 =

σiui + σi − 2

σi − 1

≤ σiui + σi − 2

σi − 1+

(ui − 1)(σi − 2)

σi − 1

=σiui + σi − 2 + σiui − 2ui − σi + 2

σi − 1

=2σiui − 2uiσi − 1

=2ui(σi − 1)

σi − 1= 2ui

= bui

In summary, for each high-utilization task, ni < bui.So, their sum τhigh satisfies nhigh =

∑high ni < b

∑high ui.

Since the task set also satisfies Condition (1), we have

nlow = m− nhigh > b∑

all

ui − b∑high

ui = b∑low

ui

Thus, the number of remaining cores allocated to low-utilization tasks is at least nlow > 2

∑low ui.

Corollary 1. For task sets satisfying Conditions (1) and(2), a multiprocessor scheduler with utilization bound ofat least 50% can schedule all the low-utilization taskssequentially on the remaining nlow cores.

Proof: Low-utilization tasks are allocated nlow cores, andfrom Lemma 4 we know that the total utilization of the lowutilization tasks is less than nlow/b = 50%nlow. Therefore,any multiprocessor scheduling algorithm that provides autilization bound of 2 (i.e., can schedule any task set withtotal worst-case utilization ratio no more than 50%) canschedule it.

Many multiprocessor scheduling algorithms (such aspartitioned EDF or partitioned RM) provide a utilizationbound of 1/2 (i.e., 50%) to sequential tasks. That is, givennlow cores, they can schedule any task set with a totalworst-case utilization up to nlow/2. Using any of thesealgorithms for low-utilization tasks will guarantee thatthe federated algorithm meets all deadlines with capacityaugmentation of 2.

Therefore, since we can successfully schedule both highand low-utilization tasks that satisfy Conditions (1) and (2),we have proven Theorem 1 (using Lemma 1).

As mentioned before, a capacity augmentation boundacts as a simple schedulability test. However, for federatedscheduling, this test can be pessimistic, especially for taskswith high parallelism. Note, however, that the federatedscheduling algorithm described in Section III-A can alsobe directly used as a (polynomial-time) schedulabilitytest: given a task set, after assigning cores to each high-utilization task using this algorithm, if the remaining coresare sufficient for all low-utilization tasks, then the task set

is schedulable and we can admit it without deadline misses.This schedulability test admits a strict superset of tasksadmitted by the capacity augmentation bound test, and inpractice, it may admits task sets with utilization greaterthan m/2.

IV. Lower Bound on Capacity Augmentationof Any Scheduler for Parallel Tasks

On a system with m cores, consider a task set τ with asingle task, τ1, which starts with sequential execution for1−ε time and then forks m−1

ε +1 subtasks with executiontime ε. Here, we assume ε is an arbitrarily small positivenumber. Therefore, the total work of task τ1 is C1 = mand its critical-path length Li = 1. The deadline (and alsominimum inter-arrival time) of τ1 is 1.

Theorem 3. The capacity augmentation bound for anyscheduler for parallel tasks on m cores is at least 2− 1

m ,when ε→+ 0.

Proof: Consider the system defined above. The finishingtime of τ1 by running at speed α is not earlier than 1−ε

α +m−1ε ε

mα = 2α −

1mα −

εα . If α > 2 − 1

m and ε →+ 0,then 2

α −1mα −

εα > 1, then task τ1 misses its deadline.

Therefore, we reach the conclusion.Since Lemma 3 works for any scheduler for parallel

tasks, we can conclude that the lower bound on capacityaugmentation of federated scheduling is at least 2 , when mis sufficiently large. Since we have shown that the upperbound on capacity augmentation of federated schedulingis also 2, therefore, we have closed the gap between thelower and upper bound of federated scheduling for largem. Moreover, for sufficiently large m, federated schedulinghas the best capacity augmentation bound, among allschedulers for parallel tasks.

V. Canonical Form of a DAG Task

In this section, we introduce the concept of a DAG’scanonical form. Note each task can have an arbitrarilycomplex DAG structure which may be difficult to analyzeand may not even be known before runtime. However,given the known task set parameters (work, critical pathlength, utilization, critical-path utilization, etc.) we repre-sent each task using a canonical DAG that allows us toupper bound the demand of the task in any given intervallength t. These results will play an important role whenwe analyze the capacity augmentation bounds for G-EDFin Section VI and G-RM in Section VII. Recall that in thispaper, we analyze tasks with implicit deadline, so periodequals to deadline (Ti = Di).

Recall that we classify each task τi as a low-utilization ifui = Ci/Di < 1 (and hence Ci < Di); or high-utilizationtask, if τi’s utilization ui ≥ 1.

5

Page 6: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

For analytical purposes, instead of considering the com-plex DAG structure of individual tasks τi, we consider acanonical form τ∗i of task τi. The canonical form of atask is represented by a simpler DAG. In particular, eachsubtask (node) of task τ∗i has execution time ε, which ispositive and arbitrarily small. Note that ε is a hypotheticalunit-node execution time. Therefore, it is safe to assumethat Diε and Ci

ε are both integers. Low and high-utilizationtasks have different canonical forms described below.• The canonical form τ∗i of a low-utilization task τi is

simply a chain of Ci/ε nodes, each with execution timeε. Note that task τ∗i is a sequential task.

• The canonical form τ∗i of a high-utilization task τistarts with a chain of Di/ε−1 nodes each with execu-tion time ε. The total work of this chain is Di− ε. Thelast node of the chain forks all the remaining nodes.Hence, all the remaining (Ci −Di + ε)/ε nodes havean edge from the last node of this chain. Therefore, allthese nodes can execute entirely in parallel.

Figure 1 provides an example for such a transformationfor a high-utilization task. It is important to note that thecanonical form τ∗i does not depend on the DAG structureof τi at all. It depends only on the task parameters of τi.

2

4

5

3

2

1

3

(a) original DAG

ε ε · · · ε ε

...

ε

...ε16

ε − 1 nodes

4+εε

node

s

(b) canonical form: heavy task

Fig. 1: A high-utilization DAG task τi with Li = 12, Ci =20, Ti = Di = 16, and ui = 1.25 and its canonical form,where the number in each node is its execution time.

As an additional analysis tool, we define a hypotheticalscheduling strategy S∞ that must schedule a task set τ onan infinite number of cores, that is, m =∞. With infinitenumber of cores, the prioritization of the sub-jobs becomesunnecessary and S can obtain an optimal schedule bysimply assigning a sub-job to a core as soon as that sub-job becomes ready for execution. Using this schedule, allthe tasks finish within their critical-path length; therefore,if Li ≤ Di for all tasks τi in τ , the task set always meetsthe deadlines. We denote this schedule as S∞. Similarly,S∞,α is the resulting schedule when A∞ schedules taskson cores of speed α ≥ 1. Note that S∞,α finishes a job oftask τi exactly Li/α time units after it is released.

We now define some notations based on S∞,α. Letqi(t, α) be the total work finished by S∞,α between thearrival time ri of task τi and time ri+ t. Therefore, in theinterval from ri + t to ri +Di (interval of length Di − t)the remaining Ci − qi(t, α) workload has to be finished.

We define maximum load, denoted by worki(t, α), for

task i as the maximum amount of work (computation) thatS∞,α must do on the sub-jobs of τi in any interval of lengtht. We can derive worki(t, α) as follows:

worki(t, α) ={Ci − qi(Di − t, α) t ≤ Di⌊tDi

⌋Ci + worki(t−

⌊tDi

⌋Di, α) t > Di.

(4)

Clearly, both qi(t, α) and worki(t, α) for a task dependon the structure of the DAG.

We similarly define q∗i (t, α) for the canonical form τ∗i .As the canonical form in task τ∗i is well-defined, we canderive q∗i (t, α) directly. Note that ε can be arbitrarily small,and, hence, its impact is ignored when calculating q∗i (t, α).

We can now define the canonical maximum loadwork∗i (t, α) as the maximum workload of the canonicaltask τ∗i in any interval t in schedule S∞,α. For a low-utilization task τi, where Ci/Di < 1, and τ∗i is a chain, itis easy to see that the canonical workload is

work∗i (t, α) = (5)0 t < Di − Ci

α

α · (t− (Di − Ciα )) Di − Ci

α ≤ t ≤ Di⌊tDi

⌋Ci + work∗i (t−

⌊tDi

⌋Di, α) t > Di.

Similarly, for high-utilization tasks, where Ci/Di ≥ 1,when ε is arbitrarily small, we have

work∗i (t, α) = (6)0 t < Di − Di

α

Ci −Di + α · (t− (Di − Diα )) Di − Di

α ≤ t ≤ Di⌊tDi

⌋Ci + work∗i (t−

⌊tDi

⌋Di, α) t > Di.

Figure 2 shows the q∗i (t, α), qi(t, α), work∗i (t, α), and

worki(t, α) of the high-utilization task τi in Figure 1 whenDi = 16, α = 1, and α = 2. Note that work∗i (t, α) ≥worki(t, α). In fact, the following lemma proves thatwork∗i (t, α) ≥ worki(t, α) for any t > 0 and α ≥ 1.

Lemma 5. For any t > 0 and α ≥ 1, work∗i (t, α) ≥worki(t, α).

Proof: For low-utilization tasks, the entire work Ci is se-quential. When t < Ci

α , q∗i (t, α) is αt, so qi(t, α) ≥ αt =q∗i (t, α). When Ci

α ≤ t < Di, qi(t, α) = Ci = q∗i (t, α).Similarly, for high-utilization tasks, the first Di units

of work is sequential, so when t < Diα , q∗i (t, α) = αt. In

addition, S∞,α finishes τi exactly Liα time units after it is

released, while it finishes the τ∗i at Diα . Since the critical-

path length Li ≤ Di for all τi and τ∗i at unit-speed system,when t < Li

α , qi(t, α) ≥ αt = q∗i (t, α). When Liα ≤ t <

Diα , qi(t, α) = q∗i (

Diα , α) > q∗i (t, α), When Li < t ≤ Di,

Lastly, when Diα ≤ t < Di, qi(t, α) = Ci = q∗i (t, α)

We can conclude that q∗i (t) ≤ qi(t) for any 0 ≤

6

Page 7: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

t0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

02468

10121416182022242628303234363840

qi(t, 2)q∗i (t, 2)

qi(t, 1)

q∗i (t, 1)

(a) q∗i (t, α) and qi(t, α)

t0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

02468

10121416182022242628303234363840

worki(t, 2)

work∗i (t, 2)

worki(t, 1)

work∗i (t, 1)

(b) work∗i (t, α) and worki(t, α)

Fig. 2: q∗i (t, α), qi(t, α), work∗i (t, α) and worki(t, α) for the high-utilization task τi with Di = 20 in Figure 1.

t < Di. Combining with the definition of work(t, α)(Equation (4)), we complete the proof.

We classify task τi as a light or heavy task. A task is alight task if ui = Ci/Di < α. Otherwise, we say that τi isheavy (ui ≥ α) . The following lemmas provide an upperbound on the density (the ratio of workload that has to befinished to the interval length) for heavy and light tasks.

Lemma 6. For any task τi, t > 0 and 1 < α, we have

worki(t, α)

t≤ work∗i (t, α)

t≤

{ui (0 ≤ ui < α)ui−11− 1

α

(α ≤ ui)(7)

Proof: The first inequality in Inequality (7) comes fromLemma 5. We now show that the second inequality alsoholds for any task. Note that the right hand side is positive,since 1 > ui > 0. There are two cases:Case 1: 0 < t ≤ Di.• If τi is a low-utilization task, where work∗i (t, α) is

defined in Equation (5). For any 0 < t ≤ Di, we have

work∗i (t, α)−CiDi· t ≤ α(t−Di +

Ciα)− Ci

Di· t

= (t−Di)(α−CiDi

) ≤ 0

where we rely on assumptions: (a) t ≤ Di; (b) sinceτi is a low-utilization task, Ci

Di< 1; and (c) α > 1.

Then, work∗i (t,α)t ≤ Ci

Di= ui.

• If τi is a high-utilization task, where work∗i (t, α) isdefined in Equation (6). Inequality (7) holds triviallywhen 0 < t < Di − Di

α , since the left side is 0 andright side is positive. For Di − Di

α ≤ t ≤ Di, we have

work∗i (t, α)

t=Ci + αt− αDi

t= α+Di(

ui − αt

),

Therefore, work∗i (t,α)t is maximized either (a) when t =

Di − Diα if ui − α ≥ 0 or (b) t = Di if ui − α < 0.

If τi is a light task with 1 ≤ ui < α and hence (b) istrue, then we have work∗i (Di,α)

Di= ui.

If heavy task τi with α ≤ ui and hence (a) is true, then

work∗i (Di − Diα , α)

Di − Diα

=Ci −Di

Di − Diα

=ui − 1

1− 1α

Therefore, Inequality (7) holds for 0 < t ≤ Di.Case 2: t > Di — Suppose that t is kDi + t′, where kis⌊tDi

⌋and 0 < t′ ≤ Di. When ui < α, by Equation (5)

and Equation (6), we have

work∗i (t, α)

t=kCi + work∗i (t

′, α)

kDi + t′≤ kuiDi + uit

kDi + t′= ui

When α ≤ ui, we can derive that ui ≤ ui−11− 1

α

. ByEquation (6), we have

work∗i (t, α)

t=kCi + work∗i (t

′, α)

kDi + t′≤kuiDi +

ui−11− 1

α

t′

kDi + t′

≤k ui−1

1− 1α

Di +ui−11− 1

α

t′

kDi + t′≤ ui − 1

1− 1α

Hence, Inequality (7) holds for any task and any t > 0.We denote τL and τH as the set of light and heavy tasks

in a task set, respectively; ‖τH‖ as the number of heavytasks in the task set; and total utilization of light and heavytasks as UL =

∑τLui and UH =

∑τHui, respectively.

Lemma 7. For any task set, the following inequality holds:

W = (∑

ni=1worki(t, α)) ≤

(α·U∑−UL−α·‖τH‖

α−1

)· t (8)

≤(αm−αα−1

)· t (9)

Proof: By Lemma 6, for any α > 1, it is clear that

Wt =

∑τL+τH

supt>0work∗i (t,α)

t ≤∑τLui +

∑τH

(ui−1)1− 1

α

=α·

∑τLui−

∑τLui+α·

∑τHui−α·

∑τH

1

α−1 =α·U∑−UL−α·‖τH‖

α−1

7

Page 8: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

where sup is the supremum of a set of numbers, τL andτH are the sets of heavy tasks (ui ≥ α) and light tasks(ui < α), respectively.

Note that∑τLui +

∑τHui = UL + UH = U∑ ≤ m.

Since for τi ∈ τH, ui ≥ α, UH =∑τHui ≥ ‖τH‖α, we

can derive the following upper bound:

U∑ − UL − ‖τH‖α ≤ supτ(U∑ − UL − ‖τH‖α) = U∑ − α

This is because for any task set, there are two cases:• If ‖τH‖ = 0 and hence U∑ = UL, then U∑ − UL −‖τH‖α = 0.

• If ‖τH‖ ≥ 1, then U∑−UL = UH ≤ U∑ and ‖τH‖α ≥α. Therefore, U∑ − UL − ‖τH‖α ≤ U∑ − α

Together with the definition of U∑ and∑τH1 = ‖τH‖,

W ≤(α·U∑−UL−α·‖τH‖

α−1

)t ≤

(α·U∑−αα−1

)t ≤

(αm−αα−1

)t

which proves the Inequalities 8 and 9 of Lemma 7.We use Lemma 7 in Sections VI and VII to derive

bounds on G-EDF and G-RM scheduling.

VI. Capacity Augmentation of Global EDF

In this section, we use the results from Section V toprove the capacity augmentation bound of (3+

√5)/2 for

G-EDF scheduling of parallel DAG tasks. In addition, wealso show a matching lower bound when m ≥ 3.

A. Upper Bound on Capacity Augmenta-tion of G-EDF

Our analysis builds on the analysis used to prove theresource augmentation bounds by Bonifaci et al. [15]. Wefirst review the particular lemma from the paper that wewill use to achieve our bound.

Lemma 8. If ∀t > 0, (αm−m+1)·t ≥∑ni=1 worki(t, α),

the task set is schedulable by G-EDF on speed-α cores.

Proof: This is based on a reformulation of Lemma 3 andDefinition 10 in [15] considering cores with speed α.

Theorem 4. The capacity augmentation bound for G-EDF

is3− 2

m+√

5− 8m+ 4

m2

2 (≈ 3+√5

2 , when m is large).

Proof: From Lemma 7 Inequality (9), we have ∀t > 0,∑ni=1 worki(t, α) ≤ (αm−αα−1 ) · t. If α·m−α

α−1 ≤ (αm−m+1), by Lemma 8 the schedulability test for G-EDF holds.To calculate α, we solve the derived equivalent inequality

mα2 − (3m− 2)α+ (m− 1) ≥ 0.

which solves to α =(3− 2

m +√5− 8

m + 4m2

)/2.

We now state a more general corollary relating the moreprecise bound using more information of a task set.

Corollary 2. If a task set has total utilization U∑,the total heavy task utilization UH and the number of

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

capa

city

aug

men

tatio

n bo

und

U / m

U / || H||=20.94U / || H||=10.47U / || H||=5.236U / || H||=2.618

Fig. 3: The required speedup of G-EDF when m issufficiently large and UL = 0 (i.e. U∑ = UH).

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

capa

city

aug

men

tatio

n bo

und

U / m

m=3m=6m=12m=24

Fig. 4: The required speedup of G-EDF when ‖τH‖ = 1and UL = 0 (i.e. U∑ = UH).

heavy task ‖τH‖, then this task set will be schedulableunder G-EDF on a m-core machine with speed α =

2+U∑−‖τH‖−1

m +

√4(UH−‖τH‖)

m +(U∑−‖τH‖−1)2

m2

2 .

Proof: The proof is the same as in the proof of Theorem 4,but without using the Inequality (9). Instead, we directlyuse Inequality (8). If α·U∑−UL−α·‖τH‖

α−1 ≤ (αm −m + 1),by Lemma 8 the schedulability test for G-EDF holds forthis task set. Solving this, we can get the required speedupα for the schedulability of the task set.

However, note that heavy tasks are defined as the set ofall tasks τi with utilization ui ≥ α. Therefore, given a taskset, to accurately calculate α, we start with the upper boundon α, which is α =

(3− 2

m +√5− 8

m + 4m2

)/2; then

for each iteration i, we can calculate the required speedupαi by using the U i−1H and ‖τH‖i−1 from the (i − 1)-thiteration; we iteratively classify more tasks into the set ofheavy tasks and we stop when no more tasks can be addedto this set, i.e., ‖τH‖i−1 = ‖τH‖i. Through these iterativesteps, we can calculate an accurate speedup.

8

Page 9: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

Figure 3 illustrates the required speedup of G-EDFprovided in Corollary 2 when m is sufficiently large (i.e.,m = ∞ and 1/m = 0) and UL = 0 (i.e. U∑ = UH). Wevary U∑

m and U∑‖τH‖ . Note that U∑

‖τH‖ = UH‖τH‖ is the average

utilization of all heavy tasks, which should be no less than(3+√5)/2 ≈ 2.618. It can be also be seen that the bound

is getting closer to (3+√5)/2, when U∑

‖τH‖ is larger, whichresults from ‖τH‖ = 1 and m =∞.

Figure 4 illustrates the required speedup of G-EDFprovided in Corollary 2 when ‖τH‖ ≤ 1 and UL = 0 (i.e.U∑ = UH) with varying m. Note that α > 1 is requiredby the proof of Corollary 2. And ‖τH‖ = 1 can only betrue, if U∑

m ≥αm (i.e. U

∑m ≥ 1/3 for m = 3).

B. Lower Bound on Capacity Augmenta-tion of G-EDF

As mentioned above, Li et al.’s lower bound [35]demonstrates the tightness of the above bound for largem. We now provide a lower bound for the capacityaugmentation bound for small m.

Consider a task set τ with two tasks, τ1 and τ2. Taskτ1 starts with sequential execution for 1− ε time and thenforks m−2

ε + 1 subtasks with execution time ε. Here, weassume ε is an arbitrarily small positive number and henceit is safe to assume that 1

εm is a positive integer. Therefore,the total work of task τ1 is C1 = m−1 and its critical-pathlength is Li = 1. The minimum inter-arrival time of τ1 is1.

Task τ2 is simply a sequential task with work (executiontime) of 1 − 1

α and minimum inter-arrival time also 1 −1α , where α > 1 will be defined later. Clearly, the totalutilization is m and the critical-path length of each task isat most the relative deadline (minimum inter-arrival time).

Lemma 9. When α <3− 2

m−δ+√

5− 12m+ 4

m2

2 and δ = 2ε+

g(ε) and m ≥ 3, then 1−2εα + m−2

mα > 1− 1− 1α

α holds.

Proof: By solving 1−2εα + m−2

mα = 1− 1− 1α

α , we know thatthe equality holds when

α <3− 2

m−2ε+√

5− 12m+ 4

m2−g(ε)2

where g(ε) is a positive function of 1ε , which approaches

to 0 when ε approaches 0. Now, by setting δ to 2ε+ g(ε),we reach the conclusion.

Theorem 5. The capacity augmentation bound for G-EDF

is at least3− 2

m+√

5− 12m+ 4

m2

2 , when ε→+ 0.

Proof: Consider the system with two tasks τ1 and τ2defined in the beginning of Section VI-B. Suppose thatthe arrival of task τ1 is at time 0, and the arrival of taskτ2 is at time 1

α + εα . By definition, the first jobs of τ1 and

τ2 have absolute deadlines at 1 and 1+ εα . Hence, G-EDF

2

2.1

2.2

2.3

2.4

2.5

2.6

10 20 30 40 50 60 70 80 90 100

capa

city

aug

men

tatio

n bo

und

m

upper boundlower bound

Fig. 5: The upper bound of G-EDF provided in Theorem4 and the lower bound in Theorem 5 with respect to thecapacity augmentation bound.

will execute the sequential execution of task τ1 and thesub-jobs of task τ1 first, and then execute τ2.

The finishing time of τ1 at speed α is not earlier than1−εα +

m−2ε ε

mα = 1−εα + m−2

mα . Hence, the finishing time oftask τ2 is not earlier than 1−ε

α + m−2mα +

1− 1α

α .

If 1−εα + m−2

mα +1− 1

α

α > 1 + εα , then task τ2 misses its

deadline. By Lemma 9, we reach the conclusion.Figure 5 illustrates the upper bound of G-EDF provided

in Theorem 4 and the lower bound in Theorem 5 withrespect to the capacity augmentation bound. It can be easilyseen that the upper and lower bounds are getting closerwhen m is larger. When m is 100, the gap between theupper and the lower bounds is roughly about 0.00452.

It is important to note that the more precise speedup inCorollary 2 is tight even for small m. This is becausein the above example task set, U∑ = m, total hightask utilization UH = m − 1 and number of heavy task‖τH‖ = 1, then according to the Corollary, the capacityaugmentation bound for this task set under G-EDF is

2+U∑−‖τH‖−1

m +

√4(UH−‖τH‖)

m +(U∑−‖τH‖−1)2

m2

2

=2+m−1−1

m +√

4(m−1−1)m +

(m−1−1)2

m2

2 =3− 2

m+√

5− 12m+ 4

m2

2

which is exactly the lower bound in Theorem 5 for ε→+ 0.

VII. G-RM Scheduling

This section proves that G-RM provides a capacityaugmentation bound of 2+

√3 for large m. The structure

of the proof is very similar to the analysis in Section VI.Again, we use a lemma from [15], restated below.

Lemma 10. If ∀t > 0, 0.5(α · m − m + 1)t ≥∑i worki(t, α) the task set is schedulable by G-RM on

speed-α cores.

9

Page 10: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

1.5

2

2.5

3

3.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

capa

city

aug

men

tatio

n bo

und

U / m

U / || H||=29.86U / || H||=14.93U / || H||=7.464U / || H||=3.732

Fig. 6: The required speedup of G-RM when m is suffi-ciently large and UL = 0 (i.e. U∑ = UH).

Proof: This is based on a reformulation of Lemma 6 andDefinition 10 in [15]. Note that the analysis in [15] is fordeadline-monotonic scheduling, by giving a sub-job of atask higher priority if its relative deadline is shorter. As weconsider tasks with implicit deadlines, deadline-monotonicscheduling is the same as rate-monotonic scheduling.

Proofs like those in Section VI-A give us the boundsbelow for G-RM scheduling.

Theorem 6. The capacity augmentation bound for G-RM

is4− 3

m+√

12− 20m+ 9

m2

2 (≈ 2 +√3, when m is large).

Proof: (Similar to the proof of Theorem 4:) First, weknow from Lemma 7 that ∀t > 0,

∑ni=1 worki(t, α) ≤

(αm−αα−1 )t; Second, if α·m−αα−1 ≤ 0.5(αm − m + 1). we

can also conclude that the schedulability test for G-RMin Lemma 10 holds. By solving the inequality above, we

have α ≥4− 3

m+√

12− 20m+ 9

m2

2 , and prove Theorem 6.The result in Theorem 6 is the best known result for

the capacity augmentation bound for global fixed-priorityscheduling for general DAG tasks with arbitrary structures.Interestingly, Kim et al. [31] get the same bound of 2+

√3

for global fixed-priority scheduling of parallel synchronoustasks (a subset of DAG tasks).

The strategy used in [31] is quite different. In theiralgorithm, the tasks undergo a stretch transformation whichgenerates a set of sequential subtask (each with its releasetime and deadline) for each parallel task in the originaltask set. These subtasks are then scheduled using a G-DMscheduling algorithm [11]. Note that even though the paral-lel tasks in the original task set have implicit deadlines, thetransformed sequential tasks have only constrained dead-lines — hence the need for deadline monotonic schedulinginstead of rate monotonic scheduling.

Corollary 3. If a task set has total utilization U∑,the total high task utilization UH and the number of

heavy task ‖τH‖, then this task set will be schedulableunder G-RM on a m-core machine with speed α =

2+2U∑−2‖τH‖−1

m +

√8(UH−‖τH‖)

m +(2U∑−2‖τH‖−1)2

m2

2 .

Proof: The proof is the same as in the proof of Corollary 2,except that instead of using Lemma 8, it uses Lemma 10.

Figure 6 illustrates the required speedup of G-RMprovided in Corollary 3 when m is sufficiently large (i.e.,m = ∞ and 1/m = 0) and UL = 0 (i.e. U∑ = UH). Wevary U∑

m and U∑‖τH‖ . Again, U∑

‖τH‖ is the average utilizationof all heavy tasks. It can be also be seen that the bound isgetting closer to 2+

√3, when U∑

‖τH‖ is larger, which resultsfrom ‖τH‖ = 1 and m =∞.

VIII. Practical Considerations

As have shown in the previous sections, the capacityaugmentation bound for federated scheduling, G-EDF andG-RM are 2, 2.618 and 3.732, respectively. In this section,we consider their run-time efficiency and efficacy froma practical perspective. We consider four dimensions —static vs. dynamic priorities, and global vs. partitionedscheduling, overheads due to scheduling and synchroniza-tion overheads, and work-conserving vs. not.

Practitioners have generally found it easier to imple-ment fixed (task) priority schedulers (such as RM) thandynamic priority schedulers (such as EDF). Fixed priorityschedulers have always been well-supported in almost allreal-time operating systems. Recently, there has been ef-forts on efficient implementations of job-level dynamic pri-ority (EDF and G-EDF) schedulers for sequential tasks [16,34]. Federated scheduling does not require any priorityassignment for high-utilization tasks (since they own theircores exclusively) and can use either fixed or dynamicpriority for low-utilization tasks. Thus, it is relatively easierto implement G-RM and federated scheduling.

For sequential tasks, in general, global scheduling mayincur more overhead due to thread migration and theassociated cache penalty, the extent of which depends onthe cache architecture and the task sets. In particular, forparallel tasks, the overheads for global scheduling couldbe worse. For sequential tasks, preemptions and migrationsonly occur when a new job with higher priority is released.In contrast, for parallel tasks, a preemption and possiblymigration could occur whenever a node in the DAG ofa job with higher priority is enabled. Since nodes in aDAG often represent a fine-grained units of computation,the number of nodes in the task set can be larger than thenumber of tasks. Hence, we can expect a larger number ofsuch events. Since federated scheduling is a generalizationof partitioned scheduling to parallel tasks, it has advantagessimilar to partitioning. In fact, if we use a partitionedRM or partitioned EDF strategy for low-utilization tasks,

10

Page 11: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

there are only preemptions but no migration for low-utilization tasks. Meanwhile, federated scheduling onlyallocates the minimum number of dedicated cores to ensurethe schedulability of each high-utilization task, so there isno preemptions for high-utilization tasks and the number ofmigrations is minimized. Hence, we expect that federatedscheduling will have less overhead than global schedulers.

In addition, parallel runtime systems have additionalparallel overheads, such as synchronization and schedulingoverheads. These overheads (per task) usually are approx-imately linear in the number of cores allocated to eachtask. Under federated scheduling, a minimum number ofcores is assigned. However, depending on the particularimplementation, global scheduling may execute a task onall the cores in the system and may have higher overheads.

Finally, note that federated scheduling is not a greedy(work conserving) strategy for the entire task set, althoughit uses a greedy schedule for each individual task. In manyreal systems, the worst-case execution times are normallyover-estimated. Under federated scheduling cores allocatedto tasks with overestimated execution times may idle dueto resource over-provisioning. In contrast, work-conservingstrategies (such as G-EDF and G-RM) and can utilizeavailable cores through thread migration dynamically.

IX. Related Work

In this section, we review closely related work on real-time scheduling, concentrating primarily on parallel tasks.

Real-time multiprocessor scheduling considers schedu-ling sequential tasks on computers with multiple proces-sors or cores and has been studied extensively (see [10, 23]for a survey). In addition, platforms such as LitmusRT [17,19] have been designed to support these task sets. Here, wereview a few relevant theoretical results. Researchers haveproven resource augmentation bounds, utilization boundsand capacity augmentation bounds. The best known re-source bound for G-EDF for sequential tasks on a mul-tiprocessor is 2 [7]; a capacity augmentation bound of2 − 1

m + ε for small ε [14]. Partitioned EDF and ver-sions partitioned static priority schedulers also provide autilization bound of 2 [3, 37]. G-RM provides a capacityaugmentation bound of 3 [2] to implicit deadline tasks.

For parallel real-time tasks, most early work consideredintra-task parallelism of limited task models such as mal-leable tasks [22, 30, 33] and moldable tasks [39]. Katoet al. [30] studied the Gang EDF scheduling of moldableparallel task systems.

Researchers have since considered more realistic taskmodels that represent programs generated by commonlyused parallel programming languages such as Cilk fam-ily [13, 21], OpenMP [41], and Intel’s Thread BuildingBlocks [43]. These languages and libraries support primi-tives such as parallel for-loops and fork/join or spawn/sync

in order to expose parallelism within the programs. Usingthese constructs generates tasks whose structure can berepresented with different types of DAGs.

Tasks with parallel synchronous tasks have been studiedmore than others in the real-time community. These tasksare generated if we use only parallel-for loops to generateparallelism. Lakshmanan et al. [32] proved a (capacity)augmentation bound of 3.42 for a restricted synchronoustask model which is generated when we restrict eachparallel-for loop in a task to have the same number ofiterations. General synchronous tasks (with no restrictionon the number of iterations in the parallel-for loops), havealso been studied [4, 31, 40, 45]. (More details on theseresults were presented in Section I) Chwa et al. [20]provide a response time analysis.

If we do not restrict the primitives used to parallel-forloops, we get a more general task model — most easilyrepresented by a general directed acyclic graph. A resourceaugmentation bound of 2− 1

m for G-EDF was proved fora single DAG with arbitrary deadlines [8] and for multipleDAGs [15, 35]. A capacity augmentation bound of 4− 2

mwas proved in [35] for tasks with for implicit deadlines.Liu et al. [36] provide a response time analysis for G-EDF.

There has been significant work on scheduling non-real-time parallel systems [5, 6, 24–26, 42]. In this context, thegoal is generally to maximize throughput. Various provablygood scheduling strategies, such as list scheduling [18, 27]and work-stealing [12] have been designed. In addition,many parallel languages and runtime systems have beenbuilt based on these results. While multiple tasks on asingle platform have been considered in the context offairness in resource allocation [1], none of this workconsiders real-time constraints.

X. Conclusions

In this paper, we consider parallel tasks in the DAGmodel and prove that for parallel tasks with implicitdeadlines the capacity augmentation bounds of federatedscheduling, G-EDF and G-RM are 2, 2.618 and 3.732respectively. In addition, the bound 2 for federated sche-duling and the bound of 2.618 for the G-EDF are bothtight for large m, since there exist matching lower bounds.Moreover, the three bounds are the best known bounds forthese schedulers for DAG tasks.

There are several directions of future work. The G-RMcapacity augmentation bound is not known to be tight. Thecurrent lower bound of G-RM is 2.668, inherited from thesequential sporadic real-time tasks without DAG structures[38]. Therefore, it is worth investigating a matching lowerbound or lowering the upper bound. In addition, sincethe lower bound of any scheduler is 2 − 1

m , it wouldbe interesting to investigate if it is possible to designschedulers that reach this bound. Finally, all the known

11

Page 12: Analysis of Federated and Global Scheduling for Parallel ...lu/papers/ecrts14.pdf · tasks that provides a capacity augmentation bound of 4. For non-decomposition strategies, researchers

capacity augmentation bound results are restricted to im-plicit deadline tasks; we would like to generalize them toconstrained and arbitrary deadline tasks.

Acknowledgment

This research was supported in part by the priorityprogram ”Dependable Embedded Systems” (SPP 1500 -spp1500.itec.kit.edu), by DFG, as part of the CollaborativeResearch Center SFB876 (http://sfb876.tu-dortmund.de/),by NSF grants CCF-1136073 (CPS) and CCF-1337218(XPS). The authors thank anonymous reviewers for theirsuggestions on improving this paper.

References

[1] K. Agrawal, C. E. Leiserson, Y. He, and W. J. Hsu. “Adaptivework-stealing with parallelism feedback”. In: ACM Trans. Com-put. Syst. 26 (2008), pp. 112–120.

[2] B. Andersson, S. Baruah, and J. Jonsson. “Static-priority schedu-ling on multiprocessors”. In: RTSS. 2001.

[3] B. Andersson and J. Jonsson. “The utilization bounds of parti-tioned and pfair static-priority scheduling on multiprocessors are50%”. In: ECRTS. 2003.

[4] B. Andersson and D. de Niz. “Analyzing Global-EDF for Mul-tiprocessor Scheduling of Parallel Tasks”. In: Principles of Dis-tributed Systems. 2012, pp. 16–30.

[5] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. “Thread Schedu-ling for Multiprogrammed Multiprocessors”. In: SPAA. 1998.

[6] N. Bansal, K. Dhamdhere, J. Konemann, and A. Sinha. “Non-clairvoyant Scheduling for Minimizing Mean Slowdown”. In:Algorithmica 40.4 (2004), pp. 305–318.

[7] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, and S. Stiller.“Improved Multiprocessor Global Schedulability Analysis”. In:Real-Time Syst. 46.1 (2010), pp. 3–24.

[8] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie, andA. Wiese. “A generalized parallel task model for recurrent real-time processes”. In: RTSS. 2012.

[9] S. K. Baruah, A. K. Mok, and L. E. Rosier. “PreemptivelyScheduling Hard-Real-Time Sporadic Tasks on One Processor”.In: RTSS. 1990.

[10] M. Bertogna and S. Baruah. “Tests for global EDF schedulabilityanalysis”. In: Journal of System Architecture 57.5 (2011).

[11] M. Bertogna, M. Cirinei, and G. Lipari. “New SchedulabilityTests for Real-time Task Sets Scheduled by Deadline Monotonicon Multiprocessors”. In: Proceedings of the 9th InternationalConference on Principles of Distributed Systems. 2006.

[12] R. D. Blumofe and C. E. Leiserson. “Scheduling multithreadedcomputations by work stealing”. In: Journal of the ACM 46.5(1999), pp. 720–748.

[13] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,K. H. Randall, and Y. Zhou. “Cilk: An Efficient MultithreadedRuntime System”. In: PPoPP. 1995, pp. 207–216.

[14] V. Bonifaci, A. Marchetti-Spaccamela, and S. Stiller. “A constant-approximate feasibility test for multiprocessor real-time schedu-ling”. In: Algorithmica 62.3-4 (2012), pp. 1034–1049.

[15] V. Bonifaci, A. Marchetti-Spaccamela, S. Stiller, and A. Wiese.“Feasibility Analysis in the Sporadic DAG Task Model”. In:ECRTS. 2013.

[16] B. B. Brandenburg and J. H. Anderson. “On the Implementationof Global Real-Time Schedulers”. In: RTSS. 2009.

[17] B. B. Brandenburg, A. D. Block, J. M. Calandrino, U. Devi, H.Leontyev, and J. H. Anderson. LITMUS RT: A Status Report. 2007.

[18] R. P. Brent. “The Parallel Evaluation of General ArithmeticExpressions”. In: Journal of the ACM (1974), pp. 201–206.

[19] J. M. Calandrino, H. Leontyev, A. Block, U. C. Devi, and J. H.Anderson. “LITMUSRT : A Testbed for Empirically ComparingReal-Time Multiprocessor Schedulers”. In: RTSS. 2006.

[20] H. S. Chwa, J. Lee, K.-M. Phan, A. Easwaran, and I. Shin. “GlobalEDF Schedulability Analysis for Synchronous Parallel Tasks onMulticore Platforms”. In: ECRTS. 2013.

[21] CilkPlus. http://software.intel.com/en-us/articles/intel-cilk-plus.[22] S. Collette, L. Cucu, and J. Goossens. “Integrating job parallelism

in real-time scheduling theory”. In: Information Processing Letters106.5 (2008), pp. 180–187.

[23] R. I. Davis and A. Burns. “A survey of hard real-time schedulingfor multiprocessor systems”. In: ACM Computing Surveys 43(2011), 35:1–44.

[24] X. Deng, N. Gu, T. Brecht, and K. Lu. “Preemptive Schedulingof Parallel Jobs on Multiprocessors”. In: SODA. 1996.

[25] M. Drozdowski. “Real-time scheduling of linear speedup paralleltasks”. In: Inf. Process. Lett. 57 (1996), pp. 35–40.

[26] J. Edmonds, D. D. Chinn, T. Brecht, and X. Deng. “Non-clairvoyant Multiprocessor Scheduling of Jobs with ChangingExecution Characteristics”. In: Journal of Scheduling 6.3 (2003),pp. 231–250.

[27] R. L. Graham. “Bounds on Multiprocessing Anomalies”. In: SIAMJournal on Applied Mathematics (1969), 17(2):416–429.

[28] H.-M. Huang, T. Tidwell, C. Gill, C. Lu, X. Gao, and S. Dyke.“Cyber-physical systems for real-time hybrid structural testing:a case study”. In: International Conference on Cyber PhysicalSystems. 2010.

[29] A. Ka and L. Mok. Fundamental design problems of distributedsystems for the hard-real-time environment. Tech. rep. 3. 1983.

[30] S. Kato and Y. Ishikawa. “Gang EDF Scheduling of Parallel TaskSystems”. In: RTSS. 2009.

[31] J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar. “Parallelscheduling for cyber-physical systems: analysis and case studyon a self-driving car”. In: ICCPS. 2013.

[32] K. Lakshmanan, S. Kato, and R. R. Rajkumar. “Scheduling Paral-lel Real-Time Tasks on Multi-core Processors”. In: Proceedings ofthe 2010 31st IEEE Real-Time Systems Symposium. RTSS. 2010.

[33] W. Y. Lee and H. Lee. “Optimal Scheduling for Real-Time ParallelTasks”. In: IEICE Transactions on Information Systems E89-D.6(2006), pp. 1962–1966.

[34] J. Lelli, G. Lipari, D. Faggioli, and T. Cucinotta. “An efficient andscalable implementation of global EDF in Linux”. In: OSPERT.2011.

[35] J. Li, K. Agrawal, C.Lu, and C. Gill. “Analysis of Global EDFfor Parallel Tasks”. In: ECRTS. 2013.

[36] C. Liu and J. Anderson. “Supporting Soft Real-Time ParallelApplications on Multicore Processors”. In: RTCSA. 2012.

[37] J. M. Lopez, J. L. Dıaz, and D. F. Garcıa. “Utilization Boundsfor EDF Scheduling on Real-Time Multiprocessor Systems”. In:Real-Time Systems 28.1 (2004), pp. 39–68.

[38] L. Lundberg. “Analyzing Fixed-Priority Global MultiprocessorScheduling”. In: IEEE Real Time Technology and ApplicationsSymposium. 2002, pp. 145–153.

[39] G. Manimaran, C. S. R. Murthy, and K. Ramamritham. “ANew Approach for Scheduling of Parallelizable Tasks inReal-Time Multiprocessor Systems”. In: Real-Time Syst. 15 (1998),pp. 39–60.

[40] G. Nelissen, V. Berten, J. Goossens, and D. Milojevic. “Tech-niques optimizing the number of processors to schedule multi-threaded tasks”. In: ECRTS. 2012.

[41] OpenMP Application Program Interface v3.1. http://www.openmp.org/mp-documents/OpenMP3.1.pdf. 2011.

[42] C. D. Polychronopoulos and D. J. Kuck. “Guided Self-Scheduling:A Practical Scheduling Scheme for Parallel Supercomputers”. In:Computers, IEEE Transactions on C-36.12 (1987).

[43] J. Reinders. Intel threading building blocks: outfitting C++ formulti-core processor parallelism. O’Reilly Media, 2010.

[44] A. Saifullah, D. Ferry, J. Li, K. Agrawal, C. Lu, and C. Gill.“Parallel real-time scheduling of DAGs”. In: IEEE Transactionson Parallel and Distributed Systems (2014).

[45] A. Saifullah, J. Li, K. Agrawal, C. Lu, and C. Gill. “Multi-corereal-time scheduling for generalized parallel task models”. In:Real-Time Systems 49.4 (2013), pp. 404–435.

[46] R. Sedgewick and K. D. Wayne. Algorithms. 4th. 2011.

12