journal of la response time bounds for typed …wcet, while figure 2(b) shows another execution...

12
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Response Time Bounds for Typed DAG Parallel Tasks on Heterogeneous Multi-cores Meiling Han, Nan Guan, Jinghao Sun, Qingqiang He, Qingxu Deng, and Weichen Liu, Abstract—Heterogenerous multi-cores utilize the strength of different architectures for executing particular types of workload, and usually offer higher performance and energy efficiency. In this paper, we study the worst-case response time (WCRT) analysis of typed scheduling of parallel DAG tasks on heterogeneous multi-cores, where the workload of each vertex in the DAG is only allowed to execute on a particular type of cores. The only known WCRT bound for this problem is grossly pessimistic and suffers the non-self-sustainability problem. In this paper, we propose two new WCRT bounds. The first new bound has the same time complexity as the existing bound, but is more precise and solves its non-self-sustainability problem. The second new bound explores more detailed task graph structure information to greatly improve the precision, but is computationally more expensive. We prove that the problem of computing the second bound is strongly NP-hard if the number of types in the system is a variable, and develop an efficient algorithm which has polynomial time complexity if the number of types is a constant. Experiments with randomly generated workload show that our proposed new methods are significantly more precise than the existing bound while having good scalability. Index Terms—Heterogenerous multi-cores system, embedded real-time scheduling, response time analysis, DAG parallel task. 1 I NTRODUCTION Multi-cores are more and more widely used in real-time systems, to meet rapidly increasing requirements in perfor- mance and energy efficiency. To fully utilize the computa- tion capacity of multi-cores, software should be properly parallelized. A representation that can model a wide range of parallel software is the DAG (directed acyclic graph) task model, where each vertex represents a piece of sequential workload and each edge represents the precedence relation between two vertices. Real-time scheduling and analysis of DAG parallel task models have raised many new challenges over traditional real-time scheduling theory with sequential tasks, and have become an increasingly hot research topic in recent years. Many modern multi-cores adopt heterogeneous ar- chitectures. Examples include Zynq-7000 [37] and OMAP1/OMAP2 [33] that integrate CPU and DSP on the same chip, and the Tegra processors [36] that integrate CPU and GPU on the same chip. Heterogenerous multi-cores utilize specialized processing capabilities to handle partic- ular computational tasks, which usually offer higher perfor- mance and energy efficiency. For example, [38] showed that a heterogeneous-ISA chip multiprocessor can outperform the best same-ISA homogeneous architecture by as much as 21% with 23% energy savings and a reduction of 32% in energy delay product. In this paper, we consider real-time scheduling of typed DAG tasks on heterogeneous multi-cores, where each vertex This paper is under submission to a journal. M. Han,J. Sun and Q. Deng are with Northeastern University, Shenyang 110819, China. Part of the work by M. Han was performed during her visit at The Hong Kong Polytechnic University E-mail: neu [email protected] N. Guan and Q. He are with The Hong Kong Polytechnic University, Hong Kong. Weichen Liu is with Nanyang Technological University, Singapore. Manuscript received-; revised-. is explicitly bound to execute on a particular type of cores. Binding code segments of the program to a certain type of cores is common practice in software development on heterogeneous multi-cores and is supported by mainstream parallel programming frameworks and operating systems. For example, in OpenMP [34] one can use the proc bind clause to specify the mapping of threads to certain process- ing cores. In OpenCL [35], one can use the clCreateCom- mandQueue function to create a command queue to certain devices. In CUDA [10], one can use the cudaSetDevice function to set the following executions to the target device. The target of this paper is to bound the worst-case response time (WCRT) for typed DAG tasks. To the best of our knowledge, the only known WCRT bound for the considered problem model was presented in an early work [18] (called OLD-B), which is not only grossly pessimistic, but also suffers the non-self-sustainability problem 1 . In this paper we develop two new response time bounds to address these problems: NEW-B-1, which dominates OLD-B in analysis pre- cision with the same time complexity and solves its non-self-sustainability problem. NEW-B-2, which significantly improves the analysis precision by exploring more detailed task graph stuc- ture information. NEW-B-2 is more precise, but also more difficult to compute. We prove the problem of computing NEW-B-2 to be strongly NP-hard if the number of types is a variable. 1. By a non-self-sustainable analysis method, a system decided to be schedulable may be decided to be unschedulable when the system parameters become “better”. We will discuss this issue in more details in Section 3. arXiv:1809.07689v1 [cs.DC] 7 Aug 2018

Upload: others

Post on 13-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Response Time Bounds for Typed DAG ParallelTasks on Heterogeneous Multi-cores

Meiling Han, Nan Guan, Jinghao Sun, Qingqiang He, Qingxu Deng, and Weichen Liu,

Abstract—Heterogenerous multi-cores utilize the strength of different architectures for executing particular types of workload, andusually offer higher performance and energy efficiency. In this paper, we study the worst-case response time (WCRT) analysis of typedscheduling of parallel DAG tasks on heterogeneous multi-cores, where the workload of each vertex in the DAG is only allowed toexecute on a particular type of cores. The only known WCRT bound for this problem is grossly pessimistic and suffers thenon-self-sustainability problem. In this paper, we propose two new WCRT bounds. The first new bound has the same time complexityas the existing bound, but is more precise and solves its non-self-sustainability problem. The second new bound explores moredetailed task graph structure information to greatly improve the precision, but is computationally more expensive. We prove that theproblem of computing the second bound is strongly NP-hard if the number of types in the system is a variable, and develop an efficientalgorithm which has polynomial time complexity if the number of types is a constant. Experiments with randomly generated workloadshow that our proposed new methods are significantly more precise than the existing bound while having good scalability.

Index Terms—Heterogenerous multi-cores system, embedded real-time scheduling, response time analysis, DAG parallel task.

F

1 INTRODUCTION

Multi-cores are more and more widely used in real-timesystems, to meet rapidly increasing requirements in perfor-mance and energy efficiency. To fully utilize the computa-tion capacity of multi-cores, software should be properlyparallelized. A representation that can model a wide rangeof parallel software is the DAG (directed acyclic graph) taskmodel, where each vertex represents a piece of sequentialworkload and each edge represents the precedence relationbetween two vertices. Real-time scheduling and analysis ofDAG parallel task models have raised many new challengesover traditional real-time scheduling theory with sequentialtasks, and have become an increasingly hot research topic inrecent years.

Many modern multi-cores adopt heterogeneous ar-chitectures. Examples include Zynq−7000 [37] andOMAP1/OMAP2 [33] that integrate CPU and DSP on thesame chip, and the Tegra processors [36] that integrate CPUand GPU on the same chip. Heterogenerous multi-coresutilize specialized processing capabilities to handle partic-ular computational tasks, which usually offer higher perfor-mance and energy efficiency. For example, [38] showed thata heterogeneous-ISA chip multiprocessor can outperformthe best same-ISA homogeneous architecture by as muchas 21% with 23% energy savings and a reduction of 32% inenergy delay product.

In this paper, we consider real-time scheduling of typedDAG tasks on heterogeneous multi-cores, where each vertex

• This paper is under submission to a journal.• M. Han,J. Sun and Q. Deng are with Northeastern University, Shenyang

110819, China. Part of the work by M. Han was performed during hervisit at The Hong Kong Polytechnic UniversityE-mail: neu [email protected]

• N. Guan and Q. He are with The Hong Kong Polytechnic University,Hong Kong. Weichen Liu is with Nanyang Technological University,Singapore.

Manuscript received-; revised-.

is explicitly bound to execute on a particular type of cores.Binding code segments of the program to a certain typeof cores is common practice in software development onheterogeneous multi-cores and is supported by mainstreamparallel programming frameworks and operating systems.For example, in OpenMP [34] one can use the proc bindclause to specify the mapping of threads to certain process-ing cores. In OpenCL [35], one can use the clCreateCom-mandQueue function to create a command queue to certaindevices. In CUDA [10], one can use the cudaSetDevicefunction to set the following executions to the target device.

The target of this paper is to bound the worst-case responsetime (WCRT) for typed DAG tasks.

To the best of our knowledge, the only known WCRTbound for the considered problem model was presentedin an early work [18] (called OLD-B), which is not onlygrossly pessimistic, but also suffers the non-self-sustainabilityproblem1. In this paper we develop two new response timebounds to address these problems:

• NEW-B-1, which dominates OLD-B in analysis pre-cision with the same time complexity and solves itsnon-self-sustainability problem.

• NEW-B-2, which significantly improves the analysisprecision by exploring more detailed task graph stuc-ture information. NEW-B-2 is more precise, but alsomore difficult to compute.

– We prove the problem of computing NEW-B-2to be strongly NP-hard if the number of typesis a variable.

1. By a non-self-sustainable analysis method, a system decided tobe schedulable may be decided to be unschedulable when the systemparameters become “better”. We will discuss this issue in more detailsin Section 3.

arX

iv:1

809.

0768

9v1

[cs

.DC

] 7

Aug

201

8

Page 2: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

– We develop an efficient algorithm to computeNEW-B-2 with polynomial time complexity ifthe number of types is a constant.

Experiments with randomly generated parallel tasksshow that the new WCRT bounds proposed in this papercan greatly improve the analysis precision. This paper fo-cuses on analysis of a single typed DAG task, but our resultsare also meaningful to general system setting with multiplerecurrent typed DAG tasks. On one hand, the results ofthis paper are directly applicable to multiple tasks underscheduling algorithms where a subset of cores are assignedto each individual parallel task (e.g., federated scheduling[3], [5], [6], [29], [30]). On the other hand, the analysis ofintra-task interference addressed in this paper is a necessarystep towards the analysis for scheduling algorithms wheredifferent tasks interfere with each other (e.g., global schedul-ing [2], [8], [32]).

2 PRELIMINARY

2.1 Task ModelWe consider a typed DAG taskG = (V,E, γ, c) to be executedon a heterogeneous multi-core platform with different typesof cores. S is the set of core types (or types for short), andfor each s ∈ S there are Ms cores of this type (Ms ≥ 1). Vand E are the set of vertices and edges in G. Each vertexv ∈ V represens a piece of code segment to be sequentiallyexecuted. Each edge (u, v) ∈ E represents the precedencerelation between vertices u and v. The type function γ : V ×Sdefines the type of each vertex, i.e., γ(v) = s, where s ∈ S,represents vertex v must be executed on cores of type s. Theweight function c : V × R+

0 defines the worst-case executiontime (WCET) of each vertex, i.e., v executes for at most c(v)time units (on cores of type γ(v)).

If there is an edge (u, v) ∈ E, u is a predecessor of v,and v is a successor of u. If there is a path in G from uto v, u is an ancestor of v and v is a descendant of u. Weuse pre(u), suc(u), ans(u) and des(u) to denote the set ofpredecessors, successors, ancestors and descendants of u,respectively. Without loss of generality, we assume G has aunique source vertex vsrc (which has no predecessor) and aunique sink vertex vsnk (which has no successor)2. We useπ ∈ G to denote π is a path in G. A path π = {τ1, · · · , τk}is a complete path iff its first vertex τ1 is the source vertexof G and last vertex τk is the sink vertex. We use vol(G) todenote the total WCET of G and vols(G) the total WCET ofvertices of type s:

vol(G) =∑u∈V

c(u), vols(G) =∑

u∈V ∧γ(u)=s

c(u).

The length of a path π is denoted by len(π) and len(G)represents the length of the longest path in G:

len(π) =∑u∈π

c(u), len(G) = maxπ∈G{len(π)}.

Example 2.1. Figure 1 illustrates a typed DAG task with twotypes of vertices (type 1 marked by yellow and type 2 marked byred). The WCET of vertex is annotated by the number next to the

2. In case G has multiple source/sink vertices, one can add a dummysource/sink vertex to make it compliant with our model.

v1 3

v01

v22

v4 2 v5 2

v1112

v8 4

v9 4

v10 4v65

v3 3

v7 1 v122

Fig. 1. A typed DAG task τ with two types.

v0v5

v1v2v4

v11v8v9

v10

v3v6

v7v12

time

type 2

type 1

M2

M1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2020 21 22 23

(a) An execution sequence where each vertex executes for its WCET.

v0v5v1

v2v4

v11v8v9

v10

v3v6 v7 v12

time

type 2

type 1

M2

M1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2020 21 22 23

(b) An execution sequence where some vertices executes shorter thantheir WCET.

Fig. 2. Two possible execution sequences the task in Figure1 executedon a platform with M1 = 2 and M2 = 3.

vertex. And we can compute that vol(G) = 45, vol1(G) = 11and vol2(G) = 34. For a path π = {v0, v1, v7, v11, v12}, thelength is len(π) = 19.

2.2 Runtime BehaviorA vertex is eligible for execution when all of its predecessorshave finished. Without loss of generality, we assume thesource vertex of G is eligible for execution at time 0. Thetyped DAG taskG is scheduled on the heterogeneous multi-core platform by a work-conserving scheduling algorithm:

Definition 2.1. Under a work-conserving scheduling algo-rithm, an eligible vertex of type s must be executed if there areavailable cores of type s.

We do not put any other constraints to schedulingalgorithms except the work-conserving constraint. Thereare many possible instances of work-conserving schedul-ing algorithms, e.g., the list scheduling [13] algorithm. Theresults of this paper are applicable to any work-conservingscheduling algorithm.

Execution Sequence. At runtime, the vertices of G exe-cute at certain time according to the scheduling algorithm.We call a trace describing which vertex executes at whichtime points an execution sequence of G. Given a schedulingalgorithm, G may generate different execution sequences.This is because, (1) the scheduling algorithm may havenondeterminism (the scheduler may behave differently inthe same situation) and (2) each vertex may execute forshorter than its WCET. For example, Figure 2(a) showsan execution sequence where each vertex executes for itsWCET, while Figure 2(b) shows another execution sequencewhere some vertices execute for shorter than their WCET

Page 3: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

but lead to a larger response time. In an execution sequenceε, we use fε(v) to denote the finish time of vertex v. Forsimplicity, we omit the subscript and only use f(v) to denotev’s finish time when the execution sequence is clear from thecontext.

Response Time. The response time of G in an executionsequence is the finish time of the sink vertex, and the WCRTofG, denoted byR(G), is the maximum among the responsetimes of all possible execution sequences. The target of thispaper is to derive safe upper bounds for the WCRT of G.Note that the WCRT of G is not necessarily achieved bythe execution sequence in which each vertex executes forits WCET (even if there is only one type in the system) [13],[14]. Therefore, one can not obtain the WCRT ofG by simplysimulating the execution of G using the WCET, but has to(explicitly or implicitly) analyze all the possible executionsequences of G.

2.3 Existing WCRT Bound

To our best knowledge, the only known WCRT upper boundfor the considered model was developed in an early work[18]:

Theorem 2.1 (OLD-B). The WCRT of G is bounded by:

R(G) ≤

1− 1

maxs∈S{Ms}

× len(G) +∑s∈S

vols(G)

Ms. (1)

This bound can be computed in O(|V | + |E|) time[11]. Although OLD-B was originally derived for the listscheduling algorithm [13], it applies to all work-conservingscheduling algorithms. When there is only one type, itdegrades to the classical response time bound for untypedDAG tasks [14]:

R(G) ≤ len(G) +vol(G)− len(G)

M.

3 THE FIRST NEW WCRT BOUND

OLD-B is not only pessimistic but also suffers the problemof being non-self-sustainable with respect to processing ca-pacity. More specifically, the value of the WCRT bound in(1) may increase when the number of cores (of some type)increases, as witnessed by the following example.

Example 3.1. For the task G in Figure 1, we can calculateits vol1(G) = 11, vol2(G) = 34 and len(G) = 19. SupposeM1 = 2 and M2 = 3, we obtain a WCRT bound by OLD-B as29.5. However, if we increase M1 to 20, the bound is increased to29.93.

Note that the actual WCRT of G will not increase whenmore cores are used. The phenomenon shown above ismerely the problem of the bound OLD-B itself ratherthan the system behavior. As pointed out in [1], the self-sustainability property is important in incremental and inter-active design process, which is typically used in the designof real-time systems and in the evolutionary developmentof fielded systems.

In this section we will develop a new WCRT bound,which is not only more precise than OLD-B (with the same

time complexity), but also self-sustainable. We start withintroducing some useful concepts.

Definition 3.1. The scaled graph G = (V,E, c, γ) of G =(V,E, c, γ) has the same topology (V and E) and type functionγ as G, but a different weight function c:

∀v ∈ V : c(v) = c(v)× (1− 1/Mγ(v)).

Definition 3.2. A critical path π = {τ1, · · · , τk} of anexecution sequence of G is a complete path of G satisfying thefollowing condition:

∀τi ∈ π \ {τ1} : f(τi−1) = maxu∈pre(τi)

{f(u)},

where f(v) is the finish time of v in this execution sequence.

For example, a complete path π = {v0, v1, v7, v11, v12} isthe critical path for the execution sequence shown in Figure2(a), while a complete path π′ = {v0, v2, v7, v11, v12} is not acritical path of this execution sequence since the v2’s finishtime is not the latest among all the predecessors of v7.

A task G may generate (infinitely) many different exe-cution sequences at runtime, and it is in general unknownwhich complete path inG is the critical path that leads to theWCRT. In the following, we assume an arbitrary completepath π = {τ1, · · · , τk} to be a critical path, and derive upperbounds for the response time of this particular critical path.Then by getting the maximum bound among all possiblepaths in G, we can safely bound the WCRT of G.

We divide [0, f(τk)) into k segments [0, f(τ1)),[f(τ1), f(τ2)), · · · , [f(τk−1), f(τk)). For each 1 < i ≤ k,we define

Ii = f(τi)− f(τi−1),

and let I1 = f(τ1). We define

• xi: the accumulative length of time intervals in Iiduring which τi is executing;

• yi: the accumulative length of time intervals in Iiduring which τi is not executing.

Obviously, Ii = xi + yi. Figure 3 illustrates xi, yi and Ii.In general the time intervals counted in xi or yi may notbe continuous (e.g., [f(τ1), f(τ2)) in Figure 3). We furtherdefine

Is =∑τi∈π∧γ(τi)=s

Ii, Xs =∑τi∈π∧γ(τi)=s

xi, Ys =∑τi∈π∧γ(τi)=s

yi,

and we know Is = Xs + Ys.

Lemma 3.1. Let π = {τ1, · · · , τk} be a critical path of anarbitrary execution sequence ofG, thenXs and Ys can be boundedby

Xs ≤∑τj∈π∧γ(τj)=s

c(τj), (2)

Ys ≤

vols(G)−∑τj∈π∧γ(τj)=s

c(τj)

/Ms. (3)

Proof. The proof of (2) is trivial. In the following, wefocus on the proof of (3). By the definition of criticalpath, we know all the predecessors of τi have finished

Page 4: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 3. Illustration of xi, yi and Ii.

by time f(τi−1). Therefore, when τi is not executing in[f(τi−1), f(τi)), all the cores of type γ(τi) must be occupiedby vertices of type γ(τi) not on the critical path. Since thetotal workload of vertices of type s that are not on the criticalpath is at most

vols(G)−∑τj∈π∧γ(τj)=s

c(τj)

and the number of cores of type s is Ms, the accumulatedlength of time intervals during which the vertices of type son π are not executing is bounded by (3).

Theorem 3.1 (NEW-B-1). The WCRT of G is bounded by:

R(G) ≤ len(G) +∑s∈S

vols(G)

Ms, (4)

where G is the scaled graph of G.

Proof. By (2), (3) and Is = Xs + Ys we have

Is ≤vols(G)

Ms+

∑τi∈π∧γ(τi)=s

(1− 1/Ms) c(τi)

∑s∈SIs ≤

∑s∈S

vols(G)

Ms+

∑s∈S

∑τi∈π∧γ(τi)=s

(1− 1/Ms) c(τi)

∑s∈SIs ≤

∑s∈S

vols(G)

Ms+

∑τi∈π

c(τi) //by the definition of c(τi)∑s∈S Is is the response time of the execution sequence with

critical path π. Since G and G have the same topology, π isalso a complete path in G, so

∑τi∈π c(τi) is bounded by

len(G). Therefore, R(G) is bounded by (4).

We can compute∑s∈S vols(G)/Ms and construct G

based on G in O(|V |) time, and compute len(G) in O(|V |+|E|) time [11]. Therefore, the overall time complexity tocompute NEW-B-1 is O(|V | + |E|), which is the same asOLD-B-1. By comparing the two bounds we can conclude:

Corollary 3.1. NEW-B-1 strictly dominates OLD-B-1.

Finally, we can easily see the bound in (4) is decreasingwith respect to each Ms, so we can conclude:

Corollary 3.2. NEW-B-1 is self-sustainable with respect toeach Ms.

4 THE SECOND NEW WCRT BOUND

Our first new WCRT bound NEW-B-1 is more precise thanOLD-B, but still very pessimistic. The source of its pes-simism comes from the step of bounding Ys. Intuitively, thebound of Ys in (3) is derived assuming that the workload ofvertices not on the critical path are all executed in the shadedareas in Figure 3. However, in reality much workload ofG may actually be executed outside these shaded areas.Therefore, the length of Ys is significantly over-estimatedin (3) .

In this section, we introduce the second new WCRTbound NEW-B-2, which eliminates workload of vertices thatcannot be executed in the shaded area, and thus reduce thepessimism in bounding Ys.

4.1 WCRT Bound

Definition 4.1. For each vertex v ∈ V , par(v) denotes the setof vertices that have the same type as v but are neither ancestorsnor descendants of v:

par(v) = {u|u ∈ V ∧ γ(u) = γ(v)∧ u /∈ (ans(v) ∪ des(v))}.

Definition 4.2. Let π = {τ1, · · · , τk} be a critical path,ivs(π, s) is defined as

ivs(π, s) =⋃

τi∈π∧γ(τi)=s

par(τi). (5)

Example 4.1. Assume π = {v0, v1, v7, v11, v12} is acritical path of the task in Figure 1. We have par(v0) =∅, par(v1) = {v2, v4, v5, v8, v9, v10}, par(v7) = {v6},par(v11) = {v4, v5, v8, v9, v10}, par(v12) = ∅, ivs(π, 1) ={v6} and ivs(π, 2) = {v2, v4, v5, v8, v9, v10}.

Intuitively, ivs(π, s) is the set of vertices of type s thatare not on the critical path but can actually interfere withvertices of type s on the critical path (i.e., can be executed inthe shaded area in Figure 3). Therefore, Ys can be boundedmore precisely as stated in the following Lemma.

Lemma 4.1. Let π = {τ1, · · · , τk} be a critical path of anarbitrary execution sequence of G, then Ys is bounded by

Ys ≤ivs(π, s)

Ms(6)

Page 5: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Proof. To prove the lemma, it is sufficient to prove that atany time instant in [f(τi−1), f(τi)) when τi is not executing,all the cores of type γ(τi) must be executing vertices inivs(π, γ(τi)). We prove this by contradiction. Assumingthat at a time instant t ∈ [f(τi−1), f(τi)) when τi is notexecuting, there exists a core of type γ(τi) which is notexecuting vertices in ivs(π, γ(τi)), then one of the followingtwo cases must be true:

• This core is idle at t. Since π is a critical path, weknow all the predecessors of τi have finished by timef(τi−1), so τi is eligible for execution at t, and thusthis core cannot be idle at t. Therefore, this case isimpossible.

• This core is executing a vertex u /∈ ivs(π, γ(τi))at t. First we know γ(u) = γ(τi), and since u /∈ivs(π, γ(τi)), by the definition of ivs we know u mustbe a predecessor or a successor of τi, so we discusstwo cases:

– u is a predecessor of τi. Since π is a criticalpath, we know all the predecessors of τi havefinished by time f(τi−1), so a predecessor ofτi cannot start execution after f(τi−1), whichcontradicts that u is executing at a time instantt after f(τi−1).

– u is a successor of τi. A successor of τi cannotstart execution before f(τi), so this is also acontradiction.

Therefore, this case is also impossible.

In summary, both cases are impossible, so the assumptionmust be false and the lemma is proved.

Theorem 4.1 (NEW-B-2). The WCRT of G is bounded by

R(G) ≤ maxπ∈G{R(π)} (7)

whereR(π) = len(π) +

∑s∈S

∑v∈ivs(π,s)

c(v)/Ms (8)

Proof. By the same idea as the proof of Theorem 3.1 butusing the new bound (6) for Ys instead of (3), we can get∑

s∈SIs ≤ len(π) +

∑s∈S

∑v∈ivs(π,s)

c(v)/Ms

∑s∈S Is is the response time of the execution sequence

with π being the critical path. Finally, by getting the max-imum bound for all complete paths (assumed to be thecritical path), the theorem is proved.

Comparing with NEW-B-1, NEW-B-2 uses a more pre-cise upper bound of Ys, so we have

Corollary 4.1. NEW-B-2 strictly dominates NEW-B-1.

The bound in (7) is decreasing with respect to each Ms,so

Corollary 4.2. NEW-B-2 is self-sustainable with respect toeach Ms.

u1

x11

x12

x13

u2

x21

x22 x2

3

u3

x31 x3

2

x33

u4

x14 x4

2

x43

v1v0 v2 v3

Fig. 4. The constructed typed DAG task G for 3-SAT problem instanceC = C1 ∧ C2 ∧ C3 ∧ C4, where C1 = x1 ∨ x2 ∨ x3, C2 = x1 ∨ x2 ∨ x3,C3 = x1 ∨ x2 ∨ x3, C4 = x1 ∨ x2 ∨ x3.

4.2 Strong NP-Hardness

NEW-B-2 requires to compute the maximum of R(π)among all paths in the graph G. It is computationallyintractable to explicitly enumerate all the paths, the num-ber of which is exponential. Can we develop efficient al-gorithms of (pseudo-)polynomial complexity to computemaxπ∈G{R(π)}?

Unfortunately, this is impossible unless P = NP.

Theorem 4.2. The problem of computing maxπ∈G{R(π)}is strongly NP-hard.

Proof. We will prove the theorem by showing that even asimpler problem of verifying whether maxπ∈G{R(π)} islarger than a given value ω is strongly NP-hard, which isproved by a reduction from the 3-SAT problem.

Let C be an arbitrary instance of the 3-SAT problem,which has m clauses C1

∧C2

∧· · ·

∧Cm and n variables

{x1, x2, · · · , xn}. Each clause Cr , 1 ≤ r ≤ m, consists ofthree literals, and each literal is a variable or the negation ofa variable. We construct a typed DAG G corresponding tothe 3-SAT instance as follows:

• We first construct n+ 1 vertices {v0, · · · , vn} of types0 with c(v0) = · · · = c(vn) = 1. v0 is the sourcevertex of G and vn is the sink vertex.

• For each clause Cr , we construct a vertex ur of typesr with c(ur) = 1, as well as two edges (v0, ur) and(ur, vn).

• For each variable xi, we construct two paths fromvi−1 to vi:

– Positive path, which includes a vertex xri oftype sr if and only if clause Cr includes a literalxi.

– Negative path, which includes a vertex xri ifand only if clause Cr includes a literal xi.

The WCET of each vertex on these two paths is1

mn+1 .

Note that there are in total m + 1 types in the aboveconstructed DAG. Finally, we set M0 = M1 = · · · = Mm =1 and ω = m+ n+ 1. The above construction is polynomialas there are no more than m + n + 1 + 2nm vertices in theconstructed graph. For illustration, an example of the aboveconstruction is given in Figure 4.

Page 6: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

In the following we prove that the 3-SAT prob-lem instance C is satisfiable if and only if the boundmaxπ∈G{R(π)} of the above constructed graph is strictlygreater than m+ n+ 1.

First, a complete path that leads to the largest R(π)must be one of those traversing v0, · · · , vn. The choicebetween the positive and negative path between vi−1 andvi corresponds to the choice between assigning 1 or 0 tovariable xi in the 3-SAT problem.

Since each vertex ur is neither an ancestor nor a de-scendant of any vertex on paths traversing v0, · · · , vn, uris included in ivs(π, γ(ur)) if and only if the path π containsat least one vertex of type γ(ur). This corresponds to thatCr is satisfied only if it contains at least one literal assignedwith value 1. Therefore, we can conclude that all vertices urare included in the corresponding ivs(π, γ(ur)) if and onlyif all clauses contain at least one literal assigned with value1, i.e., the 3-SAT problem instance C is satisfiable.

Therefore, the second item of RHS of (8) equals m ifand only if C is satisfiable. Moreover, there are at most mnvertices corresponding to the positive and negative valuesof the variables along any path traversing v0, · · · , vn, sotheir total WCET must be in the range (0, mn

mn+1 ). Therefore,the length len(π) of any such path π must be in the range(n+ 1, n+ 1 + mn

mn+1 ), and thus in the range (n+ 1, n+ 2).Therefore, maxπ∈G is larger than m+n+ 1 if and only if allvertices ur are included in the corresponding ivs(π, γ(ur)),i.e., the 3-SAT problem instance C is satisfiable.

4.3 Computation Algorithm

The construction in the above strong NP-hardness proofuses m+1 different types (where m is the number of clausesin 3-SAT). In realistic heterogeneous multi-core platforms,the number of core types is usually not very large. Will theproblem of computing maxπ∈G{R(π)} remain NP-hard ifthe number of types is a bounded constant? In the following,we will present an algorithm to compute maxπ∈G{R(π)}with complexity O(|V ||S|+2), which shows that the problemis actually in P if the number of types is a constant.

We first describe the intuition of our algorithm. Insteadof explicitly enumerating all the possible paths, our algo-rithm will use abstractions to represent paths in the graphsearching procedure. More specifically, a path starting fromthe source vertex of G and ending at some vertex τi isabstractly represented by a tuple 〈τi,∆(τi),R(τi)〉, where∆(τi) and R(τi) are defined in Definition 4.3 and 4.4 inthe following. The tuple will be updated when the pathis extended from τi to its successor τi+1, and eventuallywhen the path is extended to the sink vertex, R(τi) is theR(π) for this path. The algorithm starts with a single tuplecorresponding to the path consisting only the source vertexand repeatedly extends the paths until they all reach thesink vertex, then the maximal R(π) among all the kepttuples is the desired bound maxπ∈G{R(π)}. The abstractionis compact, so that many different path histories ending withthe same vertex can be represented by a single abstractionand the total number of abstractions generated in the com-putation is polynomially bounded.

Definition 4.3. For a path π = {τ1, · · · , τk} in G and atype s ∈ S, we define

δ(τi, s) =

τi, s = γ(τi)

⊥, s 6= γ(τ1) ∧ i = 1

δ(τi-1, s), s 6= γ(τi) ∧ 2 ≤ i ≤ k

and

∆(τi) = {δ(τi, s) | s ∈ S} (9)

Intuitively, δ(τi, s) is the vertex on path π that is theclosest to τi among all the vertices of type s.

Example 4.2. In the typed task in Figure 1, there are 3paths from v0 to v7. For the path π1 = {v0, v1, v7}, onecan derive δ(v7, 1) = {v7} and δ(v7, 2) = {v1}. For thepath π2 = {v0, v3, v7}, one can derive δ(v7, 1) = {v7} andδ(v7, 2) = {v0}.

Definition 4.4. For a path π = {τ1, · · · , τk} inG we define:

R(τi) =

c(τ1), i = 1

R(τi-1) + c(τi) +∑

v∈ϕ(τi)

c(v)

Mγ(τi), 2 ≤ i ≤ k

(10)where

ϕ(τi) = par(τi) \ par(δ(τi-1))

(since δ(τi-1) may be ⊥, we let par(⊥) = ∅ for completeness).

Lemma 4.2. For a complete path π = {τ1, · · · , τk}, italways holds

R(τk) = R(π). (11)

Proof. The proof goes in two steps: (1) Rewrite R(τi) intoa non-recursive form R′(τi) and prove ∀τi ∈ π : R′(τi) =R(τi), and (2) Prove R′(τk) = R(π).

We first define R′(τi) as follows:

R′(τi) =

c(τ1) i = 1i∑

j=1

c(τj) +∑s∈S

∑v∈ϕ(τj)∧

γ(τj)=s∧j≤i

c(v)

Ms2 ≤ i ≤ k

Now we prove ∀τi∈π : R′(τi) = R(τi) by induction:

• Base case. Both R(τ1) and R′(τ1) equal c(τ1), so theclaim holds for the base case with i = 1.

• Inductive step. SupposeR(τi-1) = R′(τi-1), we wantto prove R(τi) = R′(τi). First, we know

∑v∈ϕ(τi)

c(v)

Mγ(τi)+

∑s∈S

∑v∈ϕ(τj)∧

γ(τj)=s∧j≤i−1

c(v)

Ms=

∑s∈S

∑v∈ϕ(τj)∧

γ(τj)=s∧j≤i

c(v)

Ms

(12)For simplicity we let

F(τi) =∑s∈S

∑v∈ϕ(τj)∧

γ(τj)=s∧j≤i

c(v)

Ms

Page 7: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

so (12) can be rewritten as∑v∈ϕ(τi)

c(v)

Mγ(τi)+ F(τi-1) = F(τi)

⇔∑

v∈ϕ(τi)

c(v)

Mγ(τi)+F(τi-1)+

i∑j=1

c(τj) = F(τi)+i∑

j=1

c(τj)

⇔∑

v∈ϕ(τi)

c(v)

Mγ(τi)+R′(τi-1) + c(τi) = R′(τi)

⇔∑

v∈ϕ(τi)

c(v)

Mγ(τi)+R(τi-1) + c(τi) = R′(τi)

⇔ R(τi) = R′(τi)

By now, we have proved ∀τi ∈ π : R′(τi) = R(τi). In thefollowing we prove R′(τk) = R(π).

Let πs = {τs1 , · · · , τsh} be the subsequence of π contain-ing all vertices in π with type s. By the definition of ϕ(τj):∑v∈ϕ(τj)∧γ(τj)=s∧j≤k

c(v) =∑

v∈par(τs1 )

c(v) +∑

v∈par(τs2 )

\par(τs1 )

c(v) + · · ·+∑

v∈par(τsh)

\par(τsh-1)

c(v) (13)

In the following we will prove∑v∈ivs(π,s)

c(v) =∑

v∈par(τs1 )

c(v) +∑

v∈par(τs2 )

\par(τs1 )

c(v) + · · ·+∑

v∈par(τsh)

\par(τsh-1)

c(v) (14)

If (14) is true, then by (14) and (13) we have∑v∈ϕ(τj)∧γ(τj)=s∧j≤k

c(v) =∑

v∈ivs(π,s)

c(v)

by which R′(τk) = R(π) is proved.In the following, we focus on proving (14). We use LHS

and RHS to represent the left-hand side and right-hand sideof (14), respectively. In the following, we will prove thatboth LHS ≤ RHS and LHS ≥ RHS hold.

1) LHS ≤ RHS. This is proved by combining thefollowing two claims:

a) Any c(v) counted in LHS is also counted inRHS. If c(v) is counted in LHS, then v mustbe in some par(τsi ), and by the definition ofivs(π, s), we know v must be also in ivs(π, s),so we can conclude that all the c(v) countedin LHS are also counted in RHS.

b) Each c(v) is counted in LHS at most once.Suppose this is not true, then there existssome v such that v ∈ par(τsi ) \ par(τsi-1) andv ∈ par(τsj ) \ par(τsj-1) for some j < i− 1, soit must be the case that

v ∈ par(τsi ) ∧ v /∈ par(τsi-1) ∧ v ∈ par(τsj )

By v ∈ par(τsi ), we know v is not a descen-dant of τsi , and since τsi-1 is a predecessor ofτsi , v is not a descendant of τsi-1 either. Onthe other hand, by v ∈ par(τsj ) we know vis not an ancestor of τsj , and since τsi-1 is adescendant of τsj , v is not an ancestor of τsi−1.

In summary, v is neither a descendant nor anancestor of τsi-1, and v has the same type asτsi-1, which contradicts v /∈ par(τsi-1).

2) RHS ≤ LHS. It is obvious that each c(v) is countedat most once in the RHS, so it suffices to provethat each c(v) counted in the RHS is also countedin LHS. Since v ∈ ivs(π, s), v must be in somepar(τsi ). Suppose i is the smallest index such thatv ∈ par(τsi ) but v /∈ par(τsi−1), then c(v) is countedin item

∑v∈par(τs

i )−par(τsi-1)

c(v) (if i = 1 then c(v) iscounted in

∑v∈par(τs

1 )c(v)).

In summary, we have proved both RHS ≥ LHS and RHS ≤LHS, so (14) is true.

By Lemma 4.2, we know that by using the abstracttuple 〈τi,∆(τi),R(τi)〉 to extend the path, eventually,we can precisely compute R(π). Therefore, we can use〈τi,∆(τi),R(τi)〉 as the abstraction of paths to performgraph searching. All the paths corresponding to the sametuple can be abstractly represented by a single tuple insteadof recording each of them individually. Actually, even pathscorresponding to different tuples can be merged togetherduring the graph searching procedure by the dominationrelation among tuples defined as follows:

Definition 4.5. Given two tuples 〈v,∆1(v),R1(v)〉 and〈v,∆2(v),R2(v)〉 with the same vertex v, 〈v,∆1(v),R1(v)〉dominates 〈v,∆2(v),R2(v)〉, denoted by

〈v,∆1(v),R1(v)〉 < 〈v,∆2(v),R2(v)〉,

if both of the following conditions are satisfied:

1) R1(v) ≥ R2(v)2) ∀s: either δ1(v, s) = ⊥ or

(δ1(v, s) 6=⊥) ∧ (δ2(v, s) 6=⊥) ∧(par(δ1(v, s)) ∩ des(δ2(v, s))=∅)

Suppose v is a successor of u. Given a tuple〈u,∆(u),R(u)〉 for vertex u and a tuple 〈v,∆(v),R(v)〉 forvertex v, we use

〈u,∆(u),R(u)〉 7→ 〈v,∆(v),R(v)〉

to denote that 〈v,∆(v),R(v)〉 is generated based on〈u,∆(u),R(u)〉 according to (9) and (10).

Lemma 4.3. Suppose v is a successor of u, and

〈u,∆1(u),R1(u)〉 7→ 〈v,∆1(v),R1(v)〉〈u,∆2(u),R2(u)〉 7→ 〈v,∆2(v),R2(v)〉

if 〈u,∆1(u),R1(u)〉 < 〈u,∆2(u),R2(u)〉, then we must have〈v,∆1(v),R1(v)〉 < 〈v,∆2(v),R2(v)〉.

Proof. We start with proving R1(v) ≥ R2(v). Assume w1 =δ1(u, γ(v)) and w2 = δ2(u, γ(v)), then

ϕ1(v) = par(v) \ par(w1)

ϕ2(v) = par(v) \ par(w2)

In the following we prove ϕ1(v) ⊇ ϕ2(v). By condition 2) inDefinition 4.5, one of the following two cases must hold

1) w1 = ⊥

Page 8: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

2) (w1 6=⊥) ∧ (w2 6=⊥) ∧ (par(w1) ∩ des(w2)=∅)

If w1 = ⊥, then par(w1) = ∅, so ϕ1(v) ⊇ ϕ2(v) is obviouslytrue. In the following we focus on the second case

(w1 6=⊥) ∧ (w2 6=⊥) ∧ (par(w1) ∩ des(w2)=∅).

Suppose ϕ1(v) + ϕ2(v) , then there must exist an x s.t.

x ∈ ϕ2(v) ∧ x /∈ ϕ1(v),

by which we know any following conditions must hold

x ∈ par(v) (15)x ∈ par(w1) (16)x /∈ par(w2) (17)

By (15) we have x /∈ ans(v) ∪ des(v) and thus

x /∈ ans(v) (18)

By (17), we have

x ∈ ans(w2) ∪ des(w2). (19)

Since w2 = δ2(u, γ(v)) and v ∈ suc(u), we know w2 ∈ans(v), and thus

ans(w2) ⊂ ans(v) (20)

By (18), (19) and (20) we know x ∈ des(w2), which con-tradicts with par(w1) ∩ des(w2) = ∅, so the assumptionϕ1(v) + ϕ2(v) must be false, i.e., ϕ1(v) ⊇ ϕ2(v) must betrue. Therefore ∑

w∈ϕ1(v)

c(w)

Mγ(v)≥

∑w∈ϕ2(v)

c(w)

Mγ(v),

and since R1(u) ≥ R2(u) , we have

R1(u)+c(v)+∑

w∈ϕ1(v)

c(w)

Mγ(v)≥ R1(u)+c(v)

∑w∈ϕ2(v)

c(w)

Mγ(v)

by which we have R1(v) ≥ R2(v).Next we prove the second condition of the domination

relation is also true. We discuss two cases of s:

• s = γ(v). In this case δ1(v, s) = δ2(v, s) = v, so wehave (δ1(v, s) 6= ⊥)∧(δ2(v, s) 6= ⊥)∧(par(δ1(v, s))∩des(δ2(v, s)) = ∅)

• s 6= γ(v). In this case, δ1(v, s) = δ1(u, s) andδ2(v, s) = δ2(u, s).

– If δ1(u, s) = ⊥, then δ1(v, s) = ⊥.– If (δ1(u, s) 6= ⊥) ∧ (δ2(u, s) 6= ⊥) ∧

(par(δ1(u, s)) ∩ des(δ2(u, s)) = ∅) holds,then (δ1(u, s) 6= ⊥) ∧ (δ2(u, s) 6= ⊥) ∧(par(δ1(u, s)) ∩ des(δ2(u, s)) = ∅) also holds

In summary, the second condition of the domination relationbetween 〈v,∆1(v),R1(v)〉 and 〈v,∆2(v),R2(v)〉 also holds.

If a tuple A dominates another tuple B, then by repeat-edly applying Lemma 4.3 we know it is impossible for B toeventually lead to a larger WCRT bound than A. Therefore,in the graph searching procedure, we can safely discardtuples dominated by others.

Algorithm 1 shows the pseudo-code of the algorithm tocompute maxπ∈G{R(π)} by using the tuple abstractionsto search over G in a width-first manner. The algorithmuses TS to keep the set of tuples at each step, whichinitially contains a single tuple 〈vsrc, {vsrc}, c(vsrc)}〉 (vsrcis the source vertex of G). Then the algorithm repeatedlygenerates new tuples for all successors of each tuple in TSby (9) and (10). A newly generated tuple is added to TSif there are no other tuples in TS dominating it (line 6 to7). A tuple is removed from TS when all of its successors’tuples have been generated. Note that the algorithm alwaysselects a tuple that has no predecessors in TS to generatenew tuples (line 3), so the searching procedure is width-first. Finally, TS only contains tuples associated with thesink vertex vsnk. By Lemma 4.2 we know, the R of eachtuple in the final TS equals the R(π) of the correspondingpath π, so the maximal R among these final tuple are thedesired maxπ∈G{R(π)}. By the above discussions, we canconclude the correctness of our algorithm:

Theorem 4.3. The return value of Algorithm 1 equalsmaxπ∈G{R(π)}.

Algorithm 1 Pseudo-code for computing maxπ∈G{R(π)}1: TS = {〈vsrc, {vsrc}, c(vsrc)〉}2: while (∃〈v,∆,R〉 ∈ TS : v 6= vsnk) do3: Select a tuple 〈v,∆,R〉 in TS s.t.

@〈v′,∆′,R′〉 ∈ TS : v′ ∈ pre(v)

4: for (each v′ ∈ suc(v)) do5: Compute 〈v′,∆′,R′〉 using 〈v,∆,R〉 by (9) and (10)

6: if (@〈v′,∆∗,R∗〉∈TS: 〈v′,∆∗,R∗〉<〈v′,∆′,R′〉) then7: TS = TS ∪ {〈v′,∆′,R′〉}8: end if9: end for

10: TS = TS \ {〈v,∆,R〉}11: end while12: return max{R|〈v,∆,R〉 ∈ TS}

The time complexity of Algorithm 1 depends on the totalnumber of tuples generated in the computation procedure.As a tuple is put into TS only if no other tuples in TSdominate it, the total number of tuples have ever beenrecorded in TS in bounded by O(|V ||S|+1) (at most |V |different vertcies, O(|V ||S|) different values for ∆, and onlya singleR kept for the same v and ∆). Each tuple in TS cangenerate no more than |V | new tuples, so the total numberof tuples ever been generated is bounded by O(|V ||S|+2), sothe overall time complexity of Algorithm 1 is O(|V ||S|+2).

5 EVALUATION

In this section, we experimentally evaluate the performanceof our proposed analysis methods in terms of both precisionand efficiency. We compare the following WCRT bounds inthe experiments:

• OLD-B: the baseline bound in [18] (i.e., Theorem 2.1).• NEW-B-1: our first new bound in Theorem 3.1.• NEW-B-2: our second new bound in Theorem 4.1.

Page 9: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

0 1 2 3 4 5U

0

20

40

60

80

100Ac

cept

ance

ratio

(%) OLD-B

NEW-B-1NEW-B-2

(a) Accept. ratio with changing U

0 1 2 3 4 5U

50

60

70

80

90

100

Nom

alliz

ed W

CRT

boun

d(%

)

NEW-B-1NEW-B-2

(b) Norm. bound with changing U

70 75 80 85 90 95 100|V|

0

20

40

60

80

100

Acce

ptan

ce ra

tio(%

)

OLD-BNEW-B-1NEW-B-2

(c) Accept. ratio with changing |V |

70 75 80 85 90 95 100|V|

50

60

70

80

90

100

Nom

alliz

ed W

CRT

boun

d(%

)

NEW-B-1NEW-B-2

(d) Norm. bound with changing|V |

0.2 0.4 0.6 0.8pr

0

20

40

60

80

100

Acce

ptan

ce ra

tio(%

) OLD-BNEW-B-1NEW-B-2

(e) Accept. ratio with changing pr

0.2 0.4 0.6 0.8pr

50

60

70

80

90

100

Nom

alliz

ed W

CRT

boun

d(%

)

NEW-B-1NEW-B-2

(f) Norm. bound with changing pr

2 3 4 5 6 7 8 9 10|S|

0

20

40

60

80

100

Acce

ptan

ce ra

tio(%

)

OLD-BNEW-B-1NEW-B-2

(g) Accept. ratio with changing |S|

2 3 4 5 6 7 8 9 10|S|

50

60

70

80

90

100

Nom

alliz

ed W

CRT

boun

d(%

)

NEW-B-1NEW-B-2

(h) Norm. bound with changing|S|

Fig. 5. Comparison of analysis precision of different bounds.

We assume the typed DAG tasks are periodic with aperiod 100 with implicit deadlines (thus we evaluate notonly the WCRT bound but also the schedulability of eachtask). We first define a default parameter setting, and thentune different parameters to evaluate the performance ofthe methods regarding different parameter changing trends.The default parameter setting is defined as follows:

• The number of types |S| is randomly chosen in therange [5, 10], and the number of cores Ms of eachtype s is randomly chosen in [2, 11].

• The DAG structure of the task is generated by themethod proposed in [9], where the number of ver-tices |V | is randomly chosen in the range [70, 100]and the parallelism factor pr is randomly chosen in[0.08, 0.1] (the larger pr , the more sequential is thegraph).

• The total utilization U of the typed DAG task is

0 1 2 3 4 5U

101

102

103

104

105

Aver

age

exec

utio

n tim

e(s)

OLD-BNEW-B-1NEW-B-2

(a) Analysis time with changing U

0.2 0.4 0.6 0.8pr

101

102

103

104

105

Aver

age

exec

utio

n tim

e(m

s)

OLD-BNEW-B-1NEW-B-2

(b) Analysis time with changingpr

2 4 6 8 10 12|S|

101

102

103

104

105

Aver

age

exec

utio

n tim

e(m

s)

OLD-BNEW-B-1NEW-B-2

(c) Analysis time with changing|S|

70 80 90 100 110|V|

101

102

103

104

105

Aver

age

exec

utio

n tim

e(m

s)

OLD-BNEW-B-1NEW-B-2

(d) Analysis time with changing|V |

Fig. 6. Comparison of analysis efficiency of different bounds.

randomly chosen in [1, 3], and thus the total WCETof the task vol(G) = U × 100.

• We use the Unnifast method [7] to distribute the totalWCET to each individual vertex.

• Each vertex is randomly assigned a type in S.

Figure 5 show the evaluation results for analysis preci-sion with different parameters changed based on the defaultsetting. Figure 5(a) and 5(b) show experiment results withchanging U (other parameters are the same as the defaultsetting). Each point in Figure 5(a) represents the acceptanceratio of a WCRT bound, which is defined as the ratiobetween the number of tasks decided to be schedulableby a particular WCRT bound and the total number oftasks generated, with the target utilization U on the axis.Figure 5(b) shows the normalized WCRT bound by our newmethods (the baseline bound is 100%). Figure 5(c) and 5(d)are results with changing |V |, Figure 5(e) and 5(f) are resultswith changing pr, Figure 5(g) and 5(h) are results withchanging |S|. From these experiments we can see that ourproposed new bounds NEW-B-1 and NEW-B-2 consistentlyoutperform OLD-B under various parameter settings.

Figure 6 evaluates the computational efficiency of differ-ent bounds, with different parameter changed based on thedefault setting (the same as Figure 5). Experiments resultsshow that our first new bound NEW-B-1 is almost as effi-cient as OLD-B, while the computation procedure of NEW-B-2 takes much longer time. The lower efficiency of NEW-B-2 is due to the inherent hardness of its computation problem:exponential with respect to |S| and high-order polynomialwith respect to |V | when |S| is large. The computation timeof NEW-B-2 first increases then decreases as pr increasesbecause the number of possible paths is relatively smallerwith very small pr (with which the paths are typically veryshort) and with large pr (with which the task graphs aremore sequential and thus the number of possible paths is notvery large). The number of different core types in realistic

Page 10: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

0.0 0.2 0.4 0.6 0.8 1.0pr

102

103

104

105

106

num

ber o

f pat

hs (t

uple

s) New-B-2Enu-path

Fig. 7. State space reduction by our computation algorithm for NEW-B-2.

heterogeneous multi-cores is usually a very small number.From Figure 6(c) we can see that the analysis procedurecan finish in several seconds when the number of types issmaller than 8, which should be acceptable for most offlinedesign scenarios.

Figure 7 shows the effectiveness of our proposed algo-rithm for computing NEW-B-2 based on the abstract pathrepresentation, by comparing the total number of tuplesgenerated by our algorithm and the total number of pathsof the task graph. We can see that using our abstraction thesearching state space is reduced by typically 3 to 4 orders ofmagnitude.

6 RELATED WORK

The most relevant related work is [18], which developedthe baseline WCRT bound OLD-B as stated in Section 2.3.The analysis techniques of OLD-B are based on the classicalwork by Graham [13], [14] for untyped DAG (the special caseof the model considered in this paper with only one type).The real-time scheduling problem for multiple recurringuntyped DAG has been intensively studied, with differentscheduling paradigm including federated scheduling [2],[8], [32] and global scheduling [3], [4], [5], [29], [30], whichare also based on the classical work by Graham [13], [14].Similarly, although we assume a single typed DAG task inthis paper, our work would be a necessary step towards thescheduling and analysis of multiple recurring typed DAGtasks.

Yang et al. [40] studied the scheduling of multiple typedDAG tasks by decomposing each DAG task into a set ofindependent tasks with artificial release time and deadlines,and then analyzed these decomposed independent tasksby known methods under non-preemptive G-EDF (globalearliest deadline first). The work in [40] is not directlycomparable with our work due to the difference in problemmodels. However, when our method (based on Graham’sanalysis framework) is extended to the multiple-DAG set-ting in the future, a comparison between these two familiesof techniques would be meaningful.

Yang et al. [40] studied the scheduling of multiple typedDAG tasks by decomposing each DAG task into a set ofindependent tasks with artificial release time and deadlines,and then analyzed these decomposed independent tasksby known methods under non-preemptive G-EDF (globalearliest deadline first). The work in [40] is also applicable to

the problem studies of this paper which is a special case ofthe multiple DAG setting in [40]. However, we did not in-clude experimental comparison results between our methodand [40] as the WCRT bounds yielded by the method in[40] is much larger than our method. The transformation toindependent tasks in [40] is to handle the difficulty causedby multiple DAGs. Therefore, the comparison between ourmethod and [40] under the single DAG setting will beunfairly in favor to us. We do not want to take of advan-tage of this to exaggerate the superiority of our method3

When our method (based on Grahams analysis framework)is extended to the multiple-DAG setting in the future, acomparison between these two families of techniques wouldbe meaningful.

The scheduling problems with different workload-processor binding restrictions have been studied in the past,including the inclusive processing set restriction [16], [26],[27], [31] where for any two processor sets assigned to twotasks, one set must be the subset of the other, the intervalprocessing set restriction [21], [22] where any two processingsets may have the same machines as the other but one is nota subset of the other and the tree-hierarchical processing setrestrictions [12], [17] where each machine is described as avertex of a tree and a processing set is a path from root toa leaf. These models are all different from our work. Onthe other hand, much work has been done on schedulingof independent tasks on heterogeneous multiprocessor withworkload-processor binding restrictions [17], [23], [39], and[24], [25] provided comprehensive surveys for this line ofwork.

[15], [20] studied a special case of the problem modelof this paper, which assumes (1) two types of machines, (2)chain-type precedence constraints (3) unit execution timefor each scheduling unit. [19] studied the special case wherethe task graph structures are either in-trees, out-trees ordisjoint unions of chains. [28] considered a special case withonly two types of machines and assume the execution timeof each scheduling unit to be constant so that they cangenerate a static scheduling list offline (which suffers timinganomalies if applied to the problem model in this paperwhere each task may execute shorter than its WCET).

7 CONCLUSION

This paper derives WCRT bounds for typed DAG paralleltasks on heterogeneous multi-cores, where the workloadof each vertex in the DAG is bound to execute on a par-ticular type of cores. The only known WCRT bound forthis problem is grossly pessimistic and suffers the non-self-sustainability problem (a successful design may degrade tobe unsuccessful when the parameters become better). Wepropose two new WCRT bounds to address these problems.The first new bound is more precise without increasingthe time complexity and solves the non-self-sustainabilityproblem. The second new bound explores more detailedtask graph structure information to greatly improve theprecision, but is computationally more expensive. We provethe strong NP-hardness of the computation problem for the

3. Nevertheless, we provide some comparison results betweenour method and [40] in https://github.com/MelindaHan/Typedscheduling for readers who are interested.

Page 11: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

second bound, and develop an efficient algorithm whichhas polynomial time complexity if the number of types isa constant. In next step we will extend the results of thispaper to the general setting of multiple recurring typedDAG tasks and compare it with our families of techniquessuch as the decomposition based approach [40]. Anotherpossible direction for our future work is to develop moreefficient algorithms that compute NEW-B-2 approximately,which hopefully can give bounds reasonably close to thosecomputed by the algorithm of this paper in much shortertime.

REFERENCES

[1] Theodore P. Baker and Sanjoy K. Baruah. Sustainable multipro-cessor scheduling of sporadic task systems. In 21st EuromicroConference on Real-Time Systems, ECRTS 2009, Dublin, Ireland, July1-3, 2009, pages 141–150. IEEE Computer Society, 2009.

[2] Sanjoy Baruah. Improved multiprocessor global schedulabilityanalysis of sporadic DAG task systems. In 26th Euromicro Con-ference on Real-Time Systems, ECRTS 2014, Madrid, Spain, July 8-11,2014, pages 97–105. IEEE Computer Society, 2014.

[3] Sanjoy Baruah. The federated scheduling of constrained-deadlinesporadic DAG task systems. In Wolfgang Nebel and DavidAtienza, editors, Proceedings of the 2015 Design, Automation & Test inEurope Conference & Exhibition, DATE 2015, Grenoble, France, March9-13, 2015, pages 1323–1328. ACM, 2015.

[4] Sanjoy Baruah. Federated scheduling of sporadic DAG task sys-tems. In 2015 IEEE International Parallel and Distributed ProcessingSymposium, IPDPS 2015, Hyderabad, India, May 25-29, 2015, pages179–186. IEEE Computer Society, 2015.

[5] Sanjoy Baruah. The federated scheduling of systems of conditionalsporadic DAG tasks. In Alain Girault and Nan Guan, editors,2015 International Conference on Embedded Software, EMSOFT 2015,Amsterdam, Netherlands, October 4-9, 2015, pages 1–10. IEEE, 2015.

[6] Matthias Beckert and Rolf Ernst. Response time analysis for spo-radic server based budget scheduling in real time virtualizationenvironments. ACM Trans. Embedded Comput. Syst., 16(5):161:1–161:19, 2017.

[7] Enrico Bini and Giorgio C. Buttazzo. Measuring the performanceof schedulability tests. Real-Time Systems, 30(1-2):129–154, 2005.

[8] Vincenzo Bonifaci, Alberto Marchetti-Spaccamela, SebastianStiller, and Andreas Wiese. Feasibility analysis in the sporadicDAG task model. In 25th Euromicro Conference on Real-TimeSystems, ECRTS 2013, Paris, France, July 9-12, 2013, pages 225–233.IEEE Computer Society, 2013.

[9] Daniel Cordeiro, Gregory Mounie, Swann Perarnau, Denis Trys-tram, Jean-Marc Vincent, and Frederic Wagner. Random graphgeneration for scheduling simulations. In L. Felipe Perrone, Gio-vanni Stea, Jason Liu, Adelinde M. Uhrmacher, and Manuel Villen-Altamirano, editors, 3rd International Conference on Simulation Toolsand Techniques, SIMUTools ’10, Malaga, Spain - March 16 - 18, 2010,page 60. ICST/ACM, 2010.

[10] NVIDA CUDA[onlie]. https://developer.nvidia.com/cuda-zone.[11] Sanjoy Dasgupta, Christos H Papadimitriou, and Umesh Vazirani.

Algorithms. McGraw-Hill, Inc., 2006.[12] Leah Epstein and Asaf Levin. Scheduling with processing set

restrictions: Ptas results for several variants. 133:586–595, 2011.[13] R. L. Graham. Bounds for certain multiprocessing anomalies.

45:1563–1581, 1966.[14] R. L. Graham. Bounds on multiprocessing timing anomalies.

SIAM Journal on Applied Mathematics, 17:416–429, 1969.[15] Jiye Han, Jianjun Wen, and Guochuan Zhang. A new approxi-

mation algorithm for uet-scheduling with chain-type precedenceconstraints. Computers & OR, 25(9):767–771, 1998.

[16] Zhao hong Jia, Kai Li, and Joseph Y.-T. Leung. Effective heuristicfor makespan minimization in parallel batch machines with non-identical capacities. 169:1–10, 2015.

[17] Yumei Huo and Joseph Y.-T. Leung. Fast approximation al-gorithms for job scheduling with processing set restrictions.411:3947–3955, 2010.

[18] Jeffrey M. Jaffe. Bounds on the scheduling of typed task systems.SIAM Journal on Computing, 9(3):541–551, 1980.

[19] Klaus Jansen. Analysis of scheduling problems with typed tasksystems. Discrete Applied Mathematics, 52(3):223–232, 1994.

[20] Klaus Jansen, Gerhard J. Woeginger, and Zhongliang Yu. Uet-scheduling with chain-type precedence constraints. Computers &OR, 22(9):915–920, 1995.

[21] Shlomo Karhi and Dvir Shabtay. On the optimality of the tlsalgorithm for solving the online-list scheduling problem with twojob types on a set of multipurpose machines. 26:198–222, 2013.

[22] Shlomo Karhi and Dvir Shabtay. Online scheduling of two jobtypes on a set of multipurpose machines. 150:155–162, 2014.

[23] Kangbok Lee, Joseph Y.-T. Leung, and Michael L. Pinedo. Schedul-ing jobs with equal processing times subject to machine eligibilityconstraints. 14:27–38, 2011.

[24] Joseph Y.-T. Leung and Chung-Lun Li. Scheduling with processingset restrictions: A survey. 116:251–262, 2008.

[25] Joseph Y.-T. Leung and Chung-Lun Li. Scheduling with processingset restrictions: A literature update. 175:1–11, 2016.

[26] Chung-Lun Li and Kangbok Lee. A note on scheduling jobs withequal processing times and inclusive processing set restrictions.JORS, 67(1):83–86, 2016.

[27] Chung-Lun Li and Qingying Li. Scheduling jobs with releasedates, equal processing times, and inclusive processing set restric-tions. JORS, 66(3):516–523, 2015.

[28] Dingchao Li and N. Ishii. Scheduling task graphs onto heteroge-neous multiprocessors.

[29] Jing Li, Jian-Jia Chen, Kunal Agrawal, Chenyang Lu, Christo-pher D. Gill, and Abusayeed Saifullah. Analysis of federated andglobal scheduling for parallel real-time tasks. In 26th EuromicroConference on Real-Time Systems, ECRTS 2014, Madrid, Spain, July8-11, 2014, pages 85–96. IEEE Computer Society, 2014.

[30] Jing Li, David Ferry, Shaurya Ahuja, Kunal Agrawal, Christo-pher D. Gill, and Chenyang Lu. Mixed-criticality federatedscheduling for parallel real-time tasks. Real-Time Systems,53(5):760–811, 2017.

[31] Shuguang Li. Parallel batch scheduling with inclusive pro-cessing set restrictions and non-identical capacities to minimizemakespan. European Journal of Operational Research, 260(1):12–20,2017.

[32] Alessandra Melani, Marko Bertogna, Vincenzo Bonifaci, AlbertoMarchetti-Spaccamela, and Giorgio C. Buttazzo. Response-timeanalysis of conditional DAG tasks in multiprocessor systems. In27th Euromicro Conference on Real-Time Systems, ECRTS 2015, Lund,Sweden, July 8-10, 2015, pages 211–221. IEEE Computer Society,2015.

[33] OMAP[online]. https://en.wikipedia.org/wiki/OMAP.[34] Openmp [online]. Openmp application program interface version

4.5. https://www.openmp.org/.[35] OpenCL[online]. https://www.khronos.org/opencl/.[36] Tegra Processors[online]. https://www.nvidia.com/object/tegra.html.[37] Zynq-7000 SoC[online]. https://www.xilinx.com/products/silicon-

devices/soc/zynq-7000.html.[38] Ashish Venkat and Dean M. Tullsen. Harnessing ISA diversity:

Design of a heterogeneous-isa chip multiprocessor. In ACM/IEEE41st International Symposium on Computer Architecture, ISCA 2014,Minneapolis, MN, USA, June 14-18, 2014, pages 121–132. IEEEComputer Society, 2014.

[39] Jia Xu and Zhaohui Liu. Online scheduling with equal processingtimes and machine eligibility constraints. 572:58–65, 2015.

[40] Kecheng Yang, Ming Yang, and James H. Anderson. Reducingresponse-time bounds for dag-based task systems on heteroge-neous multicore platforms. In Alain Plantec, Frank Singhoff,Sebastien Faucou, and Luıs Miguel Pinho, editors, Proceedings ofthe 24th International Conference on Real-Time Networks and Systems,RTNS 2016, Brest, France, October 19-21, 2016, pages 349–358. ACM,2016.

Page 12: JOURNAL OF LA Response Time Bounds for Typed …WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET JOURNAL OF L A TEX

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Meiling Han as born in Liaocheng, Shandongof China in 1988. She received her master’sdegree in computer applications technology fromShenyang Agricultural University, China in 2013,she is a Ph.D. candidate at the School of Com-puter Science and Engineering of Northeast-ern University, China. Her research interests arebroadly in embedded real-time systems, espe-cially the real-time scheduling on multiprocessorsystems.

Nan Guan is currently an assistant professor atthe Department of Computing, The Hong KongPolytechnic University. Dr Guan received his BEand MS from Northeastern University, China in2003 and 2006 respectively, and a PhD from Up-psala University, Sweden in 2013. Before joiningPolyU in 2015, he worked as a faculty memberin Northeastern University, China. His researchinterests include real-time embedded systemsand cyber-physical systems. He received theEDAA Outstanding Dissertation Award in 2014,

the Best Paper Award of IEEE Real-time Systems Symposium (RTSS)in 2009, the Best Paper Award of Conference on Design Automation andTest in Europe (DATE) in 2013.

Jinghao Sun received the MS and PhD de-gree in computer science from Dalian Univer-sity of Technology in 2012. He is an associatedprofessor at Northeastern University, China. Hewas a postdoctoral fellow in the Department ofComputing at Hong Kong Polytechnic Universitybetween 2016 to 2017, working on schedulingalgorithms for multi-core real time systems. Hisresearch interests include algorithms, schedula-bility analysis and optimization methods.

Qingqiang He received the BS degree in com-puter science and technology from NortheasternUniversity, China, in 2014, and the MS degree incomputer software and theory from Northeast-ern University, China, in 2017. Now he is workingas a research assistant in Hong Kong Polytech-nic University. His research interests include em-bedded real-time system, real-time schedulingtheory, and distributed leger.

Qingxu Deng received his Ph.D. degree incomputer science from Northeastern University,China, in 1997. He is a professor of the Schoolof Computer Science and Engineering , North-eastern University, China, where he serves asthe Director of institute of Cyber Physical Sys-tems. His main research interests include Cyber-Physical systems, embedded systems, and real-time systems.

Weichen Liu Weichen Liu (S’07-M’11) is anassistant professor at School of Computer Sci-ence and Engineering, Nanyang TechnologicalUniversity, Singapore. He received the PhD de-gree from the Hong Kong University of Scienceand Technology, Hong Kong, and the BEng andMEng degrees from Harbin Institute of Tech-nology, China. Dr. Liu has authored and co-authored more than 70 publications in peer-reviewed journals, conferences and books, andreceived the best paper candidate awards from

ASP-DAC 2016, CASES 2015, CODES+ISSS 2009, the best posterawards from RTCSA 2017, AMD-TFE 2010, and the most popular posteraward from ASP-DAC 2017. His research interests include embeddedand real-time systems, multiprocessor systems and network-on-chip.