classification and generation of schedules for vliw processors

21
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389 Published online 3 April 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1175 Classification and generation of schedules for VLIW processors Christoph Kessler ,† , Andrzej Bednarski and Mattias Eriksson PELAB, Department of Computer and Information Science, Link¨ opings Universitet, S-58183 Link¨ oping, Sweden SUMMARY Exact methods for optimal instruction scheduling are gaining importance. They differ, however, considerably in the assumed processor model and in the space of schedules searched for an optimal solution. We identify and analyze different classes of schedules for VLIW processors. The classes are induced by various common techniques for generating or enumerating them, such as integer linear programming or list scheduling with backtracking. In particular, we study the relationship between VLIW schedules and their equivalent linearized forms (which may be used, e.g., with superscalar processors), and we identify classes of VLIW schedules that can be created from a linearized form using VLIW compaction methods that are just the static equivalents of dynamic instruction dispatch algorithms of in-order and out-of- order issue superscalar processors. For example, we study the class of greedy schedules and show that, if all instructions have multiblock reservation tables, it is safe for time optimization to consider greedy schedules only. We also show that, in certain situations, certain schedules generally cannot be constructed by incremental scheduling algorithms that are based on topological sorting of the data dependence graph. We summarize our findings as a hierarchy of classes of VLIW schedules. These results can sharpen the interpretation of the term ‘optimality’ used with various methods for optimal VLIW scheduling, and may help to identify classes that can be safely ignored when searching for an optimal schedule. Copyright c 2007 John Wiley & Sons, Ltd. Received 28 February 2006; Revised 15 December 2006; Accepted 23 December 2006 KEY WORDS: instruction-level parallelism; instruction scheduling; code generation; code compaction; integer linear programming; VLIW architecture; superscalar processor Correspondence to: Christoph Kessler, PELAB, Department of Computer and Information Science, Link¨ opings Universitet, S-58183 Link¨ oping, Sweden. E-mail: [email protected] Contract/grant sponsor: Link¨ oping University; contract/grant number: Ceniit 01.06 Contract/grant sponsor: Vetenskapsr˚ adet Contract/grant sponsor: SSF (project RISE) Contract/grant sponsor: CUGS Graduate School Copyright c 2007 John Wiley & Sons, Ltd.

Upload: christoph-kessler

Post on 11-Jun-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Classification and generation of schedules for VLIW processors

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2007; 19:2369–2389Published online 3 April 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1175

Classification and generation ofschedules for VLIW processors

Christoph Kessler∗,†, Andrzej Bednarski andMattias Eriksson

PELAB, Department of Computer and Information Science,Linkopings Universitet, S-58183 Linkoping, Sweden

SUMMARY

Exact methods for optimal instruction scheduling are gaining importance. They differ, however,considerably in the assumed processor model and in the space of schedules searched for an optimal solution.We identify and analyze different classes of schedules for VLIW processors. The classes are induced byvarious common techniques for generating or enumerating them, such as integer linear programming orlist scheduling with backtracking. In particular, we study the relationship between VLIW schedules andtheir equivalent linearized forms (which may be used, e.g., with superscalar processors), and we identifyclasses of VLIW schedules that can be created from a linearized form using VLIW compaction methodsthat are just the static equivalents of dynamic instruction dispatch algorithms of in-order and out-of-order issue superscalar processors. For example, we study the class of greedy schedules and show that,if all instructions have multiblock reservation tables, it is safe for time optimization to consider greedyschedules only. We also show that, in certain situations, certain schedules generally cannot be constructedby incremental scheduling algorithms that are based on topological sorting of the data dependence graph.We summarize our findings as a hierarchy of classes of VLIW schedules. These results can sharpen theinterpretation of the term ‘optimality’ used with various methods for optimal VLIW scheduling, and mayhelp to identify classes that can be safely ignored when searching for an optimal schedule. Copyright c©2007 John Wiley & Sons, Ltd.

Received 28 February 2006; Revised 15 December 2006; Accepted 23 December 2006

KEY WORDS: instruction-level parallelism; instruction scheduling; code generation; code compaction; integerlinear programming; VLIW architecture; superscalar processor

∗Correspondence to: Christoph Kessler, PELAB, Department of Computer and Information Science, Linkopings Universitet,S-58183 Linkoping, Sweden.†E-mail: [email protected]

Contract/grant sponsor: Linkoping University; contract/grant number: Ceniit 01.06Contract/grant sponsor: VetenskapsradetContract/grant sponsor: SSF (project RISE)Contract/grant sponsor: CUGS Graduate School

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 2: Classification and generation of schedules for VLIW processors

2370

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

1. INTRODUCTION

With the advent of more powerful computers and better optimization methods and tools, exact methodsfor solving code optimization problems, such as optimal instruction scheduling, have gained inimportance. However, existing approaches to (optimal) instruction scheduling differ in the assumedprocessor model and in the space of schedules searched for an optimal solution.

Instruction scheduling is the task of mapping each instruction of a program to a point (or set ofpoints) of time when it is to be executed, such that constraints implied by data dependences and limitedresource availability are preserved. For RISC and superscalar processors with dynamic instructiondispatch, it is sufficient if the schedule is given as a linear sequence of instructions, such that theinformation about the issue time and the functional unit can be inferred by simulating the dispatcher’sbehavior. The goal is usually to minimize the execution time while avoiding severe constraintson register allocation. Alternative goals for scheduling can be minimizing the number of registers(or temporary memory locations) used, or the energy consumed.

The problem of finding a time-optimal schedule, i.e. a schedule that takes a minimum number ofclock cycles to execute, is NP-complete for almost any non-trivial target architecture [1–6] exceptfor certain combinations of very simple target processor architectures and tree-shaped dependencystructures [7–14]. Even in the absence of dependences, resource-constrained scheduling remainsNP-complete for three or more parallel issue units and at least one resource [3].

In many cases, simple heuristics such as greedy list scheduling [15,16] work well enough to obtaina schedule with decent performance. In some cases, however, the user is willing to afford spending asignificant amount of time in optimizing the code, such as in the final compilation of time-critical partsin application programs for digital signal processors (DSPs).

For the general case where the dependence structure is a DAG (directed acyclic graph), variousalgorithms for time-optimal local instruction scheduling have been proposed, based on integer linearprogramming [17–21], branch-and-bound [22–24], and constraint logic programming [25]. Also,dynamic programming has been used for time-optimal scheduling of basic blocks up to a certainsize [26,27].

VLIW processors require an explicitly parallel instruction stream, packaged in long instructionwords. The aggregation of individual instructions to long instruction words, also known as compaction,is done by the compiler back-end. The two tasks of arranging a linear list of instructions (sequencing)and grouping them into long instruction words (compaction) are often performed separately.

Superscalar processors identify dependences in a sequential instruction stream and perform, withina narrow window on the instruction stream, partial (re-)scheduling and compaction at run-time.Parallelism is thus implicit in this case. Still, the compiler needs to predict execution times and therebysimulate the effect of the processor’s instruction dispatcher when exposed to a certain instructionstream. See also Table I for a summary of differences between the programming models of RISC,VLIW, and superscalar processor architectures.

Hence, code generation for both VLIW and superscalar processors needs a detailed model ofthe target processor to predict the resource allocation and time consumption of the program underexecution. In both cases, sequencing is followed by compaction. This similarity suggests that oneconsider both techniques in a uniform way. Schedules for VLIW processors built from imitating asuperscalar processor’s dispatch algorithm at compile time have special properties that make themappropriate for certain optimization methods such as dynamic programming, while considering only

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 3: Classification and generation of schedules for VLIW processors

2371

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

Table I. Some differences in the assembler-level programming model of RISC, superscalar, and VLIWarchitectures (adapted from [28] and [29]).

RISC/embedded Superscalar VLIW

Scheduling/ Mostly static Static sequencing, dynamic Staticcompaction reordering and compaction

Instruction Linear; sequence of Linear; sequence of Parallel; sequence ofstream sequential instructions sequential instructions long instruction words

Issue order In-order Implicit; Explicit;in-order or out-of-order in-order

Exposure of — Implicit Explicitparallelism (absence of hazards) (by compiler)

Exposure of — Via register names — (hardcodeddata dependence by compiler)

Exposure of Compiler or hardware Hardware dispatcher Programmer/compilerstructural hazards

Representation Mostly explicit Implicit where NOPs Explicitof NOPs follow from data dependency (in uncompressed format)

or resource usage,otherwise explicit

such a subset of VLIW schedules for optimization may actually omit some important VLIW schedulessuch that, in the worst case, an optimal schedule could be missed.

As a generic processor model, we consider multi-issue superscalar processors and VLIW processorswith a single general-purpose register file, where each register is equally accessible to each instruction.Limited resources are modeled by reservation tables. An extension to clustered VLIW processors(with multiple register files, each one connected to a specific subset of the functional units) is possible,but requires one to simultaneously consider the partitioning of data and operations across the clusters.For simplicity, we omit this scenario here and refer instead to our general dynamic programmingmethod for integrated code generation for clustered VLIW processors described elsewhere [26].

We focus in this paper on scheduling straight-line code. Nevertheless, techniques for local schedulingre-occur also in some global scheduling algorithms for larger code regions with acyclic dependencegraphs, such as traces or superblocks, and in some software pipelining methods for loops. We referto [30, ch. 17–18] for a survey and references of global scheduling methods and software pipelining.We leave a generalization of our classification to global schedules to future work.

Finally, we mostly focus on optimization of execution time here, and refer the interested reader toour earlier work on scheduling for minimal register need [31,32] and energy consumption [33], wherefurther references can be found.

The remainder of this paper is organized as follows. Section 2 introduces basic terminology.In order to make the presentation self-contained, we borrow some notation and definitions fromearlier work [26]. Section 3 describes the various classes of VLIW schedules and derives inclusionrelationships between these. In particular, we characterize a class of instruction-level parallel

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 4: Classification and generation of schedules for VLIW processors

2372

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

architectures for which we prove the existence of greedy optimal schedules. We also provide someexperimental results that illustrate the hierarchy of classes of schedules discussed in this paper.Section 4 concludes and suggests future work.

2. BACKGROUND

2.1. A general model of instruction-level parallel target processors

We assume that we are given an in-order issue superscalar or a VLIW processor with f resourcesU1, . . . , Uf , which may be functional units or other limited resources, such as internal buses.

The issue width ω is the maximum number of instructions that may be issued in the same clockcycle. For a single-issue processor, we have ω = 1, while most superscalar processors and all VLIWarchitectures are multi-issue architectures, that is, ω > 1.

In order to model instruction issue explicitly, we provide ω separate instruction issue units ui ,1 ≤ i ≤ ω, which can take up at most one instruction per time unit (clock cycle) each. For 1 ≤ i ≤ ω,let Ii denote the set of instructions that can be issued to unit ui . Let I ⊂ I1 × · · · × Iω denote the set ofall (statically legal) instruction words. Then, Sk ⊂ Ik denotes the set of all (statically legal) scheduleswhose length is k instruction words, and

S =∞⋃

k=1

Sk

the set of all (statically legal) schedules. Usually, inequality holds for all these subset relationsbecause some combinations of instructions within a long instruction word or in a subsequence of longinstruction words may be illegal as they need to find data in specific places or would subscribe to thesame resource at the same time.

For a VLIW processor, the contents of all ui at any time correspond directly to a legal long instructionword. In the case of a superscalar processor, this corresponds to the instructions issued at that time asresulting from the dynamic instruction dispatcher’s interpretation of a given linear instruction stream.

Beyond an issue unit time slot at the issue time, an instruction usually needs one or several resources,at issue time or later, for its execution. For each instruction y issued at time (clock cycle) t , the resourcesrequired for its execution at time t, t + 1, . . . can be specified by a reservation table [34], a bit matrixoy with oy(i, j) = 1 if and only if resource i is occupied by y at time t + j . For simplicity, we assumethat an issue unit is occupied for one time slot per instruction issued.

Rau proposed the following classification of reservation tables: ‘A simple reservation table is onewhich uses a single resource for a single cycle on the cycle of issue. A block reservation tableuses a single resource for multiple, consecutive cycles starting with the cycle of issue. Any othertype of reservation table is termed a complex reservation table’ [35]. Complex reservation tablescan be transformed into simple ones if the architecture has enough resources so that these can beprivatized across competing instructions, eliminating the need for instructions to share any resourceexcept for the issue slot and perhaps an abstract resource used at issue time [36] representing a(virtual) single fully pipelined execution unit. A general method for reducing the size of reservationtables while exactly preserving resource constraints was proposed by Eichenberger and Davidson [37].We assume from now on that such simplifications have already been performed. Virtually all algorithms

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 5: Classification and generation of schedules for VLIW processors

2373

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

for optimal instruction scheduling in the literature assume simple [9,12,14,18–20,22,38,39] andblock [40] reservation tables. Scheduling for architectures with complex reservation tables, such asthe Motorola 88100 [41], constitutes a harder problem. We will see an example of this in Section 3.1.

An instruction y may read and/or write registers and/or memory locations in certain clock cyclesduring its execution. For each instruction y, information on the clock cycles (relative to issue time)for reading operands and for writing a result (the latter is usually referred to as the latency of y)is given, for instance, in a formal architecture description used in a retargetable code generator.For any data dependence from an instruction y to an instruction y ′, there is thus a well-defineddelay‡ �(y, y ′) of clock cycles that must pass between the issue time of y and the earliest possibleissue time of y ′. For instance, in the case of a data flow dependence from y to y ′, the result ofinstruction y issued at time t is available for instructions y ′ issued at time t + �(y, y ′) or later.We denote by �(y) := maxy ′(�(y, y ′)) the longest delay resulting from any instruction y ′ that maypossibly depend on y. A somewhat more detailed modeling of latency behavior for VLIW processorsis given, for example, by Rau et al. [36]. For now, we assume non-negative delays; we will later relaxthis assumption in Section 3.6.

Note that we consider any alternative of an instruction that performs the same computation with adifferent reservation table or different delay as a different instruction. Hence, we assume that the choicebetween equivalent alternatives is decided in a preceding instruction selection phase. In particular, theissue unit to be used by an instruction is uniquely determined.

2.2. Scheduling

We are given a target-level basic block (i.e. instruction selection has already been performed). For basicblocks, the data dependences among the instructions form a DAG G = (V, E) where V is a set ofinstructions and E is the set of data dependence edges between instructions in V . Let n = |V | denotethe number of nodes (instructions) in the DAG.

An instruction sequence is a bijective mapping S : V → {1, . . . , n} that represents a total order ofthe n instructions in V . We call an instruction sequence S causal if this total order is compliant with thepartial order implied by E, that is, for any edge (u, v) ∈ E, S(u) < S(v), which means that u occursbefore v in the sequence. Causal sequences can be generated by topological sorting of the DAG. In thefollowing, we use tuple notation, such as S = 〈u, v, w〉, to denote instruction sequences.

A target schedule (or just schedule for short) is a mapping s from the set of time slots in all issueunits, {1, . . . , ω} × N, to instructions such that si,j denotes the instruction issued to unit ui at timeslot j . Where no instruction is issued to ui at time slot j , si,j is defined as �, meaning that ui is idle.If an instruction si,j produces a value that is used by an instruction si′,j ′ , j ′ ≥ j + �(si,j , si′,j ′) musthold. Finally, the resource reservations by the instructions in s must not overlap. The accumulatedreservations for all instructions scheduled in s can be described by a resource usage map or globalreservation table [34], a boolean matrix Rs where Rs (i, j) = 1 if and only if resource i is occupied attime j .

‡We adopt here the term delay as used, for example, by Rau et al. [36]. The term latency is also often used for the same purposein the literature.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 6: Classification and generation of schedules for VLIW processors

2374

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

The reference time ρ(s) of a target schedule s is defined as follows. Let t denote the most recentclock cycle where an instruction (including explicit NOPs) is issued in s to some functional unit.If there is any fillable slot left on an issue unit at time t , we define ρ(s) = t; otherwise ρ(s) = t + 1.Incremental scheduling methods, such as the one described in the next section, may use the referencetime as the earliest time slot for adding more instructions to a schedule.

The execution time τ (s) of a target schedule s is the number of clock cycles required for executing s,that is,

τ (s) = maxi,j

{j + �(si,j ) : si,j �= �}

Moreover, most papers on optimal local instruction scheduling simply set τ (s) to the latest issuetime, because the pessimistically assumed maximal delays beyond the end of the basic block may notbe relevant if the dependent instructions in a potentially subsequent basic block will not be executed atall or are not close enough to be affected.

A target schedule s is time-optimal if it takes no more time than any other target schedule forthe DAG.

This notion of time-optimality requires a precise characterization of the solution space of all targetschedules that are to be considered by the optimizer. In Section 3 we will define several classes ofschedules according to how they can be linearized and re-compacted into an explicit parallel form byvarious compaction strategies, and discuss their properties. This classification will allow us to a priorireduce the search space, and it will also illuminate the general relationship between VLIW schedulesand schedules for superscalar processors.

3. CLASSES OF VLIW SCHEDULES

In a target schedule s, consider any (non-NOP) instruction si,j issued on an issue unit ui at a timej ∈ {1, . . . , ρ(s)}. Let e denote the earliest issue time of si,j as permitted by data dependences frominstructions issued earlier than j ; thus, e = 1 if si,j does not depend on any other instruction. Obviously,e ≤ j . We say that si,j is tightly scheduled in s if, for each time slot k ∈ {e, e + 1, . . . , j − 1}, therewould be a resource conflict with some instruction issued earlier than time j if si,j would be issued(to ui) already at time k.

Note that this definition implicitly assumes forward scheduling. A corresponding construction forbackward scheduling is straightforward.

3.1. Greedy schedules

An important subclass of target schedules are the so-called greedy schedules where each instruction isissued as early as data dependences and available resources allow, given the placement of instructionsissued earlier.

A target schedule s is said to be greedy if all (non-NOP) instructions si,j �= � in s are tightlyscheduled. Figure 1(a) shows an example of a greedy schedule. A greedy compaction algorithm thatcomputes a greedy schedule from a linear sequence of instructions is given in Figure 2.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 7: Classification and generation of schedules for VLIW processors

2375

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

2

a b

c

u1 u

d

u1 u 22

bad

c

ad

c

b

(b)

time

ad

(d)

b

c

(a)

ad

c

(c)

5

12

b

4

1234

12

34

1234

u1 u2u1 u

3

Figure 1. A target-level DAG with four instructions, and four example schedules: (a) greedy, strongly linearizable;(b) not greedy, strongly linearizable; (c) not greedy, not strongly linearizable; (d) dawdling. Instructions a, c, and

d with delay 3 are to be issued to unit u1, and b to unit u2 with �(b, c) = 1. We assume no resource conflicts.

Input: instruction sequence S = 〈y1, . . . , yn〉Output: greedy schedule sMethod: Greedy compaction:s ← new empty schedule;R ← new empty resource usage map;for j = 1, . . . , n do

for t = 1, . . . , ∞ doif instruction yj can be issued to its unit ui at time t

then schedule si,t ← yj ; commit reservations to R; break;return s

Figure 2. Pseudocode for greedy compaction.

3.1.1. Conditions for the dominance of greedy schedules

Any given schedule s can be greedified, that is, it can be converted into a greedy schedule. For instance,local greedification (see Figure 3) simply moves, in ascending order of issue time, instructions in s

backwards in time to the earliest possible issue time that does not violate resource or delay constraints[22]; see Figure 5(a) later for an example. For the case of simple and block reservation tables, this willresult in a greedy schedule whose execution time does not exceed τ (s), because then the instruction tobe moved only uses a single resource and hence for moving the instruction one needs to consider onlylocal interferences with other instructions using that same resource. For complex reservation tables,local greedification may not succeed in delivering a greedy schedule, as in the example in Figure 5(b).Instead, the global greedification algorithm in Figure 4 could be used. For the case of simple and blockreservation tables, global greedification has the same effect as local greedification and hence we havealso here the strong property that greedification of any schedule s does not increase the executiontime τ (s). In other cases, however, there may be greedification anomalies, situations where delayingthe issue time of some instruction (thus having a non-greedy schedule) actually reduces the executiontime. Figure 5(b) shows an example. The absence of greedification anomalies allows algorithms for

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 8: Classification and generation of schedules for VLIW processors

2376

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

Input: target schedule sOutput: modified scheduleMethod: Local greedify:for c = 1, . . . , ρ(s) do

for i = 1, . . . , ω where si,c �= � doif instruction y = si,c is not tightly scheduled then

unschedule si,c and free its resource reservations;for t = 1, . . . , c do

if y can be issued to ui at time tthen schedule si,t ← y; commit resource reservations; break;

return s

Figure 3. Pseudocode for local greedification.

Input: target schedule s

Output: greedy schedule s′Method: Global greedify:s′ ← new empty schedule;R′ ← new empty resource usage map;for c = 1, . . . , ρ(s) do

for i = 1, . . . , ω where si,c �= � do// in linearized order of s, schedule its instructions greedily to s′:for t = 1, . . . , ∞ do

if y can be issued to ui at time t

then schedule s′i,t ← y; commit reservations to R′; break;

return s′

Figure 4. Pseudocode for global greedification.

time-optimal scheduling to safely omit all non-greedy schedules, which considerably reduces thesearch space.

Extending Rau’s classification [35], we define a multiblock reservation table as a reservation tablewhere r ≥ 1 resources are used from the issue time onwards contiguously for some time slots (althoughthe duration of occupation can vary across these resources), and no other resources§. Special cases ofmultiblock reservation tables are simple and block reservation tables, where r = 1.

Theorem 1. Schedules built from instructions with multiblock reservation tables are free ofgreedification anomalies.

§This is analogous to an overly conservative strategy for deadlock avoidance in multitasking systems where a task must allocateand hold all resources that it may eventually need right from the beginning, so that it can run to completion without being stalledlater on.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 9: Classification and generation of schedules for VLIW processors

2377

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

b aca b

zbaz

c

u1 u2 U1 U2

b

bz

bz

u1 u2 U1 U2

a

a

(a)

c

(b)

b aca b

zbaz

c

u1 u2 U1 U2

zu1 u2

c

c

ba

time

ac

1

2

3

4

5

Figure 5. Examples of greedification and greedification anomalies. Instructions z, a, b, and c are independent.Instruction a occupies resource U1 at the issue time and U2 two cycles later; b has a symmetric reservation table.In case (a), instruction c needs resource U1 at the issue time. Local greedification yields a greedy schedule here.In case (b), where c needs both resources at the issue time, local greedification would not work at all (the schedule

remained unchanged, i.e. non-greedy), and global greedification increases the execution time.

Proof. We show by induction over the issue time c that, if all instructions have multiblock reservationtables, local greedification (Figure 3) always yields a greedy schedule that does not increase theexecution time. For c = 1 there is nothing to greedify. Consider now an instruction v with earliestissue time c > 1 that is not yet tightly scheduled, which means that (after unscheduling v) there is anearliest time t < c where v can be scheduled, preserving resource and dependency constraints, andgiven the placement of instructions issued at time less than c. Moving v to time t makes v tightlyscheduled. We show that no other instruction, even if using any of v’s resources, needs to be movedfor this purpose. We distinguish between three cases. (i) For any instruction x issued at time tx < c,x was tightly scheduled already before v was considered, by the induction hypothesis: t < tx and v

fits (‘below’ x) by the assumption that v can be issued at time t; or t = tx , then v and x cannot shareresources due to the multiblock property; or t > tx , then v fits (‘above’ x) again by the assumption.(ii) There is also no need to move another instruction y issued at time c along with v, becausethey cannot share resources owing to the multiblock property. (iii) Finally, there is no need to moveany instruction z issued at time tz > c when moving v: if v and z share a resource, its usage by v

(which started at time c) is earlier in time than the usage by z (which starts at tz), and the resources forv now to be used from time t on will also be released c − t > 0 cycles earlier than before. (Moving v

may, of course, make some later instructions non-tightly scheduled, but these will be handled in asubsequent step of the local greedification algorithm.) �

Observation 2. The presence of any non-multiblock reservation table can lead to a greedificationanomaly.

Proof. Figure 6 gives an example of a greedification anomaly, caused by the non-multiblock instructionc which ‘hooks’ on the instruction d issued later than c such that moving c earlier within a certainrange also requires moving d to avoid a collision for the shared resource U3: either downwards if thereis space, or upwards—then the anomaly can occur. �

Lundqvist and Stenstrom [42] consider the closely related problem of timing anomalies in worst-caseexecution time (WCET) prediction for out-of-order superscalar architectures that are caused by varia-tion in resource occupation and latency behavior, for example of load instructions due to cache effects.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 10: Classification and generation of schedules for VLIW processors

2378

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

12345

1 2 3 4 1 2 3 4U U U U U U U U

cccc c

cc

aa

b

d dd d

d dd d

c c ccc

ccaa

c bd

d

c ab a

a ab

1 2u u 1 2u u

6

12345

(a) (b)

Figure 6. Example of a greedification anomaly: four independent instructions, where a, b, and d have amultiblock reservation table while c has not. (a) A non-greedy schedule where c hooks on d, which is issuedlater than c. Local greedification does not work (there would be a collision at time 4 on U3). (b) Global

greedification increases the execution time by one (6).

A WCET timing anomaly means that the WCET of a sequence of instructions does not necessarilyresult from assuming the WCET for every instruction; for instance, a cache miss may, in certainsituations, actually shorten the execution time. They classify resources into in-order resources that arealways used by instructions in increasing order of issue time, and out-of-order resources. They proposethat processors with only in-order resources are free of WCET timing anomalies, and hence increasinglatency never decreases execution time, while variations in unit occupation that allow resources tobe used out of program order can cause timing anomalies. Wenzel et al. [43] show that even in-orderresources can lead to timing anomalies if there are alternatives between different resources for the sameinstruction. Indeed, for architectures where each instruction could be issued on every issue unit, theclassical paper by Graham [16] describes how the schedule produced by a greedy scheduler can changeand increase in time even if resource occupation decreases, dependences are removed, or another issueunit is added. Eisinger et al. [44] present a method for automatic detection of timing anomalies and thetriggering instruction sequences for a given processor description in VHDL. Our multiblock reservationtables also imply that resources are used in-order. Note that we assume that instructions’ reservationtables and latencies do not vary; our greedification anomalies stem from modified instruction issuetimes within the same issue unit only.

Before we continue, we note the following corollary.

Corollary 3. For any time-optimal schedule s there exists a greedy schedule s′ that has the sameexecution time on a processor with multiblock reservation tables.

Proof. By local greedification we can construct from s a greedy schedule s′ with τ (s′) ≤ τ (s), seeTheorem 1. As s is time-optimal by assumption, τ (s′) = τ (s). �

3.2. Strongly linearizable schedules and in-order compaction

We first relax our previous definition of tightness as follows. In a target schedule s, consider any(non-NOP) instruction si,j issued on an issue unit ui at a time j ∈ {1, . . . , ρ(s)}. Let e denote theearliest issue time of si,j as permitted by data dependences from instructions issued earlier than j .si,j is tightly scheduled from time t in s if, for each time slot k ∈ {max(e, t), . . . , j − 1}, there would

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 11: Classification and generation of schedules for VLIW processors

2379

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

Input: Instruction sequence S = 〈y1, ..., yn〉Output: Schedule s as in-order compaction of S

Method: In-order compaction:c ← 1; // current issue timeR ← new empty resource usage map;s ← new empty schedule;for i = 1, ..., n do

while yi not yet scheduled doif yi can be scheduled at time c

then si,c ← yi ; commit reservations to R; break;else c ← c + 1;

return s

Figure 7. Pseudocode for in-order compaction.

be a resource conflict with some instruction issued earlier than time j if si,j would be issued (to ui)already at time k. Our previous notion of tightness means global tightness for all instructions from thestart time 1, while we now limit the scope of the tightness criterion to the interval [t, j − 1] for eachinstruction si,j .

For each instruction si,j in a schedule s, let Js(j) denote the latest issue time earlier than j whensome instruction in s is issued. For the instruction(s) issued immediately, we define Js(1) = 1.

A schedule s is called strongly linearizable if, for all times j ∈ {1, . . . , ρ(s)} where at least one(non-NOP) instruction is issued, at least one of these instructions si,j �= �, i ∈ {1, . . . , ω}, is tightlyscheduled from time Js(j).

The definition implies that every greedy schedule is strongly linearizable. The target schedules inFigures 1(a) and (b) are strongly linearizable. The schedule in Figure 1(b) is not greedy becauseinstruction b is not tightly scheduled (i.e. not from start time 1). The schedule in Figure 1(c) is notstrongly linearizable because instruction b, which is the only instruction issued at time 3, is not tightlyscheduled from time J (3) = 2.

Any strongly linearizable schedule s can be emitted for an in-order issue superscalar processorwithout insertion of explicit NOP instructions to control the placement of instructions in s. We canrepresent each strongly linearizable schedule s by a sequence s containing the instructions of s suchthat the dynamic instruction dispatcher of an in-order issue superscalar processor, when exposed tothe linear instruction stream s, will schedule instructions precisely as specified in s. For instance, alinearized version of the schedule in Figure 1(b) is 〈a, d, b, c〉.

We can compute a linearization s from s by concatenating all instructions in s in increasing orderof issue time. The linearization is strict if we locally reorder the instructions si,j with the sameissue time j such that one of them that is tightly scheduled from Js(j) appears first in s. A similarconstruction allows one to construct linearized schedules s for VLIW processors and reconstruct theoriginal schedule s from s if the in-order issue policy is applied instruction by instruction.

In the reverse direction, we can reconstruct a strongly linearizable schedule s from a strict linearizedform s by a method that we call in-order compaction (see Figure 7), where instructions are placed

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 12: Classification and generation of schedules for VLIW processors

2380

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

on issue units in the order they appear in s, as early as possible by resource requirements and datadependence, but always in non-decreasing order of issue time. In other words, there is a non-decreasing‘current’ issue time c such that all instruction words issued at time 1, . . . , c − 1 are already closed,i.e. no instruction can be placed there even if there should be a free slot. The instruction word at timec is currently open for further insertion of instructions, and the instruction words for time c + 1 andlater are still unused. The ‘current’ issue time c pointing to the open instruction word is incrementedwhenever the next instruction cannot be issued at time c (because issue units or required resources areoccupied or required operand data are not yet ready), such that the instruction word at time c + 1 isopened and the instruction word at time c is closed. Proceeding in this way, the ‘current’ issue time c

is just the reference time ρ(s) of the schedule s being constructed.As an example, in Figure 1(c), we cannot reconstruct the original schedule s from the sequence

〈a, d, b, c〉 by in-order compaction, as b would be issued in the (after having issued d at time 2) stillfree slot s2,2, instead of the original slot s2,3.

Generally, there may exist several possible strict linearizations for a strongly linearizable schedule s,but their in-order compactions will all result in the same schedule s again.

In-order compaction fits well to scheduling methods that are based on topological sorting of the datadependence graph, because such schedules can be constructed (and optimized) incrementally; this is,for instance, exploited in the dynamic programming algorithms in OPTIMIST [26]. In the context ofin-order compaction of strongly linearizable schedules it is thus well-defined to speak about appendingan instruction y to a schedule s (namely, assigning it an issue slot as early as possible but not earlierthan ρ(s), resulting in a new schedule) and about a prefix s′ of a schedule s (namely the in-ordercompaction of the prefix s′ of a suitable strict linearization s of s).

Finally, we note the following inclusion relationship.

Theorem 4. For each instruction sequence S with greedy compaction s, there exists a (usuallydifferent) instruction sequence S′ such that the in-order compaction of S′ is equal to s.

Proof. S′ = s, i.e. S′ is obtained by strictly linearizing s as defined above. �

Corollary 3 and Theorem 4 yield that, when searching for a time-optimal schedule for a processorwith multiblock reservation tables, it is safe to scan all causal sequences and apply in-order compaction,which is what OPTIMIST does [26].

3.3. Schedules derived by out-of-order compaction

In-order dispatch of instructions in superscalar processors makes the exploitable instruction-levelparallelism highly dependent on the given instruction sequence. Most modern superscalar processorsallow one, within a limited scope, to look ahead in the instruction stream and issue instructions out-of-order, which is also known as dynamic scheduling [45]. Here, we consider a simplified model thatmaintains an instruction window of fixed size k ≥ 1, holding the next k non-issued instructions in thesequence. During program execution, this window slides over the instruction sequence from left toright. Out-of-order compaction (see the compile-time version in Figure 8) proceeds cycle by cycle.By scanning the window from left to right, we issue instructions in the window that are ready to beissued in the current cycle until all issue units are used or the window end is reached. If no issue slot atthe current cycle can be filled with any instruction within the window, the cycle counter is incremented.Whenever the leftmost instruction in the window has been issued, the window is moved ahead until the

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 13: Classification and generation of schedules for VLIW processors

2381

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

Input: a causal instruction sequence S; instruction window size kOutput: schedule s ∈ OOS(k)Method: Out-of-order compaction:c ← 1; // current cycleR ← new empty resource usage map;W ← new queue of capacity k; // instruction windowenqueue the first k instructions of S into W ;while not all instructions are scheduled do

Wc ← { instructions in W ready to be issued at time c };if Wc is nonempty then

fill remaining issue slots at time c with instructions from Wc ,commit their resource reservations to R,and mark the entries of these instructions in W as empty;

c ← c + 1;while first entry of W is empty do dequeue it; odwhile W contains less than k entries do

enqueue the next instruction of S into W ;return s

Figure 8. Pseudocode for out-of-order compaction.

first non-issued instruction appears in its first position. For simplicity, we assume a single, commoninstruction window¶ as, for example, in Intel Pentium II and III processors, and we allow one to reusethe window space directly after instruction issue‖.

Let OOS(k) denote the set of schedules that can be generated from all causal instruction sequencesby applying out-of-order compaction with a window of size k.

Consider any linearization s of a schedule in OOS(k). For each clock cycle c, the first k instructionsin s with issue time c could be permuted in any order and the new sequence submitted to out-of-order compaction with window size k would still produce s. Silvera et al. [46] exploit this additionalflexibility in ordering the linear instruction stream to reduce register pressure.

Theorem 5. The hierarchy of the OOS(k) schedules is structured as follows:

(1) OOS(1) is equal to the set of strongly linearizable schedules;(2) OOS(k) ⊇ OOS(k + 1);(3) OOS(∞) is the set of greedy schedules.

Proof. (1) Comparing the algorithms in Figures 7 and 8 reveals that in-order compaction is the sameas out-of-order compaction with a window of size 1.

¶Several superscalar processors feature separate instruction windows, for example, for integer and floating-point instructions,such as HP PA-8000, MIPS-R10000, or Intel Pentium IV, or couple an instruction window with subsequent separate buffers forsubsets of functional units, as, for example, in IBM PowerPC. We do not model these details here.‖Usually, instructions stay in the reservation stations (modeled by the window) until they have finished execution and are handedover to the reorder buffer, which re-establishes sequential program semantics.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 14: Classification and generation of schedules for VLIW processors

2382

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

(2) Consider a fixed instruction sequence S where out-of-order compaction with window size k

yields a schedule s. Increasing the instruction window by one potentially yields, in each step, one morecandidate for issue in the current cycle, resulting in a ‘greedier’ schedule s′. If there exists, in any stepof the outer while loop, a situation where this latest window entry, call it y, is ready for issue in thecurrent cycle and its required issue slot and resources are not yet taken by ready entries at position 1to k, then s′ �= s. (As s may still be generated from another instruction sequence, the inclusion ofOOS(k + 1) is not necessarily proper.) However, s′ can also be generated with a size k window froma ‘better’ sequence S′, where the q instructions being issued in the current cycle (q ≤ ω) are groupedcloser together, that is, within a distance of k. For instance, any linearization of s′ is appropriate, as itwill reproduce s′ with out-of-order compaction for any window size because the instructions issued ineach clock cycle will then just be the leading elements in the window.

(3) If the lookahead is unbounded (that is, the scope is the entire remaining sequence), no ready-to-issue instruction will be postponed at any step of the out-of-order compaction algorithm if the neededresources are still available, which is the characteristics of a greedy schedule. �

3.4. Weakly linearizable schedules

In a non-strongly linearizable schedule s ∈ S there is a time t ∈ {1, . . . , ρ(s)} such that no instructionissued in s at time j is tightly scheduled from Js(j). The class of weakly linearizable schedulescomprises the strongly and non-strongly linearizable schedules.

Non-strongly linearizable schedules s, such as those in Figures 1(c) and (d), can only be linearized ifexplicit NOP instructions are issued in the linearized version to occupy certain issue units and therebydelay the placement of instructions that are subsequently issued on that unit. In Figure 1(c), we couldreconstruct the schedule, for example, from the sequence 〈a, d, NOP1, b, c〉, where NOP1 denotes anexplicit NOP instruction on issue unit 1.

In principle, all schedules in S are weakly linearizable, as an arbitrary number of explicit NOPscould be used to guide in-order compaction to produce the desired result.

Note that some non-strongly linearizable schedules such as that in Figure 1(c) may be superiorto greedy or strongly linearizable schedules if register need or energy consumption are the mainoptimization criteria, or if non-multiblock reservation tables occur.

3.5. Non-dawdling schedules

Now we will identify a finite-sized subset of S that avoids obviously useless cycles and may requireonly a limited number of NOPs to guide in-order compaction.

We call a schedule s ∈ S dawdling if there is a time slot t ∈ {1, . . . , τ (s)} such that (a) noinstruction in s is issued at time t , and (b) no instruction in s is running at time t , i.e. has beenissued earlier than t , occupies some resource at time t , and delivers its result at the end of t or later(see Figure 1(d) for an example). In other words, the computation makes no progress at time t , andtime slot t could therefore be removed completely without violating any data dependences or resourceconstraints. Hence, dawdling schedules are never time-optimal. They are never energy-optimal either,as, according to standard energy models, such a useless cycle still consumes a certain base energyand does not contribute to any energy-decreasing effect (such as closing up in time instructions usingthe same resource) [33]. Finally, the register pressure does not change in such a useless time slot.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 15: Classification and generation of schedules for VLIW processors

2383

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

A

B

1

2

U timeB

U

RA

W-3

-3

Figure 9. Example of a negative delay along an anti-dependence edge in the basic block’s data dependence graph(adapted from [36]). Here, instruction A reads in its second cycle (R) a register that the longer-running instructionB overwrites (W ) in its last cycle (disjoint resources are assumed). Instruction B could thus be issued up to threecycles earlier than A without violating the anti-dependence, i.e. the delay from A to B is −3. However, schedulingmethods based on topological sorting of the DAG and greedy or in-order compaction are, in general, unable to

issue a dependent instruction earlier than its DAG predecessor.

By repeated removal of such useless time slots, each dawdling schedule could eventually betransformed into a non-dawdling schedule. It is obvious that a dawdling schedule always requiresexplicit NOPs for linearization, i.e. is not strongly linearizable. There are also non-dawdling schedules(such as that in Figure 1(c)) that are not strongly linearizable.

Even for a fixed basic block of n instructions, there are infinitely many dawdling schedules, asarbitrarily many non-productive cycles could be inserted. There are, however, only finitely many non-dawdling schedules, as there are only n instructions and, in each clock cycle, at least one of themprogresses in a non-dawdling schedule. Hence, both for optimization of time, register need, or energyconsumption, dawdling schedules can be safely ignored.

From Theorem 1 we know that OPTIMIST is guaranteed to compute a time-optimal schedule only ifall instructions have multiblock reservation tables. This constraint was fulfilled in the block reservationtable model used in the original version [40] but was not explicitly stated when the resource modelwas generalized to include complex reservation tables [26]. For energy optimization, we proposed anextension of OPTIMIST [33] to scan all (topsort-generatable) non-dawdling schedules by considering,when appending another instruction to a schedule, all possibilities for issuing NOP instructions to anyissue unit up to the longest delay of running instructions, thus without introducing a useless cycle [33].As OPTIMIST systematically explores all possible ready-lists that could occur during any topologicalsorting of the dependence graph, the extension yields that for each ready instruction y all possible issuetimes between the earliest and latest possible (with respect to data dependences) issue time for y willbe considered. At the expense of a significant growth of the solution space considered, this extension,re-applied for time optimization, thus allows OPTIMIST to generate time-optimal code even in thepresence of greedification anomalies.

3.6. Schedules generatable by topological sorting

Up until now, we assumed non-negative delays along data dependence edges. However, for certainarchitectures and combinations of dependent instructions, negative delays could occur. For instance,if there were pre-assigned registers, there may be anti-dependences with negative delays. This meansthat an optimal schedule may actually have to issue an instruction earlier than its predecessor in thedependence graph (see Figure 9 for an example).

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 16: Classification and generation of schedules for VLIW processors

2384

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

In VLIW scheduling, such cases cannot be modeled in a straightforward way if methods based ontopological sorting of the dependence graph and in-order compaction are used, because these relyon the causal correspondence between the direction of DAG edges and the relative issue order ofinstructions.

Likewise, superscalar processors could exploit this additional scheduling slack only if out-of-orderissue is possible.

This case demonstrates a general drawback of all methods based on topological sorting of the datadependence graph, such as list scheduling [15], Vegdahl’s approach [27], and OPTIMIST [26], orChou and Chung’s branch-and-bound search for superscalar processors [22], all of which assumenon-negative delays in their model.

Workarounds to deal with the negative delay problem are possible but have drawbacks as well.A local and very limited workaround could consist in defining superinstructions for frequentlyoccurring types of anti-dependence chains that are discovered as a whole and treated specially.However, we may then miss the opportunity to insert an instruction between the coupled instructions ofa superinstruction, which can compromise optimality as well. Another possibility is to build schedulesfrom sequences by greedy or in-order compaction as before but to allow non-causal instructionsequences as input, where a successor node may appear in the sequence before its predecessor.Non-causal sequences cannot be generated by topological sorting; instead, all permutations would beeligible. However, this may lead to deadlocks, namely when a free issue slot or resource is needed atsome position in the ‘past’ that has already been occupied. In contrast, causal sequences in connectionwith monotonic compaction algorithms such as in-order compaction are deadlock-free, as the schedulecan always ‘grow’ in the direction of the time axis.

A more elegant solution to this problem of negative delays can be provided by integer linearprogramming (ILP), which does not at all involve the notion of causality because the declarativeformulation of an ILP instance as a system of inequalities is naturally acausal. (Internally, ILP solversmay certainly consider the integer variables in some order, which however is generally not under thecontrol of the user.)

3.7. Summary: hierarchy of classes

Figure 10 surveys the inclusion relationships between the various classes that we have derived in thissection. In the cases described in the previous subsection, the set of time-optimal (dashed) schedulesmay actually not include any topsort-generatable schedule (shaded area) at all. In any case, withmultiblock reservation tables, it will be sufficient to only consider greedy schedules if a time-optimalschedule is sought, and non-dawdling schedules in the case of complex reservation tables or if lookingfor an energy-optimal schedule.

3.8. Experiments

In order to illustrate this hierarchy of classes of schedules, we computed, for some basic blocks with 10to 16 instructions taken from the TI C64x benchmark suite, the number of optimal schedules, greedyschedules, strongly linearizable schedules, non-dawdling schedules, and weakly linearizable schedules.For this purpose, we implemented a brute-force backtracking constraint solver that enumerates allweakly linearizable VLIW schedules up to a certain execution time τmax. This time limit is necessary

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 17: Classification and generation of schedules for VLIW processors

2385

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

generatable by topological sorting

greedy schedules

(weakly linearizable schedules)

scanned by extended OPTIMIST

OOS(k), by out of order compactionscanned by OPTIMIST dynamic programming

scanned by integer linear programming

strongly linearizable schedules

time optimal schedules

compiler generatable schedules

non dawdling schedules

... ...

OOS(1)

scanned by [22] Branch and Bound

Figure 10. Hierarchy of classes of VLIW schedules for a basic block.

because the number of weakly linearizable schedules is unbounded, and also because the time togenerate the schedules grows exponentially. Each enumerated schedule is examined for greedynessetc. We ran the measurements for four example architectures.

• C62x single. As base-line we use a single-cluster version of the TI C62x DSP, i.e. a 4-issueVLIW architecture with four fully pipelined units and simple reservation tables only.

• C62x single mb. This is a variant of the base-line architecture extended by an additionalresource and multi-block reservation tables for addition and multiplication.

• C62x single nmb. This is a variant of the base-line architecture extended by an additionalresource modeling a data bus shared by load and store instructions, with non-multi-blockreservation tables for load instructions.

• risc. This is a variant of the base-line architecture enforcing single-issue execution by mappingall instructions to the same issue unit, here L1.

Figure 11 shows the results. We can observe the following.

• Generally, the number of greedy and strongly linearizable schedules is significantly smaller thanthe number of non-dawdling schedules, which in turn is one to two orders of magnitude belowthat of weakly linearizable schedules, all within the considered time interval.

• For all examples and architectures investigated, there are always greedy schedules among theoptimal ones, even with C62x single nmb for which the theory does not guarantee the existenceof a greedy optimal schedule.

• For basic blocks with deep dependence chains and a rather low potential of instruction-levelparallelism, such as bmmse 4, bmmse 10, and bmmse 11 (the first three rows in Figure 11), thenumber of optimal and near-optimal strongly linearizable schedules is only slightly larger thanthe corresponding number of greedy schedules. In such cases, OPTIMIST does not lose muchby scanning strongly linearizable instead of greedy schedules. In basic blocks with a rather flatdependence structure (e.g. those in Figure 12), the number of optimal and near-optimal stronglylinearizable schedules can exceed that of corresponding greedy schedules by one or two ordersof magnitude.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 18: Classification and generation of schedules for VLIW processors

2386

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

1

10

100

1000

10000

100000

1e+06

1e+07

10 11 12 13 14 15 16 17 18 19

bmmse_4_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

10 11 12 13 14 15 16 17 18 19

bmmse_4_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

10 11 12 13 14 15 16 17 18 19

bmmse_4_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

10 11 12 13 14 15 16 17 18 19

bmmse_4_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

10 11 12 13 14 15 16 17 18 19

bmmse_10_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

10 11 12 13 14 15 16 17 18 19

bmmse_10_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

11 12 13 14 15 16 17 18 19

bmmse_10_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

10 11 12 13 14 15 16 17 18 19

bmmse_10_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

9 10 11 12 13 14 15 16 17 18 19

bmmse_11_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

9 10 11 12 13 14 15 16 17 18 19

bmmse_11_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

9 10 11 12 13 14 15 16 17 18 19

bmmse_11_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

9 10 11 12 13 14 15 16 17 18 19

bmmse_11_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

21 22 23 24 25 26 27 28 29 30

codebk_srch_12_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

21 22 23 24 25 26 27 28 29 30

codebk_srch_12_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

21 22 23 24 25 26 27 28 29 30

codebk_srch_12_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

21 22 23 24 25 26 27 28 29 30

codebk_srch_12_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

14 15 16 17 18 19

codebk_srch_20_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

14 15 16 17 18 19

codebk_srch_20_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

14 15 16 17 18 19

codebk_srch_20_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

15 16 17 18 19

codebk_srch_20_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

17 18 19 20 21 22 23

fir_vselp_6_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

18 19 20 21 22 23

fir_vselp_6_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

17 18 19 20 21 22 23

fir_vselp_6_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

19 20 21 22 23

fir_vselp_6_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

Figure 11. Numbers of schedules, ordered by execution time, for six basic blocks of the TI-C62x benchmark suiteand four example architectures. In each diagram, the number of greedy schedules is displayed in black, stronglylinearizable but non-greedy schedules in dark gray, non-dawdling but non-strongly linearizable schedules in lightgray, and the remaining weakly linearizable schedules are shown in white. The leftmost column in each diagram

represents optimal schedules. Note the logarithmic scale of the vertical axes.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 19: Classification and generation of schedules for VLIW processors

2387

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

13 14 15 16 17 18 19

codebk_srch_16_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

13 14 15 16 17 18 19

codebk_srch_16_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

13 14 15 16 17 18 19

codebk_srch_16_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

14 15 16 17 18 19

codebk_srch_16_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

12 13 14 15 16 17

irr_4_C62x_single

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

12 13 14 15 16 17

irr_4_C62x_single_mb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

12 13 14 15 16 17

irr_4_C62x_single_nmb

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

12 13 14 15 16 17

irr_4_risc

Weakly linearizableNondawdlingStrongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

fourinarow_6_c62x_single

Strongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

fourinarow_6_c62x_mb

Strongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

fourinarow_6_c62x_nmb

Strongly linearizableGreedy

1

10

100

1000

10000

100000

1e+06

1e+07

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

fourinarow_6_risc

Strongly linearizableGreedy

Figure 12. Numbers of schedules, continued from Figure 11, for more example basic blocks. For the larger examplein the last row we used OPTIMIST to count all greedy and strongly linearizable schedules.

4. CONCLUSION AND FUTURE WORK

We have characterized and analyzed classes of schedules that are of interest in optimal local schedulingfor VLIW and superscalar processors.

We have shown that, for finding a time-optimal schedule on a processor with multiblock reservationtables, it is safe to only consider greedy schedules (or any enclosing superclass). Actually, theOPTIMIST optimizer scans the somewhat larger class of strongly linearizable schedules, mainly fortechnical reasons. For non-multiblock processors or for finding an energy-optimal schedule, we canlimit our search space to non-dawdling schedules.

We can also conclude that scheduling methods based on ILP are, under certain circumstances,superior to methods based on topological sorting of the dependence graph. In particular, we pointedout the limitation of topological sorting in the case of negative delays.

Scheduling depends on other decisions made earlier or simultaneously in the code generationprocess. Alternative choices in instruction selection, partitioning, or register allocation may haveconsiderable influence on scheduling. In [26] and [38] we described integrated code generationmethods that solve these problems simultaneously. However, instruction selection will be limited bythe amount of formally specified semantic equivalences that were supplied by the compiler designer.

Future work may address an extension of this analysis and classification to global and cyclicschedules.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 20: Classification and generation of schedules for VLIW processors

2388

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

C. KESSLER, A. BEDNARSKI AND M. ERIKSSON

ACKNOWLEDGEMENTS

We thank Alain Darte for bringing the question of the dominance of greedy schedules to our attention.This research was partially funded by: Linkoping University, project Ceniit 01.06; VR (Vetenskapsradet); SSF,project RISE; and the CUGS Graduate School.

REFERENCES

1. Aho AV, Johnson SC, Ullman JD. Code generation for expressions with common subexpressions. Journal of the ACM1977; 24(1).

2. Bernstein D, Rodeh M, Gertner I. On the complexity of scheduling problems for parallel/pipelined machines. IEEETransactions on Computers 1989; 38(9):1308–1314.

3. Garey MR, Johnson DS. Complexity results for multiprocessor scheduling under resource constraints. SIAM Journal onComputing 1975; 4(4):397–411.

4. Hennessy J, Gross T. Postpass code optimization of pipeline constraints. ACM Transactions on Programming LanguageSystems 1983; 5(3):422–448.

5. Motwani R, Palem KV, Sarkar V, Reyen S. Combining register allocation and instruction scheduling (technical summary).Technical Report TR 698, Courant Institute of Mathematical Sciences, New York, July 1995.

6. Palem KV, Simons BB. Scheduling time–critical instructions on RISC machines. ACM Transactions on ProgrammingLanguages and Systems 1993; 15(4):632–658.

7. Barthou D, Gasperoni F, Schwiegelshohn U. Allocating communication channels to parallel tasks. Environments and Toolsfor Parallel Scientific Computing (Advances in Parallel Computing, vol. 6). Elsevier: Amsterdam 1993; 275–291.

8. Bernstein D, Gertner I. Scheduling expressions on a pipelined processor with a maximal delay on one cycle. ACMTransactions on Programming Language Systems 1989; 11(1):57–67.

9. Bernstein D, Jaffe JM, Pinter RY, Rodeh M. Optimal scheduling of arithmetic operations in parallel with memory access.Technical Report 88.136, IBM Israel Scientific Center, Technion City, Haifa (Israel), 1985.

10. Eisenbeis C, Gasperoni F, Schwiegelshohn U. Allocating registers in multiple instruction-issuing processors. Proceedingsof the International Conference on Parallel Architectures and Compilation Techniques (PACT’95). IEEE Computer SocietyPress: Los Alamitos, CA, 1995. (Extended version available as Technical Report No. 2628, INRIA Roquencourt, France,July 1995).

11. Hu TC. Parallel sequencing and assembly line problems. Operations Research 1961; 9(11):841–848.12. Kurlander SM, Proebsting TA, Fisher CN. Efficient instruction scheduling for delayed-load architectures. ACM

Transactions on Programming Languages Systems 1995; 17(5):740–776.13. Meleis WM, Davidson ED. Dual-issue scheduling with spills for binary trees. Proceedings of the 10th Annual ACM–SIAM

Symposium on Discrete Algorithms. SIAM: Philadelphia, PA, 1999; 678–686.14. Proebsting TA, Fischer CN. Linear–time, optimal code scheduling for delayed–load architectures. Proceedings of the ACM

SIGPLAN Conference Programming Language Design and Implementation, June 1991. ACM Press: New York, 1991;256–267.

15. Coffman EG Jr (ed.) Computer and Job/Shop Scheduling Theory. Wiley: New York, 1976.16. Graham RL. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics 1969; 17(2):416–429.17. Gebotys CH, Elmasry MI. Optimal VLSI Architectural Synthesis. Kluwer: Dordrecht, 1992.18. Kastner D. Retargetable postpass optimisations by integer linear programming. PhD Thesis, Universitat des Saarlandes,

Saarbrucken, Germany, 2000.19. Malik AM, McInnes J, van Beek P. Optimal basic block instruction scheduling for multiple-issue processors using

constraint programming. Technical Report CS-2005-19, School of Computer Science, University of Waterloo, Canada,2005.

20. Wilken K, Liu J, Heffernan M. Optimal instruction scheduling using integer programming. Proceedings of theACM SIGPLAN Conference Programming Language Design and Implementation. ACM Press: New York, 2000;121–133.

21. Zhang L. SILP. Scheduling and allocating with integer linear programming. PhD Thesis, Technische Fakultat derUniversitat des Saarlandes, Saarbrucken, Germany, 1996.

22. Chou H-C, Chung C-P. An optimal instruction scheduler for superscalar processors. IEEE Transactions on Parallel andDistributed Systems 1995; 6(3):303–313.

23. Hanono S, Devadas S. Instruction scheduling, resource allocation, and scheduling in the AVIV retargetable code generator.Proceedings of the Design Automation Conference. ACM Press: New York, 1998.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 21: Classification and generation of schedules for VLIW processors

2389

Concurrency Computat.: Pract. Exper. 2007; 19:2369–2389DOI: 10.1002/cpe

CLASSIFICATION AND GENERATION OF SCHEDULES

24. Yang C-I, Wang J-S, Lee RCT. A branch-and-bound algorithm to solve the equal-execution time job scheduling problemwith precedence constraints and profile. Computers Operations Research 1989; 16(3):257–269.

25. Bashford S, Leupers R. Phase-coupled mapping of data flow graphs to irregular data paths. Design Automation forEmbedded Systems 1999; 4(2/3):1–50.

26. Kessler C, Bednarski A. Optimal integrated code generation for VLIW architectures. Concurrency and Computation:Practice and Experience 2006; 18(11):1353–1390. DOI: 10.1002/cpe.1012.

27. Vegdahl SR. A dynamic-programming technique for compacting loops. Proceedings of the 25th Annual IEEE/ACMInternational Symposium on Microarchitecture. IEEE Computer Society Press: Los Alamitos, CA, 1992; 180–188.

28. Fisher JA, Faraboschi P, Young C. Embedded Computing: A VLIW Approach to Architecture, Compilers, and Tools. MorganKaufmann/Elsevier, 2005.

29. Rau BR, Fisher J. Instruction-level parallel processing: History, overview, and perspective. Journal of Supercomputing1993; 7(1–2):9–50.

30. Srikant YN, Shankar P (eds.). The Compiler Design Handbook: Optimizations and Machine Code Generation. CRC Press:Boca Raton, FL, 2003.

31. Keßler CW, Rauber T. Generating optimal contiguous evaluations for expression DAGs. Computer Languages 1996;21(2):113–127.

32. Keßler CW. Scheduling expression DAGs for minimal register need. Computer Languages 1998; 24(1):33–53.33. Kessler C, Bednarski A. Energy-optimal integrated VLIW code generation. Proceedings of the 11th Workshop on

Compilers for Parallel Computers, July 2004, Gerndt M, Kereku E (eds.). Shaker-Verlag: Aachen, 2004; 227–238.34. Davidson ES, Shar LE, Thomas AT, Patel JH. Effective control for pipelined computers. Proceedings of the Spring

COMPCON75 Digest of Papers. IEEE Computer Society Press: Los Alamitos, CA, 1975; 181–184.35. Rau BR. Iterative modulo scheduling: An algorithm for software pipelining loops. Proceedings of the Annual IEEE/ACM

International Symposium on Microarchitecture (MICRO-27). ACM Press: New York, 1994; 63–74.36. Rau BR, Kathail V, Aditya S. Machine-description driven compilers for EPIC and VLIW processors. Design Automation

for Embedded Systems 1999; 4:71–118. Appeared also as Technical Report HPL-98-40, HP Labs, September 1998.37. Eichenberger AE, Davidson ES. A reduced multipipeline machine description that preserves scheduling constraints.

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’96). ACMPress: New York, 1996.

38. Bednarski A, Kessler C. Optimal integrated VLIW code generation with integer linear programming. Proceedings of theInternational Euro-Par 2006 Conference (Lecture Notes in Computer Science, vol. 4128). Springer: Berlin, 2006.

39. Bernstein D, Jaffe JM, Rodeh M. Scheduling arithmetic and load operations in parallel with no spilling. SIAM Journal ofComputing 1989; 18:1098–1127.

40. Kessler C, Bednarski A. A dynamic programming approach to optimal integrated code generation. Proceedings of the ACMSIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems (LCTES’2001). ACM Press: New York,2001.

41. Ertl MA, Krall A. Optimal instruction scheduling using constraint logic programming. Proceedings of the 3rd InternationalSymposium Programming Language Implementation and Logic Programming (PLILP) (Lecture Notes in ComputerScience, vol. 528). Springer: Berlin, 1991; 75–86.

42. Lundqvist T, Stenstrom P. Timing anomalies in dynamically scheduled microprocessors. Proceedings of the 20th IEEEReal-Time Systems Symposium (RTSS’99). IEEE Computer Society Press: Washington, DC, 1999.

43. Wenzel I, Kirner R, Puschner P, Rieder B. Principles of timing anomalies in superscalar processors. Proceedings of the 5thInternational Conference on Quality Software (QSIC 2005), September 2005. IEEE Computer Society Press: Los Alamitos,CA, 2005; 295–303.

44. Eisinger J, Polian I, Becker B, Metzner A, Thesing S, Wilhelm R. Automatic identification of timing anomalies forcycle-accurate worst-case execution time analysis. Proceedings of the 9th IEEE Workshop on Design and Diagnosticsof Electronic Circuits and Systems (DDECS’06). IEEE Computer Society Press: Washington, DC, 2006.

45. Hennessy J, Patterson D. Computer Architecture: A Quantitative Approach (3rd edn). Morgan Kaufmann: San Francisco,CA, 2003.

46. Silvera R, Wang J, Gao GR, Govindarajan R. A register pressure sensitive instruction scheduler for dynamic issueprocessors. Proceedings of the PACT’97 International Conference Parallel Architectures and Compilation Techniques.IEEE Computer Society Press: Los Alamitos, CA, 1997.

Copyright c© 2007 John Wiley & Sons, Ltd.