instruction scheduling on vliw architecturesaces.snu.ac.kr/.../4541.775.8.vliw.scheduling.pdf ·...

Instruction Schedulingon

VLIW Architectures

Spring 2011

4541.775Topics on Compilers

Instruction Scheduling

● Limited ILP

● Trace Scheduling

● Superblock Scheduling

● Hyperblock Scheduling

● Modulo Scheduling

● Insufficient ILP

● “normal” code does not contain enough ILP

● ILP within basic blocks is limited for controlintensive programs

– the problem accentuates with longer latencies

unsigned int abs_sum = 0;for (int i=0; i<N; i++) { int abs = (A[i] >= 0? A[i] : -A[i]); abs_sum += abs;}

mov r0 ← #0 mov r1 ← #0 mov r2 ← N shl #2 mov r3 ← @A.loop ld r4 ← mem[r3 + r1] bge r4, #0, .skip not r4 ← r4 add r4 ← r4, #1.skip add r0 ← r0, r4 add r1 ← r1, #4 blt r1, r2, .loop

● Insufficient ILP

● “normal” code does not contain enough ILP

● ILP within basic blocks is limited for controlintensive programs

– the problem accentuates with longer latencies

ld r4 ← … bge r4, …

ld latency: 4 cycles

● ILP within basic blocks is limited for controlintensive programs.

→ optimizations across basic blocks are needed

– trace scheduling (J.Fisher, 1981)

– superblock scheduling (P.Chang, 1991)

– hyperblock scheduling (S.Mahlke, 1992)

J.A.Fisher: Trace Scheduling: A Technique for Global Microcode Compaction (IEEE Transactions on Computers, vol.30, no.7, 1981)

● basic idea: schedule the most frequently executed trace of basic blocks as one unit

● requires compensation code if the program takes another route than expected

add r4 ← r0, r1

add r4 ← r0, r1 add r4 ← r0, r1

code motioncompensationcode0.9 0.1

● A trace consists of a sequence of instructions

– including branches– but not including loops

● example:● assume B1,B3,B4,B5,B7 is

the most frequently executedpath

add compensationcode if necessary

● Compensation Code

– moving an instruction below a side exit

instr 1instr 2instr 3instr 4instr 5instr 6

instr 1

– moving an instruction above a side exit(speculative execution)

[undo instr 5]

– moving an instruction below a side entrance– moving an instruction above a side entrance

instr 5instr 4

● Superblock Scheduling WenMei Hwu et al. The Superblock: An Effective Technique for VLIW and Superscalar Compilation (The Journal of Supercomputing, vol. 7, issue 12, 1993)

● tries to overcome some difficulties with trace scheduling

– complicated bookkeeping when moving instructions above/below a side entrance/exit

– some compiler optimizations require additional bookkeeping when side entrances are present

example: copypropagation

● a superblock is a trace with no side entrances control may only enter from the top, but leave at one or more exit →

points

● similar to extended basic blocks (Aho et al, 1986)

● superblock formation:

1. identify trace using profile information

2. apply tailduplication until all side entrances have been eliminated

● tail duplication

1. copy the the tail portion of the trace from the first side entrance to the end

2. move all side entrances to the corresponding duplicated basic blocks

● example: superblock formation

● superblock ILP optimizations

optimizations that are performed before superblock formation with the goal to enlarge the superblock and increase ILP by removing dependences.

● superblock enlarging optimizations

– branch target expansion● expand target of the likely taken control transfer that ends a superblock● not applied to backedges● stops when a predefined superblock size is reached or the branch does not favor

one direction.

● superblock enlarging optimizations (cont’d)

– loop peeling● applied to superblock loops (superblocks which end with a likely taken control

transfer to itself) that only tend to iterate a few (k) times.● peel the first k iterations and insert control flow to branch to the original loop

body if the loop is not executed k times.● after loop peeling, the superblock may be extended both at the head and the tail

of the superblock loop

– loop unrolling● unroll the body of a superblock loop that tends to iterate many times

● superblock dependence removing optimizationsremove data dependences between instructions in a superblock

– register renamingi.e., in unrolled loop bodies

– operation migration● move instructions whose result is not used within a superblock to a less

frequently superblock● decicion based on a cost function

– induction variable expansion● create a separate copy of the loop induction variable for each unrolled loop body● requires additional patch code at the loop preheader and at exits

● superblock dependence removing optimizations (cont’d)

– accumulator variable expansion● use a separate accumulator for each unrolled instance of loops accumulating a

sum or product in every iteration● additional patch code at the loop preheader needed● additional patch code at the loop exits needed (summing up the individual

accumulators)

– operation combining● for certain classes of instructions, true dependencies can be eliminated by pre

computing new immediate values at compile time● example:

add x ← x, #4add x ← x, #4

add x ← x, #4 add x’ ← x, #8……mov x ← x’

● example: superblock dependence removing optimizations

accumulator variableexpansion

induction variableexpansion

● speculative execution

– occurs when moving an instruction up above a control transfer instruction B

– the instruction is executed in any case, even if the control transfer instruction would branch out of the superblock (i.e., speculative instructions)

– restrictions for an instruction I to be executed speculatively

1. the destination of I is not used before it is redefined when B is taken

2. I will never cause an exception that may terminate the program when B is taken

– instructions that may cause exceptions● memory load● memory store● integer divide● floating point operations

● speculative execution (cont’d)

– exception models● restricted percolation model

no support for disregarding exceptions generated by speculatively executed instructions

● limits performance in superblocks that contain many longlatency potentially trapcausing instructions (i.e., memory loads) above branches

● general percolation modelthe architecture provides a nontrapping version instructions that may cause exceptions

● convert speculatively executed and potentially trapping instructins to their nontrapping counterpart

● if detection of the exception is required additional architecture and compiler support is required

● Analysis

– implementation complexity in the IMPACTI C compiler

total size: ~92K lines

● Analysis

– compilation time (IMPACTI)

● Analysis

– performance improvement due to superblock ILP optimization

● Analysis

– effect of speculative execution support

● Analysis

– code size increase

● Hyperblock Scheduling Scott Mahlke et al. Effective Compiler Support for Predicated Execution Using the Hyperblock (MICRO’25, 1992)

● tries to overcome some difficulties with superblock scheduling

– superblocks end when both targets of a control flow instruction have a similar probability to be taken

● hyperblock scheduling

– combine basic blocks from multiple control paths (using ifconversion)

– for programs without heavily biased branches, hyperblocks provide a more flexible framework

● Predicated execution

– When the predicate is TRUE the instruction is executed normally

– When the predicate is FALSE the instruction is treated as a NOP

● Conditional branches can be eliminated with predicated execution (ifconversion)

● The Hyperblock

– set of predicated basic blocks in which control may only enter at the top but several exits may exists.

– very similar to superblock formation

● Building Hyperblocks

1. hyperblock block selection● decide which basic blocks in a region should be included in the hyperblock● three features of each block are examined

– execution frequency– block size– instruction characteristics

● use heuristic functions

● Building Hyperblocks (cont’d)

2. hyperblock formation● tail duplication● loop peeling● node splitting

– eliminate dependences created by control path merges– duplicate all blocks subsequent to the merge point for each path

● Ifconversion

● Building Hyperblocks (cont’d)

● Control Flow Information

– instructions within a hyperblock are not sequential. a more complex analysis is required→

● Predicate Hierarchy Graph (PHG)

– determine if two instructions can ever be executed in a single path

– if they can, then there is a control flow path between these two instructions

● Predicate Hierarchy Graph (PHG) example

ANDing p4 and p5p4∙p5 = (c1∙c2) ∙(~c1+c1 ∙~c2) = 0

→ there is no viable path between p4, p5

same path: ANDp4 = c1 ∙ c2

multiple paths meet: ORp5 = ~c1 + c1 ∙ ~c2

● HyperblockSpecific Optimizations

– similar to optimizations for superblocks

– instruction promotion● removes the dependence between the predicated instruction and the instruction

which sets the corresponding predicate value

– instructions merging● combine two instructions in a hyperblock with complementary predicates into a

single instruction

● Summary

● Trace Scheduling can increase ILP

– side entrances are too complex to handle

● Superblock Scheduling removes the side entrances from the trace

– weak point: unbiased branches

– for programs without heavily biased branches, hyperblocks provide a more flexible framework

● Modulo Scheduling next class!→

instruction scheduling on vliw architecturesaces.snu.ac.kr/.../4541.775.8.vliw.scheduling.pdf ·...

Documents

exploiting scratchpad-aware scheduling on vliw architectures...

instruction scheduling for superscalar and vliw platforms...

vliw processor: pros and cons of vliw vliw, … processor...

tema 3 procesadores vliw (very long … · procesadores...

heterogeneous clustered vliw microarchitectures

iterative modulo scheduling · exploiting the ilp between...

superblock · 2020. 11. 21. · title: superblock created...

Процессоры архитектуры vliw /...

code scheduling for multiple instruction stream...

an algorithm for software pipelining loops · acyclic...

the superblock: an effective technique for vliw and...

instruction scheduling for vliw processors under...

ilp: vliw architectures

superblock printing

superblock ftl: a superblock-based flash translation layer...

1 innocheck superblock superblock inspection...2_innocheck...

smt/vliw/epic, statically scheduled...

software pipelining and superblock scheduling: compilation...

software pipelining and superblock scheduling: compilation

software pipelining and superblock scheduling: compilation...