cs8803: compilers for embedded system santosh pande – summer 2007 chapter 8 compiling for vliws...
TRANSCRIPT
CS8803: Compilers for Embedded SystemSantosh Pande – Summer 2007
Chapter 8Compiling for VLIWs and ILP
1
Outline
• 8.1 Profiling• 8.2 Scheduling
– Acyclic Region Types and Shapes– Region Formation– Schedule Construction– Resource Management During Scheduling– Loop Scheduling– Clustering
• 8.3 Register Allocation• 8.4 Speculation and Predication• 8.5 Instruction Selection
2
Overview
• This chapter…– Focuses on optimizations, or code transformations– These topics are common across all types of ILP-
processors, for both general-purpose and embedded applications
– Compilers and toolchains used for embedded processors are very similar to those in general-purpose computers
3
1. Profiling
• Profiles– Statistics about how a program spends its time
and resources– Many ILP optimizations require good profile
information
• Two types of profiles– “Point profiles”
• Call graphs and CFG
– “Path profiles”
4
Types of Profiles
• Call graph– Nodes: procedures– Edges: procedure calls– Information
• How many times each proc was called?• How many times each caller proc invoked a callee?
– Limitation: • Can’t tell what to do possibly beneficial procedures
5
Types of Profiles (cont.)
• Control Flow Graph (CFG)– Nodes: each basic blocks
• Basic block: a sequence of always executed instructions
– Edges: one basic block can execute after another basic block
– Information• How many times a particular basic block was executed?• How many times control flowed from one basic block to
one of its immediate neighbors?
6
7
Call Graph
Control Flow Graph
Types of Profiles (cont.)
• Path profiles– Measuring # of times a path, or sequence of
contiguous blocks in CFG is executed– Optimizations using path profiles appeared in
research compilers, but not into production compilers
– Note that call graphs and CFG are “Point profiles”
8
Profile Collection
• Instrumentation– Extra code is inserted into program to gather data– Can be done by compilers or post-compilation tool
• e.g. Pin: dynamic instrumentation tools and API– http://rogue.colorado.edu/pin/
– Hardware techniques• Special registers record stats various events• Statistical-sampling profilers
9
Synthetic Profiles (Heuristics in Lieu of Profiles)
• Synthetic profile– Assigns weights to each part of program based
solely on the structure of source program– Pros
• Need not to collect stats on actual running programs
– Cons• Can’t see how the program behaves w/ read data
– None of synthetic profile techniques does as well as actual profiling
10
2. Scheduling
• Instruction scheduling– Directly responsible for identifying and grouping
operations that can be executed in parallel
• Taxonomy– Cyclic: operates on loops in the program– Acyclic: handles loop-free regions, not directly loops– Current compilers include both schedulers
• Hardware support– Helps the choices available to the scheduler
11
12
Acyclic Region Types and Shapes
• Shapes of Regions– Basic blocks, Traces, …
• Basic Blocks– A “degenerate” form of region– Maximal straight line code fragments
13
Acyclic Region Types and Shapes (cont.)
• Traces: the first proposed region– Linear paths thru code: multiple-entrances & exits– A trace consists of the operations from a list of
basic blocks with the following properties• Each basic block is a predecessor of the next on the list
– e.g. Bk falls thru or branches to Bk+1
• For any i and k, there is no path Bi->Bk->Bi except for those that go through B0
– e.g. Code is cycle free, except entire region can be part of some encompassing loop
– Allow forward branches and so on: complex!14
Acyclic Region Types and Shapes (cont.)
15
Basic Block
Control Flow
Trace:a linear,
multiple-entry, multiple-exit
region
side entrance
Acyclic Region Types and Shapes (cont.)
• Superblocks– Traces with added restriction
• Single-entry, multiple-exit traces
– Same properties with traces, but one addition• There may be no branches into a block in the region,
except to B0. These outlawed branches are referred to in the superblock literature as side entrances
– Tail duplication: a region enlarging technique• Avoids side entrances and adds compensation codes
16
Acyclic Region Types and Shapes (cont.)
17
Tail duplication to eliminate side
entrances
e.g. 70*0.8=56
Superblock
Acyclic Region Types and Shapes (cont.)
• Hyperblocks– Single-entry, multiple-exit regions with internal
control flow– Variants of superblocks that employ predication to
fold multiple control paths into a single superblock– Removing some control flow complexity
18
Acyclic Region Types and Shapes (cont.)
19
Hyperblock
if-conversion of basic blocks B2,
B5
Acyclic Region Types and Shapes (cont.)
• Treegions– Regions containing a tree of basic blocks within
the control flow of the program– Properties
• Each basic block Bj except for B0 has exactly one predecessor.
• That predecessor, Bi, is on the list, where i < j.
– Any path thru treegion yield a superblock• A trace with no side entrances
– Treegion-2: w/o restriction on side entrances
20
Acyclic Region Types and Shapes (cont.)
21
Treegion 1
Treegion 2
Treegion 3
Trace-2: trace
Acyclic Region Types and Shapes (cont.)
• Percolation Scheduling– Many code motion rules are applied to regions
that resemble traces– One of the earliest versions of DAG scheduling
• DAG scheduling: most general of acyclic scheduling
• Cycle scheduler– Limited region shapes
• A single innermost loop• An inner loop that has very simple control flow
22
Acyclic Region Types and Shapes (cont.)
23
Region Formation
• So far, discussed about a region shape• Remaining two questions
– Region Formation• How does one divide a program into regions?• Region formation is more than selecting good regions
from CFG; also includes duplication (region enlargement)
– Schedule Construction• How does one build schedules for them?• Well-selected regions are critical for schedule
construction– Using profiles: how frequently executed?
24
Region Formation (cont.)
• Region Selection– Trace growing
• The most popular algorithm
– Using the mutual most likely heuristic– Steps
• A is the last block of the current trace• Block B is A’s most likely successor, and vice versa
– A and B are “mutually most likely”
• Adds B to the trace• Repeats until no mutually-most-likely successor
25
Region Formation (cont.)
• Region Selection– Shortcomings of using point profiles
• Cumulative effect of conditional probability• Point profiles independently measure probability• Probability of remaining on the trace rapidly decreases• Example:
– A trace that crosses ten splits, each with 90% of staying on the trace, appears to have only 35% (=0.9^10) probability of running from start to end
• Solutions: – building different shaped regions, predication– Using predication to remove branches
26
Region Formation (cont.)
• Region Selection– Hyperblock formation
• Based on the mutual-most-likely trace formation• Considers block size and execution frequency• Predication can remove unpredictable branches
– Researches on better statistics• Using global, bounded-length path profiles to improve
static branch prediction
27
Region Formation (cont.)
• Enlargement Techniques– Region selection is not enough alone– Needs to increase ILP by using enlargement
• Code size increased, but better scheduled code• Based on the fact programs iterate (loop)
– Loop unrolling• Performed before region selection to make the larger
unrolled codes available to region selector• Induction variable simplification and etc performed to
expose more parallelism across iterations
28
Region Formation (cont.)
29
• Simplified example of variants of loop unrollingFor while loop:most general
case
For for loop:counted loops
Region Formation (cont.)
• Induction var manipulations for loops
30
Region Formation (cont.)
• Enlargement Techniques– Different approach for superblocks
• Superblock loop unrolling– Unrolling superblock loops (the most likely exit from some
superblocks jump to the beginning)
• Superblock loop peeling– Profile suggests a small # of iterations for the superblock loop– The expected # of iterations is copied
• Superblock target expansion– Similar to the mutual-most-likely heuristic for growing traces– If superblock A ends in a likely branch to B, then B is added
31
32
Superblock-enlarging optimizations
Target expansion Loop unrolling Loop peeling
Region Formation (cont.)
• Phase-ordering Considerations– Which one first?
• Multiflow compiler: enlargement before trace selection• Superblock-based chose and formed superblocks first• Neither is clearly preferable
– Other transformations• i.e. Dependence height reduction should be run before
region formation
33
Schedule Construction
• So far, discussed about region formations– Selecting and enlarging individual regions
• A Schedule– Set of annotations that indicate unit assignment
and cycle time of the operations in a region– Depending on the shape of the region
• Goal: minimizing objective function– Estimated completion time + code size or energy
efficiency (in embedded systems)
34
Schedule Construction (cont.)
• Analyzing Programs for Schedule Construction– Dependences (data & control) prohibit reordering
• Partial ordering on the pieces of code• Represented as a DAG or its variants
– DDG (data dependence graph)– PDG (program dependence graph)
• Creating DDG and PDG typically O(n^2)– Where, n is the number of operations
35
36Data dependences example
Output dependence
True dependence
37Control dependence example
Control flow example
Schedule Construction (cont.)
• Compaction Techniques– Cycle versus Operation Scheduling
• Two strategies to minimize an objective function• 1) Operation scheduling
– Selects an operation in the region and allocates it in the “best” cycle w/o dependences
• 2) Cycle scheduling– Fills a cycle with operations from region, proceeding to the
next cycle only after exhausting available operations
• Operation scheduling is theoretically powerful because of consideration of long-latency operations
38
Schedule Construction (cont.)
• Compaction Techniques– Linear Techniques
• Algorithm using DDG gives O(n^2) • In practical, linear O(n) used in modern compilers• Two techniques• 1) As-soon-as-possible (ASAP) scheduling
– Placing op in the earliest possible cycle (top-down linear scan)
• 2) As-late-as-possible (ALAP) scheduling– Placing op in the latest possible cycle (bottom-up linear scan)
• Example: critical-path scheduling uses ASAP followed by ALAP to identify operations in the critical path
39
Schedule Construction (cont.)
• Compaction Techniques– Graph-based Techniques (List Scheduling)
• Linear techniques can’t see the global properties (DDG)• Repeatedly assigning a cycle to operation w/o
backtracking (greedy algorithms): O(nlogn)• Steps
– Selects an operation from a data-ready-queue (DRQ)– An op is ready when all of its DDG predecessors scheduled– Once scheduled, op is removed from the DRQ
• Performance is dependent on the order selecting candidates, or on the scheduler’s greediness
40
Schedule Construction (cont.)
• Compensation Code– Restoring the correct flow of data and control– Four basic scenarios
41
• (a) No Compensation– Code motion don’t change relative order of
operations wrt joins and splits– Also covers moving operations above a split
point (becoming speculative)– Recall that compensation code for speculative
code motions depends on recovery model
Schedule Construction (cont.)
• Compensation Code
42
• (b) Joint Compensation– B moves above a join point A– Drop a copy of B (B’) in the join path
• (c) Split Compensation• Split op B (i.e. branch) moves
above a previous op A• Produces a copy of A (A’) in the
split path
Schedule Construction (cont.)
• Compensation Code
• Summary– In general, make sure preserve all paths from the
original sequence in the transformed control flow after scheduling
43
• (d) Joint Compensation– Splits moved above joins (in the figure)– Splits moved above splits
Z-B-W path
Resource Management During Scheduling
• Resource hazards– Dependences and operational latencies and
available resources (i.e., functional units)
• Approaches– Reservation table: a simple and early method– Using finite-state-automata
44
Resource Management During Scheduling (cont.)
• Resource Vectors– Easy scheduling of instructions– Row: each cycle of schedule– Col: each resource in the machine– Recent work on reduction of the size
45
Busy
Resource Management During Scheduling (cont.)
• Finite-state Automata– Intuition
• Is this instruction sequence a resource-legal schedule?– Similar with “Does this FSA accept this string?”
• A schedule is a sequence of instructions– Similar with “a string is a sequence of alphabet character”– Resource-valid schedules = a language
– FSAs are enough to accept these language– Several approaches for improving efficiency
• Breaking them into “factor” automata, reversing automata, and non-determinism
46
Resource Management During Scheduling (cont.)
• Finite-state Automata
47
• Original automaton: representing two-resource machine• Factored automata: “Letter” and “Number” since independent operations• Cross-product of factored automaton is equivalent to the original one
Resource Management During Scheduling (cont.)
• TODO:– Reverse automata?– Nondeterminism?
48
Loop Scheduling
• Loop scheduling approaches– Most of execution time spent in loops– The simplest approach was loop unrolling– Software pipelining
• Exploits inter-iteration ILP: parallelism across iterations• Modulo scheduling
– Produces a kernel of code– Kernel: overlapped multiple iterations of a loop, where
neither data dependence, nor resource conflicts
• Prologues and epilogues code is needed for correctness– Increased code size, H/W techniques can reduce this
49
• Conceptual illustration of software pipelining
Loop Scheduling
50
Loop Scheduling (cont.)
• Modular Scheduling– Initiation Interval (II)
• The length of the kernel: the constant interval b/w start of successive kernel iterations
• Minimum II (MII)– Determines lower bound on II
• Two constraints on the MII– Recurrence-constrained minimum II (RecMII)– Resource-constrained minimum II (ResMII)
51
Loop Scheduling (cont.)
• Modular Scheduling– Goal
• Arranging operations so that they can be repeated at the smallest possible II distance (related throughput)
– Rather than minimizing the stage count of each iterations, which means minimizing latency
– But, stage count is also important because it relates to prologue (pipeline filling) and epilogue (pipeline draining)
– Downsides of modular scheduling• Hard to handle nested loops• Control flow in the loop handled by only predication
52
• Conceptual model of Modulo scheduling– 4-wide, load (3 cycles), mult & compare (2 cycles)
53
How many inter-iteration dependences?
Loop Scheduling (cont.)
• Modular Scheduling– Modulo Reservation Table (MRT)
• Find a resource conflict-free schedule over multiple II intervals
• Ensure the same resources are not reused more than once in the same cycle
• MRT records and checks resources usage for cycle
54
Modulo Reservation Table
55
Loop Scheduling (cont.)
• Modular Scheduling– Searching for the II
• Find two candidates: minII and maxII• maxII: trivial, sum of all latencies of operations in loop• minII: complex, max(resII, recII)
– Consider resource constraints, and both intra- and inter-iteration dependences
• Then, find a legal schedule within the range– Usually using a modified list scheduling in which resource
checking for each assignment through MRT
56
Loop Scheduling (cont.)
• Modular Scheduling– Searching for the II
• basic scheme of iterative modulo scheduling
57
minII = compute_minII();maxII = compute_maxII();found = false;II = minII;while (!found && II < maxII) { found = try_to_modulo_schedule(II, budget); II = II + 1;}if (!found)trouble(); /* wrong maxII */
Loop Scheduling (cont.)
• Modular Scheduling– Prologues and Epilogues
• Partial copies of kernel• More complex when multiple-exit loops• In practice, multiple epilogues are almost always a
necessity (but, this is beyond our scope!)• Kernel-only loop scheduling
– Condition 1: prologues and epilogues are proper subsets of kernel code in which some operations have been disabled
– Condition 2: fully predication architecture
58
Kernel-only code by predicates
Loop Scheduling (cont.)
• Modular Scheduling– Modulo Variable Expansion
• MRT solved a correct resource scheduling for a given II• What about register allocation when lifetime of a value
within an iteration exceeds the II length?– Simple register allocation policy won’t work: overwritten!
• Solution: artificially extend II w/o perf degradation by unrolling loop body -> Modulo Variable Expansion
• Must unroll at least by a factor k = ceil (v / II)– v = the length of the longest life time
59
Loop Scheduling (cont.)
• Modular Scheduling– Modulo Variable Expansion
• But, increased length of kernel code, reg pressure, …• Solution: rotating registers
– Physical register instantiation: combination of a logical identifier and a register base incremented at every iteration
• A reference to register r at iteration i points to a different location than iteration i+1
– It’s possible to avoid modulo variable expansion
60
61
Register r1 needs to hold the same variable in twodifferent iterations, but the lifetimes overlap
Unroll kernel twice!
62
Used two registers (r1, r11) to resolve overlappingSame throughput, but code size hurts
Loop Scheduling (cont.)
• Modular Scheduling– Iterative Modulo Scheduling
• Sometimes hard to find a schedule due to complex MRT• To improve probability of finding a schedule, allow a
controlled form of backtracking (unscheduling and rescheduling of instructions)
– Advanced Modulo Scheduling Techniques• So far, several heuristics: e.g. guessing a good minII• Recent techniques
– e.g. Hypernode reduction modulo scheduling (HRMS):» reduces loop-variant lifetimes while keeping II constant
63
Loop Scheduling (cont.)
• Clustering– Review of the need of clustering
• A practical solution to solve high register demands rather than multiported register file, or bypassing logic
– Multiports are expensive and poor scalability
• A clustered architecture divides into separate clusters• Each cluster has its own register bank and func units• In general, intercluster (explicit) operations needed
– Compilers’ new role• Minimizing intercluster moves and balancing clusters
64
Loop Scheduling (cont.)
• Clustering– Preassignment techniques
• In general, clustering before scheduling• Two techniques
– Bottom-up-greedy (BUG)» Two phases: traversing from exit to entry, and assignment
– Partial-component clustering (PCC)» Reduce complexity by constructing macronodes
– Clustering overheads• Two clusters: 15~20% lost cycles, Four: 25~30%
65
3. Register Allocation
• Register allocation– Memory >> register space– NP-Hard problem– This problem is old and well known
• Standard technique: coloring of interference graph• Recent: nonstandard register allocation techniques
– Faster and better than graph-coloring– linear-scan allocators
» Interested in JIT, dynamic translation
• Tradeoffs b/w compile- and run-time– Feasible today because of faster machines
66
Phase-ordering Issues
• Phase ordering is hard problem– Should it be don before, after, or same time?– Register allocation and scheduling conflicts goals
• Register allocator tries to minimize spill and restore, creating sequential constraints for register reuse)
• Scheduler tries to fill all parallel units• How to order them?
– Very tricky problem
67
Phase-ordering Issues
• Scheduling followed by Register Allocation followed by Post-scheduling– The most popular choice (common for modern RISC)
– ILP over efficient register utilization• Enough registers are available
– Post-scheduler rearranges the code
68
Scheduling: without regard for the
number of physical registers actually
available
Register allocation:though no allocation
might exist that makes the schedule legal, so insert spills/restores
Post-scheduling: after inserting spills/restores,
fix up schedule, making it legal, with least
possible added cycles
Phase-ordering Issues
• Register Allocation followed by Scheduling– Register use over exploiting ILP– Works well with few GPRs (e.g. x86)– But, register allocator introduces additional
dependences every time it reuses a register
69
Register allocation:producing code withall registers assigned
Register allocation:Scheduling (though not very
effectively, because the register allocation has inserted many
false dependences)
Phase-ordering Issues
• Combined Register Allocation and Scheduling– Potentially very powerful, but very complex– A list-scheduling algorithm may not converge
• Cooperative Approaches– Scheduler monitors register resources and
estimates pressure in its heuristics
70
Scheduling and register allocation done together:
difficult engineering, and it is difficult to ensure that
scheduling will ever terminate)
4. Speculation and Predication
• Speculation and Predication– Removes and transforms control dependences– Usually, they are independent techniques, and
one is much more appropriate than the other– Note that predication is important in software
pipelining
71
Control and Data Speculation
• Control and Data Speculation– Recall exception behavior in recovery model
• Nonexcepting parts and sentinel (checking) parts• In compiler’s perspective
– It’s complicated to support nonexcepting loads because of recovery code handling
– Speculative code motion (or code hoisting)• Removes actual control dependences unlike predication• Compiler need to consider supported exception model
and speculative memory operations
72
73
Speculative code motion example
load operation becomes speculative load (load.s)
Predicated Execution
• Compiler techniques for predication– Examples: if-conversion, logical reduction of
predicates, reverse if-conversion, and hyperblock-based scheduling
– If-conversion• Translates control dependence into data dependence• Converts an acyclic subset of CFG from an unpredicated
code into straight-line code with predication• Also try to minimize # of predicate values
– logical reduction of predicates
74
Predicated Execution (cont.)
• Compiler techniques for predication– Reverse if-conversion
• Removing predicates, returning to unpredicated code• May be worthwhile to if-convert• When insufficient predicate registers, selectively
reverse if-converting
– Hyperblock based scheduling• Unified framework for both speculation and predication• First, choose a hyperblock region, then if-conversions
– Gives the schedule constructor much more freedom to schedule, and removes speculative constraints
75
76
Example of predicated codesAlways executed
Predicated Execution (cont.)
• Case studies in embedded systems– No usually full predicated like IPF architecture– ARM includes a 4-bit predicates in every operation
• Looks like always being predicated• But, the predicate registers is usual set of condition
code flags instead of an index to general predicates
– TI C6x supports full predication• Five of GPR can be specified as condition registers
77
Prefetching
• Memory prefetching– A form of speculation, and invisible to programs– Compiler-supported prefetching better than pure
hardware prefetching in many cases• Compiler assist in prefetching
– ISA includes a prefetch instruction• Only hints to the hardware
– Automatic insertion requires to understand loop behaviors
– Unneeded prefetches waste resources
78
Other Topics
• Data Layout Methods– Increase locality by considering cache line
• Static and Hybrid Branch Prediction– Profiles are used to set static branch predication– More sophisticated approach
• Hybrid method: statically or dynamically
– e.g. IPF includes four branch encoding hints• static taken, static not-taken, dynamic T, and dynamic NT
79
5. Instruction Selection
• Instruction Selection– Translates from a tree-structured linguistically-
oriented IR to operation- and machine-oriented IR– Especially important with complex instruction sets– Recent technique
• Cost-based pattern-matching rewriting systems– “match” or “cover” parse tree produced by front end using
minimum-cost set of operation subtrees
• e.g. BURS (bottom-up rewriting systems)– 1st pass, labels each node in parse tree– 2nd pass, reads labels & generates target machine operations
80