ten hardware features that affect optimization c omp 512 rice university houston, texas fall 2003...
TRANSCRIPT
Ten Hardware Features That Affect Optimization
COMP 512Rice UniversityHouston, Texas
Fall 2003
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
COMP 512, Fall 2003 2
Hardware Features Affect Optimization
• Target machine defines cost of each operation
• Target machine defines set of available resources
• Target machine may provide unusual opportunities Load multiple, predication, branches in delay slots, …
Compiler Writers/Designers must understand hardware features
• Make good use of features that help
• Avoid downside impact of features that hurt (branch to register)
COMP 512, Fall 2003 3
Ten Hardware Features That Affect Optimization
The list for today’s lecture
1. Register windows
2. Partitioned register sets
3. Itanium’s rolling registers
4. x86 floating-point register stack
5. Predicated Execution
6. Autoincrement & autodecrement
7. On-chip local memory
8. Hints to hardware
9. Branch-delay slots
10.Software-controlled processor speed
COMP 512, Fall 2003 4
Register Windows
Architectural response to procedure call save/restore overhead
• Use hardware renaming to avoid most saves & restores at a call
• Partition register names into sets Set shared with caller Local set, maybe global set Set shared with callee
• Manipulate the map at a call so that caller’s output set becomes callee’s input set Intrinsic effect of call/return or separate operations
• Hardware or software mechanism to handle stack overflow
COMP 512, Fall 2003 5
Register Windows
SPARC
• Save & Restore operations
• 32 GPRs visible at any time
• No window on floats
• Overflow handled by trap 40 to 520 physical regs
r0 to r7
r8 to r15
r16 to r23
r24 to r31
global
w/callee
local
w/caller
Using Register Windows
• Caller passes args in r8 to r15
• Callee sees them in r24 to r31
• Global set visible to all
• Can use r16 to r31 arbitrarily
• Can use r8 to r15 as scratch between calls
Most save-restore activity now automated in overflow code
• 520 physical registers is a lot
• 40 physical registers is not
Faster for non-recursive code
COMP 512, Fall 2003 6
Register Windows
Itanium
• Variable size window (only GPRs)
• 32 “global” registers
• Variable size window (0-96) Starts with r32
• Background engine performs fill and spill operations on register stack overflows Stall on return when fill is needed & incomplete
• Callee inherits window of same size as caller Operation lets callee set window size & local–out line
• ISA includes alloc, flushrs, loadrs, & cover operations
COMP 512, Fall 2003 7
Partitioned Register Sets
• Number of functional units keeps rising
• At some point, the register-FU MUX becomes too deep & slow
• One response is to partition the register set
Register File
FU4 FU5 FU6 FU7
Register File
FU0 FU1 FU2 FU3
inter-cluster data paths
• Multiple register files, each with a cluster of FUs
• Inter-cluster xfer mechanism, with limited bandwidth
• Fast access to local register file
TMS320C6X
COMP 512, Fall 2003 8
PRS: Cluster Assignment & Scheduling
• Compiler must place each operation & ensure operand availability
May necessitate inter-cluster copy operations
• Adds another complex problem to the back end
Bottom-up Greedy (BUG) algorithm [Ellis 81]
• Separate cluster assignment phase before scheduling
• Inserted all necessary data movement before scheduling
Unified Assign and Schedule (UAS) [Ozer et al. 98]
• Moved assignment into inner loop of backward list scheduler
• Produced better results than bottom-up greedy approach
Commercial practice
• Ad-hoc techniques based on coloring as prelude to scheduling
• Poor utilization of off-critical-path clusters
COMP 512, Fall 2003 9
Cluster Assignment & Scheduling
Jingsong He’s work (MS Thesis)
• Follow pattern of UAS & move assignment into inner loop
• Use forward list scheduler
• Search backward for slots to insert inter-cluster copies
• Use direct reference as last result
• Two versions TDF considers clusters in a fixed order TDC considers clusters in order by operand count
• Both TDF & TDC outperform BUG & UAS Measured by execution cycles, not some static count
Multiplies the complexity by N.
Limit the search to last 10 or 20 cycles if this worries you.
COMP 512, Fall 2003 10
Itanium’s Rolling Registers
Support for Software Pipelining
• Lam suggested a combination of Modulo Variable Expansion and unrolling to straighten out the flow of values
• Itanium supports a rolling-register set Fixed size portion of floating-point & predicate register set
PR32 to PR63 and FR32 to FR127
Code sets size of GPR rolling set (above GR32)
• Code uses adjacent registers for same name in successive iterations of the pipelined loop
• Loop-oriented branches adjust the RRB (rotating register base) rx+1 becomes rx after br.ctop or br.wtop Other loop counting features simplify epilogue code
COMP 512, Fall 2003 11
Register Stack
x86 Floating-point registers are organized as a rotating stack
• 8 FP registers
• ST[0] refers to top, ST[7] refers to bottom
• Memory operations always use to ST[0]
Computational model differs from ILOC-like IRs
• Places a premium on code shape (RPN)
Generating code is well understood
• Infix to postfix translation is a postorder walk on expression tree
• Stack optimization was studied in 1970s and 1980s
COMP 512, Fall 2003 12
Register Stack
Stack model complicates post-compilation optimization
• Translation from explicit to implicit names loses information
• Implicit names are inherently ambiguous ST[i] can refer to FR0, FR1, FR2, …, FR7
• Simple translation from stack to infix code retains ambiguity
Das Gupta built SSA from x86 assembly in his Vizer system
• Model push and pop with a series of register copy operations
• Creates (truly) ugly IR, but captures the effect
• Allows analysis to build accurate SSA and use it
• Copy folding eliminates most of the 7x “extra” copy operations
• Reconstruct stack code on translation out of SSA via treewalk
COMP 512, Fall 2003 13
Predicated Execution
Pervasive predication changes code shape
• Can use if-conversion to avoid branches (EaC, § 7) Need to evaluate tradeoffs (path lengths, density of executed
ops)
• Subtler impacts abound Branches become predicated jumps Multiway branches – up to number of FUs that can branch Predicated prologue & epilogue in software pipelined loop Run-time checks on ambiguous stores & loads
Test condition before loop & only load/store on overlap … More will emerge as clever students work with predication
I do not believe that we have seen the killer app for predication
COMP 512, Fall 2003 14
Autoincrement & Autodecrement
Many architectures support autoincrement (PDP-11, DSPs, IA 64)
• TMS320C25 relies heavily on indirect addressing No address-immediate form Code must perform explicit arithmetic or use autoincrement
• Data layout in memory has a significant impact on speed & size Want offsets assigned so that successive references differ by
an autoincrement or autodecrement (scalar variables ) Folds address calculation into addressing hardware Eliminates instructions (space & time )
• Single register problem modeled as path covering problem (NPC)
• General problem (multiple index registers) is harder
This work may have application on Itanium (autoincrement)
See “Storage Assignment to Decrease Code Size,” S. Liao, S. Devedas, K. Keutzer, S. Tjiang, and A. Wang, TOPLAS 18(3), May 1996, pages 235–253.
COMP 512, Fall 2003 15
On-chip Local Memory
Many DSP chips have local memory rather than cache
• Local memory is not mapped or managed (as cache is)
• Takes less space & less power
• Programmer (or compiler) control of contents
How can the compiler manage this memory?
• Tile and copy for large arrays Strip mine and interchange to create manageable data size Copy in & copy out around inner loop(s)
• Spill memory? Harvey showed that a couple of KB is enough Interprocedural allocation problem
COMP 512, Fall 2003 16
Hints to the Hardware
Intel, in particular, likes this mechanism for compiler-given advice
Itanium has
• Hints to the register-stack engine Enforced lazy, eager, load intensive, store intensive
• Hints on loads, stores, & prefetches help cache management Temporal, non-temporal L1 (NT-L1), NT-L2, & NT-All
• Hints on branch behavior Branch predict (brp) operations Hints that govern prediction
Static not taken, static taken (no prediction resources) Dynamic not taken, dynamic taken (use dynamic history)
Hints about amount of code to prefetch (few vs many lines) Hints to deallocate prediction resources (keep vs free)
Default predictions in absence of history info.
COMP 512, Fall 2003 17
Branch Delay Slots
Many processors expose branch delay slots to scheduling
• SPARC has 1 slot, TMS320C6x has 5
• Bit in branch indicates whether next op is in the delay slot
• Filling delay slots eliminates wasted cycles
Branches in branch delay slots create complex control-flow
• Both SPARC & C6x allow this code SPARC manual actively encourages its use
• Aggressive use can create complex code that is hard to decipher Recall example from TI compiler …
COMP 512, Fall 2003 18
Unravelling Control-flow (TI TMS320C6x)
B .S1 LOOP ; branch to loop B .S1 LOOP ; branch to loop B .S1 LOOP ; branch to loop B .S1 LOOP ; branch to loop
|| ZERO .L1 A2 ; zero A side product|| ZERO .L2 B2 ; zero B side product
B .S1 LOOP ; branch to loop || ZERO .L1 A3 ; zero A side accumulator|| ZERO .L2 B3 ; zero B side accumulator || ZERO .D1 A1 ; zero A side load value || ZERO .D2 B1 ; zero B side load value LOOP: LDW .D1 *A4++, A1 ; load a[i] \& a[i+1] || LDW .D2 *B4++, B1 ; load a[i] \& a[i+1]|| MPY .M1X A1, B1, A2 ; load b[i] \& b[i+1] || MPYH .M2X A1, B1, B2 ; a[i] * b[i] || ADD .L1 A2, A3, A3 ; a[i+1] * b[i+1] || ADD .L2 B2, B3, B3 ; ca += a[i] * b[i] || [B0] SUB .S2 B0, 1, B0 ; decrement loop counter || [B0] B .S1 LOOP ; branch to loop
ADD .L1X A3, B3, A3 \>; c = ca + cb
Stuff four branches into the pipe
Set up the loop
Single cycle loop ending with another branch
From Peephole Optimization Lecture …
COMP 512, Fall 2003 19
Branch Delay Slots
Many processors expose branch delay slots to scheduling
• SPARC has 1 slot, TMS320C6x has 5
• Bit in branch indicates whether next op is in the delay slot
• Filling delay slots eliminates wasted cycles
Branches in branch delay slots create complex control-flow
• Both SPARC & C6x allow this code SPARC manual actively encourages its use
• Aggressive use can create complex code that is hard to decipher Recall example from TI compiler … Code is complex, loop structure is hidden, but it is fast
COMP 512, Fall 2003 20
Software-adjustable Processor Speed
Important component of power-aware processors
Code can change the processor’s clock rate
• Slower execution requires less power (n2 effect)
• Can have significant impact on battery life
• GnuEmacs needs 300K to 400K OPS
To generate code for this feature
• Compiler must estimate speed required for appropriate progress Hard part is defining appropriate progress
• Compiler must insert code to change speed May require brief delay to let processor stabilize