ten hardware features that affect optimization c omp 512 rice university houston, texas fall 2003...

Ten Hardware Features That Affect Optimization

COMP 512Rice UniversityHouston, Texas

Fall 2003

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.

Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

COMP 512, Fall 2003 2

Hardware Features Affect Optimization

• Target machine defines cost of each operation

• Target machine defines set of available resources

• Target machine may provide unusual opportunities Load multiple, predication, branches in delay slots, …

Compiler Writers/Designers must understand hardware features

• Make good use of features that help

• Avoid downside impact of features that hurt (branch to register)

COMP 512, Fall 2003 3

Ten Hardware Features That Affect Optimization

The list for today’s lecture

1. Register windows

2. Partitioned register sets

3. Itanium’s rolling registers

4. x86 floating-point register stack

5. Predicated Execution

6. Autoincrement & autodecrement

7. On-chip local memory

8. Hints to hardware

9. Branch-delay slots

10.Software-controlled processor speed

COMP 512, Fall 2003 4

Register Windows

Architectural response to procedure call save/restore overhead

• Use hardware renaming to avoid most saves & restores at a call

• Partition register names into sets Set shared with caller Local set, maybe global set Set shared with callee

• Manipulate the map at a call so that caller’s output set becomes callee’s input set Intrinsic effect of call/return or separate operations

• Hardware or software mechanism to handle stack overflow

COMP 512, Fall 2003 5

Register Windows

SPARC

• Save & Restore operations

• 32 GPRs visible at any time

• No window on floats

• Overflow handled by trap 40 to 520 physical regs

r0 to r7

r8 to r15

r16 to r23

r24 to r31

global

w/callee

local

w/caller

Using Register Windows

• Caller passes args in r8 to r15

• Callee sees them in r24 to r31

• Global set visible to all

• Can use r16 to r31 arbitrarily

• Can use r8 to r15 as scratch between calls

Most save-restore activity now automated in overflow code

• 520 physical registers is a lot

• 40 physical registers is not

Faster for non-recursive code

COMP 512, Fall 2003 6

Register Windows

Itanium

• Variable size window (only GPRs)

• 32 “global” registers

• Variable size window (0-96) Starts with r32

• Background engine performs fill and spill operations on register stack overflows Stall on return when fill is needed & incomplete

• Callee inherits window of same size as caller Operation lets callee set window size & local–out line

• ISA includes alloc, flushrs, loadrs, & cover operations

COMP 512, Fall 2003 7

Partitioned Register Sets

• Number of functional units keeps rising

• At some point, the register-FU MUX becomes too deep & slow

• One response is to partition the register set

Register File

FU4 FU5 FU6 FU7

Register File

FU0 FU1 FU2 FU3

inter-cluster data paths

• Multiple register files, each with a cluster of FUs

• Inter-cluster xfer mechanism, with limited bandwidth

• Fast access to local register file

TMS320C6X

COMP 512, Fall 2003 8

PRS: Cluster Assignment & Scheduling

• Compiler must place each operation & ensure operand availability

May necessitate inter-cluster copy operations

• Adds another complex problem to the back end

Bottom-up Greedy (BUG) algorithm [Ellis 81]

• Separate cluster assignment phase before scheduling

• Inserted all necessary data movement before scheduling

Unified Assign and Schedule (UAS) [Ozer et al. 98]

• Moved assignment into inner loop of backward list scheduler

• Produced better results than bottom-up greedy approach

Commercial practice

• Ad-hoc techniques based on coloring as prelude to scheduling

• Poor utilization of off-critical-path clusters

COMP 512, Fall 2003 9

Cluster Assignment & Scheduling

Jingsong He’s work (MS Thesis)

• Follow pattern of UAS & move assignment into inner loop

• Use forward list scheduler

• Search backward for slots to insert inter-cluster copies

• Use direct reference as last result

• Two versions TDF considers clusters in a fixed order TDC considers clusters in order by operand count

• Both TDF & TDC outperform BUG & UAS Measured by execution cycles, not some static count

Multiplies the complexity by N.

Limit the search to last 10 or 20 cycles if this worries you.

COMP 512, Fall 2003 10

Itanium’s Rolling Registers

Support for Software Pipelining

• Lam suggested a combination of Modulo Variable Expansion and unrolling to straighten out the flow of values

• Itanium supports a rolling-register set Fixed size portion of floating-point & predicate register set

PR32 to PR63 and FR32 to FR127

Code sets size of GPR rolling set (above GR32)

• Code uses adjacent registers for same name in successive iterations of the pipelined loop

• Loop-oriented branches adjust the RRB (rotating register base) rx+1 becomes rx after br.ctop or br.wtop Other loop counting features simplify epilogue code

COMP 512, Fall 2003 11

Register Stack

x86 Floating-point registers are organized as a rotating stack

• 8 FP registers

• ST[0] refers to top, ST[7] refers to bottom

• Memory operations always use to ST[0]

Computational model differs from ILOC-like IRs

• Places a premium on code shape (RPN)

Generating code is well understood

• Infix to postfix translation is a postorder walk on expression tree

• Stack optimization was studied in 1970s and 1980s

COMP 512, Fall 2003 12

Register Stack

Stack model complicates post-compilation optimization

• Translation from explicit to implicit names loses information

• Implicit names are inherently ambiguous ST[i] can refer to FR0, FR1, FR2, …, FR7

• Simple translation from stack to infix code retains ambiguity

Das Gupta built SSA from x86 assembly in his Vizer system

• Model push and pop with a series of register copy operations

• Creates (truly) ugly IR, but captures the effect

• Allows analysis to build accurate SSA and use it

• Copy folding eliminates most of the 7x “extra” copy operations

• Reconstruct stack code on translation out of SSA via treewalk

COMP 512, Fall 2003 13

Predicated Execution

Pervasive predication changes code shape

• Can use if-conversion to avoid branches (EaC, § 7) Need to evaluate tradeoffs (path lengths, density of executed

ops)

• Subtler impacts abound Branches become predicated jumps Multiway branches – up to number of FUs that can branch Predicated prologue & epilogue in software pipelined loop Run-time checks on ambiguous stores & loads

Test condition before loop & only load/store on overlap … More will emerge as clever students work with predication

I do not believe that we have seen the killer app for predication

COMP 512, Fall 2003 14

Autoincrement & Autodecrement

Many architectures support autoincrement (PDP-11, DSPs, IA 64)

• TMS320C25 relies heavily on indirect addressing No address-immediate form Code must perform explicit arithmetic or use autoincrement

• Data layout in memory has a significant impact on speed & size Want offsets assigned so that successive references differ by

an autoincrement or autodecrement (scalar variables ) Folds address calculation into addressing hardware Eliminates instructions (space & time )

• Single register problem modeled as path covering problem (NPC)

• General problem (multiple index registers) is harder

This work may have application on Itanium (autoincrement)

See “Storage Assignment to Decrease Code Size,” S. Liao, S. Devedas, K. Keutzer, S. Tjiang, and A. Wang, TOPLAS 18(3), May 1996, pages 235–253.

COMP 512, Fall 2003 15

On-chip Local Memory

Many DSP chips have local memory rather than cache

• Local memory is not mapped or managed (as cache is)

• Takes less space & less power

• Programmer (or compiler) control of contents

How can the compiler manage this memory?

• Tile and copy for large arrays Strip mine and interchange to create manageable data size Copy in & copy out around inner loop(s)

• Spill memory? Harvey showed that a couple of KB is enough Interprocedural allocation problem

COMP 512, Fall 2003 16

Hints to the Hardware

Intel, in particular, likes this mechanism for compiler-given advice

Itanium has

• Hints to the register-stack engine Enforced lazy, eager, load intensive, store intensive

• Hints on loads, stores, & prefetches help cache management Temporal, non-temporal L1 (NT-L1), NT-L2, & NT-All

• Hints on branch behavior Branch predict (brp) operations Hints that govern prediction

Static not taken, static taken (no prediction resources) Dynamic not taken, dynamic taken (use dynamic history)

Hints about amount of code to prefetch (few vs many lines) Hints to deallocate prediction resources (keep vs free)

Default predictions in absence of history info.

COMP 512, Fall 2003 17

Branch Delay Slots

Many processors expose branch delay slots to scheduling

• SPARC has 1 slot, TMS320C6x has 5

• Bit in branch indicates whether next op is in the delay slot

• Filling delay slots eliminates wasted cycles

Branches in branch delay slots create complex control-flow

• Both SPARC & C6x allow this code SPARC manual actively encourages its use

• Aggressive use can create complex code that is hard to decipher Recall example from TI compiler …

COMP 512, Fall 2003 18

Unravelling Control-flow (TI TMS320C6x)

B .S1 LOOP ; branch to loop B .S1 LOOP ; branch to loop B .S1 LOOP ; branch to loop B .S1 LOOP ; branch to loop

|| ZERO .L1 A2 ; zero A side product|| ZERO .L2 B2 ; zero B side product

B .S1 LOOP ; branch to loop || ZERO .L1 A3 ; zero A side accumulator|| ZERO .L2 B3 ; zero B side accumulator || ZERO .D1 A1 ; zero A side load value || ZERO .D2 B1 ; zero B side load value LOOP: LDW .D1 *A4++, A1 ; load a[i] \& a[i+1] || LDW .D2 *B4++, B1 ; load a[i] \& a[i+1]|| MPY .M1X A1, B1, A2 ; load b[i] \& b[i+1] || MPYH .M2X A1, B1, B2 ; a[i] * b[i] || ADD .L1 A2, A3, A3 ; a[i+1] * b[i+1] || ADD .L2 B2, B3, B3 ; ca += a[i] * b[i] || [B0] SUB .S2 B0, 1, B0 ; decrement loop counter || [B0] B .S1 LOOP ; branch to loop

ADD .L1X A3, B3, A3 \>; c = ca + cb

Stuff four branches into the pipe

Set up the loop

Single cycle loop ending with another branch

From Peephole Optimization Lecture …

COMP 512, Fall 2003 19

Branch Delay Slots

Many processors expose branch delay slots to scheduling

• SPARC has 1 slot, TMS320C6x has 5

• Bit in branch indicates whether next op is in the delay slot

• Filling delay slots eliminates wasted cycles

Branches in branch delay slots create complex control-flow

• Both SPARC & C6x allow this code SPARC manual actively encourages its use

• Aggressive use can create complex code that is hard to decipher Recall example from TI compiler … Code is complex, loop structure is hidden, but it is fast

COMP 512, Fall 2003 20

Software-adjustable Processor Speed

Important component of power-aware processors

Code can change the processor’s clock rate

• Slower execution requires less power (n2 effect)

• Can have significant impact on battery life

• GnuEmacs needs 300K to 400K OPS

To generate code for this feature

• Compiler must estimate speed required for appropriate progress Hard part is defining appropriate progress

• Compiler must insert code to change speed May require brief delay to let processor stabilize

ten hardware features that affect optimization c omp 512 rice university houston, texas fall 2003...

Documents