compiler improvement of register usage part 1 - chapter 8, through section 8.4 anastasia braginsky

88
Compiler Compiler Improvement of Improvement of Register Usage Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Post on 21-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Compiler Compiler Improvement of Improvement of Register UsageRegister Usage

Part 1 - Chapter 8, through Section 8.4

Anastasia Braginsky

Page 2: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Roadmap

Introduction

Scalar Replacement

Unroll-and-Jam

Page 3: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Introduction

Processor cycle time is decreasing; while memory time remains almost the same.

Better usage of registers set (especially for RISC architectures)

No cache discussion here

Page 4: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Register Allocation Algorithms

1. Defining live range per variable.

2. Building interference graph to model which pairs of live ranges can not be assigned to the same register.

3. Using a fast heuristic coloring algorithm.

4. If coloring fails take at least one live range from registers and repeat from step 3.

Page 5: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Register Allocation Algorithms – so what’s the problem? Only non-array variables are assigned to

registers.

Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables.

7 9 4 05 6 1 8 32Array A:Register RFor Array A:

Page 6: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Register Allocation Algorithms – so what’s the problem? Only non-array variables are assigned to

registers.

Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables.

7 9 4 05 6 1 8 32Array A:

Page 7: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Register Allocation Algorithms – so what’s the problem? Only non-array variables are assigned to

registers.

Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables.

7 9 4 05 6 1 8 32Array A:

We want to eliminate the unneeded loads and stores.

Page 8: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Introduction – example:

DO I=1,N

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

ENDDO

1. Load from some memory location A(I) to register RA.

2. Load from some memory location B(J) to register RB.

3. RA = RA + RB

4. Store RA to memory – A(I)

1. Load from some memory location A(I) to register RA.

2. Load from some memory location B(J+1) to register RB.

3. RA = RA + RB

4. Store RA to memory – A(I)

Page 9: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Introduction – example (cont.)

DO I = 1, N

T = A(I)

DO J = 1, M

T = T + B(J)

ENDDO

A(I) = T

ENDDO

All loads and stores to A in the inner loop have been saved

High chance of T being allocated a register by the coloring algorithm

Source-to-source transformation

Page 10: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Data Dependence for Register Reuse Two application of dependence:

To determine the correctness of different transformations

To determine the transformations to improve the performance of particular memory accesses

Page 11: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Types of dependences

A true or flow dependence Assignment to variable and then usage of it

S1: V = … Store the register representing V to V’s

memory location

S2: … = V Load V to register representing V

Load can be saved (store after the usage). Cache miss can be saved

R1 R2

R1

R2

Page 12: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Types of dependences

An antidepencence dependence

Usage of variable and then assignment to it

S1: … = V Load V to register representing V

S2: V = … Store the register representing V to V’s

memory location

Nothing can be done to improve the registers usage (if used once), but cache miss can be saved.

R1 R2

1

Page 13: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Types of dependences

An output dependence

Assignment to variable repeats

S1: V = … Store the register representing V to V’s

memory location

S2: V = … Store the register representing V to V’s

memory location

First store is not needed.

R1 R2

o

Page 14: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Types of dependences - example

S1: A(I) = …

S2: … = A(I)

S3: A(I) = …

Store

Load

Store

True dependence between S1 and S2

Output dependence between S1 and S3

Page 15: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Types of dependences

An input dependence

Usage of variable repeats

S1: … = V Load V to register representing V

S2: … = V Load V to register representing V

Second load is not needed.

R1 R2

i

Page 16: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Types of dependences

Loop-independent dependence

Loop-carried dependence:

Consistent dependence – a loop-carried dependence with a constant dependence distance throughout the loop.

Non consistent dependence – the opposite.

Page 17: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Dependence graph modification

S1: A(I)=…

S2: …=A(I)

S3: …=A(I) load from A(I)

Two True dependences between S1 and S2

)Load can be saved(

and between S1 and S3

)Load can be saved(

Input dependence from S2 to S3

)Load can be saved(

It is not possible to save three memory references the dependence graph should be pruned.

load from A(I)

store to A(I)

Page 18: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Sum reduction example (again)

DO I = 1, N

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO Save second load

…=V

…=V

Input

Save first store

V=…

V=…

Output

Save nothing

…=V

V=…

Anti- dependence

Save load

V = …

… = V

True

1

o

i

1J

J

oJ i

I

iJ

Page 19: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Sum reduction example (again)

DO I = 1, N

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

1J

J

oJ i

I

Load from A(I) to R1

Load from B(J) to R2

R1 = R1 + R2

Store R1 to A(I)

Load from A(I) to R1

Load from B(J+1) to R2

R1 = R1 + R2

Store R1 to A(I)

…Load from A(I) to R1

iJ

Page 20: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Scalar Register Allocation

DO I = 1, N

T = A(I)

DO J = 1, M

T = T + B(J)

ENDDO

A(I) = T

ENDDO

Page 21: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Roadmap

Introduction

Scalar Replacement

Unroll-and-Jam

Page 22: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Loop Independent Dependence

Simple replacement

DO I=1,N

A(I) = B(I) + C

X(I) = A(I) + Q

ENDDO

DO I=1,N

t = B(I) + C

X(I) = t + Q

A(I) = t

ENDDO

Page 23: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Loop Carried Dependences

Spanning single iteration

DO I=1,N

A(I) = B(I-1)

B(I) = A(I) + C(I)

ENDDO

I=1

A(1) = B(0)

B(1) = A(1)+C(1)

I=2

A(2) = B(1)

B(2) = A(2)+C(2)

I=3

A(3) = B(2)

B(3) = A(3)+C(3)

Loop Carried True dependence

Page 24: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Loop Carried Dependences

Spanning single iteration

DO I=1,N

A(I) = B(I-1)

B(I) = A(I) + C(I)

ENDDO

I=1

A(1) = B(0) or T

B(1) or T = A(1)+C(1)

I=2

A(2) = B(1) or T

B(2) or T = A(2)+C(2)

I=3

A(3) = B(2) or T

B(3) or T = A(3)+C(3)

T = B(0)

DO I=1,N

A(I) = T

T = A(I) + C(I)

B(I) = T

ENDDO

Page 25: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Dependences Spanning Multiple Iterations Distance for the dependence is 2 iterations

DO I=1, N

A(I) = B(I-1) + B(I+1)

ENDDO

Use 3 different scalar temporaries defined as follows: t1 = B(I-1) t2 = B(I) t3 = B(I+1)

Input dependenceiI

Page 26: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Dependences Spanning Multiple IterationsDO I=1, N

A(I) = B(I-1) + B(I+1)

ENDDO

t1 = B(I-1)

t2 = B(I)

t3 = B(I+1)

DO I=1, N t1 = B(I-1)

t2 = B(I)

t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

B[0]

B[1]

B[2]

B[3]

B[4]

t1

t2

t3

t1

t2

t3

Page 27: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Dependences Spanning Multiple IterationsDO I=1, N

A(I) = B(I-1) + B(I+1)

ENDDO

t1 = B(I-1)

t2 = B(I)

t3 = B(I+1)

t1 = B(0)

t2 = B(1)

DO I=1, N t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

2 pipeline cycles/slots wasted for each loop!

Page 28: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Dependences Spanning Multiple IterationsDO I=1, N

A(I) = B(I-1) + B(I+1)

ENDDO

t1 = B(I-1)

t2 = B(I)

t3 = B(I+1)

t1 = B(0)

t2 = B(1)

DO I=1, N t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

2 pipeline cycles/slots wasted for each loop!

B[0]

B[1]

B[2]

B[3]

B[4]

t1

t2

t3

t1

t2

t3

B[6]

B[5]

Page 29: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Dependences Spanning Multiple Iterations t1 = B(0) t2 = B(1) DO I=1, N

t3 = B(I+1) A(I) = t1 + t3 t1 = B(I+2) A(I+1) = + t2 = B(I+3) A(I+2) = t3 + t2

ENDDO

I = 1 A(1)=B(0)+B(2) I = 2 A(2)=B(1)+B(3) I = 3 A(3)=B(2)+B(4) I = 4 A(4)=B(3)+B(5) …

B[0]

B[1]

B[2]

B[3]

B[4]

t1

t2

t3

B[6]

B[5]

t2 t1t1

,3

Page 30: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Eliminating Scalar Copies

t1 = B(0)

t2 = B(1)

mN3 = MOD(N,3)

DO I = 1, mN3

t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

DO I = mN3 + 1, N, 3

t3 = B(I+1)

A(I) = t1 + t3

t1 = B(I+2)

A(I+1) = t2 + t1

t2 = B(I+3)

A(I+2) = t3 + t2

ENDDO

Pre-loop the same as the previous solution

Main loop - three sums in one iteration

t1 = B(0)

t2 = B(1)

DO I=1, N

t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

Page 31: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Scalar Replacement

1. Loop-independent dependence – simple replacement

2. Handling loop-carried dependence – spanning one iteration

3. Dependences spanning multiple iterations

4. Eliminating scalar copies

5. Pruning the dependence graph

6. Moderating register pressure

Page 32: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Break?

Page 33: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

I = 1A(2)=A(0)+B(0)A(1)=A(1)+B(1)+B(2)

I = 2A(3)=A(1)+B(1)A(2)=A(2)+B(2)+B(3)

I = 3A(4)=A(2)+B(2)A(3)=A(3)+B(3)+B(4)

Save

last

load

…=V

…=V

Input

Save

first

store

V=…

V=…

Output

Save

nothing

…=V

V=…

Anti-dependence

Save

load

V=…

…=V

True

Page 34: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

Two kinds of edges to be pruned:

True and input dependence edges which are killed by an intervening assignment to the same location.

Input dependence edges that are redundant.

Page 35: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 1: Eliminate killed dependences

Can be true or input dependences

Looking for the store to the location involved in the dependence between the endpoints of the dependence.

Page 36: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 1: Eliminate killed dependences

True dependence

Looking for output dependence from the source to the assignment before the sink (there should be true dependence from the assignment to the sink).

S1: A(I+1) = …

S2: A(I) = …

S3: … = A(I)True

True

Output

Page 37: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 1: Eliminate killed dependences

Input dependence

Looking for anti-dependence from the source to the assignment before the sink (there should be true dependence from the assignment to the sink).

S1: …=A(I+1)

S2: A(I) = …

S3: … = A(I)Input

True

Anti

Page 38: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 2: Identify generators

Generator is a reference with at least one true or input dependence starting

from it

AND

with no input or true dependence into it

Actually if we are looking on the graph represented only by input and true dependences, generators are the sources.

Page 39: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 3: Find name partitions and eliminate

input dependences

A name partition is a set of references that can be replaced by a reference to a single scalar variable.

A(I+2)

A(I)

B(I)

A(I+1)B(I)

A(I)

B(I)

A(I-1)

2 2

Page 40: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 3: Find name partitions and eliminate

input dependences A name partition is a set of references that can be

replaced by a reference to a single scalar variable.

Starting at each generator mark each reference reachable from the generator by a flow or input dependence as part of the name partition for that variable. (similar to the typed fusion problem)

Eliminate input dependences within same name partition, unless source is generator.

Page 41: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Three-phase algorithm for pruning edges Phase 3 and a half: Anti-dependence

Note that anti-dependence can’t directly give a rise to register reuse and are always pruned as the last step in the graph pruning.

Page 42: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

Phase 1 Try to eliminate true

dependences

Try to eliminate input dependences

Phase 2 Identify generators

Candidates are sources of true and input dependences

ii

io

i

1

Page 43: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

Phase 3 Starting each

generator mark each reference reachable from the generator by input or true dependence as part of the name partition for that variable.

Phase 3.5

i

i

io

1

Page 44: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

ii

io

i

1

How to start scalar replacement from here?

1. Generators: A(I+1) A(I) B(I+1)

2. The number of scalars for each generator = how many iterations are spanned by the dependence.

A(I-1) B(I-1) B(I)

Page 45: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

i

i

o

References: A(I-1) = tA1

A(I) = tA2

A(I+1) = tA3

B(I-1) = tB1

B(I) = tB2

B(I+1) = tB3

Page 46: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

tA1 = A(I-1), tA2=A(I), tA3=A(I+1)

tB1 = B(I-1), tB2=B(I), tB3=B(I+1)

tA1 = A(0)

tA2 = A(1)

tB1 = B(0)

tB2 = B(1)

DO I = 1, N

tA3 = tA1 + tB1

tB3 = B(I+1)

tA1 = tA2 + tB2 + tB3

// Above: tA1=tA2

A(I) = tA1

tA2 = tA3

tB1 = tB2

tB2 = tB3

ENDDO

A(N+1) = tA3

One Load

One Store

Page 47: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

1. Load A(I-1) to R1

2. Load B(I-1) to R2 R1 = R1 + R2

3. Store R1 to A(I+1)

4. Load A(I) to R1

5. Load B(I) to R2 R1 = R1 + R2

6. Load B(I+1) to R2 R1 = R1 + R2

7. Store R1 to A(I)

Page 48: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

Special cases: References in the Loop

DO I=1, N

= B(I) + C(I,J)

C(I,J) = + D(I)

ENDDO

A(J)

A(J)

1

i

otA

tA

A(J) = tA Assumption: N>1

Page 49: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

Special cases: Forcing stores and loads

DO I = 1, N

A(I) = A(I-1) + B(I)

A(J) = A(J) + A(I)

ENDDO

I=1

A(1) = A(0) + B(1)

A(J) = A(J) + A(1)

I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2)

I=3…

Page 50: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

Special cases: Forcing stores and loads

DO I = 1, N

tAI = A(I-1) + B(I)

A(I) = tAI

A(J) = A(J) + tAI

ENDDO

I=1

A(1) = A(0) + B(1)

A(J) = A(J) + A(1)

I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2)

I=3…

Page 51: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

Special cases: Forcing stores and loads

tAI = A(0)

DO I = 1, N

tAI = tAI(was A(I-1))+B(I)

A(J) = A(J) + tAI

A(I) = tAI

ENDDO

J ?= I

I=1

A(1) = A(0) + B(1)

A(J) = A(J) + A(1)

I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2)

I=3…

Page 52: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Special cases: Forcing stores and loads

tAI = A(0)

DO I = 1, N

tAI = tAI + B(I)

A(J) = A(J) + tAI

A(I) = tAI

ENDDO

A(J) ?= A(I)

DO I = 1, N

tAI = A(I-1) + B(I)

A(J) = A(J) + tAI

A(I) = tAI

ENDDO

tAI = A(0)

tAJ = A(J)

JU = MAX(J-1,0)

DO I = 1, JU

tAI = tAI + B(I)

tAJ = tAJ + tAI

A(I) = tAI

ENDDO

// Here I = J

IF(J.GT.0.AND.J.LE.N) THEN

tAI = tAI + B(J)

tAJ = tAJ + tAI

A(J) = tAI

tAI = tAJ

ENDIF

DO I = JU+2, N //I starting from J+1

tAI = tAI + B(I)

tAJ = tAJ + tAI

A(I) = tAI

ENDDO

A(J) = tAJ

Page 53: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Pruning the dependence graph

Special cases: Inconsistent dependence

DO I = 1, N

A(2*I) = A(I) + B(I)

ENDDO

Bag edge in the typed fusion framework above.

Page 54: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Moderating of Register Pressure

Scalar replacement may produce to many scalar quantities.

Have to choose name partitions for scalar replacement to maximize register usage

Page 55: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Moderating of Register Pressure

M name partitions R: {R1, R2, R3,…,Rm}

The value of the name partition: v(R) – the number of memory loads and stores saved by replacing

The cost of the name partition: c(R) – the number of registers needed to hold all the scalar values.

Page 56: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Moderating of Register Pressure

The desired solution, given the limit of register-resident scalars to use - n:

The sub-collection of name partitions:

or {R1, R2,…,Rm} where Ri=0/1

Such that:

And

1

( )m

i ii

c R R n

1

max[ ( ) ]m

i ii

v R R

1 2{ , ,..., }

ki i iR R R

Page 57: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

0/1 knapsack problem solutions

Dynamic programming solution – O(nm)

Heuristic solution:

Order the name partition set in decreasing order of the ratio v(R)/c(R)

Select elements from the beginning of the list until the registers are full.

Page 58: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Scalar replacement algorithm

For the loops that do not include conditional flow of control:

1. Prune dependence graph. (Apply typed fusion.) Get set of name partitions as the part of the result.

2. Select a set of name partitions using register pressure moderation

Page 59: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Scalar replacement algorithm

For the loops that do not include conditional flow of control:

3. For each selected partition, replace it with a reference to scalars:

A) If non-cyclic, replace using set of temporaries

True dependence is replaced by set of temporaries. The number of temporaries is defined by how many iterations are spanned by the dependence.

Output dependence – move stores after the end of loop.

Input dependence – move loads before the beginning of loop.

Page 60: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Scalar replacement algorithm

For the loops that do not include conditional flow of control:

B) If cyclic replace reference with single temporary

C) For each inconsistent dependence

Use index set splitting or insert loads and stores

4. Unroll loop to eliminate scalar copies

Page 61: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Experimental Data

Speedup = running time of the original program divided by running time of the versionafter scalar replacementLL = Livermore loops

Page 62: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Roadmap

Introduction

Scalar Replacement

Unroll-and-Jam

Page 63: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam

Recall introduction example:

DO I=1,N

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

ENDDO

Page 64: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam

After transformation called unroll-and-jam

Assume Nmod2=0

DO I=1,N,2

DO J=1,M

A(I)=A(I)+B(J)

A(I+1)=A(I+1)+B(J)

ENDDO

ENDDO

This is unroll-and-jam to factor 2

Page 65: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

DO J=1, 2M

DO I=1, N

A(I,J) = A(I+1,J) + A(I-1,J)

ENDDO

ENDDO

A(1,J) = A(2,J) + A(0,J)

A(2,J) = A(3,J) + A(1,J)

A(3,J) = A(4,J) + A(2,J)

Page 66: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

InstructionFetch

InstructionDecode

OrRegister

Fetch

RegisterFile

ExecuteMemoryAccess

WriteBackTo

RegisterFile

Page 67: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

InstructionFetch

InstructionDecode

OrRegister

Fetch

RegisterFile

ExecuteMemoryAccess

WriteBackTo

RegisterFile

Page 68: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

InstructionFetch

InstructionDecode

OrRegister

Fetch

RegisterFile

ExecuteMemoryAccess

WriteBackTo

RegisterFile

Forward Unit

Page 69: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

InstructionFetch

InstructionDecode

OrRegister

Fetch

RegisterFile

MemoryAccess

WriteBackTo

RegisterFile

Forward Unit

Execute

Page 70: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

DO J=1, 2M

DO I=1, N

A(I,J) = A(I+1,J) + A(I-1,J)

ENDDO

ENDDO

A(1,J) = A(2,J) + A(0,J)

A(2,J) = A(3,J) + A(1,J)

A(3,J) = A(4,J) + A(2,J)

A(1,J+1) = A(2,J+1) + A(0,J+1)

A(2,J+1) = A(3,J+1) + A(1,J+1)

A(3,J+1) = A(4,J+1) + A(2,J+1)

Page 71: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unitDO J=1, 2M, 2

DO I=1, N

A(I,J) = A(I+1,J) + A(I-1,J)

A(I,J+1)=A(I+1,J+1)+A(I-1,J+1)

ENDDO

ENDDO

Page 72: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

DO I=1, 2N

DO J=1, M

A(I+1, J-1) = A(I,J) + B(I,J)

A(I+2, J-1) = A(I+1,J) + B(I+1,J)

ENDDO

ENDDO

I=1, J=1A(2, 0) = A(1, 1) + B(1, 1)I=1, J=2A(2, 1) = A(1, 2) + B(1, 2)I=1, J=3A(2, 2) = A(1, 3) + B(1, 3)…I=2, J=1A(3, 0) = A(2, 1) + B(2, 1)I=2, J=2A(3, 1) = A(2, 2) + B(2, 2)I=2, J=3A(3, 2) = A(2, 3) + B(2, 3)…

,2

Page 73: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

I=1, J=1A(2, 0) = A(1, 1) + B(1, 1)I=1, J=2A(2, 1) = A(1, 2) + B(1, 2)I=1, J=3A(2, 2) = A(1, 3) + B(1, 3)…I=2, J=1A(3, 0) = A(2, 1) + B(2, 1)I=2, J=2A(3, 1) = A(2, 2) + B(2, 2)I=2, J=3A(3, 2) = A(2, 3) + B(2, 3)…

Page 74: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

I=1, J=1A(2, 0) = A(1, 1) + B(1, 1)I=1, J=2A(2, 1) = A(1, 2) + B(1, 2)I=1, J=3A(2, 2) = A(1, 3) + B(1, 3)…I=2, J=1A(3, 0) = A(2, 1) + B(2, 1)I=2, J=2A(3, 1) = A(2, 2) + B(2, 2)I=2, J=3A(3, 2) = A(2, 3) + B(2, 3)…

Page 75: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

DO I=1, 2N

DO J=1, M

A(I+1, J-1) = A(I,J) + B(I,J)

ENDDO

ENDDO

The direction vector of the loop: [<, >]

Swap [>, <] illegal

Should we assume that loop unroll-and-jam is illegal whenever loop interchange is illegal?

Page 76: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

DO I=1, 2N

DO J=1, M

A(I+2, J-1) = A(I,J) + B(I,J)

ENDDO

ENDDO

Direction vector is (<,>); still unroll-and-jam possible

Page 77: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

Definition:

An unroll-and-jam to factor n consist of:

Unrolling the outer loop n-1 times to create n copies of the inner loop

And fusing those copies together

Page 78: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam to factor n=2

Recall introduction example:

DO I=1,2N

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

ENDDO

Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop

DO I=1,2N

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

ENDDO

Page 79: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam to factor n=2

Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop

DO I=1,2N, 2

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

DO J=1,M

A(I+1)=A(I+1)+B(J)

ENDDO

ENDDO

Recall introduction example:

DO I=1,2N

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

ENDDO

Page 80: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam to factor n=2

Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop

DO I=1,2N, 2

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

DO J=1,M

A(I+1)=A(I+1)+B(J)

ENDDO

ENDDO

Page 81: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam to factor n=2

Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop

DO I=1,2N, 2

DO J=1,M

A(I)=A(I)+B(J)

ENDDO

DO J=1,M

A(I+1)=A(I+1)+B(J)

ENDDO

ENDDO

And fusing those copies together

DO I=1,2N,2

DO J=1,M

A(I)=A(I)+B(J)

A(I+1)=A(I+1)+B(J)

ENDDO

ENDDO

Page 82: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

Theorem:

An unroll-and-jam to factor n is legal if and only if there exist no dependence with direction vector (<, >) such that the distance for the outer loop is < n.

To check this note to use the full dependence graph and not the pruned one.

Page 83: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Legality of Unroll-and-Jam

What happens if there exist dependence with direction vector (<, >) but that the distance for the outer loop is n.

Note that an unroll-and-jam was to factor n

Page 84: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam to factor m Algorithm

1. Create preloop

2. Unroll main loop m times

3. Apply typed fusion to loops within the body of the unrolled loop

4. Apply unroll-and-jam recursively to the inner nested loop

Page 85: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Unroll-and-Jam example

DO I = 1, N

DO K = 1, N

A(I) = A(I) + X(I,K)

ENDDO

DO J = 1, M

DO K = 1, N

B(J,K) = B(J,K) + A(I)

ENDDO

ENDDO

DO J = 1, M

C(J,I) = B(J,N)/A(I)

ENDDO

ENDDO

DO I = Nmod2+1, N, 2

DO K = 1, N

A(I) = A(I) + X(I,K)

A(I+1) = A(I+1) + X(I+1,K)

ENDDO

DO J = 1, M

DO K = 1, N

B(J,K) = B(J,K) + A(I)

B(J,K) = B(J,K) + A(I+1)

ENDDO

C(J,I) = B(J,N)/A(I)

C(J,I+1) = B(J,N)/A(I+1)

ENDDO

ENDDO

Page 86: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

DO I = Nmod2+1, N, 2

DO K = 1, N

A(I) = A(I) + X(I,K)

A(I+1) = A(I+1) + X(I+1,K)

ENDDO

DO J = 1, M

DO K = 1, N

B(J,K) = B(J,K) + A(I)

B(J,K) = B(J,K) + A(I+1)

ENDDO

C(J,I) = B(J,N)/A(I)

C(J,I+1) = B(J,N)/A(I+1)

ENDDO

ENDDO

DO I = Nmod2+1, N, 2

DO K = 1, N

A(I) = A(I) + X(I,K)

A(I+1) = A(I+1) + X(I+1,K)

ENDDO

DO J = 1, Mmod2

DO K = 1, N

B(J,K) = B(J,K) + A(I)

B(J,K) = B(J,K) + A(I+1)

ENDDO

C(J,I) = B(J,N)/A(I)

C(J,I+1) = B(J,N)/A(I+1)

ENDDO

DO J = Mmod2+1, M, 2

DO K = 1, N

B(J,K) = B(J,K) + A(I)

B(J,K) = B(J,K) + A(I+1)

B(J+1,K) = B(J+1,K) + A(I)

B(J+1,K) = B(J+1,K) + A(I+1)

ENDDO

C(J,I) = B(J,N)/A(I)

C(J,I+1) = B(J,N)/A(I+1)

C(J+1,I) = B(J+1,N)/A(I)

C(J+1,I+1) = B(J+1,N)/A(I+1)

ENDDO

ENDDO

Page 87: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Effectiveness of Unroll-and-Jam

Page 88: Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Improving the efficiency of pipelined functional unit

InstructionFetch

InstructionDecode

OrRegister

Fetch

RegisterFile

ExecuteMemoryAccess

WriteBackTo

RegisterFile