behnam robatmili, katherine e. coons, kathryn s. mckinley, and doug burger register bank assignment...

Behnam Robatmili, Katherine E. Coons,

Kathryn S. McKinley, and Doug Burger

Register Bank Assignment For Spatially Partitioned Processors

LCPC 2008April 18, 2023

Motivation

• Spatially partitioned processors– Technology scalable substrate– Challenging compilation target

• Partitioned register files– Spill code– Operand routing latency– Bank and network link contention

• Conflicting goals– Reduce communication distances– Avoid contention– Avoid spills

Traditionally, spill costs take priority

Now, spatial locality and contention are important

LCPC 2008April 18, 2023

Bank Allocation Example

E0 E1 E2

B0 B1 B2

Variables

Instructions

Register banks

Execution tiles

Network links

Flow of data

v1 v0 v2 v3

i0

v0 v1 v2 v3

i0 i1 i1

3 1 13 22

LCPC 2008April 18, 2023

Outline

• Motivation

• Background– TRIPS– Compiling for TRIPS– Baseline Register Allocator

• Bank Allocation Algorithm

• Customizing for TRIPS

• Results

• Conclusions

LCPC 2008April 18, 2023

Register Allocation for EDGE ISAs

• Block atomic execution– Instruction groups fetch, execute, and commit atomically

• Direct instruction communication– Explicitly encode dataflow graph by specifying targets

Centralized Register

File

Centralized Register

File

RISC EDGE

B1 B2B0

B1 B2B0

LCPC 2008April 18, 2023

TRIPS Microarchitecture

Register File

Da

ta C

ach

e

Single cycle communication latency

• TRIPS ISA– Up to 128 instructions/block– Instructions can be placed

anywhere

• TRIPS microarchitecture– Up to 8 blocks in flight– 1 cycle latency per hop

• TRIPS blocks constraints– Max 128 instructions– 32 load and store instructions– 32 register reads or writes– 8 register reads/writes per bank

E0

E4

E1

E5

E2

E6

E3

E7

E8

E12

E9

E13

E10

E14

E11

E15

R0 R1 R2 R3

D0

D1

D2

D3

G

LCPC 2008April 18, 2023

Compiling for TRIPS

B2B2

B1B1

B3B3

B4B4

DataflowGraph

ControlFlow Graph

ExecutionSubstrate

Source CodeSource Codemulmul

addadd

addadd

mulmul

addadd

R1R1 R2R2

read R2

read R2

mulmul

addadd addadd

mulmul read R1

read R1

addadd

write R1

write R1

Static instruction placement

LCPC 2008April 18, 2023

TRIPS Compiler Back End

If-conversionLoop peeling

While loop unrollingInstruction merging

Predicate optimizations

If-conversionLoop peeling

While loop unrollingInstruction merging

Predicate optimizations

TRIPS block Formation

Register allocationReverse if-conversion & split

Load/store ID assignmentSSA for constant outputs

Register allocationReverse if-conversion & split

Load/store ID assignmentSSA for constant outputs

Fanout insertionInstruction placement

Target form generation

Fanout insertionInstruction placement

Target form generation

ResourceAllocation

SchedulingTrips Assembly

Language

Constraints

128 instructions32 load/store IDs32 reg. read/writes(4 banks, 8 per bank)

LCPC 2008April 18, 2023

Baseline Register Allocator

• Linear scan register allocator

• Traverse variables using standard priority function (Chow & Hennessy ‘90):

• For each variable, find all available architectural registers

• For each candidate architectural register– Check for live range conflicts– Check max reads/writes per block constraint

• Spill variable if no candidate meets criteria

• If spill code invalidates blocks, split invalidated blocks and re-allocate

€

PrDEF (vr) = (Di * ST _COST +U i *LD_COST)i∈LR (vr)

∑

LCPC 2008April 18, 2023

Outline

• Motivation




• Results

• Conclusions

LCPC 2008April 18, 2023

Register Dependence Graph

• First introduced by Hiser et al. (HCSB ‘00)

• Nodes represent variables

• Edge weights indicate affinity between variables

• Use RDG to optimize the critical path1. Use ideal schedule to estimate execution time2. Estimate arrival time of instruction inputs3. Set edge weights based on differences between arrival times

to instructions in critical path

LCPC 2008April 18, 2023

**

~~

t0

t1

vr0vr0 vr1vr1

--

t4 6

**

~~

t0

t1

vr0vr0 vr1vr1

--

t4


IntermediateRepresentation

11

3

4

5

Dataflow Dependence Graph

Ideal Schedule

++

t3

~~

t2

vr2vr2

mul t0,vr0,vr1

not t1,t0

not t2,vr2

add t3,vr1,t2

sub t4,t1,t3

mul t0,vr0,vr1

not t1,t0

not t2,vr2

add t3,vr1,t2

sub t4,t1,t3

1

2

vr0vr0 vr1vr10

vr2vr2

2 2


1

LCPC 2008April 18, 2023

Bank Assignment Algorithm

• Traverse variables in priority order:

• For every variable– Find cost for placing it in each bank– Choose bank with minimum cost– Allocate variable to a register in that bank

• Bank cost– Number of variables already allocated to that bank– Weights of edges in the RDG

€

PrSpatial (vr) =10LoopNestingDepth + NumOfEdges(vr,RDG)

LCPC 2008April 18, 2023

Bank Score Evaluation

• Evaluation function– Bank utilization– Dependencies among variables

CalculateBankCost (vr, bank)

Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR

CalculateBankCost (vr, bank)

Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR

CalculateDependenceCost (vr, bank)

cost = 0

for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank)

cost = cost + RDG Weight(vr, nvr)

return cost

CalculateDependenceCost (vr, bank)

cost = 0

for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank)

cost = cost + RDG Weight(vr, nvr)

return cost

LCPC 2008April 18, 2023

Outline

• Motivation




• Results

• Conclusions

LCPC 2008April 18, 2023

Customizing for TRIPS

• Fewer register/data cache banks than execution tiles– Heavy traffic between registers and execution tiles– Heavy traffic between data cache and execution tiles

• Cost function should separate data cache traffic

TieBreaker (vr, bank1, bank2)

if (vr.affectedCriticalLoads + vr.affectedCriticalStores > 0)

return min(bank1, bank2)

else

return max(bank1, bank2)

TieBreaker (vr, bank1, bank2)

if (vr.affectedCriticalLoads + vr.affectedCriticalStores > 0)

return min(bank1, bank2)

else

return max(bank1, bank2)

B0B0 B1B1 B2B2 B3B3

Dat

a C

ache

Register File

LCPC 2008April 18, 2023

Outline

• Motivation




• Results

• Conclusions

LCPC 2008April 18, 2023

Implemented Allocator

• Bank Oblivious – Always assign the next available register– Fills each bank before switching to the next bank

• Round Robin – Selects banks in a round robin fashion

• HCSB – Places dependent variables close together– No ideal schedule

• Spatial – Uses ideal schedule to reason about critical path– Customized bank assignment algorithm for TRIPS

LCPC 2008April 18, 2023

• Remaining benchmarks never spill– TRIPS has 128 registers– Register communication converted to intra-block temporaries

Spill Code Size

ProgramBenchmarksuite

Bankoblivious

Roundrobin

HCSB Spatial

a2time EEMBC 111 111 30 31

applu SPEC 528 514 365 382

apsi SPEC 328 220 183 183

equake SPEC 30 30 10 10

mgrid SPEC 44 21 8 12

LCPC 2008April 18, 2023

EEMBC Results

1.33,1.39

Average 5% improvement

LCPC 2008April 18, 2023

Sample Spatial Allocations

Separate memory traffic

stst ++

v0v0 v1v1 v2v2

v0 v2 v1v0

v1v2

+ st+

fbital

HCSBSpatial

st

LCPC 2008April 18, 2023

SPEC Results1.22,1.22,1.23

Average 5% improvement

LCPC 2008April 18, 2023

Conclusions

• Spatial locality among registers matters

• Register dependence graph can help– Avoids spilling critical registers– Flexible tool to incorporate locality information

• Modeling the topology is important– Non-uniform distribution of registers/L1 cache banks– Separate different types of traffic

• EDGE ISA eases burden on register allocator– Spills are rare– Spatial locality and contention become first-order constraints

LCPC 2008April 18, 2023

Questions?

behnam robatmili, katherine e. coons, kathryn s. mckinley, and doug burger register bank assignment...

Documents

register allocation

register readswrites

bank slide

trips compiler

important slide

trips results conclusions

unrolling instruction

allocation reverse