behnam robatmili, katherine e. coons, kathryn s. mckinley, and doug burger register bank assignment...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Behnam Robatmili, Katherine E. Coons,
Kathryn S. McKinley, and Doug Burger
Register Bank Assignment For Spatially Partitioned Processors
LCPC 2008April 18, 2023
Motivation
• Spatially partitioned processors– Technology scalable substrate– Challenging compilation target
• Partitioned register files– Spill code– Operand routing latency– Bank and network link contention
• Conflicting goals– Reduce communication distances– Avoid contention– Avoid spills
Traditionally, spill costs take priority
Now, spatial locality and contention are important
LCPC 2008April 18, 2023
Bank Allocation Example
E0 E1 E2
B0 B1 B2
Variables
Instructions
Register banks
Execution tiles
Network links
Flow of data
v1 v0 v2 v3
i0
v0 v1 v2 v3
i0 i1 i1
3 1 13 22
LCPC 2008April 18, 2023
Outline
• Motivation
• Background– TRIPS– Compiling for TRIPS– Baseline Register Allocator
• Bank Allocation Algorithm
• Customizing for TRIPS
• Results
• Conclusions
LCPC 2008April 18, 2023
Register Allocation for EDGE ISAs
• Block atomic execution– Instruction groups fetch, execute, and commit atomically
• Direct instruction communication– Explicitly encode dataflow graph by specifying targets
Centralized Register
File
Centralized Register
File
RISC EDGE
B1 B2B0
B1 B2B0
LCPC 2008April 18, 2023
TRIPS Microarchitecture
Register File
Da
ta C
ach
e
Single cycle communication latency
• TRIPS ISA– Up to 128 instructions/block– Instructions can be placed
anywhere
• TRIPS microarchitecture– Up to 8 blocks in flight– 1 cycle latency per hop
• TRIPS blocks constraints– Max 128 instructions– 32 load and store instructions– 32 register reads or writes– 8 register reads/writes per bank
E0
E4
E1
E5
E2
E6
E3
E7
E8
E12
E9
E13
E10
E14
E11
E15
R0 R1 R2 R3
D0
D1
D2
D3
G
LCPC 2008April 18, 2023
Compiling for TRIPS
B2B2
B1B1
B3B3
B4B4
DataflowGraph
ControlFlow Graph
ExecutionSubstrate
Source CodeSource Codemulmul
addadd
addadd
mulmul
addadd
R1R1 R2R2
read R2
read R2
mulmul
addadd addadd
mulmul read R1
read R1
addadd
write R1
write R1
Static instruction placement
LCPC 2008April 18, 2023
TRIPS Compiler Back End
If-conversionLoop peeling
While loop unrollingInstruction merging
Predicate optimizations
If-conversionLoop peeling
While loop unrollingInstruction merging
Predicate optimizations
TRIPS block Formation
Register allocationReverse if-conversion & split
Load/store ID assignmentSSA for constant outputs
Register allocationReverse if-conversion & split
Load/store ID assignmentSSA for constant outputs
Fanout insertionInstruction placement
Target form generation
Fanout insertionInstruction placement
Target form generation
ResourceAllocation
SchedulingTrips Assembly
Language
Constraints
128 instructions32 load/store IDs32 reg. read/writes(4 banks, 8 per bank)
LCPC 2008April 18, 2023
Baseline Register Allocator
• Linear scan register allocator
• Traverse variables using standard priority function (Chow & Hennessy ‘90):
• For each variable, find all available architectural registers
• For each candidate architectural register– Check for live range conflicts– Check max reads/writes per block constraint
• Spill variable if no candidate meets criteria
• If spill code invalidates blocks, split invalidated blocks and re-allocate
€
PrDEF (vr) = (Di * ST _COST +U i *LD_COST)i∈LR (vr)
∑
LCPC 2008April 18, 2023
Outline
• Motivation
• Background– TRIPS– Compiling for TRIPS– Baseline Register Allocator
• Bank Allocation Algorithm
• Customizing for TRIPS
• Results
• Conclusions
LCPC 2008April 18, 2023
Register Dependence Graph
• First introduced by Hiser et al. (HCSB ‘00)
• Nodes represent variables
• Edge weights indicate affinity between variables
• Use RDG to optimize the critical path1. Use ideal schedule to estimate execution time2. Estimate arrival time of instruction inputs3. Set edge weights based on differences between arrival times
to instructions in critical path
LCPC 2008April 18, 2023
**
~~
t0
t1
vr0vr0 vr1vr1
--
t4 6
**
~~
t0
t1
vr0vr0 vr1vr1
--
t4
Register Dependence Graph
IntermediateRepresentation
11
3
4
5
Dataflow Dependence Graph
Ideal Schedule
++
t3
~~
t2
vr2vr2
mul t0,vr0,vr1
not t1,t0
not t2,vr2
add t3,vr1,t2
sub t4,t1,t3
mul t0,vr0,vr1
not t1,t0
not t2,vr2
add t3,vr1,t2
sub t4,t1,t3
1
2
vr0vr0 vr1vr10
vr2vr2
2 2
Register Dependence Graph
1
LCPC 2008April 18, 2023
Bank Assignment Algorithm
• Traverse variables in priority order:
• For every variable– Find cost for placing it in each bank– Choose bank with minimum cost– Allocate variable to a register in that bank
• Bank cost– Number of variables already allocated to that bank– Weights of edges in the RDG
€
PrSpatial (vr) =10LoopNestingDepth + NumOfEdges(vr,RDG)
LCPC 2008April 18, 2023
Bank Score Evaluation
• Evaluation function– Bank utilization– Dependencies among variables
CalculateBankCost (vr, bank)
Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR
CalculateBankCost (vr, bank)
Return CalculateDependenceCost(vr, bank) + bank.numAssignedVR
CalculateDependenceCost (vr, bank)
cost = 0
for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank)
cost = cost + RDG Weight(vr, nvr)
return cost
CalculateDependenceCost (vr, bank)
cost = 0
for each nvr RDG neighbor of vr assigned to NeighborBankSet(bank)
cost = cost + RDG Weight(vr, nvr)
return cost
LCPC 2008April 18, 2023
Outline
• Motivation
• Background– TRIPS– Compiling for TRIPS– Baseline Register Allocator
• Bank Allocation Algorithm
• Customizing for TRIPS
• Results
• Conclusions
LCPC 2008April 18, 2023
Customizing for TRIPS
• Fewer register/data cache banks than execution tiles– Heavy traffic between registers and execution tiles– Heavy traffic between data cache and execution tiles
• Cost function should separate data cache traffic
TieBreaker (vr, bank1, bank2)
if (vr.affectedCriticalLoads + vr.affectedCriticalStores > 0)
return min(bank1, bank2)
else
return max(bank1, bank2)
TieBreaker (vr, bank1, bank2)
if (vr.affectedCriticalLoads + vr.affectedCriticalStores > 0)
return min(bank1, bank2)
else
return max(bank1, bank2)
B0B0 B1B1 B2B2 B3B3
Dat
a C
ache
Register File
LCPC 2008April 18, 2023
Outline
• Motivation
• Background– TRIPS– Compiling for TRIPS– Baseline Register Allocator
• Bank Allocation Algorithm
• Customizing for TRIPS
• Results
• Conclusions
LCPC 2008April 18, 2023
Implemented Allocator
• Bank Oblivious – Always assign the next available register– Fills each bank before switching to the next bank
• Round Robin – Selects banks in a round robin fashion
• HCSB – Places dependent variables close together– No ideal schedule
• Spatial – Uses ideal schedule to reason about critical path– Customized bank assignment algorithm for TRIPS
LCPC 2008April 18, 2023
• Remaining benchmarks never spill– TRIPS has 128 registers– Register communication converted to intra-block temporaries
Spill Code Size
ProgramBenchmarksuite
Bankoblivious
Roundrobin
HCSB Spatial
a2time EEMBC 111 111 30 31
applu SPEC 528 514 365 382
apsi SPEC 328 220 183 183
equake SPEC 30 30 10 10
mgrid SPEC 44 21 8 12
LCPC 2008April 18, 2023
EEMBC Results
1.33,1.39
Average 5% improvement
LCPC 2008April 18, 2023
EEMBC Results
1.33,1.39
Average 5% improvement
LCPC 2008April 18, 2023
EEMBC Results
1.33,1.39
Average 5% improvement
LCPC 2008April 18, 2023
Sample Spatial Allocations
Separate memory traffic
stst ++
v0v0 v1v1 v2v2
v0 v2 v1v0
v1v2
+ st+
fbital
HCSBSpatial
st
LCPC 2008April 18, 2023
SPEC Results1.22,1.22,1.23
Average 5% improvement
LCPC 2008April 18, 2023
SPEC Results1.22,1.22,1.23
Average 5% improvement
LCPC 2008April 18, 2023
Conclusions
• Spatial locality among registers matters
• Register dependence graph can help– Avoids spilling critical registers– Flexible tool to incorporate locality information
• Modeling the topology is important– Non-uniform distribution of registers/L1 cache banks– Separate different types of traffic
• EDGE ISA eases burden on register allocator– Spills are rare– Spatial locality and contention become first-order constraints
LCPC 2008April 18, 2023
Questions?