architecture and compilation for reconfigurable processors
DESCRIPTION
Architecture and Compilation for Reconfigurable Processors. Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004. Outline. Motivation Application-specific instruction set compilation Register file data bandwidth problem - PowerPoint PPT PresentationTRANSCRIPT
Architecture and Compilation for Architecture and Compilation for Reconfigurable ProcessorsReconfigurable Processors
Jason Cong, Yiping Fan, Guoling Han, Zhiru ZhangJason Cong, Yiping Fan, Guoling Han, Zhiru Zhang
Computer Science DepartmentComputer Science Department
UCLAUCLA
Nov 22, 2004Nov 22, 2004
OutlineOutline
Motivation Motivation
Application-specific instruction set compilationApplication-specific instruction set compilation
Register file data bandwidth problemRegister file data bandwidth problem
Architecture extension – shadow registersArchitecture extension – shadow registers
Shadow register bindingShadow register binding
ConclusionsConclusions
Reconfigurable Processor PlatformReconfigurable Processor Platform Reconfigurable processor (RP) core + programmable fabricReconfigurable processor (RP) core + programmable fabric
RP core supports: Basic instruction set + customized instructionsRP core supports: Basic instruction set + customized instructions
Programmable fabric implements the customized instructions Programmable fabric implements the customized instructions
Either runtime reconfigurable or pre-synthesizedEither runtime reconfigurable or pre-synthesized
Example: Nios / Nios II from AlteraExample: Nios / Nios II from Altera Stratix version supported by Nios 3.0 systemStratix version supported by Nios 3.0 system 5 extended instruction formats5 extended instruction formats Up to 2048 instructions for each formatUp to 2048 instructions for each format
Reconfigurable Processor Core
CPUBus
Motivational ExampleMotivational Example
t1 = a * b;
t2 = b * 2;;
t3 = c * 5;
t4 = t1 + t2;
t5 = t2 + t3;
t6 = t5 + t4;
Execution time: 9 clock cycles
*: 2 clock cycles +: 1 clock cycleExtended Instruction Set: Iextop1 expop2
extop1 extop2
* * *
+ ++
2 5a b c
t1 = extop1(a, b, 2);
t2 = extop2(b, c, 2, 5);
t3 = t1 + t2;
Execution time: 5 clock cycles
Speedup: 1.8
Problem StatementProblem Statement
Given:Given: Application program in CDFGApplication program in CDFG G(V, E) G(V, E) A processor with basic instruction set A processor with basic instruction set II Pattern constraints:Pattern constraints:
I.I. Number of inputs less than NNumber of inputs less than Nin;in;
II.II. 1 output; 1 output; III.III. Total area no more than ATotal area no more than A
Objective:Objective: Generate a pattern library Generate a pattern library PP Map G to the extended instruction set Map G to the extended instruction set IIPP, so that the total , so that the total
execution time is minimized. execution time is minimized.
Proposed ASIP Compilation FlowProposed ASIP Compilation Flow
Extended Instruction Extended Instruction
Candidates GenerationCandidates Generation Satisfying I/O constraints Satisfying I/O constraints
Extended Instruction Extended Instruction
SelectionSelection Select a subset to Select a subset to
maximize the potential maximize the potential speedup while satisfying speedup while satisfying the resource constraintthe resource constraint
Code GenerationCode Generation Graph coveringGraph covering
Minimize the total Minimize the total execution timeexecution time
Instruction Implementation /
Pattern Generation /
ASIP constraints
ASIP Synthesis
Pattern Selection
Application Mapping Pattern Library
C
Implementation
Mapped CDFG
Compilation
CDFG
Simulation
Step 1. Pattern EnumerationStep 1. Pattern Enumeration
3-feasible cones:
n1: {a, b} n2: {b, 2} n3: {c, 5}
n4: {n1, n2}, {n1, b, 2}, {n2, a, b}, {a, b, 2}
* * *
+ +
+
2 5a b c
n1 n2n3
n4 n5
n6
Each pattern is a Nin-feasible cone
Cut enumeration is used to enumerate all the Nin-feasible cones [cong et al, FPGA’99]
Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not Nin-feasible
Step 2. Pattern SelectionStep 2. Pattern Selection
Basic idea: simultaneously consider speed up, occurrence frequency and area.Basic idea: simultaneously consider speed up, occurrence frequency and area.
Speedup Speedup Tsw(p) = total execution time with basic instructionsTsw(p) = total execution time with basic instructions
Thw(p)Thw(p) = length of the critical path of scheduled p= length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p)Speedup(p) = Tsw(p) / Thw(p)
OccurrenceOccurrence Some pattern instances may be isomorphicSome pattern instances may be isomorphic Graph isomorphism test [ Nauty Package ]Graph isomorphism test [ Nauty Package ] Small subgraphs, isomorphism test is very fastSmall subgraphs, isomorphism test is very fast
Gain(p) = Speedup(p) Gain(p) = Speedup(p) Occurrence(p) Occurrence(p)
Selection under area constraint can be formulated as a 0-1 Selection under area constraint can be formulated as a 0-1 knapsack problem knapsack problem
Pattern *+
Tsw= 3
Thw= 2
Speedup = 1.5
* * *
+ ++
2 5a b c
n1n2
n3
n4 n5
n6
Step 3. Application MappingStep 3. Application Mapping
Assume execution on an in-order, single-issue processorAssume execution on an in-order, single-issue processor
Cover each node in Cover each node in G(V, E)G(V, E) with the extended instruction with the extended instruction
set to minimize the execution time.set to minimize the execution time. Trivial pattern – software execution timeTrivial pattern – software execution time
Nontrivial pattern – hardware execution timeNontrivial pattern – hardware execution time
Total execution time = Sum of execution time of instance patterns after Total execution time = Sum of execution time of instance patterns after application mappingapplication mapping
TheoremTheorem:: The application mapping problem is equivalent The application mapping problem is equivalent
to the library-based minimum-area technology mapping to the library-based minimum-area technology mapping
problem. problem.
Speedup and Resource Overhead on NIOSSpeedup and Resource Overhead on NIOS
# Extended # Extended
Instruction Instruction
SpeedupSpeedup Resource OverheadResource Overhead
EstimationEstimation NiosNios LELE MemoryMemory DSP BlockDSP Block
fft_brfft_br 99 3.283.28 2.652.65 408408 6.06%6.06% 65,53665,536 9.79%9.79% 1616
iiriir 77 3.183.18 3.733.73 255255 3.79%3.79% 4,7364,736 0.71%0.71% 4040
firfir 22 2.402.40 2.142.14 5151 0.76%0.76% 1,0241,024 0.15%0.15% 88
prpr 22 1.571.57 1.751.75 7171 1.05%1.05% 00 0.00%0.00% 1414
dirdir 22 3.283.28 3.023.02 5454 0.80%0.80% 00 0.00%0.00% 1616
mcmmcm 44 4.754.75 3.223.22 186186 2.76%2.76% 00 0.00%0.00% 5656
AverageAverage 3.083.08 2.752.75 -- 2.54%2.54% -- 1.77%1.77% --
Simulation EnvironmentSimulation Environment
Simplescalar v3.0Simplescalar v3.0
Benchmarks Benchmarks
From Mediabench suiteFrom Mediabench suite
Machine ConfigurationMachine Configuration
Single issue in-order processor (ARM like)Single issue in-order processor (ARM like)
DL1: 8KB, 4-way, 1 cycleDL1: 8KB, 4-way, 1 cycle
IL1: 8KB, direct mapped, 1 cycleIL1: 8KB, direct mapped, 1 cycle
Unified L2: 256KB, 4-way, 8 cycleUnified L2: 256KB, 4-way, 8 cycle
Functional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMultFunctional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMult
Reconfigurable unitsReconfigurable units• critical path latency of the collapsed instructionscritical path latency of the collapsed instructions
Pattern Distribution Pattern Distribution
Most of the patterns have less than 7 nodes inside
0
0. 05
0. 1
0. 15
0. 2
0. 25
0. 3
2 3 4 5 6 7 8 9 10 11 12
# nodes i n the pat tern
occu
rren
ce f
requ
ency i nput = 3
i nput = 4
Ideal Speedup under Different Input Size ConstraintsIdeal Speedup under Different Input Size Constraints
0. 00
0. 05
0. 10
0. 15
0. 20
0. 25
0. 30
adpcmc adpcmd cj peg dj peg epi c unepi c mesaosdemo
benchmark
spee
dup
i nput = 2
i nput = 3
i nput = 4
OutlineOutline
Motivation Motivation
Application-specific instruction set compilationApplication-specific instruction set compilation
Register file data bandwidth problemRegister file data bandwidth problem
Architecture extension – shadow registersArchitecture extension – shadow registers
Shadow register bindingShadow register binding
ConclusionsConclusions
Register File Bandwidth ProblemRegister File Bandwidth Problem
Most of the speedup comes from clusters with Most of the speedup comes from clusters with
more than two inputs more than two inputs
2-port register file in embedded processors2-port register file in embedded processors
Need extra cycles to transfer data for extended Need extra cycles to transfer data for extended
instructions with more than 2 inputsinstructions with more than 2 inputs
Speedup drop due to communication overheadSpeedup drop due to communication overhead
_ ideal reg
ideal
Speedup SpeedupSpeedup drop
Speedup
Speedup Drop with Different Input Speedup Drop with Different Input Constraints Constraints
0
0.1
0.2
0.3
0.4
0.5
0.6
adpcmc adpcmd cjpeg djpeg epic unepic mesaosdemo
benchmark
spee
dup
drop
input = 3
input = 4
Move operation takes one cycleMove operation takes one cycle
46% speedup drop on average
OutlineOutline
Motivation Motivation
Application-specific instruction set compilationApplication-specific instruction set compilation
Register file data bandwidth problemRegister file data bandwidth problem
Architecture extension – shadow registersArchitecture extension – shadow registers
Shadow register bindingShadow register binding
ConclusionsConclusions
Architecture ExtensionsArchitecture Extensions
Existing SolutionsExisting Solutions Dedicated Data LinkDedicated Data Link
• Avoid potential resource contention through busAvoid potential resource contention through bus
• Need extra cycles for communicationNeed extra cycles for communication
• Employed in Microblaze from XilinxEmployed in Microblaze from Xilinx
Multiport Register FileMultiport Register File• Low utilization when executing basic instructionsLow utilization when executing basic instructions
• Area and power grows cubically Area and power grows cubically
Register File ReplicationRegister File Replication• Predetermined one-to-one correspondencePredetermined one-to-one correspondence
• Resource waste in terms of area and power Resource waste in terms of area and power
• Limit compiler optimizationLimit compiler optimization
Our Approach – Shadow RegistersOur Approach – Shadow Registers
Core registers are augmented by an extra set of Core registers are augmented by an extra set of
shadow registersshadow registers Conditionally writtenConditionally written
Used only by the custom logic Used only by the custom logic
Processor core Core
register file
Exe
cutio
n un
its
Dat
a B
us
cont
rolle
r
Shadow registers
Cus
tom
lo
gic Local
memory
Shadow RegistersShadow Registers Controlling the shadow registerControlling the shadow register
Advantages and limitationsAdvantages and limitations
Cost-efficient for small number of shadow registersCost-efficient for small number of shadow registers
Only need a few control signals to be addedOnly need a few control signals to be added
Opportunity for compiler optimizationOpportunity for compiler optimization
Require extra control bits Require extra control bits
Operation Forward the result Skip
Instruction
Subword
00 01 10 11
Shadow-re
g ID
0 1 2 -
OutlineOutline
Motivation Motivation
Application-specific instruction set compilationApplication-specific instruction set compilation
Register file data bandwidth problemRegister file data bandwidth problem
Architecture extension – shadow registersArchitecture extension – shadow registers
Shadow register bindingShadow register binding
ConclusionsConclusions
i1 = …;
i2 = ext1 (…, i1, …);
i3 = …;
i4 = ext2 (…, i1, …);
i5 = ext3 (…, i3, …);
i6 = ext4 (…, i3, …);
Internal RepresentationInternal Representation
2-level CDFG representation2-level CDFG representation 11stst level: control flow graph level: control flow graph 22ndnd level: data flow graph level: data flow graph
Computation nodeComputation nodelatency & scheduled time latency & scheduled time slotslot
Data edgeData edgelifetimelifetime
Variable lifetimeVariable lifetime
e3e4
e2
e1
1
2
3
4
5
6
Life time e1 = [2, 2]
Life time e2 = [2, 4]
Life time i1 = [2, 4]
ObservationObservation
2-port register file 2-port register file
3-input extended instruction3-input extended instruction
Without shadow registerWithout shadow register
4 additional moves4 additional moves
Binding for 1 registerBinding for 1 register
i1 = …;
i2 = ext1 (…, i1, …);
i3 = …;
i4 = ext2 (…, i1, …);
i5 = ext3 (…, i3, …);
i6 = ext4 (…, i3, …);
e3e4
e2
e1
1
2
3
4
5
6Binding 1: either i1 or i3 in shadow register
save 2 moves
Binding 2: save 3 moves
Register BindingRegister Binding
Which operands should be bound?Which operands should be bound? Each input could be a candidateEach input could be a candidate
Binding different candidates leads to different savingsBinding different candidates leads to different savings
Unaffordable to try all the combinationsUnaffordable to try all the combinations
I1. a = …; I2. b = …; I3. c = …; I4. d = …; I5. e = …; I6. … = ext1 ( a, b, c); I7. … = ext2 ( d, e, a);
One Shadow Register Binding Problem One Shadow Register Binding Problem
Problem formulation:Problem formulation: GivenGiven
A scheduled DFG and one shadow registerA scheduled DFG and one shadow register
ObjectiveObjectiveBind variables to shadow registerBind variables to shadow register
Minimize the number of movesMinimize the number of moves
Algorithm for Binding One Shadow RegisterAlgorithm for Binding One Shadow Register
Weighted compatibility graphWeighted compatibility graph• Vertex <-> data edge in the DFGVertex <-> data edge in the DFG
• Weight <-> # saves if the value is kept in the registerWeight <-> # saves if the value is kept in the register
• Edge <-> lifetimes don’t overlapEdge <-> lifetimes don’t overlap
Theorem: Theorem: Binding problem is equivalent to find a maximum weighted Binding problem is equivalent to find a maximum weighted
chain in the compatibility graphchain in the compatibility graph
Can be optimally solved in time O(|V’| + |E’|)Can be optimally solved in time O(|V’| + |E’|)
Extension to K-shadow registersExtension to K-shadow registers
Experimental Results (1)Experimental Results (1)
input = 3
0.00
0.05
0.10
0.15
0.20
0.25
adpcmc adpcmd cjpeg djpeg epic unepic mesaosdemo
benchmark
spee
dup
Sreg 0
Sreg 1
Sreg 2
Sreg 3
Ideal
Speedup under different number of shadow registers for 3-input extended instructions
Experimental Results (2)Experimental Results (2)
input = 4
0.00
0.05
0.10
0.15
0.20
0.25
0.30
adpcmc adpcmd cjpeg djpeg epic unepic mesaosdemo
benchmark
spee
dup
Sreg 0
Sreg 1
Sreg 2
Sreg 3
Ideal
Speedup under different number of shadow registers for 4-input extended instructions
ConclusionsConclusions
Proposed and developed complete compilation Proposed and developed complete compilation
flowflow
Observed and quantitatively analyzed data banObserved and quantitatively analyzed data ban
dwidth problemdwidth problem
Proposed novel architecture extension and efficProposed novel architecture extension and effic
ient register binding algorithm ient register binding algorithm
Thank You