architecture and compilation for reconfigurable processors

Architecture and Compilation for Architecture and Compilation for Reconfigurable ProcessorsReconfigurable Processors

Jason Cong, Yiping Fan, Guoling Han, Zhiru ZhangJason Cong, Yiping Fan, Guoling Han, Zhiru Zhang

Computer Science DepartmentComputer Science Department

UCLAUCLA

Nov 22, 2004Nov 22, 2004

OutlineOutline

Motivation Motivation

Application-specific instruction set compilationApplication-specific instruction set compilation

Register file data bandwidth problemRegister file data bandwidth problem

Architecture extension – shadow registersArchitecture extension – shadow registers

Shadow register bindingShadow register binding

ConclusionsConclusions

Reconfigurable Processor PlatformReconfigurable Processor Platform Reconfigurable processor (RP) core + programmable fabricReconfigurable processor (RP) core + programmable fabric

RP core supports: Basic instruction set + customized instructionsRP core supports: Basic instruction set + customized instructions

Programmable fabric implements the customized instructions Programmable fabric implements the customized instructions

Either runtime reconfigurable or pre-synthesizedEither runtime reconfigurable or pre-synthesized

Example: Nios / Nios II from AlteraExample: Nios / Nios II from Altera Stratix version supported by Nios 3.0 systemStratix version supported by Nios 3.0 system 5 extended instruction formats5 extended instruction formats Up to 2048 instructions for each formatUp to 2048 instructions for each format

Reconfigurable Processor Core

CPUBus

Motivational ExampleMotivational Example

t1 = a * b;

t2 = b * 2;;

t3 = c * 5;

t4 = t1 + t2;

t5 = t2 + t3;

t6 = t5 + t4;

Execution time: 9 clock cycles

*: 2 clock cycles +: 1 clock cycleExtended Instruction Set: Iextop1 expop2

extop1 extop2

* * *

+ ++

2 5a b c

t1 = extop1(a, b, 2);

t2 = extop2(b, c, 2, 5);

t3 = t1 + t2;

Execution time: 5 clock cycles

Speedup: 1.8

Problem StatementProblem Statement

Given:Given: Application program in CDFGApplication program in CDFG G(V, E) G(V, E) A processor with basic instruction set A processor with basic instruction set II Pattern constraints:Pattern constraints:

I.I. Number of inputs less than NNumber of inputs less than Nin;in;

II.II. 1 output; 1 output; III.III. Total area no more than ATotal area no more than A

Objective:Objective: Generate a pattern library Generate a pattern library PP Map G to the extended instruction set Map G to the extended instruction set IIPP, so that the total , so that the total

execution time is minimized. execution time is minimized.

Proposed ASIP Compilation FlowProposed ASIP Compilation Flow

Extended Instruction Extended Instruction

Candidates GenerationCandidates Generation Satisfying I/O constraints Satisfying I/O constraints

Extended Instruction Extended Instruction

SelectionSelection Select a subset to Select a subset to

maximize the potential maximize the potential speedup while satisfying speedup while satisfying the resource constraintthe resource constraint

Code GenerationCode Generation Graph coveringGraph covering

Minimize the total Minimize the total execution timeexecution time

Instruction Implementation /

Pattern Generation /

ASIP constraints

ASIP Synthesis

Pattern Selection

Application Mapping Pattern Library

C

Implementation

Mapped CDFG

Compilation

CDFG

Simulation

Step 1. Pattern EnumerationStep 1. Pattern Enumeration

3-feasible cones:

n1: {a, b} n2: {b, 2} n3: {c, 5}

n4: {n1, n2}, {n1, b, 2}, {n2, a, b}, {a, b, 2}

* * *

+ +

+

2 5a b c

n1 n2n3

n4 n5

n6

Each pattern is a Nin-feasible cone

Cut enumeration is used to enumerate all the Nin-feasible cones [cong et al, FPGA’99]

Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not Nin-feasible

Step 2. Pattern SelectionStep 2. Pattern Selection

Basic idea: simultaneously consider speed up, occurrence frequency and area.Basic idea: simultaneously consider speed up, occurrence frequency and area.

Speedup Speedup Tsw(p) = total execution time with basic instructionsTsw(p) = total execution time with basic instructions

Thw(p)Thw(p) = length of the critical path of scheduled p= length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p)Speedup(p) = Tsw(p) / Thw(p)

OccurrenceOccurrence Some pattern instances may be isomorphicSome pattern instances may be isomorphic Graph isomorphism test [ Nauty Package ]Graph isomorphism test [ Nauty Package ] Small subgraphs, isomorphism test is very fastSmall subgraphs, isomorphism test is very fast

Gain(p) = Speedup(p) Gain(p) = Speedup(p) Occurrence(p) Occurrence(p)

Selection under area constraint can be formulated as a 0-1 Selection under area constraint can be formulated as a 0-1 knapsack problem knapsack problem

Pattern *+

Tsw= 3

Thw= 2

Speedup = 1.5

* * *

+ ++

2 5a b c

n1n2

n3

n4 n5

n6

Step 3. Application MappingStep 3. Application Mapping

Assume execution on an in-order, single-issue processorAssume execution on an in-order, single-issue processor

Cover each node in Cover each node in G(V, E)G(V, E) with the extended instruction with the extended instruction

set to minimize the execution time.set to minimize the execution time. Trivial pattern – software execution timeTrivial pattern – software execution time

Nontrivial pattern – hardware execution timeNontrivial pattern – hardware execution time

Total execution time = Sum of execution time of instance patterns after Total execution time = Sum of execution time of instance patterns after application mappingapplication mapping

TheoremTheorem:: The application mapping problem is equivalent The application mapping problem is equivalent

to the library-based minimum-area technology mapping to the library-based minimum-area technology mapping

problem. problem.

Speedup and Resource Overhead on NIOSSpeedup and Resource Overhead on NIOS

# Extended # Extended

Instruction Instruction

SpeedupSpeedup Resource OverheadResource Overhead

EstimationEstimation NiosNios LELE MemoryMemory DSP BlockDSP Block

fft_brfft_br 99 3.283.28 2.652.65 408408 6.06%6.06% 65,53665,536 9.79%9.79% 1616

iiriir 77 3.183.18 3.733.73 255255 3.79%3.79% 4,7364,736 0.71%0.71% 4040

firfir 22 2.402.40 2.142.14 5151 0.76%0.76% 1,0241,024 0.15%0.15% 88

prpr 22 1.571.57 1.751.75 7171 1.05%1.05% 00 0.00%0.00% 1414

dirdir 22 3.283.28 3.023.02 5454 0.80%0.80% 00 0.00%0.00% 1616

mcmmcm 44 4.754.75 3.223.22 186186 2.76%2.76% 00 0.00%0.00% 5656

AverageAverage 3.083.08 2.752.75 -- 2.54%2.54% -- 1.77%1.77% --

Simulation EnvironmentSimulation Environment

Simplescalar v3.0Simplescalar v3.0

Benchmarks Benchmarks

From Mediabench suiteFrom Mediabench suite

Machine ConfigurationMachine Configuration

Single issue in-order processor (ARM like)Single issue in-order processor (ARM like)

DL1: 8KB, 4-way, 1 cycleDL1: 8KB, 4-way, 1 cycle

IL1: 8KB, direct mapped, 1 cycleIL1: 8KB, direct mapped, 1 cycle

Unified L2: 256KB, 4-way, 8 cycleUnified L2: 256KB, 4-way, 8 cycle

Functional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMultFunctional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMult

Reconfigurable unitsReconfigurable units• critical path latency of the collapsed instructionscritical path latency of the collapsed instructions

Pattern Distribution Pattern Distribution

Most of the patterns have less than 7 nodes inside

0

0. 05

0. 1

0. 15

0. 2

0. 25

0. 3

2 3 4 5 6 7 8 9 10 11 12

# nodes i n the pat tern

occu

rren

ce f

requ

ency i nput = 3

i nput = 4

Ideal Speedup under Different Input Size ConstraintsIdeal Speedup under Different Input Size Constraints

0. 00

0. 05

0. 10

0. 15

0. 20

0. 25

0. 30

adpcmc adpcmd cj peg dj peg epi c unepi c mesaosdemo

benchmark

spee

dup

i nput = 2

i nput = 3

i nput = 4

OutlineOutline







Register File Bandwidth ProblemRegister File Bandwidth Problem

Most of the speedup comes from clusters with Most of the speedup comes from clusters with

more than two inputs more than two inputs

2-port register file in embedded processors2-port register file in embedded processors

Need extra cycles to transfer data for extended Need extra cycles to transfer data for extended

instructions with more than 2 inputsinstructions with more than 2 inputs

Speedup drop due to communication overheadSpeedup drop due to communication overhead

_ ideal reg

ideal

Speedup SpeedupSpeedup drop

Speedup

Speedup Drop with Different Input Speedup Drop with Different Input Constraints Constraints

0

0.1

0.2

0.3

0.4

0.5

0.6

adpcmc adpcmd cjpeg djpeg epic unepic mesaosdemo

benchmark

spee

dup

drop

input = 3

input = 4

Move operation takes one cycleMove operation takes one cycle

46% speedup drop on average

OutlineOutline







Architecture ExtensionsArchitecture Extensions

Existing SolutionsExisting Solutions Dedicated Data LinkDedicated Data Link

• Avoid potential resource contention through busAvoid potential resource contention through bus

• Need extra cycles for communicationNeed extra cycles for communication

• Employed in Microblaze from XilinxEmployed in Microblaze from Xilinx

Multiport Register FileMultiport Register File• Low utilization when executing basic instructionsLow utilization when executing basic instructions

• Area and power grows cubically Area and power grows cubically

Register File ReplicationRegister File Replication• Predetermined one-to-one correspondencePredetermined one-to-one correspondence

• Resource waste in terms of area and power Resource waste in terms of area and power

• Limit compiler optimizationLimit compiler optimization

Our Approach – Shadow RegistersOur Approach – Shadow Registers

Core registers are augmented by an extra set of Core registers are augmented by an extra set of

shadow registersshadow registers Conditionally writtenConditionally written

Used only by the custom logic Used only by the custom logic

Processor core Core

register file

Exe

cutio

n un

its

Dat

a B

us

cont

rolle

r

Shadow registers

Cus

tom

lo

gic Local

memory

Shadow RegistersShadow Registers Controlling the shadow registerControlling the shadow register

Advantages and limitationsAdvantages and limitations

Cost-efficient for small number of shadow registersCost-efficient for small number of shadow registers

Only need a few control signals to be addedOnly need a few control signals to be added

Opportunity for compiler optimizationOpportunity for compiler optimization

Require extra control bits Require extra control bits

Operation Forward the result Skip

Instruction

Subword

00 01 10 11

Shadow-re

g ID

0 1 2 -

OutlineOutline







i1 = …;

i2 = ext1 (…, i1, …);

i3 = …;

i4 = ext2 (…, i1, …);

i5 = ext3 (…, i3, …);

i6 = ext4 (…, i3, …);

Internal RepresentationInternal Representation

2-level CDFG representation2-level CDFG representation 11stst level: control flow graph level: control flow graph 22ndnd level: data flow graph level: data flow graph

Computation nodeComputation nodelatency & scheduled time latency & scheduled time slotslot

Data edgeData edgelifetimelifetime

Variable lifetimeVariable lifetime

e3e4

e2

e1

1

2

3

4

5

6

Life time e1 = [2, 2]

Life time e2 = [2, 4]

Life time i1 = [2, 4]

ObservationObservation

2-port register file 2-port register file

3-input extended instruction3-input extended instruction

Without shadow registerWithout shadow register

4 additional moves4 additional moves

Binding for 1 registerBinding for 1 register

i1 = …;

i2 = ext1 (…, i1, …);

i3 = …;

i4 = ext2 (…, i1, …);

i5 = ext3 (…, i3, …);

i6 = ext4 (…, i3, …);

e3e4

e2

e1

1

2

3

4

5

6Binding 1: either i1 or i3 in shadow register

save 2 moves

Binding 2: save 3 moves

Register BindingRegister Binding

Which operands should be bound?Which operands should be bound? Each input could be a candidateEach input could be a candidate

Binding different candidates leads to different savingsBinding different candidates leads to different savings

Unaffordable to try all the combinationsUnaffordable to try all the combinations

I1. a = …; I2. b = …; I3. c = …; I4. d = …; I5. e = …; I6. … = ext1 ( a, b, c); I7. … = ext2 ( d, e, a);

One Shadow Register Binding Problem One Shadow Register Binding Problem

Problem formulation:Problem formulation: GivenGiven

A scheduled DFG and one shadow registerA scheduled DFG and one shadow register

ObjectiveObjectiveBind variables to shadow registerBind variables to shadow register

Minimize the number of movesMinimize the number of moves

Algorithm for Binding One Shadow RegisterAlgorithm for Binding One Shadow Register

Weighted compatibility graphWeighted compatibility graph• Vertex <-> data edge in the DFGVertex <-> data edge in the DFG

• Weight <-> # saves if the value is kept in the registerWeight <-> # saves if the value is kept in the register

• Edge <-> lifetimes don’t overlapEdge <-> lifetimes don’t overlap

Theorem: Theorem: Binding problem is equivalent to find a maximum weighted Binding problem is equivalent to find a maximum weighted

chain in the compatibility graphchain in the compatibility graph

Can be optimally solved in time O(|V’| + |E’|)Can be optimally solved in time O(|V’| + |E’|)

Extension to K-shadow registersExtension to K-shadow registers

Experimental Results (1)Experimental Results (1)

input = 3

0.00

0.05

0.10

0.15

0.20

0.25


benchmark

spee

dup

Sreg 0

Sreg 1

Sreg 2

Sreg 3

Ideal

Speedup under different number of shadow registers for 3-input extended instructions

Experimental Results (2)Experimental Results (2)

input = 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30


benchmark

spee

dup

Sreg 0

Sreg 1

Sreg 2

Sreg 3

Ideal

Speedup under different number of shadow registers for 4-input extended instructions


Proposed and developed complete compilation Proposed and developed complete compilation

flowflow

Observed and quantitatively analyzed data banObserved and quantitatively analyzed data ban

dwidth problemdwidth problem

Proposed novel architecture extension and efficProposed novel architecture extension and effic

ient register binding algorithm ient register binding algorithm

Thank You

architecture and compilation for reconfigurable processors

Documents