university of michigan electrical engineering and computer science compiler-directed synthesis of...

University of MichiganElectrical Engineering and Computer Science

Compiler-directed Synthesis of Programmable Loop Accelerators

Kevin Fan, Hyunchul Park, Scott MahlkeSeptember 25, 2004EDCEP Workshop

Loop Accelerators

• Hardware implementation of a critical loop nest– Hardwired state machine– Digital camera appln – 1000x vs Pentium III– Multiple accelerators hooked up in a pipeline

• Loop accelerator vs. customized processor– 1 block of code vs. multiple blocks– Trivial control flow vs. handling generic branches– Traditionally state machine vs. instruction driven

Programmable Loop Accelerators• Goals

– Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use)

– Post-programmable – To a degree, allow changes to the application

– Use compiler as architecture synthesis tool• But …

– Don’t build a customized processor– Maintain ASIC-level efficiency

NPA (Nonprogrammable Accelerator) Synthesis in PICO

Sequential Loop Nest

Performance Requirement

Systolic Array Datapath

Systolic Array

Controller

Systolic Array

Coprocessor Interface

External Bus

Data In

Data Out

commands

timing

commands

Loadyii

Loadwjj

Loadxii-jj

Yr-1 yii

Systolic Processor Datapath

PICO Frontendfor i = 1 to nifor j = 1 to nj

y[i] += w[j] * x[i+j]

for jt = 1 to 100 step 10

for t = 0 to 502

for p = 0 to 1

(i,j) = function of (t,p)

if (i>1) W[t][p] = W[t-5][p] else w[jt+j]

if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j]

Y[t][p] += W[t][p] * X[t][p]

• Goals– Exploit loop-level parallelism– Map loop to abstract hardware– Manage global memory BW

• Steps– Tiling– Load/store elimination– Iteration mapping– Iteration scheduling– Virtual processor clustering

PICO Backend

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

• Resource allocation (II, operation graph)• Synthesize machine description for “fake” fully connected processor with allocated resources

Reduced VLIW Processor after Modulo Scheduling

FSelectT

LoadLoadTSelectF

Xr-1Yr-1

t3t1 t2 yii wjj xii-jj

Data/control-path Synthesis NPA

Loadyii

Loadwjj

Loadxii-jj

Yr-1 yii

PICO Methodology – Why it Works?

• Systematic design methodology– 1. Parameterized meta-architecture – all NPAs have

same general organization– 2. Performance/throughput is input– 3. Abstract architecture – We know how to build

compilers for this – 4. Mapping mechanism – Determine architecture

specifics from schedule for abstract architecture

Direct Generalization of PICO?

FSelectT

LoadLoadTSelectF

• Programmability would require full interconnect between elements• Back to the meta architecture!

• Generalize connectivity to enable post-programmability• But stylize it

Programmable Loop Accelerator – Design Strategy

• Compile for partially defined architecture– Build long distance communication into schedule– Limit global communication bandwidth

• Proposed meta-architecture– Multi-cluster VLIW

• Explicit inter-cluster transfers (varying latency/BW)• Intra-cluster communication is complete

– Hardware partially defined – expensive units

Programmable Loop Accelerator Schema

Intra-cluster Communication

Shift Register

Stream Buffer

Accelerator

Pipeline of Tiled orClustered Accelerators

Accelerator Datapath

ControlUnit

Stream Unit

Stream UnitII

… …

Inter-clusterRegister File

… …

Flow Diagram

FU Alloc

Partition

ModuloSchedule

Assembly code,II

# clusters# expensive FUs

# cheap FUsFUs assigned to clusters

Shift register depth, width, portingIntercluster bandwidth

LoopAccelerator

Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) {

int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp;

t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2];

e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22));

e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp;

FU Allocation• Determine number of

clusters:

• Determine number of expensive FUs– MPY, DIV, memory

typeofops __#

• Sobel with II=4

41 ops 3 clusters

2 MPY ops 1 multiplier

9 memory ops 3 memory units

Partitioning

• Multi-level approach consists of two phases– Coarsening– Refinement

• Minimize inter-cluster communication• Load balance

– Max of 4 II operations per cluster• Take FU allocation into account

– Restricted # of expensive units– # of cheap units (ADD, logic) determined from partition

Coarsening

• Group highly related operations together– Pair operations together at each step– Forces partitioner to consider several operations as a

single unit• Coarsening Sobel subgraph into 2 groups:

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

Refinement

• Move operations between clusters• Good moves:

– Reduce inter-cluster communication– Improve load balance– Reduce hardware cost

• Reduce number of expensive units to meet limit

• Collect similar bitwidth operations together

+ + + +

L LLL L

Partitioning Example

• From sobel, II=4• Place MPYs together• Place each tree of ADD-

LOAD-ADDs together• Cuts 6 edges

Modulo Scheduling

• Determines shift register width, depth, and number of read ports

• Sobel II=4

ADD ADD

cycle FU0 FU1 FU2 FU3

FU Cycle Max resultlifetime

Req’ddepth

Req’d ports

0 2 4 4 1

1 1 2 4 2

2 4 1 1 1

3 0 - 1 1

Test Cases

• Sobel and fsed kernels, II=4 designs• Each machine has 4 clusters with 4 FUs per cluster

+ - + -

+ - <<

+ & + &

Cross Compile Results

• Computation is localized– sobel: 1.5 moves/cycle– fsed: 1 move/cycle

• Cross compile– Can still achieve II=4– More inter-cluster communication– May require more units– sobel on fsed machine: ~2 moves/cycle– fsed on sobel machine: ~3 moves/cycle

Concluding Remarks

• Programmable loop accelerator design strategy– Meta-architecture with stylized interconnect– Systematic compiler-directed design flow

• Costs of programmability:– Interconnect, inter-cluster communication– Control – “micro-instructions” are necessary

• Just scratching the surface of this work• For more, see the CCCP group webpage

– http://cccp.eecs.umich.edu

university of michigan electrical engineering and computer science compiler-directed synthesis of...

j n2 j

computer science refinement

computer science compiler

computer science fu

computer science sobel

computer science pico

pico slide

partition slide

Documents

introduction to accelerators: evolution of accelerators

yacc (yet another compiler compiler)

compiler++ evolving the compiler - c2.dll

nvidia · compiler directives to specify parallel regions...

compiler computer science - compiler design - compilers and...

dc accelerators

accelerators introduction

spatial: a language and compiler for application...

graphics accelerators

arm compiler toolchain compiler reference - arm...

spatial: a language and compiler for application...

accelerators - syllabus.cs.manchester.ac.uk

vulcanization & accelerators

linear accelerators

accelerators diagnostics

accelerators f or america’s f uture · introduction...

accelerators and detectors - school of physics and...

introduction to accelerators: evolution of accelerators...

smacc: a compiler-compiler€¦ · smacc:a...

spatial: a language and compiler for application...