streamroller: automatic synthesis of prescribed throughput accelerator pipelines

1 University of MichiganElectrical Engineering and Computer Science

Streamroller: Automatic Synthesis of Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator PipelinesPrescribed Throughput Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan


Automated C to Gates SolutionAutomated C to Gates Solution• SoC design

– 10-100 Gops, 200 mW power budget

– Low level tools ineffective• Automated accelerator

synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

app.c

LA

LA LA

LA


Streaming ApplicationsStreaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packetsData in Data out

CRC Conv./Turbo

BlockInterleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter


Software OverviewSoftware Overview

Whole Application

1

2 3

4

SystemLevel

Synthesis

FrontendAnalyses

Accelerator Pipeline

SRAMBuffers

Loop Graph


Input SpecificationInput Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

}

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels


Performance SpecificationPerformance Specification• High performance DCT

– Process one 1024x768 image every 2ms– Given 400 Mhz clock

• One image every 800000 cycles• One block every 64 cycles

• Low Performance DCT– Process one 1024x768 image every 4ms– One block every 128 cycles

8

8

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

8

8

Input image(1024 x 768)

Output coeffs

Task

Performance goal :Task throughput in number of cycles between tasks


Building BlocksBuilding Blocks

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Multifunction Loop Accelerator[CODES/ISSS ’06]

tmp1

tmp2

tmp3

SRAM buffers


System Schema OverviewSystem Schema Overview

Kernel 1

Kernel 2

Kernel 4

LA 1

LA 2

LA 3

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3

time

Task throughput


Cost ComponentsCost Components• Cost of loop accelerator data path

– Cost of FUs, shift registers, muxes, interconnect• Initiation interval (II)

– Key parameter that decides LA cost• Low II → high performance → high cost

– Loop execution time ≈ (trip count) x II– Appropriate II chosen to satisfy task throughput

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

II=2

II=2

II=2

Low performance

K1

K2

K3

TC=100

TC=100

TC=100

K1

K2

K3

K1

K2

K3

Task 1

Task 2

K1

K2

K3

Task 3

100

200

300

High performance

Throughput = 1 task/100 cyclesK1

K2

K3

K1

K2

K3

Task 1

Task 2200

400

600

Throughput = 1 task/200 cycles


Cost Components (Contd..)Cost Components (Contd..)

• Grouping of loops into a multifunction LA– More loops in a single LA → LA occupied for longer

time in current task

K1

K2

K3

TC=100

TC=100

TC=100

K3TC=100

LA 2

LA 3

LA 1

K1

K2

K3

K4LA 1 occupied for 200 cycles

K1

K2

K3

100

200

300

K4400

Throughput = 1 task / 200 cycles


Cost Components (Contd..)Cost Components (Contd..)• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

tmp1

tmp2

LA 1

LA 2

LA 3

K1

K2

K3

K1

K2

K3

100

200

300

LA 1

LA 2

LA 3

tmp1 buffer in use by LA2

K1

K2

K3

K1

K2

K3

100

200

300

Adjacent tasks use different

buffers


ILP FormulationILP Formulation

• Variables– II for each loop– Which loops are combined into single LA– Number of buffers for temp array

• Objective function– Cost of LAs + cost of buffers

• Constraints– Overall task throughput should be achieved


Non-linear LA CostNon-linear LA Cost

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

IImin IImax

II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1

Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14

IImin ≤ II ≤ IImax

Re

lativ

e C

ost

Initiation interval


Multifunction Accelerator CostMultifunction Accelerator Cost

LA 1LA 2

LA 3LA 4

LA 1LA 2

LA 3LA 4

LA 1LA 2

LA 3LA 4

Worst Case : No sharingCost = Sum

Realistic Case : Some sharingCost = Between Sum and Max

Best case : Full sharingCost = Max

• Impractical to obtain accurate cost of all combinations• CLA = 0.5 * (SUMCLA + MAXCLA)


Case Study : “Simple” benchmarkCase Study : “Simple” benchmarkLoop graph

TC=256

1

1

1

1

1

1

1

1

512 cycles LA 1

LA 2

LA 3

LA 4

1

1

2

1

1

1

3

3

1792 cycles

1536 cycles

LA 1

LA 2

1

1

1

1

1

1

1

1

LA 12048 cycles


BeamformerBeamformer

Beamformer• 10 loops• Memory Cost – 60% to 70%

• Up to 20% cost savings due to hardware sharing in multifunction accelerators• Systems at lower throughput have over-designed LAs

– Not profitable to pick a lower performance LA• Memory buffer cost significant

– High performance producer consumer better than more buffers


ConclusionsConclusions

• Automated design realistic for system of loops• Designers can move up the abstraction hierarchy• Observations

– Macro level hardware sharing can achieve significant cost savings

– Memory cost is significant – need to simultaneously optimize for datapath and memory cost

• ILP formulation tractable– Solver took less than 1 minute for systems with 30 loops

streamroller: automatic synthesis of prescribed throughput accelerator pipelines

Documents