streamroller: automatic synthesis of prescribed throughput accelerator pipelines

18
1 University of Michigan Electrical Engineering and Computer Science Streamroller: Automatic Streamroller: Automatic Synthesis of Prescribed Synthesis of Prescribed Throughput Accelerator Throughput Accelerator Pipelines Pipelines Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan

Upload: allie

Post on 13-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget - PowerPoint PPT Presentation

TRANSCRIPT

1 University of MichiganElectrical Engineering and Computer Science

Streamroller: Automatic Synthesis of Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator PipelinesPrescribed Throughput Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan

2 University of MichiganElectrical Engineering and Computer Science

Automated C to Gates SolutionAutomated C to Gates Solution• SoC design

– 10-100 Gops, 200 mW power budget

– Low level tools ineffective• Automated accelerator

synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

app.c

LA

LA LA

LA

3 University of MichiganElectrical Engineering and Computer Science

Streaming ApplicationsStreaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packetsData in Data out

CRC Conv./Turbo

BlockInterleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter

4 University of MichiganElectrical Engineering and Computer Science

Software OverviewSoftware Overview

Whole Application

1

2 3

4

SystemLevel

Synthesis

FrontendAnalyses

Accelerator Pipeline

SRAMBuffers

Loop Graph

5 University of MichiganElectrical Engineering and Computer Science

Input SpecificationInput Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

}

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels

6 University of MichiganElectrical Engineering and Computer Science

Performance SpecificationPerformance Specification• High performance DCT

– Process one 1024x768 image every 2ms– Given 400 Mhz clock

• One image every 800000 cycles• One block every 64 cycles

• Low Performance DCT– Process one 1024x768 image every 4ms– One block every 128 cycles

8

8

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

8

8

Input image(1024 x 768)

Output coeffs

Task

Performance goal :Task throughput in number of cycles between tasks

7 University of MichiganElectrical Engineering and Computer Science

Building BlocksBuilding Blocks

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Multifunction Loop Accelerator[CODES/ISSS ’06]

tmp1

tmp2

tmp3

SRAM buffers

8 University of MichiganElectrical Engineering and Computer Science

System Schema OverviewSystem Schema Overview

Kernel 1

Kernel 2

Kernel 4

LA 1

LA 2

LA 3

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3

time

Task throughput

9 University of MichiganElectrical Engineering and Computer Science

Cost ComponentsCost Components• Cost of loop accelerator data path

– Cost of FUs, shift registers, muxes, interconnect• Initiation interval (II)

– Key parameter that decides LA cost• Low II → high performance → high cost

– Loop execution time ≈ (trip count) x II– Appropriate II chosen to satisfy task throughput

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

II=2

II=2

II=2

Low performance

K1

K2

K3

TC=100

TC=100

TC=100

K1

K2

K3

K1

K2

K3

Task 1

Task 2

K1

K2

K3

Task 3

100

200

300

High performance

Throughput = 1 task/100 cyclesK1

K2

K3

K1

K2

K3

Task 1

Task 2200

400

600

Throughput = 1 task/200 cycles

10 University of MichiganElectrical Engineering and Computer Science

Cost Components (Contd..)Cost Components (Contd..)

• Grouping of loops into a multifunction LA– More loops in a single LA → LA occupied for longer

time in current task

K1

K2

K3

TC=100

TC=100

TC=100

K3TC=100

LA 2

LA 3

LA 1

K1

K2

K3

K4LA 1 occupied for 200 cycles

K1

K2

K3

100

200

300

K4400

Throughput = 1 task / 200 cycles

11 University of MichiganElectrical Engineering and Computer Science

Cost Components (Contd..)Cost Components (Contd..)• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

tmp1

tmp2

LA 1

LA 2

LA 3

K1

K2

K3

K1

K2

K3

100

200

300

LA 1

LA 2

LA 3

tmp1 buffer in use by LA2

K1

K2

K3

K1

K2

K3

100

200

300

Adjacent tasks use different

buffers

12 University of MichiganElectrical Engineering and Computer Science

ILP FormulationILP Formulation

• Variables– II for each loop– Which loops are combined into single LA– Number of buffers for temp array

• Objective function– Cost of LAs + cost of buffers

• Constraints– Overall task throughput should be achieved

13 University of MichiganElectrical Engineering and Computer Science

Non-linear LA CostNon-linear LA Cost

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

IImin IImax

II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1

Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14

IImin ≤ II ≤ IImax

Re

lativ

e C

ost

Initiation interval

14 University of MichiganElectrical Engineering and Computer Science

Multifunction Accelerator CostMultifunction Accelerator Cost

LA 1LA 2

LA 3LA 4

LA 1LA 2

LA 3LA 4

LA 1LA 2

LA 3LA 4

Worst Case : No sharingCost = Sum

Realistic Case : Some sharingCost = Between Sum and Max

Best case : Full sharingCost = Max

• Impractical to obtain accurate cost of all combinations• CLA = 0.5 * (SUMCLA + MAXCLA)

15 University of MichiganElectrical Engineering and Computer Science

Case Study : “Simple” benchmarkCase Study : “Simple” benchmarkLoop graph

TC=256

1

1

1

1

1

1

1

1

512 cycles LA 1

LA 2

LA 3

LA 4

1

1

2

1

1

1

3

3

1792 cycles

1536 cycles

LA 1

LA 2

1

1

1

1

1

1

1

1

LA 12048 cycles

16 University of MichiganElectrical Engineering and Computer Science

BeamformerBeamformer

Beamformer• 10 loops• Memory Cost – 60% to 70%

• Up to 20% cost savings due to hardware sharing in multifunction accelerators• Systems at lower throughput have over-designed LAs

– Not profitable to pick a lower performance LA• Memory buffer cost significant

– High performance producer consumer better than more buffers

17 University of MichiganElectrical Engineering and Computer Science

ConclusionsConclusions

• Automated design realistic for system of loops• Designers can move up the abstraction hierarchy• Observations

– Macro level hardware sharing can achieve significant cost savings

– Memory cost is significant – need to simultaneously optimize for datapath and memory cost

• ILP formulation tractable– Solver took less than 1 minute for systems with 30 loops

18 University of MichiganElectrical Engineering and Computer Science