Transcript
Page 1: A Computing Origami: Folding Streams in FPGAs

A Computing Origami: Folding Streams in FPGAs

S. M. FarhadPhD Student

University of Sydney

DAC 2009, California, USA

Page 2: A Computing Origami: Folding Streams in FPGAs

2

Outline

Motivation Stream programming FPGA Problem

Stream Folding Results Conclusion

2

Page 3: A Computing Origami: Folding Streams in FPGAs

Stream Programming Paradigm

Programs expressed as stream graphs Streams: Sequence of data elements Actor: Functions applied to streams

Independent actors with explicit communication

Regular and repeating computation

3

Actor/Filter

Streams

Streams

Page 4: A Computing Origami: Folding Streams in FPGAs

FPGA

FPGAs are widely available as programmable coprocessors

Opportunities to exploit FPGA-based acceleration Multimedia, networking, graphics, and security codes

4

Page 5: A Computing Origami: Folding Streams in FPGAs

Problem

Maximizing throughput subject to Area and latency constraints

Resolving bottleneck actors The replicated filters do not require resynthesis

5

Page 6: A Computing Origami: Folding Streams in FPGAs

Motivating Example

6

Page 7: A Computing Origami: Folding Streams in FPGAs

Motivating Example

7

Page 8: A Computing Origami: Folding Streams in FPGAs

Motivating Example

8

Page 9: A Computing Origami: Folding Streams in FPGAs

9

Outline

Motivation Stream programming FPGA Problem

Stream Folding Results Conclusion

9

Page 10: A Computing Origami: Folding Streams in FPGAs

Area/Throughput Design Folding

1 foreach Filter f in S do2 workFactor[f] = f.latency.S.runs(f);3 designPointArea + = f.area.workFactor[f];4 scaleLimit = minf.hasState (1/workFactor[f]); 5 scaling = min(AREA/designPointArea, scaleLimit);6 foreach Filter f in S do7 replication[f] = workFactor[f].scaling;8 while area(replication) > AREA do9 replication = reduceThroughput(replication);

10

Page 11: A Computing Origami: Folding Streams in FPGAs

Calculating Throughput

11

)(

)()(

i

iout Flatency

Fpushit

)(

)().()(

j

j

njiout

Pout Flatency

Fpushitit

)(min1

itt Pout

ni

Pout

i

njj

outni

SJout w

w

itt 1).(min

))(..(min

))(.(min)(min

itCrt

itCitt

outiiSi

Sout

outiSi

Sout

Si

Sout

Page 12: A Computing Origami: Folding Streams in FPGAs

Calculating Latency

FPGAs that are coupled to host processors Initiation interval (DMA) Replication improves throughput, it often

increases the latency! Major factors for latency variation

Non-periodic data arrival Data-token reordering Local congestion

12

Page 13: A Computing Origami: Folding Streams in FPGAs

Latency constrained design folding

1 latConf= null ; T = ∞;2 while throughput(thrConf) ≤ T do3 if feasibleImprovement(thrConf) then4 candidates = simAnnealing(thrConf, T);5 foreach candidate in candidates do6 if throughput(candidate) < T then7 latConf = candidate;8 T = throughput(latConf);9 thrConf = reduceThroughput(thrConf);10 return latConf

13

Page 14: A Computing Origami: Folding Streams in FPGAs

Results

Benchmark

Minimum area Best throughput Constrained design

LUTs Latency II LUTs Latency II LUTs Latency IIConstraint

Run time

MatrixMult 1498 480 19 7618 185 3 4558 175 7

Latency ≤ 175 1.14s

Serpent 3028 1027 4 3878 773 2 3053 901 4Latency ≤ 910 0.73s

FFT2 37610 1199 3 43370 764 2 39530 868 7AREA ≤ 40000 34.7s

FMRadio 37458 371 39 87564 371 13 62511 371 20AREA ≤ 65000 1.01s

DCT 45752 349 3 137256 349 1 91504 349 2AREA ≤ 120000 0.73s

BitonicSort 43920 1042 3 131760 1042 1 47400 1282 2

AREA ≤ 50000 18.3s

Synthetic 350 309 135 15990 504 2 1490 309 47

AREA ≤ 1500 0.43s

14

Page 15: A Computing Origami: Folding Streams in FPGAs

Questions?


Top Related