flexible filters for high performance embedded computing

40
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University

Upload: eljah

Post on 23-Mar-2016

39 views

Category:

Documents


1 download

DESCRIPTION

Flexible Filters for High Performance Embedded Computing. Rebecca Collins and Luca Carloni Department of Computer Science Columbia University. Motivation. Stream Programming. Stream Programming model filter : a piece of sequential programming channels : how filters communicate - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Flexible Filters for  High Performance  Embedded Computing

Flexible Filters for High Performance

Embedded ComputingRebecca Collins and Luca CarloniDepartment of Computer Science

Columbia University

Page 2: Flexible Filters for  High Performance  Embedded Computing

Motivation

Page 3: Flexible Filters for  High Performance  Embedded Computing

Stream Programming

• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter

• Examples: Signal processing, image processing, embedded applications

CBA

Page 4: Flexible Filters for  High Performance  Embedded Computing

Example: Constant False-Alarm Rate (CFAR) Detection

guard cells cell under test

compare to with additional factors• number of gates• threshold (μ)• number of guard cells• rows and other dimensional data

http://www.ll.mit.edu/HPECchallenge/

HPEC Challenge

Page 5: Flexible Filters for  High Performance  Embedded Computing

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

uInt tofloat square right

windowleft

window

add

aligndata

findtargets

CFAR Benchmark

Data Stream:

Page 6: Flexible Filters for  High Performance  Embedded Computing

uInt tofloat square right

windowleft

window

add

aligndata

findtargets1

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 7: Flexible Filters for  High Performance  Embedded Computing

uInt tofloat square right

windowleft

window

add

aligndata

findtargets2 1

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 8: Flexible Filters for  High Performance  Embedded Computing

uInt tofloat square right

windowleft

window

add

aligndata

findtargets

1

23 1

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 9: Flexible Filters for  High Performance  Embedded Computing

cell under test

uInt tofloat square right

windowleft

window

add

aligndata

findtargets1

67891011

23910111213

8

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

t

right window

left window

t

Page 10: Flexible Filters for  High Performance  Embedded Computing

Mapping

Throughput: rate of processing data tokens

multi-core chip

core 1

A

core 2

B

core 3

C

CBA

Page 11: Flexible Filters for  High Performance  Embedded Computing

Mapping

Throughput: rate of processing data tokens

multi-core chip

core 1

AB

core 3

C

CBA

Page 12: Flexible Filters for  High Performance  Embedded Computing

core 1 core 2 core 3

Unbalanced Flow

• Bottlenecks reduce throughput– Caused by backpressure

• inherent algorithmic imbalances• data-dependent computational spikes

CBA

“wait”

Page 13: Flexible Filters for  High Performance  Embedded Computing

Data Dependent Execution Time: CFAR

• Using Set 1 from the HPEC CFAR Kernel Benchmark

• Over a block of about 100 cells

• Extra workload of 32 microseconds per target

7.3 % targets

0

5

10

15

0 96 192 288 384 480 576

Execution Time (microseconds)

Perc

ent

1.3 % targets

05

101520253035

2 128 254 380 506Execution Time (microseconds)

Perc

ent

0.3 % targets

0

20

40

60

80

2 128 255 381 508Execution Time (microseconds)

Perc

ent

Page 14: Flexible Filters for  High Performance  Embedded Computing

Data Dependent Execution Time

Some other examples:• Bloom Filters (Financial, Spam detection)• Compression (Image processing)

0123456789

0.000 0.001 0.002 0.003 0.004 0.005

Execution Time(s)

Perc

ent

Page 15: Flexible Filters for  High Performance  Embedded Computing

Our Solution: Flexible Filters

• Unused cycles on B are filled by working ahead on filter C with the data already present on B

• Push stream flow upstream of a bottleneck• Semantic Preservation

BA C

C

core 1 core 2 core 3

flex-split flex-merge

Page 16: Flexible Filters for  High Performance  Embedded Computing

Related Works• Static Compiler Optimizations

– StreamIt [Gordon et al, 2006]• Dynamic Runtime Load Balancing

– Work Stealing/Filter Migration • [Kakulavarapu et al., 2001]• Cilk [Frigo et al., 1998]• Flux [Shah et al., 2003]• Borealis [Xing et al., 2005]

– Queue based load balancing• Diamond [Huston et al., 2005] distributed search, queue based load

balancing, filter re-ordering

• Combination Static+Dynamic– FlexStream [Hormati et al., 2009] multiple competing programs

Page 17: Flexible Filters for  High Performance  Embedded Computing

Outline• Introduction• Design Flow of a Stream Program with

Flexibility • Performance• Implementation of Flexible Filters• Experiments

– CFAR Case Study

Page 18: Flexible Filters for  High Performance  Embedded Computing

Design Flow of a Stream Program with Flexible Filters

compiler/design tools

Design stream algorithm

Mapping• Filters

• Memory

Add Flexibility to Bottlenecks

Profile

Profile of CFAR filters on Cell

0 1 2 3 4 5

find targets

add

right window

uIntToFloat

Average Execution Time (microseconds) per 100 tokens

Page 19: Flexible Filters for  High Performance  Embedded Computing

Design ConsiderationsDesign stream

algorithm

Mapping• Filters

• Memory

Add Flexibility to Bottlenecks

Profile

Page 20: Flexible Filters for  High Performance  Embedded Computing

Adding Flexibility

Design stream algorithm

Mapping• Filters

• Memory

Add Flexibility to Bottlenecks

Profile

Page 21: Flexible Filters for  High Performance  Embedded Computing

Outline• Introduction• Design Flow of a Stream Program with

Flexibility • Performance• Implementation of Flexible Filters• Experiments

– CFAR Case Study

Page 22: Flexible Filters for  High Performance  Embedded Computing

Mapping Stream Programs to Multi-Core Platforms

BA C

BA C

BA C

BA C

pipeline mapping

SPMD mapping

BA C

sharing a core

core 1

core 1

core 2 core 3

core 3

core 2

core 2

core 1

Page 23: Flexible Filters for  High Performance  Embedded Computing

Throughput: SPMD Mapping

BA C

BA C

BA C

SPMD mapping

t0 t1 t2 t3 t4 t5 t6

core 1

core 2

core 3

A1

A2

A3

B1

B2

B3

C1

C2

C3

3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0.429

Suppose EA = 2, EB = 2, EC = 3

core 2

core 1

core 3

2

3

1 EA=2 EB=2 EC=3

Page 24: Flexible Filters for  High Performance  Embedded Computing

Throughput: Pipeline Mapping

CBAcore 1 core 2 core 3

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

core 1core 2core 3

latency 2 latency 2 latency 3

1

A1 A2B1

C1B2A3

C2B3A4

12 123 234

throughput = 1/3 = 0.333 < 0.429

Page 25: Flexible Filters for  High Performance  Embedded Computing

Data Blocks

CBAcore 1 core 2 core 3

C

data block: group of data tokens

Page 26: Flexible Filters for  High Performance  Embedded Computing

Throughput: Pipeline Augmented with Flexibility

CBAcore 1 core 2 core 3

C

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

core 1core 2core 3

A1 A2B1

C1B2A3

C2B3A4

C2

throughput = 2/5 = 0.4 < 0.429 (but > 0.333)

1 12 123 234

Page 27: Flexible Filters for  High Performance  Embedded Computing

Outline• Introduction• Design Flow of a Stream Program with

Flexibility • Performance• Implementation of Flexible Filters• Experiments

– CFAR Case Study

Page 28: Flexible Filters for  High Performance  Embedded Computing

Flex-SplitFlex-split

pop data block b from inn0 = available space on out0n1 = |b| - n0send n0 to out0, n1 to out1send n0 0’s, then n1 1’s to

select

CB

core 2 core 3

C

flex split flex mergeC

C

out

out1

in0

in1

select

maintain ordering based on run-time state of queues

out0in

Page 29: Flexible Filters for  High Performance  Embedded Computing

Flex-MergeFlex-merge

pop i from selectif i is 0, pop token from in0if i is 1, pop token from in1push token to out

CB

core 2 core 3

C

flex split flex mergeC

C

out

out1

in0

in1

select

out0in Overhead of Flexibility?

Page 30: Flexible Filters for  High Performance  Embedded Computing

Multi-Channel Flex-Split and Flex-Merge

flex split filter

select

…filterflex

flex merge

flex merge

flex merge

output channel 1

output channel 2

output channel n

output channel 1

output channel 2

output channel n

Page 31: Flexible Filters for  High Performance  Embedded Computing

Multi-Channel Flex-Split and Flex-Merge

filter

…input channel 1

input channel 2

input channel n

Page 32: Flexible Filters for  High Performance  Embedded Computing

Multi-Channel Flex-Split and Flex-Merge

filter

…filterflex

input channel 1

input channel 2

input channel n

flex merge

select

flex split

Centralized

Page 33: Flexible Filters for  High Performance  Embedded Computing

Multi-Channel Flex-Split and Flex-Merge

filter

…filterflex

input channel 1

input channel 2

input channel n

flex merge

select

flex split

filter

… filterflex

input channel 1

input channel 2

input channel n

flex mergeflex split

Centralized

β flex split

β flex split

select

Distributed

Page 34: Flexible Filters for  High Performance  Embedded Computing

Outline• Introduction• Design Flow of a Stream Program with

Flexibility • Performance• Implementation of Flexible Filters• Experiments

– CFAR Case Study

Page 35: Flexible Filters for  High Performance  Embedded Computing

Cell BE Processor• Distributed Memory• Heterogeneous

– 8 SIMD (SPU) cores– 1 PowerPC (PPU)

• Element Interconnect Bus– 4 rings– 205 Gb/s

• Gedae Programming LanguageCommunication Layer

Page 36: Flexible Filters for  High Performance  Embedded Computing

Gedae• Commercial data-flow language and programming GUI • Performance analysis tools

Page 37: Flexible Filters for  High Performance  Embedded Computing

CFAR BenchmarkuInt tofloat square right

windowleft

window

add

aligndata

findtargets

Profile of CFAR filters on Cell

0 1 2 3 4 5

find targets

add

right window

uIntToFloat

Average Execution Time (microseconds) per 100 tokens

Page 38: Flexible Filters for  High Performance  Embedded Computing

Data Dependency• By changing threshold, change % targets

– 1.3 % – 7.3 %

• Additional workload per target– 16 µs– 32 µs– 64 µs

1.3 7.316 μs 0.82 1.4532 μs 1.06 1.3964 μs 1.27 1.47

% Targets

AdditionalWorkload

Page 39: Flexible Filters for  High Performance  Embedded Computing

More BenchmarksBenchmark Field Results

DedupInformation

TheoryRabin block/ max chunk size Speedup

4096/512 2.00

JPEG Image Processing

Image width x height128x128256x256512x512

1.311.161.25

Value-at-Risk

Financestocks/walks/timesteps

16/1024/102464/1024/1024

128/1024/1024

0.981.561.55

Page 40: Flexible Filters for  High Performance  Embedded Computing

Conclusions

• Flexible filters – adapt to data dependent bottlenecks– distributed load balancing– provide speedup without modification to

original filters– can be implemented on top of general

stream languages