flexible filters for high performance embedded computing

Flexible Filters for High Performance

Embedded ComputingRebecca Collins and Luca CarloniDepartment of Computer Science

Columbia University

Motivation

Stream Programming

• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter

• Examples: Signal processing, image processing, embedded applications

CBA

Example: Constant False-Alarm Rate (CFAR) Detection

guard cells cell under test

compare to with additional factors• number of gates• threshold (μ)• number of guard cells• rows and other dimensional data

http://www.ll.mit.edu/HPECchallenge/

HPEC Challenge

http://www.ll.mit.edu/HPECchallenge/

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

uInt tofloat square right

windowleft

window

add

aligndata

findtargets

CFAR Benchmark

Data Stream:


windowleft

window

add

aligndata

findtargets1

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


windowleft

window

add

aligndata

findtargets2 1

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


windowleft

window

add

aligndata

findtargets

1

23 1

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

cell under test


windowleft

window

add

aligndata

findtargets1

67891011

23910111213

8

CFAR Benchmark

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

t

right window

left window

t

Mapping

Throughput: rate of processing data tokens

multi-core chip

core 1

A

core 2

B

core 3

C

CBA

Mapping

Throughput: rate of processing data tokens

multi-core chip

core 1

AB

core 3

C

CBA

core 1 core 2 core 3

Unbalanced Flow

• Bottlenecks reduce throughput– Caused by backpressure

• inherent algorithmic imbalances• data-dependent computational spikes

CBA

“wait”

Data Dependent Execution Time: CFAR

• Using Set 1 from the HPEC CFAR Kernel Benchmark

• Over a block of about 100 cells

• Extra workload of 32 microseconds per target

7.3 % targets

0

5

10

15

0 96 192 288 384 480 576

Execution Time (microseconds)

Perc

ent

1.3 % targets

05

101520253035

2 128 254 380 506Execution Time (microseconds)

Perc

ent

0.3 % targets

0

20

40

60

80

2 128 255 381 508Execution Time (microseconds)

Perc

ent

Data Dependent Execution Time

Some other examples:• Bloom Filters (Financial, Spam detection)• Compression (Image processing)

0123456789

0.000 0.001 0.002 0.003 0.004 0.005

Execution Time(s)

Perc

ent

Our Solution: Flexible Filters

• Unused cycles on B are filled by working ahead on filter C with the data already present on B

• Push stream flow upstream of a bottleneck• Semantic Preservation

BA C

C

core 1 core 2 core 3

flex-split flex-merge

Related Works• Static Compiler Optimizations

– StreamIt [Gordon et al, 2006]• Dynamic Runtime Load Balancing

– Work Stealing/Filter Migration • [Kakulavarapu et al., 2001]• Cilk [Frigo et al., 1998]• Flux [Shah et al., 2003]• Borealis [Xing et al., 2005]

– Queue based load balancing• Diamond [Huston et al., 2005] distributed search, queue based load

balancing, filter re-ordering

• Combination Static+Dynamic– FlexStream [Hormati et al., 2009] multiple competing programs

Outline• Introduction• Design Flow of a Stream Program with

Flexibility • Performance• Implementation of Flexible Filters• Experiments

– CFAR Case Study

Design Flow of a Stream Program with Flexible Filters

compiler/design tools

Design stream algorithm

Mapping• Filters

• Memory

Add Flexibility to Bottlenecks

Profile

Profile of CFAR filters on Cell

0 1 2 3 4 5

find targets

add

right window

uIntToFloat

Average Execution Time (microseconds) per 100 tokens

Design ConsiderationsDesign stream

algorithm

Mapping• Filters

• Memory


Profile

Adding Flexibility

Design stream algorithm

Mapping• Filters

• Memory


Profile



– CFAR Case Study

Mapping Stream Programs to Multi-Core Platforms

BA C

BA C

BA C

BA C

pipeline mapping

SPMD mapping

BA C

sharing a core

core 1

core 1

core 2 core 3

core 3

core 2

core 2

core 1

Throughput: SPMD Mapping

BA C

BA C

BA C

SPMD mapping

t0 t1 t2 t3 t4 t5 t6

core 1

core 2

core 3

A1

A2

A3

B1

B2

B3

C1

C2

C3

3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0.429

Suppose EA = 2, EB = 2, EC = 3

core 2

core 1

core 3

2

3

1 EA=2 EB=2 EC=3

Throughput: Pipeline Mapping

CBAcore 1 core 2 core 3

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

core 1core 2core 3

latency 2 latency 2 latency 3

1

A1 A2B1

C1B2A3

C2B3A4

12 123 234

throughput = 1/3 = 0.333 < 0.429

Data Blocks


C

data block: group of data tokens

Throughput: Pipeline Augmented with Flexibility


C

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

core 1core 2core 3

A1 A2B1

C1B2A3

C2B3A4

C2

throughput = 2/5 = 0.4 < 0.429 (but > 0.333)

1 12 123 234



– CFAR Case Study

Flex-SplitFlex-split

pop data block b from inn0 = available space on out0n1 = |b| - n0send n0 to out0, n1 to out1send n0 0’s, then n1 1’s to

select

CB

core 2 core 3

C

flex split flex mergeC

C

out

out1

in0

in1

select

maintain ordering based on run-time state of queues

out0in

Flex-MergeFlex-merge

pop i from selectif i is 0, pop token from in0if i is 1, pop token from in1push token to out

CB

core 2 core 3

C

flex split flex mergeC

C

out

out1

in0

in1

select

out0in Overhead of Flexibility?

Multi-Channel Flex-Split and Flex-Merge

flex split filter

select

…filterflex

flex merge

flex merge

flex merge

output channel 1

output channel 2

output channel n

output channel 1

output channel 2

output channel n


filter

…input channel 1

input channel 2

input channel n


filter

…filterflex

input channel 1

input channel 2

input channel n

flex merge

select

flex split

Centralized


filter

…filterflex

input channel 1

input channel 2

input channel n

flex merge

select

flex split

filter

… filterflex

input channel 1

input channel 2

input channel n

flex mergeflex split

Centralized

β flex split

β flex split

select

Distributed



– CFAR Case Study

Cell BE Processor• Distributed Memory• Heterogeneous

– 8 SIMD (SPU) cores– 1 PowerPC (PPU)

• Element Interconnect Bus– 4 rings– 205 Gb/s

• Gedae Programming LanguageCommunication Layer

Gedae• Commercial data-flow language and programming GUI • Performance analysis tools

CFAR BenchmarkuInt tofloat square right

windowleft

window

add

aligndata

findtargets

Profile of CFAR filters on Cell

0 1 2 3 4 5

find targets

add

right window

uIntToFloat

Average Execution Time (microseconds) per 100 tokens

Data Dependency• By changing threshold, change % targets

– 1.3 % – 7.3 %

• Additional workload per target– 16 µs– 32 µs– 64 µs

1.3 7.316 μs 0.82 1.4532 μs 1.06 1.3964 μs 1.27 1.47

% Targets

AdditionalWorkload

More BenchmarksBenchmark Field Results

DedupInformation

TheoryRabin block/ max chunk size Speedup

4096/512 2.00

JPEG Image Processing

Image width x height128x128256x256512x512

1.311.161.25

Value-at-Risk

Financestocks/walks/timesteps

16/1024/102464/1024/1024

128/1024/1024

0.981.561.55

Conclusions

• Flexible filters – adapt to data dependent bottlenecks– distributed load balancing– provide speedup without modification to

original filters– can be implemented on top of general

stream languages

flexible filters for high performance embedded computing

Documents

bpush stream flow

filters communicatetoken

bloom filters financial

signal processing

flexible filtersunused

indivisible unit of

load balancingdiamond

hpec cfar kernel benchmarkover