flexible filters for high performance embedded computing
DESCRIPTION
Flexible Filters for High Performance Embedded Computing. Rebecca Collins and Luca Carloni Department of Computer Science Columbia University. Motivation. Stream Programming. Stream Programming model filter : a piece of sequential programming channels : how filters communicate - PowerPoint PPT PresentationTRANSCRIPT
Flexible Filters for High Performance
Embedded ComputingRebecca Collins and Luca CarloniDepartment of Computer Science
Columbia University
Motivation
Stream Programming
• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applications
CBA
Example: Constant False-Alarm Rate (CFAR) Detection
guard cells cell under test
compare to with additional factors• number of gates• threshold (μ)• number of guard cells• rows and other dimensional data
http://www.ll.mit.edu/HPECchallenge/
HPEC Challenge
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
uInt tofloat square right
windowleft
window
add
aligndata
findtargets
CFAR Benchmark
Data Stream:
uInt tofloat square right
windowleft
window
add
aligndata
findtargets1
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
uInt tofloat square right
windowleft
window
add
aligndata
findtargets2 1
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
uInt tofloat square right
windowleft
window
add
aligndata
findtargets
1
23 1
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cell under test
uInt tofloat square right
windowleft
window
add
aligndata
findtargets1
67891011
23910111213
8
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
right window
left window
t
Mapping
Throughput: rate of processing data tokens
multi-core chip
core 1
A
core 2
B
core 3
C
CBA
Mapping
Throughput: rate of processing data tokens
multi-core chip
core 1
AB
core 3
C
CBA
core 1 core 2 core 3
Unbalanced Flow
• Bottlenecks reduce throughput– Caused by backpressure
• inherent algorithmic imbalances• data-dependent computational spikes
CBA
“wait”
Data Dependent Execution Time: CFAR
• Using Set 1 from the HPEC CFAR Kernel Benchmark
• Over a block of about 100 cells
• Extra workload of 32 microseconds per target
7.3 % targets
0
5
10
15
0 96 192 288 384 480 576
Execution Time (microseconds)
Perc
ent
1.3 % targets
05
101520253035
2 128 254 380 506Execution Time (microseconds)
Perc
ent
0.3 % targets
0
20
40
60
80
2 128 255 381 508Execution Time (microseconds)
Perc
ent
Data Dependent Execution Time
Some other examples:• Bloom Filters (Financial, Spam detection)• Compression (Image processing)
0123456789
0.000 0.001 0.002 0.003 0.004 0.005
Execution Time(s)
Perc
ent
Our Solution: Flexible Filters
• Unused cycles on B are filled by working ahead on filter C with the data already present on B
• Push stream flow upstream of a bottleneck• Semantic Preservation
BA C
C
core 1 core 2 core 3
flex-split flex-merge
Related Works• Static Compiler Optimizations
– StreamIt [Gordon et al, 2006]• Dynamic Runtime Load Balancing
– Work Stealing/Filter Migration • [Kakulavarapu et al., 2001]• Cilk [Frigo et al., 1998]• Flux [Shah et al., 2003]• Borealis [Xing et al., 2005]
– Queue based load balancing• Diamond [Huston et al., 2005] distributed search, queue based load
balancing, filter re-ordering
• Combination Static+Dynamic– FlexStream [Hormati et al., 2009] multiple competing programs
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Design Flow of a Stream Program with Flexible Filters
compiler/design tools
Design stream algorithm
Mapping• Filters
• Memory
Add Flexibility to Bottlenecks
Profile
Profile of CFAR filters on Cell
0 1 2 3 4 5
find targets
add
right window
uIntToFloat
Average Execution Time (microseconds) per 100 tokens
Design ConsiderationsDesign stream
algorithm
Mapping• Filters
• Memory
Add Flexibility to Bottlenecks
Profile
Adding Flexibility
Design stream algorithm
Mapping• Filters
• Memory
Add Flexibility to Bottlenecks
Profile
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Mapping Stream Programs to Multi-Core Platforms
BA C
BA C
BA C
BA C
pipeline mapping
SPMD mapping
BA C
sharing a core
core 1
core 1
core 2 core 3
core 3
core 2
core 2
core 1
Throughput: SPMD Mapping
BA C
BA C
BA C
SPMD mapping
t0 t1 t2 t3 t4 t5 t6
core 1
core 2
core 3
A1
A2
A3
B1
B2
B3
C1
C2
C3
3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0.429
Suppose EA = 2, EB = 2, EC = 3
core 2
core 1
core 3
2
3
1 EA=2 EB=2 EC=3
Throughput: Pipeline Mapping
CBAcore 1 core 2 core 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1core 2core 3
latency 2 latency 2 latency 3
1
A1 A2B1
C1B2A3
C2B3A4
12 123 234
throughput = 1/3 = 0.333 < 0.429
Data Blocks
CBAcore 1 core 2 core 3
C
data block: group of data tokens
Throughput: Pipeline Augmented with Flexibility
CBAcore 1 core 2 core 3
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1core 2core 3
A1 A2B1
C1B2A3
C2B3A4
C2
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
1 12 123 234
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Flex-SplitFlex-split
pop data block b from inn0 = available space on out0n1 = |b| - n0send n0 to out0, n1 to out1send n0 0’s, then n1 1’s to
select
CB
core 2 core 3
C
flex split flex mergeC
C
out
out1
in0
in1
select
maintain ordering based on run-time state of queues
out0in
Flex-MergeFlex-merge
pop i from selectif i is 0, pop token from in0if i is 1, pop token from in1push token to out
CB
core 2 core 3
C
flex split flex mergeC
C
out
out1
in0
in1
select
out0in Overhead of Flexibility?
Multi-Channel Flex-Split and Flex-Merge
flex split filter
select
…filterflex
flex merge
flex merge
flex merge
output channel 1
output channel 2
output channel n
output channel 1
output channel 2
output channel n
Multi-Channel Flex-Split and Flex-Merge
filter
…input channel 1
input channel 2
input channel n
Multi-Channel Flex-Split and Flex-Merge
filter
…filterflex
input channel 1
input channel 2
input channel n
flex merge
select
flex split
Centralized
Multi-Channel Flex-Split and Flex-Merge
filter
…filterflex
input channel 1
input channel 2
input channel n
flex merge
select
flex split
filter
… filterflex
input channel 1
input channel 2
input channel n
flex mergeflex split
Centralized
β flex split
β flex split
select
Distributed
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Cell BE Processor• Distributed Memory• Heterogeneous
– 8 SIMD (SPU) cores– 1 PowerPC (PPU)
• Element Interconnect Bus– 4 rings– 205 Gb/s
• Gedae Programming LanguageCommunication Layer
Gedae• Commercial data-flow language and programming GUI • Performance analysis tools
CFAR BenchmarkuInt tofloat square right
windowleft
window
add
aligndata
findtargets
Profile of CFAR filters on Cell
0 1 2 3 4 5
find targets
add
right window
uIntToFloat
Average Execution Time (microseconds) per 100 tokens
Data Dependency• By changing threshold, change % targets
– 1.3 % – 7.3 %
• Additional workload per target– 16 µs– 32 µs– 64 µs
1.3 7.316 μs 0.82 1.4532 μs 1.06 1.3964 μs 1.27 1.47
% Targets
AdditionalWorkload
More BenchmarksBenchmark Field Results
DedupInformation
TheoryRabin block/ max chunk size Speedup
4096/512 2.00
JPEG Image Processing
Image width x height128x128256x256512x512
1.311.161.25
Value-at-Risk
Financestocks/walks/timesteps
16/1024/102464/1024/1024
128/1024/1024
0.981.561.55
Conclusions
• Flexible filters – adapt to data dependent bottlenecks– distributed load balancing– provide speedup without modification to
original filters– can be implemented on top of general
stream languages