edge detect algorithm - rssirssi.ncsa.illinois.edu/proceedings/academic/caliga.pdf · edge detect...
TRANSCRIPT
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Algorithm Optimization Case
Study
Edge Detection
David Caliga
SRC Computers, Inc.
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Edge Detect Algorithm
Median Filter to remove noise– 3 x 3 stencil
Prewitt or Sobel Edge Detect– Uses 3 x 3 stencil for X and Y templates to calculate
gradient values
• Prewitt templates
• Prewitt gradient
– SQRT (X*X + Y*Y) / 4
X Template Y Template Pixel data
-1 0 1 0 0 0 a00 a01 a02
1 1 1 -1 0 1 a10 a11 a12
-1 0 1 -1 -1 -1 a20 a12 a22
X = -1*a00 + 1*a02 - 2*a10 + 2*a12 - 1*a20 + 1*a22
Y = 1*a00 + 2*a01 + 1*a02 - 1*a20 - 2*a21 - 1*a22
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Sample Code
Median Filter Code Edge Detect Code
for (i=0; i<py-2; i++)
for (j=0; j<px-2; j++) {
// get 9 input values in 3x3 stencil
a00 = REF (AL, i, j);
a01 = REF (AL, i, j+1);
a02 = REF (AL, i, j+2);
a10 = REF (AL, i+1, j);
a11 = REF (AL, i+1, j+1);
a12 = REF (AL, i+1, j+2);
a20 = REF (AL, i+2, j);
a21 = REF (AL, i+2, j+1);
a22 = REF (AL, i+2, j+2);
// compute median filter
median_9 (a00, a01, a02,
a10, a11, a12,
a20, a21, a22, &px);
// write output value to memory
REF (BL, i, j) = px; }
for (i=0; i<py-2; i++)
for (j=0; j<px-2; j++) {
// get 8 input values in 3x3 stencil
b00 = REF (BL, i, j);
b01 = REF (BL, i, j+1);
b02 = REF (BL, i, j+2);
b10 = REF (BL, i+1, j);
b12 = REF (BL, i+1, j+2);
b20 = REF (BL, i+2, j);
b21 = REF (BL, i+2, j+1);
b22 = REF (BL, i+2, j+2);
// apply template
hz = (b00 + b01 + b02) –
(b20 + b21 + b22);
vt = (b00 + b10 + b20) –
(b02 + b12 + b22);
// compute gradient
px = sqrtf (hz*hz+vt*vt);
// write output value to memory
if ((i>=2) & (j>=2))
REF (CL, i-2, j-2) = px>255?255:px; }
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization Process
Iterative process– Maximize computation per loop iteration
– Flatten nested loops
– Do things in parallel
– Overlap DMAs and compute
– Maximize use of communication bandwidth
Keep going until you run out of a resource– Off-chip Memory (OBM) accesses per clock
– Computational logic
– Internal memory blocks
– Multipliers
– Etc.
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Loop Firing Rate
Goal: Make computational loops fire every clock
Things that can prevent getting to the goal
– Multiple accesses per clock to memories
• Example of median code loop
– Loop carried scalar problems
• Eg: sxloc1++;
if (sxloc1 == px-1) sxloc1 = 0;
################## INNER LOOP SUMMARY ####################
loop on line 41:
clocks per iteration: 9
multiple reads of 'OBM bank A' required 8 additional clocks
################## INNER LOOP SUMMARY ####################
loop on line 52:
clocks per iteration: 3
loop-carried var 'sxloc1' required 2 additional clocks
######################################################################
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Flatten Nested Loops
Nested Compute time
– Time = Outer_cnt-1 * ((Inner_cnt-1 + pipeline depth)
+ outer_work)
– Not optimal if the pipeline depth is “large” relative to inner
trip count
– Not optimal if “outer_work” is large
Flattened Compute time
– Time = (Outer_cnt * Inner_cnt) – 1 + “new pipeline depth”
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Eliminate Loop Carried Scalars
Carte™ supplies many functions for users to
implement in their codes
– Accumulators
– Counters
– Bitwise operations
– Min, Max
– Etc.
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Flattened Loops
Median Filter Code Edge Detect Code
for (n=0; n<(py-2)*(py-2); n++)
cg_count_ceil_32 (1, 1, n==0, px-2, &i);
cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);
a00 = REF (AL, i, j);
a01 = REF (AL, i, j+1);
a02 = REF (AL, i, j+2);
a10 = REF (AL, i+1, j);
a11 = REF (AL, i+1, j+1);
a12 = REF (AL, i+1, j+2);
a20 = REF (AL, i+2, j);
a21 = REF (AL, i+2, j+1);
a22 = REF (AL, i+2, j+2);
median_9 (a00, a01, a02,
a10, a11, a12,
a20, a21, a22,
&px);
REF (BL, i, j) = px; }
for (n=0; n<(py-2)*(py-2); n++)
cg_count_ceil_32 (1, 1, n==0, px-2, &i);
cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);b00 = REF (BL, i, j);
b01 = REF (BL, i, j+1);
b02 = REF (BL, i, j+2);
b10 = REF (BL, i+1, j);
b12 = REF (BL, i+1, j+2);
b20 = REF (BL, i+2, j);
b21 = REF (BL, i+2, j+1);
b22 = REF (BL, i+2, j+2);
hz = (b00 + b01 + b02) –
(b20 + b21 + b22);
vt = (b00 + b10 + b20) –
(b02 + b12 + b22);
px = sqrtf (hz*hz+vt*vt);
if ((i>=2) & (j>=2))
REF (CL, i-2, j-2) = px > 255 ? 255 : px; }
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
640 x 480 Image
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
1.0 1.02
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Remove Multiple Reads to OBM
Major loop slowdown noted by compiler is
because compute loops want 9 or 8 array values
from a single memory per loop iteration
Solution: Use delay queue feature in Carte™
compiler
SRC Confidential - DO NOT DUPLICATERSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Pipelining Stencil Code
9 points in stencil.
Compute Process
Move a window through the image
Data access input(i)
x3
x6
x9
x1
x4
x7
x2
x5
x8
Compute f(x1,x2,..x9)
Data Storage f(x)
SRC Confidential - DO NOT DUPLICATERSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Pipelining Stencil Code
Compute Process
Move a window through the image
Data access input(i)
x3
x6
x9
x1
x4
x7
x2
x5
x8
Compute f(x1,x2,..x9)
Data Storage f(x)
9 points in stencil.
6 have been seen
before.
SRC Confidential - DO NOT DUPLICATERSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Pipelining Stencil Code
Compute Process
Move a window through the image
Data access input(i)
x3
x6
x9
x1
x4
x7
x2
x5
x8
Compute f(x1,x2,..x9)
Data Storage f(x)
9 points in stencil.
6 have been seen
before.
SRC Confidential - DO NOT DUPLICATERSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Pipelining Stencil Code
Compute Process
Move a window through the image
Data access input(i)
x3
x6
x9
x1
x4
x7
x2
x5
x8
Compute f(x1,x2,..x9)
Data Storage f(x)
9 points in stencil.
6 have been seen
before.
SRC Confidential - DO NOT DUPLICATERSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Pipelining Stencil Code
Compute Process
Move a window through the image
Data access input(i)
x3
x6
x9
x1
x4
x7
x2
x5
x8
Compute f(x1,x2,..x9)
Data Storage f(x)
9 points in stencil.
8 have been seen
before.
SRC Confidential - DO NOT DUPLICATERSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Pipelining Stencil Code
Compute Process
Move a window through the image
x3
x6
x9
x1
x4
x7
x2
x5
x8
Compute f(x1,x2,..x9)
Data Storage f(x)
9 points in stencil.
8 have been seen
before.
The leading point
should be the only
data access.
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
x3
x6
x9
x1
x4
x7
x2
x5
x8
9 Scalars
(16 –unit Shift Register,
remembers previous row)
Data access f(x)Data access input(x)Compute f(x1,x2,..x9)
Data Storage f(x)
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i0
i0
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i1i0
i0 i1
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i2i0 i1
i0 i1 i2
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i3i1 i2
i0 i1 i2 i3
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i4i2 i3
i0 i1 i2 i3 i4
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i5i3 i4
i0 i1 i2 i3 i4 i5
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i6i4 i5
i0 i1 i2 i3 i4 i5 i6
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i15i13 i14
i1 i2 i3 i4 i5 i6 i7 i7 i9 i10 i11 i12 i13 i14 i15i0
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i0
i16i14 i15
i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16
i0
i1
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i1
i17i15
i0
i16
i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17
i0 i1
i2
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i17
i31
i15
i29
i16
i30
i17 i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15
i16
i0
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i0
i17
i32
i15
i30
i16
i31
i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32
i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16
i17
i1
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i1
i18
i33
i16
i31
i0
i17
i32
i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33
i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17
i18
i2
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i2
i18
i34
i0
i16
i32
i1
i17
i33
i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34
i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18
i19
i3
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i3
i19
i35
i1
i17
i33
i2
i18
i34
i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34 i35
i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19
i20
i4
Compute f(x1,x2,..x9)delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Stencil Data Flow
Data access input(i)Output(i)
i15
i31
i47
i13
i29
i45
i14
i30
i46
Compute f(x1,x2,..x9)
i33 i34 i35 i36 i37 i38 i39 i40 i41 i42 i43 i44 i45 i46 i47
i17 i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31
i32
i16
delayq(in,&out);
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Delay Queues
Median Filter Code Edge Detect Codefor (n=0; n<(py-2)*(py-2); n++)
cg_count_ceil_32 (1, 1, n==0, px-2, &i);
cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);
mvalue = REF (AL, i, j);
a20 = a21;
a21 = a22;
a22 = mvalue;
a10 = a11;
a11 = a12;
delay_queue_8_var (a22,1,n==0,px, &a12);
a00 = a01;
a01 = a02;
delay_queue_8_var (a12,1,n==0,px, &a02);
median_9 (a00, a01, a02,
a10, a11, a12,
a20, a21, a22,
&px);
// continued on column to right
// continued from column on left
b20 = b21;
b21 = b22;
b22 = px;
b10 = b11;
b11 = b12;
delay_queue_8_var (b22,1,n==0,px, &b12);
b00 = b01;
b01 = b02;
delay_queue_8_var (b12,1,n==0,px, &b02);
hz = (b00 + b01 + b02) –
(b20 + b21 + b22);
vt = (b00 + b10 + b20) –
(b02 + b12 + b22);
px = sqrtf (hz*hz+vt*vt);
if ((i>=2) & (j>=2))
REF (BL, i-2, j-2) = px > 255 ? 255 : px;
}
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
Opt:
Delay
Queues
1.0 1.029.04
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Overlap DMA and Compute
Use “streams” feature in Carte™
A stream is an efficient communication mechanism between producer and consumer loops
Producer and consumer loops are executing in “parallel”– Consumer loop will use a value from the producer loop as
soon as it is generated
– Loops that are inherently sequential can execute in parallel
DMAs can produce streams that are consumed in a compute loop
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Use of Streams
Median Filter Code Edge Detect Code
#pragma src parallel sections {
#pragma src section {
streamed_dma_cpu (&S0, PORT_TO_STREAM,
PATH_0, image_in, 1, nbytes); }
#pragma src section {for (n=0; n<(py-2)*(py-2); n++)
cg_count_ceil_32 (1, 1, n==0, px-2, &i);
cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);
get_stream (&S0, &mvalue);
a20 = a21; a21 = a22; a22 = mvalue;
a10 = a11; a11 = a12;
delay_queue_8_var (a22,1,n==0,px, &a12);
a00 = a01; a01 = a02;
delay_queue_8_var (a12,1,n==0,px, &a02);
median_9 (a00, a01, a02,
a10, a11, a12,
a20, a21, a22, &px);
// continued on column to right
// continued from column on left
b20 = b21;
b21 = b22;
b22 = px;
b10 = b11;
b11 = b12;
delay_queue_8_var (b22,1,n==0,px, &b12);
b00 = b01;
b01 = b02;
delay_queue_8_var (b12,1,n==0,px, &b02);
hz = (b00 + b01 + b02) –
(b20 + b21 + b22);
vt = (b00 + b10 + b20) –
(b02 + b12 + b22);
px = sqrtf (hz*hz+vt*vt);
if ((i>=2) & (j>=2))
REF (BL, i-2, j-2) = px > 255 ? 255 : px;
}
} } // end parallel section and region
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
Opt:
Delay
Queues
Opt:
Streaming
DMAs
1.0 1.029.04
12.05
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Loop unrolling
Do more work during a loop iteration
Take advantage of the fact that the pixels are 8bit
data packed into 64b values
Unroll by 8 will reduce the compute time by 8x
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Loop Unrolling by 8
Median Filter Code#pragma src section {
px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {
cg_count_ceil_32 (1, 1, n==0, px8, &i); cg_count_ceil_32 (i==1, 1, n==0, py, &j);
get_stream (&S0, &w1);
/* | row
|| word
||| byte
vvv
v111 */
a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;
split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;
a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;delay_queue_64_var (w1,1,n==0,px8, &w2);
split_64to8 (w2, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a128=v0; a129=v1;
a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2,1,n==0, px8, &w3);
split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a018=v0; a019=v1;
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Loop Unrolling by 8
Median Filter Code
median_8_9 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);
median_8_9 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);
median_8_9 (a012, a013, a014, a112, a113, a114, a212, a213, a214, &p2);
median_8_9 (a013, a014, a015, a113, a114, a115, a213, a214, a215, &p3);
median_8_9 (a014, a015, a016, a114, a115, a116, a214, a215, a216, &p4);
median_8_9 (a015, a016, a017, a115, a116, a117, a215, a216, a217, &p5);
median_8_9 (a016, a017, a018, a116, a117, a118, a216, a217, a218, &p6);
median_8_9 (a017, a018, a019, a117, a118, a119, a217, a218, a219, &p7);
comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);
put_stream (&S1, b1, 1);
} // end parallel section
// continued into edge detect
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Loop Unrolling by 8
Edge Detect Code#pragma src section {
px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {
cg_count_ceil_32 (1,0,n==0,px8,&j); cg_count_ceil_32 (j==0,0,n==0,py,&i);
get_stream (&S1, &w1);
/* | row
|| word
||| byte
vvv
v111 */
a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;
split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;
a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;
delay_queue_64_var (w1, 1, n==0, px8, &w2);
split_64to8 (w2, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a128=v0; a129=v1;
a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2, 1, n==0, px8, &w3);
split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);
a018=v0; a019=v1;
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Used Inlining of edge_detect_8
Edge Detect Codeedge_detect_8 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);
edge_detect_8 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);
edge_detect_8 (a012, a013, a014, a112, a113, a114, a212, a213, a214, &p2);
edge_detect_8 (a013, a014, a015, a113, a114, a115, a213, a214, a215, &p3);
edge_detect_8 (a014, a015, a016, a114, a115, a116, a214, a215, a216, &p4);
edge_detect_8 (a015, a016, a017, a115, a116, a117, a215, a216, a217, &p5);
edge_detect_8 (a016, a017, a018, a116, a117, a118, a216, a217, a218, &p6);
edge_detect_8 (a017, a018, a019, a117, a118, a119, a217, a218, a219, &p7);
comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);
ix = (i-4)*px8 + j-2;
if ((i>=4) & (j>=2)) DL[ix] = b1;
} // end parallel section
} // end parallel region
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
Opt:
Delay
Queues
Opt:
Streaming
DMAs
Opt:
Unroll by 8
1.0 1.029.04
12.05
92.9
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Maximize Input DMA Bandwidth
Use streaming DMA to bring in two 64b values
every clock
Unroll by 16
Two output DMA examples
– DMA to microprocessor after compute
– Streaming DMA to Global Common Memory
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Stream Input / Bulk DMA Output
Median Filter Code Edge Detect Code#pragma src parallel sections {
#pragma src section {
streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,
image_in, 1, nbytes); }
#pragma src section
{
px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {
cg_count_ceil_32 (1, 1, n==0, px16, &i);
cg_count_ceil_32 (i==1, 1, n==0, py, &j);
get_stream_128 (&S0, &w1, &w2);
// 16 median_8_9 calls
put_stream_128 (&S1, b1, b2 1);
} // end parallel section
// continued on column to right
// continued from column on left
#pragma src section
{
px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {
cg_count_ceil_32 (1, 0, n==0, px16, &j);
cg_count_ceil_32 (j==0, 0, n==0, py, &i);
get_stream_128 (&S1, &w1, &w2);
// 16 edge_detect calls
ix = (i-4)*px16 + j-2;
if ((i>=4) & (j>=2)) {
CL[ix] = b11;
DL[ix] = b12;
} // end parallel section
} // end parallel region
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
Opt:
Delay
Queues
Opt:
Streaming
DMAs
Opt:
Unroll by 8
Opt:
Unroll by 16
Stream DMA in
Bulk DMA out
1.0 1.029.04
12.05
92.9
158
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Stream Input / Stream Output
Median Filter Code Edge Detect Code#pragma src parallel sections {
#pragma src section {
streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,
image_in, 1, nbytes); }
#pragma src section
{
px16 = px/16;
for (n=0; n<((px16)*(py-2)); n++) {
cg_count_ceil_32 (1, 1, n==0, px16, &i);
cg_count_ceil_32 (i==1, 1, n==0, py, &j);
get_stream_128 (&S0, &w1, &w2);
// 16 median_8_9 calls
put_stream_128 (&S1, b1, b2 1);
} // end parallel section
// continued on column to right
// continued from column on left
#pragma src section
{
px16 = px/16;
for (n=0; n<((px16)*(py-2)); n++) {
cg_count_ceil_32 (1, 0, n==0, px16, &j);
cg_count_ceil_32 (j==0, 0, n==0, py, &i);
get_stream_128 (&S1, &w1, &w2);
// 16 edge_detect call
iput = ((i>=4) & (j>=2)) ? 1 : 0 ;
put_stream_128 (&S2, b11, b12, iput);
} // end parallel section
#pragma src section {
streamed_dma_gcm_128 (&S0, STREAM_TO_PORT,
PATH_0, image_out, 1, nbytes); }
} // end parallel region
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
Opt:
Delay
Queues
Opt:
Streaming
DMAs
Opt:
Unroll by 8
Opt:
Unroll by 16
Stream DMA in
Bulk DMA out
Opt:
Unroll by 16
Stream DMA in
Stream DMA out
1.0 1.029.04
12.05
92.9
158
277
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Maximize Input DMA Bandwidth
Use streaming DMA to bring in roughly four 64b
values every clock
Unroll by 32
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Optimization
Stream Input / Bulk DMA Output
Median Filter Code Edge Detect Code#pragma src parallel sections {
#pragma src section {
streamed_dma_gcm_256 (&S0, PORT_TO_STREAM, PATH_0,
image_in, 1, nbytes); }
#pragma src section
{
px16 = px/32;
for (n=0; n<((px32)*(py-2)); n++) {
cg_count_ceil_32 (1, 1, n==0, px32, &i);
cg_count_ceil_32 (i==1, 1, n==0, py, &j);
get_stream_256 (&S0, &w1, &w2, &w3, &w4);
// 32 median_8_9 calls
put_stream_256 (&S1, b1, b2, b3, b4 1);
} // end parallel section
// continued on column to right
// continued from column on left
#pragma src section
{
px32 = px/32;
for (n=0; n<((px32)*(py-2)); n++) {
cg_count_ceil_32 (1, 0, n==0, px32, &j);
cg_count_ceil_32 (j==0, 0, n==0, py, &i);
get_stream_256 (&S1, &w1, &w2, &w3, &w4);
// 32 edge_detect calls
ix = (i-4)*px32 + j-2;
if ((i>=4) & (j>=2)) {
CL[ix] = b11;
DL[ix] = b12;
EL[ix] = b13;
FL[ix] = b14;} // end parallel section
} // end parallel region
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance Gains
400
350
300
250
200
150
100
50
Original Opt:
Flatten
Loops
Opt:
Delay
Queues
Opt:
Streaming
DMAs
Opt:
Unroll by 8
Opt:
Unroll by 16
Stream DMA in
Bulk DMA out
Opt:
Unroll by 16
Stream DMA in
Stream DMA out
Opt:
Unroll by 32
1.0 1.029.04
12.05
92.9
158
277
345
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
How does Microprocessor
Performance* Compare
* Intel IPPLIB 5.1 running on 3GHz Xeon
Optimization MAP Speedup
640 x 480
MAP Speedup
1024 x 1024
Original .56 .69
Opt: Flattened Loops .56 .70
Opt: Delay Queues 5.0 6.2
Opt: Streaming DMAs 6.7 8.3
Opt: Unroll by 8 51.6 56.7
Opt: Unroll by 16 154 195
Opt: Unroll by 32 191 256
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Logic Utilization
Optimization
Level
Percentage
Logic Utilization
Original 16
Opt: Flattened Loops 16
Opt: Delay Queues 16
Opt: Streaming DMAs 16
Opt: Unrolling by 8 37
Opt: Unrolling by 16 47
Opt: Unrolling by 32 75
RSSI 2008
©2008 SRC Computers, Inc. ALL RIGHTS RESERVED
www.srccomputers.com
Performance is for the taking
Standard code optimization techniques work
Ability to get massive compute parallelism is
straight forward
Easy to “dial” amount of DMA bandwidth to
match compute parallelism requirements