edge detect algorithm - rssirssi.ncsa.illinois.edu/proceedings/academic/caliga.pdf · edge detect...

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Algorithm Optimization Case

Study

Edge Detection

David Caliga

SRC Computers, Inc.

RSSI 2008



Edge Detect Algorithm

Median Filter to remove noise– 3 x 3 stencil

Prewitt or Sobel Edge Detect– Uses 3 x 3 stencil for X and Y templates to calculate

gradient values

• Prewitt templates

• Prewitt gradient

– SQRT (X*X + Y*Y) / 4

X Template Y Template Pixel data

-1 0 1 0 0 0 a00 a01 a02

1 1 1 -1 0 1 a10 a11 a12

-1 0 1 -1 -1 -1 a20 a12 a22

X = -1*a00 + 1*a02 - 2*a10 + 2*a12 - 1*a20 + 1*a22

Y = 1*a00 + 2*a01 + 1*a02 - 1*a20 - 2*a21 - 1*a22

RSSI 2008



Sample Code

Median Filter Code Edge Detect Code

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 9 input values in 3x3 stencil

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

// compute median filter

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

// write output value to memory

REF (BL, i, j) = px; }

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 8 input values in 3x3 stencil

b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

// apply template

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

// compute gradient

px = sqrtf (hz*hz+vt*vt);

// write output value to memory

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px>255?255:px; }

RSSI 2008



Optimization Process

Iterative process– Maximize computation per loop iteration

– Flatten nested loops

– Do things in parallel

– Overlap DMAs and compute

– Maximize use of communication bandwidth

Keep going until you run out of a resource– Off-chip Memory (OBM) accesses per clock

– Computational logic

– Internal memory blocks

– Multipliers

– Etc.

RSSI 2008



Optimization

Loop Firing Rate

Goal: Make computational loops fire every clock

Things that can prevent getting to the goal

– Multiple accesses per clock to memories

• Example of median code loop

– Loop carried scalar problems

• Eg: sxloc1++;

if (sxloc1 == px-1) sxloc1 = 0;

################## INNER LOOP SUMMARY ####################

loop on line 41:

clocks per iteration: 9

multiple reads of 'OBM bank A' required 8 additional clocks

################## INNER LOOP SUMMARY ####################

loop on line 52:

clocks per iteration: 3

loop-carried var 'sxloc1' required 2 additional clocks

######################################################################

RSSI 2008



Optimization

Flatten Nested Loops

Nested Compute time

– Time = Outer_cnt-1 * ((Inner_cnt-1 + pipeline depth)

+ outer_work)

– Not optimal if the pipeline depth is “large” relative to inner

trip count

– Not optimal if “outer_work” is large

Flattened Compute time

– Time = (Outer_cnt * Inner_cnt) – 1 + “new pipeline depth”

RSSI 2008



Optimization

Eliminate Loop Carried Scalars

Carte™ supplies many functions for users to

implement in their codes

– Accumulators

– Counters

– Bitwise operations

– Min, Max

– Etc.

RSSI 2008



Optimization

Flattened Loops


for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

&px);

REF (BL, i, j) = px; }

for (n=0; n<(py-2)*(py-2); n++)


cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);


if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px > 255 ? 255 : px; }

RSSI 2008



Performance Gains

640 x 480 Image

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

1.0 1.02

RSSI 2008



Optimization

Remove Multiple Reads to OBM

Major loop slowdown noted by compiler is

because compute loops want 9 or 8 array values

from a single memory per loop iteration

Solution: Use delay queue feature in Carte™

compiler

SRC Confidential - DO NOT DUPLICATERSSI 2008



Pipelining Stencil Code

9 points in stencil.

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)





Compute Process



x3

x6

x9

x1

x4

x7

x2

x5

x8


Data Storage f(x)


6 have been seen

before.





Compute Process



x3

x6

x9

x1

x4

x7

x2

x5

x8


Data Storage f(x)


8 have been seen

before.





Compute Process


x3

x6

x9

x1

x4

x7

x2

x5

x8


Data Storage f(x)


8 have been seen

before.

The leading point

should be the only

data access.

RSSI 2008



Stencil Data Flow

x3

x6

x9

x1

x4

x7

x2

x5

x8

9 Scalars

(16 –unit Shift Register,

remembers previous row)

Data access f(x)Data access input(x)Compute f(x1,x2,..x9)

Data Storage f(x)

RSSI 2008



Stencil Data Flow

Data access input(i)Output(i)

i0

i0

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008



Stencil Data Flow


i1i0

i0 i1


RSSI 2008



Stencil Data Flow


i2i0 i1

i0 i1 i2


RSSI 2008



Stencil Data Flow


i3i1 i2

i0 i1 i2 i3


RSSI 2008



Stencil Data Flow


i4i2 i3

i0 i1 i2 i3 i4


RSSI 2008



Stencil Data Flow


i5i3 i4

i0 i1 i2 i3 i4 i5


RSSI 2008



Stencil Data Flow


i6i4 i5

i0 i1 i2 i3 i4 i5 i6


RSSI 2008



Stencil Data Flow


i15i13 i14

i1 i2 i3 i4 i5 i6 i7 i7 i9 i10 i11 i12 i13 i14 i15i0


RSSI 2008



Stencil Data Flow


i0

i16i14 i15

i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

i0

i1


RSSI 2008



Stencil Data Flow


i1

i17i15

i0

i16


i0 i1

i2


RSSI 2008



Stencil Data Flow


i17

i31

i15

i29

i16

i30



i16

i0


RSSI 2008



Stencil Data Flow


i0

i17

i32

i15

i30

i16

i31



i17

i1


RSSI 2008



Stencil Data Flow


i1

i18

i33

i16

i31

i0

i17

i32



i18

i2


RSSI 2008



Stencil Data Flow


i2

i18

i34

i0

i16

i32

i1

i17

i33



i19

i3


RSSI 2008



Stencil Data Flow


i3

i19

i35

i1

i17

i33

i2

i18

i34



i20

i4


RSSI 2008



Stencil Data Flow


i15

i31

i47

i13

i29

i45

i14

i30

i46




i32

i16

delayq(in,&out);

RSSI 2008



Optimization

Delay Queues

Median Filter Code Edge Detect Codefor (n=0; n<(py-2)*(py-2); n++)



mvalue = REF (AL, i, j);

a20 = a21;

a21 = a22;

a22 = mvalue;

a10 = a11;

a11 = a12;

delay_queue_8_var (a22,1,n==0,px, &a12);

a00 = a01;

a01 = a02;


median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

&px);

// continued on column to right

// continued from column on left

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

delay_queue_8_var (b22,1,n==0,px, &b12);

b00 = b01;

b01 = b02;


hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);


if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

}

RSSI 2008



Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

1.0 1.029.04

RSSI 2008



Optimization

Overlap DMA and Compute

Use “streams” feature in Carte™

A stream is an efficient communication mechanism between producer and consumer loops

Producer and consumer loops are executing in “parallel”– Consumer loop will use a value from the producer loop as

soon as it is generated

– Loops that are inherently sequential can execute in parallel

DMAs can produce streams that are consumed in a compute loop

RSSI 2008



Optimization

Use of Streams


#pragma src parallel sections {

#pragma src section {

streamed_dma_cpu (&S0, PORT_TO_STREAM,

PATH_0, image_in, 1, nbytes); }

#pragma src section {for (n=0; n<(py-2)*(py-2); n++)



get_stream (&S0, &mvalue);

a20 = a21; a21 = a22; a22 = mvalue;

a10 = a11; a11 = a12;


a00 = a01; a01 = a02;


median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);



b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;


b00 = b01;

b01 = b02;


hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);


if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

}

} } // end parallel section and region

RSSI 2008



Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

1.0 1.029.04

12.05

RSSI 2008



Optimization

Loop unrolling

Do more work during a loop iteration

Take advantage of the fact that the pixels are 8bit

data packed into 64b values

Unroll by 8 will reduce the compute time by 8x

RSSI 2008



Optimization

Loop Unrolling by 8

Median Filter Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px8, &i); cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream (&S0, &w1);

/* | row

|| word

||| byte

vvv

v111 */

a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;

split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;

a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;delay_queue_64_var (w1,1,n==0,px8, &w2);


a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2,1,n==0, px8, &w3);


RSSI 2008



Optimization

Loop Unrolling by 8

Median Filter Code

median_8_9 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);








comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

put_stream (&S1, b1, 1);

} // end parallel section

// continued into edge detect

RSSI 2008



Optimization

Loop Unrolling by 8

Edge Detect Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1,0,n==0,px8,&j); cg_count_ceil_32 (j==0,0,n==0,py,&i);

get_stream (&S1, &w1);

/* | row

|| word

||| byte

vvv

v111 */




delay_queue_64_var (w1, 1, n==0, px8, &w2);


a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2, 1, n==0, px8, &w3);

split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);

a018=v0; a019=v1;

RSSI 2008



Optimization

Used Inlining of edge_detect_8

Edge Detect Codeedge_detect_8 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

edge_detect_8 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);







comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

ix = (i-4)*px8 + j-2;

if ((i>=4) & (j>=2)) DL[ix] = b1;


} // end parallel region

RSSI 2008



Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

1.0 1.029.04

12.05

92.9

RSSI 2008



Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in two 64b values

every clock

Unroll by 16

Two output DMA examples

– DMA to microprocessor after compute

– Streaming DMA to Global Common Memory

RSSI 2008



Optimization

Stream Input / Bulk DMA Output

Median Filter Code Edge Detect Code#pragma src parallel sections {


streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px16, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_128 (&S0, &w1, &w2);

// 16 median_8_9 calls

put_stream_128 (&S1, b1, b2 1);




#pragma src section

{

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px16, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect calls

ix = (i-4)*px16 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;



RSSI 2008



Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

1.0 1.029.04

12.05

92.9

158

RSSI 2008



Optimization

Stream Input / Stream Output





#pragma src section

{

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {



get_stream_128 (&S0, &w1, &w2);


put_stream_128 (&S1, b1, b2 1);




#pragma src section

{

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {



get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect call

iput = ((i>=4) & (j>=2)) ? 1 : 0 ;

put_stream_128 (&S2, b11, b12, iput);



streamed_dma_gcm_128 (&S0, STREAM_TO_PORT,

PATH_0, image_out, 1, nbytes); }


RSSI 2008



Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

Opt:

Unroll by 16

Stream DMA in

Stream DMA out

1.0 1.029.04

12.05

92.9

158

277

RSSI 2008



Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in roughly four 64b

values every clock

Unroll by 32

RSSI 2008



Optimization

Stream Input / Bulk DMA Output





#pragma src section

{

px16 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {



get_stream_256 (&S0, &w1, &w2, &w3, &w4);


put_stream_256 (&S1, b1, b2, b3, b4 1);




#pragma src section

{

px32 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {



get_stream_256 (&S1, &w1, &w2, &w3, &w4);

// 32 edge_detect calls

ix = (i-4)*px32 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

EL[ix] = b13;

FL[ix] = b14;} // end parallel section


RSSI 2008



Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

Opt:

Unroll by 16

Stream DMA in

Stream DMA out

Opt:

Unroll by 32

1.0 1.029.04

12.05

92.9

158

277

345

RSSI 2008



How does Microprocessor

Performance* Compare

* Intel IPPLIB 5.1 running on 3GHz Xeon

Optimization MAP Speedup

640 x 480

MAP Speedup

1024 x 1024

Original .56 .69

Opt: Flattened Loops .56 .70

Opt: Delay Queues 5.0 6.2

Opt: Streaming DMAs 6.7 8.3

Opt: Unroll by 8 51.6 56.7

Opt: Unroll by 16 154 195

Opt: Unroll by 32 191 256

RSSI 2008



Logic Utilization

Optimization

Level

Percentage

Logic Utilization

Original 16

Opt: Flattened Loops 16

Opt: Delay Queues 16

Opt: Streaming DMAs 16

Opt: Unrolling by 8 37



RSSI 2008



Performance is for the taking

Standard code optimization techniques work

Ability to get massive compute parallelism is

straight forward

Easy to “dial” amount of DMA bandwidth to

match compute parallelism requirements

edge detect algorithm - rssirssi.ncsa.illinois.edu/proceedings/academic/caliga.pdf · edge detect...

Documents