edge detect algorithm - rssirssi.ncsa.illinois.edu/proceedings/academic/caliga.pdf · edge detect...

RSSI 2008

www.srccomputers.com

Algorithm Optimization Case

Edge Detection

David Caliga

SRC Computers, Inc.

RSSI 2008

Edge Detect Algorithm

Median Filter to remove noise– 3 x 3 stencil

Prewitt or Sobel Edge Detect– Uses 3 x 3 stencil for X and Y templates to calculate

gradient values

• Prewitt templates

• Prewitt gradient

– SQRT (X*X + Y*Y) / 4

X Template Y Template Pixel data

-1 0 1 0 0 0 a00 a01 a02

1 1 1 -1 0 1 a10 a11 a12

-1 0 1 -1 -1 -1 a20 a12 a22

X = -1*a00 + 1*a02 - 2*a10 + 2*a12 - 1*a20 + 1*a22

Y = 1*a00 + 2*a01 + 1*a02 - 1*a20 - 2*a21 - 1*a22

RSSI 2008

Sample Code

Median Filter Code Edge Detect Code

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 9 input values in 3x3 stencil

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

// compute median filter

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

// write output value to memory

REF (BL, i, j) = px; }

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 8 input values in 3x3 stencil

b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

// apply template

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

// compute gradient

px = sqrtf (hz*hz+vt*vt);

// write output value to memory

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px>255?255:px; }

RSSI 2008

Optimization Process

Iterative process– Maximize computation per loop iteration

– Flatten nested loops

– Do things in parallel

– Overlap DMAs and compute

– Maximize use of communication bandwidth

Keep going until you run out of a resource– Off-chip Memory (OBM) accesses per clock

– Computational logic

– Internal memory blocks

– Multipliers

– Etc.

RSSI 2008

Optimization

Loop Firing Rate

Goal: Make computational loops fire every clock

Things that can prevent getting to the goal

– Multiple accesses per clock to memories

• Example of median code loop

– Loop carried scalar problems

• Eg: sxloc1++;

if (sxloc1 == px-1) sxloc1 = 0;

################## INNER LOOP SUMMARY ####################

loop on line 41:

clocks per iteration: 9

multiple reads of 'OBM bank A' required 8 additional clocks

################## INNER LOOP SUMMARY ####################

loop on line 52:

clocks per iteration: 3

loop-carried var 'sxloc1' required 2 additional clocks

######################################################################

RSSI 2008

Optimization

Flatten Nested Loops

Nested Compute time

– Time = Outer_cnt-1 * ((Inner_cnt-1 + pipeline depth)

+ outer_work)

– Not optimal if the pipeline depth is “large” relative to inner

trip count

– Not optimal if “outer_work” is large

Flattened Compute time

– Time = (Outer_cnt * Inner_cnt) – 1 + “new pipeline depth”

RSSI 2008

Optimization

Eliminate Loop Carried Scalars

Carte™ supplies many functions for users to

implement in their codes

– Accumulators

– Counters

– Bitwise operations

– Min, Max

– Etc.

RSSI 2008

Optimization

Flattened Loops

for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

REF (BL, i, j) = px; }

for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px > 255 ? 255 : px; }

RSSI 2008

Performance Gains

640 x 480 Image

Original Opt:

Flatten

1.0 1.02

RSSI 2008

Optimization

Remove Multiple Reads to OBM

Major loop slowdown noted by compiler is

because compute loops want 9 or 8 array values

from a single memory per loop iteration

Solution: Use delay queue feature in Carte™

compiler

SRC Confidential - DO NOT DUPLICATERSSI 2008

Pipelining Stencil Code

9 points in stencil.

Compute Process

Move a window through the image

Data access input(i)

Compute f(x1,x2,..x9)

Data Storage f(x)

Compute Process

Data Storage f(x)

6 have been seen

before.

Compute Process

Data Storage f(x)

6 have been seen

before.

Compute Process

Data Storage f(x)

6 have been seen

before.

Compute Process

Data Storage f(x)

8 have been seen

before.

Compute Process

Data Storage f(x)

8 have been seen

before.

The leading point

should be the only

data access.

RSSI 2008

Stencil Data Flow

9 Scalars

(16 –unit Shift Register,

remembers previous row)

Data access f(x)Data access input(x)Compute f(x1,x2,..x9)

Data Storage f(x)

RSSI 2008

Stencil Data Flow

Data access input(i)Output(i)

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

Stencil Data Flow

RSSI 2008

Stencil Data Flow

i2i0 i1

i0 i1 i2

RSSI 2008

Stencil Data Flow

i3i1 i2

i0 i1 i2 i3

RSSI 2008

Stencil Data Flow

i4i2 i3

i0 i1 i2 i3 i4

RSSI 2008

Stencil Data Flow

i5i3 i4

i0 i1 i2 i3 i4 i5

RSSI 2008

Stencil Data Flow

i6i4 i5

i0 i1 i2 i3 i4 i5 i6

RSSI 2008

Stencil Data Flow

i15i13 i14

i1 i2 i3 i4 i5 i6 i7 i7 i9 i10 i11 i12 i13 i14 i15i0

RSSI 2008

Stencil Data Flow

i16i14 i15

i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

RSSI 2008

Stencil Data Flow

i17i15

RSSI 2008

Stencil Data Flow

RSSI 2008

Stencil Data Flow

RSSI 2008

Stencil Data Flow

RSSI 2008

Stencil Data Flow

RSSI 2008

Stencil Data Flow

RSSI 2008

Stencil Data Flow

delayq(in,&out);

RSSI 2008

Optimization

Delay Queues

Median Filter Code Edge Detect Codefor (n=0; n<(py-2)*(py-2); n++)

mvalue = REF (AL, i, j);

a20 = a21;

a21 = a22;

a22 = mvalue;

a10 = a11;

a11 = a12;

delay_queue_8_var (a22,1,n==0,px, &a12);

a00 = a01;

a01 = a02;

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

// continued on column to right

// continued from column on left

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

delay_queue_8_var (b22,1,n==0,px, &b12);

b00 = b01;

b01 = b02;

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

RSSI 2008

Performance Gains

Original Opt:

Flatten

Queues

1.0 1.029.04

RSSI 2008

Optimization

Overlap DMA and Compute

Use “streams” feature in Carte™

A stream is an efficient communication mechanism between producer and consumer loops

Producer and consumer loops are executing in “parallel”– Consumer loop will use a value from the producer loop as

soon as it is generated

– Loops that are inherently sequential can execute in parallel

DMAs can produce streams that are consumed in a compute loop

RSSI 2008

Optimization

Use of Streams

#pragma src parallel sections {

#pragma src section {

streamed_dma_cpu (&S0, PORT_TO_STREAM,

PATH_0, image_in, 1, nbytes); }

#pragma src section {for (n=0; n<(py-2)*(py-2); n++)

get_stream (&S0, &mvalue);

a20 = a21; a21 = a22; a22 = mvalue;

a10 = a11; a11 = a12;

a00 = a01; a01 = a02;

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

b00 = b01;

b01 = b02;

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

} } // end parallel section and region

RSSI 2008

Performance Gains

Original Opt:

Flatten

Queues

Streaming

1.0 1.029.04

RSSI 2008

Optimization

Loop unrolling

Do more work during a loop iteration

Take advantage of the fact that the pixels are 8bit

data packed into 64b values

Unroll by 8 will reduce the compute time by 8x

RSSI 2008

Optimization

Loop Unrolling by 8

Median Filter Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px8, &i); cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream (&S0, &w1);

/* | row

|| word

||| byte

v111 */

a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;

split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;

a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;delay_queue_64_var (w1,1,n==0,px8, &w2);

a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2,1,n==0, px8, &w3);

RSSI 2008

Optimization

Loop Unrolling by 8

Median Filter Code

median_8_9 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

put_stream (&S1, b1, 1);

} // end parallel section

// continued into edge detect

RSSI 2008

Optimization

Loop Unrolling by 8

Edge Detect Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1,0,n==0,px8,&j); cg_count_ceil_32 (j==0,0,n==0,py,&i);

get_stream (&S1, &w1);

/* | row

|| word

||| byte

v111 */

delay_queue_64_var (w1, 1, n==0, px8, &w2);

a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2, 1, n==0, px8, &w3);

split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);

a018=v0; a019=v1;

RSSI 2008

Optimization

Used Inlining of edge_detect_8

Edge Detect Codeedge_detect_8 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

edge_detect_8 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);

comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

ix = (i-4)*px8 + j-2;

if ((i>=4) & (j>=2)) DL[ix] = b1;

} // end parallel region

RSSI 2008

Performance Gains

Original Opt:

Flatten

Queues

Streaming

Unroll by 8

1.0 1.029.04

RSSI 2008

Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in two 64b values

every clock

Unroll by 16

Two output DMA examples

– DMA to microprocessor after compute

– Streaming DMA to Global Common Memory

RSSI 2008

Optimization

Stream Input / Bulk DMA Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px16, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_128 (&S0, &w1, &w2);

// 16 median_8_9 calls

put_stream_128 (&S1, b1, b2 1);

#pragma src section

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px16, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect calls

ix = (i-4)*px16 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

RSSI 2008

Performance Gains

Original Opt:

Flatten

Queues

Streaming

Unroll by 8

Unroll by 16

Stream DMA in

Bulk DMA out

1.0 1.029.04

RSSI 2008

Optimization

Stream Input / Stream Output

#pragma src section

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {

get_stream_128 (&S0, &w1, &w2);

put_stream_128 (&S1, b1, b2 1);

#pragma src section

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect call

iput = ((i>=4) & (j>=2)) ? 1 : 0 ;

put_stream_128 (&S2, b11, b12, iput);

streamed_dma_gcm_128 (&S0, STREAM_TO_PORT,

PATH_0, image_out, 1, nbytes); }

RSSI 2008

Performance Gains

Original Opt:

Flatten

Queues

Streaming

Unroll by 8

Unroll by 16

Stream DMA in

Bulk DMA out

Unroll by 16

Stream DMA in

Stream DMA out

1.0 1.029.04

RSSI 2008

Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in roughly four 64b

values every clock

Unroll by 32

RSSI 2008

Optimization

Stream Input / Bulk DMA Output

#pragma src section

px16 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {

get_stream_256 (&S0, &w1, &w2, &w3, &w4);

put_stream_256 (&S1, b1, b2, b3, b4 1);

#pragma src section

px32 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {

get_stream_256 (&S1, &w1, &w2, &w3, &w4);

// 32 edge_detect calls

ix = (i-4)*px32 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

EL[ix] = b13;

FL[ix] = b14;} // end parallel section

RSSI 2008

Performance Gains

Original Opt:

Flatten

Queues

Streaming

Unroll by 8

Unroll by 16

Stream DMA in

Bulk DMA out

Unroll by 16

Stream DMA in

Stream DMA out

Unroll by 32

1.0 1.029.04

RSSI 2008

How does Microprocessor

Performance* Compare

* Intel IPPLIB 5.1 running on 3GHz Xeon

Optimization MAP Speedup

640 x 480

MAP Speedup

1024 x 1024

Original .56 .69

Opt: Flattened Loops .56 .70

Opt: Delay Queues 5.0 6.2

Opt: Streaming DMAs 6.7 8.3

Opt: Unroll by 8 51.6 56.7

Opt: Unroll by 16 154 195

Opt: Unroll by 32 191 256

RSSI 2008

Logic Utilization

Optimization

Percentage

Logic Utilization

Original 16

Opt: Flattened Loops 16

Opt: Delay Queues 16

Opt: Streaming DMAs 16

Opt: Unrolling by 8 37

RSSI 2008

Performance is for the taking

Standard code optimization techniques work

Ability to get massive compute parallelism is

straight forward

Easy to “dial” amount of DMA bandwidth to

match compute parallelism requirements

edge detect algorithm - rssirssi.ncsa.illinois.edu/proceedings/academic/caliga.pdf · edge detect...

Documents

green edge directed demosaicing algorithm

edge detection algorithm and code

image processing and classi cation algorithm to detect ......

new algorithm to detect moving target in an image with...

canny’s algorithm for edge detection

edge detection algorithm for sst images

an enhanced mechanism to detect distorted...

a reduced complexity algorithm for minimizing n -detect...

a split canny edge detection: algorithm and its fpga ... ·...

edge preserving image enhancement via harmony search...

design and implementation of edge detection algorithm in

an efficient algorithm for edge-coloring series parallel

an improved fractional differential edge detection algorithm

a novel image edge detection algorithm based on neutrosophic...

ant algorithm for image edge...

improved algorithm of edge adaptive image steganography...

edge detection by genetic algorithm

an algorithm to detect geometrical object in the image

to design a hybrid algorithm to detect and eliminate

a marching cube algorithm based on edge growth