edge detect algorithm - rssirssi.ncsa.illinois.edu/proceedings/academic/caliga.pdf · edge detect...

Post on 31-May-2020

18 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Algorithm Optimization Case

Study

Edge Detection

David Caliga

SRC Computers, Inc.

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Edge Detect Algorithm

Median Filter to remove noise– 3 x 3 stencil

Prewitt or Sobel Edge Detect– Uses 3 x 3 stencil for X and Y templates to calculate

gradient values

• Prewitt templates

• Prewitt gradient

– SQRT (X*X + Y*Y) / 4

X Template Y Template Pixel data

-1 0 1 0 0 0 a00 a01 a02

1 1 1 -1 0 1 a10 a11 a12

-1 0 1 -1 -1 -1 a20 a12 a22

X = -1*a00 + 1*a02 - 2*a10 + 2*a12 - 1*a20 + 1*a22

Y = 1*a00 + 2*a01 + 1*a02 - 1*a20 - 2*a21 - 1*a22

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Sample Code

Median Filter Code Edge Detect Code

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 9 input values in 3x3 stencil

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

// compute median filter

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

// write output value to memory

REF (BL, i, j) = px; }

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 8 input values in 3x3 stencil

b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

// apply template

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

// compute gradient

px = sqrtf (hz*hz+vt*vt);

// write output value to memory

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px>255?255:px; }

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization Process

Iterative process– Maximize computation per loop iteration

– Flatten nested loops

– Do things in parallel

– Overlap DMAs and compute

– Maximize use of communication bandwidth

Keep going until you run out of a resource– Off-chip Memory (OBM) accesses per clock

– Computational logic

– Internal memory blocks

– Multipliers

– Etc.

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Firing Rate

Goal: Make computational loops fire every clock

Things that can prevent getting to the goal

– Multiple accesses per clock to memories

• Example of median code loop

– Loop carried scalar problems

• Eg: sxloc1++;

if (sxloc1 == px-1) sxloc1 = 0;

################## INNER LOOP SUMMARY ####################

loop on line 41:

clocks per iteration: 9

multiple reads of 'OBM bank A' required 8 additional clocks

################## INNER LOOP SUMMARY ####################

loop on line 52:

clocks per iteration: 3

loop-carried var 'sxloc1' required 2 additional clocks

######################################################################

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Flatten Nested Loops

Nested Compute time

– Time = Outer_cnt-1 * ((Inner_cnt-1 + pipeline depth)

+ outer_work)

– Not optimal if the pipeline depth is “large” relative to inner

trip count

– Not optimal if “outer_work” is large

Flattened Compute time

– Time = (Outer_cnt * Inner_cnt) – 1 + “new pipeline depth”

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Eliminate Loop Carried Scalars

Carte™ supplies many functions for users to

implement in their codes

– Accumulators

– Counters

– Bitwise operations

– Min, Max

– Etc.

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Flattened Loops

Median Filter Code Edge Detect Code

for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

&px);

REF (BL, i, j) = px; }

for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

px = sqrtf (hz*hz+vt*vt);

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px > 255 ? 255 : px; }

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

640 x 480 Image

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

1.0 1.02

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Remove Multiple Reads to OBM

Major loop slowdown noted by compiler is

because compute loops want 9 or 8 array values

from a single memory per loop iteration

Solution: Use delay queue feature in Carte™

compiler

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

9 points in stencil.

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

6 have been seen

before.

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

6 have been seen

before.

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

6 have been seen

before.

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

8 have been seen

before.

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

8 have been seen

before.

The leading point

should be the only

data access.

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

x3

x6

x9

x1

x4

x7

x2

x5

x8

9 Scalars

(16 –unit Shift Register,

remembers previous row)

Data access f(x)Data access input(x)Compute f(x1,x2,..x9)

Data Storage f(x)

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i0

i0

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i1i0

i0 i1

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i2i0 i1

i0 i1 i2

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i3i1 i2

i0 i1 i2 i3

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i4i2 i3

i0 i1 i2 i3 i4

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i5i3 i4

i0 i1 i2 i3 i4 i5

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i6i4 i5

i0 i1 i2 i3 i4 i5 i6

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i15i13 i14

i1 i2 i3 i4 i5 i6 i7 i7 i9 i10 i11 i12 i13 i14 i15i0

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i0

i16i14 i15

i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

i0

i1

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i1

i17i15

i0

i16

i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17

i0 i1

i2

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i17

i31

i15

i29

i16

i30

i17 i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31

i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15

i16

i0

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i0

i17

i32

i15

i30

i16

i31

i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32

i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

i17

i1

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i1

i18

i33

i16

i31

i0

i17

i32

i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33

i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17

i18

i2

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i2

i18

i34

i0

i16

i32

i1

i17

i33

i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34

i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18

i19

i3

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i3

i19

i35

i1

i17

i33

i2

i18

i34

i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34 i35

i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19

i20

i4

Compute f(x1,x2,..x9)delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i15

i31

i47

i13

i29

i45

i14

i30

i46

Compute f(x1,x2,..x9)

i33 i34 i35 i36 i37 i38 i39 i40 i41 i42 i43 i44 i45 i46 i47

i17 i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31

i32

i16

delayq(in,&out);

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Delay Queues

Median Filter Code Edge Detect Codefor (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

mvalue = REF (AL, i, j);

a20 = a21;

a21 = a22;

a22 = mvalue;

a10 = a11;

a11 = a12;

delay_queue_8_var (a22,1,n==0,px, &a12);

a00 = a01;

a01 = a02;

delay_queue_8_var (a12,1,n==0,px, &a02);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

&px);

// continued on column to right

// continued from column on left

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

delay_queue_8_var (b22,1,n==0,px, &b12);

b00 = b01;

b01 = b02;

delay_queue_8_var (b12,1,n==0,px, &b02);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

px = sqrtf (hz*hz+vt*vt);

if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

}

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

1.0 1.029.04

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Overlap DMA and Compute

Use “streams” feature in Carte™

A stream is an efficient communication mechanism between producer and consumer loops

Producer and consumer loops are executing in “parallel”– Consumer loop will use a value from the producer loop as

soon as it is generated

– Loops that are inherently sequential can execute in parallel

DMAs can produce streams that are consumed in a compute loop

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Use of Streams

Median Filter Code Edge Detect Code

#pragma src parallel sections {

#pragma src section {

streamed_dma_cpu (&S0, PORT_TO_STREAM,

PATH_0, image_in, 1, nbytes); }

#pragma src section {for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

get_stream (&S0, &mvalue);

a20 = a21; a21 = a22; a22 = mvalue;

a10 = a11; a11 = a12;

delay_queue_8_var (a22,1,n==0,px, &a12);

a00 = a01; a01 = a02;

delay_queue_8_var (a12,1,n==0,px, &a02);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

// continued on column to right

// continued from column on left

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

delay_queue_8_var (b22,1,n==0,px, &b12);

b00 = b01;

b01 = b02;

delay_queue_8_var (b12,1,n==0,px, &b02);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

px = sqrtf (hz*hz+vt*vt);

if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

}

} } // end parallel section and region

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

1.0 1.029.04

12.05

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop unrolling

Do more work during a loop iteration

Take advantage of the fact that the pixels are 8bit

data packed into 64b values

Unroll by 8 will reduce the compute time by 8x

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Unrolling by 8

Median Filter Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px8, &i); cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream (&S0, &w1);

/* | row

|| word

||| byte

vvv

v111 */

a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;

split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;

a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;delay_queue_64_var (w1,1,n==0,px8, &w2);

split_64to8 (w2, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a128=v0; a129=v1;

a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2,1,n==0, px8, &w3);

split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a018=v0; a019=v1;

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Unrolling by 8

Median Filter Code

median_8_9 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

median_8_9 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);

median_8_9 (a012, a013, a014, a112, a113, a114, a212, a213, a214, &p2);

median_8_9 (a013, a014, a015, a113, a114, a115, a213, a214, a215, &p3);

median_8_9 (a014, a015, a016, a114, a115, a116, a214, a215, a216, &p4);

median_8_9 (a015, a016, a017, a115, a116, a117, a215, a216, a217, &p5);

median_8_9 (a016, a017, a018, a116, a117, a118, a216, a217, a218, &p6);

median_8_9 (a017, a018, a019, a117, a118, a119, a217, a218, a219, &p7);

comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

put_stream (&S1, b1, 1);

} // end parallel section

// continued into edge detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Unrolling by 8

Edge Detect Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1,0,n==0,px8,&j); cg_count_ceil_32 (j==0,0,n==0,py,&i);

get_stream (&S1, &w1);

/* | row

|| word

||| byte

vvv

v111 */

a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;

split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;

a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;

delay_queue_64_var (w1, 1, n==0, px8, &w2);

split_64to8 (w2, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a128=v0; a129=v1;

a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2, 1, n==0, px8, &w3);

split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);

a018=v0; a019=v1;

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Used Inlining of edge_detect_8

Edge Detect Codeedge_detect_8 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

edge_detect_8 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);

edge_detect_8 (a012, a013, a014, a112, a113, a114, a212, a213, a214, &p2);

edge_detect_8 (a013, a014, a015, a113, a114, a115, a213, a214, a215, &p3);

edge_detect_8 (a014, a015, a016, a114, a115, a116, a214, a215, a216, &p4);

edge_detect_8 (a015, a016, a017, a115, a116, a117, a215, a216, a217, &p5);

edge_detect_8 (a016, a017, a018, a116, a117, a118, a216, a217, a218, &p6);

edge_detect_8 (a017, a018, a019, a117, a118, a119, a217, a218, a219, &p7);

comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

ix = (i-4)*px8 + j-2;

if ((i>=4) & (j>=2)) DL[ix] = b1;

} // end parallel section

} // end parallel region

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

1.0 1.029.04

12.05

92.9

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in two 64b values

every clock

Unroll by 16

Two output DMA examples

– DMA to microprocessor after compute

– Streaming DMA to Global Common Memory

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Stream Input / Bulk DMA Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

#pragma src section {

streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px16, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_128 (&S0, &w1, &w2);

// 16 median_8_9 calls

put_stream_128 (&S1, b1, b2 1);

} // end parallel section

// continued on column to right

// continued from column on left

#pragma src section

{

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px16, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect calls

ix = (i-4)*px16 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

} // end parallel section

} // end parallel region

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

1.0 1.029.04

12.05

92.9

158

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Stream Input / Stream Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

#pragma src section {

streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px16, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_128 (&S0, &w1, &w2);

// 16 median_8_9 calls

put_stream_128 (&S1, b1, b2 1);

} // end parallel section

// continued on column to right

// continued from column on left

#pragma src section

{

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px16, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect call

iput = ((i>=4) & (j>=2)) ? 1 : 0 ;

put_stream_128 (&S2, b11, b12, iput);

} // end parallel section

#pragma src section {

streamed_dma_gcm_128 (&S0, STREAM_TO_PORT,

PATH_0, image_out, 1, nbytes); }

} // end parallel region

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

Opt:

Unroll by 16

Stream DMA in

Stream DMA out

1.0 1.029.04

12.05

92.9

158

277

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in roughly four 64b

values every clock

Unroll by 32

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Stream Input / Bulk DMA Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

#pragma src section {

streamed_dma_gcm_256 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px32, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_256 (&S0, &w1, &w2, &w3, &w4);

// 32 median_8_9 calls

put_stream_256 (&S1, b1, b2, b3, b4 1);

} // end parallel section

// continued on column to right

// continued from column on left

#pragma src section

{

px32 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px32, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_256 (&S1, &w1, &w2, &w3, &w4);

// 32 edge_detect calls

ix = (i-4)*px32 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

EL[ix] = b13;

FL[ix] = b14;} // end parallel section

} // end parallel region

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

Opt:

Unroll by 16

Stream DMA in

Stream DMA out

Opt:

Unroll by 32

1.0 1.029.04

12.05

92.9

158

277

345

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

How does Microprocessor

Performance* Compare

* Intel IPPLIB 5.1 running on 3GHz Xeon

Optimization MAP Speedup

640 x 480

MAP Speedup

1024 x 1024

Original .56 .69

Opt: Flattened Loops .56 .70

Opt: Delay Queues 5.0 6.2

Opt: Streaming DMAs 6.7 8.3

Opt: Unroll by 8 51.6 56.7

Opt: Unroll by 16 154 195

Opt: Unroll by 32 191 256

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Logic Utilization

Optimization

Level

Percentage

Logic Utilization

Original 16

Opt: Flattened Loops 16

Opt: Delay Queues 16

Opt: Streaming DMAs 16

Opt: Unrolling by 8 37

Opt: Unrolling by 16 47

Opt: Unrolling by 32 75

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance is for the taking

Standard code optimization techniques work

Ability to get massive compute parallelism is

straight forward

Easy to “dial” amount of DMA bandwidth to

match compute parallelism requirements

top related