embedded systems in silicon td5102 data management (1) overview henk corporaal...

54
Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/ EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006

Post on 20-Dec-2015

232 views

Category:

Documents


0 download

TRANSCRIPT

Embedded Systems in SiliconTD5102

Data Management (1)Overview

Henk Corporaalhttp://www.ics.ele.tue.nl/~heco/courses/EmbSystems

Technical University EindhovenDTI / NUS Singapore

2005/2006

H.C. TD5102 2

Data Management Overview

• Motivation

• Example application

• Data Management (DM) steps

• Results

Important note:-We consider here static declared data structures only-DM is also called

-DTSE (Data Transfer and Storage Exploration), or-Physical Memory Management

H.C. TD5102 3

Dynamic memory mgmtDynamic memory mgmt

Task concurrency mgmtTask concurrency mgmt

Physical memory mgmtPhysical memory mgmt

Address optimizationAddress optimization

SWSWdesigndesignflowflow

HWHWdesigndesignflowflow

SW/HW co-designSW/HW co-design

Concurrent OO specConcurrent OO spec

Remove OO overheadRemove OO overhead

H.C. TD5102 4

VLIWcpu

I$

video-in

video-out

audio-in

audio-out

PCI bridge

Serial I/O

timers I2C I/O

SDRAM

D$

for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k];

SDRAM

D$

Data storage bottleneck

B[j] = A[i*4+k];

The underlying idea

B[j] = A[i*4+k];

Data transfer bottleneck

B[j] = A[i*4+k];

H.C. TD5102 5

Platform architecture model

CPUs

HWaccel

Level-1 Level-2 Level-3 Level-4

ICache

LocalMemory

Disk

MainMemory

bus-if

on-c

hip

buss

es

LocalMemory

L2Cache

LocalMemory

bridge SCSI

DCache Disk

Disk

bus

bus

SCSI

bus

Chip

H.C. TD5102 6

Platform example: TriMedia

5 out of 27 processor

FU’s

128*32b16-portRegFile

Hardware accelerators

TriMedia TM1000

256M1-portSDRAM

16K2-portSRAM

256M1-portSDRAM

SWcache8KB

TriMedia TM1000 cache

HWcache

8/16KB

CPUCache bypassSW controlledHW controlled

H.C. TD5102 7

Data transfer and storage power

012345

6

Relative energyper operation

ADD DX,BX

MOV DB,[BX]

Cache miss(per word)

05

10152025303540

Powerdistribution in

CPUs

Storage

Interconnect

Clocks

Other

05

101520253035

Relative energy per operation

16b carry-select

16b multiplier

8x128x16 SRAM (R)

8x128x16 SRAM (W)

External IO access

16b Memory transfer

Power(memory)

Power(arith

metic)= 33

H.C. TD5102 8

ApplicationsApplicationsArchitecture

Instance

Mapping

Applications

PerformanceAnalysis

PerformanceNumbers

Data transfer anddata storagespecificrewrites in theapplication code

Positioning in the Y-chart

H.C. TD5102 9

Current practiceMapping, easy, but...........• Given

– reference C code for applicatione.g. MPEG-4 Motion Estimation

– platform: SUPERDUPER-LX50

• Task– map application on architecture

• But … wait a moment me@work> CC –o2 mpeg4_me mpeg4_me.cThank you for running SUPERDUPER-LX50 compiler.

Your program uses 257321886 bytes memory, 78 Watt, 428798765291 clock cycles

a=b*5+d;for (...){..}

Idea

H.C. TD5102 10

Let’s help the compiler ...DTSE: data transfer and storage exploration

• DTSE is a methodology to explore data-transfer and data-storage in multi-media applications

– Transforms C-code of the application

– By focusing on multi-dimensional signals (arrays)

– To better exploit platform capabilities

• This overview covers the major steps to improve power, area, performance trade-off

H.C. TD5102 11

Data Management principles

Processor Data Paths

L1cache

L2cache

CacheBank

Combine

local latch 1& bank 1

local latch N& bank N

Exploit memory hierarchy

Off-chip SDRAM

Exploit limited life-time

Avoid N-port Memories within real-time constraints

Reduce redundant transfersIntroduce Locality

H.C. TD5102 12

DM stepsC-in

Preprocessing

Dataflow transformations

Loop transformations

Data reuse Memory hierarchy layer assignment

Cycle budget distribution

Memory allocation and assignment

Data layout

C-out

Address optimization

H.C. TD5102 13

The DM steps• Preprocessing

– Rewrite code in 3 layers (parts)– Selective inlining, Single Assignment form, ....

• Data flow transformations– Eliminate redundant transfers and storage

• Loop and control flow transformations– Improve regularity of accesses and data locality

• Data re-use and memory hierarchy layer assignment– Determine when to move which data between memories to meet

the cycle budget of the application with low cost– Determine in which layer to put the arrays (and copies)

H.C. TD5102 14

The DM steps

Per memory layer:• Cycle budget distribution

– determine memory access constraints for given cycle budget

• Memory allocation and assignment– which memories to use, and where to put the arrays

• Data layout– determine how to combine and put arrays into memories

• Address optimization on the final C-code

H.C. TD5102 15

Application example

• Application domain:– Computer Tomography in medical imaging

• Algorithm: – Cavity detection in CT-scans

– Detect dark regions in successive images

– Indicate cavity in brain

Bad news for owner of brain

H.C. TD5102 16

Data enters Cavity Detectorrow-wise

scandevice

Buffer

serial scan

Cavity

Detector

GaussBlur loop

= image_in

H.C. TD5102 17

Application

Reference (conceptual) C code for the algorithm

– all functions: image_in[N x M]t-1 -> image_out[N x M]t

– new value of pixel depends on its neighbors

– neighbor pixels read from background memory

– approximately 110 lines of C code (ignoring file I/O etc)

– experiments with N x M = 640 x 400 pixels

– straightforward implementation: 6 image buffers

ComputeEdges

GaussBlur x

ReverseDetectRoots

MaxValue

GaussBlur y

H.C. TD5102 18

Preprocessing: Dividing an application in the 3 layers

Module1a

Module1b

Module2 Module3

Synchronisation

- testbench call- dynamic event behaviour- mode selection

for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]);

intfunc1(int a, int b){ return a*b;}

LAYER1

LAYER2

LAYER3

H.C. TD5102 19

main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out);}

void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }}

Layered code structure

H.C. TD5102 21

N

M

Data-flow trafo - cavity detection

for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0;

for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }}

M-2

N-2

#accesses: N * M + (N-2) * (M-2)

H.C. TD5102 22

Data-flow trafo - cavity detection

N

M

N-2

M-2

for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } }}

#accesses: N * Mgain is ± 50 %

H.C. TD5102 23

Data-flow transformation

• In total 5 types of data-flow transformations:

– advanced signal substitution and (copy) propagation

– algebraic transformations (associativity, etc.)

– shifting “delay lines”

– re-computation

– transformations to eliminate bottlenecks for subsequent loop transformations

H.C. TD5102 24

• Loop transformations– improve regularity of accesses

– improve temporal locality: production consumption

• Expected influence– reduce temporary storage and (anticipated) background

storage

storage size N

Loop transformations

for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]);

for (i=1; i<=N; i++) out[i] = A[i];

for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i];}

storage size 1

H.C. TD5102 25

Global loop transformation steps applied to cavity detection• Removal of data-flow bottleneck

– allows merging of loops– done in global data-flow trafo step

• Make all loop dimensions equal• Regularize loop traversal:

Y and X loop interchange– follow order of input stream

• Y loop folding and global mergingX loop folding and global merging– full, global scope regularity– nearly complete locality for main signals

H.C. TD5102 26

Scanner

Loop trafo - cavity detectionN x M

GaussBlur x

N x M

From double bufferto single buffer

X

YX-Y LoopInterchange

H.C. TD5102 27

• Single assignment always possible

• For all loops, to maintain regularity

Loop interchange (Y X)for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */

for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */

H.C. TD5102 28

Loop trafo - cavity detection

ComputeEdges

GaussBlur y

N x (2GB+1)

Repeated fold and loop merge

N x 3

From N x M toN x (3) buffer size

From N x M toN x (2GB+1) buffer size

2GB+13(offset

arrays)

GaussBlur x

H.C. TD5102 29

for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */

for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */

Improve regularity and locality Loop Merging

!! Impossible due to dependencies!

H.C. TD5102 30

Data dependencies between1st and 2nd loop

for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = …

for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …

H.C. TD5102 31

Enable merging withLoop Folding (bumping)

for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = …

for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …

H.C. TD5102 32

Y-loop merging on 1st and 2nd loop nest

for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = …

if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else

H.C. TD5102 33

Simplify conditionsin merged loop nest

for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = …

for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)

H.C. TD5102 34

Global loop merging/folding steps

1 x y Loop interchange (done)

2 Global y-loop folding/merging: 1st and 2nd nest (done)

3 Global y-loop folding/merging: 1st/2nd and 3rd nest

4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest

5 Global x-loop folding/merging: 1st and 2nd nest

6 Global x-loop folding/merging: 1st/2nd and 3rd nest

7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest

H.C. TD5102 35

End result of global loop trafo

for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …

H.C. TD5102 36

#A = 100

P (original) = # access x power/access = 100

Processor Data Paths

RegFile

MMain

memory

P = 1

M’

P = 0.1

M’’

P = 0.01

100 110

Data re-use & memory hierarchy

• Introduce memory hierarchy – reduce number of reads from main memory

– heavily accessed arrays stored in smaller memories

P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3

H.C. TD5102 37

Data re-use• Data flow transformations to introduce extra

copies of heavily accessed signals– Step 1: figure out data re-use possibilities

– Step 2: calculate possible gain

– Step 3: decide on data assignment to memory hierarchy

int[2][6] A;

for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; iterations

array index (6 * i + k)

H.C. TD5102 38

Data re-use• Data flow transformations to introduce extra

copies of heavily accessed signals– Step 1: figure out data re-use possibilities

– Step 2: calculate possible gain

– Step 3: decide on data assignment to memory hierarchy

iterationsframe1 frame2 frame3

array index6*2

6*1N*2*3*6

CPU

1*2*1*6

N*2*1*6

H.C. TD5102 39

Data re-use tree

N*M

N*1

3*1

image_in

M*3

1*3

gauss_x

M*3

3*3

gauss_xy/comp_edge

M*3

1*1

N*M*3

N*M

N*M N*M*3

N*M*3

N*M*3 N*M

N*M

image_out

0

N*M*8 N*M*8

CPU CPU CPU CPU CPU

H.C. TD5102 40

Memory hierarchy assignment

L3

L2

L1

N*M

3*1

image_in

M*3

gauss_x gauss_xy comp_edgeimage_out

3*3 1*1 3*3 1*1

N*M

N*M*3 N*M*3 N*M

N*M

0

N*M*3 N*M

N*M*3 N*M*8 N*M*8 N*M*8 N*M*8

M*3 M*3

1MB

SDRAM

16KB

Cache

128 B

RegFile

H.C. TD5102 41

Data-reuse - cavity detection code

for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ }}

Code before reuse transformation

H.C. TD5102 42

Data-reuse - cavity code

for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3][y%1] = image_in[x+k][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3][y%1] = image_in[x+1][y]; if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3][y%1]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0;

Code after reuse transformation detection

H.C. TD5102 43

Data layout optimization• At this point multi-dimensional arrays

are to be assigned to physical memories• Data layout optimization determines exactly

where in each memory an array should be placed, to

– reduce memory size by “in-placing” arrays that do not

overlap in time (disjoint lifetimes)– to avoid cache misses due to conflicts– exploit spatial locality of the data in memory to

improve performance of e.g. page-mode memory access sequences

H.C. TD5102 44

In-place mapping

B

C

D

A

C

AD

B

E

C

A

D

E

B

time

E

addresses

A

B

C

D

E

Inter in-place

Intra in-place

H.C. TD5102 45

0x0

0x28a0B

A

In-place mapping

• Implements all the “anticipated” memory size savings obtained in previous steps

• Modifies code to introduce one array per “real” memory

• Changes indices to addresses in mem. arrays

b8 A[100][100];b6 B[20][20];for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]);

b8 mem1[10400];for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)]));0x2710

H.C. TD5102 46

In-place mapping

• Input image is partly consumed by the time first results for output image are ready

Image_out

index

time

Image_in

time

index

time

address

Image

H.C. TD5102 47

In-place - cavity detection code

for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */

… = image_in[x+1][y]; }}

for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */

… = image [x+1][y]; }}

H.C. TD5102 48

The last step: ADOPT (Address OPTimization)

• Increased execution time introduced by DTSE– Complicated address arithmetic (modulo!)

– Additional complex control flow

• Multimedia platform not adapted to address calculations

• Additional transformations needed to– Simplify control flow

– Simplify address arithmetic: common sub-expression elimination, modulo expansion, …

– Match remaining expressions on target machine

H.C. TD5102 49

ADOPT principles

Processor specific algebraictransformations

Optimized behavioral descr.for target processor

Compile to target processor

Behavioral description

Extract address expr. codePerform addr. expr. splitting

Apply transformations:- Loop invariant code motion- Induction variable analysis- Algebraic transformations

Optimized behavioral descr.

Map to custom ACU

H.C. TD5102 50

for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; }

cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1

ADOPT principles

for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; }

Example: Full-search Motion Estimation

Algebraic transformationsat word-level

H.C. TD5102 51

DMM – results for cavity detection on ASIC

0

100

200

300

400

500

600

accesses size cycles

OriginalDF trafoLoop trafoData reuseIn-placeData layoutADOPT - moduloADOPT - rest

H.C. TD5102 52

Cavity detection on Pentium-MMX

0

5

10

15

20

25

30

35

# A

cc

es

se

s &

Ex

ec

Tim

e

Co

nve

ntio

na

l

Co

nv

+ A

do

pt

Da

ta F

low

Trf

Lo

op

Trf

D r

eu

se (

line

bu

ffe

r)

D r

eu

se (

pix

el b

uff

er)

Inp

lace

Me

m d

ata

layo

ut

Mo

du

lo r

ed

(a

do

pt)

Oth

er

Ad

op

t

Main Memory Accesses Local Memory Accesses Execution Time (sec)

H.C. TD5102 53

ApplicationsApplicationsArchitecture

Instance

Mapping

Applications

PerformanceAnalysis

PerformanceNumbers

Data transfer anddata storagespecificrewrites in theapplication code

Data transfer anddata storage

specificplatform

customization

The Y-chart revisited

H.C. TD5102 54

Fixing platform parameters

• Assume configurable on-chip memory hierarchy– Trade-off power versus cycle-budget

storagecyclebudget

power [mW]

50,000 100,000 150,000

5

10

15

20

25

H.C. TD5102 55

Conclusion• In multi-media applications exploring data

transfer and storage issues should be done at system level

• DTSE is a methodology for Data Transfer and Storage Exploration based on manual and/or tool-assisted code rewriting

– Platform independent high-level transformations– Platform dependent transformations exploit platform

characteristics (optimal use of cache, …)– Substantial reduction in power and memory size

demonstrated on MPEG-4, OFDM, H.263, ADSL, ...