processor architectures and program mapping 5kk70 tu/e henk corporaal jef van meerbergen bart mesman...

Processor Architectures and Program Mapping

5KK70 TU/e

Henk Corporaal

Jef van Meerbergen

Bart Mesman

Data Memory Management

Part b: Loop transformations & Data Reuse

Thanks to the IMEC DTSE experts:

Erik Brockmeyer

IMEC, Leuven, Belgium

and also

Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda,

Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

@HC 5KK70 Platform-based Design 3

DM methodology

Dataflow Transformations

Analysis/Preprocessing

Loop/control-flow transformations

Data Reuse

Storage Cycle Budget Distribution

Memory Allocation and Assignment

Memory Layout organisation

C-out

C-in

Address optimization


for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[7-i] = f(A[i]);

Location

Time

Production

Consumption

for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]);

Location

Time

Production

Consumption

Locality of Reference


Regularity

for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[i] = f(A[7-i]);

Location

Time

for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[7-i] = f(A[i]);

Location

Time

ProductionConsumption


for (i=0; i < 8; i++) B[i] = f1(A[i]);for (i=0; i < 8; i++) C[i] = f2(A[i]);

Location

Time

Consumption

Consumption

Location

Time

Consumption

Consumption

Enabling Reuse

for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]);


How to do these loop transformations automatically?

Requires cost function

Requires technique

Let's introduce some terminology

- iteration spaces

- polytopes

- ordering vector / execution order


0 1 j2 3 4 50

i

1

2

3

4

5

Iteration space and polytopes

// assume A[][] exists

for (i=1; i<6; i++) {

for (j=2; j<6; j++) {

B[i][j] = g( A[i-1][j-2]);

} }

--- iteration space

--- consumption space

--- production space

--- dependency vector


Example with 3 polytopes

A: for (i=1; i<=N; ++i)

for (j=1; j<=N-i+1; ++j)

a[i][j] = in[i][j] + a[i-1][j];

B: for (p=1; p<=N; ++p)

b[p][1] = f( a[N-p+1][p], a[N-p][p] );

C: for (k=1; k<=N; ++k)

for (l=1; l<=k; ++k)

b[k][l+1] = g (b[k][l]);

A

B

C

Algorithm having 3 loops:

j

i

k

p

l


Common iteration space

for (i=1; i<=(2*N+1); ++i)

for (j=1; j<=2*N; ++j)

if (i>=1 && i<=N && j>=1 && j<=N-i+1)

a[i][j] = in[i][j] + a[i-1][j];

if (i==N+1 && j>=1 && j<=N)

b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k)

b[i-N-1][j-N+1] = g (b[i-N-1][j-N]);

j

i

1

2*N+1

1 2*N

Initial solution having a common iteration space:

Bad locality Bad regularity Requires 2N memory locations Many dummy iterations

Ordering vector


Cost function needed for automation

RegularityEqual direction for dependency vectorsAvoid that dependency vectors cross each otherGood for storage size

Temporal localityEqual length of all dependency vectorsGood for storage sizeGood for data reuse


Regularity

Regular

Irregular


Bad regularity limits the ordering freedom

j

i

1

2*N+1

1 2*N

Ordering freedom = 90 degrees


Locality estimates

P

C

C

C

C

P

C

C

C

C

P = productionC = consumption

P

C

C

C

C

C

Dependency vector length is measure for localityQ: Which length is the best estimate?

Sum{di} Max {di} Spanning tree

di


1. Affine loop transformations

1. Only geometric information is available during placement

2. Rotation, skewing, interchange, reverse

2. Polytope placement

1. Only geometric information is available during placement

2. Translation

3. Choose ordering vector

Three step approach for loop transformation tool

y

x

j

iu

j

iT

y

x

u

Combined transformation:


A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];

C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );

B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );

i

j

p

k

l

• Affine loop transformations• Polytope placement• Choose ordering vector




i

j


C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );

pB: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );

k

l

i

j

p

k

l




i

p

k

l

j

i

j


C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );

pB: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );

k

l

• Affine loop transformations• Polytope placement = merging loops• Choose ordering vector


Choose optimal ordering vector

Ordering Vector 1 Ordering Vector 2


From the Polyhedral model back to C

for (j=1; j<=N; ++j){ for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] );}

i

l

j


Optimized solution having a common iteration space:

Optimal locality Optimal regularity Requires 2 memory locations


Scanner

Loop trafo - cavity detection

GaussBlur y

GaussBlur x

N x M

X-Y LoopInterchange

N x M

From N x M toN x (2GB+1) buffer size

X

Y

N x M


Loop trafo-cavity (1)

1Transform:

interchange

2

Translate:merge

3Order



1Transform:

interchange

2

Translate:merge

3Order

x-blur filter:


Scanner

Loop trafo - cavity detection

GaussBlur y

GaussBlur x

N x M

X-Y LoopInterchange

N x M

From N x M toN x (2GB+1) buffer size

X

Y

N x M



2Translate 1:

2 Translate 2:

3

Comparingdifferenttranslations



33Order

+ =

Combining (merging) multiple polytopes


Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot;

} else if (x<N && y<M) gauss_x_image[x][y] = 0;

if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0;


Intermezzo Before we continue with data reuse, have a look at other

loop transformations


DM methodology

Dataflow Transformations

Analysis/Preprocessing

Loop/control-flow transformations

Data Reuse

Storage Cycle Budget Distribution

Memory Allocation and Assignment

Memory Layout organisation

C-out

C-in

Address optimization


Layer 1 Layer 2 Layer 3Datapaths

Memory hierarchy and Data reuse1. Determines reuse candidates

2. Combine reuse candidates into reuse chains

3. If multiple access statements/array combine into reuse trees

4. Determine number of layers (if architecture is not fixed)

5. Select candidates and assign to memory layers

6. Add extra transfers between the different memory layers(for scratchpad RAM; not for caches)


TI C55@200MHz example platform

Register file+

Core

4Kx16dual

32xTotal 256Kb1 elem in 1 cycle

16Kx16ROM

OffchipMAX: 8MBx16SRAM/EPROM/ SDRAM/SBSRAM

TMS320vc5510@200MHzVdd= 1.5 VP = unknown

8xTotal 64Kb2 elem in 1 cycle

4Kx16dual

4Kx16dual

4Kx16sing

4Kx16sing

4Kx16sing

ROM (Data/program/DMA)first 3 cycles, next 2 cyclesIt seems this can be in parallel with the 256Kb memory Bandwidth 100M words/S

Bandwidth 400M words/s

Size 32kB

Size 320kB

ROM partition

Variable size RAM partition

Bandwidth 50M words/sSize 16 MB Fixed size RAM partition

Bandwidth 4.8Gwords/sSize 2x16 registersProcessor partition

BW: 50M Word/ssingle port

L2

L0

L1

BW: 400M Word/sdual port


M

P = 1

Exploiting Memory Hierarchy for reduced Power: principle

Processor Data Paths

RegisterFile


RegisterFile

A

P = 1

#A = 100%

P total (before) = 100%


P total (before) = 100%

M

P = 1

A

P = 1

A’

P = 0.3

100% 5%

Exploiting Memory Hierarchy for reduced Power: principle

P total (after) = 100%x0.01+10%x0.1+1%x1 = 3%

M

P = 1

A

P = 1

A’

P = 0.1

A’’

P = 0.01

100% 1%10%Processor Data Paths

RegisterFile


RegisterFile


M

Data reuse decision and memory hierarchy: principle


RegisterFile


RegisterFile

BA

A’A’’

customized connections

Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead.


Step 1: identify arrays with data reuse potential

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];

timecopy3 copy4copy1 copy2

Time frame 1 Time frame 2 Time frame 3 Time frame 4

arrayindex

intra-copyreuse

inter-copyreuse


Importance of high level cost estimate




arrayindex

6

Mk

Array copies arestored in-place!


Step 1: determine gains Intra-copy reuse factor




arrayindex

6

Mk

intra-copyreuse

factor= 3

j iterator =not presentso intra-copy reuse

3


Step 1: determine gains Inter-copy reuse factor



arrayindex

inter-copyreuse factor

= 1/(1-1/3)=3/2

6

Mk

for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];


i iterator has smaller weightthan k range so

inter-copy reuse


5Mm

tf 1 tf 2 tf 3 tf 4 tf 5 tf 6 tf 7 tf 8 tf 9

Possibility for multi-level hierarchy

arrayindex

time

for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];

Mk

15

time frame 1 time frame 2

5Mm

tf 1.1 tf 1.2 tf 1.3 tf 1.4 tf 1.5 tf 1.6 tf 2.1 tf 2.2 tf 2.3


Step 2: determine data reuse chains for each memory access

R1(A)

A

A’

R1(A)

A

A’

R1(A)

A

A’

A’’

Many reuse possibilities

Cost estimate needed

Prune for promising ones

R1(A)

A


Cost function needs both size and number of accesses to intermediate array


Gk

155

Gm

estimate #misses from different levels for one iteration of i

R1(A)

2*3*3*5=90

A’

3*5=15

A’

2*3*5=30

estimate size

0 5 10 15 20#elements

0

20

40

60

80

100#

mis

ses


R1(A)

A

A’

R1(A)

A

A’

R1(A)

A

A’

A’’

R1(A)

A

30

90 90 9090

15 15

30

90 30 15 15

120105 45

120

150 150 150 150

515 15

5

135 45 22 22

616 7

6

135 51 38 35150 155 165 170

140 150 160 170 180

Area Estimate

0

50

100

150

En

erg

y E

stim

ate

Very simplistic power and area estimation for different data-reuse versions

xyz

accessessizeenergy


R1(A)

A

A’

A’’


Step 3: determine data reuse trees for multiple accesses

R2(A)

A

A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y];


R1(A)

A

A’

A’’

R2(A)

A

A’

Reuse tree

A

R1(A)

A’

A’’

R2(A)

A’

Step 3: determine data reuse trees for multiple accesses


Assign all data reuse trees (multiple arrays) to memory hierarchy

A

R1(A)

A’

A’’

R2(A)

A’

R1(B)

B

B’

B’’

B’’’

Layer 1

Layer 2

Layer 3

A

R1(A)

A’

A’’

R2(A)

A’

R1(B)

B

B’

B’’’


Step 4: Determine number of layers

B

B'

FG

A

A'

FG

Data reusetrees A

Data reusetrees B

FG

Hierarchylayers

Layer1

Layer2

Layer3

Foreground mem.Datapath


Step 5: Select and assign reuse candidates

A

A'

FG

FG

Data reusetrees

Hierarchylayers

hierarchy assignments

FG

A

A'

1

FG

A

A'

2

FG

A

A'

3

FG

A

A

4

FG

A

5

FG

all


Step 5: All freedom in array to memory hierarchy

A

A'

FG

Data reusetrees A

Hierarchylayers

B

B'

FG

Data reusetrees B

FGFG


Step 5: Prune reuse graph (platform independent)

Hierarchy layersFull freedom

FG FG

Hierarchy layersPruned

Quite some solutions never make sense


Step 5: Prune reuse graph further (platform dependent)

FG

Hierarchy layersPruned

FG

Final solution4 layer platform

A

B

B'

A'

FG

Final solution4 layer platform


int in[H][W+8], out[H][W];const int c[] = {1,0,1,2,2,1,0,1};for (r=0; r < H; r++) for (c=0; c < W; c++) for (dc=0; dc < 8; dc++) out[r][c] += in[r][c+dc]*c[dc];

int in[H][W+8], out[H][W], buf[8];const int c[] = {1,0,1,2,2,1,0,1};for (r=0; r < H; r++) for (i=0; i<7; i++) buf[i]=in[r][i]; for (c=0; c < W; c++) buf[(c+7)%8] = in[r][c+7]; for (dc=0; dc < 8; dc++) out[r][c] += buf[(c+dc)%8]*c[dc];

Introducing 1D reuse buffer

Reuse Factor =7

intermediatelevel decl.

additional copy

initial copy

reread from buffer


Data Reuse on 1D horizontal convolutionHow to make explicit copies?

initbuffer

reusedata

newdata

Image NxM, traversed row order


Introducing line buffers for vertical filtering

whole image

size[N][M]

set of lines [2GB+1]

Why keep the whole image in that case?

[N]


Simplified “reuse script”

1. Identify arrays with sufficient reuse potential

2. Determine reuse chains and prune these

(for every array read)

3. Determine reuse trees and prune these

(for every array)

4. Determine reuse graph including bypasses and

prune (for entire application)

5. Determine memory hierarchy layout assignment incorporating given background memory restrictions (layers) and real-time constraints

6. Introduce copies in code: init, update, use code For scratchpad memories only For caches we need a different approach


Data re-use trees: cavity detector

N*M

N*1

3*1

image_in

N*3

1*3

gauss_x

N*3

3*3

gauss_xy/comp_edge

N*3

1*1

N*M*3

N*M

N*M N*M*3

N*M*3

N*M*3 N*M

N*M

image_out

0

N*M*8 N*M*8

CPU CPUCPU CPU CPU

Array reads: Array write:


Memory hierarchy assignment: cavity detector

N*M

3*1

image_in

N*3

gauss_x gauss_xy comp_edgeimage_out

3*3 1*1 3*3 1*1

L2

N*M

N*M*3 N*M*3 N*M

N*M

0

N*M*3 N*M

N*M*3 N*M*8 N*M*8 N*M*8 N*M*8

N*3 N*3

L3

L1

1MB

SDRAM

16KB

Cache

128 B

RegFile


Data reuse & memory hierarchy

0

100

200

300

400

500

600

accesses size cycles

Original

DF trafo

Loop trafo

Data reuse

processor architectures and program mapping 5kk70 tu/e henk corporaal jef van meerbergen bart mesman...

Documents

f2ai slide

program mapping 5kk70

address optimization

f1ai ci

tycho van meeuwen

tanja van achteren

sven wuytack

sven verdoolaege