processor architectures and program mapping

57
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Memory Management t b: Loop transformations & Data Re

Upload: iokina

Post on 19-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

Processor Architectures and Program Mapping. Data Memory Management Part b: Loop transformations & Data Reuse. 5KK70 TU/e Henk Corporaal Bart Mesman. Thanks to the IMEC DTSE experts:. Erik Brockmeyer IMEC, Leuven, Belgium and also - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Processor Architectures and Program Mapping

Processor Architectures and Program Mapping

5KK70 TU/e

Henk Corporaal

Bart Mesman

Data Memory Management

Part b: Loop transformations & Data Reuse

Page 2: Processor Architectures and Program Mapping

Thanks to the IMEC DTSE experts:

Erik Brockmeyer

IMEC, Leuven, Belgium

and also

Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda,

Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

Page 3: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 3

DM methodology

Dataflow Transformations

Analysis/Preprocessing

Loop/control-flow transformations

Data Reuse

Storage Cycle Budget Distribution

Memory Allocation and Assignment

Memory Layout organisation

C-out

C-in

Address optimization

Page 4: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 4

for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[7-i] = f(A[i]);

Location

Time

Production

Consumption

for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]);

Location

Time

Production

Consumption

Locality of Reference

Page 5: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 5

Regularity

for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[i] = f(A[7-i]);

Location

Time

for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[7-i] = f(A[i]);

Location

Time

ProductionConsumption

Page 6: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 6

for (i=0; i < 8; i++) B[i] = f1(A[i]);for (i=0; i < 8; i++) C[i] = f2(A[i]);

Location

Time

Consumption

Consumption

Location

Time

Consumption

Consumption

Enabling Reuse

for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]);

Page 7: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 7

How to do these loop transformations automatically?

Requires cost function

Requires technique

Let's introduce some terminology

- iteration spaces

- polytopes

- ordering vector, which determines the execution order

Page 8: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 8

0 1 j2 3 4 50

i

1

2

3

4

5

Iteration space and polytopes

// assume A[][] exists

for (i=1; i<6; i++) {

for (j=2; j<6; j++) {

B[i][j] = g( A[i-1][j-2]);

} }

--- iteration space

--- consumption space

--- production space

--- dependency vector

Page 9: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 9

Example with 3 polytopes

A: for (i=1; i<=N; ++i)

for (j=1; j<=N-i+1; ++j)

a[i][j] = in[i][j] + a[i-1][j];

B: for (p=1; p<=N; ++p)

b[p][1] = f( a[N-p+1][p], a[N-p][p] );

C: for (k=1; k<=N; ++k)

for (l=1; l<=k; ++k)

b[k][l+1] = g (b[k][l]);

A

B

C

Algorithm having 3 loops:

j

i

k

p

l

Page 10: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 10

Common iteration space

for (i=1; i<=(2*N+1); ++i)

for (j=1; j<=2*N; ++j)

if (i>=1 && i<=N && j>=1 && j<=N-i+1)

a[i][j] = in[i][j] + a[i-1][j];

if (i==N+1 && j>=1 && j<=N)

b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k)

b[i-N-1][j-N+1] = g (b[i-N-1][j-N]);

j

i

1

2*N+1

1 2*N

Initial solution having a common iteration space:

Bad locality Bad regularity Requires 2N memory locations Many dummy iterations

Ordering vector

Page 11: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 11

Cost function needed for automation

RegularityEqual direction for dependency vectorsAvoid that dependency vectors cross each otherGood for storage size

Temporal localityEqual length of all dependency vectorsGood for storage sizeGood for data reuse

Page 12: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 12

Regularity

Regular

Irregular

Page 13: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 13

Bad regularity limits the ordering freedom

j

i

1

2*N+1

1 2*N

Ordering freedom = 90 degrees

Page 14: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 14

Locality estimates: a few options

P

C

C

C

C

P

C

C

C

C

P = productionC = consumption

P

C

C

C

C

C

Dependency vector length is measure for localityQ: Which length is the best estimate?

Sum{di} Max {di} Spanning tree

di

Page 15: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 15

1. Affine loop transformations

Rotation, skewing, interchange, reverse Only geometric information is needed

2. Polytope placement

Translation Only geometric information is needed

3. Choose ordering vector• Generate the code

Three step approach for loop transformation tool

y

x

j

iu

j

iT

y

x

u

Combined transformation:

Page 16: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 16

A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];

C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );

B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );

i

j

p

k

l

1. Affine loop transformations2. Polytope placement3. Choose ordering vector

Three step approach for loop transformation tool

Page 17: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 17

Three step approach for loop transformation tool

i

j

A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];

C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );

pB: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );

k

l

i

j

p

k

l

1. Affine loop transformations2. Polytope placement3. Choose ordering vector

Page 18: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 18

Three step approach for loop transformation tool

i

p

k

l

j

i

j

A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];

C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );

pB: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );

k

l

1. Affine loop transformations2. Polytope placement = merging loops3. Choose ordering vector

Page 19: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 19

Choose optimal ordering vector

Ordering Vector 1 Ordering Vector 2

Page 20: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 20

From the Polyhedral model back to C

for (j=1; j<=N; ++j){ for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] );}

i

l

j

1. Affine loop transformations2. Polytope placement3. Choose ordering vector

Optimized solution having a common iteration space:

Optimal locality Optimal regularity Requires 2 memory locations

Page 21: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 21

Scanner

Loop trafo - cavity detection

GaussBlur y

GaussBlur x

N x M

X-Y LoopInterchange

N x M

From N x M toN x (2GB+1) buffer size

X

Y

N x M

Page 22: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 22

Loop trafo-cavity (1)

1Transform:

interchange

2

Translate:merge

3Order

Page 23: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 23

Loop trafo-cavity (2)

1Transform:

interchange

2

Translate:merge

3

ChooseOrder

x-blur filter:

Page 24: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 24

Scanner

Loop trafo - cavity detection

GaussBlur y

GaussBlur x

N x M

X-Y LoopInterchange

N x M

From N x M toN x (2GB+1) buffer size

X

Y

N x M

Page 25: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 25

Loop trafo-cavity (3)

2Translate 1:

2 Translate 2:

3

Comparingdifferenttranslations

Page 26: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 26

Loop trafo-cavity (4)

33Order

+ =

Combining (merging) multiple polytopes

Page 27: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 27

Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot;

} else if (x<N && y<M) gauss_x_image[x][y] = 0;

if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0;

Page 28: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 28

Intermezzo Before we continue with data reuse, have a look at other

loop transformations check the discussed slides !!

Page 29: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 29

DM methodology

Dataflow Transformations

Analysis/Preprocessing

Loop/control-flow transformations

Data Reuse

Storage Cycle Budget Distribution

Memory Allocation and Assignment

Memory Layout organisation

C-out

C-in

Address optimization

Page 30: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 30

Layer 1 Layer 2 Layer 3Datapaths

Memory hierarchy and Data reuse1. Determines reuse candidates

2. Combine reuse candidates into reuse chains

3. If multiple access statements/array combine into reuse trees

4. Determine number of layers (if architecture is not fixed)

5. Select candidates and assign to memory layers

6. Add extra transfers between the different memory layers(for scratchpad RAM; not for caches)

Page 31: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 31

TI C55@200MHz example platform

Register file+

Core

4Kx16dual

32xTotal 256Kb1 elem in 1 cycle

16Kx16ROM

OffchipMAX: 8MBx16SRAM/EPROM/ SDRAM/SBSRAM

TMS320vc5510@200MHzVdd= 1.5 VP = unknown

8xTotal 64Kb2 elem in 1 cycle

4Kx16dual

4Kx16dual

4Kx16sing

4Kx16sing

4Kx16sing

ROM (Data/program/DMA)first 3 cycles, next 2 cyclesIt seems this can be in parallel with the 256Kb memory Bandwidth 100M words/S

Bandwidth 400M words/s

Size 32kB

Size 320kB

ROM partition

Variable size RAM partition

Bandwidth 50M words/sSize 16 MB Fixed size RAM partition

Bandwidth 4.8Gwords/sSize 2x16 registersProcessor partition

BW: 50M Word/ssingle port

L2

L0

L1

BW: 400M Word/sdual port

Page 32: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 32

M

P = 1

Exploiting Memory Hierarchy for reduced Power: principle

Processor Data Paths

RegisterFile

Processor Data Paths

RegisterFile

A

P = 1

#A = 100%

P total (before) = 100%

Page 33: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 33

P total (before) = 100%

M

P = 1

A

P = 1

A’

P = 0.3

100% 5%

Exploiting Memory Hierarchy for reduced Power: principle

P total (after) = 100%x0.01+10%x0.1+1%x1 = 3%

M

P = 1

A

P = 1

A’

P = 0.1

A’’

P = 0.01

100% 1%10%Processor Data Paths

RegisterFile

Processor Data Paths

RegisterFile

Page 34: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 34

M

Data reuse decision and memory hierarchy: principle

Processor Data Paths

RegisterFile

Processor Data Paths

RegisterFile

BA

A’A’’

customized connections

Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead.

Page 35: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 35

Step 1: identify arrays with data reuse potential

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];

timecopy3 copy4copy1 copy2

Time frame 1 Time frame 2 Time frame 3 Time frame 4

arrayindex

intra-copyreuse

inter-copyreuse

Page 36: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 36

Importance of high level cost estimate

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];

timecopy3 copy4copy1 copy2

Time frame 1 Time frame 2 Time frame 3 Time frame 4

arrayindex

6

Mk

Array copies arestored in-place!

Page 37: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 37

Step 1: determine gains Intra-copy reuse factor

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];

timecopy3 copy4copy1 copy2

Time frame 1 Time frame 2 Time frame 3 Time frame 4

arrayindex

6

Mk

intra-copyreuse

factor= 3

j iterator =not presentso intra-copy reuse

3

Page 38: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 38

Step 1: determine gains Inter-copy reuse factor

timecopy3 copy4copy1 copy2

Time frame 1 Time frame 2 Time frame 3 Time frame 4

arrayindex

inter-copyreuse factor

= 1/(1-1/3)=3/2

6

Mk

for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];

i iterator has smaller weightthan k range so

inter-copy reuse

Page 39: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 39

5Mm

tf 1 tf 2 tf 3 tf 4 tf 5 tf 6 tf 7 tf 8 tf 9

Possibility for multi-level hierarchy

arrayindex

time

for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];

Mk

15

time frame 1 time frame 2

5Mm

tf 1.1 tf 1.2 tf 1.3 tf 1.4 tf 1.5 tf 1.6 tf 2.1 tf 2.2 tf 2.3

Page 40: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 40

Step 2: determine data reuse chains for each memory access

R1(A)

A

A’

R1(A)

A

A’

R1(A)

A

A’

A’’

Many reuse possibilities

Cost estimate needed

Prune for promising ones

R1(A)

A

Page 41: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 41

Cost function needs both size and number of accesses to intermediate array

for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];

Gk

155

Gm

estimate #misses from different levels for one iteration of i

R1(A)

2*3*3*5=90

A’

3*5=15

A’

2*3*5=30

estimate size

0 5 10 15 20#elements

0

20

40

60

80

100#

mis

ses

Page 42: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 42

R1(A)

A

A’

R1(A)

A

A’

R1(A)

A

A’

A’’

R1(A)

A

30

90 90 9090

15 15

30

90 30 15 15

120105 45

120

150 150 150 150

515 15

5

135 45 22 22

616 7

6

135 51 38 35150 155 165 170

140 150 160 170 180

Area Estimate

0

50

100

150

En

erg

y E

stim

ate

Very simplistic power and area estimation for different data-reuse versions

xyz

accessessizeenergy

Page 43: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 43

R1(A)

A

A’

A’’

for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];

Step 3: determine data reuse trees for multiple accesses

R2(A)

A

A’

for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y];

Page 44: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 44

R1(A)

A

A’

A’’

R2(A)

A

A’

Reuse tree

A

R1(A)

A’

A’’

R2(A)

A’

Step 3: determine data reuse trees for multiple accesses

Page 45: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 45

Step 4: Determine number of layers

B

B'

FG

A

A'

FG

Data reusetrees A

Data reusetrees B

FG

Hierarchylayers

Layer1

Layer2

Layer3

Foreground mem.Datapath

Page 46: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 46

Step 5: Select and assign reuse candidates

A

A'

FG

FG

Data reusetrees

Hierarchylayers

hierarchy assignments

FG

A

A'

1

FG

A

A'

2

FG

A

A'

3

FG

A

A

4

FG

A

5

FG

all

Page 47: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 47

Step 5: All freedom in array to memory hierarchy

A

A'

FG

Data reusetrees A

Hierarchylayers

B

B'

FG

Data reusetrees B

FGFG

Page 48: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 48

Step 5: Prune reuse graph (platform independent)

Hierarchy layersFull freedom

FG FG

Hierarchy layersPruned

Quite some solutions never make sense

Page 49: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 49

Step 5: Prune reuse graph further (platform dependent)

FG

Hierarchy layersPruned

FG

Final solution4 layer platform

A

B

B'

A'

FG

Final solution4 layer platform

Page 50: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 50

Assign all data reuse trees (multiple arrays) to memory hierarchy

A

R1(A)

A’

A’’

R2(A)

A’

R1(B)

B

B’

B’’

B’’’

Layer 1

Layer 2

Layer 3

A

R1(A)

A’

A’’

R2(A)

A’

R1(B)

B

B’

B’’’

Page 51: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 51

int in[H][W+8], out[H][W];const int c[] = {1,0,1,2,2,1,0,1};for (r=0; r < H; r++) for (c=0; c < W; c++) for (dc=0; dc < 8; dc++) out[r][c] += in[r][c+dc]*c[dc];

int in[H][W+8], out[H][W], buf[8];const int c[] = {1,0,1,2,2,1,0,1};for (r=0; r < H; r++) for (i=0; i<7; i++) buf[i]=in[r][i]; for (c=0; c < W; c++) buf[(c+7)%8] = in[r][c+7]; for (dc=0; dc < 8; dc++) out[r][c] += buf[(c+dc)%8]*c[dc];

Introducing 1D reuse buffer

Reuse Factor =7

intermediatelevel decl.

additional copy

initial copy

reread from buffer

Page 52: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 52

Data Reuse on 1D horizontal convolutionHow to make explicit copies?

initbuffer

reusedata

newdata

Image NxM, traversed row order

Page 53: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 53

Introducing line buffers for vertical filtering

whole image

size[N][M]

set of lines [2GB+1]

Why keep the whole image in that case?

[N]

Page 54: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 54

Simplified “reuse script”

1. Identify arrays with sufficient reuse potential

2. Determine reuse chains and prune these

(for every array read)

3. Determine reuse trees and prune these

(for every array)

4. Determine reuse graph including bypasses and

prune (for entire application)

5. Determine memory hierarchy layout assignment incorporating given background memory restrictions (layers) and real-time constraints

6. Introduce copies in code: init, update, use code For scratchpad memories only For caches we need a different approach

Page 55: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 55

Data re-use trees: cavity detector

N*M

N*1

3*1

image_in

N*3

1*3

gauss_x

N*3

3*3

gauss_xy/comp_edge

N*3

1*1

N*M*3

N*M

N*M N*M*3

N*M*3

N*M*3 N*M

N*M

image_out

0

N*M*8 N*M*8

CPU CPUCPU CPU CPU

Array reads: Array write:

Page 56: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 56

Memory hierarchy assignment: cavity detector

N*M

3*1

image_in

N*3

gauss_x gauss_xy comp_edgeimage_out

3*3 1*1 3*3 1*1

L2

N*M

N*M*3 N*M*3 N*M

N*M

0

N*M*3 N*M

N*M*3 N*M*8 N*M*8 N*M*8 N*M*8

N*3 N*3

L3

L1

1MB

SDRAM

16KB

Cache

128 B

RegFile

Page 57: Processor Architectures and Program Mapping

@HC 5KK70 Platform-based Design 57

Data reuse & memory hierarchy

0

100

200

300

400

500

600

accesses size cycles

Original

DF trafo

Loop trafo

Data reuse