data-flow analysis in the memory management of real-time multimedia processing systems

38
Data-Flow Analysis Data-Flow Analysis in the Memory Management of in the Memory Management of Real-Time Multimedia Processing Real-Time Multimedia Processing Systems Systems Florin Balasa Florin Balasa University University of Illinois at Chicago of Illinois at Chicago

Upload: jaime-tyler

Post on 03-Jan-2016

34 views

Category:

Documents


1 download

DESCRIPTION

Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems. Florin Balasa University of Illinois at Chicago. Introduction. Real-time multimedia processing systems. (video and image processing, real-time 3D rendering, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Data-Flow Analysis Data-Flow Analysis in the Memory Management of Real-Time in the Memory Management of Real-Time

Multimedia Processing SystemsMultimedia Processing Systems

Florin BalasaFlorin Balasa

UniversityUniversity of Illinois at Chicagoof Illinois at Chicago

Page 2: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Introduction

Real-time multimedia processing systemsReal-time multimedia processing systems

(video and image processing, real-time 3D rendering,(video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.)audio and speech coding, medical imaging, etc.)

A large part of power dissipation is due toA large part of power dissipation is due to

data transfer and data storagedata transfer and data storage

Fetching operands from an off-chip memory for additionFetching operands from an off-chip memory for addition consumes 33 times more power than the computation consumes 33 times more power than the computation

[ Catthoor 98 ][ Catthoor 98 ]

Area cost often largely dominated by memoriesArea cost often largely dominated by memories

Page 3: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Introduction

In the early years of high-level synthesisIn the early years of high-level synthesis

memory management tasks tackled at memory management tasks tackled at scalar levelscalar level

Algebraic techniquesAlgebraic techniques -- similar to those used -- similar to those used

in modern compilers -- in modern compilers -- allow to handle allow to handle

memory management atmemory management at non-scalar levelnon-scalar level

Requirement: addressing the entire class of affine specificationsRequirement: addressing the entire class of affine specifications

multidimensional signals with (complex) affine indexesmultidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functionsloop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.conditions – relational and / or logical operators of affine fct.

Page 4: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Outline

Memory size computation using dataMemory size computation using data

dependence analysisdependence analysis Hierarchical memory allocation Hierarchical memory allocation

based on data reuse analysisbased on data reuse analysis Data-flow driven data partitioningData-flow driven data partitioning

for on/off- chip memoriesfor on/off- chip memories ConclusionsConclusions

Page 5: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

How many memory locations are necessaryHow many memory locations are necessary to store the array references to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

Page 6: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

Number of iterator triplets (i,j,k), that is 512Number of iterator triplets (i,j,k), that is 51233 ?? ??

(i,j,k)=(0,1,1)(i,j,k)=(0,1,1)

(i,j,k)=(1,2,0)(i,j,k)=(1,2,0)

B [1] [2]B [1] [2]No !!No !!

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……

Page 7: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

Number of index values (i+k,j+k), that is 1023Number of index values (i+k,j+k), that is 102322 ?? ??

B [0] [512]B [0] [512]No !!No !!

(since 0 <= i+k , j+k <= 1022)(since 0 <= i+k , j+k <= 1022)

any (i,j,k) any (i,j,k)

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……

Page 8: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

jj

ii

x=2i+3j+1x=2i+3j+1

y=5i+j+2y=5i+j+2

z=4i+6j+3z=4i+6j+3A[x][y][z]A[x][y][z]

Iterator spaceIterator space Index spaceIndex space

Page 9: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

jj

A[x][y][z]A[x][y][z]

Iterator spaceIterator space Index spaceIndex space

ii

Page 10: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

Remark Remark The iterator space may have ``holes’’ tooThe iterator space may have ``holes’’ too

for (i=4; i<=8; i++)for (i=4; i<=8; i++)

for (j=i-2; j<=i+2; j+=2)for (j=i-2; j<=i+2; j+=2) … … C[i+j] …C[i+j] …

4 6 8 4 6 8

ii

jj

22

44

66

for (i=4; i<=8; i++)for (i=4; i<=8; i++)

for (j=0; j<=2; j++)for (j=0; j<=2; j++) … … C[2i+2j-2] …C[2i+2j-2] …

4 5 6 7 84 5 6 7 8

ii

jj

1122

00

normalizationnormalization

88

Page 11: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

22 3355 1144 66

11

2233

xx

yyzz

ii

jj++==

Iterator spaceIterator spaceIndex spaceIndex spaceaffineaffine

mappingmapping

0 <= i , j <= 5110 <= i , j <= 511

Page 12: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

ii

jj

x=i+kx=i+k

y=j+ky=j+kB[x][y]B[x][y]

Iterator spaceIterator space Index spaceIndex space

kk

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……

Page 13: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……

11 0000 11

ii

jjkk

xxyy

00

00++==

Iterator spaceIterator spaceIndex spaceIndex spaceaffineaffine

mappingmapping

0 <= i , j , k <= 5110 <= i , j , k <= 511

1111

Page 14: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

Any array reference can be modeledAny array reference can be modeled as a linearly bounded lattice (LBL)as a linearly bounded lattice (LBL)

LBL = { LBL = { xx = = TT··ii + + uu | | AA··ii >= >= bb } }

Iterator spaceIterator space

- - scope of nested loops, andscope of nested loops, and

- iterator-dependent conditionsiterator-dependent conditions

Affine mappingAffine mapping

PolytopePolytopeLBLLBLaffineaffine

mappingmapping

Page 15: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

The size of the array referenceThe size of the array reference is the size of its index space – an LBL !!is the size of its index space – an LBL !!

LBL = { LBL = { xx = = TT··ii + + uu | | AA··ii >= >= bb } }

f : f : ZZn n ZZmm f(f(ii) = ) = TT··ii + + uu

Is function f a one-to-one mapping ??Is function f a one-to-one mapping ??

Size(index space) = Size(iterator space)Size(index space) = Size(iterator space)

If YESIf YES

Page 16: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

f : f : ZZn n ZZmm f(f(ii) = ) = TT··ii + + uu

HHGG

PP·T·S ·T·S ==0000

[Minoux 86][Minoux 86]

SS - unimodular matrix - unimodular matrix

PP - row permutation - row permutation

HH - nonsingular lower-triangular matrix - nonsingular lower-triangular matrix

When rank(When rank(HH)=m <= n , )=m <= n , HH is the Hermite Normal Form is the Hermite Normal Form

Page 17: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

1313

Computation of array reference size

rank(rank(HH)=n)=n function f is a one-to-one mappingfunction f is a one-to-one mapping

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++) … … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

22 3355 1144 66

11

2233

xx

yyzz

ii

jj++==

PP··TT·S·S == II33

22 33

55 1144 66

-1-1

1133

-2-2==

11 00

-4-4

22 00- - - -- - - -

HH

GG

Nr. locations Nr. locations A[ ][ ][ ]A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512 = size ( 0 <= i,j <= 511 ) = 512 x 512

Case 1Case 1

Page 18: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

rank(rank(HH)<n)<nCase 2Case 2

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++) …for (j=0; j<=511; j++) …

for (k=0; k<=511; k++)for (k=0; k<=511; k++) … … B [i+k] [j+k]B [i+k] [j+k] … …

PP··TT·S·S == II2211 0000 11

11

0000

1111

00

1100

-1-1

-1-111

11 0000 11

0000

==

HH 00

{ 0 <= i , j , k <= 511 }{ 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 }{ 0 <= I-K , J-K , K <= 511 }

| B[i+k][j+k] || B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897 = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897

Page 19: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

Loop

bounds

Size

iterator space

Size index space

(# storage locations)

|index space| /

|iterator space|

77 512512 169169 33 %33 %

1515 4,0964,096 721721 17 %17 %

3131 32,76832,768 2,9772,977 9.0 %9.0 %

6363 262,144262,144 12,09712,097 4.6 %4.6 %

127127 2,097,1522,097,152 48,76948,769 2.3 %2.3 %

255255 16,777,21616,777,216 195,841195,841 1.1 %1.1 %

511511 134,217,728134,217,728 784,897784,897 0.5 %0.5 %

Array reference B [i+k] [j+k]Array reference B [i+k] [j+k]

Page 20: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Computation of array reference size

Computation of the size of an integer polytopeComputation of the size of an integer polytope

The Fourier-Motzkin eliminationThe Fourier-Motzkin elimination

aaikikxxk k >= b >= bkk

1. x1. xn n >= D>= Di i (x(x11,…,x,…,xn-1n-1))

3. 0 3. 0 <= F<= Fk k (x(x11,…,x,…,xn-1n-1))

DDi i (x(x11,…,x,…,xn-1n-1)) <= E<= Ej j (x(x11,…,x,…,xn-1n-1))

0 0 <= F<= Fk k (x(x11,…,x,…,xn-1n-1))

n-dim polytopen-dim polytope

(n-1)-dim polytope(n-1)-dim polytope

2. x2. xn n <= E<= Ejj (x(x11,…,x,…,xn-1n-1))

1-dim polytope1-dim polytope

Range of xRange of x11

for each value of xfor each value of x11

add size (n-1)-dim polytopeadd size (n-1)-dim polytope

Page 21: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ){ A [ j ] [ 0 ] = in0;{ A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;}}

for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ){ alpha [ i ] = A [ i ] [ n+i ] ;{ alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] :j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ;alpha [ i ] + A [ j ] [ n+i ] ;}}for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];

# define n 6# define n 6

Page 22: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

Decompose the LBL’s of the array refs.Decompose the LBL’s of the array refs.into non-overlapping pieces !!into non-overlapping pieces !!

LBLLBL11 LBLLBL22UU

LBLLBL

LBLLBL11 = { = { xx = = TT11··ii11 + + uu11 | | AA11··ii11 >= >= bb11 } }

LBLLBL22 = { = { xx = = TT22··ii22 + + uu22 | | AA22··ii22 >= >= bb22 } }

TT11··ii11 + + uu11 = = TT22··ii22 + + uu22

Diophantine system of eqs.Diophantine system of eqs.

{ { AA11··ii11 >= >= bb11 , , AA22··ii22 >= >= bb22 } }

New polytopeNew polytope

Page 23: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

Keeping minimal the set of inequalities in the LBL intersectionKeeping minimal the set of inequalities in the LBL intersection

for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? j < i ? A [ j ] [ n+i ]A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; : alpha [ i ] + A [ j ] [ n+i ] ;

Iterator spaceIterator space { 0 <= i , j <= n-1 , j+1 <= i }{ 0 <= i , j <= n-1 , j+1 <= i }

ii

jj

11 n-1n-1

n-1n-1{ 0 <= j , i <= n-1 , j+1 <= i }{ 0 <= j , i <= n-1 , j+1 <= i }

(5 ineq.)(5 ineq.)

(3 ineq.)(3 ineq.)

Page 24: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

Polyhedron = { Polyhedron = { xx | | CC··xx = = dd , , AA··xx >= >= bb } }

Keeping minimal the set of inequalities in the LBL intersectionKeeping minimal the set of inequalities in the LBL intersection

The decomposition theorem of polyhedraThe decomposition theorem of polyhedra

Polyhedron = { Polyhedron = { xx | | xx = = VV·· + + LL·· + + RR·· } }

ii

[ Motzkin 1953 ] [ Motzkin 1953 ]

Page 25: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

LBL’s of signal A (illustrative example)LBL’s of signal A (illustrative example)

Page 26: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Polyhedral data-dependence graphsPolyhedral data-dependence graphs

GranularityGranularitylevel = 0level = 0

GranularityGranularitylevel = 1level = 1

Page 27: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Scalar-level data-dependence graphScalar-level data-dependence graph

GranularityGranularitylevel = 2level = 2

Page 28: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

PolyhedralPolyhedraldata-dependencedata-dependence

graphgraph

motion detectionmotion detectionalgorithmalgorithm

[Chan 93][Chan 93]

# scalars

# dependencies

Page 29: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

Memory size variation during the motion detection alg.Memory size variation during the motion detection alg.

Page 30: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Memory size computation

To handle high throughput applicationsTo handle high throughput applications

Extract the (largely hidden) parallelismExtract the (largely hidden) parallelism from the initially specified codefrom the initially specified code

Find the lowest degree of parallelismFind the lowest degree of parallelism to meet the throughput/hardware requirementsto meet the throughput/hardware requirements

Perform memory size computationPerform memory size computation for code with explicit parallelism instructionsfor code with explicit parallelism instructions

Page 31: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Hierarchical memory allocation

A large part of power dissipation A large part of power dissipation in data-dominated applications is due toin data-dominated applications is due to

data transfers and data storagedata transfers and data storage

Power cost reduction Power cost reduction memory hierarchymemory hierarchy

exploiting temporal locality exploiting temporal locality in the data accessesin the data accesses

Power dissipation Power dissipation = = f f ( ( memory size memory size , , access frequencyaccess frequency ) )

Page 32: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Hierarchical memory allocation

Power dissipation Power dissipation = = f f ( ( memory size memory size , , access freq.access freq. ) )

heavily used heavily used datadata

Layer ofLayer ofsmall memoriessmall memories

Layer ofLayer oflarge memorieslarge memories

Page 33: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Hierarchical memory allocation

HierarchicalHierarchicaldistributiondistribution

Non-hierarchicalNon-hierarchicaldistributiondistribution

trade-offstrade-offs

Lower power consumption by accessing from smaller memoriesLower power consumption by accessing from smaller memories

Higher power consumption due to additional transfersHigher power consumption due to additional transfers

Larger areaLarger area to store copies of datato store copies of data

additional area overheadadditional area overhead (addressing logic) (addressing logic)

Page 34: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Hierarchical memory allocation

Synthesis of multilevel memory architectureSynthesis of multilevel memory architectureoptimized for area and / or poweroptimized for area and / or power

subject to performance constraintssubject to performance constraints

1. Data reuse exploration1. Data reuse exploration

Which intermediate copies of data are necessaryWhich intermediate copies of data are necessary for accessing data in a power- and area- efficient way for accessing data in a power- and area- efficient way

2. Memory allocation & assignment2. Memory allocation & assignment

Distributed (hierarchical) memory architectureDistributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic ,( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )signal-to-memory & signal-to-port assignment )

Page 35: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Hierarchical memory allocation

Synthesis of multilevel memory architectureSynthesis of multilevel memory architectureoptimized for area and / or poweroptimized for area and / or power

subject to performance constraintssubject to performance constraints

1. Data reuse exploration1. Data reuse exploration

Array partitions to be considered as copy candidates: Array partitions to be considered as copy candidates:

the LBL’s from the recursive intersection of array refs. the LBL’s from the recursive intersection of array refs.

2. Memory allocation & assignment2. Memory allocation & assignment

CostCost = = · · PPread / write read / write ( N( N bits bits , N , N words words , f , f read / write read / write ) )

+ + · · Area Area ( N( N bits bits , N , N words words , N, Nports ports , technology, technology ) )

Page 36: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Partitioning for on/off- chip memories

CPUCPU

CacheCacheDRAMDRAM

off-chip off-chip

SRAMSRAM on-chipon-chip

1 cycle1 cycle

1 cycle1 cycle

10-2010-20

cyclescycles

MemoryMemoryaddressaddressspacespace

Optimal data mapping to the SRAM / DRAM Optimal data mapping to the SRAM / DRAM to maximize the performance of the applicationto maximize the performance of the application

Page 37: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Partitioning for on/off- chip memories

Total conflict factorTotal conflict factor

Total number of array accessesTotal number of array accessesexposed to cache conflictsexposed to cache conflicts

The importance of mappingThe importance of mappingto the on-chip SRAMto the on-chip SRAM

Using the polyhedral data-dependence graphUsing the polyhedral data-dependence graph

Precise info about the relative lifetimes of the different parts of arraysPrecise info about the relative lifetimes of the different parts of arrays

Page 38: Data-Flow Analysis  in the Memory Management of Real-Time Multimedia Processing Systems

Conclusions

Algebraic techniques are powerful non-scalar instrumentsAlgebraic techniques are powerful non-scalar instruments

in the memory management of multimedia signal processingin the memory management of multimedia signal processing

Data-dependence analysis at polyhedral levelData-dependence analysis at polyhedral level

useful in many memory management tasksuseful in many memory management tasks

memory size computation for behavioral specifications memory size computation for behavioral specifications

hierarchical memory allocationhierarchical memory allocation

data partitioning between on- and off- chip memories data partitioning between on- and off- chip memories