data-flow analysis in the memory management of real-time multimedia processing systems

Data-Flow Analysis Data-Flow Analysis in the Memory Management of Real-Time in the Memory Management of Real-Time

Multimedia Processing SystemsMultimedia Processing Systems

Florin BalasaFlorin Balasa

UniversityUniversity of Illinois at Chicagoof Illinois at Chicago

Introduction

Real-time multimedia processing systemsReal-time multimedia processing systems

(video and image processing, real-time 3D rendering,(video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.)audio and speech coding, medical imaging, etc.)

A large part of power dissipation is due toA large part of power dissipation is due to

data transfer and data storagedata transfer and data storage

Fetching operands from an off-chip memory for additionFetching operands from an off-chip memory for addition consumes 33 times more power than the computation consumes 33 times more power than the computation

[ Catthoor 98 ][ Catthoor 98 ]

Area cost often largely dominated by memoriesArea cost often largely dominated by memories

Introduction

In the early years of high-level synthesisIn the early years of high-level synthesis

memory management tasks tackled at memory management tasks tackled at scalar levelscalar level

Algebraic techniquesAlgebraic techniques -- similar to those used -- similar to those used

in modern compilers -- in modern compilers -- allow to handle allow to handle

memory management atmemory management at non-scalar levelnon-scalar level

Requirement: addressing the entire class of affine specificationsRequirement: addressing the entire class of affine specifications

multidimensional signals with (complex) affine indexesmultidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functionsloop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.conditions – relational and / or logical operators of affine fct.

Outline

Memory size computation using dataMemory size computation using data

dependence analysisdependence analysis Hierarchical memory allocation Hierarchical memory allocation

based on data reuse analysisbased on data reuse analysis Data-flow driven data partitioningData-flow driven data partitioning

for on/off- chip memoriesfor on/off- chip memories ConclusionsConclusions

Computation of array reference size

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

How many memory locations are necessaryHow many memory locations are necessary to store the array references to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …


Number of iterator triplets (i,j,k), that is 512Number of iterator triplets (i,j,k), that is 51233 ?? ??

(i,j,k)=(0,1,1)(i,j,k)=(0,1,1)

(i,j,k)=(1,2,0)(i,j,k)=(1,2,0)

B [1] [2]B [1] [2]No !!No !!

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……


Number of index values (i+k,j+k), that is 1023Number of index values (i+k,j+k), that is 102322 ?? ??

B [0] [512]B [0] [512]No !!No !!

(since 0 <= i+k , j+k <= 1022)(since 0 <= i+k , j+k <= 1022)

any (i,j,k) any (i,j,k)

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……


for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

jj

ii

x=2i+3j+1x=2i+3j+1

y=5i+j+2y=5i+j+2

z=4i+6j+3z=4i+6j+3A[x][y][z]A[x][y][z]

Iterator spaceIterator space Index spaceIndex space


… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

jj

A[x][y][z]A[x][y][z]


ii


Remark Remark The iterator space may have ``holes’’ tooThe iterator space may have ``holes’’ too

for (i=4; i<=8; i++)for (i=4; i<=8; i++)

for (j=i-2; j<=i+2; j+=2)for (j=i-2; j<=i+2; j+=2) … … C[i+j] …C[i+j] …

4 6 8 4 6 8

ii

jj

22

44

66

for (i=4; i<=8; i++)for (i=4; i<=8; i++)

for (j=0; j<=2; j++)for (j=0; j<=2; j++) … … C[2i+2j-2] …C[2i+2j-2] …

4 5 6 7 84 5 6 7 8

ii

jj

1122

00

normalizationnormalization

88


for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

22 3355 1144 66

11

2233

xx

yyzz

ii

jj++==

Iterator spaceIterator spaceIndex spaceIndex spaceaffineaffine

mappingmapping

0 <= i , j <= 5110 <= i , j <= 511


ii

jj

x=i+kx=i+k

y=j+ky=j+kB[x][y]B[x][y]


kk

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……


for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++)

for (k=0; k<=511; k++)for (k=0; k<=511; k++)

… … B [i+k] [j+k]B [i+k] [j+k] … …

……

11 0000 11

ii

jjkk

xxyy

00

00++==

Iterator spaceIterator spaceIndex spaceIndex spaceaffineaffine

mappingmapping

0 <= i , j , k <= 5110 <= i , j , k <= 511

1111


Any array reference can be modeledAny array reference can be modeled as a linearly bounded lattice (LBL)as a linearly bounded lattice (LBL)

LBL = { LBL = { xx = = TT··ii + + uu | | AA··ii >= >= bb } }

Iterator spaceIterator space

- - scope of nested loops, andscope of nested loops, and

- iterator-dependent conditionsiterator-dependent conditions

Affine mappingAffine mapping

PolytopePolytopeLBLLBLaffineaffine

mappingmapping


The size of the array referenceThe size of the array reference is the size of its index space – an LBL !!is the size of its index space – an LBL !!

LBL = { LBL = { xx = = TT··ii + + uu | | AA··ii >= >= bb } }

f : f : ZZn n ZZmm f(f(ii) = ) = TT··ii + + uu

Is function f a one-to-one mapping ??Is function f a one-to-one mapping ??

Size(index space) = Size(iterator space)Size(index space) = Size(iterator space)

If YESIf YES


f : f : ZZn n ZZmm f(f(ii) = ) = TT··ii + + uu

HHGG

PP·T·S ·T·S ==0000

[Minoux 86][Minoux 86]

SS - unimodular matrix - unimodular matrix

PP - row permutation - row permutation

HH - nonsingular lower-triangular matrix - nonsingular lower-triangular matrix

When rank(When rank(HH)=m <= n , )=m <= n , HH is the Hermite Normal Form is the Hermite Normal Form

1313


rank(rank(HH)=n)=n function f is a one-to-one mappingfunction f is a one-to-one mapping

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++)for (j=0; j<=511; j++) … … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …

22 3355 1144 66

11

2233

xx

yyzz

ii

jj++==

PP··TT·S·S == II33

22 33

55 1144 66

-1-1

1133

-2-2==

11 00

-4-4

22 00- - - -- - - -

HH

GG

Nr. locations Nr. locations A[ ][ ][ ]A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512 = size ( 0 <= i,j <= 511 ) = 512 x 512

Case 1Case 1


rank(rank(HH)<n)<nCase 2Case 2

for (i=0; i<=511; i++)for (i=0; i<=511; i++)

for (j=0; j<=511; j++) …for (j=0; j<=511; j++) …

for (k=0; k<=511; k++)for (k=0; k<=511; k++) … … B [i+k] [j+k]B [i+k] [j+k] … …

PP··TT·S·S == II2211 0000 11

11

0000

1111

00

1100

-1-1

-1-111

11 0000 11

0000

==

HH 00

{ 0 <= i , j , k <= 511 }{ 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 }{ 0 <= I-K , J-K , K <= 511 }

| B[i+k][j+k] || B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897 = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897


Loop

bounds

Size

iterator space

Size index space

(# storage locations)

|index space| /

|iterator space|

77 512512 169169 33 %33 %

1515 4,0964,096 721721 17 %17 %

3131 32,76832,768 2,9772,977 9.0 %9.0 %

6363 262,144262,144 12,09712,097 4.6 %4.6 %

127127 2,097,1522,097,152 48,76948,769 2.3 %2.3 %

255255 16,777,21616,777,216 195,841195,841 1.1 %1.1 %

511511 134,217,728134,217,728 784,897784,897 0.5 %0.5 %

Array reference B [i+k] [j+k]Array reference B [i+k] [j+k]


Computation of the size of an integer polytopeComputation of the size of an integer polytope

The Fourier-Motzkin eliminationThe Fourier-Motzkin elimination

aaikikxxk k >= b >= bkk

1. x1. xn n >= D>= Di i (x(x11,…,x,…,xn-1n-1))

3. 0 3. 0 <= F<= Fk k (x(x11,…,x,…,xn-1n-1))

DDi i (x(x11,…,x,…,xn-1n-1)) <= E<= Ej j (x(x11,…,x,…,xn-1n-1))

0 0 <= F<= Fk k (x(x11,…,x,…,xn-1n-1))

n-dim polytopen-dim polytope

(n-1)-dim polytope(n-1)-dim polytope

2. x2. xn n <= E<= Ejj (x(x11,…,x,…,xn-1n-1))

1-dim polytope1-dim polytope

Range of xRange of x11

for each value of xfor each value of x11

add size (n-1)-dim polytopeadd size (n-1)-dim polytope

Memory size computation

for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ){ A [ j ] [ 0 ] = in0;{ A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;}}

for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ){ alpha [ i ] = A [ i ] [ n+i ] ;{ alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] :j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ;alpha [ i ] + A [ j ] [ n+i ] ;}}for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];

# define n 6# define n 6


Decompose the LBL’s of the array refs.Decompose the LBL’s of the array refs.into non-overlapping pieces !!into non-overlapping pieces !!

LBLLBL11 LBLLBL22UU

LBLLBL

LBLLBL11 = { = { xx = = TT11··ii11 + + uu11 | | AA11··ii11 >= >= bb11 } }

LBLLBL22 = { = { xx = = TT22··ii22 + + uu22 | | AA22··ii22 >= >= bb22 } }

TT11··ii11 + + uu11 = = TT22··ii22 + + uu22

Diophantine system of eqs.Diophantine system of eqs.

{ { AA11··ii11 >= >= bb11 , , AA22··ii22 >= >= bb22 } }

New polytopeNew polytope


Keeping minimal the set of inequalities in the LBL intersectionKeeping minimal the set of inequalities in the LBL intersection

for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? j < i ? A [ j ] [ n+i ]A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; : alpha [ i ] + A [ j ] [ n+i ] ;

Iterator spaceIterator space { 0 <= i , j <= n-1 , j+1 <= i }{ 0 <= i , j <= n-1 , j+1 <= i }

ii

jj

11 n-1n-1

n-1n-1{ 0 <= j , i <= n-1 , j+1 <= i }{ 0 <= j , i <= n-1 , j+1 <= i }

(5 ineq.)(5 ineq.)

(3 ineq.)(3 ineq.)


Polyhedron = { Polyhedron = { xx | | CC··xx = = dd , , AA··xx >= >= bb } }

Keeping minimal the set of inequalities in the LBL intersectionKeeping minimal the set of inequalities in the LBL intersection

The decomposition theorem of polyhedraThe decomposition theorem of polyhedra

Polyhedron = { Polyhedron = { xx | | xx = = VV·· + + LL·· + + RR·· } }

ii

[ Motzkin 1953 ] [ Motzkin 1953 ]


LBL’s of signal A (illustrative example)LBL’s of signal A (illustrative example)

Polyhedral data-dependence graphsPolyhedral data-dependence graphs

GranularityGranularitylevel = 0level = 0


Scalar-level data-dependence graphScalar-level data-dependence graph


PolyhedralPolyhedraldata-dependencedata-dependence

graphgraph

motion detectionmotion detectionalgorithmalgorithm

[Chan 93][Chan 93]

# scalars

# dependencies


Memory size variation during the motion detection alg.Memory size variation during the motion detection alg.


To handle high throughput applicationsTo handle high throughput applications

Extract the (largely hidden) parallelismExtract the (largely hidden) parallelism from the initially specified codefrom the initially specified code

Find the lowest degree of parallelismFind the lowest degree of parallelism to meet the throughput/hardware requirementsto meet the throughput/hardware requirements

Perform memory size computationPerform memory size computation for code with explicit parallelism instructionsfor code with explicit parallelism instructions

Hierarchical memory allocation

A large part of power dissipation A large part of power dissipation in data-dominated applications is due toin data-dominated applications is due to

data transfers and data storagedata transfers and data storage

Power cost reduction Power cost reduction memory hierarchymemory hierarchy

exploiting temporal locality exploiting temporal locality in the data accessesin the data accesses

Power dissipation Power dissipation = = f f ( ( memory size memory size , , access frequencyaccess frequency ) )


Power dissipation Power dissipation = = f f ( ( memory size memory size , , access freq.access freq. ) )

heavily used heavily used datadata

Layer ofLayer ofsmall memoriessmall memories

Layer ofLayer oflarge memorieslarge memories


HierarchicalHierarchicaldistributiondistribution

Non-hierarchicalNon-hierarchicaldistributiondistribution

trade-offstrade-offs

Lower power consumption by accessing from smaller memoriesLower power consumption by accessing from smaller memories

Higher power consumption due to additional transfersHigher power consumption due to additional transfers

Larger areaLarger area to store copies of datato store copies of data

additional area overheadadditional area overhead (addressing logic) (addressing logic)


Synthesis of multilevel memory architectureSynthesis of multilevel memory architectureoptimized for area and / or poweroptimized for area and / or power

subject to performance constraintssubject to performance constraints

1. Data reuse exploration1. Data reuse exploration

Which intermediate copies of data are necessaryWhich intermediate copies of data are necessary for accessing data in a power- and area- efficient way for accessing data in a power- and area- efficient way

2. Memory allocation & assignment2. Memory allocation & assignment

Distributed (hierarchical) memory architectureDistributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic ,( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )signal-to-memory & signal-to-port assignment )


Synthesis of multilevel memory architectureSynthesis of multilevel memory architectureoptimized for area and / or poweroptimized for area and / or power

subject to performance constraintssubject to performance constraints

1. Data reuse exploration1. Data reuse exploration

Array partitions to be considered as copy candidates: Array partitions to be considered as copy candidates:

the LBL’s from the recursive intersection of array refs. the LBL’s from the recursive intersection of array refs.

2. Memory allocation & assignment2. Memory allocation & assignment

CostCost = = · · PPread / write read / write ( N( N bits bits , N , N words words , f , f read / write read / write ) )

+ + · · Area Area ( N( N bits bits , N , N words words , N, Nports ports , technology, technology ) )

Partitioning for on/off- chip memories

CPUCPU

CacheCacheDRAMDRAM

off-chip off-chip

SRAMSRAM on-chipon-chip

1 cycle1 cycle

1 cycle1 cycle

10-2010-20

cyclescycles

MemoryMemoryaddressaddressspacespace

Optimal data mapping to the SRAM / DRAM Optimal data mapping to the SRAM / DRAM to maximize the performance of the applicationto maximize the performance of the application

Partitioning for on/off- chip memories

Total conflict factorTotal conflict factor

Total number of array accessesTotal number of array accessesexposed to cache conflictsexposed to cache conflicts

The importance of mappingThe importance of mappingto the on-chip SRAMto the on-chip SRAM

Using the polyhedral data-dependence graphUsing the polyhedral data-dependence graph

Precise info about the relative lifetimes of the different parts of arraysPrecise info about the relative lifetimes of the different parts of arrays

Conclusions

Algebraic techniques are powerful non-scalar instrumentsAlgebraic techniques are powerful non-scalar instruments

in the memory management of multimedia signal processingin the memory management of multimedia signal processing

Data-dependence analysis at polyhedral levelData-dependence analysis at polyhedral level

useful in many memory management tasksuseful in many memory management tasks

memory size computation for behavioral specifications memory size computation for behavioral specifications

hierarchical memory allocationhierarchical memory allocation

data partitioning between on- and off- chip memories data partitioning between on- and off- chip memories

data-flow analysis in the memory management of real-time multimedia processing systems

Documents

array refs

memory size computationfor

outlinememory size computation

chip memory

iterator space scope

computation catthoor

ti u ai

dataflow analysis