data-flow analysis in the memory management of real-time multimedia processing systems
DESCRIPTION
Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems. Florin Balasa University of Illinois at Chicago. Introduction. Real-time multimedia processing systems. (video and image processing, real-time 3D rendering, - PowerPoint PPT PresentationTRANSCRIPT
Data-Flow Analysis Data-Flow Analysis in the Memory Management of Real-Time in the Memory Management of Real-Time
Multimedia Processing SystemsMultimedia Processing Systems
Florin BalasaFlorin Balasa
UniversityUniversity of Illinois at Chicagoof Illinois at Chicago
Introduction
Real-time multimedia processing systemsReal-time multimedia processing systems
(video and image processing, real-time 3D rendering,(video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.)audio and speech coding, medical imaging, etc.)
A large part of power dissipation is due toA large part of power dissipation is due to
data transfer and data storagedata transfer and data storage
Fetching operands from an off-chip memory for additionFetching operands from an off-chip memory for addition consumes 33 times more power than the computation consumes 33 times more power than the computation
[ Catthoor 98 ][ Catthoor 98 ]
Area cost often largely dominated by memoriesArea cost often largely dominated by memories
Introduction
In the early years of high-level synthesisIn the early years of high-level synthesis
memory management tasks tackled at memory management tasks tackled at scalar levelscalar level
Algebraic techniquesAlgebraic techniques -- similar to those used -- similar to those used
in modern compilers -- in modern compilers -- allow to handle allow to handle
memory management atmemory management at non-scalar levelnon-scalar level
Requirement: addressing the entire class of affine specificationsRequirement: addressing the entire class of affine specifications
multidimensional signals with (complex) affine indexesmultidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functionsloop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.conditions – relational and / or logical operators of affine fct.
Outline
Memory size computation using dataMemory size computation using data
dependence analysisdependence analysis Hierarchical memory allocation Hierarchical memory allocation
based on data reuse analysisbased on data reuse analysis Data-flow driven data partitioningData-flow driven data partitioning
for on/off- chip memoriesfor on/off- chip memories ConclusionsConclusions
Computation of array reference size
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
for (k=0; k<=511; k++)for (k=0; k<=511; k++)
… … B [i+k] [j+k]B [i+k] [j+k] … …
How many memory locations are necessaryHow many memory locations are necessary to store the array references to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]
… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …
Computation of array reference size
Number of iterator triplets (i,j,k), that is 512Number of iterator triplets (i,j,k), that is 51233 ?? ??
(i,j,k)=(0,1,1)(i,j,k)=(0,1,1)
(i,j,k)=(1,2,0)(i,j,k)=(1,2,0)
B [1] [2]B [1] [2]No !!No !!
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
for (k=0; k<=511; k++)for (k=0; k<=511; k++)
… … B [i+k] [j+k]B [i+k] [j+k] … …
……
Computation of array reference size
Number of index values (i+k,j+k), that is 1023Number of index values (i+k,j+k), that is 102322 ?? ??
B [0] [512]B [0] [512]No !!No !!
(since 0 <= i+k , j+k <= 1022)(since 0 <= i+k , j+k <= 1022)
any (i,j,k) any (i,j,k)
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
for (k=0; k<=511; k++)for (k=0; k<=511; k++)
… … B [i+k] [j+k]B [i+k] [j+k] … …
……
Computation of array reference size
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …
jj
ii
x=2i+3j+1x=2i+3j+1
y=5i+j+2y=5i+j+2
z=4i+6j+3z=4i+6j+3A[x][y][z]A[x][y][z]
Iterator spaceIterator space Index spaceIndex space
Computation of array reference size
… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …
jj
A[x][y][z]A[x][y][z]
Iterator spaceIterator space Index spaceIndex space
ii
Computation of array reference size
Remark Remark The iterator space may have ``holes’’ tooThe iterator space may have ``holes’’ too
for (i=4; i<=8; i++)for (i=4; i<=8; i++)
for (j=i-2; j<=i+2; j+=2)for (j=i-2; j<=i+2; j+=2) … … C[i+j] …C[i+j] …
4 6 8 4 6 8
ii
jj
22
44
66
for (i=4; i<=8; i++)for (i=4; i<=8; i++)
for (j=0; j<=2; j++)for (j=0; j<=2; j++) … … C[2i+2j-2] …C[2i+2j-2] …
4 5 6 7 84 5 6 7 8
ii
jj
1122
00
normalizationnormalization
88
Computation of array reference size
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
… … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …
22 3355 1144 66
11
2233
xx
yyzz
ii
jj++==
Iterator spaceIterator spaceIndex spaceIndex spaceaffineaffine
mappingmapping
0 <= i , j <= 5110 <= i , j <= 511
Computation of array reference size
ii
jj
x=i+kx=i+k
y=j+ky=j+kB[x][y]B[x][y]
Iterator spaceIterator space Index spaceIndex space
kk
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
for (k=0; k<=511; k++)for (k=0; k<=511; k++)
… … B [i+k] [j+k]B [i+k] [j+k] … …
……
Computation of array reference size
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++)
for (k=0; k<=511; k++)for (k=0; k<=511; k++)
… … B [i+k] [j+k]B [i+k] [j+k] … …
……
11 0000 11
ii
jjkk
xxyy
00
00++==
Iterator spaceIterator spaceIndex spaceIndex spaceaffineaffine
mappingmapping
0 <= i , j , k <= 5110 <= i , j , k <= 511
1111
Computation of array reference size
Any array reference can be modeledAny array reference can be modeled as a linearly bounded lattice (LBL)as a linearly bounded lattice (LBL)
LBL = { LBL = { xx = = TT··ii + + uu | | AA··ii >= >= bb } }
Iterator spaceIterator space
- - scope of nested loops, andscope of nested loops, and
- iterator-dependent conditionsiterator-dependent conditions
Affine mappingAffine mapping
PolytopePolytopeLBLLBLaffineaffine
mappingmapping
Computation of array reference size
The size of the array referenceThe size of the array reference is the size of its index space – an LBL !!is the size of its index space – an LBL !!
LBL = { LBL = { xx = = TT··ii + + uu | | AA··ii >= >= bb } }
f : f : ZZn n ZZmm f(f(ii) = ) = TT··ii + + uu
Is function f a one-to-one mapping ??Is function f a one-to-one mapping ??
Size(index space) = Size(iterator space)Size(index space) = Size(iterator space)
If YESIf YES
Computation of array reference size
f : f : ZZn n ZZmm f(f(ii) = ) = TT··ii + + uu
HHGG
PP·T·S ·T·S ==0000
[Minoux 86][Minoux 86]
SS - unimodular matrix - unimodular matrix
PP - row permutation - row permutation
HH - nonsingular lower-triangular matrix - nonsingular lower-triangular matrix
When rank(When rank(HH)=m <= n , )=m <= n , HH is the Hermite Normal Form is the Hermite Normal Form
1313
Computation of array reference size
rank(rank(HH)=n)=n function f is a one-to-one mappingfunction f is a one-to-one mapping
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++)for (j=0; j<=511; j++) … … A [2i+3j+1] [5i+j+2] [4i+6j+3]A [2i+3j+1] [5i+j+2] [4i+6j+3] … …
22 3355 1144 66
11
2233
xx
yyzz
ii
jj++==
PP··TT·S·S == II33
22 33
55 1144 66
-1-1
1133
-2-2==
11 00
-4-4
22 00- - - -- - - -
HH
GG
Nr. locations Nr. locations A[ ][ ][ ]A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512 = size ( 0 <= i,j <= 511 ) = 512 x 512
Case 1Case 1
Computation of array reference size
rank(rank(HH)<n)<nCase 2Case 2
for (i=0; i<=511; i++)for (i=0; i<=511; i++)
for (j=0; j<=511; j++) …for (j=0; j<=511; j++) …
for (k=0; k<=511; k++)for (k=0; k<=511; k++) … … B [i+k] [j+k]B [i+k] [j+k] … …
PP··TT·S·S == II2211 0000 11
11
0000
1111
00
1100
-1-1
-1-111
11 0000 11
0000
==
HH 00
{ 0 <= i , j , k <= 511 }{ 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 }{ 0 <= I-K , J-K , K <= 511 }
| B[i+k][j+k] || B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897 = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897
Computation of array reference size
Loop
bounds
Size
iterator space
Size index space
(# storage locations)
|index space| /
|iterator space|
77 512512 169169 33 %33 %
1515 4,0964,096 721721 17 %17 %
3131 32,76832,768 2,9772,977 9.0 %9.0 %
6363 262,144262,144 12,09712,097 4.6 %4.6 %
127127 2,097,1522,097,152 48,76948,769 2.3 %2.3 %
255255 16,777,21616,777,216 195,841195,841 1.1 %1.1 %
511511 134,217,728134,217,728 784,897784,897 0.5 %0.5 %
Array reference B [i+k] [j+k]Array reference B [i+k] [j+k]
Computation of array reference size
Computation of the size of an integer polytopeComputation of the size of an integer polytope
The Fourier-Motzkin eliminationThe Fourier-Motzkin elimination
aaikikxxk k >= b >= bkk
1. x1. xn n >= D>= Di i (x(x11,…,x,…,xn-1n-1))
3. 0 3. 0 <= F<= Fk k (x(x11,…,x,…,xn-1n-1))
DDi i (x(x11,…,x,…,xn-1n-1)) <= E<= Ej j (x(x11,…,x,…,xn-1n-1))
0 0 <= F<= Fk k (x(x11,…,x,…,xn-1n-1))
n-dim polytopen-dim polytope
(n-1)-dim polytope(n-1)-dim polytope
2. x2. xn n <= E<= Ejj (x(x11,…,x,…,xn-1n-1))
1-dim polytope1-dim polytope
Range of xRange of x11
for each value of xfor each value of x11
add size (n-1)-dim polytopeadd size (n-1)-dim polytope
Memory size computation
for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ){ A [ j ] [ 0 ] = in0;{ A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;}}
for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ){ alpha [ i ] = A [ i ] [ n+i ] ;{ alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] :j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ;alpha [ i ] + A [ j ] [ n+i ] ;}}for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];
# define n 6# define n 6
Memory size computation
Decompose the LBL’s of the array refs.Decompose the LBL’s of the array refs.into non-overlapping pieces !!into non-overlapping pieces !!
LBLLBL11 LBLLBL22UU
LBLLBL
LBLLBL11 = { = { xx = = TT11··ii11 + + uu11 | | AA11··ii11 >= >= bb11 } }
LBLLBL22 = { = { xx = = TT22··ii22 + + uu22 | | AA22··ii22 >= >= bb22 } }
TT11··ii11 + + uu11 = = TT22··ii22 + + uu22
Diophantine system of eqs.Diophantine system of eqs.
{ { AA11··ii11 >= >= bb11 , , AA22··ii22 >= >= bb22 } }
New polytopeNew polytope
Memory size computation
Keeping minimal the set of inequalities in the LBL intersectionKeeping minimal the set of inequalities in the LBL intersection
for ( i=0; i<n ; i++ )for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ )for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? j < i ? A [ j ] [ n+i ]A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; : alpha [ i ] + A [ j ] [ n+i ] ;
Iterator spaceIterator space { 0 <= i , j <= n-1 , j+1 <= i }{ 0 <= i , j <= n-1 , j+1 <= i }
ii
jj
11 n-1n-1
n-1n-1{ 0 <= j , i <= n-1 , j+1 <= i }{ 0 <= j , i <= n-1 , j+1 <= i }
(5 ineq.)(5 ineq.)
(3 ineq.)(3 ineq.)
Memory size computation
Polyhedron = { Polyhedron = { xx | | CC··xx = = dd , , AA··xx >= >= bb } }
Keeping minimal the set of inequalities in the LBL intersectionKeeping minimal the set of inequalities in the LBL intersection
The decomposition theorem of polyhedraThe decomposition theorem of polyhedra
Polyhedron = { Polyhedron = { xx | | xx = = VV·· + + LL·· + + RR·· } }
ii
[ Motzkin 1953 ] [ Motzkin 1953 ]
Memory size computation
LBL’s of signal A (illustrative example)LBL’s of signal A (illustrative example)
Polyhedral data-dependence graphsPolyhedral data-dependence graphs
GranularityGranularitylevel = 0level = 0
GranularityGranularitylevel = 1level = 1
Scalar-level data-dependence graphScalar-level data-dependence graph
GranularityGranularitylevel = 2level = 2
PolyhedralPolyhedraldata-dependencedata-dependence
graphgraph
motion detectionmotion detectionalgorithmalgorithm
[Chan 93][Chan 93]
# scalars
# dependencies
Memory size computation
Memory size variation during the motion detection alg.Memory size variation during the motion detection alg.
Memory size computation
To handle high throughput applicationsTo handle high throughput applications
Extract the (largely hidden) parallelismExtract the (largely hidden) parallelism from the initially specified codefrom the initially specified code
Find the lowest degree of parallelismFind the lowest degree of parallelism to meet the throughput/hardware requirementsto meet the throughput/hardware requirements
Perform memory size computationPerform memory size computation for code with explicit parallelism instructionsfor code with explicit parallelism instructions
Hierarchical memory allocation
A large part of power dissipation A large part of power dissipation in data-dominated applications is due toin data-dominated applications is due to
data transfers and data storagedata transfers and data storage
Power cost reduction Power cost reduction memory hierarchymemory hierarchy
exploiting temporal locality exploiting temporal locality in the data accessesin the data accesses
Power dissipation Power dissipation = = f f ( ( memory size memory size , , access frequencyaccess frequency ) )
Hierarchical memory allocation
Power dissipation Power dissipation = = f f ( ( memory size memory size , , access freq.access freq. ) )
heavily used heavily used datadata
Layer ofLayer ofsmall memoriessmall memories
Layer ofLayer oflarge memorieslarge memories
Hierarchical memory allocation
HierarchicalHierarchicaldistributiondistribution
Non-hierarchicalNon-hierarchicaldistributiondistribution
trade-offstrade-offs
Lower power consumption by accessing from smaller memoriesLower power consumption by accessing from smaller memories
Higher power consumption due to additional transfersHigher power consumption due to additional transfers
Larger areaLarger area to store copies of datato store copies of data
additional area overheadadditional area overhead (addressing logic) (addressing logic)
Hierarchical memory allocation
Synthesis of multilevel memory architectureSynthesis of multilevel memory architectureoptimized for area and / or poweroptimized for area and / or power
subject to performance constraintssubject to performance constraints
1. Data reuse exploration1. Data reuse exploration
Which intermediate copies of data are necessaryWhich intermediate copies of data are necessary for accessing data in a power- and area- efficient way for accessing data in a power- and area- efficient way
2. Memory allocation & assignment2. Memory allocation & assignment
Distributed (hierarchical) memory architectureDistributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic ,( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )signal-to-memory & signal-to-port assignment )
Hierarchical memory allocation
Synthesis of multilevel memory architectureSynthesis of multilevel memory architectureoptimized for area and / or poweroptimized for area and / or power
subject to performance constraintssubject to performance constraints
1. Data reuse exploration1. Data reuse exploration
Array partitions to be considered as copy candidates: Array partitions to be considered as copy candidates:
the LBL’s from the recursive intersection of array refs. the LBL’s from the recursive intersection of array refs.
2. Memory allocation & assignment2. Memory allocation & assignment
CostCost = = · · PPread / write read / write ( N( N bits bits , N , N words words , f , f read / write read / write ) )
+ + · · Area Area ( N( N bits bits , N , N words words , N, Nports ports , technology, technology ) )
Partitioning for on/off- chip memories
CPUCPU
CacheCacheDRAMDRAM
off-chip off-chip
SRAMSRAM on-chipon-chip
1 cycle1 cycle
1 cycle1 cycle
10-2010-20
cyclescycles
MemoryMemoryaddressaddressspacespace
Optimal data mapping to the SRAM / DRAM Optimal data mapping to the SRAM / DRAM to maximize the performance of the applicationto maximize the performance of the application
Partitioning for on/off- chip memories
Total conflict factorTotal conflict factor
Total number of array accessesTotal number of array accessesexposed to cache conflictsexposed to cache conflicts
The importance of mappingThe importance of mappingto the on-chip SRAMto the on-chip SRAM
Using the polyhedral data-dependence graphUsing the polyhedral data-dependence graph
Precise info about the relative lifetimes of the different parts of arraysPrecise info about the relative lifetimes of the different parts of arrays
Conclusions
Algebraic techniques are powerful non-scalar instrumentsAlgebraic techniques are powerful non-scalar instruments
in the memory management of multimedia signal processingin the memory management of multimedia signal processing
Data-dependence analysis at polyhedral levelData-dependence analysis at polyhedral level
useful in many memory management tasksuseful in many memory management tasks
memory size computation for behavioral specifications memory size computation for behavioral specifications
hierarchical memory allocationhierarchical memory allocation
data partitioning between on- and off- chip memories data partitioning between on- and off- chip memories