![Page 1: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/1.jpg)
fakultät für informatikinformatik 12
technische universität dortmund
Optimizations
Peter MarwedelTU DortmundInformatik 12
Germany
2009/01/10
Gra
phic
s: ©
Ale
xand
ra N
olte
, Ges
ine
Mar
wed
el, 2
003
![Page 2: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/2.jpg)
- 2 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Structure of this course
2: Specifications
3: Embedded System HW
4: Standard Software, Real-Time Operating Systems
5: Scheduling,HW/SW-Partitioning, Applications to MP-Mapping
6: Evaluationand Validation
8: Testing
7: Optimization of Embedded Systems
App
licat
ion
Kno
wle
dge
![Page 3: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/3.jpg)
- 3 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Transformation “Loop nest splitting”
Example: Separation of margin handling
+
many if-statements for margin-checking
no checking,efficient
only few margin elements to be processed
![Page 4: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/4.jpg)
- 4 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
if (x>=10||y>=14) for (; y<49; y++) for (k=0; k<9; k++) for (l=0; l<9;l++ ) for (i=0; i<4; i++) for (j=0; j<4;j++) { then_block_1; then_block_2}else {y1=4*y; for (k=0; k<9; k++) {x2=x1+k-4; for (l=0; l<9; ) {y2=y1+l-4; for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; if (0 || 35<x3 ||0 || 48<y3) then-block-1; else else-block-1; if (x4<0|| 35<x4||y4<0||48<y4) then_block_2; else else_block_2;}}}}}}
Loop nest splitting at University of DortmundLoop nest from MPEG-4 full search motion estimation
for (z=0; z<20; z++) for (x=0; x<36; x++) {x1=4*x; for (y=0; y<49; y++) {y1=4*y; for (k=0; k<9; k++) {x2=x1+k-4; for (l=0; l<9; ) {y2=y1+l-4; for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; if (x3<0 || 35<x3||y3<0||48<y3) then_block_1; else else_block_1; if (x4<0|| 35<x4||y4<0||48<y4) then_block_2; else else_block_2;}}}}}}
for (z=0; z<20; z++) for (x=0; x<36; x++) {x1=4*x; for (y=0; y<49; y++)
analysis of polyhedral domains, selection with genetic algorithm
[H. Falk et al., Inf 12, UniDo, 2002]
![Page 5: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/5.jpg)
- 5 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Results for loop nest splitting- Execution times -
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%Cavity Motion Estimation QSDPCM
[H. Falk et al., Inf 12, UniDo, 2002]
![Page 6: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/6.jpg)
- 6 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Results for loop nest splitting- Code sizes -
[Falk, 2002]
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
200%Cavity Motion Estimation QSDPCM
![Page 7: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/7.jpg)
- 7 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Array folding
Initial arrays
![Page 8: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/8.jpg)
- 8 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Array folding
Unfolded arrays
Unfolded arrays
![Page 9: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/9.jpg)
- 9 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Inter-array folding
Intra-array folding
![Page 10: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/10.jpg)
- 10 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Application
Array folding is implemented in the DTSE optimization proposed by IMEC. Array folding adds div and mod ops. Optimizations required to remove these costly operations.
At IMEC, ADOPT address optimizations perform this task.For example, modulo operations are replaced by pointers (indexes) which are incremented and reset.
![Page 11: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/11.jpg)
- 11 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Results (Mcycles for cavity benchmark)
ADOPT&DTSE required to achieve real benefit
[C.Ghez et al.: Systematic high-level Address Code Transformations for Piece-wise Linear Indexing: Illustration on a Medical Imaging Algorithm, IEEE WS on Signal Processing System: design & implementation, 2000, pp. 623-632]
![Page 12: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/12.jpg)
fakultät für informatikinformatik 12
technische universität dortmund
Compilation for Embedded Processors
Peter MarwedelTU DortmundInformatik 12
Germany
Gra
phic
s: ©
Ale
xand
ra N
olte
, Ges
ine
Mar
wed
el, 2
003
![Page 13: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/13.jpg)
- 13 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Compilers for embedded systems:Why are compilers an issue?
Many reports about low efficiency of standard compilers
- Special features of embedded processors have to be exploited.
- High levels of optimization more important than compilation speed.
- Compilers can help to reduce the energy consumption.
- Compilers could help to meet real-time constraints.
Less legacy problems than for PCs.
- There is a large variety of instruction sets.
- Design space exploration for optimized processors makes sense
![Page 14: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/14.jpg)
- 14 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
(Lack of) performance of C-compilers for DSPs
ADI-21011.01.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
TI-C51DSP56001
Data memory overhead [× N]DSPStone (Zivojnovic et al.). Example: ADPCM
Cycle overhead [× n]
DSP56001TI-C51 ADI-2101
1.0
2.0
3.0
4.0
5.0
8.0
7.0
6.0
![Page 15: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/15.jpg)
- 15 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
3 key problems for future memory systems
Energy
Access times
1. (Average) Speed2. Energy/Power3. Predictability/WCET
![Page 16: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/16.jpg)
- 16 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Optimization for low-energy the same as optimization for high performance?
int a[1000];c = a;for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1;}
int a[1000];c = a;for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1;}
LDR r3, [r2, #0]ADD r3,r0,r3MOV r0,#28LDR r0, [r2, r0]ADD r0,r3,r0ADD r2,r2,#4ADD r1,r1,#1CMP r1,#100BLT LL3
ADD r3,r0,r2MOV r0,#28MOV r2,r12MOV r12,r11MOV r11,rr10MOV r0,r9MOV r9,r8MOV r8,r1LDR r1, [r4, r0]ADD r0,r3,r1ADD r4,r4,#4ADD r5,r5,#1CMP r5,#100BLT LL3
2231 cycles16.47 µJ
2096 cycles19.92 µJ
No !• High-performance if available memory bandwidth fully used; low-energy consumption if memories are at stand-by mode• Reduced energy if more values are kept in registers
![Page 17: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/17.jpg)
- 17 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Compiler optimizationsfor improving energy efficiency
Energy-aware scheduling Energy-aware instruction selection Operator strength reduction: e.g. replace * by + and << Minimize the bitwidth of loads and stores Standard compiler optimizations with energy as a cost
function
E.g.: Register pipelining:
for i:= 0 to 10 do C:= 2 * a[i] + a[i-1];
R2:=a[0];for i:= 1 to 10 do begin R1:= a[i]; C:= 2 * R1 + R2; R2 := R1; end;
Exploitation of the memory hierarchy Exploitation of the memory hierarchy
![Page 18: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/18.jpg)
- 18 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Hierarchical memoriesusing scratch pad memories (SPM)
Address space ARM7TDMI cores, well-known for low power consumption
main
SPM
processor
HierarchyHierarchy
ExampleExample
scratch pad memory
0
FFF..
no tag memory
![Page 19: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/19.jpg)
- 19 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Very limited support in ARMcc-based tool flows
1. Use pragma in C-source to allocate to specific section:For example:#pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region)int const z2[3] = {1,2,3}; // in bar
2. Input scatter loading file to linker for allocating section to specific address range
http://www.arm.com/documentation/ Software_Development_Tools/index.html
![Page 20: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/20.jpg)
- 20 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Migration of data and instructions, global optimization model (U. Dortmund)
Which memory object (array,loop, etc.) to be stored in SPM?
Non-overlaying (“Static”) allocation:
Gain gk and size sk for each segment k. Maximise gain G = gk, respecting size of SPM SSP sk.
Solution: knapsack algorithm.
Overlaying (“dynamic”) allocation:
Moving objects back and forth
Processor
Scratch pad memory,capacity SSP
main memory
?
For i .{ }
for j ..{ }
while ...
Repeat
call ...
Array ...
Int ...
Array
Example:
![Page 21: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/21.jpg)
- 21 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
IP representation- migrating functions and variables-
Symbols:S(vark ) = size of variable knk = number of accesses to variable ke(vark ) = energy saved per variable access, if vark is migratedE(vark ) = energy saved if variable vark is migrated (= e(vark ) n(vark ))x(vark ) = decision variable, =1 if variable k is migrated to SPM, =0 otherwiseK = set of variables
Similar for functions I
Integer programming formulation:Maximize kK x(vark ) E(vark ) + iI x(Fi ) E(Fi )Subject to the constraintk K S (vark ) x(vark ) + i I S (Fi ) x(Fi ) SSP
![Page 22: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/22.jpg)
- 22 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Reduction in energy and average run-time
Multi_sort (mix of sort algorithms)
Cyc
les
[x1
00
] E
ner
gy
[µJ]
Feasible with standard compiler & postpass optimization
Measured processor / external memory energy + CACTI values for SPM (combined model)
Numbers will change with technology, algorithms remain unchanged.
![Page 23: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/23.jpg)
- 23 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Allocation of basic blocks
Fine-grained granularity smoothens dependency on the size of the scratch pad.
Requires additional jump instructions to return to "main" memory.
Fine-grained granularity smoothens dependency on the size of the scratch pad.
Requires additional jump instructions to return to "main" memory.
Main memory
BB1
BB2
Jump1
Jump2
Jump4
Jump3
For consecutive basic blocks
Statically 2 jumps,but only one is taken
![Page 24: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/24.jpg)
- 24 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Allocation of basic blocks, sets of adjacent basic blocks and the stack
Requires generation of additional jumps (special compiler)
Cyc
les
[x1
00
] E
ner
gy
[µJ]
![Page 25: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/25.jpg)
- 25 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Savings for memory system energy alone
Combined model for memories
![Page 26: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/26.jpg)
- 26 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Timing predictability
aiT: WCET analysis tool support for scratchpad memories by specifying different
memory access times also features experimental cache analysis for ARM7
aiT: WCET analysis tool support for scratchpad memories by specifying different
memory access times also features experimental cache analysis for ARM7
![Page 27: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/27.jpg)
- 27 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Architectures considered
ARM7TDMI with 3 different memory architectures:1. Main memory
LDR-cycles: (CPU,IF,DF)=(3,2,2)STR-cycles: (2,2,2)* = (1,2,0)
2. Main memory + unified cacheLDR-cycles: (CPU,IF,DF)=(3,12,6)STR-cycles: (2,12,3)* = (1,12,0)
3. Main memory + scratch padLDR-cycles: (CPU,IF,DF)=(3,0,2)STR-cycles: (2,0,0)* = (1,0,0)
![Page 28: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/28.jpg)
- 28 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Results for G.721
References:• Wehmeyer, Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time (WCET) analysis, Catania, Sicily, Italy, June 29, 2004• Second paper on SP/Cache and WCET at DATE, March 2005
Using Scratchpad: Using Unified Cache:
![Page 29: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/29.jpg)
- 29 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Multiple scratch pads
![Page 30: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/30.jpg)
- 30 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Optimization for multiple scratch pads
i
iijj
j nxeC ,Minimize
With ej: energy per access to memory j,and xj,i= 1 if object i is mapped to memory j, =0 otherwise,and ni: number of accesses to memory object i,subject to the constraints:
i
jiij SSPSxj ,:
j
ijxi 1: ,
With Si: size of memory object i,SSPj: size of memory j.
![Page 31: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/31.jpg)
- 31 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Considered partitions
![Page 32: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/32.jpg)
- 32 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Results for parts of GSM coder/decoder
A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.
„Working set“
![Page 33: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/33.jpg)
- 33 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Dynamic replacement within scratch pad
Effectively results in a kind of compiler-controlled segmentation/ paging for SPM
Address assignment within SPM required(paging or segmentation-like)
Reference: Verma, Marwedel: Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS 2004
CPU
Memory
Memory
SPM
![Page 34: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/34.jpg)
- 34 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Dynamic replacement of data within scratch pad: based on liveness analysis
MO = {A, T1, T2, T3, T4}SP Size = |A| = |T1| …= |T4|
Solution:A SP & T3 SP
Solution:A SP & T3 SP
SPILL_STORE(A);SPILL_LOAD(T3);
SPILL_STORE(A);SPILL_LOAD(T3);
SPILL_LOAD(A);SPILL_LOAD(A);
T3
DEF A
USE A
USE A
MOD A USE T3
USE T3
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
![Page 35: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/35.jpg)
- 35 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Dynamic replacement within scratch pad- Results for edge detection relative to static allocation -
![Page 36: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/36.jpg)
- 36 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Saving/Restoring Context Switch
Saving Context Switch (Saving) Utilizes SPM as a common region
shared all processes Contents of processes are copied
on/off the SPM at context switch Good for small scratchpads
P1
P2
P3
Scratchpad
Process P3Process P1Process P2
Saving/Restoring at context switch
Saving/Restoring at context switch
![Page 37: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/37.jpg)
- 37 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Non-Saving Context Switch
Process P1
Process P3
Process P2
Scratchpad
Process P1
Non-Saving Context Switch Partitions SPM into disjoint regions Each process is assigned a SPM
region Copies contents during initialization Good for large scratchpads
Process P2
Process P3
P1
P2
P3
![Page 38: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/38.jpg)
- 38 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Hybrid Context Switch
Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads Analysis is similar to Non-Saving
Approach Runtime: O(nM3)Scratchpad
Process P1,P2, P3
Process P1
Process P2
Process P3
Process P1Process P2Process P3
P1
P2
P3Saving/Restoring at context switch
Saving/Restoring at context switch
![Page 39: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/39.jpg)
- 39 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Multi-process Scratchpad Allocation: Results
80
90
100
110
120
130
140
150
160
64 128 256 512 1024 2048 4096
Scratchpad Size (bytes)
En
erg
y C
on
su
mp
tio
n (
mJ
) Energy (SPA) Energy (Non-Saving)
Energy (Saving) CopyEnergy (Saving)
Energy (Hybrid) CopyEnergy (Hybrid)
27%
For small SPMs (64B-512B) Saving is better For large SPMs (1kB- 4kB) Non-Saving is better Hybrid is the best for all SPM sizes. Energy reduction @ 4kB SPM is 27% for Hybrid approach
edge detection,
adpcm, g721, mpeg
SPA: Single Process Approach
![Page 40: fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:](https://reader035.vdocuments.us/reader035/viewer/2022062512/55204d6349795902118b8e55/html5/thumbnails/40.jpg)
- 40 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Summary
High-level transformations Loop nest splitting Array folding
Impact of memory architecture on execution times & energy.
The SPM provides Runtime efficiency Energy efficiency Timing predictability
Achieved savings are sometimes dramatic, for example: savings of ~ 95% of the memory system energy