![Page 1: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/1.jpg)
Embedded Systems in SiliconTD5102
Introduction and overview
Henk Corporaalhttp://www.ics.ele.tue.nl/~heco/courses/EmbSystems
Technical University EindhovenDTI / NUS Singapore
2005/2006
![Page 2: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/2.jpg)
H.C. TD5102 2
Contents• Trends• Platforms• Application mapping• Design flow• Summary
![Page 3: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/3.jpg)
H.C. TD5102 3
Observation 1:The 3 Cs
• Convergence of 3 Cscomputers, communications and consumer
electronics
• The computer enters the 3rd fasecomputing power - networking - intelligent processing
• The world is one network wherever, whenever, all information and communication available
We get a smart environment
![Page 4: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/4.jpg)
H.C. TD5102 4
Observation 2: Current design practise
Logic
SystemAlgorithm
R/T
circuit
Behaviour Structure
Physical
Y-Chart (Gajski-Kuhn)
Design Flow is path in Y chart
Till RT-level largely manual flow
![Page 5: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/5.jpg)
H.C. TD5102 5
Integration
Task Task
Task
Systempeople
CASM
Softwarepeople
vhdl
verilogHardware
people
Paper spec
Observation 3: Informal system specification
![Page 6: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/6.jpg)
H.C. TD5102 6
Observation 4: design productivity
• Yes, we can fabricate the ICs, but …• Can we design them ?• Can we program them ?
103
102
101
4 8 12 16 year
complexity
HW gap
SW gap
Process technology + 58%
HW design productivity +21 %
SW productivity + 8 %
![Page 7: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/7.jpg)
H.C. TD5102 7
Video
3D
Rel. CPU-load for 15 fps
0%200%400%600%800%
1000%1200%
Order ofMagnitude
0 %
25 %
50 %
75 %
100 %
0 50 100 150 200 250 300
Frame (IPPP ...)
Load (Sequence: weather, VO1, binary shape, 10Hz, 112 kbit/s, QCIF)
Factor 2
P. Kuhn, G. Diebel, “Complexity Analysis of the MPEG-4 VM 8.0,” ISO/IEC JTC1/SC29/WG11/MPEG97/m2862, Fribourg, October 1997*
*
Obervation 5:More dynamic applications
![Page 8: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/8.jpg)
H.C. TD5102 8
Observation 6: Memory problem
µProc:55%/year
CPU
DRAM:7%/yearDRAM
1
10
100
1000
1980 1985
1990
1995
2000
Processor-MemoryPerformance Gap:(grows 50% / year)
Performance
Time
“Moore’s Law”
[Patterson]
![Page 9: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/9.jpg)
H.C. TD5102 9
What do we learn from these observations?We need: • Short Time-to-Market
– reuse– short design time
• Flexible solution– programmability– reconfigurability
• Scalability• Low power• Low cost• QoS control
At sufficient performance !
![Page 10: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/10.jpg)
H.C. TD5102 10
Solution ?
1. Platforms– HW and SW IP reuse– Standardization (interfaces)– QoS (quality of service) hooks
2. Advanced Design Flow for Platforms2. Raise abstraction level3. Tool support4. Modeling of Power, Cost, Performance5. Predictable design
![Page 11: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/11.jpg)
H.C. TD5102 11
Lecture 1: Introduction• Trends• Platforms• Application mapping• Design flow• Summary
![Page 12: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/12.jpg)
H.C. TD5102 12
What is a platform?
A platform is a generic, but domain specificinformation processing (sub-)system
In future available as single chip (SoC),or package (SiP)
![Page 13: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/13.jpg)
H.C. TD5102 13
What is a platform?• HW properties:
– One or more programmable processors– Advanced memory organization– Programmable communication network– I/O (highly domain dependent)
• Possible extra HW features:– Reconfigurable logic– Domain specific accelerators
![Page 14: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/14.jpg)
H.C. TD5102 14
What is a platform?• SW components:
– Standardized RTOS– Proper tooling for platform system design
• Compilers, Models, Exploration, Debugging, Simulation, …
• Possible extra SW features– Middleware layer on top of OS for features like:
• QoS• Domain specific protocols • Domain specific SW interfaces• Control reconfigurable logic • Library components• Distributed / Active network processing• Billing• Security
![Page 15: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/15.jpg)
H.C. TD5102 15
Example Platform: Philips NexperiaAvailable in the Billion Transistor Era
– E.g. TI OMAP, Sony Cell, Philips Nexperia, TRIPS, Xilinx Virtex-4 Pro, …
Philips Nexperia
![Page 16: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/16.jpg)
H.C. TD5102 16
Future platforms
Example: Smart Networked Devices
radio programmablehardware
reconfig.hardware
OS library
Virtual MachineProtocols
Multimedia (MPEG 21)Network
acceleratorhardware
![Page 17: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/17.jpg)
H.C. TD5102 17
Future platform: architecture concept
CPUs Accelerators Reconfigurable HW blocks
Level 0
Level 1
Level N
Memory Memory
Memory
Communication network
Communication network
Communication network
I/O
I/O
CPUsCPUs AcceleratorsAccelerators Reconfigurable HW blocks
Reconfigurable HW blocks
![Page 18: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/18.jpg)
H.C. TD5102 18
On-chipNetwork
Networkinterface
NoC realization
IP - Isles:32 RISC microprocessor ~ 20 KgatesMPEG decoding ~ 100 KgatesWavelet filtering ~ 40 KgatesSRAMDRAMFPGA block
Future platforms
IP core
![Page 19: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/19.jpg)
H.C. TD5102 19
Lecture 1: Introduction• Trends• Platforms• Application mapping• Design flow• Summary
![Page 20: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/20.jpg)
H.C. TD5102 20
Platform and platform design
Platform
Enabling technologies
Applications
Des
ign
tech
nolo
gySDT
system design technology
PDTplatform design
technology
![Page 21: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/21.jpg)
H.C. TD5102 21
What is the system designers problem ?
Idea
Specification
Implementation
Find for an application (idea/specification) an efficientmapping/implementation on a given realization space,under given constraints (cost, P, E, T, E*D, Throughput, #pins, ..)
![Page 22: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/22.jpg)
H.C. TD5102 22
A (single) processor: how does it look inside?
FunctionUnit(s)
DataMemory
r0r1r2
FunctionUnit(s)
Registerfile
Instruction register
Decode logic
Processor datapath
Load-StoreUnit
InstructionMemory
Processor control
![Page 23: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/23.jpg)
H.C. TD5102 23
Mapping: placing operations in space and time
d = a * b;e = a + d;f = 2 * b + d;r = f – e;x = z + y;
* *
+ +
- +
a b 2
z y
d
e f
r x
Data Dependence Graph (DDG)
![Page 24: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/24.jpg)
H.C. TD5102 24
How to map these operations?
* *
+ +
- +
a b 2
z yd
e f
r x
Architecture 1:• One Function Unit• All operations single cycle latency
*
*
+
+
-
+
cycle 1
2
3
45
6
![Page 25: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/25.jpg)
H.C. TD5102 25
How to map these operations?
* *
+ +
- +
a b 2
z yd
e f
r x
Architecture 2:• One Add-Sub and one Mul unit• All operations single cycle latency
*
* +
+
-
+cycle 1
2
3
45
6
Mul Add-sub
![Page 26: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/26.jpg)
H.C. TD5102 26
How to map these operations?
* *
+ +
- +
a b 2
z yd
e f
r x
Architecture 3:• One Add-sub and one Mul unit• Add/Sub 1 cycle, Mul 2 cycles
*
* +
+
-
+cycle 1
2
3
45
6
Mul Add-sub
![Page 27: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/27.jpg)
H.C. TD5102 27
There are many mapping solutions
Pareto curve(solution space)
T ex
ecut
ion
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xxx
x
x
xx
x
x
x
xx
x x
x
xx
Cost0
Specific architectureand code schedule
Let S be the solution space containing solutions x = (xi),then: x = Pareto point x S, and y S i xi < yi
![Page 28: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/28.jpg)
H.C. TD5102 28
Can we do better?
Much better !!• transforming the specification• a different architecture• a different mapping• speculative execution• …… be creative ………..
![Page 29: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/29.jpg)
H.C. TD5102 29
Transforming the specification (1)
+
+
+
+
+
+
Based on associativity of + operationa + (b + c) = (a + b) + c
Example: tree height reduction
![Page 30: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/30.jpg)
H.C. TD5102 30
Transforming the specification (2)
d = a * b;e = a + d;f = 2 * b + d;r = f – e;x = z + y;
r = f – e = 2*b + d – (a + d) = 2*b – a;x = z + y;
<<
-
a
1 b
+
x
zy
r
![Page 31: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/31.jpg)
H.C. TD5102 31
Changing the architecture: adding more complex units:
+
+
+
+
+
+
4-input adderwhy is this faster?
![Page 32: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/32.jpg)
H.C. TD5102 32
Changing the architecture: adding more complex units
In the extreme case put everything into one unit!
Spatial mapping- no control flow
![Page 33: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/33.jpg)
H.C. TD5102 33
More complex control flow
-a- ;If condThen -b-Else -c- ;-d- ;
-a-cond?
-b- -c-
-d-
Control Flow Graph(CFG)
Program part:
![Page 34: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/34.jpg)
H.C. TD5102 34
Mapping the CFG example: 3 options: what's the best?
-a-br c
-b-jmp d
-c-
-d-
-a-br b
-c-jmp d
-b-
-d-
-a-br c
-c-jmp d
-b-
-d-
![Page 35: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/35.jpg)
H.C. TD5102 35
Why not removing the control flow ?
![Page 36: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/36.jpg)
H.C. TD5102 36
If conversion shortens the schedule
-a-br c
-b-jmp d
-c-
-d-
-a-
cond-b-
!cond-c-
-d-
Using guarded instructions like:r3: add r1,r2,r5; !r3: mul r4,r5,#3
![Page 37: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/37.jpg)
H.C. TD5102 37
Speculative execution makes it even shorter!
-a-br c
-b-jmp d
-c-
-d-
-a- -b- -c-
-d-
Why not executing -d- in parallel?
![Page 38: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/38.jpg)
H.C. TD5102 38
Huge requirements: > 10 GOP/s > 6 GB/s> 10 MB storage
Software specification: - more than 200 000 lines C- hundreds of files- written by approx. 80 teams
E.g.: MPEG-4 : multimedia
However: Real life much more complex
![Page 39: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/39.jpg)
H.C. TD5102 39
Nowadays implementations:- small images- decoding only- not real-time- several W- single task- limited dynamism
Wanted features:- large images (HDTV)- encoding and decoding- real-time- 100 mW (mobile)- multiple tasks- dealing with dynamism
Can we handle this?
![Page 40: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/40.jpg)
H.C. TD5102 40
Lecture 1: Introduction• Trends• Platforms• Application mapping• Design flow• Summary
![Page 41: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/41.jpg)
H.C. TD5102 41
Embedded system design
How to map your application graph A(L,A,D) to hardware graph (L,N,C)
L: design level (e.g. architecture, implementation or realization level)A: application components (e.g. tasks, operations, data structures)D: dependences between application componentsN: hardware components (e.g. processors, ASICs, FPGA,memories)C: connections between hardware components
![Page 42: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/42.jpg)
H.C. TD5102 42
Abstraction levels
Level 1: Architecture
Level 2: Implementation
Level 3: Realization
Explorationsearch area
Level 0: Requirements
Is modeled by
Is implemented by
Compiles into
Inter-level transformation:System specification level Level specificationlanguages:
English
ES/RT-UML, Esterel, SDL
C++, JAVA,C, VHDL, SystemC
Machine code,Hardware modules
Idea
![Page 43: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/43.jpg)
H.C. TD5102 43
Design space exploration
Level n-1Design point
Exploration atlevel nExploration
search areaRealization
space
LT(n-1,n)
Design transformation Exploration search area
Cos
t
global optimum
![Page 44: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/44.jpg)
H.C. TD5102 44
Design space exploration framework- another Y-chart
SoftwaredescriptionAG(L,A,D)
HardwaredescriptionRG(L,N,C)
Mapper &Scheduler
Analysis
Exploration
Steeringdesigntransformation
Steeringdesigntransformationand mapping
Design point
Statistics
Designtransfor-mations
Designtransfor-mations
![Page 45: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/45.jpg)
H.C. TD5102 45
Design flow steps and constraintsR
efin
emen
t ste
ps
Transformation
Architecture / Platformconstraints
high abstraction level
low abstraction level
idea
realization
![Page 46: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/46.jpg)
H.C. TD5102 46
In which order should we perform the steps?
Step n Step n+1Decision trees
Step n+1
Step n
Step n
Step n+1
![Page 47: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/47.jpg)
H.C. TD5102 47
Well-known phase ordering examples
• Concurrency versus Data management– e.g. loop partitioning versus array partitioning for a
multiprocessor
• Scheduling versus Register allocation
• Logic synthesis versus Placement and Routing
![Page 48: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/48.jpg)
H.C. TD5102 48
Rule of thumb!
• Perform steps with biggest impact first
• Biggest impact: – depends on your interest (= cost function)– min. E, P, E*D, D, Area, Npins, ...
![Page 49: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/49.jpg)
H.C. TD5102 49
Phase ordering example:Why fix data storage/transfer before concurrency management issues?
Recursive image processing algorithm on local neighborhoods:(for i : 0 .. I-1 ) ::(for j : 0 .. J-1 ) :: img[i][j]= f(img[i][j-k], old_img[i][j]);
I
rows
J c o l u m n s
![Page 50: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/50.jpg)
H.C. TD5102 50
Why fix data storage/transfer before concurrency mngnt issues?
Unrolling outerloop (i) M times: • needed M J-word FIFOs (image lines)• M data paths
I
rows
J c o l u m n s
14.4mm 2
(0.7um)
![Page 51: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/51.jpg)
H.C. TD5102 51
Why fix data storage/transfer before concurrency mngnt issues?
Unrolling (j) innerloop (limited by k):M - 1 buffer reg
(i : 0 .. I-1 ) ::(j : 0 .. (J div 2)-1 ):: img[i][2j-1]= f(img[i][2j-k-1], old_img[i][2j-1]); img[i][2j]= f(img[i][2j-k], old_img[i][2j]);
I
rows
J c o l u m n s
![Page 52: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/52.jpg)
H.C. TD5102 52
Proposed System Design Methodology
Traditional (parallelizing)Compiler Steps
System Specification
Optimized algorithms(C/C++ specification)
Code per (parallel) proc.
System-LevelExploration
and refinement
SW/HW Partitioning/Exploration Architecture
HW SynthesisSteps
Structural VHDL Code
![Page 53: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/53.jpg)
H.C. TD5102 53
Dynamic memory mgmtDynamic memory mgmt
Task concurrency mgmtTask concurrency mgmt
Static memory mgmtStatic memory mgmt
Address optimizationAddress optimization
SWSWdesigndesignflowflow
HWHWdesigndesignflowflow
SW/HW co-designSW/HW co-design
Concurrent OO specConcurrent OO spec
Remove OO overheadRemove OO overhead
![Page 54: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/54.jpg)
H.C. TD5102 54
![Page 55: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/55.jpg)
H.C. TD5102 55
Object-based versus Object-oriented
• calls through function pointer• cannot be inlined
switchable
Switchget_state()update()
Switchableswitch_on()switch_off()
Buttonget_state()
Lampswitch_on()switch_off()
Buttonget_state()update()
Lampswitch_on()switch_off()lamp
• direct calls• can be inlined
Object-based
=> OO is good for specification, not for implementation
Object-oriented
![Page 56: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/56.jpg)
H.C. TD5102 56
Whole-system optimization techniques
• Aggressive use of traditional inter-procedural techniques– in the embedded world you often know the whole application !
• OO specific optimization
• Data allocation optimization
![Page 57: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/57.jpg)
H.C. TD5102 57
Example: data inlining
B *b;A() { b = new C; }~A() { delete b; }void f() { b->g();}
class A
C b;A(): b() {}~A() {}void f() { b.g();}
class A’
Eliminate:
• dynamic allocation
• pointer de-reference
• polymorphic calls
class B
class C
![Page 58: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/58.jpg)
H.C. TD5102 58
Example: dynamic allocation removalvoid teq(…,short size,…){ float* Ryy; Ryy = new float[size]; … teq computation … delete Ryy;}
void teq(…,…){ float Ryy[64]; … teq computation … }
teq(…,64,…);…teq(…,64,…);…
• Eliminate dynamic allocation
• Re-use stack memory already needed for other call tree branches
teq(…,…);…teq(…,…);…
![Page 59: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/59.jpg)
H.C. TD5102 59
ADSL result: footprint -33%
400kB
200kB
Total memory footprint (code + data)
106%100% 83%
82%
67%
ARM C++ optimized (-O2 -Ospace)Inlining, dead code, constant prop.
Unoptimized
Virtual call eliminationData alloc. optim.
![Page 60: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/60.jpg)
H.C. TD5102 60
• Data type refinement• Virtual memory management
![Page 61: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/61.jpg)
H.C. TD5102 61
ATM_cell * Data_In;Association_Table * Routing_Table;
Routing_Table = new Association_Table();Data_In = new ATM_cell();
if ( Routing_Table->Lookup(Data_In) ) ...
Data type refinement
Impl. alternatives
104
103
102
101
100
Power function Area function10 4
10 3
10 2
10 1
10 0
![Page 62: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/62.jpg)
H.C. TD5102 62
ATM_cell * Data_In;Array * Routing_Table;
Routing_Table = new Array ();Data_In = new ATM_cell();
if ( Routing_Table->Lookup(Data_In) ) ...
Data type refinement: Array
Impl. alternatives
104
103
102
101
100
Power function Area function10 4
10 3
10 2
10 1
10 0
Array (AR)
data
data
data
data
![Page 63: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/63.jpg)
H.C. TD5102 63
ATM_cell * Data_In;Linked_List * Routing_Table;
Routing_Table = new Linked_List ();Data_In = new ATM_cell();
if ( Routing_Table->Lookup(Data_In) ) ...
Data type refinement: Linked List
Impl. alternatives
104
103
102
101
100
Power function Area function10 4
10 3
10 2
10 1
10 0
Array (AR)
data
data
data
data
keydata
keydata
keydata
Linked List (LL)
keydata
![Page 64: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/64.jpg)
H.C. TD5102 64
ATM_cell * Data_In;Binary_Tree * Routing_Table;
Routing_Table = new Binary_Tree ();Data_In = new ATM_cell();
if ( Routing_Table->Lookup(Data_In) ) ...
Data type refinement: Binary Tree
Impl. alternatives
104
103
102
101
100
Power function Area function10 4
10 3
10 2
10 1
10 0
Array (AR)
data
data
data
data
keydata
keydata
keydata
Linked List (LL)
keydata
keydata
keydata
Binary Tree (BT)
keydata
![Page 65: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/65.jpg)
H.C. TD5102 65
Going from specification concurrency to implementation concurrency
![Page 66: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/66.jpg)
H.C. TD5102 66
Modelling MTG*
![Page 67: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/67.jpg)
H.C. TD5102 67
TCM transformations
Why transformations?– shift existing Pareto curves– create new points on the Pareto curves– improve available task level parallelism
Cycle Budget
Power
Cycle Budget
Power
![Page 68: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/68.jpg)
H.C. TD5102 68
MA Cycle Budget
Shar
ed M
emor
y A
rea
TCM Transformations
T1 T2T3 T4 T5T6
T1 T2 T3 T4 T5 T6
Independent,dynamic tasksassigned to 1 Processor
T1 T2 T3
T4 T5 T6
P1
P2
T1 T3
T4 T5T6
T2
Tasks freely assignedto 2 Processors
T1 T3T5T6
T2T4
Tasks order constrainedto reduce memory requirements
T1 T6 T3
T4
T5
T2
HW1
HW1
Partial OrderConstraints
‘Conflict’
less memory
![Page 69: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/69.jpg)
H.C. TD5102 69
DTSE: data transfer and storage exploration
![Page 70: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/70.jpg)
H.C. TD5102 70
Static data memory management (DMM)
Processor Data Paths
L1Cache
L2Cache
Chip
Cache & BankRecombine
Local Latch 1 +Bank 1
Off-chip SDRAM
Local Latch N +Bank N
4 Avoid N-port Memories 3 Exploit memory hierarchy
1 Reduce redundant transfers2 Introduce Locality
6 Exploit limited life-timeand data layout freedom 5 Meet real-time constraints
![Page 71: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/71.jpg)
H.C. TD5102 71
DMM: how to improve locality?
Processor Data Paths
L1Cache
L2Cache
Chip
Cache & BankRecombine
Local Latch 1 +Bank 1
Off-chip SDRAM
Local Latch N +Bank N
Introduce locality
FOR i:=1 TO N DO B[i]:=f(A[i]);FOR i:=1 TO N DO C[i]:=g(B[i]);
FOR i:=1 TO N DO{ B[i]:=f(A[i]); C[i]:=g(B[i]);}
![Page 72: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/72.jpg)
H.C. TD5102 72
Exploiting Memory Hierarchy
Processor Data Paths
Reg.file M'' M'' M''
#A = 100%
P (before) = 100 %P (after) = 100%*0.01 + 10%*0.1 + 1% * 1 = 3%
#A = 1%#A = 10%
P=0.01 P=0.1 P=1
![Page 73: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/73.jpg)
H.C. TD5102 73
How to Avoid N-port Memories?
R(A) R(B) W(C)
A,B,C A,B,CR(A) R(B)R(B)W(C)
R(A) R(B) W(C)
A,C B
![Page 74: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/74.jpg)
H.C. TD5102 74
![Page 75: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/75.jpg)
H.C. TD5102 75
Algebraic Transformations and Aggressive Code Hoisting for Expression Elimination
for(y=0..9; y++) { for(x=0..99; x++) { if (x>1) A[ (y%3)*3 + (x-2)%3 ]=... if (x>4) ...=A[ (y%3)*3 + (x-5)%3 ];}}
Initial
Optimised-1st
X3 less cost
for(y=0..9; y++) { v_y = (y%3)*3; for(x=0..99; x++) { v_yx = (x-2)%3+v_y; if (x>1) A[v_yx] = …; if (x>4) … = A[v_yx];}}
![Page 76: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/76.jpg)
H.C. TD5102 76
for(y=0..9; y++) { v_y = (y%3)*3; for(x=0..99; x++) { v_yx = (x-2)%3+v_y; if (x>1) A[v_yx] = …; if (x>4) … = A[v_yx];}}
Optimised-1st
Modulo substitution for piece-wise linear addressing
for (p_y=0, y=0..9; y++) { if (p_y>=9) p_y - =9; for (p_x=1, x=0..99; x++) { if (p_x>=3) p_x - = 3; v_yx = p_x + p_y; if (x>1) A[v_yx] = …; if (x>4) … = A[v_yx]; p_x++; } p_y=+3; }
Optimised-2nd
X2 less cost
![Page 77: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/77.jpg)
H.C. TD5102 77
What do we gain?Running example: cavity detection
• Application domain:– Computer Tomography in medical imaging
• Algorithm: – Cavity detection in CT-scans– Detect dark regions in
successive images– Indicate cavity in brain
Bad news for owner of brain
![Page 78: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/78.jpg)
H.C. TD5102 78
Starting point
Reference (conceptual) C code for the algorithm– all functions: image_in[N x M]t-1 -> image_out[N x M]t
– new value of pixel depends on its neighbors– neighbor pixels read from background memory– approximately 110 lines of C code (ignoring file I/O etc)– experiments with N x M = 640 x 400 pixels– straightforward implementation: 6 image buffers
ComputeEdges
GaussBlur x Reverse Detect
Roots
MaxValue
GaussBlur y
![Page 79: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/79.jpg)
H.C. TD5102 79
Cavity Detector Results
0
100
200
300
400
500
600
accesses size cycles
OriginalDF trafoLoop trafoData reuseIn-placeData layoutADOPT - moduloADOPT - rest
![Page 80: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/80.jpg)
H.C. TD5102 80
Lecture 1: Introduction• Trends• Platforms• Application mapping• Design flow• Summary
![Page 81: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/81.jpg)
H.C. TD5102 81
Summary
• Billions of Embedded systems, everywhere!!!• Multi-media applications become extremely
complex and dynamic• Time-to-Market pressure
• Solution:– Platforms as design target (raise abstraction level)– Advanced emb. system design flow needed
![Page 82: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/82.jpg)
H.C. TD5102 82
Traditional Design Methodology
Traditional (parallelizing)Compiler Steps
System Specification
Optimized SW spec(C specification)
Code per (parallel) proc.
(SW SystemExploration)
SW/HW Partitioning/Exploration
ArchitectureHW Synthesis
Steps
Structural VHDL Code
Optimized HW spec(VHDL specification)
HW SystemExploration
![Page 83: Embedded Systems in Silicon TD5102 Introduction and overview](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815cbb550346895dcab886/html5/thumbnails/83.jpg)
H.C. TD5102 83
Proposed System Design Methodology
Traditional (parallelizing)Compiler Steps
System Specification
Optimized algorithms(C/C++ specification)
Code per (parallel) proc.
System-LevelExploration
and refinement
SW/HW Partitioning/Exploration Architecture
HW SynthesisSteps
Structural VHDL Code
Our main focus