ee382v: embedded system design and...
Post on 25-Mar-2020
7 Views
Preview:
TRANSCRIPT
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 1
EE382V:Embedded System Design and Modeling
Andreas GerstlauerElectrical and Computer Engineering
University of Texas at Austingerstl@ece.utexas.edu
Lecture 10 – Computation Modeling & Refinement
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 2
Lecture 10: Outline
• Processor layers
• Application
• Task/OS
• Firmware
• Hardware
• Processor synthesis
• Software synthesis
• Hardware synthesis
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 2
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 3
System-On-Chip Environment (SCE)
Specification
System Design(Specify-Explore-Refine)
SWDB
Systemmodels
CPUn.bin
Implementation Model
PE/CE/BusModels
TLMnTLMnTLMi
Hardware Synthesis
Software Synthesis
RTLDB
RTLnRTLnRTLnISSnISSnISSn CPUn.bin
CPUn.binHWn.vHWn.vHWn.v
Design Decisions
ArchnArchnTLMn
Impln
Spec
ImplnImpln
Mem
IPHW
Bri
dg
e
CPU Bus DSP Bus
B3v1v2
B5B4
DSP
C4C2C1
OS + Drv
CPU
OS + Drv
Coren Coren
Coren Coren
Core1 Coren
B2B1
C3
B1 B2
OS
DrvHAL ISR
CPU
B1 B2
OS
DrvHAL ISR
CPU
Computation modeling and
refinement
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 4
General Processor Micro-Architecture
• Basic computation component is a processor (PE)
• Programmable, general-purpose software processor (CPU)
• Programmable special-purpose processor (e.g. DSPs)
• Application-specific instruction set processor (ASIP)
• Custom hardware processor
Functionality and timing (and power and …)
PE
Controller Datapath
Bus interface CLK
Control signals
Status lines∆t
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 3
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 5
Computation Modeling (1)
• Structural RTL models
Sub-cycle accurate
HW
Controller
State
Next state logic
Output logic
Datapath
Registerfile
Memory
Bus interface CLK
FU1
CPU
Controller Datapath
Registerfile
Memory(data &progr.)
Load/store unit CLK
ALU
IR
PC
Decode
Fetch
Software processor Hardware processor
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 6
Computation Modeling (2)
• Behavioral RTL models (FSMD)• Instruction-set simulation (ISS) models
• Purely functional or micro-architectural
Cycle or timing accurate
HW
HW_CLK
CPU
CPU_CLK
HAL
ISS
RTOS
App.
Instruction set simulation (ISS) FSMD
Bin
ary
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 4
© 2014 A. Gerstlauer 7
Computation Modeling (3)
• Host-compiled models
• Source-level application model
• Back-annotate timing and other metrics
• Abstract OS and processor models
• Transaction-level model (TLM) backplane
• C-based discrete-eventsimulation kernel [SpecC,SystemC]
Fast and accurate full-system simulation
Source: A. Gerstlauer. “Host-Compiled Simulation of Multi-Core Platforms," RSP10.
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 8
Host-Compiled Computation Layers
• Application
• Process execution (C code)
• Execution timing
• OS & processor
• Operating system– Real-time multi-tasking (RTOS model)
– Bus drivers (C code)
• Hardware abstraction layer (HAL)– Interrupt handlers
– Media accesses
• Processor hardware– Bus interfaces (I/O state machines)
– Interrupt suspension and timing
P1 P2
OS
CP
U
Drv
Interrupts
Bus
ISRHAL
Process B1(){
…waitfor(15000);…waitfor(25000);…
};
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 5
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 9
• High-level, abstract programming model• Hierarchical process graph
– ANSI C leaf processes– Parallel-serial composition
• Abstract, typed inter-process communication
– Channels– Shared variables
Timed simulation of application functionality (SLDL)• Back-annotate timing
– Estimation or measurement(trace, ISS)
– Function or basic block levelgranularity
• Execute natively onsimulation host
– Discrete event simulator– Fast, native compiled simulation
Application Layer
Logical time
5 100
CPU
B2 C1
B1
B3C2
… … …
... void f() {
waitfor(5);...
}...
© 2014 A. Gerstlauer 10
Retargetable Back-Annotation
• Back-annotation flow • Intermediate
representation (IR)– Frontend optimizations [gcc]– IR to C conversion
• Target binary matching– Cross-compiler backend [gcc]– Control-flow graph matching
• Timing and power estimation
– Micro-architecture description language (uADL) or RTL
– Cycle-accurate timing– Reference power model
[McPAT]
• Back-annotation into IR– Basic block level
C Source Code
Frontend Optimisations
(gcc)
Intermediate Rep. (IR)
Backend
Binary
a=b=c=0;if(a<=0) { a=1; c=2; }……printf(…);
bb_2: a = 1; b = 0; c = 2; goto bb_7;bb_3:…..bb_7: printf(…);
Compile-able Intermediate Code
IR to C
Timing and
Energy Back
Annotator
bb_2: a = 1; b = 0; c = 2; incrDelay(15); incrEnergy(2); bb = BB_2; goto bb_7;bb_3: ….. incrDelay(delay[bb][BB_3]); incrEnergy(energy[bb][BB_3]); bb = BB_3;
…..
Host-Compiled (HC) Model
IR
Binary
GraphMatching
Mapping Table
Basic BlockTiming and Energy Cz.
AugmentedMapping Table
Back Annotator
uADL ISS
McPAT
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10
Source: S. Chakravarty, Z. Zhao, A. Gerstlauer. “Automated, Retargetable Back-Annotation for Host-Compiled Performance and Power Modeling," CODES+ISSS’13.
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 6
11© 2014 A. GerstlauerEE382V: Embedded Sys Dsgn and Modeling, Lecture 10
Binary-to-Source/IR Mapping
• Compiler optimizations• Frontend
– Control flow optimizations
• Backend– Instruction scheduling/percolation
Mismatches– Capture frontend by annotating
at IR, not source– Establish binary-IR mapping
for back-annotation
Graph matching heuristic• Synchronized, recursive depth-first traversal
– Compatibility: loop and branch nesting levels– Cost: sum of unmatched nodes in subgraphs rooted at node– Return least-cost mapping between all successors (incl. skips)
• Resolve ambiguities using debug information
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 12
Timing/Energy Characterization
• Basic block characterization• Execution depends on state
– Pipeline stalls in case of hazards– Pipeline overlaps in multi-issue
• Pairwise characterization– Over all immediate predecessors – Across function hierarchy
• Timing & energy– First-to-last instruction fetch time– Resource utilization statistics
• Back-annotation into IR
• Path-dependent metrics– Capture static branch prediction
bb_2:a = 1; b = 0; c = 2;goto bb_7;
wait(15); energy(2);bb_3:…..If(prev_bb==3)
wait(25); energy(5);else if(prev_bb==1)
wait(30); energy(6);…..bb_7: printf(…);
Annotated IR
BB1 BB2
BB3
Exec flow 1
Exec flow 2
SS =A SS = BSS – Sys State
(registers, mem,
pipeline)
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 7
Source-Level Simulation: Speed
05001000
15002000250030003500
400045005000
SHA
(Small)
SHA
(Large)
ADPCM
(Small)
ADPCM
(Large)
CRC32
(Small)
CRC32
(Large)
Sieve
MIPS
Host‐Compiled IR Source
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 13
• Automatic timing and energy back-annotation• Telecom & security
applications [MiBench]– SHA, ADPCM, CRC32 &
custom Eratosthenes’ Sieve– Small and large data sets,
10 to 700 million instr.
• One-time back-annotation– 3min. to 3s BA runtime
Back-annotated source vs. traditional ISS 2000 MIPS vs. 0.8 MIPS Close to native source-
level speeds
0s
100s
200s
300s
400s
500s
600s
700s
800s
SHA
(Small)
SHA
(Large)
ADPCM
(Small)
ADPCM
(Large)
CRC32
(Small)
CRC32
(Large)
Sieve
Runtim
HC+BA runtime ISS+McPAT runtime
Source-Level Simulation: Accuracy
• Source-level power and performance simulation
• Single- (z4-like) and dual-issue (z6-like) e200 PowerPC– No cache, static branch prediction
• Compare against cycle-accurate reference [ISS+McPAT]
>99% average timing and energy accuracy @ 2000 MIPS
Integrate back-annotation of other metrics
Performance, energy, reliability, power, thermal (PERPT)
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 14
Timing Accuracy
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
10
SHA
(Small)
SHA
(Large)
ADPCM
(Small)
ADPCM
(Large)
CRC32
(Small)
CRC32
(Large)
Sieve
Error [%
Z4 z6
Energy Accuracy
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
10
SHA
(Small)
SHA
(Large)
ADPCM
(Small)
ADPCM
(Large)
CRC32
(Small)
CRC32
(Large)
Sieve
Error [%
z4 z6
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 8
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 15
OS Modeling
• High-level RTOS abstraction
• Specification is fast but inaccurate– Native execution, truly concurrent model
• Traditional ISS-based validation infeasible– Accurate but slow (esp. in multi-processor context), requires full binary
Model of operating system (task interleaving in time) High accuracy but small overhead at early stages
Focus on key effects, abstract unnecessary implementation details
Model all concepts: Multi-tasking, scheduling, preemption, interrupts, IPC
Specification System-Level Implementation
Source: A. Gerstlauer, H. Yu, D. Gajski. "RTOS Modeling for System-Level Design," DATE03.
Application
SLDL
Channels
RTOS Model
T1 T2
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 16
Operating System Layer
• Scheduling
• Group processes into tasks– Static scheduling
• Schedule tasks– Dynamic scheduling, multitasking
– Preemption, interrupt handling
– Task communication (IPC)
Scheduling refinement
• Flatten hierarchy
• Reorder behaviors
OS refinement
• Insert OS model
• Task refinement
• IPC refinement
Application
SLDL
OS Layer
P1 P2
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 9
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 17
Abstract RTOS Model
• Emulate the sequential execution of concurrent tasks• Task scheduler
– Maintain task queues, determine task(s) to run & perform context switch
• Timing model– Simulate back-annotated task delays, call scheduler to allow for preemptions
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 18
RTOS Model Implementation• RTOS model
• OS, task, event management– Descriptors & queues
• Context switching– Block all but active task on SLDL level
• Scheduling– Select and dispatch task based on
algorithm
• Preemption– Allow rescheduling at simulation time
increases
• Event handling– Remove task temporarily from OS
while waiting for SLDL event
RTOS model library• RTOS models for different
scheduling strategies– Round robin, priority based
• Parametrizable– Task parameters (priorities)
channel OS implements OSAPI {Task current = 0;os_queue rdyq;
void dispatch(void) {current = schedule();notify(current.event);
}void yield() {task = current;dispatch();wait(task.event);
}
void time_wait(time t) {waitfor(t);yield();
}
Task pre_wait(void) {Task t = rdyq.get(current);dispatch(); return t;
}void post_wait(Task t) {rdyq.put(t);wait(t.event);
}};
1
5
10
15
20
25
schedule();
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 10
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 19
RTOS Model Interface
interface OSAPI {
void init();void start(int sched_alg); void interrupt_return();
Task task_create(char *name, int type,sim_time period);
void task_terminate(); void task_sleep(); void task_activate(Task t); void task_endcycle();void task_kill(Task t); Task par_start();void par_end(Task t);
Task pre_wait();void post_wait(Task t);
void time_wait(sim_time nsec); };
1
5
10
15
20
Task management
OS management
Event handling
Delay modeling
• Canonical, target-independent API
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 20
Task Refinementprocess task_B2(OSAPI os) {
void main(void) {
... /* model execution delay */waitfor(BLOCK1_DELAY);...send();/* model execution delay */waitfor(BLOCK2_DELAY);
...
}
void send() {
wait(ack);
}};
1
5
10
15
20
25
os.task_terminate(h);
• Convert processes into tasks
• Task initialization– Register task with OS model
• Task activation– Wait for task start trigger from OS
• Replace delay model– Trigger rescheduling in OS
Preemption points
• Communication and synchronization
– Wrap around SLDL event handling
os.time_wait(BLOCK1_DELAY);
os.time_wait(BLOCK2_DELAY);
Task h;void task_B2(void) {h = os.task_create(“B2”,
APERIODIC, 0); }
os.task_activate(h);
t = os.pre_wait();
os.post_wait(t);
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 11
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 21
Simulated Dynamic Behavior
C1
c1.recv()
c1.send()
Bu
s
bus.recv()
P2 P3
S1
Logical time
t0
t1
t2
t3
t5
t8
t6
t4
t7
Unscheduled
t0
t1
t2
t3
t4
t5
t6
t7
t8
Inaccuracy due to timing granularity
waitfor() waitfor()
waitfor()
waitfor()waitfor()
waitfor()
ISR
P1
waitfor()
Scheduled
C1
c1.recv()
c1.send()B
us
bus.recv()
Task P2 Task P3
S1
time_wait()
time_wait()
time_wait()
ISR
time_wait()
time_wait()
time_wait()
time_wait()
P1
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 22
OS Modeling Results
• Configurable, generic and flexible OS model
• Configurable scheduling strategies and parameters– Round-robin or priority-based scheduling
Scheduling exploration– Artificial periodic task sets, uniformly distributed periods & utilizations
– Back-annotation at 1s, 10s, 100s, or 1000s granularity
– Dual-core MIPS Malta reference platform w/ Linux 2.6 SMP kernel [OVP]
GranularityAvg. speed
per coreAvg. err.
1 s 140 MIPS 0.4 %
10 s 1500 MIPS 0.4 %
100 s 9000 MIPS 1.0 %
1000 s 29000 MIPS 8.0%
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 12
© 2014 A. Gerstlauer 23
Speed and Accuracy Tradeoffs
• Errors in discrete preemption models
Automatic Timing Granularity Adjustment (ATGA)• Observe system state to predict preemption points• Dynamically and optimally control timing model • Transparently integrated into OS model Eliminate preemption errors
Time
Thigh
rlrh
Idle
Preemption Error
fh fl
TlowRun
Preemption Error
• Potentially large preemption errors– Not bounded by
simulation granularity
Source: P. Razaghi, A. Gerstlauer. "Predictive OS Modeling for Host-Compiled Simulation of Periodic Real-Time Task Sets," Emb. Sys. Letters ‘12.P. Razaghi, A. Gerstlauer. “Automatic Timing Granularity Adjustment for Host-Compiled Software Simulation,” ASPDAC’12
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10
ATGA Model Execution Example
© 2014 A. Gerstlauer 24
•Ready
•Idle
t0 •rTH,1 t6t5t4t3t2
•Ready
•Wait
•rTH,3•rTH,2
•Sleep
• Predictive •OS Mode:
•Wait
• Fall-back
•Ready
t7
•TL
•TM
•TH
•TIntr
•fTH,1
•Idle
•fTH,2
•Ready
•Idle
• Predictive
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 13
ATGA Results
• ATGA OS model
• Artificial periodictask sets
Vs. conventional modelat varying granularity
• Reference platform [OVP]
• MIPS-Malta
• Linux 2.6
Optimally navigate speed vs. accuracy tradeoff As fast as coarse grain (100s) As accurate as fine grain (1s) simulation
© 2014 A. Gerstlauer 25EE382V: Embedded Sys Dsgn and Modeling, Lecture 10
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 26
Operating System Layer
OS model
• On top of standard SLDL
• Wrap around SLDL primitives, replace event handling
– Block all but active task
– Select and dispatch tasks
• Target-independent, canonical API
– Task management
– Channel communication
– Timing and all events
Application
SLDL
OS Model
Task P2 Task P3
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 14
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 27
Hardware Abstraction Layer (HAL)
• External communication
• Software Drivers– Presentation, session, network
communication layers
– Synchronization (interrupts)
• Hardware/software boundary– Low-level HW access
– Bus drivers and interrupt handlers
– Canonical HW/SW interface
• External interface– Bus transactions (TLM)
– Interrupt trigger
sample.send(v1);
void send(…) { intr.receive();bus.masterWrite(0xA000,
&tmp, len);
}
App
.D
river
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 28
Hardware Layer (1)
• Processor TLM
• HW interrupt handling– Interrupt logic
» Suspend user code
– Interrupt scheduling» Priority, nesting
• Peripherals– Interrupt controller
– Timers
• TLM bus model– Bus transactions
HAL: Hardware:
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 15
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 29
Hardware Layer (2)
• Cache modeling• Pure behavioral
modeling– Tag state– Hits/misses– Replacement policy
• Integrated into back-annotation
– Called with accessedaddress trace
– Update cache state– Return delay
penalties
Implemented asSpecC channel
– < 200 lines of code
HWHALOSApp
TaskP2
C1
P1
TaskP3C2
OS Model
HWInt
IntA IntB IntC
UsrInt2UsrInt1
IntD
Bus TLM
INTAINTBINTCINTD
Cac
heM
odelAddresses
/ Delays
Source: A. Pedram, D. Craven, T. Amimeur, A. Gerstlauer. “Modeling Cache Effects at the Transaction Level," IESS 2009.
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 30
void f(void) {BB1: ...
os.wait(BB1_DELAY);if (c) goto BB2;
BB2: a[i][j] += sum;
...
os.wait(BB2_DELAY);BB3: ...
os.wait(BB3_DELAY);drv.write(res);
}
void main(void) {os.task_create(&f, “Task 1”, PRIO0);
}
Cache-Aware Back-Annotation
TLM
FrontendOptimizations
IntermediateIntermediatecode
Retargetable Backend
CW
PC
Binarycode
void f(void) {BB1: ...
os.wait(BB1_DELAY);if (c) goto BB3;
BB2: a[i][j] += sum;alist[__idx] =
A_BASE + 4*(i*A_WID+j);...miss = cache.upd(__alist, __idx);os.wait(BB2_DELAY + miss);
BB3: ...os.wait(BB3_DELAY);drv.write(res);
}
void main(void) {os.task_create(&f, “Task 1”, PRIO0);
}
OS
AP
ID
river
AP
IC
ache
mod
el
Micro-architecturedescription
Block-Level Characterization
• Hybrid timing model• Static + dynamic
– Runtime cache model
Addresslayout
Memoryaccesses
• Host-compiled functional model
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 16
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 31
Hardware Layer (3)
• Bus-functional model (BFM)
• Pin-accurate processormodel
– Timing-accurate bus and interrupt protocols
• Bus model– Pin- and cycle-accurate
– Driving and sampling ofbus wires
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 32
Processor Models
OS
OS HA
L
HW
-TLM
HW
-BF
M
OS HA
L
HW
-TLM
HW
-BF
M
BF
M -
ISS
• Processor layers
• Application– Native, host-compiled C
– Back-annotation
• OS– OS model
– Middleware, drivers
• HAL– Firmware
• Processorhardware
– Bus interfaces
– Interrupts
– Cache
Source: G. Schirner, A. Gerstlauer, R. Doemer. “Fast and Accurate Processor Models for Efficient MPSoC Design," TODAES, 2009.
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 17
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 33
Processor Model Example
• Voice encoding and decoding• Motorola DSP 56600
– Encoding & decoding tasks– custom OS
• 4 custom I/O blocks• 1 custom HW co-processor
– Codebook search
• Processor models• Perfect timing
– Back-annotated from ISS
• Priority-based OS model– EDF: Decoder > Encoder
• HW interrupt scheduling– 4 non-preempted priority levels
• Reference• Motorola proprietary ISS
Custom HWDSP 5660k
Encoder
Decoder
INTDINTCINTB
Codebook search
Cust. HWCust. HWCust. HW Cust. HW
Enc. Input
Enc. Output
Dec. Input
Dec. Output
DSP Port A
INTA
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 34
Processor Model Results
• Execute on Sun Fire V240(1.5 GHz)
• 163 speech frames
• Speed vs. accuracy
OS model (Appl Task)
Interrupts (FW TLM)
1800x speed w/ 3% error (vs. cycle-accurate ISS)
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 18
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 35
Multi-Core Models
• Multi-core OS model• SMP scheduler model
– Global or partitioned queue
• Configurable parameters– Number of cores– FIFO, round-robin, priority-based
scheduling policies– Priorities, affinity, time slice
(for round-robin)
• Multi-core processor model• Multi-core interrupt handling
chain models– Interrupt handlers & tasks– Configurable generic interrupt controller (GIC) model
• TLM bus interfaces
Source: P. Razaghi, A. Gerstlauer. "Host-Compiled Multi-Core System Simulation for Early Real-Time Performance Evaluation," ACM TECS ‘14.
OS
Multi‐Core Scheduler
Dispatch
Global ReadyQueue
SLDL Simulation Kernel
Intr.Handler
Application
HAL
TLM
I/ODrv
I/O IF
T1
CH
Intr.Handler
Intr. IF
T2
Intr.Task
Intr.Task
T3
Multi-Core OS Model
• Global or partitioned SMP scheduling
• Replicated or shared Ready, Idle, Sleep & Wait queues
• Processor suspension and interrupt handling
• Interrupt handlers as highest-priority OS-internal tasks
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 36
ISR
Interrupt task(bottom half)
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 19
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 37
MA
C
TLM
Ada
pterD
rvD
rv
Multi-Core Processor Model
• Emulate the processor hardware/software interface• OS & hardware abstraction layers
– I/O drivers, interrupt suspension
• Hardware layer– TLM bus interface, interrupt routing logic & interrupt controller models
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 38
Dual-Core Processor Model Example
• Errors in preemption model due to discrete timing Integrate multi-core ATGA approach
Core 1
Core 0
time
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 20
Multi-Core ATGA Model
• Enhanced fallbackmode check
• Only fall back when ext. event triggersinterrupt task with higher priority than current task
– Potential task switch
– Allow for delayedinterrupts otherwise
• Model inter-core interrupt notifications
• Adjust predicted times or switch to fallback
Accurate interrupt response times while maintaining speed
But: high-priority interrupt-driven tasks degrade performance
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 39
ATGA(Intr.H) ATGA(Intr.M)
ATGA(Intr.L)
ATGA(No.Intr)
10-2 10-1 100 10+110-2
10-1
100
10+1
10+2
Ave
rag
e E
rro
r [%
]
Simulation Time [Sec.]
Conventional (Intr.H)
Conventional (Intr.M)
Conventional (Intr.L)
Conventional (no Intr.)
10 ms
100 µs
1 µs
Multi-Core Cache Model
• Application model• Per core memory
access list– Address, mode, time stamp
• Cache interface• Hardware layer of
processor model
• Generic cache model• Emulate cache state
– Only tags, no values– Return hit & miss info
• Parameterizable– Cache size, line size, associativity,
replacement & write-back policy
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 40
Source: P. Razaghi, A. Gerstlauer. “Multi-Core Cache Modeling for Host-Compiled Performance Simulation," ESLSyn ‘13.
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 21
Multi-Core Cache Simulation• Directly committing accesses in simulation order
Globally out-of-order in discrete timing model
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 41
Multi-Core Cache Simulation• Delayed reordering of aggregated requests
Multi-Core Out-of-Order Cache (MOOC) model
100% accurate results @ coarse-grain speedsEE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 42
•Safe-to-commit
•Safe-to-commit
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 22
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 43
MPCSoC Platform Simulation
• Cellphone baseband MPCSoC
• Design space exploration: mapping & scheduling
Full-system simulation in close to real time
• 1400 MIPS at > 99% timing accuracy
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 44
MPCSoC Exploration Results
•Dual-Core•Core-attached Interrupt
•Single-Core •Dual-Core•Task-attached Interrupt
0.1%
1.0%
10.0%
100.0%
1000.0%
0ms
8ms
16ms
24ms
Avg
. F
ram
e E
rro
r
MP
3A
vg.
Fra
me
Del
ay
HCSim.TLM HCSim.TLM.no_IntrHCSim.TLM.no_Intr.error HCSim.TLM.error
0.1%
1.0%
10.0%
100.0%
0ms
10ms
20ms
30ms
Avg
. F
ram
e E
rro
r
JPE
GA
vg.
Fra
me
Del
ay
HCSim.TLM HCSim.TLM.no_IntrHCSim.TLM.error HCSim.TLM.no_Intr.error
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 23
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 45
Lecture 9: Outline
Processor layers
Application
Task/OS
Firmware
Hardware
• Processor synthesis
• Software synthesis
• Hardware synthesis
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 46
Software Synthesis
Automatically generate target binaries from TLM Generate code for application (tasks and IPC) Synthesize firmware (drivers, interrupt handlers) OS wrappers and HAL implementations from DB Compile and link against target RTOS and libraries
ISS
MA
C
Dri
ver
Dri
ver
HALRTOS
App.
Source: G. Schirner, A. Gerstlauer, R. Doemer. “Automatic Generation of Hardware dependent Software for MPSoCs from Abstract System Specifications,” ASPDAC08
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 24
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 47
Processor Implementation Models
• Software C model
• Generated application C code– Flat standard ANSI C code
• Firmware and hardware models– RTOS model, HAL model
– Low-level &hardware interrupt handling
– External bus communication protocol/TLM
• Software ISS model
• Reintegrared processor ISS– Bus-functional ISS wrapper
• Running generated binary– Application, RTOS, drivers, HAL
Bus Functional ModelHardware ShellCore ISS
ISS
nIRQnFIQ
ISS API (lib)
Bus Protocol
CPU_1.bin
HALInt.RTOSRAL
DriversSW Application
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 48
Lecture 9: Outline
Processor layers
Application
Task/OS
Firmware
Hardware
• Processor synthesis
Software synthesis
• Hardware synthesis
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 25
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 49
Hardware Synthesis
• C-to-RTL high-level synthesis (HLS)
• Allocation, scheduling, binding
s3
s4
s5
t=y*i
d+=t
i++
s6 h=2*d
s1
s2
y=3*x
i=0
HW_FSMD
Behavioral RTL
HW_RTLController
Datapath
RegisterFile (RF)
Bus interface
FU
s3
s4
s5
s6
s1
s2
CLKCLK
b1b2
b3
Structural RTL
ctrl=10…10
Sch
edul
ing
Bin
ding
, net
lisin
g
……y = 3*x;i = 0;do {d += y * i;i++;
} while (i < 10);h = d + d;……
HW
BFM
Source: D. Shin, A. Gerstlauer, R. Doemer, D. Gajski. “An Interactive Design Environment for C-based High-level Synthesis of RTL Processors," TVLSI, 2008.
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 50
SCE Interactive RTL Synthesis
RTL Allocation
RTL Scheduling & Binding
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 26
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 51
Modeling of Hardware in SoC Design
• RTL Modeling
• State modeling: Accellera RTL Semantics Standard– Style 1: unmapped
» a = b * c;
– Style 2: storage mapped» R1 = R1 * RF2[4];
– Style 3: function mapped» R1 = ALU1(MULT, R1, RF2[4]);
– Style 4: connection mapped» Bus1 = R1;
» Bus2 = RF2[4];
» Bus3 = ALU1(MULT, Bus1, Bus2);
– Style 5: exposed control» ALU_CTRL = 011001b;
» RF2_CTRL = 010b;
» …
http://www.eda.org/alc-cwg/cwg-open.pdf
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 52
SpecC RTL Modeling
behavior FSMD_Example(signal in bool CLK, // system clocksignal in bool RST, // system resetsignal in bit[31:0] Inport, // input portssignal in bit[1] Start,signal out bit[31:0] Outport, // output portssignal out bit[1] Done)
{ void main(void){fsmd(CLK) // clock + sensitivity
{bit[32] a, b, c, d, e; // local variables
{ Outport = 0; // defaultDone = 0b; // assignments}
if (RST) { goto S0; // reset actions}
S0 : { if (Start) goto S1;else goto S0;}
S1 : { a = b + c; // state actionsd = Inport * e; // (register transfers)Outport = a;goto S2;}
... }}
};
RTLModelingExample
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 27
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 53
SpecC RTL Modeling
behavior FSMD_Example(signal in bool CLK, // system clocksignal in bool RST, // system resetsignal in bit[31:0] Inport, // input portssignal in bit[1] Start,signal out bit[31:0] Outport, // output portssignal out bit[1] Done)
{ void main(void){fsmd(CLK) // clock + sensitivity
{bit[32] a, b, c, d, e; // local variables
{ Outport = 0; // defaultDone = 0b; // assignments}
if (RST) { goto S0; // reset actions}
S0 : { if (Start) goto S1;else goto S0;}
S1 : { a = b + c; // state actionsd = Inport * e; // (register transfers)Outport = a;goto S2;}
... }}
};
S1 : { a = b + c; // Accellera style 1 d = Inport * e; // (unmapped)Outport = a;goto S2;}
bit[32] a, b, c, d, e; // unmapped variables
MappedRTLExample
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 54
SpecC RTL Modeling
behavior FSMD_Example(signal in bool CLK, // system clocksignal in bool RST, // system resetsignal in bit[31:0] Inport, // input portssignal in bit[1] Start,signal out bit[31:0] Outport, // output portssignal out bit[1] Done)
{ void main(void){fsmd(CLK) // clock + sensitivity
{bit[32] a, b, c, d, e; // local variables
{ Outport = 0; // defaultDone = 0b; // assignments}
if (RST) { goto S0; // reset actions}
S0 : { if (Start) goto S1;else goto S0;}
S1 : { a = b + c; // state actionsd = Inport * e; // (register transfers)Outport = a;goto S2;}
... }}
};
S1 : { RF[0]=RF[1]+RF[2]; // Accellera style 2 RF[3]=Inport*RF[4];// (storage mapped)Outport = RF[0];goto S2;}
buffered[CLK] bit[32] RF[4]; // register file
MappedRTLExample
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 28
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 55
SpecC RTL Modeling
behavior FSMD_Example(signal in bool CLK, // system clocksignal in bool RST, // system resetsignal in bit[31:0] Inport, // input portssignal in bit[1] Start,signal out bit[31:0] Outport, // output portssignal out bit[1] Done)
{ void main(void){fsmd(CLK) // clock + sensitivity
{bit[32] a, b, c, d, e; // local variables
{ Outport = 0; // defaultDone = 0b; // assignments}
if (RST) { goto S0; // reset actions}
S0 : { if (Start) goto S1;else goto S0;}
S1 : { a = b + c; // state actionsd = Inport * e; // (register transfers)Outport = a;goto S2;}
... }}
};
S1 : { RF[0] = // Accellera style 3 ADD0(RF[1],RF[2]);// (function mapped)RF[3] =MUL0(Inport,RF[4]);Outport = RF[0];goto S2;}
buffered[CLK] bit[32] RF[4]; // register file
MappedRTLExample
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 56
SpecC RTL Modeling
behavior FSMD_Example(signal in bool CLK, // system clocksignal in bool RST, // system resetsignal in bit[31:0] Inport, // input portssignal in bit[1] Start,signal out bit[31:0] Outport, // output portssignal out bit[1] Done)
{ void main(void){fsmd(CLK) // clock + sensitivity
{bit[32] a, b, c, d, e; // local variables
{ Outport = 0; // defaultDone = 0b; // assignments}
if (RST) { goto S0; // reset actions}
S0 : { if (Start) goto S1;else goto S0;}
S1 : { a = b + c; // state actionsd = Inport * e; // (register transfers)Outport = a;goto S2;}
... }}
};
S1 : { BUS0 = RF[1]; // Accellera style 4 BUS1 = RF[2]; // (connection mapped)BUS3 = ADD0(BUS0,BUS1);RF[0]= BUS3;...goto S2;}
buffered[CLK] bit[32] RF[4]; // register file bit[32] BUS0, BUS1, BUS2; // busses
MappedRTLExample
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling
Lecture 10
© 2014 A. Gerstlauer 29
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 57
SpecC RTL Modeling
behavior FSMD_Example(signal in bool CLK, // system clocksignal in bool RST, // system resetsignal in bit[31:0] Inport, // input portssignal in bit[1] Start,signal out bit[31:0] Outport, // output portssignal out bit[1] Done)
{ void main(void){fsmd(CLK) // clock + sensitivity
{bit[32] a, b, c, d, e; // local variables
{ Outport = 0; // defaultDone = 0b; // assignments}
if (RST) { goto S0; // reset actions}
S0 : { if (Start) goto S1;else goto S0;}
S1 : { a = b + c; // state actionsd = Inport * e; // (register transfers)Outport = a;goto S2;}
... }}
};
S1 : { RF_CTRL = 011000b; // Accellera style 5 ADD0_CTRL = 01b; // (exposed control)MUL0_CTRL = 11b;...
goto S2;}
signal bit[5:0] RF_CTRL; // control wires signal bit[1:0] ADD0_CTRL, MUL0_CTRL;
MappedRTLExample
Source: R. Doemer
EE382V: Embedded Sys Dsgn and Modeling, Lecture 10 © 2014 A. Gerstlauer 58
Lecture 10: Summary
• Host-compiled computation modeling
• Model of software running in execution environment– Timed application, OS, bus drivers, interrupt handlers
– Processor hardware model, suspension, bus interfaces
Virtual platform prototype Embedded software development and validation
Viable complement to ISS-based validation
• Backend processor synthesis
• Software synthesis– Code generation, RTOS targeting, cross-compilation & linking
– Fully automatic final target binary generation
• Hardware synthesis– High-level/behavioral synthesis: allocation, scheduling, binding
– Interactive C-to-RTL synthesis flow
top related