reducing time to point of interest with accelerated os boot · pdf file ·...
TRANSCRIPT
Reducing Time to Point of Interest with Accelerated OS Boot
Frank Schirrmeister, Cadence®
Robert Kaye, ARM®
Agenda
Design Challenge
The Hybrid Emulation – Virtual Approach– Enabling technology in Palladium XP and Virtual System Platform
– Fast Models from ARM
ARM Example
Industry Examples and Outlook– NVIDIA, CSR
– Improvement considerations
Design Challenge
Example SoC and System
LPDDRDRAM NAND
FLASH
NAND
FLASH
Cellular
Modem
WiFiLLI
DigRF
LP
DD
R 2
eM
MC
4.5
UF
S
LP
DD
R 3
SD
3.0
SD
4.0
UF
S
SLIMbus
DSI
CSI2
CSI3
Bluetooth
SDIO
FM
Receiver
GPS
Receiver
RF
FE
SL
IMb
us
Motion
SensorscJTAG
GBT
SP
MI
Power
Control
Multimedia
Processor
I2C
US
B 2
.0
Memory
Card
HDMI 1.4
Touch Screen
Controller
Display
Driver
Audio
Interface
Camera
Interface
USB 3.0 OTG
OCP 2.0
OCP 3.0
System on Printed Circuit Board (PCB)
Application Specific Components
SoC Interconnect Fabric
ARM CPU Subsystem
3D
GFX
DSP
A/V
High speed, wired interface peripherals
D
D
R
3
P
H
Y
Other peripherals
SAT
A
MIPI
HD
MI
WL
AN
LTELow-speed peripheral
subsystem
Low speed peripherals
PM
U
MI
PI
JT
AG
IN
TC
I2
C
SP
ITi
me
r
GP
IO
Display
UA
RT
Apps
Accel
Modem
ARM®
Cortex®
A15
L2 cache
USB
3.0
3
.
0P
H
Y
2
.
0P
H
Y
PCIe
Gen
2,3
PHY
Et
h
er
n
etP
H
Y
ARM®
Cortex®
A15
ARM®
Cortex®
A7
L2 cache
ARM®
Cortex®
A7
Cache Coherent Fabric
System on Chip (SOC)
Software
Bare
Meta
l
So
ftw
are
DS
P S
oft
ware
Bare
Meta
l
So
fwta
re RTOS
Drivers
Communications L2
Communications L1
Firmware / HAL
Communications L3
Modem
Comms
Application
Software
Bare Metal Software
Operating Systems (OS)
Drivers
Applications
Middleware
Firmware / HAL
An Example Project Timeline
IP
Sub-System
System on Chip
SpecPost SiNetlist to GDSII
RTL-Design & IP Integration & VerificationFab
Only
small
gate
level
changes
and
ECO’s
RTL
Becomes
stable
Idea to
specProduction
Post silicon
Validation
IP Qualification
Spec to GDSII: 49 - 83 wks
8-12 wks
Design & Integration & Verification: 35 – 63 wks
Netlist to GDSII: 21 - 32 wks
14 wks
11 - 17 wks
14 - 18 wks
Source: Cadence, IBS
Timeline for System Critical Bugs
RTL
Becomes
stable
Only
small
gate
level
changes
and
ECO’sIP
Sub-System
System on Chip
SoC in System
Idea to
specProduction
Post silicon
Validation
SpecPost SiNetlist to GDSII
RTL-Design & IP Integration & VerificationFabIP Qualification
Time for critical bugs in
System Environment to be removed
Source: Cadence, IBS
Software is Key to verification
Applications
(Basic to Angry
Birds)
IP
Sub-System
Bare metal SW
OS & Drivers
(Linux, Android)
System on Chip
Middleware
(Graphics, Audio)
SoC in System
Only
small
gate
level
changes
and
ECO’s
RTL
Becomes
stable
Idea to
specProduction
Post silicon
Validation
SpecPost SiNetlist to GDSII
RTL-Design & IP Integration & VerificationFabIP Qualification
Time for critical bugs in
System Environment to be removed
SW
Development
on ChipM
ay h
old
final ta
pe
ou
t if bu
g to
o c
ritica
l
Source: Cadence, IBS
Platforms - There is No ‘One-Fits-All’
SDKOS Sim
•Highest speed
•Earliest in the flow
•Ignore hardware
Virtual Platform
•Almost at speed
•Less accurate (or slower)
•Before RTL
•Great to debug (but less detail)
•Easy replication
RTL Simulation
•KHz range
•Accurate
•Excellent HW debug
•Little SW execution
AccelerationEmulation
•MHz Range
•RTL accurate
•After RTL is available
•Good to debug with full detail
•Expensive to replicate
FPGA Prototype
•10’s of MHz
•RTL accurate
•After stable RTL is available
•OK to debug
•More expensive than software to replicate
Prototyping Board
•Real time speed
•Fully accurate
•Post Silicon
•Difficult to debug
•Sometimes hard to replicate
Choosing the Right Engine
Chip
Virtual
Prototyping
FPGA based
Prototyping
Acceleration
Emulation
RTL
Simulation
SW Development
SW Development
HW/SW Validation
HW/SW Verification
HW Verification
CONCEPT PRODUCT
TLM
Sim RTL
SimA&E
FPGA
Hardware Debug &
Turn-around-time
Early Software
Development
System Speed
Software Debug
-+
-+
+-
+ -
Software
Hardware
Hardware Accuracy ++-
HW/SW Concurrency Gap
Tapeout Silicon Samples Product Ships
SW
HW
System
Legend
HW Development & Verification
Continuous System Validation
Next Generation SW-Driven SoC Flow
HW Development & Verification
SW Dev
On model
HW Development & Verification
System Validation
SW Dev and Bringup On real HW design, Silicon
SW-Enhanced SoC Flow
Traditional SoC then SW Flow
Continuous SW Development & Bringup
SW Dev and Bringup on Silicon
System Validation
Enabled By
Virtual Platform
FPGA Prototype
Emulation
Powered By
Platform Hybrids
Emulation + Virtual Platform + FPGA
TLM Virtual Platform – VSP Emulation – Palladium® XPI/II
Early SW Execution on Palladium
- Up to 100MHz
- Early Availability for SW Developers
- Advanced SW Debug
- Fast SW Turnaround Time
- Up to 4MHz
- From early-RTL to full-SoC Validation
- Advanced HW Debug
- Fast HW Turnaround Time
Hybrid Solution with SW Integrator
.- Boot Complex OS at 48MHz
- Speed UP SW-Driven tests 1-10X
over emulation
- Early Availability for SW Developers
- Advanced HW + SW Debug
- Fast HW and SW Turnaround Time
VSP Execution Engines Palladium
Palladium/VSP Hybrid Solution
Architected for SW Performance
− High-speed virtual platform
− Asynchronous HW/SW Execution with Interrupt driven sync
− High-Speed Multi-Domain Memory Coherency
Designed to integrate HW and SW flows
− Does not require changes to HW or SW stacks
− Virtual connections into SW Engineer’s environments
− Seamless hybrid execution for both HW and SW users
Proven Methodology, Unique Expertise
− Cross-platform and design integration expertise
− Exclusive hybrid methodology delivers performance and repeatability
− Proven during successful application to SW-rich SoCs
Smart
Memory
Virtualized
CPU
Sub-system
CPU
Bridges
Customer
Virtual Models
VSP Virtual
Models UART, eMMC, USB
Integration
APIs
GPU IP
Memory
ControllerIP IP
RTL Fabric
DDR
AMBA®, interrupts,
resets
Customer Design in Palladium®
AVIP
SW Integrator
Customer RTL
RTL
TLM
Mem I/F
Component Color Key
SoC Interconnect Fabric
DDR3 Display
INTC
Timer
CSIDSI
UART
GPUMemory
ControllerSATAUSB3
…
System
Boot
Peripheral Fabric
USB2
Ethernet
SW Integrator
UARTsTimers
Fast
Processor
Model
A15 x 4 A7 x 2
AXI4 or ACE-Lite Interrupts
Smart DDR
MMP model
eMMC
Interrupt
Manager
TLM
/ RTL
Bridge
Reconfigurable Interconnect
CPU Sub-system RTL I/F
Reset
Manager
TLM
MemorySmart
DDR
Resets
VSP
Palladium®
XP
AV
IPValidate SoC + OS at 5-10 MHz on PXPHigh-performance memory coherency
Execute SW at 100MHzWith standard or custom processor models
Shorten SoC DebugSystem Messages
HW / SW Debuggers
Plug and Play Integration with RTLSoC-specific transactors and RTL I/F
Hybrid Example
Hybrid Performance Compared to an All-RTL in Emulation Configuration
Target Application- Large, compute intensive SoCs
Target Users: - HW-Dependent SW engineers, - System validation engineers
Accuracy SW: Delivers programmers-view accuracy- HW: Full accuracy except for timing between virtual CPU and SoC fabric- Memory: in fast mode, memory transactions are performed back-door.
Thus, hybrid models not recommended for power or performance estimation
Metric All RTL in
Palladium*
Hybrid** Increase Effective
SW exec.
speed
Linux boot (minutes) 30 0.5 60X 48MHz
Android boot (minutes) 900 15 60X 48MHz
Windows RT boot (min.) 1800 30 60X 48MHz
512x512 2D test (min)*** 30 2 15X 11MHz
# Emulation gates used 70 Million 40 Million 0.6X
* 70 million gate application processor, all blocks in Palladium®
** Virtualized CPU sub-system with register model of L1 & L2 caches. All other SoC blocks in Palladium.
*** Includes Linux boot, data preparation, image processing by HW engine and result checking. 1.3 million memory transactions.
All boot numbers are full production images. Linux includes all drivers. Android and Win RT with SW rendering
Fast Models from ARM
SoC Simulation Views
Detailed Abstraction level Abstract
Slo
wS
imu
lation s
pe
ed
Fa
st
Loosely Timed (LT)
Cycle Accurate (CA)
Approximately Timed (AT)
1-20 KIPS
50-200 KIPS
50-200 MIPS
Programmer’s View• SW Development
• SW Profiling
• Architecture Compliance
System Performance View• High-level performance analysis
Component Performance View
• Architecture exploration• Performance analysis• Benchmarking
System Validation View• HW/SW Co-Verification
Component Validation View• HW Validation• Driver Development• HW/SW Co-Verification
Model Requirements for SW Development
Simulation Speed– Model must be fast enough for
software developers
Model Fidelity– Models must be complete and
faithful to the target implementation
Solution Flexibility– Enable broad range of applications
with varied requirements
Simulation
Speed
Model
Fidelity
Solution
Flexibility
Fixed Virtual Platforms
Foundation Platform for ARMv8– Simple, entry level platform for Linux application
developers– Debug via GDB server
“ARM® Versatile™ Express” (VE) FVP– Versatile Express memory map– Debugger and Trace API
“Base” FVP– VE + power management, system control
Software alignment with ARM, Linaro
Fixed Virtual Platform
ARM®
Cortex®-A57
Fast Model
ARM®
Cortex®-A53
Fast Model
CCI/CCN Fast Model
Peripheral
model
GIC
Fast Model
SMMU
Fast Model
Peripheral
model
Virtual I/O
I/O Accesses
Virtual Platforms Based on Fast Models
ARM Fast Models– De-facto standard for virtual prototyping of
ARM-based SoCs
– Fast, Flexible, Fidelity
– Accurate software representation of the CPUs and fabric IP
– Open APIs to software debuggers and EDA tool
Virtual Platform
ARM®
Cortex®-A57
Fast Model
ARM®
Cortex®-A53
Fast Model
CCI/CCN Fast Model
Custom
IP model
Peripheral
model
Virtual I/O
I/O accesses
GIC
Fast Model
SMMU
Fast Model
Custom
IP model
Custom
IP model
Hybrid Virtual Platforms
ARM Fast Models– Open APIs to
software debuggers and EDA tools
– Standardized TLM 2.0 bridges for AMBA
– Mix and match models at different levels of abstraction
– Concurrent debug of hardware and software
Custom Virtual
Platform
ARM®
Cortex®-A57
Fast Model
ARM®
Cortex®-A53
Fast Model
CCI/CCN Fast Model
Custom
IP model
Peripheral
model
Virtual I/O
I/O accesses
HW
emulators
AMBA
Transactions
GIC
Fast Model
SMMU
Fast Model
Custom
IP model
Custom
IP model
Use Case 1: No PV Model Available
Graphics & Video difficult to model in PV
ARM® Mali™
cores on Emulator
Compute Sub-System on ARM Fast Models
CoreLink™ CCI-400r1
Quad core Cortex-A53
I/O
Coherent
devices
Mali Graphics
ADB-400 ADB-400
MMU-500
ACE
ACE ACE-Lite + DVM
ACE-LiteACE-LiteACE-Lite
ACE-Lite
NIC-400
Other
Slaves
Other
Slaves
Dual core
Cortex-A57
ACE
ACE
AXI4
Configurable: AXI4/AXI3/AHB/APB
GIC-400
ACE-Lite + DVM ACE-Lite + DVM
ADB-400 ADB-400
DMC-400 / 3rd
Party DMCACE-LiteACE-Lite
ETM ETM
STM Trace Bus
DAP
TMC
SWD/JTAG
TPIU
ACE-LiteACE-Lite
DDR3/2
LPDDR2/3
PHY
DDR3/2
LPDDR2/3
PHY
NIC-400
AXI4
AXI4
Displays
MMU-500
Mali
V500
Video
AXI4
ACE-Lite
Non
Coherent
devices
MMU-500
TZC-400
Transactor Virtual Platform Emulation Content
Use Case 2: Component Analysis
Analyse traffic in CCI-400
CCI-400 on emulator
Upstream components in ARM Fast Models
Downstream in Fast Models or stubbed
CoreLink™ CCI-400r1
Quad core Cortex-A53
I/O
Coherent
devices
Mali Graphics
ADB-400 ADB-400
MMU-500
ACE
ACE ACE-Lite + DVM
ACE-LiteACE-LiteACE-Lite
ACE-Lite
NIC-400
Other
Slaves
Other
Slaves
Dual core
Cortex-A57
ACE
ACE
AXI4
Configurable: AXI4/AXI3/AHB/APB
GIC-400
ACE-Lite + DVM ACE-Lite + DVM
ADB-400 ADB-400
DMC-400 / 3rd
Party DMCACE-LiteACE-Lite
ETM ETM
STM Trace Bus
DAP
TMC
SWD/JTAG
TPIU
ACE-LiteACE-Lite
DDR3/2
LPDDR2/3
PHY
DDR3/2
LPDDR2/3
PHY
NIC-400
AXI4
AXI4
Displays
MMU-500
Mali
V500
Video
AXI4
ACE-Lite
Non
Coherent
devices
MMU-500
TZC-400
Transactor Virtual Platform Emulation Content
Use Case 3: Verification speed-up
Use Fast Processor Models to speed simulation
Complete System in Emulator other than CPU
CoreLink™ CCI-400r1
Quad core Cortex-A53
I/O
Coherent
devices
Mali Graphics
ADB-400 ADB-400
MMU-500
ACE
ACE ACE-Lite + DVM
ACE-LiteACE-LiteACE-Lite
ACE-Lite
NIC-400
Other
Slaves
Other
Slaves
Dual core
Cortex-A57
ACE
ACE
AXI4
Configurable: AXI4/AXI3/AHB/APB
GIC-400
ACE-Lite + DVM ACE-Lite + DVM
ADB-400 ADB-400
DMC-400 / 3rd
Party DMCACE-LiteACE-Lite
ETM ETM
STM Trace Bus
DAP
TMC
SWD/JTAG
TPIU
ACE-LiteACE-Lite
DDR3/2
LPDDR2/3
PHY
DDR3/2
LPDDR2/3
PHY
NIC-400
AXI4
AXI4
Displays
MMU-500
Mali
V500
Video
AXI4
ACE-Lite
Non
Coherent
devices
MMU-500
TZC-400
Transactor Virtual Platform Emulation Content
Typical Virtual Platform Flow
Start with example Fast Model project or FVP– Sufficient for OS boot and initial software development
Develop virtual platform features to support software program– Incrementally build out platform capabilities
– Easy deployment as new features completed
Export Fast Model subsystem to SystemC to extend platform and integrate with commercial EDA/ESL tools– e.g. VSP, Palladium
Majority of software features and integration completed ahead of silicon– Fast platform execution – software teams efficient
– Advanced debug and visibility capabilities
Final optimisation and tuning against silicon– Power/performance tuning often the final stages of the project
ARM Example
VSP SW Integrator Palladium
Hybrid Palladium / VSP System
• ARM SysBench RTL instance as provided
• Able to boot Linux and run GPU benchmarks on Palladium
GPU Periph.
Memory
ControllerIP IP
RTL Fabric
RTL Design in Palladium (SysBench instance)
CPU
wrapper
ARM
CPU
DDR
GIC
VSP SW Integrator Palladium
Hybrid Palladium / VSP System (AQUA)
• Virtual-only system
• Able to boot Linux image to
prompt
• No specialized IP or DUT
• May comprise virtualized
components such as virtual
Ethernet, SD card, USB
• Full visibility and debuggabilityVirtualized
CPU
Simple
Memory
Peripherals
GIC
VSP SW Integrator Palladium
Hybrid Palladium / VSP System
Hybridization of Sysbench RTL instance– Wrappers for CPU and GIC
– Replace DDR memory
GPU Periph.
Memory
ControllerIP IP
RTL Fabric
RTL Design in Palladium (SysBench instance)
CPU
wrapper
SmartMem
GIC
VSP SW Integrator Palladium
Hybrid Palladium / VSP System
Full hybrid system
– AMBA® ACE bridge connects virtual core to RTL
– Interrupts are forwarded to virtual GIC
– Smart memory provides coherent memory between RTL and VSP
Via mapping on the virtual side, Peripherals and IP in the RTL can be shadowed
Virtualized
CPU
CPU
Bridges
GPU Periph.
Memory
ControllerIP IP
RTL Fabric
AMBA, IRQs,
resets
RTL Design in Palladium (SysBench instance)
CPU
wrapper
Smart
MemorySmartMem
Peripherals GIC
CoherencyGIC
What was Measured?
Wall clock time for:– Linux boot
– Driver loading & setup
– Benchmark execution for 1st frame
– Benchmark execution for all frames
Platform: Palladium only, Palladium/VSP Hybrid (6 domains)– Measured with frame dumping active and inactive
Results: 50X speedup in Linux Boot
0
50
100
150
200
250
300
350
Linux boot + driver load Boot to 1st frame
Samurai
Egypt
Taiji
0
2
4
6
8
10
12
14
GPU test (all Frames) Boot to test end
Samurai
Egypt
Taiji
Industry Examples
PALLADIUM/VSP ARM V8 TEGRA HYBRID PROJECT #2
- Pre-silicon Android Validation
- Open GL Graphics Testing
Vikramjeet Singh
Sr System SW Manager
Mobile Devices
Pre-silicon goals for SW
Co-develop and Co-verify pre-silicon HW designs
High quality SW on silicon arrival
No silly bugs
Focus on performance and power optimizations
Cut down time to production
Bring-up production software stack
Android, OpenGL..
Eliminate interface bugs
Early visibility into “eco-system” readiness
OS
Use
Ca
ses
OG
L T
ests
Arch. MODS
Sim Front
Custom Nvidia
SW
Kernel
NVIDIA Validation with Palladium/VSP Hybrid
Te
st G
en3
Directed HW Arch Tests
Directed Kernel Tests
Kernel Use cases
NVidia OGL Tests
OS boot end to end
OS use cases
Test &
Metrics DB
SA Host fiber
VSP Hybrid
BootLoader
BootROM
Kern
el
Test
Case
s
OS Validation
Staged 64b’ development
64b’ Kernel with 32b’ User space
Fix interface issues (no silly bugs)
Demo capabilities to partners
32b’
64b’
Android
Boot Times
Kernel = 2 mins
Android = 90 mins
10x faster than PD
SW Validation Results
Eliminated reliance on other pre-silicon platforms
SW problems found prior to Silicon return
SW race conditions
Code completeness
64b’ <> 32b’ interface bugs
After silicon return
Contributed to smoother bring-up
SW Ready to demo product at SOL
Less bugs resulted in focused effort to tune perf/w
Using Palladium-VSP Hybrid to Accelerate SW DevelopmentMoshe BerkovichJune-2014
X200
39Confidential © Cambridge Silicon Radio Limited 2014
• Complex SoCs need massive SW development
• Meeting TTM relies on stable working SW
• HW/SW Co-sim is critical for verification
SW development requirements:
1. High frequency platform
2. Debug capabilities
− JTAG/UART interfaces for SW debug
− Full waveform visibility
− Activity logs (like ARM’s tarmac)
3. Fast bring-up
4. Quick turnaround cycle
5. Accuracy
Challenges in pre-silicon SW development
40Confidential © Cambridge Silicon Radio Limited 2014
Hybrid Platform Exploration
Smart DDR
Minimize DDR access penalty
TLM units often accesses by CPU
DDR
CPU
SoC Interconnect Fabric
ROM
DISPLAY GPU
DDR
RTL
TLM
Mem I/F
Component Color Key
UART
TIMERS
41Confidential © Cambridge Silicon Radio Limited 2014
• Runtime comparisons
− Palladium compiled at 1.5Mhz, CAKE 1X
Results
0
2000
4000
13501086
3670
16 104 18
PXP
HYBRID
Decompressed
Linux boot
Video test Compressed
Linux boot
x200x84 x10
NVIDIA Results
Android & AnTuTu on PXP
Multiple customers booting Android on PXP– Brought up OS before tapeout– Validated SW with Full SoC, including GPU rendered desktop– Ran test applications
Observed Android boot times– In-Circuit configuration: 13 hours– Hybrid configuration: 1 hour
Several customers have run AnTuTu on PXP– Characterized SOC performance– Optimized SW stack– AnTuTu run time: 24 hours in ICE configuraiton
Outlook
Outlook: Enhancements
Further Virtualization– PCI-e for Hybrid configurations
Hybrid Assembly– Hybrid Model Library
– Graphical User Interface
Smart Memory– User configuration
– Cache Coherency
Embedded Software Debug
Record/Playback at Virtual/RTL boundary– To support replay on standalone virtual platform
– Swap between RTL and Virtual