design is a high performance embedded · pdf filec6x high performance at low power ... –...
TRANSCRIPT
1
High Performance Embedded Computing
Arnon Friedmann
Texas Instruments
Design is a strategic
asset
Overview • What is embedded? • How did we get here?
– Shannon DSP
• Brief history of TI DSP for HPC • What makes a DSP? • Where are we now
– Benchmarks – Hawking DSP, Brown Dwarf, Moonshot
• It’s about the Software • Where are we headed • Summary
2
radar & communications computing
Embedded Markets
Video and audio infrastructure
Wireless and Networking
DVR / NVR & smart camera
Test and Measurement
Industrial control
High-performance and cloud computing
3
Home AVR and automotive audio
Portable mobile radio
Medical imaging
Mission critical systems
media processing industrial electronics
Industrial imaging
Analytics
C6657 • 1x/2x C66x • 1.25 GHz, 3MB L2 • PCIe, USB, GigE, SRIO • 21x21mm
DSP Roadmap C6678/4/2/1 • 1x/2x/4x/8x C66x • 1.25GHz, 8MB of L2 • PCIe, GigE, SRIO • 24x24mm
Concept Development Sampling Production D
SP M
id
DSP
Low
D
SP H
igh
66AK2H12/06 • 2x/4x ARM A15, 1.4 GHz • 4x/8x C66x, 1.2GHz • 1.4GHz,up to 8MB L2 • PCIe, USB, GigE, SRIO • 40x40mm
Next Mid-range • Multicore ARM and DSP • Industrial control and communications
Next High End Multicore • High performance Multicore ARM + DSP • Large L2, 2x DDR4 • High speed serial I/O
Next DSP Low • Multicore ARM and DSP devices • Industrial, Audio and Communicaitons
2013 2014 2015 Production
OMAP L138 • 1xARM A9, 456 MHz • 1xC674x, 456 MHz • EMAC, USB2, TDM • 13mm2 ,16x16mm
C6748 • 1xC674x, 456 MHz • EMAC, USB2, McASP • 13mm2 ,16x16mm
AM5K2H04 • 4x ARM A15, 1.4 GHz • 1.4GHz,up to 8MB L2 • PCIe, USB, GigE, SRIO
TVP5154 TVP5158
DM64x
DM6467
Technology for Video Security Analytics from the Core to the Edge
Analog Camera
DM33x ISP+ARM
Coax cable
Smart Analytics IP Camera
DMVA2 DMVA3
Advanced Analytics IP Camera
DM644x DM812x
Main Stream
IP Camera
DM36x DM385
3G/Edge
IP DM812x C665x
C665x DM385
Additional processing With C665x Multicore
DVR : Digital Video Recorder NVR : Network Video Recorder DVS : Digital Video Server
DM810x DM814x DM816x
TI’s DSP & vision solutions: - All DM81xx DVR solutions with embedded analytics capabilities
- Analytics at the edge with DMVAx & DM812x
C667x
C667x Multicore
Unleashing TI multicore DSPs @ SC’11
Innovative new DSP core Most powerful multicore DSPs Lowest power per MHz/GMAC/GFLOP 6
Highest Performance Fixed and Floating Point DSP
• 40 GMACs/ 20 GFLOPs/Core
• 320 GMACs/ 160 GFLOPs/Total
• >10GFlops/Watt
TI Optimizing Compiler (GCC support, C/C++)
Scientific Computing Libraries
Multicore Tools and Code Composer
Studio IDE
C66x DSP Core
Floating Point
Fixed Point
C64x Core
C64xx
Industry’s Lowest Power Fixed-point DSP Core
Industry’s Highest Performance DSP Core
Current base for multi-core product line
C67x Core Industry’s Lowest Power
Floating-point DSP Core
High precision and wide dynamic range
Easy and flexible programming
C67xx
Most Power Efficient Scientific Computing Engine in the Industry!
NEW MultiCore
DSP C66x
Evolution of the C66x
Shannon (TMS320C6678) – Block Diagram
8
Multicore Navigator
Tera
Net
C66x DSP
L1 L2
C66x DSP
L1 L2
C66x DSP
L1 L2
C66x DSP
L1 L2
C66x DSP
L1 L2
C66x DSP
L1 L2
C66x DSP
L1 L2
C66x DSP
L1 L2
8 x CorePac
SRIO x4
PCIe x2
EMIF 16
TSIP 2x
I2C SPI UART
Peripherals & IO
GbE Switch
SGMII SGMII
IP Interfaces
Crypto
Packet Accelerator
Network CoProcessors
Power Management Debug
Multicore Shared Memory Controller (MSMC)
Shared Memory 4MB
DDR3- 64b
EDMA SysMon
System Elements
Memory Subsystem
• Multi-Core KeyStone SoC • Fixed/Floating CorePac
• 8 CorePac @ 1.25 GHz • 0.5MB L2/core, 4.0 MB Shared L2 • 320G MAC, 160G FLOP, 60G DFLOPS • 10W
• Navigator • Hardware Queue Manager with DMA
• Multicore Shared Memory Controller
• Low latency, high bandwidth memory access
• Network Coprocessor • IPv4/IPv6 Network interface solution • IPSec, SRTP, Encryption fully offloaded
• HyperLink • 50G Baud Expansion Port • Transparent to Software
Hyper Link 50
Telecom ATCA Blade
• 1T DFLOPS / blade • 240W (board power) • 256GB / s memory bandwidth • 20GB memory • 100Gbit/s interconnect total
bandwidth • Dual 10Gbit/s Ethernet uplink • 20 devices, 8 cores each • 50Gbit/s links pairing devices
9
Quad/Octal-Shannon PCIe Cards
•512 Gflops •50 W
• ~1 Teraflop • 110 W •16 GByte DDR3
High Level Comparison
• TI Quad Shannon PCIe ~50 W (2011)
• Nvidia Kepler ~250 W (2012) – Dominates acceleration today – Powers #2 Supercomputer (Titan)
• Intel Xeon PHI (MIC) ~250 W (2012) – Unveiled at SC’12 – Powers #1 Supercomputer (Tian 2)
~12.8 Gflops/W SP ~3.2 Gflops/W DP
~12 Gflops/W SP ~4 Gflops/W DP
~8 Gflops/W SP ~4 Gflops/W DP
QCDSP: A Teraflop Scale Massively Parallel Supercomputer
12
We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons.
Researchers at Brookhaven develop DSP-based system in the mid-late 90’s
QCDSP: A Teraflop Scale Massively Parallel Supercomputer
13
We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons.
Researchers at Brookhaven develop DSP-based system in the mid-late 90’s
TI then forgot all about this...
14
Linpack Results from KTH
15
• Data from study performed at KTH Supercomputing Center • LINPACK running on C6678 achieves 25.6 GFlops, ~2.1 GFlops/W • Single precision performance ~4x better, ~8 GFlops/W
Comparison – Algorithm Level • GPU benchmark: Nvidia Tesla C1060
(Bisceglie’10, ref[2]) – Core clock= 1.296GHz; Processor core
#=240; memory= 4GB @ 800MHz – Testing algorithm: Range-azimuth algorithm,
FFT size 4096
• FPGA: Xilinx VIRTEX-5 (Pfitzner’11, ref[3])
• Comparison
16
DSP GPU FPGA
23.0
14.9
53.3 ns/pixel
DSP > 20x in power/performance
17
Running on OpenMP today
Video Analytics Comparison between KeyStone I and x86 processors
0
5
10
15
20
25
30
35
40
0
1
2
3
4
5
6
7
i7-2600 Xeon E5620 Dual E5645 Xeon X5675 singleShannon
QuadShannon
OctoShannon
Cost
(USD
)/ c
hann
el
Wat
ts c
onsu
med
/ ch
anne
l
Processors
Watts consumed / Cost per Channel (QVGA)
QVGA Watt per channel
QVGA Cost per channel
FINALLY SOME DETAILS...
19
C66x Core Overview
L1P
C66x DSP
L1P SRAM/Cache
32KB
L2
L1D
Embedded Debug
Prefetch
Power Management
Interrupt controller
Emulation
Register file A Register file B
Fetch
L1D SRAM/Cache
32KB
L2 SRAM/Cache
1MB
DMA
L M S D L M S D
Dispatch Exectute
Prefetch
Prefetch
registers
64 64 64 64
registers
C6X High Performance at low power • The c6x architecture is designed to provide the highest performance
DSP processing • Superscalar DSP is capable of executing 8 instructions per clock cycle • VLIW engine works in concert w/ compiler technology to provide
superscalar performance without the power overhead of general purpose superscalar CPUs
GPP C6X
ALU ALU ALU
Instruction Dispatch
Instruction Scheduler
Reservation Stations
Re-order Buffers
ALU ALU ALU
Instruction Dispatch Instruction Scheduler
Register Allocation
C6X Compiler C6X VLIW Engine
C6X VLIW Power Optimization • Traditional exposed-pipeline VLIW machines have some drawbacks
with respect to power
• Instruction RAM usage is high due to – No instruction scheduler – if an ALU (or other functional unit) is not used, a
NOP must be issued to it – Loops must be unrolled by the compiler
Pure VLIW Machine code for low IPC code – 6 instructions encoded in 24 instruction words
NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP
C6X VLIW Instruction Dispatch • To reduce instruction fetch power and code size, a simple instruction
dispatch unit is introduced
ALU
Decoder
ALU
Decoder
ALU
Decoder
Instruction RAM
ALU
Decoder
ALU
Decoder
ALU
Decoder
Instruction RAM
Pure VLIW Machine C6X VLIW
Instruction Dispatch
C6X VLIW Instruction Dispatch (2) • Execution unit and Parallelism encoded in machine code
– Simplified dispatch unit “unpacks” the machine code.
ADD MPY SUB MPY LD MPY
C6X Machine Code
C6X VLIW Core
NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP
Instruction Dispatch
C6X VLIW Loop Unrolling • Traditional exposed pipeline VLIW machines have additional instruction
overhead – Loops are unrolled by the compiler – A loop that really only has 4 unique instructions can easily need 12-16
instructions after unrolling
• C64x+ generation introduced a loop construct which unrolls the loop w/in the CPU – Code size reduction for loops – Power Savings in CPU instruction pipeline
C6X VLIW Loop Unrolling (2) • A “while” loop in C is depicted below • The resulting assembly code is shown for
– Traditional VLIW and – C6X VLIW
B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LOOP: LDW *A0++, A7 ||[A1] SUB A1,1,A1 || ADD A7,A8,A8 ||[A1] B LOOP
while (A1--) A8 += *A0++; C-psuedo source
Traditional VLIW
MVC A1, ILC || SPLOOP 1 LDW *A0++,A7 NOP 4 ADD A7,A8,A8 || SPKERNEL
C6X VLIW w/ Software Pipelined Loop Unroller
• 20% Overall Dynamic Power reduction • Same performance using less than ½ the instructions
C6X VLIW plus SIMD • Strategy in evolving from c674x core to c66x
– Increase datapath width, leave “overhead” the same – C66x increased most 32-bit instructions into 64-bit SIMD versions of the
same instructions – Instruction decode overhead is the same, but processing power goes up by
2x – Overall energy consumption is lower for a given benchmark. – When only 32-bits of the unit is required, clock-gating eliminates the dynamic
power of the unused 32-bits
WHERE ARE WE NOW
28
KeyStone Innovation
• Lowers development effort • Speeds time to market • Leverages TI’s investment • Optimal software reuse
5 generations of multicore
29
2011 2013 / 2014 2012
C64x+
Wireless Accelerators
Network and Security AccelerationPacs
C66x fixed and floating point, FPi, VSPi
ARM A8
ARM A15
10G Networking
64 bit ARM v8
C66x+
40G Networking
Faraday 65nm
2014 / 2015
Multicore cache coherency
KeyStone III 20nm
KeyStone
KeyStone II
KeyStone 40nm
KeyStone II 28nm
Concept
Development
Sampling
Production
Janus 130nm
6 core DSP
2003 2006
K2H Platform 66AK2H12/06 Functional Diagram
40mm x 40mm package
Multicore Navigator 28 nm
Te
raN
et
MSMC 6MB
Network AccelerationPacs
System Elements
Power Mgr
Packet Accelerator
5 port 1GbE Switch
EMIF and I/O
64/72b DDR3
x2
16b EMIF
UART x2
SPI x3
I2C x3
High Speed SERDES
1GbE SRIO HyperLink x2
4x
4x
8x
PCIe
2x
Security Accelerator
SysMon
Debug EDMA
•4x/8x 66x DSP cores up to 1.4GHz •2x/4x Cotex ARM A15 •1MB of local L2 cache RAM per C66 DSP core •4MB shared across all ARM
C66x Fixed or Floating Point DSP
•Multicore Shared Memory Controller provides low latency & high bandwidth memory access
•6MB Shared L2 on-chip •2 x 72 bit DDR3, 72-bit (with ECC), 16 GB total addressable, DIMM support (4 ranks total)
Large on chip and off chip memory
• Multicore Navigator, TeraNet, HyperLink •1GbE Network coprocessor (IPv4/IPv6) •Crypto Engine (IPSec, SRTP)
KeyStone multicore architecture and acceleration
•4 Port 1G Layer 2 Ethernet Switch •2x PCIe, 1x4 SRIO 2.1, EMIF16, USB 3.0 UARTx2, SPI, I2C
•15-25W depending upon DSP cores, speed, temp & other factors
Peripherals
66x
66x
66x
66x
1MB 1MB 1MB 1MB
ARM A15
ARM A15
ARM A15
ARM A15
4MB
66x
66x
66x
66x
1MB 1MB 1MB 1MB
USB3
The essential foundation for the new style of IT
45 hot-plug cartridges
Compute, Storage, or Combination x86 , ARM, or Accelerator
• Single-server = 45 servers per chassis
• Quad-server =180 servers per chassis (future capability)
Dual low-latency switches
• HP Moonshot-45G Switch Module (180 x1Gb downlinks)
HP Moonshot – Keystone II software-defined server
TI KeyStone II HPC System
• ATCA-Based • Rapid IO Switching • 8 TFlop/Blade • 100 GByte/Blade • Up to 14 blades/chassis
32
SOFTWARE AND TOOLS
Multicore Software Vision
Multicore DSP Multicore ARM
Mainline SMP Linux Standard Linux Tools Distribs. MPI Augment with TI differentiation
OpenMP Libraries Navigator run-time RTOS & Drivers IPC And more…
“Same User Experience as x86 devices”
Leverage standard accelerator models
OpenCL OpenMP Accel
Extensive Tool box for advanced programmers
Development Environment: • GDB, etc. Apps Debug Environment for ARM & DSP
• Eclipse “Embedded” Development & Debug Environment • Instrumentation and Trace leveraging embedded hardware capability
Multicore ARM + Multicore DSP
“Make easy native DSP experience”
34
Optimized C Compiler • Multicore parallel programming models
Productive IDE – Code Composer Studio™ • Eclipse based, host TI and 3rd party tools for easy debug • Advanced analysis and visualization, speeds SW development
Efficient Multicore Software Development Kit • Available on both DSP and ARM with free source code • HLOS/RTOS, Optimized library, algorithm and drivers,
multicore runtime, protocol stack and application demos.
35
Fast, Effective, Open Tools
3 Create optimized functions using • Standard Programming • Vector Programming • OpenMP Programming
Discrete to Integrated Getting from here To here
Parallel computing strategies with TI DSPs
1 Get started quickly with • Optimized libraries • Simple host/DSP interface
TI LIBS BLAS, DSPLIB
FFT
User LIBS TI Tools
3rd Party LIBS VSIPL
2 Offload code simply with • Directive-based programming • OpenMP Accelerator Model
Accelerator Model
User LIBS Custom fxn
x86
DSP
Hawking 28nm
There’s No Single Answer
37
Different parallelism models: o Task Parallel or Data parallel o Course or Fine Grained Parallelism o Large or Small Data Sets
Variety of System Baselines o Current Multicore Systems o Variety of methods of expressing and
managing parallelism. o Already or easily partitioned
Hierarchy of Multicore Engagement Options In
crea
sing
abs
trac
tion
and
prod
uctiv
ity Increasing control and performance
Explicit IPC
OpenMP (Homogeneous)
OpenCL for Accelerators
OpenMP (Accelerator Model)
Multicore Libraries
MPI Development Approach:
• Engage at the most abstract level
• Incrementally optimize to achieve required performance or power efficiency
Cooperative Parallel Programming (brief history of expression APIs/languages)
Node 0
MPI Communication APIs
Node 1 Node N
39
Cooperative Parallel Programming (brief history of expression APIs/languages)
40
CPU CPU CPU CPU
OpenMP Threads
Node 0
MPI Communication APIs
CPU CPU CPU CPU
OpenMP Threads
Node 1
CPU CPU CPU CPU
OpenMP Threads
Node N
Cooperative Parallel Programming (brief history of expression APIs/languages)
CPU CPU CPU CPU
OpenMP Threads
GPU
CUDA/OpenCL
Node 0
MPI Communication APIs
CPU CPU CPU CPU
OpenMP Threads
GPU
CUDA/OpenCL
Node 1
CPU CPU CPU CPU
OpenMP Threads
GPU
CUDA/OpenCL
Node N
41
CPU CPU CPU CPU
OpenMP Threads
DSP
OpenCL
Node 0
MPI Communication APIs
CPU CPU CPU CPU
OpenMP Threads
DSP
OpenCL
Node 1
CPU CPU CPU CPU
OpenMP Threads
DSP
OpenCL
Node N
Cooperative Parallel Programming On KeyStone II as an example
42
CPU CPU CPU CPU
OpenCL
Node 0
MPI Communication APIs
Node 1 Node N
Cooperative Parallel Programming On KeyStone II as an alternative example
CPU CPU CPU CPU
OpenCL
CPU CPU CPU CPU
OpenCL
43
CPU CPU CPU CPU
OpenCL
Node 0
MPI Communication APIs
Node 1 Node N
Cooperative Parallel Programming On KeyStone II as an alternative example
OpenMP
CPU CPU CPU CPU
OpenCL
OpenMP
CPU CPU CPU CPU
OpenCL
OpenMP
44
CPU CPU CPU CPU
Node 0
MPI Communication APIs
Node 1 Node N
Cooperative Parallel Programming On KeyStone II as an alternative example
OpenMP Accel
CPU CPU CPU CPU
OpenMP Accel
CPU CPU CPU CPU
OpenMP Accel
45
OpenMP Accelerator model: Target Construct Pragma based model to dispatch computation from host to accelerator
(K2H ARMs to DSPs)
04.09.2013
TI Confidential - NDA Restrictions 46
Extends OpenMP by adding • A ‘target’ construct to indicate regions to be dispatched • Map clause to indicate data transfer between host & accelerator
• Does not have to be a copy (e.g. shared memory) • Clauses to indicate that variables/functions reside on host/device/both • Target regions can contain OpenMP constructs
void foo(int *in1, int *in2, int *out1, int count) { #pragma omp target map (to: in1[0:count-1], in2[0:count-1], count, \ from: out1[0:count-1]) { #pragma omp parallel shared(in1, in2, out1) { int i; #pragma omp for for (i = 0; i < count; i++) out1[i] = in1[i] + in2[i]; } } }
TI co-chair on OpenMP accelerator model sub-committee – Played significant role in spec definition
EPCC Micro benchmark data
04.09.2013
47
1 2 3 4 5 6 7 8OpenMP Runtime 1.2 6506 9519 10587 11600 12695 13857 15079 16423OpenMP Runtime 2.0 900 5788 6035 6161 6250 6368 6554 6804
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Cycl
es
Parallel-For Overheads
1 2 3 4 5 6 7 8OpenMP Runtime 1.2 2573 4461 4919 5431 6117 6619 7177 7842OpenMP Runtime 2.0 1667 1840 2009 2170 2366 2539 2733 2948
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Cycl
es
Barrier Construct Overheads
OpenMP Runtime 2.0 • Significantly reduces (2.5x) overhead of OpenMP constructs such as parallel for,
barrier – makes it feasible to use OpenMP for parallel regions with smaller granularity i.e. fewer cycles
• Optimized OpenMP runtime built on OpenEM and libgomp (gcc openmp library) • Does not require BIOS/IPC/XDC
• However, runtime will co-exist with BIOS etc. if present in user application
WHERE ARE WE GOING
48
High Performance Compute – moving to mainstream
49
Oil and Gas Exploration
Bioscience
Big data mining
Weather forecast
Financial trading
Electronics design automation
Defense
Compute
• Heterogeneous processing
• High level of parallelism
Data Movement
• Reducing memory and IO bottleneck
Connectivity
• Efficient networking
• Higher IO bandwidth
Performance/W
• Increasing power efficiency
Less power consumption
More networking
and IO capability
More memory and memory BW
More computation
capacity
50
HPC System and Architecture Evolution
Reliability Real time Safety Scalability High Performance
Power Efficiency
C1x First 16bit Commercial DSP
C5000 16 bit Fixed point Ultra low power
C6000 32 bit Fixed and/or Floating point
C66xx 12.8GFLOPS/w
TI Continues to Invest in DSP
1995
1997
2010
1982
Next Gen DSP
DSP Leadership Innovation
High Performance Memory Interfaces
52
Hybrid memory cube(HMC)
• High BW serialized interface • Large DRAM memory space • Suitable for networking and
applications that are latency tolerant
• Lower mw/Gbps
High Bandwidth Memory Interface(HBM) • Interposer/TSV stack memory into
SoC package • Wide interface to SoC cores • Suitable for core centric access
requires large BW and low latency • Higher mw/Gbps
High BW IO and network on chip
53
Multicore Navigator
PKT DMA
Hyperlink
Packet Accelerator
Ethernet Switch
Security Accelerator
Other IO (PCIe,
JESD204B SRIO, USB…)
Teranet
Multicore Navigator enables zero copy and common multicore programming model
Modular, scalable networking solution
Hyperlink enables 50Gbps throughput with minimum latency and SW overhead
Teranet enables high throughput non-blocking network on chip
Holistic Power Optimization -- From board to transistor level
54
Board level
• Memory integration, (e.g.HMC, HBM)
Device level
• Low voltage operation
• DVSF • Retention • Bias • In-package
voltage regulation
• Interposer/TSV • Signal transport
– on-die Serdes
Transistor level
• FinFET, • Significant
leakage current reduction with lower Vdd
ASIC 2Si Interposer
ASIC 1 ASIC 2Si Interposer
ASIC 1
Industry Standard Ecosystem • In-house and 3rd party IP --
interface IP, core IP and soft IP
Overcoming Integration Complexities
Dynamic Power Management • IO voltages, core logic AVS and
DVFS domains, SRAM supplies.
Static Power Management • Power domains on processor
cores, accelerators, and I/O
Board-Level Feature Integration • Asynchronous clocking, scalable
clocks, fixed frequency clocks. • A/D & D/A Converters, RF
integration, voltage regulation.
System Management • Reset, Clocking, DFT, Interrupts,
Interconnect fabric
56
Thank you Texas Instruments
Design is a strategic
asset