multi-processor systems-on-a-chip: trends and...
TRANSCRIPT
Multi-Processor Systems-on-a-Chip:Trends and Technologies
Pierre Paulin, DirectorSoC Platform Automation, AST
STMicroelectronics, Ottawa, Canada
2
Outline
�MP-SoC Trends�Key market trends�Key technology trends
�MP-SoC Technology Objectives�ST’s MP-SoC Technologies�Application to platform mapping
�Applications: Multimedia, Networking�R&D Outlook
3
4
Emb. Systems Prog. Survey 2005:Number of Processors per chip
Source: Embedded Systems Programming Magazine, 2005
�Nearly 50% of chips use multiple processors�Over 100 projects used >10 processors
1
2
3-5
6-10
>10
5
Processor Heterogeneity
�60% of systems are heterogeneous MP
Source: Embedded Systems Programming Magazine, 2005
Multipleidentical
Multipledifferent
Board Multipleidentical
Board Multipledifferent
6
Processor Selection Criteria
Source: Embedded Systems Programming Magazine, 2005
�Quality of software tools sells the processor
S/W toolsPerformance
Price
7
Market Trends Summary
�Continued reduction in time-to market�Fast changing specs, requirements�Need for high-level capture and mapping�Increasing cost of developing SoC platforms� 10M$ to 80 M$ for today’s 90nm SoC’s�Need to increase time-in-market
Implies higher flexibility is needed�Rising proportion of SoC function in eS/W�Currently 50% to 75% of design costs�Majority of all SoC’s have 2 or more
processors� Leading edge (20%) have 6-10 processors
eFPGAeFPGA
8
Outline
�MP-SoC Trends�Key market trends�Key technology trends�MP-SoC Technology Objectives�ST’s MP-SoC Technologies�Applications�R&D Outlook
9
0
100
200
300
400
500
600
700
800
900
1000
ARM7TDMI MIPS M4K ARM1176 MIPS 24K
MHz
MIPS
mw/GIPS
Power Scaling (Core Only)Commercial Cores
Video codectarget power
(100 mW)
Source: ARM, MIPS
10
Power Scaling (w 16KB/16KB D/I Caches)
0100200300400500600700800
ARM920 T ARM1022E ARM1176
Processor
MHz
MIPS
mW/GIPS
Power Scaling (Cores + Caches)
Video codectarget power
(100 mW)
Source: ARM
11
Voltage and Frequency Scaling(ADI Blackfin DSP)
Source: ADI
12
mW/GIPS
0
50
100
150
200
100 200 300 400 500 600
MHz
Voltage and Frequency Scaling(ADI Blackfin DSP)
0.8V1.0V
1.2V
0.9V
Source: ADI
13
DSP
Technology Trend: Latencies
Out
NoC
In SRAM
DSPDSPH/W-MT
RISCH/W
Proc. Element$
GPP
I$D$ I$
Increasing gap memory & processor
speeds(2x / 2 years)
Increasing gap interconnect &
gate delays(multi-clock
intra-chip delay)
More parallel processing
(lower-power, higher-perf./mm2) implies more IPC
14
Technology Trends Summary�Embedded processor speed wall ~1GHz�More function requires more parallelism
�Parallelism favors lower power solutions�Single 625 MHZ MIPS processor
consumes 4x more power than 3x 225 MHz processors�Need to efficiently program
these parallel cores though …
�Three types of rising latency�Growing gap in interconnect v.s logic delays�Growing gap in memory v.s. logic delays�MP implies inter-processor communication
vs.
3 x 32mW = 96 mW 365mW
15
Outline
�MP-SoC Trends�Key market trends�Key technology trends
�MP-SoC Technology Objectives�ST’s MP-SoC Technologies�Applications�R&D Outlook
16
What is Next SoC Paradigm Shift?
Architecture Platforms� Hard-disk drive platform� Set-top box, DVD, HDTV� Mobile multimedia � Image processing platform
System Applications� Audio codecs� Video codecs� Still image processing� Communication stacks
Component IP� Value-add cores, IP� Commodity cores:
GPP, DSP, MCU, Bus� Libraries: Cells, memories, I/O
2 yrs
year
� Algm./system expert� S/W developers� Platform programming
� Domain-specific platform expertise
� Fast platform derivatives
1995: RTL, ISA " Insulation"
2005: Platform " Insulation"
Systemfunction
RTL to Layout
H/W-S/W architecture
� Component expertise� Process differentiators(BiCMOS, eDRAM,eFlash, eFPGA)
1985: Cell Library " Insulation"
months
Prog. Models
NoC
I/O Mem
Proc. Proc.
H/W
17
Configurability Trade-offs
RTL
PhysicalDesign
Cell Library
RTL
FPGA Fabric
CProcessor Platform
(ISA)
Abstraction
Intrinsic efficiency
Year Year- 3 months
Month
Deepsubmicron
effects
18
SMP
Platform Programming Models
DSOC
Flexible SoC Platform Objectives
Abstraction
sAsic FPGARTL
WeeksISA C
Weeks
<3 months
RTL
Weeks(reuse)
Cell Library
Connectivity Platform
RTL
19
Outline
�MP-SoC Trends�Key market trends�Key technology trends
�MP-SoC Technology Objectives�ST’s MP-SoC Technologies�Application to platform mapping
�Applications�R&D Outlook
20
System-on-Chip Platform AutomationOttawa, Canada
� Info:[email protected]+1 (613) 768-9069
� Focus on use of flexible SoC platforms� Multi-processor, Network-on-Chip, Configurable H/W
� Applications� Multimedia (A/V/I), Networking, 3G basestation
� Leverage ST’s other system design technologies
21
HW/SW Platform MappingMultiFlexMultiFlex
MultiFlexMP-SoCTools
Multi-processor SoC Platform MultiFlex SoC Tools
FPGAFPGAFPGA FPGAH/W PE H/W PE
I/O
I/O
I/O . . .eMEMeMEM
. . .eFPGAeFPGARISC DSP
� Lightweight mapping tools� Supported by H/W RTOS
thread-level parallelism� Interoperable with
general-purpose O/S� MP analysis tools
System Application
Parallel programming models
Object1 Object2
Object4Object3
SharedMemory
T1T2
T3
- Temporal constraints- Spatial constraints
Msg.passing
� Proven, established programming models(msg. passing, SMP)� User-defined
parallelism
� Simplify the use ofheterogeneous HW/SWMP-SoC architectures
22
Processors integrated:
MicroBlaze®
PowerPC®
FlexMP Demo Virtual Platform
FPGAFPGAFPGA FPGAH/W PE(sASIC)
H/W PE(eFPGA)
. . .
Processor 1Processor 1 Processor NProcessor N
eMEMMEM
NetworkNetwork--onon--ChipChip
Proc. MEM
AS-IS
P1Pn
Proc. MEM
AS-IS
P1PnCoProc
I/O
App-specI/O
MemI/O
Gen-purpI/O
�Multi-threaded, multi-processor platform� Popular processor models w. config. extensions:
H/W multithreading and pipeline depth
ASIC
ConfigWMem
ProcessorPacketi-zation
H/W schedulers
(OCP(OCP--IP, IP, STBusSTBus))
MPU I/O
Host GPP
D$I$
23
Thread 1
Thread 2
Thread 3
Thread 4
Co
re 1
Hiding latency with H/W MT
C Mem C Mem
IPCC C Mem
Mem C IPC C
C Co-proc C MemMem
Mem
IPCC
C
C Mem
Co-proc
C IPC
C Mem
C Mem
C Mem
C
Mem
Thread 1
Thread 2
Thread 3
Thread 4C
ore
2
24
� Two parallelProgramming Models�DSOC: Object-Oriented
Message passing�SMP: Shared memory
I/O
I/O
I/O
MultiFlex MP-SoC Platform Tools
Application to platform mapping
. . .
NoC
. . .eFPGARISC
mem
FPGAFPGAH/WeMEMeMEM
Fine grain parallelism
Parallel System-Level Application Code
eFPGADSP
mem
H/W MP-O/S scheduleracceleratorsH/W O/S
Schedulers
H/W message passing,IP Plug and Play
25
FPGAFPGAeSoG/eFPGA
I/O
I/O
I/OeMEMeMEM
Impact of H/W Accellerators
. . .
NoC
H/W O/SSchedulers
Parallel System-Level Application Code
. . .eFPGA
RISC
mem eFPGADSP
mem
� Dynamic management fine-grain parallelism(~100 instrns. or more)� Fault tolerant
[Power, QoS in future]
Msg passing rate:<15 instructions
(15M msg/s @200 MHz)
Fork-join (for N forks): 10 instructions+ 12 cycles/thread
Additional cost(5 processor system): +0.3 mm2 (@90 nm)
26
1~8 2~6
Exploitable Parallelism
GP O/SThread-LevelParallelism
ILPVLIW, SIMD
1
10 000’sInstructions
Min parallel grain size (instrns.)
Exploitable taskparallelism
//m overhead
1~100
MultiFlexThread-Level
Parallelism
100’s10’s
1000’s
Overheads:1. Context switching2. Message Passing3. Scheduling
27
Application to Platform Mapping (1)
DSC
Video
SMP
T1 T2 T3Audio1
Audio2
ControlControl
SMP
T1 T2 T3
Control
� High-level communicatingparallel application objects�Platform independent�Map onto host GP O/S
to validate HL function and parallelismHost (Linux/Unix)
28
Out
NoC
In DSOC H/W scheduler
SMP H/W schedulerMEM
DSPDSPHwMT
RISCH/WProc. Element
DSPDSPDSP,
ASIP$Host GPP
D$I$ I$ I$hw
MEM MEMMEMMEM
hw
Control
Application to Platform Mapping (2)
DSC
Video
SMP
T1 T2 T3Audio1
Audio2
ControlControl
SMP
T1 T2 T3
GP O/SLinux/Symbian/WinCE
Control
29
Outline
�MP-SoC Trends�MP-SoC Technology Objectives�ST’s MP-SoC Technologies�Application to platform mapping
�Applications�Multimedia, Networking, 3G
�R&D Outlook
30
3G
MultiFlex MP Applications�Packet processing� 2.5/10 Gbps IPv4 forwarding, traffic manager�High-throughput platforms
�Nomadik mobile multimedia�Heterogeneous platform, O/S interoperability� Impact: Hundreds of programmers�Video codec exploration�VGA (30fps): Small grain, mixed SMP/DSOC�Digital Still Camera�Array access optimization�3G basestation�Heterogeneous MP (ARM11, 3X DSP, 3X ASIP)
31
�Results:� 85-92%
PE utilization�Msg
passing code <20%
DSOC+SMP Traffic Manager(2.5Gb/s)
ParameterizableParameterizable NetworkNetwork--onon--Chip (latency LChip (latency Li, ji, j ))
. . .
Pipe1
PipeP
RISC 2
…..T1 Tm
Data$
ingSPI egrSPIextSRAM
SMPqueMgr
extDRAM
dataMgr
Pipe1
PipeP
RISC 1
…..T1 Tm
Data$
Pipe1
PipeP
RISC N-1
…..T1 Tm
Data$
Pipe1
PipeP
RISC N
…..T1 Tm
Data$
ing
SP
I
egrS
PI
dataMgr
queMgr
ing
Pkt
sch
Pkt
egrP
kt
shP
ort
Sh
are
mem
Sh
are
mem
SRAMSRAMSRAM
#proc. = 9-10 (@ 500 MHz)
#threads = 4-16NoC Latency =
0, 40, 80 Clks(+/-25% jitter)
DSOCscheduler
32
Traffic Manager Performance
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
4 8 16
#Threads/Processor
Th
ou
gh
pu
t (G
b/s
)
Latency 0
Latency 40
Latency 80
# Ports = 256# Class = 32# Proc. = 10
(500 MHz)
NoC Latency(clock cycles)
# of H/W Threads per Processor
Th
rou
gh
pu
t (G
b/s
ec)
Traffic Manager Performance
33
5% offunctionality
MPEG4 Video Codec (VGA, 30fps)
SAD
DCT/iDCT
Badd/Bdiff
Bq/Biq
Bsum
H/Whdl
STBusSTBusC
95% offunctionality
Parallelization (SMP, DSOC), mapping in <2 monthsFlexibility in S/W, Performance in H/W
Pipe1
PipeP
ARM 1
…..T1 Tm
Data$
Pipe1
PipeP
ARM 1
…..T1 Tm
Data$
Pipe1
PipeP
ARM 1
…..T1 Tm
Data$
Pipe1
PipeP
ARM 1
…..T1 Tm
Data$
Pipe1
Pipe4
ARM 1..5
…..T1 T8
I$
S/W// C
ReferenceC code(4.1 GIPS)
80% ofperformance(3.2 GIPS)
20% ofperformance(0.9 GIPS)
34
Local Local STBusSTBus
DDR ExternalMemory
VideoIn
VideoOut
SMPDSOCscheduler
MPEG4 Video CodecArchitecture
…
Pipe1
PipeP
ARM 1
…..T1 Tm
Data$
Program Cache #1
Pipe1
PipeP
ARM 5
…..T1 Tm
Data$
System System STBusSTBus
Program Cache #N
< 200 MHz (DDR rate)
CLIP
DIV
ABS
SIGN
CLIP
DIV
ABS
SIGN
L1 data Cache #N
L1 data Cache #1
L1 Data Bank
(Stack,Data)
Interleaver
L1 Data Bank
(Stack,Data)
L1 Data Bank
(Stack,Data)
L2 Data Bank
(Data)
SAD
DCT/iDCT
Badd/Bdiff
Bq/Biq
Bsum
BieBzigzagL1
Data
L1 Data Bank
(Stack,Data)
35
Execution speed upMPEG4 VGA Video Codec Results
0
5
10
15
20
25
30
35
2 3 4 5 6Number of ARMs
Fra
me/
Sec
8 threads Theoretical 2 threads4 threads Latency = 0
Theoretical upper bound
MultiFlexresult (STBus)
MultiFlex(0 latency bus)
1 thread
8 threads:4X speedup,1.5X area
36
� Two parallelProgramming Models�DSOC: Object-Oriented
Message passing�SMP: Shared memory
I/O
I/O
I/O
MultiFlex Summary
Application to platform mapping
. . .
NoC
. . .eFPGARISC
mem
FPGAFPGAH/WeMEMeMEM
Parallel System-Level Application Code
eFPGADSP
mem
H/W MP-O/S scheduleraccelerators
H/W O/SSchedulers
H/W message passing,IP Plug and Play
Programmability + High-performance
37
� Platformindependent S/W� S/W reuse across
multiple products
� Platform scalability� Exploit fine-grain
parallelismArchitecture explorationHigh-performanceand flexibilityProgrammingproductivity
MultiFlexValue-add
FPGAFPGAH/W PE
I/O
I/O
I/O eMEMMEM
DSPI/O
MP-SoCPlatform N
RISC
FPGAFPGAH/W PE eMEMMEM
I/O GPP
MP-SoCPlatform 1
Object1 Object2
Object4Object3
SharedMemory
T1T2
T3Msg.
passing
System Application
38
Outline
�MP-SoC Trends�Key market trends�Key technology trends
�MP-SoC Technology Objectives�ST’s MP-SoC Technologies�Applications�R&D Outlook
39
Key Challenges
�Many architecture parameters to explore�HW-S/W partition�Number and specialization of processors
Memory architecture�Virtual platform simulation �Good observability, controllability, debug�High level of abstraction
Too slow�Verification, temporal correctness
Need formal parallelism analysis tools