asymmetric chip multi-core: the future chip multiprocessorfrankel/altcomfeb09/... · 25th american...
TRANSCRIPT
Uri WeiserEE Technion
Alternative Computing Day February 9th, 2009
Asymmetric Chip Multi-Core:The Future Chip Multiprocessor
1
Sailing Basicswind
Buoy
2
Sailing - wind shiftwind
Buoy
3
a
a
a
25th American Cap – September 23rd 1983Liberty I against Australia II
7 rounds race, Score status: 3:0 to Liberty I, 4th round started
4
wind
Buoy
Start line
Sailing competitiongetting first
Boat 1
Boat 2
wind
Buoy
- What is the Strategy of Boat 1?
- What is the Strategy of Boat 2?
Do not follow Invent5
OutlineI selected the other way, but not as far as alternative computing
Today's environment
– On die: Cores; Platform: NUMA
The application environment
Cores implications
The memory wall
NUMA implications
Scheduling
Conclusions
6
Cores implicationsCMP is ubiquitous since Paul Otellini announced the P-Due-2 in June 2006
Reasons: Power wall
- Single Core Performance/power trend - Process technology
Open questions
– How many Cores? 10s 1000s? Implications!
– Cache architecture?
– Are the cores Symmetric (aka Homogenous) or Asymmetric?
– If Asymmetric
– Same ISA
– Different ISA extensions
– Application specific accelerators
– Task/threads scheduling
7
The outside world
NUMA implicationssecond boat chose NUMA in 2007
Reasons: Integration and Performance
Implications
– Synchronize Memory Controllers (MC)
– Future Multiple MC on die
Open issues/opportunities
– All memory traffic runs through the CPU
– Task/Threads assignments and scheduling
Core$Mem
Core
Core
Core
Core $ Mem
Core
Core
Core
8
Outline
Today's environment– On die: Cores; Platform: NUMA
The application environment
Cores implications
The memory wall
NUMA implications
Scheduling
Conclusions
9
The Computer runs multiple programs
CPUs & Caches
Memory
Mass storage
?
Fast I/Oe.g. camera, Accelerators
Slow I/Oe.g. human interface
Display
Communication
Computation and “scratch pad” The “outside” world
Computation enginese.g. Graphics
engine, Media, Communication
engines,
10
Flying machinesAre they all the same?
11
12
Flying machinesAre they all the same?
13
Flying machinesAre they all the same?
Requirements and program behavior
QoS– Application’s response Latency + Performance
– Differential services in multiple applications environment
– Priority of Programs
– Priority of Data
– Lossless or loss consent (e.g. Music, video, communication…)
– Power constrains
– Reliability
14
Parallelism
– Threads level parallelism– Many/Huge number of threads 10s or 1000s
– Data level parallelism vector|SIMD
– Instruction level parallelism ILP
– Pipelining (one dimension/few dimensions graphics/systolic algorithms)
Requirements and program behavior
Remember Amdahl law15
– Memory BW
– Commutation vs. memory access e.g. G Byte/Flop, M
0.05 Byte/Flop
– Streaming e.g. Media, communication, stock exchange data
– Computation behavior FP, integer, branches, loops
– Locality of program/data impact on caches, TLBs, memory
– Sharing
– Foot print Program/data
Requirements and program behavior (cont’)
16
Outline
Today's environment
– On die: Cores; Platform: NUMA
The application environment
Cores implications
Memory Wall
NUMA implications
Scheduling
Conclusions
17
Gain
Frequency
CMP implication:Same ISA/ISA extensions/Application Specific Accelerators
Analog Circuit Paradigm
GBWP = Gain Bandwidth Product = constant @ a given technology
BW1
Gain1
BW2Gain2
e.g. Gain1*BW1= Gain2*BW2
18
Gain
Frequency
Analog Circuit Paradigm (cont.)
BW1 BWn Gain
19
Application domainSame ISA/ISA extensions/Application Specific Accelerators
Performance
Apps range
General purpose
Application specific – Accelerators (not our generic ISA)
Application specific engines achieve higher
performance/power than general purpose engines
20
Application domainSame ISA/ISA extensions/Application Specific Accelerators
Performance
Apps range
Original
ISA
general purpose engines with ISA extensions
ISA extension A ISA extension A
21
The memory Wall
Many execution engines more
performancemore memory accesses
How to reduce the memory latency
performance impact?
– Caches
–Threads
22
The memory Wallthe Cache solution
The “MultiCore” machines solution use cache
as a memory access “filter”
Implications:
– Number of cores: Cache area limits the number of cores
– BW: Reduces memory access Bandwidth
– Coherency: In flat memory structure cache needs coherency mechanism
– Cache Access time (see example: Nahalal)
23
Core Mem
Core
Core
Core
M
C
Inte
rcon
ne
ct
$
Aerial view of Nahalal cooperative village
Example: Nahalal; Shared $ vs. Private $
Overview of Nahalal cache organization
P0 P1
P2
P3P4P5
P6
P7
P0 P1
P2
P3P4P5
P6
P7S$
P$
P$
P$
P$P$
P$
P$
P$
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
CPU0
CP
U1
CP
U5
CPU3CPU7
CPU2
CPU6 CPU4
P$
S$
P$ P$
P$P$
P$
P$
P$
24
Nahalal Cache Performance vs. CIM
27.41% improvement in average cache access time– 41.1% in apache
L2 hit time
0
5
10
15
20
25
30
35
40
45
50
equake fma3d barnes water apache zeus specjbb
Cy
cle
s CIM
NAHALAL
# c
ycl
es
Average Cache hit access Time
(cycles)
3.9% 8.57%
40.53%
41.1%
29.06%29.35%39.4%
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
CIM
25
The memory Wallthe Threads solution
The “MultiTHreaded” machines solution
Hide memory latency behind many many threads
When 1000s threads exist NO CACHE is valuable/needed
Implications:
– Applications: 1000s of concurrent threads
– BW: Major impact on Memory access Bandwidth
– Copies: Need to keep many threads’ RFs and states
– Exception handling: complex structure
Outline
Today's environment
– On die: Cores; Platform: NUMA
The application environment
Cores implications
The memory wall
NUMA implications
Scheduling
Conclusions
27
CPU
Priva
te C
ach
e
Priva
te C
ach
e
Sh
are
d C
ach
e
Priva
te C
ach
e
NAHLAL Cache structure Asymmetric NUMA Architecture
CPU CPUCPU
Bridge
“Sh
are
d M
em
ory
”
Priva
te M
em
ory
Priva
te M
em
ory
CPUs clusters
CPU CPU CPU CPU CPU CPU
28
Priva
te C
ach
e
Memory
NUMA implicationa platform view
CPUs
IOsAccelerators
Communication
Storage
External devices
Bridge
NUMA
29
Memory
Platform NUMA
CPUs
IOsAccelerators
Communication
Storage
External devices
Bridge
Platform NUMA
• Can we improve platform data movements
• Reduce unnecessary traffic
• Efficient on-loading
Memory (aka M3)
30
Outline
Today's environment
– On die: Cores; Platform: NUMA
The application environment
Cores implications
NUMA implications
Scheduling
Conclusions
31
Task Scheduling
Current platform tasks’ scheduling is being performed by OS based on:
– Available resources
– Ready tasks
Are there other dimensions?
– Scheduling based on thread’s essence
– Can we impact scheduling based on “The data content”?*
*Commex Technologies Patent pending 32
33
ACCMP - Asymmetric Cluster CMP
Big cores and small cores same ISA
Serial phases execute on large core
Parallel phases execute on all cores
βaSerial
Parallel
34
ACCMP - Analysis
ACCMP Performance Vs. Power
Symmetric Upper
Bound
a =1, β =4
a =0.33, β =6
1
2
3
4
5
0 5 10 15 20 25
Relative Power
Rel
ati
ve
Per
form
an
ce
Symmetric Upper Bound
Asymmetric (a=1, β=4)
Asymmetric (a=0.33, β=6)
1
112
1 2
Perf .
1
ACCMP
a
P Pv k nk a
a a
ACCMP delivers more
performance per unit
power!
35
Multiple Applications Scheduling
App-A App-B
t
Observation: Serial threads are the bottleneck of applications
Current OS grants serial threads the same priority as parallel threads, resulting in lower system throughput and unfairness
Solution: Boost priority of serial threads
ACCMP: Serial threads will be granted more powerful cores
Benchmark Throughput
36
∞:1
8:1
4:1
2:1
1:1
1:2
1:4
1:8
1:∞
∞:1 8:1 4:1 2:1 1:1 1:2 1:4 1:8 1:∞Sy
nth
eti
c B
en
chm
ark
"A"
Synthetic Benchmark "B"Benchmark “B” Serial to Parallel ratio
Ben
ch
mark
“A
” S
eri
al
to P
ara
llel ra
tio
Application’s Data’s requirements
Differential services
– Some data is latency non-tolerant(e.g. banking transactions)
– Some data is BW non-tolerant(e.g. Video)
– Some data is more “valuable”…
Why not to handle data according to its “value” (in addition its processing requirement)?
37
Data Content Aware (DCA) Architecture*
Task assignment based on Data Content
DCA
CA
System without DCA System with DCA
38*Commex Technologies Patent pending
Latency
620 640770810 850 905 900 900 900 940 940 940 940 940
580666
780840940
1200
1517
1900
2740
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 500000 1000000 1500000 2000000
Late
ncy [
µsec]
Load [PPS]
Commex Classification
Symmetric
*Test does not include UDP zero copy
No Service
W/O Content
Aware
39
Latency vs. load
40
Conclusions
The environment is changing
– Many open issues interesting research topics
– Core homogeneity?
– Computation assignment and scheduling?– Thread essence base?
– Data Content base?
– Cache or not Cache?
– Cache structure?
– …
41
Thanks:Research is related to the work of: Zvika Guz, Tomer Morad, Leonid Azriel, Commex-Technologies
Thank you
42