kaist computer architecture lab. the effect of multi-core on hpc applications in virtualized systems...
TRANSCRIPT
KAISTComputer Architecture Lab.
The Effect of Multi-core on HPC Applica-tions in Virtualized Systems
Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin Kwon¹, Young-ri Choi², and Jaehyuk Huh¹
¹ KAIST(Korea Advanced Institute of Science and Technology)
² KISTI(Korea Institute of Science and Technology Information)
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
2
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
3
Benefits of Virtualization
4
Hardware
Virtual Machine Monitor
VM VM VM
• Improve system utilization by consolidation
Benefits of Virtualization
5
Hardware
Virtual Machine Monitor
VMWin-dows
VM
Linux
VM
Solaris
• Improve system utilization by consolidation• Support for multiple types of OSes on a system
Benefits of Virtualization
6
Hardware
Virtual Machine Monitor
VMWin-dows
VM
Linux
VM
Solaris
• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation
Benefits of Virtualization
7
Hardware
Virtual Machine Monitor
VMWin-dows
VM
Linux
VM
Solaris
Hardware
Virtual Machine Monitor
• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation• Flexible resource management
Benefits of Virtualization
8
• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation• Flexible resource management
Hardware
Virtual Machine Monitor
VMWin-dows
VM
Linux
VM
Solaris
Hardware
Virtual Machine Monitor
Benefits of Virtualization
9
• Improve system utilization by consolidation• Support for multiple types of OSes on a system• Fault isolation• Flexible resource management• Cloud computing
VMWin-dows
VM
Linux
VM
Solaris Cloud
Hardware
Virtual Machine Monitor
Virtualization for HPC
• Benefits of virtualization
– Improve system utilization by consolidation
– Support for multiple types of OSes on a system
– Fault isolation
– Flexible resource management
– Cloud computing
• HPC is performance-sensitive
• Virtualization can help HPC workloads
10
resource-sensitive
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
11
Virtualization on Multi-core
12
core
• More VMs on a physical machine• More complex memory hierarchy (NUCA, NUMA)
VM
VM
core
VM
VM
core
VM
VM
core
VM
VM
core
VM
VM
core
VM
VM
Shared cache Shared cache
Memory Memory
core
VM
VM
core
VM
VM
Challenges
• VM management cost • Semantic gaps– vCPU scheduling, NUMA
13
Virtual Machine Monitor
VM
VM
VM
VM
VM
VM
VM
VM
Scheduling, Mem-ory, Communica-
tion,I/O multiplexing…
Mem
Mem
core
core
core
core
core
core
core
core
Virtual Machine Monitor
core
core
core
core
OS
Memory
$ $
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
14
Virtualization for HPC on Multi-core
• Virtualization may help HPC• Virtualization on multi-core may have some overheads• For servers, improving system utilization is a key factor• For HPC, performance is a key factor.
15
How much overheads are there?
Where do they come from?
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
16
Machines
• Single Socket System– 12-cores AMD processor– Uniform memory access la-
tency– Two 6MB L3 caches shared
by 6 cores
• Dual Socket System – 2x 4-core Intel processor– Non-uniform memory ac-
cess latency– Two 8MB L3 caches shared
by 4 cores
17
P
L2
P
L2
L3
P
L2
P
L2P
L2
P
L2
P
L2
P
L2
L3
P
L2
P
L2P
L2
P
L2
Single socket: 12-core CPU
Memory
P
L2
P
L2
P
L2
P
L2
L3
P
L2
P
L2
P
L2
P
L2
L3
Dual socket: 2x 4-core CPUs
Workloads
• PARSEC– Shared memory model– Input: native– On one machine
• Single and Dual socket
– Fix: One VM– Vary: 1, 4, 8 vCPUs
• NAS Parallel Benchmark– MPI model– Input: class C– On two machines (dual socket)
• 1Gb Ethernet switch
– Fix: 16 vCPUs– Vary: 2 ~ 16 VMs
18
Mem
Mem
core
core
core
core
core
core
core
core
Virtual Machine Monitor
core
core
core
core
OS
Memory
$ $
Virtual Machine Moni-tor
VM
VM
VM
VM
VM
VM
VM
VM
Hardware
Virtual Machine Moni-tor
VM
VM
VM
VM
VM
VM
VM
VM
Hardware
Semantic gaps VM management cost
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
19
PARSEC – Single Socket
• Single socket• No NUMA effect• Very low virtualization overheads
20
blacksc
holes
cannea
lfer
ret
fluid
anim
ate
freqm
ine
strea
mclu
ster
swap
tions
x264AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.81 vCPU4 vCPUs8 vCPUs
2~4 %
Execution times normalized to native runs
PARSEC – Single Socket
• Single socket + pin vCPU to each pCPU• Reduce semantic gaps by prevent vCPU migration• vCPU migration has negligible effect
21
blacksc
holes
cannea
lfer
ret
fluid
anim
ate
freqm
ine
strea
mclu
ster
swap
tions
x264AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.81 vCPU4 vCPUs8 vCPUs
Execution times normalized to native runs
Similar to un-pinned
PARSEC – Dual Socket
• Dual socket, unpinned vCPUs• NUMA effect semantic gap• Significant increase of overheads
22
blacksc
holes
cannea
lfer
ret
fluid
anim
ate
freqm
ine
strea
mclu
ster
swap
tions
x264AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8 1 vCPU4 vCPUs8 vCPUs
16~37 %
Execution times normalized to native runs
PARSEC – Dual Socket
• Dual socket, pinned vCPUs• May reduce NUMA effect also• Reduced overheads with 1 and 4 vCPUs
23
blacksc
holes
cannea
lfer
ret
fluid
anim
ate
freqm
ine
strea
mclu
ster
swap
tions
x264AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.81 vCPU4 vCPUs8 vCPUs
Execution times normalized to native runs
XEN and NUMA machine
• Memory allocation policy– Allocate up to 4GB chunk on
one socket
• Scheduling policy– Pinning to allocated socket– Nothing more
• Pinning 1 ~ 4 vCPUs on the socket mem. allocated is possible
• Impossible with 8 vCPUs
24
Mem
core
core
core
core
core
core
core
core
$ $
Mem
VM
0VM
1
VM
2VM
3
Mitigating NUMA Effects
• Range pinning
– Pin vCPUs of a VM on a socket
– Work only if # of vCPUs < # of cores on a socket
– Range-pinned (best): memory of VM in the same socket
– Range-pinned (worst): memory of VM in the other socket
• NUMA-first scheduler
– If there is an idle core in the socket memory allocated, pick it
– If not, anyway, pick a core in the machine
– All vCPUs are not active all the time (sync. or I/O)
25
Range Pinning
• For 4 vCPUs case• Range-pinned(best) ≈ Pinned
26
blacksc
holes
cannea
lfer
ret
fluid
anim
ate
freqm
ine
strea
mclu
ster
swap
tions
x264AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Unpinned
Range-pinned (worst)
Range-pinned (best)
Pinned
Execution times normalized to native runs
NUMA-first Scheduler
• For 8 vCPUs case• Significant improvement by NUMA-first scheduler
27
blacksc
holes
cannea
lfer
ret
fluid
anim
ate
freqm
ine
strea
mclu
ster
swap
tions
x264AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8Unpinned
Pinned
NUMA-first
Execution times normalized to native runs
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
28
VM Granularity for MPI model
• Fine-grained VMs– Few processes in a VM– Small VM: vCPUs, memory– Fault isolation among pro-
cesses in different VMs– Many VMs on a machine– MPI communications
mostly through the VMM
•
• Coarse-grained VMs– Many processes in a VM– Large VM: vCPUs, memory– Single failure point for pro-
cesses in a VM– Few VMs on a machine– MPI communications
mostly within a VM
29
VMM
Hardware
VMM
Hardware
VMM
Hardware
VMM
Hardware
NPB - VM Granularity• Work to do are same for all granularity• 2 VMs: each VM has 8 vCPUs, 8 MPI processes• 16 VMs: each VM has 1 vCPU, 1 MPI processes
30
BT CG EP FT IS LU MG SP AVG0
0.5
1
1.5
2
2.5
3 2 VMs4 VMs8 VMs16 VMs
Execution times normalized to native runs
11~54 %
NPB - VM Granularity
• Fine-grained VMs significant overheads (avg. 54%)
– MPI communications mostly through VMM
• Worst in CG with high communication ratio
– Small memory per VM
– VM management costs of VMM
• Coarse-grained VMs much less overheads (avg. 11%)
– Still dual socket, but less overheads than shared memory model
the bottle neck is moved to communication
– MPI communication largely within VM
– Large memory per VM
31
Outline
• Virtualization for HPC
• Virtualization on Multi-core
• Virtualization for HPC on Multi-core
• Methodology
• PARSEC – shared memory model
• NPB – MPI model
• Conclusion
32
Conclusion
• Questions on virtualization for HPC on multi-core system– How much overheads are there?– Where do they come from?
• For shared memory model– Without NUMA little overheads– With NUMA large overheads from semantic gaps
• For MPI model– Less NUMA effect communication is important– Fine-grained VMs have large overheads
• Communication mostly through VMM• Small memory / VM management cost
• Future Works– NUMA-aware VMM scheduler– Optimize communication among VMs in a machine
33
34
Thank you!
35
Backup slides
PARSEC CPU Usage
• Environments: native linux, turn on only 8 cores (use 8 threads mode)
• Get CPU usage every seconds, then average them
• For all workloads, less than 800% (fully parallel) NUMA-first can work
36
blackscholes canneal ferret fluidanimate freqmine streamcluster swaptions x264 Avg.0.00%
100.00%
200.00%
300.00%
400.00%
500.00%
600.00%
700.00%
800.00%