toward a practical “hpc cloud”: performance tuning of a virtualized hpc cluster
Post on 12-May-2015
1.284 Views
Preview:
DESCRIPTION
TRANSCRIPT
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
SC2011@Seattle, Nov.15 2011
Ryousei Takano
Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST),
Japan
Outline
• What is HPC Cloud? • Performance tuning method for HPC Cloud
– PCI passthrough – NUMA affinity – VMM noise reduction
• Performance evaluation
2
HPC Cloud
HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications
3
Physical Cluster
Virtualized Clusters
Users require resources according to needs
Provider allocates users a dedicated virtual cluster on demand
HPC Cloud (cont’d)
• Pros: – User side: easy to deployment – Provider side: high resource utilization
• Cons: – Performance degradation?
4
The method of performance tuning on a virtualized environment is not established.
Current HPC Cloud
Its performance is not good and
unstable.
“True” HPC Cloud The performance is
closing to that of bare metals.
Toward a practical HPC Cloud
Use PCI passthrough
Set NUMA affinity
Reduce VMM noise (not completed)
5
VM1
NIC
VMM
Physical driver
Guest OS
To reduce the overhead of interrupt virtualization To disable unnecessary services on the host OS (i.e., ksmd).
VM (QEMU process)
Linux kernel
KVM
Physical CPU
VCPU threads
Guest OSThreads
CPU socket
PCI passthrough
6
IO emulationVM1
NIC
VMM
Guest driver
Physical driver
Guest OSVM2
vSwitch
…
PCI passthrough VM1
NIC
VMM
Physical driver
Guest OSVM2
…
SR-IOV VM1
NIC
VMM
Physical driver
Guest OSVM2
…
Switch (VEB)
IO emulation PCI passthrough SR-IOVVM sharingPerformance
Virtual CPU scheduling
7
P3
VM (QEMU process)
P0
Linux kernel
KVM
Physical CPU P1 P2
VCPU threads
Process scheduler
Guest OS
Threads
CPU socket
V3V0 V1 V2
Bare Metal KVM
Virtual Machine
Virtual Machine Monitor (VMM)
Hardware
Xen VM (Xen DomU)
P0
Xen Hypervisor
Physical CPU P1 P2 P3
VM (Dom0)
Domain scheduler
Guest OS
V3V0 V1 V2
Threads
VCPU
A guest OS can not run numactl
NUMA affinity
8
P3
VM (QEMU process)
P0
Linux kernel
KVM
Physical CPU P1 P2
VCPU threads
Process scheduler
Guest OS
Threads
CPU socket
V3V0 V1 V2
bind threads to vSocket
pin vCPU to CPU (Vn = Pn)
numactl
taskset
Bare Metal KVM Linux
P0Physical CPU P1 P2 P3
Process scheduler
Threads
numactl
memory memory
CPU socket
Evaluation
9
Compute node(Dell PowerEdge M610)
CPU Intel quad-core Xeon E5540/2.53GHz x2
Chipset Intel 5520
Memory 48 GB DDR3
InfiniBand Mellanox ConnectX (MT26428)
9
Blade switch
InfiniBand Mellanox M3601Q (QDR 16 ports)
Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster)
Host machine environmentOS Debian 6.0.1
Linux kernel 2.6.32-5-amd64
KVM 0.12.50
Compiler gcc/gfortran 4.4.5
MPI Open MPI 1.4.2
VM environmentVCPU 8Memory 45 GB
1
10
100
1000
10000
1 10 100 1k 10k 100k 1M 10M 100M 1G
Band
wid
th [M
B/se
c]
Message size [byte]
Bare MetalKVM
MPI Point-to-Point communication performance
10
(higher is better)
PCI passthrough improves MPI communication throughput close to that of bare metal machines.
Bare Metal: non-virtualized cluster
NUMA affinityExecution time on a single node: NPB multi-zone (Computational Fluid Dynamics) and Bloss (Non-linear eignsolver)
11
SP-MZ [sec] BT-MZ [sec] Bloss [min]Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00)KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05)KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)
NUMA affinity is an important performance factor not only on bare metal machines but also on virtual machines.
NPB BT-MZ: Parallel efficiency
12
0
20
40
60
80
100
0
50
100
150
200
250
300
1 2 4 8 16
Para
llel e
ffici
ency
[%]
Perf
orm
ance
[Gop
/s to
tal]
Number of nodes
Bare Metal
KVM
Amazon EC2
Bare Metal (PE)
KVM (PE)
Amazon EC2 (PE)
(higher is better)
Degradation of PE: KVM: 2%, EC2: 14%
Bloss: Parallel efficiency
13
0
20
40
60
80
100
120
1 2 4 8 16
Para
llel E
ffici
ency
[%]
Number of nodes
Bare MetalKVM
Amazon EC2Ideal
Degradation of PE: KVM: 8%, EC2: 22%
Bloss: non-linear internal eigensolver – Hierarchical parallel program by MPI and OpenMP
Overhead of communication and virtualization
Summary
HPC Cloud is promising! • The performance of coarse-grained parallel
applications is comparable to bare metal machines
• We plan to operate a private cloud service “AIST Cloud” for HPC users
• Open issues – VMM noise reduction – VMM-bypass device-aware VM scheduling – Live migration with VMM-bypass devices
14
0
20
40
60
80
100
0 100 200 300 400 500
InfiniBand Gigabit Ethernet 10 Gigabit Ethernet
LINPACK Efficiency
※Efficiency=(Maximum LINPACK performance:Rmax)/(Theoretical peak performance:Rpeak)
InfiniBand: 79%
Gigabit Ethernet: 54%
10 Gigabit Ethernet: 74%
TOP500 rank
Effi
cien
cy (
%)
TOP500 June 2011
#451 Amazon EC2 cluster compute instances
Virtualization causes the performance degradation!
GPGPU machines
0
20
40
60
80
100
120
1 2 4 8 16
Para
llel E
ffici
ency
[%]
Number of nodes
Bare MetalKVM
KVM (w/ bind)Amazon EC2
Ideal
Bloss: Parallel efficiency
16
Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance.
Bloss: non-linear internal eigensolver – Hierarchical parallel program by MPI and OpenMP
top related