toward a practical “hpc cloud”: performance tuning of a virtualized hpc cluster

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

SC2011@Seattle, Nov.15 2011

Ryousei Takano

Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST),

Outline

•  What is HPC Cloud? •  Performance tuning method for HPC Cloud

–  PCI passthrough –  NUMA affinity –  VMM noise reduction

•  Performance evaluation

HPC Cloud

HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications

Physical Cluster

Virtualized Clusters

Users require resources according to needs

Provider allocates users a dedicated virtual cluster on demand

HPC Cloud (cont’d)

•  Pros: –  User side: easy to deployment –  Provider side: high resource utilization

•  Cons: –  Performance degradation?

The method of performance tuning on a virtualized environment is not established.

Current HPC Cloud

Its performance is not good and

unstable.

“True” HPC Cloud The performance is

closing to that of bare metals.

Toward a practical HPC Cloud

Use PCI passthrough

Set NUMA affinity

Reduce VMM noise (not completed)

Physical driver

Guest OS

To reduce the overhead of interrupt virtualization To disable unnecessary services on the host OS (i.e., ksmd).

VM (QEMU process)

Linux kernel

Physical CPU

VCPU threads

Guest OSThreads

CPU socket

PCI passthrough

IO emulationVM1

Guest driver

Physical driver

Guest OSVM2

vSwitch

PCI passthrough VM1

Physical driver

Guest OSVM2

SR-IOV VM1

Physical driver

Guest OSVM2

Switch (VEB)

IO emulation PCI passthrough SR-IOVVM sharingPerformance

Virtual CPU scheduling

VM (QEMU process)

Linux kernel

Physical CPU P1 P2

VCPU threads

Process scheduler

Guest OS

Threads

CPU socket

V3V0 V1 V2

Bare Metal KVM

Virtual Machine

Virtual Machine Monitor (VMM)

Hardware

Xen VM (Xen DomU)

Xen Hypervisor

Physical CPU P1 P2 P3

VM (Dom0)

Domain scheduler

Guest OS

V3V0 V1 V2

Threads

A guest OS can not run numactl

NUMA affinity

VM (QEMU process)

Linux kernel

Physical CPU P1 P2

VCPU threads

Process scheduler

Guest OS

Threads

CPU socket

V3V0 V1 V2

bind threads to vSocket

pin vCPU to CPU (Vn = Pn)

numactl

taskset

Bare Metal KVM Linux

P0Physical CPU P1 P2 P3

Process scheduler

Threads

numactl

memory memory

CPU socket

Evaluation

Compute node（Dell PowerEdge M610）

CPU Intel quad-core Xeon E5540/2.53GHz x2

Chipset Intel 5520

Memory 48 GB DDR3

InfiniBand Mellanox ConnectX (MT26428)

Blade switch

InfiniBand Mellanox M3601Q (QDR 16 ports)

Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster)

Host machine environmentOS Debian 6.0.1

Linux kernel 2.6.32-5-amd64

KVM 0.12.50

Compiler gcc/gfortran 4.4.5

MPI Open MPI 1.4.2

VM environmentVCPU 8Memory 45 GB

1 10 100 1k 10k 100k 1M 10M 100M 1G

Message size [byte]

Bare MetalKVM

MPI Point-to-Point communication performance

(higher is better)

PCI passthrough improves MPI communication throughput close to that of bare metal machines.

Bare Metal: non-virtualized cluster

NUMA affinityExecution time on a single node: NPB multi-zone (Computational Fluid Dynamics) and Bloss (Non-linear eignsolver)

SP-MZ [sec] BT-MZ [sec] Bloss [min]Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00)KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05)KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)

NUMA affinity is an important performance factor not only on bare metal machines but also on virtual machines.

NPB BT-MZ: Parallel efficiency

1 2 4 8 16

llel e

ance　

Number of nodes

Bare Metal

Amazon EC2

Bare Metal (PE)

KVM (PE)

Amazon EC2 (PE)

(higher is better)

Degradation of PE: 　　KVM: 2%, EC2: 14%

Bloss: Parallel efficiency

1 2 4 8 16

llel E

Number of nodes

Bare MetalKVM

Amazon EC2Ideal

Degradation of PE: 　　KVM: 8%, EC2: 22%

Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP

Overhead of communication and virtualization

Summary

HPC Cloud is promising! •  The performance of coarse-grained parallel

applications is comparable to bare metal machines

•  We plan to operate a private cloud service “AIST Cloud” for HPC users

•  Open issues –  VMM noise reduction –  VMM-bypass device-aware VM scheduling –  Live migration with VMM-bypass devices

0 100 200 300 400 500

InfiniBand Gigabit Ethernet 10 Gigabit Ethernet

LINPACK Efficiency

※Efficiency＝（Maximum LINPACK performance：Rmax）／（Theoretical peak performance：Rpeak）

InfiniBand: 79%

Gigabit Ethernet: 54%

10 Gigabit Ethernet: 74%

TOP500 rank

TOP500 June 2011

#451 Amazon EC2 cluster compute instances

Virtualization causes the performance degradation!

GPGPU machines

1 2 4 8 16

llel E

Number of nodes

Bare MetalKVM

KVM (w/ bind)Amazon EC2

Bloss: Parallel efficiency

Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance.

Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP

toward a practical “hpc cloud”: performance tuning of a virtualized hpc cluster

hpc cloudhpc cloud

degradethe performance

cloud resources

onlyon bare metal machines

performance tuning method

method of performance

hpc cloud contd pros

bare metal pekvm pe

Technology

optimized engine calibration - esteco · toyota has...

toward a practical “hpc cloud”: performance tuning of a...

hpc performance & development tuning tools for scientists to...

post-k supercomputer - fujitsu · 2017-11-29 ·...

cisco virtualized voice browser serviceability...

virtualized storage for virtualized workloads: the smarter...

on benchmarking intrusion detection systems in virtualized...

hpc tuning on power7 - tifr

princeton research computing bootcamp · 2020. 8. 8. ·...

hpc tuning guide for amd epyc™...

implementing virtualized hpc clusters on dell emc ... ·...

ppopen-hpc open source infrastructure for development and...

hpc tuning guide for amd epyc™ processors...hpc tuning...

hpc meets cloud: opportunities and challenges in...

virtualized policy control - fine-tuning data plans for...

rainbow: capacity oriented virtualized computing framework...

"performance evaluation, scalability analysis, and ...

insideagpgpumanaged platformwithanauto-tuningjit compiler ·...

a fault-tolerant strategy for virtualized hpc...

debugging, benchmarking, tuning i.e. software …...hpc open...