Download - Designing and Building Efficient HPC Cloud with Modern ... Archive/doctoral_showcase/d… · Modern Networking Technologies on Heterogeneous HPC Clusters JieZhang Dr. DhabaleswarK

DesigningandBuildingEfficientHPCCloudwithModernNetworkingTechnologiesonHeterogeneous

HPCClusters

Jie Zhang

Dr.Dhabaleswar K.Panda(Advisor)

DepartmentofComputerScience&EngineeringTheOhioStateUniversity

SC2017DoctoralShowcase 2NetworkBasedComputingLaboratory

Outline

• Introduction

• ProblemStatement

• DetailedDesignsandResults

• ImpactonHPCCommunity

• Contributions


• CloudComputingfocusesonmaximizingtheeffectivenessofthesharedresources

• VirtualizationisthekeytechnologyforresourcesharingintheCloud

• Widelyadoptedinindustrycomputingenvironment

• IDCForecastsWorldwidePublicITCloudServicesspendingwillreach$195billionby2020

(Courtesy:http://www.idc.com/getdoc.jsp?containerId=prUS41669516)

CloudComputingandVirtualization

VirtualizationCloudComputing


DriversofModernHPCClusterandCloudArchitecture

• Multi-core/Many-coretechnologies

• Accelerators(GPUs/Co-processors)

• Largememorynodes

• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

• SingleRootI/OVirtualization(SR-IOV)

HighPerformanceInterconnects–InfiniBand(withSR-IOV)

<1useclatency,200GbpsBandwidth>Multi-/Many-core

Processors

Accelerators(GPUs/Co-processors)

Largememorynodes(Upto2TB)

Cloud CloudSDSCComet TACCStampede


• SingleRootI/OVirtualization(SR-IOV) isprovidingnewopportunitiestodesign

HPCcloudwithverylittlelowoverhead

SingleRootI/OVirtualization(SR-IOV)

• Allowsasinglephysicaldevice,ora

PhysicalFunction(PF),topresentitselfas

multiplevirtualdevices,orVirtual

Functions(VFs)

• VFsaredesignedbasedontheexisting

non-virtualizedPFs,noneedfordriver

change

• EachVFcanbededicatedtoasingleVM

throughPCIpass-through

4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications

The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.

The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.

2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.

Qemu Userspace

Guest 1

Userspace

kernelPCI Device

mmap region

Qemu Userspace

Guest 2

Userspace

kernel

mmap region

Qemu Userspace

Guest 3

Userspace

kernelPCI Device

mmap region

/dev/shm/<name>

PCI Device

Host

mmap mmap mmap

shared mem fd

eventfds

(a) Inter-VM Shmem Mechanism [15]

Guest 1Guest OS

VF Driver

Guest 2Guest OS

VF Driver

Guest 3Guest OS

VF Driver

Hypervisor PF Driver

I/O MMU

SR-IOV Hardware

Virtual Function

Virtual Function

Virtual Function

Physical Function

PCI Express

(b) SR-IOV Mechanism [22]

Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to


• CanMPIruntimeberedesignedtoprovidevirtualizationsupportfor

virtualmachinesandcontainerswhenbuildingHPCclouds?

• WhatkindofbenefitscanbeachievedonHPCcloudswithredesigned

MPIruntimeforscientifickernelsandapplications?

• Canfault-tolerance/resilience(LiveMigration)besupportedonSR-IOV

enabledHPCclouds?

• Canweco-designwithresourcemanagementandschedulingsystems

toenableHPCcloudsonmodernHPCsystems?

Problem Statements


ResearchFramework


• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan428,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(Jul‘17ranking)

• 1st ranked10,649,640-corecluster(SunwayTaihuLight)atNSC,Wuxi,China

• 15th ranked241,108-corecluster(Pleiades)atNASA

• 20th ranked522,080-corecluster(Stampede)atTACC

• 44th ranked74,520-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

MVAPICH2Project


Locality-awareMPICommunicationwithSR-IOVandIVShmemApplication

MPI Layer

ADI3 Layer

SMP Channel

Network Channel

Shared Memory

InfiniBand API

MPI Library

CommunicationDevice APIs

Native Hardware

Application

MPI Layer

ADI3 Layer

IVShmem Channel

SR-IOV Channel

Shared Memory InfiniBand API

Virtual MachineAware

CommunicationDevice APIs

Virtualized Hardware

Communication Coordinator

Locality Detector

SMP Channel

• MPIlibrary running in native and virtualization environments

• Invirtualizedenvironment

- Supportshared-memorychannels(SMP,IVShmem)andSR-IOVchannel

- Localitydetection

- Communicationcoordination

- Communicationoptimizationsondifferentchannels(SMP,IVShmem,SR-IOV;RC,UD)

J.Zhang,X.Lu,J.JoseandD.K.Panda,HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters,TheInternationalConferenceonHighPerformanceComputing(HiPC’14),Dec2014


Virtual MachineLocalityDetection

Host Environment


/dev/shm/

IVShmem

user space

kernel space

rank0MPI proc

PCI Device

VF Driver

user space

kernel space

MPI proc

PCI Device

VF Driver

user space

kernel space

MPI proc

PCI Device

VF Driver

user space

kernel space

MPI proc

PCI Device

VF Driver

0 0 0 0

rank1 rank4 rank5

0 0 0 01 1 1 1 VM List

• Create aVMListstructureon IVShmem regionofeachhost

• EachMPI processwritesitsownmembershipinformationintosharedVMList

structureaccordingtoitsglobalrank

• Onebyteeach,lock-free, O(N)


CommunicationCoordination

Host Environment


IVShmem

Guest1user space

kernel spacePCI Device VF Driver

Physical Function

VirtualFunction

VirtualFunction

InfiniBand Adapter/dev/shm

IVShmemChannel

SR-IOVChannel

1 1 0 0 1 1 0 0

CommunicationCoordinator

MPI Process Rank 1Guest2

user space

kernel spacePCI Device VF Driver

IVShmemChannel

SR-IOVChannel

1 1 0 0 1 1 0 0

CommunicationCoordinator

MPI Process Rank 4

• Retrieve VM locality detection information

• Schedule communication channels based on VM locality information

• Fastindex,light-weight


Application Performance(NAS&P3DFFT)

• Proposed design delivers up to 43% (IS) improvement for NAS

• Proposed design brings 29%,33%,29%and20% improvement for INVERSE,RAND,

SINEandSPEC

0

2

4

6

8

10

12

14

16

18

FT LU CG MG IS

ExecutionTime(s)

NAS B Class

SR-IOV

Proposed

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

SPEC INVERSE RAND SINE

ExecutionTimes

(s)

P3DFFT 512*512*512

SR-IOV

Proposed

NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node)

43%

20%29%

33% 29%


SR-IOV-enabledVMMigrationSupportonHPCClouds


HighPerformanceSR-IOVenabledVMMigrationFramework

MPI

Host

Guest VM1

Hypervisor

IBAdapterIVShmem

Network Suspend Trigger

Read-to-Migrate Detector

Network ReactiveNotifier

Controller

MPIGuest VM2

MPI

Host

Guest VM1

VF /SR-IOV

Hypervisor

IBAdapter IVShmemEthernet

Adapter

MPIGuest VM2

EthernetAdapter

Migration Done

Detector

VF /SR-IOV

VF /SR-IOV

VF /SR-IOV

Migration Trigger

J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017

• ConsistofSR-IOVenabledIBClusterandExternal

MigrationController

• Detachment/Re-attachmentofvirtualizeddevices:Multipleparallellibrariestocoordinate

VMduringmigration(detach/reattachSR-

IOV/IVShmem,migrateVMs,migrationstatus)

• IBConnection:MPIruntimehandlesIB

connection suspendingandreactivating

• ProposeProgressEngine(PE)andMigration

Threadbased(MT)designtooptimizeVM

migrationandMPIapplicationperformance


SequenceDiagramofVMMigration

Migration SourceMigration TargetComputer Node

Controller

Login Node

Suspend Trigger

RTM Detector

Start Migration

MPIVF

IVSHMEM

VM

Migration

Reactivate

Detach VF Detach IVSHMEM

Attach VF Attach IVSHMEM

Phase 1

Phase 2

Phase 3.a

Phase 4

VM Migration

MPIVF

VM

MPIVMPhase 3.b

Phase 3.c

IVSHMEM

Suspend Channel

Pre-migration

Ready-to-migrate

Post-migration

Reactivate ChannelMigration Done

Phase 5


ProposedDesignofMPIRuntime

Time Comp

MPICall

RTime

RSMPI Call Suspend Channel Reactivate Channel Control Msg

Migration Migration

Lock/Unlock Communication

Comp Computation

P 0

ControllerPre-migration

S

Ready-to-MigrateMigration

Post-Migration

MPI Call

MPICall

TimeNo-migration P 0 Comp

MPICall

P 0

MPICall

ThreadPre-

migrationReady-to-migrate Migration

Post-migration

MPICall

Comp

Controller

TimeCompMigration-thread based Worst Scenario

P 0

MPICall

Thread Pre-migration

SReady-to-migrate Migration

Post-migrationController

S R

R

Migration-thread based Typical Scenario

Progress Engine Based

Migration Signal Detection Down-Time for VM Live Migration

MPI Call


• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning

• ComparedwithNM,MT- worstandPEincursomeoverhead

• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation

ApplicationPerformance

051015202530

LU.B EP.B MG.B CG.B FT.B

ExecutionTim

e(s) PE

MT-worstMT-typicalNM

0.0

0.1

0.2

0.3

0.4

20,10 20,16 20,20 22,10

ExecutionTim

e(s) PE

MT-worstMT-typicalNM

Graph500NAS


HighPerformanceMPICommunicationforNestedVirtualization

QPI

Mem

ory

Con

trol

ler

core 4 core 5 core 6 core 7

NUMA 0

VM 0Container 0

Mem

ory

Con

trol

lercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

core 12 core 13 core 14 core 15

VM 1Container 2Container 1 Container 3

1 23 4

Two-Layer NUMA Aware Communication Coordinator

NUMA 1

Container Locality Detector VM Locality Detector

Nested Locality Combiner

Two-Layer Locality Detector

CMAChannel

SHared Memory (SHM) Channel

Network (HCA)Channel

Two-LayerLocalityDetector:

DynamicallydetectingMPIprocesses

intheco-residentcontainersinside

oneVMaswellastheonesintheco-

residentVMs

Two-LayerNUMAAware

CommunicationCoordinator: Leverage

nestedlocalityinfo,NUMA

architectureinfoandmessagetoselect

appropriatecommunicationchannel

J.Zhang,X.LuandD.K.Panda,DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,The13thACMSIGPLAN/SIGOPSInternationalConferenceonVirtualExecutionEnvironments(VEE’17),April2017


Two-LayerNUMAAwareCommunicationCoordinator

NUMA Loader Nested Locality Loader

Message Parser

Communication Coordinator

Two-Layer NUMA Aware Communication Coordinator

CMA Channel SHared Memory (SHM) Channel

Network (HCA)Channel

Two-Layer Locality Detector

• NestedLocalityLoaderreadslocalityinfoofdestinationprocessfromTwo-LayerLocality

Detector

• NUMALoaderreadsinfoofVM/containerplacementstodecideonwhichNUMAnodethe

destinationprocessispinning

• MessageParserobtainsmessageattributes,e.g.,messagetypeandmessagesize


HybridDesignforNUMA-AwareCommunication

BasicHybrid EnhancedHybrid

• BasicHybridDesign– transfersmallmessagewithsharedmemorychannel,largemessage

throughnetworkchannel

• EnhancedHybrid– smallandcontrol(RTS,CTS) messagesgowithsharedmemorychannel,

ONLYdatapayloadoflargemessagegoesthroughnetworkchannel

MPI Rank 0

VM1

Container AMPI

Rank 1

Container BMPI

Rank 0

VM2

Container AMPI

Rank 1

Container B

IVSHMEM

NIC

RTS CTSData

Host OS

Network

Data

MPI Rank 0

VM1

Container AMPI

Rank 1

Container BMPI

Rank 0

VM2

Container AMPI

Rank 1

Container B

IVSHMEM

NIC

RTS CTS

Data

Host OS

Network

Data


ApplicationsPerformance

• 256processesacross64containerson16nodes

• ComparedwithDefault,enhanced-hybriddesignreducesupto16% (28,16)and10% (LU)of

executiontimeforGraph500andNAS,respectively

• Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6% (LU)

performancebenefit.

0

50

100

150

200

IS MG EP FT CG LU

ExecutionTime(s)

ClassDNAS

Default

1Layer

2Layer-Enhanced-Hybrid

0

2

4

6

8

10

22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16

BFSExecutionTime(s)

Graph500

Default

1Layer

2Layer-Enhanced-Hybrid


Compu

teNod

es

MPI MPI

MPI

MPI

MPI

MPI

MPI

MPI

MPI

MPI

VM VM

VMVM

VM

VM

VM VM

ExclusiveAllocationsSequentialJobs(EASJ)

ExclusiveAllocationsConcurrentJobs

(EACJ)

Shared-hostAllocationsConcurrentJobs(SACJ)

Typical UsageScenarios


Submit Job SLURMctld

VM Configuration File

physical

node

SLURMd

SLURMd

VM Launching/Reclaiming

libvirtd

VM1

VFIVSHME

M

VM2

VFIVSHME

M

physical

node

SLURMd

physical

node

SLURMd

sbatch File

MPI MPI

physical resource request

physical node list

launch VMs

Lustre

Image

Pool

1. SR-IOV virtual

function

2. IVSHMEM device

3. Network setting

4. Image management

5. Launching VMs and

check availability

6. Mount global storage,

etc.

….

Slurm-VArchitectureOverview


AlternativeDesignsofSlurm-V

• Slurm SPANKPluginbaseddesign

– UtilizeSPANKplugintoreadVMconfiguration,launch/reclaimVM

– FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF

– AssignauniqueIDtoeachIVSHMEMdeviceanddynamicallyattachtoeachVM

– InheritadvantagesfromSlurm: coordination,scalability,security

• Slurm SPANKPluginoverOpenStackbaseddesign

– OffloadVMlaunch/reclaimtounderlyingOpenStackframework

– PCIWhitelisttopassthrough freeVFtoVM

– ExtendNovatoenableIVSHMEMwhenlaunchingVM

– InheritadvantagefrombothOpenStackandSlurm: componentoptimization,performance


0

50

100

150

200

250

22,10 22,16 22,20

ProblemSize(Scale,Edgefactor)

VM

Native

• 32VMsacross8nodes,6Cores/VM

• EASJ- ComparedtoNative,lessthan4%overhead

• SACJ,EACJ– lessthan9%overhead,whenrunningNASasconcurrentjobwith64Procs

EASJ SACJ

ApplicationsPerformance

0

500

1000

1500

2000

2500

3000

24,16 24,20 26,10

BFSExecutionTime(m

s)


VM

Native

0

50

100

150

200

250

2210 2216 2220


VM

Native

EACJ

Graph500with64Procs acorss 8NodesonChameleon

6%

4% 9%


ImpactonHPCandCloudCommunities

• DesignsavailablethroughMVAPICH2-Virtlibraryhttp://mvapich.cse.ohio-

state.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm

• ComplexAppliancesavailableonChameleonCloud

– MPIbare-metalcluster:https://www.chameleoncloud.org/appliances/29/

– MPI+SR-IOVKVMcluster:https://www.chameleoncloud.org/appliances/28/

• EnablesuserstoeasilyandquicklydeployHPCcloudsandperformjobswith

highperformance

• Enablesadministratorstoefficientlymanageandscheduleclusterresource


Contributions

• AddresseskeyissuesonbuildingefficientHPCclouds

• OptimizesMPIcommunicationonvariousHPCclouds

• Presentsdesignsoflivemigrationtoprovidefault-toleranceonHPCclouds

• Presentsco-designswithresourcemanagementandschedulingsystems

• DemonstratesthecorrespondingbenefitsonmodernHPCclusters

• BroaderoutreachthroughMVAPICH2-Virtpublicreleasesandcomplex

appliancesonChameleonCloudtestbed


Thank You! & Questions?

[email protected]

Network-BasedComputingLaboratory

http://nowlab.cse.ohio-state.edu/

MVAPICHWebPage

http://mvapich.cse.ohio-state.edu/

Download - Designing and Building Efficient HPC Cloud with Modern ... Archive/doctoral_showcase/d… · Modern Networking Technologies on Heterogeneous HPC Clusters JieZhang Dr. DhabaleswarK

Top Related