DesigningandBuildingEfficientHPCCloudwithModernNetworkingTechnologiesonHeterogeneous
HPCClusters
Jie Zhang
Dr.Dhabaleswar K.Panda(Advisor)
DepartmentofComputerScience&EngineeringTheOhioStateUniversity
SC2017DoctoralShowcase 2NetworkBasedComputingLaboratory
Outline
• Introduction
• ProblemStatement
• DetailedDesignsandResults
• ImpactonHPCCommunity
• Contributions
SC2017DoctoralShowcase 3NetworkBasedComputingLaboratory
• CloudComputingfocusesonmaximizingtheeffectivenessofthesharedresources
• VirtualizationisthekeytechnologyforresourcesharingintheCloud
• Widelyadoptedinindustrycomputingenvironment
• IDCForecastsWorldwidePublicITCloudServicesspendingwillreach$195billionby2020
(Courtesy:http://www.idc.com/getdoc.jsp?containerId=prUS41669516)
CloudComputingandVirtualization
VirtualizationCloudComputing
SC2017DoctoralShowcase 4NetworkBasedComputingLaboratory
DriversofModernHPCClusterandCloudArchitecture
• Multi-core/Many-coretechnologies
• Accelerators(GPUs/Co-processors)
• Largememorynodes
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SingleRootI/OVirtualization(SR-IOV)
HighPerformanceInterconnects–InfiniBand(withSR-IOV)
<1useclatency,200GbpsBandwidth>Multi-/Many-core
Processors
Accelerators(GPUs/Co-processors)
Largememorynodes(Upto2TB)
Cloud CloudSDSCComet TACCStampede
SC2017DoctoralShowcase 5NetworkBasedComputingLaboratory
• SingleRootI/OVirtualization(SR-IOV) isprovidingnewopportunitiestodesign
HPCcloudwithverylittlelowoverhead
SingleRootI/OVirtualization(SR-IOV)
• Allowsasinglephysicaldevice,ora
PhysicalFunction(PF),topresentitselfas
multiplevirtualdevices,orVirtual
Functions(VFs)
• VFsaredesignedbasedontheexisting
non-virtualizedPFs,noneedfordriver
change
• EachVFcanbededicatedtoasingleVM
throughPCIpass-through
4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications
The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.
The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.
2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.
Qemu Userspace
Guest 1
Userspace
kernelPCI Device
mmap region
Qemu Userspace
Guest 2
Userspace
kernel
mmap region
Qemu Userspace
Guest 3
Userspace
kernelPCI Device
mmap region
/dev/shm/<name>
PCI Device
Host
mmap mmap mmap
shared mem fd
eventfds
(a) Inter-VM Shmem Mechanism [15]
Guest 1Guest OS
VF Driver
Guest 2Guest OS
VF Driver
Guest 3Guest OS
VF Driver
Hypervisor PF Driver
I/O MMU
SR-IOV Hardware
Virtual Function
Virtual Function
Virtual Function
Physical Function
PCI Express
(b) SR-IOV Mechanism [22]
Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms
Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to
SC2017DoctoralShowcase 6NetworkBasedComputingLaboratory
• CanMPIruntimeberedesignedtoprovidevirtualizationsupportfor
virtualmachinesandcontainerswhenbuildingHPCclouds?
• WhatkindofbenefitscanbeachievedonHPCcloudswithredesigned
MPIruntimeforscientifickernelsandapplications?
• Canfault-tolerance/resilience(LiveMigration)besupportedonSR-IOV
enabledHPCclouds?
• Canweco-designwithresourcemanagementandschedulingsystems
toenableHPCcloudsonmodernHPCsystems?
Problem Statements
SC2017DoctoralShowcase 7NetworkBasedComputingLaboratory
ResearchFramework
SC2017DoctoralShowcase 8NetworkBasedComputingLaboratory
• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan428,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(Jul‘17ranking)
• 1st ranked10,649,640-corecluster(SunwayTaihuLight)atNSC,Wuxi,China
• 15th ranked241,108-corecluster(Pleiades)atNASA
• 20th ranked522,080-corecluster(Stampede)atTACC
• 44th ranked74,520-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
MVAPICH2Project
SC2017DoctoralShowcase 9NetworkBasedComputingLaboratory
Locality-awareMPICommunicationwithSR-IOVandIVShmemApplication
MPI Layer
ADI3 Layer
SMP Channel
Network Channel
Shared Memory
InfiniBand API
MPI Library
CommunicationDevice APIs
Native Hardware
Application
MPI Layer
ADI3 Layer
IVShmem Channel
SR-IOV Channel
Shared Memory InfiniBand API
Virtual MachineAware
CommunicationDevice APIs
Virtualized Hardware
Communication Coordinator
Locality Detector
SMP Channel
• MPIlibrary running in native and virtualization environments
• Invirtualizedenvironment
- Supportshared-memorychannels(SMP,IVShmem)andSR-IOVchannel
- Localitydetection
- Communicationcoordination
- Communicationoptimizationsondifferentchannels(SMP,IVShmem,SR-IOV;RC,UD)
J.Zhang,X.Lu,J.JoseandD.K.Panda,HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters,TheInternationalConferenceonHighPerformanceComputing(HiPC’14),Dec2014
SC2017DoctoralShowcase 10NetworkBasedComputingLaboratory
Virtual MachineLocalityDetection
Host Environment
Hypervisor PF Driver
/dev/shm/
IVShmem
user space
kernel space
rank0MPI proc
PCI Device
VF Driver
user space
kernel space
MPI proc
PCI Device
VF Driver
user space
kernel space
MPI proc
PCI Device
VF Driver
user space
kernel space
MPI proc
PCI Device
VF Driver
0 0 0 0
rank1 rank4 rank5
0 0 0 01 1 1 1 VM List
• Create aVMListstructureon IVShmem regionofeachhost
• EachMPI processwritesitsownmembershipinformationintosharedVMList
structureaccordingtoitsglobalrank
• Onebyteeach,lock-free, O(N)
SC2017DoctoralShowcase 11NetworkBasedComputingLaboratory
CommunicationCoordination
Host Environment
Hypervisor PF Driver
IVShmem
Guest1user space
kernel spacePCI Device VF Driver
Physical Function
VirtualFunction
VirtualFunction
InfiniBand Adapter/dev/shm
IVShmemChannel
SR-IOVChannel
1 1 0 0 1 1 0 0
CommunicationCoordinator
MPI Process Rank 1Guest2
user space
kernel spacePCI Device VF Driver
IVShmemChannel
SR-IOVChannel
1 1 0 0 1 1 0 0
CommunicationCoordinator
MPI Process Rank 4
• Retrieve VM locality detection information
• Schedule communication channels based on VM locality information
• Fastindex,light-weight
SC2017DoctoralShowcase 12NetworkBasedComputingLaboratory
Application Performance(NAS&P3DFFT)
• Proposed design delivers up to 43% (IS) improvement for NAS
• Proposed design brings 29%,33%,29%and20% improvement for INVERSE,RAND,
SINEandSPEC
0
2
4
6
8
10
12
14
16
18
FT LU CG MG IS
ExecutionTime(s)
NAS B Class
SR-IOV
Proposed
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
SPEC INVERSE RAND SINE
ExecutionTimes
(s)
P3DFFT 512*512*512
SR-IOV
Proposed
NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node)
43%
20%29%
33% 29%
SC2017DoctoralShowcase 13NetworkBasedComputingLaboratory
SR-IOV-enabledVMMigrationSupportonHPCClouds
SC2017DoctoralShowcase 14NetworkBasedComputingLaboratory
HighPerformanceSR-IOVenabledVMMigrationFramework
MPI
Host
Guest VM1
Hypervisor
IBAdapterIVShmem
Network Suspend Trigger
Read-to-Migrate Detector
Network ReactiveNotifier
Controller
MPIGuest VM2
MPI
Host
Guest VM1
VF /SR-IOV
Hypervisor
IBAdapter IVShmemEthernet
Adapter
MPIGuest VM2
EthernetAdapter
Migration Done
Detector
VF /SR-IOV
VF /SR-IOV
VF /SR-IOV
Migration Trigger
J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017
• ConsistofSR-IOVenabledIBClusterandExternal
MigrationController
• Detachment/Re-attachmentofvirtualizeddevices:Multipleparallellibrariestocoordinate
VMduringmigration(detach/reattachSR-
IOV/IVShmem,migrateVMs,migrationstatus)
• IBConnection:MPIruntimehandlesIB
connection suspendingandreactivating
• ProposeProgressEngine(PE)andMigration
Threadbased(MT)designtooptimizeVM
migrationandMPIapplicationperformance
SC2017DoctoralShowcase 15NetworkBasedComputingLaboratory
SequenceDiagramofVMMigration
Migration SourceMigration TargetComputer Node
Controller
Login Node
Suspend Trigger
RTM Detector
Start Migration
MPIVF
IVSHMEM
VM
Migration
Reactivate
Detach VF Detach IVSHMEM
Attach VF Attach IVSHMEM
Phase 1
Phase 2
Phase 3.a
Phase 4
VM Migration
MPIVF
VM
MPIVMPhase 3.b
Phase 3.c
IVSHMEM
Suspend Channel
Pre-migration
Ready-to-migrate
Post-migration
Reactivate ChannelMigration Done
Phase 5
SC2017DoctoralShowcase 16NetworkBasedComputingLaboratory
ProposedDesignofMPIRuntime
Time Comp
MPICall
RTime
RSMPI Call Suspend Channel Reactivate Channel Control Msg
Migration Migration
Lock/Unlock Communication
Comp Computation
P 0
ControllerPre-migration
S
Ready-to-MigrateMigration
Post-Migration
MPI Call
MPICall
TimeNo-migration P 0 Comp
MPICall
P 0
MPICall
ThreadPre-
migrationReady-to-migrate Migration
Post-migration
MPICall
Comp
Controller
TimeCompMigration-thread based Worst Scenario
P 0
MPICall
Thread Pre-migration
SReady-to-migrate Migration
Post-migrationController
S R
R
Migration-thread based Typical Scenario
Progress Engine Based
Migration Signal Detection Down-Time for VM Live Migration
MPI Call
SC2017DoctoralShowcase 17NetworkBasedComputingLaboratory
• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning
• ComparedwithNM,MT- worstandPEincursomeoverhead
• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation
ApplicationPerformance
051015202530
LU.B EP.B MG.B CG.B FT.B
ExecutionTim
e(s) PE
MT-worstMT-typicalNM
0.0
0.1
0.2
0.3
0.4
20,10 20,16 20,20 22,10
ExecutionTim
e(s) PE
MT-worstMT-typicalNM
Graph500NAS
SC2017DoctoralShowcase 18NetworkBasedComputingLaboratory
HighPerformanceMPICommunicationforNestedVirtualization
QPI
Mem
ory
Con
trol
ler
core 4 core 5 core 6 core 7
NUMA 0
VM 0Container 0
Mem
ory
Con
trol
lercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11
core 12 core 13 core 14 core 15
VM 1Container 2Container 1 Container 3
1 23 4
Two-Layer NUMA Aware Communication Coordinator
NUMA 1
Container Locality Detector VM Locality Detector
Nested Locality Combiner
Two-Layer Locality Detector
CMAChannel
SHared Memory (SHM) Channel
Network (HCA)Channel
Two-LayerLocalityDetector:
DynamicallydetectingMPIprocesses
intheco-residentcontainersinside
oneVMaswellastheonesintheco-
residentVMs
Two-LayerNUMAAware
CommunicationCoordinator: Leverage
nestedlocalityinfo,NUMA
architectureinfoandmessagetoselect
appropriatecommunicationchannel
J.Zhang,X.LuandD.K.Panda,DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,The13thACMSIGPLAN/SIGOPSInternationalConferenceonVirtualExecutionEnvironments(VEE’17),April2017
SC2017DoctoralShowcase 19NetworkBasedComputingLaboratory
Two-LayerNUMAAwareCommunicationCoordinator
NUMA Loader Nested Locality Loader
Message Parser
Communication Coordinator
Two-Layer NUMA Aware Communication Coordinator
CMA Channel SHared Memory (SHM) Channel
Network (HCA)Channel
Two-Layer Locality Detector
• NestedLocalityLoaderreadslocalityinfoofdestinationprocessfromTwo-LayerLocality
Detector
• NUMALoaderreadsinfoofVM/containerplacementstodecideonwhichNUMAnodethe
destinationprocessispinning
• MessageParserobtainsmessageattributes,e.g.,messagetypeandmessagesize
SC2017DoctoralShowcase 20NetworkBasedComputingLaboratory
HybridDesignforNUMA-AwareCommunication
BasicHybrid EnhancedHybrid
• BasicHybridDesign– transfersmallmessagewithsharedmemorychannel,largemessage
throughnetworkchannel
• EnhancedHybrid– smallandcontrol(RTS,CTS) messagesgowithsharedmemorychannel,
ONLYdatapayloadoflargemessagegoesthroughnetworkchannel
MPI Rank 0
VM1
Container AMPI
Rank 1
Container BMPI
Rank 0
VM2
Container AMPI
Rank 1
Container B
IVSHMEM
NIC
RTS CTSData
Host OS
Network
Data
MPI Rank 0
VM1
Container AMPI
Rank 1
Container BMPI
Rank 0
VM2
Container AMPI
Rank 1
Container B
IVSHMEM
NIC
RTS CTS
Data
Host OS
Network
Data
SC2017DoctoralShowcase 21NetworkBasedComputingLaboratory
ApplicationsPerformance
• 256processesacross64containerson16nodes
• ComparedwithDefault,enhanced-hybriddesignreducesupto16% (28,16)and10% (LU)of
executiontimeforGraph500andNAS,respectively
• Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6% (LU)
performancebenefit.
0
50
100
150
200
IS MG EP FT CG LU
ExecutionTime(s)
ClassDNAS
Default
1Layer
2Layer-Enhanced-Hybrid
0
2
4
6
8
10
22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16
BFSExecutionTime(s)
Graph500
Default
1Layer
2Layer-Enhanced-Hybrid
SC2017DoctoralShowcase 22NetworkBasedComputingLaboratory
Compu
teNod
es
MPI MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
VM VM
VMVM
VM
VM
VM VM
ExclusiveAllocationsSequentialJobs(EASJ)
ExclusiveAllocationsConcurrentJobs
(EACJ)
Shared-hostAllocationsConcurrentJobs(SACJ)
Typical UsageScenarios
SC2017DoctoralShowcase 23NetworkBasedComputingLaboratory
Submit Job SLURMctld
VM Configuration File
physical
node
SLURMd
SLURMd
VM Launching/Reclaiming
libvirtd
VM1
VFIVSHME
M
VM2
VFIVSHME
M
physical
node
SLURMd
physical
node
SLURMd
sbatch File
MPI MPI
physical resource request
physical node list
launch VMs
Lustre
Image
Pool
1. SR-IOV virtual
function
2. IVSHMEM device
3. Network setting
4. Image management
5. Launching VMs and
check availability
6. Mount global storage,
etc.
….
Slurm-VArchitectureOverview
SC2017DoctoralShowcase 24NetworkBasedComputingLaboratory
AlternativeDesignsofSlurm-V
• Slurm SPANKPluginbaseddesign
– UtilizeSPANKplugintoreadVMconfiguration,launch/reclaimVM
– FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF
– AssignauniqueIDtoeachIVSHMEMdeviceanddynamicallyattachtoeachVM
– InheritadvantagesfromSlurm: coordination,scalability,security
• Slurm SPANKPluginoverOpenStackbaseddesign
– OffloadVMlaunch/reclaimtounderlyingOpenStackframework
– PCIWhitelisttopassthrough freeVFtoVM
– ExtendNovatoenableIVSHMEMwhenlaunchingVM
– InheritadvantagefrombothOpenStackandSlurm: componentoptimization,performance
SC2017DoctoralShowcase 25NetworkBasedComputingLaboratory
0
50
100
150
200
250
22,10 22,16 22,20
ProblemSize(Scale,Edgefactor)
VM
Native
• 32VMsacross8nodes,6Cores/VM
• EASJ- ComparedtoNative,lessthan4%overhead
• SACJ,EACJ– lessthan9%overhead,whenrunningNASasconcurrentjobwith64Procs
EASJ SACJ
ApplicationsPerformance
0
500
1000
1500
2000
2500
3000
24,16 24,20 26,10
BFSExecutionTime(m
s)
ProblemSize(Scale,Edgefactor)
VM
Native
0
50
100
150
200
250
2210 2216 2220
ProblemSize(Scale,Edgefactor)
VM
Native
EACJ
Graph500with64Procs acorss 8NodesonChameleon
6%
4% 9%
SC2017DoctoralShowcase 26NetworkBasedComputingLaboratory
ImpactonHPCandCloudCommunities
• DesignsavailablethroughMVAPICH2-Virtlibraryhttp://mvapich.cse.ohio-
state.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm
• ComplexAppliancesavailableonChameleonCloud
– MPIbare-metalcluster:https://www.chameleoncloud.org/appliances/29/
– MPI+SR-IOVKVMcluster:https://www.chameleoncloud.org/appliances/28/
• EnablesuserstoeasilyandquicklydeployHPCcloudsandperformjobswith
highperformance
• Enablesadministratorstoefficientlymanageandscheduleclusterresource
SC2017DoctoralShowcase 27NetworkBasedComputingLaboratory
Contributions
• AddresseskeyissuesonbuildingefficientHPCclouds
• OptimizesMPIcommunicationonvariousHPCclouds
• Presentsdesignsoflivemigrationtoprovidefault-toleranceonHPCclouds
• Presentsco-designswithresourcemanagementandschedulingsystems
• DemonstratesthecorrespondingbenefitsonmodernHPCclusters
• BroaderoutreachthroughMVAPICH2-Virtpublicreleasesandcomplex
appliancesonChameleonCloudtestbed
SC2017DoctoralShowcase 28NetworkBasedComputingLaboratory
Thank You! & Questions?
Network-BasedComputingLaboratory
http://nowlab.cse.ohio-state.edu/
MVAPICHWebPage
http://mvapich.cse.ohio-state.edu/