new developments in message passing, pgas and big data–hadoop (hdfs, hbase, mapreduce) environment...
TRANSCRIPT
New Developments in Message Passing, PGAS and Big Data
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council China Workshop (2013)
by
2
Welcome
欢迎
HPC Advisory Council China Workshop, Oct '13
• Growth of High Performance Computing
(HPC)
– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking
• Increase in speed/features + reducing cost
HPC Advisory Council China Workshop, Oct '13
Current and Next Generation HPC Systems and Applications
3
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of
momentum
– Memcached is also used for Web 2.0
4
Two Major Categories of Applications
HPC Advisory Council China Workshop, Oct '13
High-End Computing (HEC): PetaFlop to ExaFlop
HPC Advisory Council China Workshop, Oct '13 5
Expected to have an ExaFlop system in 2019 -2022!
100
PFlops in
2015
1 EFlops in
2018
6
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
HPC Advisory Council China Workshop, Oct '13
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• InfiniBand very popular in HPC clusters •
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
7
Multi-core Processors
HPC Advisory Council China Workshop, Oct '13
Exascale System Targets
Systems 2010 2018 Difference 2010 & 2018
System peak 2 PFlop/s 1 EFlop/s O(1,000)
Power 6 MW ~20 MW (goal)
System memory 0.3 PB 32 – 64 PB O(100)
Node performance 125 GF 1.2 or 15 TF O(10) – O(100)
Node memory BW 25 GB/s 2 – 4 TB/s O(100)
Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)
Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)
O(100)
System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]
O(10,000)
Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)
O(10) – O(100)
IO Rates 0.2 TB 60 TB/s O(100)
MTTI Days O(1 day) -O(10)
Courtesy: DOE Exascale Study and Prof. Jack Dongarra
8 HPC Advisory Council China Workshop, Oct '13
• DARPA Exascale Report – Peter Kogge, Editor and Lead
• Energy and Power Challenge
– Hard to solve power requirements for data movement
• Memory and Storage Challenge
– Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge
– Management of very large amount of concurrency (billion threads)
• Resiliency Challenge
– Low voltage devices (for low power) introduce more faults
HPC Advisory Council China Workshop, Oct '13 9
Basic Design Challenges for Exascale Systems
Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 40/100GigE,
Aries, BlueGene)
Multi/Many-core Architectures
Point-to-point Communication (two-sided & one-sided)
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Communication Library or Runtime for Programming Models
10
Accelerators (NVIDIA and MIC)
Middleware
HPC Advisory Council China Workshop, Oct '13
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
• Message Passing Interface (MPI)
• Partitioned Global Address Space (PGAS)
– Hybrid Programming: MPI + PGAS
• Big Data
– Hadoop
– Memcached (Web 2.0)
11
New Developments and Challenges
HPC Advisory Council China Workshop, Oct '13
HPC Advisory Council China Workshop, Oct '13 12
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• In this presentation, we concentrate on MPI first, then on
PGAS and Hybrid MPI+PGAS
• Message Passing Library standardized by MPI Forum
– C and Fortran
• Goal: portable, efficient and flexible standard for writing
parallel applications
• Not IEEE or ISO standard, but widely considered “industry
standard” for HPC application
• Evolution of MPI
– MPI-1: 1994
– MPI-2: 1996
– MPI-3.0: 2008 – 2012, standardized before SC ‘12
– Next plans for MPI 3.1, 3.2, ….
13
MPI Overview and History
HPC Advisory Council China Workshop, Oct '13
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
Major MPI Features
14 HPC Advisory Council China Workshop, Oct '13
• Power required for data movement operations is one of
the main challenges
• Non-blocking collectives
– Overlap computation and communication
• Much improved One-sided interface
– Reduce synchronization of sender/receiver
• Manage concurrency
– Improved interoperability with PGAS (e.g. UPC, Global Arrays,
OpenSHMEM)
• Resiliency
– New interface for detecting failures
HPC Advisory Council China Workshop, Oct '13 15
How does MPI Plan to Meet Exascale Challenges?
• Major features
– Non-blocking Collectives
– Improved One-Sided (RMA) Model
– MPI Tools Interface
• Specification is available from: http://www.mpi-
forum.org/docs/mpi-3.0/mpi30-report.pdf
HPC Advisory Council China Workshop, Oct '13 16
Major New Features in MPI-3
• Enables overlap of computation with communication
• Removes synchronization effects of collective operations
(exception of barrier)
• Non-blocking calls do not match blocking collective calls
– MPI implementation may use different algorithms for blocking and
non-blocking collectives
– Blocking collectives: optimized for latency
– Non-blocking collectives: optimized for overlap
• User must call collectives in same order on all ranks
• Progress rules are same as those for point-to-point
• Example new calls: MPI_Ibarrier, MPI_Iallreduce, …
HPC Advisory Council China Workshop, Oct '13 17
Non-blocking Collective Operations
HPC Advisory Council China Workshop, Oct '13 18
Improved One-sided (RMA) Model Process
0
Process
2
Process
1
Process
3
mem
mem
mem
mem
Process 0
Private
Window
Public
Window
Incoming RMA Ops
Synchronization
Local Memory Ops
Window
• New RMA proposal has major improvements
• Easy to express irregular communication pattern
• Better overlap of communication & computation
• MPI-2: public and private windows
– Synchronization of windows explicit
• MPI-2: works for non-cache coherent systems
• MPI-3: two types of windows
– Unified and Separate
– Unified window leverages hardware cache coherence
HPC Advisory Council China Workshop, Oct '13 19
MPI Tools Interface
• Extended tools support in MPI-3, beyond the PMPI interface
• Provide standardized interface (MPIT) to access MPI internal
information • Configuration and control information
• Eager limit, buffer sizes, . . .
• Performance information
• Time spent in blocking, memory usage, . . .
• Debugging information
• Packet counters, thresholds, . . .
• External tools can build on top of this standard interface
20
Partitioned Global Address Space (PGAS) Models
HPC Advisory Council China Workshop, Oct '13
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
- Libraries
• OpenSHMEM
• Global Arrays
• Chapel
• UPC: a parallel extension to the C standard
• UPC Specifications and Standards: – Introduction to UPC and Language Specification, 1999
– UPC Language Specifications, v1.0, Feb 2001
– UPC Language Specifications, v1.1.1, Sep 2004
– UPC Language Specifications, v1.2, 2005
– UPC Language Specifications, v1.3, In Progress - Draft Available
• UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland…
– Government Institutions: ARSC, IDA, LBNL, SNL, US DOE…
– Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …
• Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC
– Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC
• Aims for: high performance, coding efficiency, irregular applications, …
HPC Advisory Council China Workshop, Oct '13 21
Compiler-based: Unified Parallel C
OpenSHMEM
• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP
SHMEM, GSHMEM
• Subtle differences in API, across versions – example:
SGI SHMEM Quadrics SHMEM Cray SHMEM
Initialization start_pes(0) shmem_init start_pes
Process ID _my_pe my_pe shmem_my_pe
• Made applications codes non-portable
• OpenSHMEM is an effort to address this:
“A new, open specification to consolidate the various extant SHMEM versions
into a widely accepted standard.” – OpenSHMEM Specification v1.0
by University of Houston and Oak Ridge National Lab
SGI SHMEM is the baseline
HPC Advisory Council China Workshop, Oct '13 22
• Hierarchical architectures with multiple address spaces
• (MPI + PGAS) Model
– MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared
memory programming models
• Applications can have kernels with different communication patterns
• Can benefit from different models
• Re-writing complete applications can be a huge effort
• Port critical kernels to the desired model instead
HPC Advisory Council China Workshop, Oct '13 23
MPI+PGAS for Exascale Architectures and Applications
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
HPC Advisory Council China Workshop, Oct '13 24
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
– Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …)
Challenges in Designing (MPI+X) at Exascale
25 HPC Advisory Council China Workshop, Oct '13
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,085 organizations (HPC Centers, Industry and Universities) in 71
countries
– More than 193,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 6th ranked 462,462-core cluster (Stampede) at TACC
• 19th ranked 125,980-core cluster (Pleiades) at NASA
• 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others
– Available with software stacks of many IB, HSE, and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
26 HPC Advisory Council China Workshop, Oct '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Collective communication – Multicore-aware and Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
27 HPC Advisory Council China Workshop, Oct '13
HPC Advisory Council China Workshop, Oct '13 28
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council China Workshop, Oct '13 29
Bandwidth: MPI over IB
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
0
5000
10000
15000
20000
25000 MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
30
Latest MVAPICH2 2.0a
Intel Sandy-bridge
0.19 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
12,000MB/s 12,000MB/s
HPC Advisory Council China Workshop, Oct '13
• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)
• NAMD performance improves when there is frequent communication to many peers
HPC Advisory Council China Workshop, Oct '13
eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)
31
• Both UD and RC/XRC have benefits
• Hybrid for the best of both
• Available since MVAPICH2 1.7 as integrated interface
• Runtime Parameters: RC - default;
• UD - MV2_USE_ONLY_UD=1
• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable
Connection,” Cluster ‘08
0
100
200
300
400
500
1 4 16 64 256 1K 4K 16KMe
mo
ry (
MB
/pro
cess
)
Connections
MVAPICH2-RC
MVAPICH2-XRC
0
0.5
1
1.5
apoa1 er-gre f1atpase jac
No
rmal
ize
d T
ime
Dataset
MVAPICH2-RC MVAPICH2-XRC
0
2
4
6
128 256 512 1024
Tim
e (
us)
Number of Processes
UD Hybrid RC
26% 40% 30% 38%
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Collective communication – Multicore-aware and Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
32 HPC Advisory Council China Workshop, Oct '13
HPC Advisory Council China Workshop, Oct '13 33
Hardware Multicast-aware MPI_Bcast on Stampede
05
10152025303540
2 8 32 128 512
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages (102,400 Cores)
Default
Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
050
100150200250300350400450
2K 8K 32K 128K
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages (102,400 Cores)
Default
Multicast
0
5
10
15
20
25
30
Late
ncy
(u
s)
Number of Nodes
16 Byte Message
Default
Multicast
0
50
100
150
200
Late
ncy
(u
s)
Number of Nodes
32 KByte Message
Default
Multicast
Shared-memory Aware Collectives
0
20
40
60
80
100
120
140
160
0 4 8 16 32 64 128 256 512
La
ten
cy (
us
)
Message Size (Bytes)
MPI_Reduce (4096 cores)
Original
Shared-memory
HPC Advisory Council China Workshop, Oct '13
0500
10001500200025003000350040004500
0 8 32 128 512 2K 8K
La
ten
cy (
us
)
Message Size (Bytes)
MPI_ Allreduce (4096 cores)
Original
Shared-memory
34
MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1
020406080
100120140
128 256 512 1024
Late
ncy
(u
s)
Number of Processes
Pt2Pt Barrier Shared-Mem Barrier
• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)
• MVAPICH2 Barrier with 1K Intel
Westmere cores , QDR IB
MV2_USE_SHMEM_BARRIER=0/1
0
1
2
3
4
5
512 600 720 800Ap
plic
atio
n R
un
-Tim
e
(s)
Data Size
0
5
10
15
64 128 256 512Ru
n-T
ime
(s)
Number of Processes
PCG-Default Modified-PCG-Offload
Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload
35
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking
All-to-All with Collective Offload on InfiniBand Clusters: A Study
with Parallel 3D FFT, ISC 2011
HPC Advisory Council China Workshop, Oct '13
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
No
rmal
ize
d
Pe
rfo
rman
ce
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective
Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
HPC Advisory Council China Workshop, Oct '13 36
Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?
Overall performance and Split up of physical communication for MILC on Ranger
Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable
InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper
Finalist
• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execution time @ 2048 cores
• 15% improvement in Hypre execution time @ 1024 cores
37
Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes
K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for
Infiniband Clusters”, ICPP ‘10
CPMD Application Performance and Power Savings
5% 32%
HPC Advisory Council China Workshop, Oct '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Collective communication – Multicore-aware and Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
38 HPC Advisory Council China Workshop, Oct '13
• Collaboration between Mellanox and
NVIDIA to converge on one memory
registration technique
• Both devices register a common
host buffer
– GPU copies data to this buffer, and the network
adapter can directly read from this buffer (or
vice-versa)
• Note that GPU-Direct does not allow you to
bypass host memory
• Latest GPUDirect RDMA achieves this
HPC Advisory Council China Workshop, Oct '13 39
GPU-Direct
PCIe
GPU
CPU
NIC
Switch
At Sender: cudaMemcpy(sbuf, sdev, . . .);
MPI_Send(sbuf, size, . . .);
At Receiver: MPI_Recv(rbuf, size, . . .);
cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
•Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
40 HPC Advisory Council China Workshop, Oct '13
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(sbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
Sample Code – User Optimized Code
• Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity
HPC Advisory Council China Workshop, Oct '13 41
At Sender:
At Receiver:
MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
• High Performance and High Productivity
42 HPC Advisory Council China Workshop, Oct '13
MPI_Send(s_device, size, …);
MPI-Level Two-sided Communication
• 45% improvement compared with a naïve user-level implementation
(Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation
(MemcpyAsync+Isend), for 4MB messages
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy+Send
MemcpyAsync+Isend
MVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
43 HPC Advisory Council China Workshop, Oct '13
Better
MVAPICH2 1.8, 1.9 and 2.0a Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
44 HPC Advisory Council China Workshop, Oct '13
HPC Advisory Council China Workshop, Oct '13 45
Applications-Level Evaluation
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios •one process/GPU per node, 16 nodes
• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
20
40
60
80
100
120
140
256*256*256 512*256*256 512*512*256 512*512*512
Ste
p T
ime
(S)
Domain Size X*Y*Z
MPI MPI-GPU
13.7% 12.0%
11.8%
9.4%
0102030405060708090
1 GPU/Proc per Node 2 GPUs/Procs per NodeTota
l Ex
ecu
tio
n T
ime
(se
c)
Configuration
AWP-ODC
MPI MPI-GPU
11.1% 7.9%
LBM-CUDA
• Fastest possible communication
between GPU and other PCI-E
devices
• Network adapter can directly read
data from GPU device memory
• Avoids copies through the host
• Allows for better asynchronous
communication
HPC Advisory Council China Workshop, Oct '13 46
GPU-Direct RDMA with CUDA 5
InfiniBand
GPU
GPU Memory
CPU
Chip set
System Memory
MVAPICH2 with GDR support under development
MVAPICH2-GDR Alpha is available for users
• Peer2Peer (P2P) bottlenecks on
Sandy Bridge
• Design of MVAPICH2
– Hybrid design
– Takes advantage of GPU-Direct-
RDMA for writes to GPU
– Uses host-based buffered design in
current MVAPICH2 for reads
– Works around the bottlenecks
transparently
47
Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA
HPC Advisory Council China Workshop, Oct '13
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
0
5
10
15
20
25
30
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
48
Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
900
1000
16K 64K 256K 1M 4M
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
Better
Better
19.1
67.5 %
6.2
HPC Advisory Council China Workshop, Oct '13
0
1000
2000
3000
4000
5000
6000
7000
8K 32K 128K 512K 2M
MVAPICH2-1.9MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Large Message Bandwidth
0
100
200
300
400
500
600
700
800
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Small Message Bandwidth
49
Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Uni-Directional Bandwidth
2.8x 33%
Bet
ter
Bet
ter
HPC Advisory Council China Workshop, Oct '13
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Large Message Bi-Bandwidth
0
100
200
300
400
500
600
700
800
900
1000
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Small Message Bi-Bandwidth
50
Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
54%
Bet
ter
Bet
ter
HPC Advisory Council China Workshop, Oct '13
3x
Getting Started with GDR Experimentation?
• Two modules are needed
– GPUDirect RDMA (GDR) driver from Mellanox
– MVAPICH2-GDR from OSU
• Send a note to [email protected]
• You will get alpha versions of GDR driver and MVAPICH2-GDR (based on MVAPICH2 2.0a release)
• You can get started with this version
• MVAPICH2 team is working on multiple enhancements (collectives, datatypes, one-sided) to exploit the advantages of GDR
• As GDR driver matures, successive versions of MVAPICH2-GDR with enhancements will be made available to the community
51 HPC Advisory Council China Workshop, Oct '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Collective communication – Multicore-aware and Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
52 HPC Advisory Council China Workshop, Oct '13
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
53 HPC Advisory Council China Workshop, Oct '13
Data Movement on Intel Xeon Phi Clusters
CPU CPU QPI
MIC
PC
Ie
MIC
MIC
CPU
MIC
IB
Node 0 Node 1 1. Intra-Socket
2. Inter-Socket
3. Inter-Node
4. Intra-MIC
5. Intra-Socket MIC-MIC
6. Inter-Socket MIC-MIC
7. Inter-Node MIC-MIC
8. Intra-Socket MIC-Host
10. Inter-Node MIC-Host
9. Inter-Socket MIC-Host
MPI Process
54
11. Inter-Node MIC-MIC with IB adapter on remote socket
and more . . .
• Critical for runtimes to optimize data movement, hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
HPC Advisory Council China Workshop, Oct '13
55
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only
• Symmetric Mode
• Multi-MIC Node Configurations
• Comparison with Intel’s MPI Library
HPC Advisory Council China Workshop, Oct '13
MIC-RemoteHost (Closer to HCA) Point-to-Point Communication
56 HPC Advisory Council China Workshop, Oct '13
osu_latency (small)
osu_bw osu_bibw
Bette
r B
ette
r
Bet
ter
Bette
r
osu_latency (large)
0
5
10
15
0 2 8 32 128 512 2K
La
ten
cy (
use
c)
Message Size (Bytes)
MV2
MV2-MIC
0
500
1000
1500
2000
2500
3000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
0
2000
4000
6000
8000
10000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
5681 8790
Inter-Node Symmetric Mode Collective Communication
osu_scatter (small)
osu_allgather (small) osu_allgather (large)
Bette
r
57
Bette
r
osu_scatter (large)
HPC Advisory Council China Workshop, Oct '13
Bette
r
Bette
r
8 Nodes with 8 Procs/Host and 8 Procs/MIC
0
20
40
60
80
100
120
1 4 16 64 256 1K 4K
La
ten
cy (
use
c)
Message Size (Bytes)
MV2
MV2-MIC
0
5000
10000
15000
20000
25000
8K 32K 128K 512K
La
ten
cy (
use
c)
Message Size (Bytes)
0
1000
2000
3000
4000
5000
1 4 16 64 256 1K 4K
La
ten
cy (
use
c)
Message Size (Bytes)
0100000200000300000400000500000600000700000800000
8K 32K 128K 512K
La
ten
cy (
use
c)
Message Size (Bytes)
InterNode – Symmetric Mode - P3DFFT Performance
58 HPC Advisory Council China Workshop, Oct '13
• To explore different threading levels for host and Xeon Phi
3DStencil Comm. Kernel
P3DFFT – 512x512x4K
0
2
4
6
8
10
4+4 8+8
Tim
e p
er S
tep
(se
c)
Processes on Host+MIC (32 Nodes)
0
5
10
15
20
2+2 2+8Co
mm
per
Ste
p (
sec)
Processes on Host-MIC - 8 Threads/Proc
(8 Nodes)
50.2%
MV2-INTRA-OPT MV2-MIC
16.2%
59
MVAPICH2 and Intel MPI – Intra-MIC
0
2
4
6
8
10
12
14
0 2 8 32 128 512 2K
La
ten
cy (
use
c)
Message Size (Bytes)
Small Message Latency
Intel MPI
MVAPICH2-MIC
0
500
1,000
1,500
2,000
2,500
4K 16K 64K 256K 1M 4M
La
ten
cy (
use
c)
Message Size (Bytes)
Large Message Latency
Intel MPI
MVAPICH2-MIC
0
2,000
4,000
6,000
8,000
10,000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
ps)
Message Size (Bytes)
Bandwidth
Intel MPI
MVAPICH2-MIC
02,0004,0006,0008,000
10,00012,00014,00016,000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
ps)
Message Size (Bytes)
Bi-directional Bandwidth
Intel MPI
MVAPICH2-MIC
Better
Better
Bet
ter
Bet
ter
HPC Advisory Council China Workshop, Oct '13
60
MVAPICH2 and Intel MPI – Intranode – Host-MIC
0
2
4
6
8
10
0 2 8 32 128 512 2K
La
ten
cy (
use
c)
Message Size (Bytes)
Small Message Latency
Intel MPI
MVAPICH2-MIC
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
ps)
Message Size (Bytes)
Bandwidth
Intel MPI
MVAPICH2-MIC
01,0002,0003,0004,0005,0006,0007,0008,000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
ps)
Message Size (Bytes)
Bi-directional Bandwidth
Intel MPI
MVAPICH2-MIC
0
200
400
600
800
1,000
1,200
4K 16K 64K 256K 1M 4M
La
ten
cy (
use
c)
Message Size (Bytes)
Large Message Latency
Intel MPI
MVAPICH2-MICBetter
Better
Bet
ter
Bet
ter
HPC Advisory Council China Workshop, Oct '13
61
MVAPICH2 and Intel MPI – Internode MIC-MIC
02468
10121416
0 2 8 32 128 512 2K
La
ten
cy (
use
c)
Message Size (Bytes)
Small Message Latency
Intel MPI
MVAPICH2-MIC
0200400600800
1,0001,2001,4001,600
4K 16K 64K 256K 1M 4M
La
ten
cy (
use
c)
Message Size (Bytes)
Large Message Latency
Intel MPI
MVAPICH2-MIC
0
2,000
4,000
6,000
8,000
10,000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
ps)
Message Size (Bytes)
Bi-directional Bandwidth
Intel MPI
MVAPICH2-MIC
0
1,000
2,000
3,000
4,000
5,000
6,000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
ps)
Message Size (Bytes)
Bandwidth
Intel MPI
MVAPICH2-MIC
Better
Better
Bet
ter
Bet
ter
HPC Advisory Council China Workshop, Oct '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Collective communication – Multicore-aware and Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
62 HPC Advisory Council China Workshop, Oct '13
MVAPICH2-X for Hybrid MPI + PGAS Applications
HPC Advisory Council China Workshop, Oct '13 63
MPI Applications, OpenSHMEM Applications, UPC
Applications, Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Class UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM available with
MVAPICH2-X 1.9 onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant
– Scalable Inter-node and intra-node communication – point-to-point and collectives
HPC Advisory Council China Workshop, Oct '13 64
Micro-Benchmark Performance (OpenSHMEM)
0
2
4
6
8
fadd finc cswap swap add inc
Tim
e (u
s)
OpenSHMEM-GASNetOpenSHMEM-OSU
Atomic Operations Broadcast (256 processes)
41% 16%
Collect (256 processes) Reduce (256 processes)
1
10
100
1000
10000
100000
1000000
10000000
Tim
e (
us)
Size (bytes)
OpenSHMEM-OSU-1.9OpenSHMEM-OSU-2.0a
1
10
100
1000
10000
100000
1000000
10000000
Tim
e (
us)
Size (bytes)
OpenSHMEM-OSU-1.9
OpenSHMEM-OSU-2.0a
1
10
100
1000
10000
100000
Tim
e (
us)
Size(bytes)
OpenSHMEM-OSU-1.9
OpenSHMEM-OSU-2.0a
HPC Advisory Council China Workshop, Oct '13 65
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
0
1
2
3
4
5
6
7
8
9
26 27 28 29
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
Scale
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
0
1
2
3
4
5
6
7
8
9
1024 2048 4096 8192
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance
Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Tim
e (
s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 8,192 processes - 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes - 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
66
Hybrid MPI+OpenSHMEM Sort Application
Execution Time
Strong Scalability
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• Execution Time
- 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds
- 51% improvement over MPI-based design
• Strong Scalability (configuration: constant input size of 500GB)
- At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min
- 55% improvement over MPI based design
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
512 1,024 2,048 4,096
Sort
Rat
e (
TB/m
in)
No. of Processes
MPI
Hybrid
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (
seco
nd
s)
Input Data - No. of Processes
MPI
Hybrid
51% 55%
HPC Advisory Council China Workshop, Oct '13
Hybrid MPI+UPC NAS-FT
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall
• Truly hybrid program
• For FT (Class C, 128 processes)
• 34% improvement over UPC-GASNet
• 30% improvement over UPC-OSU
67 HPC Advisory Council China Workshop, Oct '13
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Tim
e (
s)
NAS Problem Size – System Size
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth
Conference on Partitioned Global Address Space Programming Model (PGAS ’10), October 2010
• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB
• Enhanced Optimization for GPU Support and Accelerators – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.0 and Beyond
– Support for Intel MIC (Knight Landing)
• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)
• RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Support for MPI Tools Interface (as in MPI 3.0)
• Checkpoint-Restart and migration support with in-memory checkpointing
• Hybrid MPI+PGAS programming support with GPGPUs and Accelertors
MVAPICH2/MVPICH2-X – Plans for Exascale
68 HPC Advisory Council China Workshop, Oct '13
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of
momentum
– Memcached is also used for Web 2.0
69
Two Major Categories of Applications
HPC Advisory Council China Workshop, Oct '13
Big Data Applications and Analytics
• Big Data has become the one of the most
important elements of business analytics
• Provides groundbreaking opportunities for
enterprise information management and
decision making
• The amount of data is exploding; companies
are capturing and digitizing more information
that ever
• The rate of information growth appears to be
exceeding Moore’s Law
• 35 Zettabytes of data will be generated and
consumed by the end of this decade
70 HPC Advisory Council China Workshop, Oct '13
• The Meaning of Big Data - 3 V’s • Big Volume • Big Velocity • Big Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
Can High-Performance Interconnects Benefit Big Data Processing?
• Beginning to draw interest from the enterprise domain
– Oracle, IBM, Google are working along these directions
• Performance in the enterprise domain remains a concern
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs?
71 HPC Advisory Council China Workshop, Oct '13
Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?
72
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
HPC Advisory Council China Workshop, Oct '13
Big Data Processing with Hadoop
• The open-source implementation of MapReduce programming model for Big Data Analytics
• Framework includes: MapReduce, HDFS, and HBase
• Underlying Hadoop Distributed File System (HDFS) used by MapReduce and HBase
• Model scales but high amount of communication during intermediate phases
73 HPC Advisory Council China Workshop, Oct '13
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand support at the verbs-level for
HDFS, MapReduce, and RPC components
– Easily configurable for both native InfiniBand and the traditional sockets-based
support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.1
– Based on Apache Hadoop 0.20.2
– A version based on Apache Hadoop 1.2.1 will be released soon
– Compliant with Apache Hadoop 0.20.2 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• Various multi-core platforms
• Different file systems with disks and SSDs
– http://hadoop-rdma.cse.ohio-state.edu
Hadoop-RDMA Project
74 HPC Advisory Council China Workshop, Oct '13
Reducing Communication Time with HDFS-RDMA Design
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced by 30%
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
75 HPC Advisory Council China Workshop, Oct '13
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
Co
mm
un
icat
ion
Tim
e (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Evaluations using TestDFSIO for HDFS-RDMA Design
• Cluster with 8 HDD DataNodes, single disk per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 SSD DataNodes, single SSD per node
– 61% improvement over IPoIB (QDR) for 20GB file size
Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps
Increased by 24%
Increased by 61%
76 HPC Advisory Council China Workshop, Oct '13
0
200
400
600
800
1000
1200
5 10 15 20
Tota
l Th
rou
ghp
ut
(MB
ps)
File Size (GB)
IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
5 10 15 20
Tota
l Th
rou
ghp
ut
(MB
ps)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Performance of Sort using Map-Reduce-RDMA with HDD
• With 4 HDD DataNodes for 20GB sort
– 26% improvement over IPoIB (QDR)
– 38% improvement over Hadoop-A-IB
(QDR)
4 HDD DataNodes 8 HDD DataNodes
0
200
400
600
800
1000
1200
1400
5 10 15 20
Job
Exe
cuti
on
Tim
e (
sec)
Sort Size (GB)
1GigE IPoIB (QDR)
Hadoop-A-IB (QDR) OSU-IB (QDR)
0
200
400
600
800
1000
1200
1400
25 30 35 40
Job
Exe
cuti
on
Tim
e (
sec)
Sort Size (GB)
1GigE IPoIB (QDR)
Hadoop-A-IB (QDR) OSU-IB (QDR)
• With 8 HDD DataNodes for 40GB sort
– 24% improvement over IPoIB (QDR)
– 32% improvement over Hadoop-A-IB
(QDR)
77 HPC Advisory Council China Workshop, Oct '13
M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-
based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May
2013.
TeraSort Performance with Map-Reduce-RDMA
• 100GB TeraSort with 12 DataNodes and 200GB TeraSort in 24 DataNodes
– 41% improvement over IPoIB (QDR)
– 7% improvement over Hadoop-A-IB (QDR)
0
500
1000
1500
2000
2500
3000
Sort Size = 100 GB Sort Size = 200 GB
Cluster Size = 12 Cluster Size = 24
Job
Exe
cuti
on
Tim
e (s
ec)
1GigE IPoIB (QDR) HadoopA-IB (QDR) OSU-IB (QDR)
HPC Advisory Council China Workshop, Oct '13 78
• 46% improvement in Adjacency List over IPoIB (32Gbps) for 30 GB data size
• 33% improvement in Sequence Count over IPoIB (32 Gbps) for 50 GB data
size
Evaluations using PUMA Workload
0
500
1000
1500
2000
2500
3000
Adjacency List (30GB) Self Join (30 GB) Word Count (50 GB) Sequence Count (50 GB) Inverted Index (50 GB)
Exec
uti
on
Tim
e (s
ec)
Workloads
IPoIB (QDR) OSU-IB (QDR)
HPC Advisory Council China Workshop, Oct '13 79
Memcached Architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
• Native IB-verbs-level Design and evaluation with
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
80
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
Networks
Main
memoryCPUs
SSD HDD
High Performance Networks
... ...
...
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
HPC Advisory Council China Workshop, Oct '13
81
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
SDP IPoIB
OSU-RC-IB 1GigE
10GigE OSU-UD-IB
Memcached Get Latency (Small Message)
HPC Advisory Council China Workshop, Oct '13
HPC Advisory Council China Workshop, Oct '13 82
Memcached Get TPS (4byte)
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32 64 128 256 512 800 1K
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
SDP IPoIB
OSU-RC-IB 1GigE
OSU-UD-IB
0
200
400
600
800
1000
1200
1400
1600
4 8
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
HPC Advisory Council China Workshop, Oct '13 83
Application Level Evaluation – Workload from Yahoo
• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design
0
20
40
60
80
100
120
1 4 8
Tim
e (
ms)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
0
50
100
150
200
250
300
350
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached
design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
• Keynote Talk, Accelerating Big Data Processing with Hadoop-
RDMA
– Forum of Big Data Computing (Oct 30th, 2:00-2:30pm)
• Keynote Talk, Challenges in Benchmarking and Evaluating Big
Data Processing Middleware on Modern Clusters
– Big Data Benchmarks, Performance Optimization and Emerging Hardware
(BPOE) Workshop (Oct 31st, 2:05-2:45pm)
84
More Details on Big Data and Benchmarking will be covered ..
HPC Advisory Council China Workshop, Oct '13
• InfiniBand with RDMA feature is gaining momentum in HPC and Big Data
systems with best performance and greater usage
• As the HPC community moves to Exascale, new solutions are needed in the
MPI+PGAS stacks for supporting GPUs and Accelerators
• Demonstrated how such solutions can be designed with
MVAPICH2/MVAPICH2-X and their performance benefits
• New solutions are also needed to re-design software libraries for Big Data
environments to take advantage of InfiniBand and RDMA
• Such designs will allow application scientists and engineers to take advantage
of upcoming exascale systems
85
Concluding Remarks
HPC Advisory Council China Workshop, Oct '13
HPC Advisory Council China Workshop, Oct '13
Funding Acknowledgments
Funding Support by
Equipment Support by
86
HPC Advisory Council China Workshop, Oct '13
Personnel Acknowledgments
Current Students
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– M. Li (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
87
Past Research Scientist – S. Sur
Current Post-Docs
– X. Lu
– M. Luo
– K. Hamidouche
Current Programmers
– M. Arnold
– J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– M. Rahman (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associate
– H. Subramoni
Past Programmers
– D. Bureddy
HPC Advisory Council China Workshop, Oct '13
Pointers
http://nowlab.cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
88
http://mvapich.cse.ohio-state.edu http://hadoop-rdma.cse.ohio-state.edu
• Looking for Bright and Enthusiastic Personnel to join as
– Post-Doctoral Researchers (博士后)
– PhD Students (博士研究生)
– MPI Programmer/Software Engineer (MPI软件开发工程师)
– Hadoop/Big Data Programmer/Software Engineer (大数据软件开发工程
师)
• If interested, please contact me at this conference and/or send
an e-mail to [email protected] or [email protected]
state.edu
89
Multiple Positions Available in My Group
HPC Advisory Council China Workshop, Oct '13
90
Thanks, Q&A?
谢谢,请提问?
HPC Advisory Council China Workshop, Oct '13