mvapich2 project update and big data acceleration · optimized v2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64k...
TRANSCRIPT
MVAPICH2 Project Update and Big Data Acceleration
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Presentation at HPC Advisory Council European Conference 2012
by
Hari Subramoni
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~subramon
• MVAPICH Project Update
– 1.8 Release
– Sample Performance Numbers
– Upcoming Features and Performance Numbers
• Big Data Acceleration
– Motivation
– Solutions
• Memcached
• HBase
• HDFS
HPC Advisory - Hamburg 2
Presentation Outline
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002
– Used by more than 1,930 organizations (HPC Centers, Industry and Universities) in
68 countries
– More than 114,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 5th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 7th ranked 111,104-core cluster (Pleiades) at NASA
• 25th ranked 62,976-core cluster (Ranger) at TACC
• and many others
– Available with software stacks of many InfiniBand, High-speed Ethernet and server
vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System
3
MVAPICH2 Software
HPC Advisory - Hamburg
• Released on 04/30/12
• Major features (Compared to MVAPICH2 1.7)
– Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-
GPU)
• Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
– New shared memory design for enhanced intra-node small message performance
– Support suspend/resume functionality with mpirun_rsh
– Enhanced support for CPU binding with socket and numanode level granularity
– Checkpoint-Restart and run-through stabilization support in OFA-IB-Nemesis interface
– Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully
– Enhanced integration with SLURM and PBS
– Reduced memory footprint of the library
– Enhanced one-sided communication design with reduced memory requirement
– Enhancements and tuned collectives (Bcast and Alltoallv)
– Support iWARP interoperability between Intel NE020 and Chelsio T4 Adapters
– RoCE enable environment variable name is changed from MV2_USE_RDMAOE to MV2_USE_RoCE
– Update to hwloc v1.4
4
Latest MVAPICH2 1.8 – Features
HPC Advisory - Hamburg
HPC Advisory - Hamburg 5
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory - Hamburg 6
Bandwidth: MPI over IB
0
1000
2000
3000
4000
5000
6000
7000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/s
ec)
Message Size (bytes)
3280
3385
1917
1706
6343
-1000
1000
3000
5000
7000
9000
11000
13000
15000 MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/s
ec)
Message Size (bytes)
3341
3704
4407
11643
6521
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
• Supports all different modes: RC/XRC, UD, UD+RC or UD+XRC
• Available in MVAPICH (since 1.1)
• Available in MVAPICH2 (since 1.7) with more advanced, tuned and adaptive
designs
• Provides flexibility to tune performance of critical applications on large-scale
clusters with various modes
HPC Advisory - Hamburg
Hybrid Transport Design (UD-RC/XRC)
M. Koop, T. Jones and D. K. Panda, MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over
InfiniBand, IPDPS ‘08
7
• Experimental Platform : LLNL Hyperion Cluster (Intel Harpertown + QDR IB)
• RC: Default Parameters
• UD: MV2_HYBRID_ENABLE_THRESHOLD=1,
MV2_HYBRID_MAX_RC_CONNECTIONS=0
• Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1
HPC Advisory - Hamburg
HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid)
8
0
1
2
3
4
5
6
128 256 512 1024
Tim
e (
us)
No. of Processes
UD
Hybrid
RC
38%26% 40%30%
Shared-memory Aware Collectives
020406080
100120140160
0 4 8 16 32 64 128 256 512
La
ten
cy (
us
)
Message Size (bytes)
MPI_Reduce (4096 cores)
Original
Shared-memory
HPC Advisory - Hamburg
0
1000
2000
3000
4000
5000
0 8 32 128 512 2K 8K
La
ten
cy (
us
)
Message Size (bytes)
MPI_ Allreduce (4096 cores)
Original
Shared-memory
9
MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1
020406080
100120140
128 256 512 1024
Late
ncy
(u
s)
Number of Processes
Pt2Pt Barrier Shared-Mem Barrier
• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)
• MVAPICH2 Barrier with 1K Intel
Westmere cores , QDR IB
MV2_USE_SHMEM_BARRIER=0/1
HPC Advisory - Hamburg 10
MVAPICH2 vs. OpenMPI (GPU Device- GPU Device, Inter-node)
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Latency
MVAPICH2
OpenMPI
Late
ncy
(u
s)
Message Size (bytes)
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
MVAPICH2 1.8rc1 and OpenMPI (Trunk nightly snapshot on Mar 26, 2012)Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Bi-directional Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
BE
TT
ER
BE
TT
ER
BE
TT
ER
HPC Advisory - Hamburg 11
Applications-Level Evaluation (Lattice Boltzmann Method (LBM))
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios• Node specification on Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory• CUDA 4.1 and MVAPICH2 1.8rc1• 16 nodes with one MPI process per node/GPU
0
20
40
60
80
100
120
140
256*256*256 512*256*256 512*512*256 512*512*512
LBM
Ste
p T
ime
(S)
Domain Size X*Y*Z
MPIMPI-GPU
13.7%12.0%
11.8%
9.4%
• Performance and Memory scalability toward 500K-1M cores
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Enhanced Optimization for GPU Support and Accelerators
– Extending the GPGPU support
– Support for Intel MIC
• UD-Multicast-Aware Collectives
• Taking advantage of Collective Offload framework– Including support for non-blocking collectives (MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Enhanced Multi-rail Designs
• Automatic optimization of collectives– LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail
• Support for MPI Tools Interface
• Checkpoint-Restart and migration support with incremental checkpointing
• Fault-tolerance with run-through stabilization (being discussed in MPI 3.0)
• QoS-aware I/O and checkpointing
• Automatic tuning and self-adaptation for different systems and applications
MVAPICH2 – Plans for Exascale
12HPC Advisory - Hamburg
HPC Advisory - Hamburg 13
Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs
Hybrid MPI+OpenSHMEM
• Based on OpenSHMEM Reference Implementation http://openshmem.org/
• Provides a design over GASNet
• Does not take advantage of all OFED features
• Design scalable and High-Performance OpenSHMEM over OFED
• Designing a Hybrid MPI +OpenSHMEM Model
• Current Model – Separate Runtimes for OpenSHMEM and MPI
• Possible deadlock if both runtimes are not progressed
• Consumes more network resource
• Our Approach – Single Runtime for MPI and OpenSHMEM
Hybrid (OpenSHMEM+MPI) Application
InfiniBand Network
OSU Design
OpenSHMEM
InterfaceMPI
Interface
OpenSHMEM
callsMPI calls
HPC Advisory - Hamburg 14
Micro-Benchmark Performance (OpenSHMEM)
0
2
4
6
8
fadd finc cswap swap add inc
Tim
e (u
s)
OpenSHMEM-GASNetOpenSHMEM-OSU
0
2
4
6
8
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU41%
Atomic Operations Broadcast (256 bytes)
41%16%
0
2
4
6
8
10
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU35%
Collect (256 bytes)
0
4
8
12
16
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU36%
Reduce (256 bytes)
HPC Advisory - Hamburg 15
Performance of Hybrid (OpenSHMEM+MPI) Applications
0
400
800
1200
1600
32 64 128 256 512
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
34%
0
2
4
6
8
32 64 128 256
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
2D Heat Transfer Modeling Graph-500
0
20
40
60
80
32 64 128 256 512
Res
ou
rce
Uti
lizat
ion
(M
B)
No. of Processes
Hybrid-GASNetHybrid-OSU
27%
45%
Network Resource Usage• Improved Performance for Hybrid Applications
• 34% improvement for 2DHeat Transfer Modeling with 512 processes
• 45% improvement for Graph500 with 256 processes
• Our approach with single Runtime consumes 27% lesser network resources
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and
Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
Will be available with MVAPICH2 1.9
Many Integrated Core (MIC) Architecture
• Intel’s Many Integrated Core (MIC) architecture geared for HPC
• Critical part of their solution for exascale computing
• Knights Ferry (KNF) and Knights Corner (KNC)
• Many low-power x86 processor cores with hardware threads and
wider vector units
• Applications and libraries can run with minor or no modifications
• Using MVAPICH2 for communication within a KNF coprocessor
– out-of-the box (Default)
– shared memory designs tuned for KNF (Optimized V1)
– designs enhanced using lower level API (Optimized V2)
16HPC Advisory - Hamburg
17
Point-to-Point Communication
Knights Ferry - D0 1.20GHz card - 32 coresAlpha 9 MIC software stack with an additional pre-alpha patch
0.0
0.2
0.4
0.6
0.8
1.0
0 2 8 32 128 512 2K 8K
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Latency
Default
Optimized V1
Optimized V2
13% 17%
0.0
0.2
0.4
0.6
0.8
1.0
32K 128K 512K 2M
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Latency
Default
Optimized V1
Optimized V2
70% 80%
0.0
0.2
0.4
0.6
0.8
1.0
1 4 16 64 256 1K 4K 16K
No
rma
lize
d B
an
dw
idth
Message Size (Bytes)
Bandwidth
Default
Optimized V1
Optimized V2
0.0
0.2
0.4
0.6
0.8
1.0
32K 128K 512K 2M
No
rma
lize
d B
an
dw
idth
Message Size (Bytes)
Bandwidth
Default
Optimized V1
Optimized V2
15%
4X
9.5X
HPC Advisory - Hamburg
18
Collective Communication
0.0
0.2
0.4
0.6
0.8
1.0
0 2 8 32 128 512
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Broadcast
Default
Optimized V1
Optimized V2
17%
0.0
0.2
0.4
0.6
0.8
1.0
64K 128K 256K 512K 1M
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Scatter
Default
Optimized V1
Optimized V2
72% 80%
S. Potluri, K. Tomko, D. Bureddy and D. K. Panda – Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 2012 – Best Student Paper Award
0.0
0.2
0.4
0.6
0.8
1.0
2K 8K 32K 128K 512K
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Broadcast
Default
Optimized V1
Optimized V2
20% 26%
HPC Advisory - Hamburg
HPC Advisory - Hamburg 19
Hardware Multicast-aware MPI_Bcast
0
2
4
6
8
10
12
14
16
32 64 128 256 512 1024
Tim
e (
us)
Number of Processes
16 byte MPI_Bcast
Default
Multicast
0
50
100
150
200
250
300
350
32 64 128 256 512 1024
Tim
e (
us)
Number of Processes
64KB MPI_Bcast
Default
Multicast
0
500
1000
1500
2000
2500
2K 8K 32K 128K 512K
Tim
e (
us)
Message Size (Bytes)
Large Messages (1,024 Cores)
Default
Multicast
0
5
10
15
20
25
1 4 16 64 256 1K
Tim
e (
us)
Message Size (Bytes)
Small Messages (1,024 Cores)
Default
Multicast
Initial support was available in MVAPICH, enhanced will be available in MVAPICH2 1.9
Application Benefits with Non-Blocking Collectives based on CX-2 Collective Offload
20
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
• K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D.
K. Panda, High-Performance and Scalable Non-Blocking All-to-All
with Collective Offload on InfiniBand Clusters: A Study with
Parallel 3D FFT, ISC 2011
HPC Advisory - Hamburg
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
Nor
mal
ized
Perf
orm
ance
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
21.8%
• K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K.
Panda, Designing Non-blocking Broadcast with Collective Offload
on InfiniBand Clusters: A Case Study with HPL, HotI 2011
• K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
HPC Advisory - Hamburg
Topology-Aware Collectives
Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes
21
K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, Designing Topology-Aware Collective Communication
Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather, CAC ‘10
0
100
200
300
400
500
2K
4K
8K
16
K
32
K
64
K
12
K
25
6K
51
2K
12
8K
Lat
enc
y (m
sec)
Message Size (Bytes)
Scatter-Default
Scatter-Topo-Aware
0100200300400500600700
2K
4K
8K
16
K
32K
64
K
12
8K
256K
51
2K
12
8K
Lat
enc
y(m
sec)
Message Size (Bytes)
Gather-Default
Gather-Topo-Aware22%54%
H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design
and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11
14 %
Impact of Network-Topology Aware Algorithms on Broadcast Performance
22
Power and Energy Savings with Power-Aware CollectivesPerformance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes
K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, Designing Power Aware Collective Communication
Algorithms for InfiniBand Clusters, ICPP ‘10HPC Advisory - Hamburg
CPMD Application Performance and Power Savings:
5%32%
• MVAPICH Project Update
– 1.8 Release
– Sample Performance Numbers
– Upcoming Features and Performance Numbers
• Big Data Acceleration
– Motivation
– Solutions
• Memcached
• HBase
• HDFS
HPC Advisory - Hamburg 23
Presentation Outline
Can High-Performance Interconnects Benefit Big Data Acceleration?
• Beginning to draw interest from the enterprise domain
– Oracle, IBM, Google are working along these directions
• Performance in the enterprise domain remains a concern
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs?
24HPC Advisory - Hamburg
Common Protocols using Open Fabrics
25
Application
VerbsSocketsApplication
Interface
SDP
RDMA
SDP
InfiniBandAdapter
InfiniBandSwitch
RDMA
IB Native
InfiniBandAdapter
InfiniBandSwitch
User
space
RDMA
RoCE
RoCEAdapter
User
space
EthernetSwitch
TCP/IP
EthernetDriver
Kernel Space
ProtocolImplementation
InfiniBandAdapter
InfiniBandSwitch
IPoIB
EthernetAdapter
EthernetSwitch
NetworkAdapter
NetworkSwitch
1/10/40 GigE
iWARP
EthernetSwitch
iWARP
iWARPAdapter
User
spaceIPoIB
TCP/IP
EthernetAdapter
EthernetSwitch
10/40 GigE-TOE
HardwareOffload
ISC '12
Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols?
26
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigENetwork
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
HPC Advisory - Hamburg
Memcached Architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
• Native IB-verbs-level Design and evaluation with
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
27
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
Networks
Main
memoryCPUs
SSD HDD
High Performance Networks
... ...
...
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
ISC '12
28
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
SDP IPoIB
OSU-RC-IB 1GigE
10GigE OSU-UD-IB
Memcached Get Latency (Small Message)
HPC Advisory - Hamburg
HPC Advisory - Hamburg 29
Memcached Get TPS (4byte)
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32 64 128 256 512 800 1K
Th
ou
san
ds
of
Tra
nsa
ctio
ns
pe
r se
con
d (T
PS)
No. of Clients
SDP IPoIB
OSU-RC-IB 1GigE
OSU-UD-IB
0
200
400
600
800
1000
1200
1400
1600
4 8
Tho
usa
nd
s o
f Tr
an
sact
ion
s p
er
seco
nd
(TP
S)
No. of Clients
HPC Advisory - Hamburg 30
Application Level Evaluation –Real Application Workloads
• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients• Hybrid design achieves comparable performance to that of pure RC design
0
20
40
60
80
100
120
1 4 8
Tim
e (m
s)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
0
50
100
150
200
250
300
350
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design
on High Performance RDMA Capable Interconnects, CCGrid’12
Overview of HBase Architecture
• An open source database project based on Hadoop framework for hosting very large tables
• Major components: HBaseMaster, HRegionServer and HBaseClient
• HBase and HDFS are deployed in the same cluster to get better data locality
31HPC Advisory - Hamburg
HPC Advisory - Hamburg 32
HBase Put/Get – Detailed Analysis
• HBase 1KB Put– Communication Time – 8.9 us– A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get– Communication Time – 8.9 us– A factor of 6X improvement over 10GE for communication time
0
50
100
150
200
250
300
1GigE IPoIB 10GigE OSU-IB
Tim
e (
us)
HBase Put 1KB
0
50
100
150
200
250
1GigE IPoIB 10GigE OSU-IB
Tim
e (
us)
HBase Get 1KB
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12
HPC Advisory - Hamburg 33
HBase Single Server-Multi-Client Results
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
0
100
200
300
400
500
600
1 2 4 8 16
Tim
e (
us)
No. of Clients
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
Op
s/se
c
No. of Clients
IPoIB
OSU-IB
1GigE
10GigE
Latency Throughput
HPC Advisory - Hamburg 34
HBase – YCSB Read-Write Workload
• HBase Get latency (Yahoo! Cloud Service Benchmark)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
IPoIB OSU-IB
1GigE 10GigE
Read Latency Write Latency
J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
Hadoop Architecture
• Underlying Hadoop Distributed File System (HDFS)
• Fault-tolerance by replicating data blocks
• NameNode: stores information on data blocks
• DataNodes: store blocks and host Map-reduce computation
• JobTracker: track jobs and detect failure
• Model scales but high amount of communication during intermediate phases
35ISC '12
HPC Advisory - Hamburg 36
RDMA-based Design for Native HDFS –Preliminary Results
• HDFS (Hadoop-0.20.2) File Write Experiment using four data nodes on IB-QDR Cluster
• HDFS File Write Time
– 2 GB – 7.78s
– For 2 GB File Size - 26% improvement over IPoIB
• In Communication, for 3 GB File Size, 36% improvement over IPoIB, 87%
improvement over 1GigE
0
5
10
15
20
25
1 2 3 4 5
Tim
e (
s)
FileSize (GB)
Communication Time
1GigE
IPoIB
OSU-Design
0
10
20
30
40
50
60
1 2 3 4 5
Tim
e (
s)
File Size (GB)
File Write Time (SSD)
1GigE
IPoIB
OSU-design
HPC Advisory - Hamburg
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~subramon
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
37