accelerating data management and processing on · pdf fileaccelerating data management and...
TRANSCRIPT
Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected] http://www.cse.ohio-state.edu/~panda
Keynote Talk at ADMS 2014
by
Introduction to Big Data Applications and Analytics
• Big Data has become the one of the most important elements of business analytics
• Provides groundbreaking opportunities for enterprise information management and decision making
• The amount of data is exploding; companies are capturing and digitizing more information than ever
• The rate of information growth appears to be exceeding Moore’s Law
ADMS '14 2
4V Characteristics of Big Data
ADMS '14 http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
• 4/5V’s of Big Data – 3V + *Veracity, *Value
3
Velocity of Big Data – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 6.59% from 2010 to 2011 and now represents 2.1 Billion People.
http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute
ADMS '14 4
• Substantial impact on designing and utilizing modern data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline) • HDFS, MapReduce, Spark
ADMS '14
Data Management and Processing in Modern Datacenters
5
• Three-layer architecture of Web 2.0 – Web Servers, Memcached Servers, Database Servers
• Memcached is a core component of Web 2.0 architecture
6
Overview of Web 2.0 Architecture and Memcached
ADMS '14
ADMS '14 7
Memcached Architecture
• Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose
• Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive
8
HBase Overview
• Apache Hadoop Database (http://hbase.apache.org/)
• Semi-structured database, which is highly scalable
• Integral part of many datacenter applications – eg: Facebook Social Inbox
• Developed in Java for platform-independence and portability
• Uses sockets for communication!
(HBase Architecture)
ADMS '14
Hadoop Distributed File System (HDFS)
• Primary storage of Hadoop; highly reliable and fault-tolerant
• Adopted by many reputed organizations – eg: Facebook, Yahoo!
• NameNode: stores the file system namespace
• DataNode: stores data blocks
• Developed in Java for platform-independence and portability
• Uses sockets for communication!
(HDFS Architecture)
9
RPC
RPC
RPC
RPC
ADMS '14
Client
Data Movement in Hadoop MapReduce
Disk Operations • Map and Reduce Tasks carry out the total job execution
– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk – Reduce tasks get these data by shuffle from TaskTrackers, operate on it and write to HDFS
• Communication in shuffle phase uses HTTP over Java Sockets 10 ADMS '14
Bulk Data Transfer
Spark Architecture Overview
• An in-memory data-processing framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos
• Scalable and communication intensive – Wide dependencies between
Resilient Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to repartition RDDs
– Sockets based communication
11 ADMS '14
http://spark.apache.org
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
– RDMA-based Memcached – Case study with OLTP – SSD-assisted hybrid Memcached – RDMA-based HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 12
High-End Computing (HEC): PetaFlop to ExaFlop
ADMS '14 13
Expected to have an ExaFlop system in 2020-2022!
100 PFlops in
2015
1 EFlops in 2018?
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0102030405060708090100
050
100150200250300350400450500
Perc
enta
ge o
f Clu
ster
s
Num
ber o
f Clu
ster
s
Timeline
Percentage of ClustersNumber of Clusters
ADMS '14 14
• High End Computing (HEC) grows dramatically – High Performance Computing
– Big Data Computing
• Technology Advancement – Multi-core/many-core technologies and accelerators
– Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Solid State Drives (SSDs) and Non-Volatile Random-Access Memory (NVRAM)
– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
High End Computing (HEC)
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
ADMS '14 15
Overview of High Performance Interconnects
• High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand
– 10 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Very Good Performance – Low latency (few micro seconds)
– High Bandwidth (100 Gb/s with dual FDR InfiniBand)
– Low CPU overhead (5-10%)
• OpenFabrics software stack (www.openfabrics.org) with IB, iWARP and RoCE interfaces are driving HPC systems
• Many such systems in Top500 list
ADMS '14 16
ADMS '14
Kernel Space
All interconnects and protocols in OpenFabrics Stack Application / Middleware
Verbs
Ethernet Adapter
Ethernet Switch
Ethernet Driver
TCP/IP
1/10/40 GigE
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
Ethernet Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40 GigE-TOE
InfiniBand Adapter
InfiniBand Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCE Adapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand Switch
InfiniBand Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand Adapter
InfiniBand Switch
RDMA
SDP
SDP
17
Trends of Networking Technologies in TOP500 Systems Percentage share of InfiniBand is steadily increasing
Interconnect Family – Systems Share
ADMS '14 18
• 223 IB Clusters (44.3%) in the June 2014 Top500 list
(http://www.top500.org)
• Installations in the Top 50 (25 systems):
Large-scale InfiniBand Installations
519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th)
62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th)
147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th)
76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th)
194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th)
110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th)
96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th)
73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th)
77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st)
65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (27th) and many more!
ADMS '14 19
Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer
– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec
-> 100Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers • Multiple Operations
– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective communication operations
• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….
ADMS '14 20
ADMS '14
Communication in the Channel Semantics (Send/Receive Model)
InfiniBand Device
Memory Memory
InfiniBand Device
CQ QP
Send Recv
Memory Segment
Send WQE contains information about the send buffer (multiple non-contiguous
segments)
Processor Processor
CQ QP
Send Recv
Memory Segment
Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a
receive WQE to know where to place
Hardware ACK
Memory Segment
Memory Segment
Memory Segment
Processor is involved only to:
1. Post receive WQE 2. Post send WQE
3. Pull out completed CQEs from the CQ
21
ADMS '14
Communication in the Memory Semantics (RDMA Model)
InfiniBand Device
Memory Memory
InfiniBand Device
CQ QP
Send Recv
Memory Segment
Send WQE contains information about the send buffer (multiple segments) and the
receive buffer (single segment)
Processor Processor
CQ QP
Send Recv
Memory Segment
Hardware ACK
Memory Segment
Memory Segment
Initiator processor is involved only to: 1. Post send WQE
2. Pull out completed CQE from the send CQ
No involvement from the target processor
22
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 23
Wide Adaptation of RDMA Technology
• Message Passing Interface (MPI) for HPC
• Parallel File Systems – Lustre
– GPFS
• Delivering excellent performance (latency, bandwidth and CPU Utilization)
• Delivering excellent scalability
ADMS '14 24
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC)
– Used by more than 2,200 organizations in 73 countries
– More than 221,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC
• 13th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
• 23th ranked 96,192-core cluster (Pleiades) at NASA
• many others . . .
– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
ADMS '14 25
ADMS '14
One-way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(us)
26
ADMS '14
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Bidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
27
Can High-Performance Interconnects Benefit Data Management and Processing? • Most of the current Big Data systems use Ethernet
Infrastructure with Sockets
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw interest – Oracle, IBM, Google, Intel are working along these directions
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?
• Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached?
ADMS '14 28
Designing Communication and I/O Libraries for Big Data Systems: Challenges
ADMS '14
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
29
Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) – Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
ADMS '14 30
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 31
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB) • http://hibd.cse.ohio-state.edu
• RDMA for Apache HBase and Spark
The High-Performance Big Data (HiBD) Project
ADMS '14 32
• High-Performance Design of Memcached over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-level for Memcached and libMemcached components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.1
– Based on Memcached 1.4.20 and libMemcached 1.0.18.
– Compliant with Memcached APIs and applications
– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
– http://hibd.cse.ohio-state.edu
RDMA for Memcached Distribution
ADMS '14 33
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.9/0.9.1
– Based on Apache Hadoop 1.2.1/2.4.1
– Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications
– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• Different file systems with disks and SSDs
– http://hibd.cse.ohio-state.edu
RDMA for Apache Hadoop 1.x/2.x Distributions
ADMS '14 34
• Released in OHB 0.7.1 (ohb_memlat)
• Evaluates the performance of stand-alone Memcached
• Three different micro-benchmarks – SET Micro-benchmark: Micro-benchmark for memcached set
operations
– GET Micro-benchmark: Micro-benchmark for memcached get operations
– MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 90:10)
• Calculates average latency of Memcached operations
• Can measure throughput in Transactions Per Second
ADMS '14
OSU HiBD Micro-Benchmark (OHB) Suite - Memcached
35
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
– RDMA-based Memcached – Case study with OLTP – SSD-assisted hybrid Memcached – RDMA-based HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 36
Design Challenges of RDMA-based Memcached
• Can Memcached be re-designed from the ground up to utilize RDMA capable networks?
• How efficiently can we utilize RDMA for Memcached operations?
• Can we leverage the best features of RC and UD to deliver both high performance and scalability to Memcached?
• Memcached applications need not be modified; the middleware uses verbs interface if available
ADMS '14 37
Memcached-RDMA Design
• Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread
• All other Memcached data structures are shared among RDMA and Sockets worker threads
• Native IB-verbs-level Design and evaluation with – Server : Memcached (http://memcached.org)
– Client : libmemcached (http://libmemcached.org)
– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)
Sockets Client
RDMA Client
Master Thread
Sockets Worker Thread
Verbs Worker Thread
Sockets Worker Thread
Verbs Worker Thread
Shared
Data
Memory Slabs Items
…
1
1
2
2
ADMS '14 38
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (u
s)
Message Size
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (u
s)
Message Size
IPoIB (QDR)
10GigE (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
Memcached Get Latency (Small Message)
ADMS '14
IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)
39
ADMS '14
Memcached Get TPS (4 bytes)
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to IPoIB
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32 64 128 256 512 800 1K
Thou
sand
s of T
rans
actio
ns p
er se
cond
(TPS
)
No. of Clients
0
200
400
600
800
1000
1200
1400
1600
4 8
Thou
sand
s of T
rans
actio
ns p
er se
cond
(TPS
)
No. of Clients
IPoIB (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
40
1
10
100
10001 2 4 8 16 32 64 128
256
512 1K 2K 4K
Tim
e (u
s)
Message Size
OSU-IB (FDR)IPoIB (FDR)
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4080
Thou
sand
s of T
rans
actio
ns p
er
Seco
nd (T
PS)
No. of Clients
• Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X
2X
ADMS '14 41
ADMS '14
Application Level Evaluation – Olio Benchmark
• Olio Benchmark – RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
0
500
1000
1500
2000
2500
64 128 256 512 1024
Tim
e (m
s)
No. of Clients
0
10
20
30
40
50
60
1 4 8
Tim
e (m
s)
No. of Clients
IPoIB (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
OSU-Hybrid-IB (QDR)
42
ADMS '14
Application Level Evaluation – Real Application Workloads
• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design
0
10
20
30
40
50
60
70
80
90
1 4 8
Tim
e (m
s)
No. of Clients
0
50
100
150
200
250
300
350
64 128 256 512 1024
Tim
e (m
s)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
IPoIB (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
OSU-Hybrid-IB (QDR)
43
ADMS '14 44
Memcached-RDMA – Case Studies with OLTP Workloads
• Design of scalable architectures using Memcached and MySQL
• Case study to evaluate benefits of using RDMA-Memcached for traditional OLTP workloads
• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool
• Up to 40 nodes, 10 concurrent threads per node, 20 queries per client
• Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps)
- throughput by up to 69% over IPoIB (32Gbps)
ADMS '14 45
Evaluation with Traditional OLTP Workloads - mysqlslap
012345678
48 64 80 96 112 128 144 160 240 320 400
Late
ncy
(sec
)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
0
500
1000
1500
2000
2500
3000
3500
48 64 80 96 112 128 144 160 240 320 400
Thro
ughp
ut (
q/m
s)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
ADMS '14 46
Traditional Memcached Deployments
•Low hit ratio at Memcached server due to Limited main memory size
•Frequent access to backend DB
•Higher hit ratio due to SSD Mapped virtual memory •Significant overhead at virtual memory management
Existing memcached deployment Mmap() SSD into virtual memory
ADMS '14 47
SSD-Assisted Hybrid Memory for Memcached
IB Verbs IPoIB 10GigE 1GigE
MySQL N/A 10763 10724 11220
Memcached (In RAM) 10 60 40 150
Memcached (SSD virtual
memory) 347 387 362 455
Random Read
Random Write
SSD Latency 68 70
Get Latency (us)
SSD Basic Performance (us) (Fusionio ioDrive) •SSD-Assisted Hybrid Memory to expand Memory size •A user-level library that bypasses the kernel-based virtual memory management overhead
ADMS '14 48
Evaluation on SSD-Assisted Hybrid Memory for Memcached - Latency
3.6 X
347 us
93 us 16 us
48 us
3 X
Memcached Get Latency Memcached Put Latency
• Memcached with InfiniBand DDR transport (16Gbps) • 30GB data in SSD, 256 MB read/write buffer • Get / Put a random object X. Ouyang, N. S. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang & D. K. Panda, SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks, ICPP’12
ADMS '14 49
Evaluation on SSD-Assisted Hybrid Memory for Memcached - Throughput
• Memcached with InfiniBand DDR transport (16Gbps) • 30GB data in SSD, object size = 1KB, 256 MB read/write buffer • 1, 2, 4 and 8 client process to perform random get()
5.3 X
0
20
40
60
80
1 2 4 8
Tran
sact
ions
/sec
(t
hous
ands
)
# processes
Native Hmem (upper-bound) Memcached+IB+Hmem Memcached+IB+VMS
ADMS '14
Motivation – Detailed Analysis on HBase Put/Get
• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,
Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, Poster Paper, ISPASS’12
0
20
40
60
80
100
120
140
160
180
200
IPoIB (DDR) 10GigE OSU-IB (DDR)
Tim
e (u
s)
HBase Put 1KB
0
20
40
60
80
100
120
140
160
IPoIB (DDR) 10GigE OSU-IB (DDR)
Tim
e (u
s)
HBase Get 1KB
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
50
ADMS '14
HBase-RDMA Design Overview
• JNI Layer bridges Java based HBase with communication library written in native code
• Enables high performance RDMA communication, while supporting traditional socket interface
HBase
IB Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
OSU-IB Design
Applications
1/10 GigE, IPoIB Networks
Java Socket Interface
Java Native Interface (JNI)
51
HBase-RDMA Design - Communication Flow
Application
HBase Client
Network Selector
Sockets
1/10G Network
Sockets
1/10G Network
JNI Interface
OSU-IB Design
IBVerbs
JNI Interface
OSU-IB Design
IBVerbs
IB Reader Responder
Handler Threads
Helper Threads
Reader Threads
(HBase Client)
(HBase Server)
• RDMA design components – Network Selector, IB Reader, Helper threads JNI Adaptive Interface
Call Queue
Response Queue
ADMS '14 52
ADMS '14
HBase Micro-benchmark (Single-Server-Multi-Client) Results
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16
Tim
e (u
s)
No. of Clients
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
Ops
/sec
No. of Clients
10GigE
Latency Throughput
J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
OSU-IB (DDR)
IPoIB (DDR)
53
ADMS '14
HBase – YCSB Read-Write Workload
• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients
• HBase Put latency – 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Tim
e (u
s)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Tim
e (u
s)
No. of Clients
10GigE
Read Latency Write Latency
OSU-IB (QDR) IPoIB (QDR)
54
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 55
• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark
Acceleration Case Studies and In-Depth Performance Evaluation
ADMS '14 56
Design Overview of HDFS with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
HDFS
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
• Design Features – RDMA-based HDFS
write – RDMA-based HDFS
replication – Parallel replication
support – On-demand connection
setup – InfiniBand/RoCE
support
ADMS '14 57
Communication Times in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced by 30%
ADMS '14
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
Com
mun
icat
ion
Tim
e (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
58
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
Evaluations using OHB HDFS Micro-benchmark
• Cluster with 4 HDD DataNodes, single disk per node
– 25% improvement in latency over IPoIB (QDR) for 10GB file size
– 50% improvement in throughput over IPoIB (QDR) for 10GB file size
Write Latency Write Throughput with 2 clients
reduced by 25% Increased by 50%
ADMS '14
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10
Writ
e La
tenc
y (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
0
100
200
300
400
500
600
2 4 6 8 10
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
59
N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012.
Evaluations using TestDFSIO
• Cluster with 8 HDD DataNodes, single disk per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 SSD DataNodes, single SSD per node
– 61% improvement over IPoIB (QDR) for 20GB file size
Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps
Increased by 24%
Increased by 61%
ADMS '14
0
200
400
600
800
1000
1200
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
5 10 15 20
Tota
l Thr
ough
put (
MBp
s)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
60
ADMS '14
Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede
• Cluster with 64 DataNodes, single HDD per node – 64% improvement in throughput over IPoIB (FDR) for 256GB file size
– 37% improvement in latency over IPoIB (FDR) for 256GB file size
0
200
400
600
800
1000
1200
1400
64 128 256
Aggr
egat
ed T
hrou
ghpu
t (M
Bps)
File Size (GB)
IPoIB (FDR)OSU-IB (FDR) Increased by 64%
Reduced by 37%
0
100
200
300
400
500
600
64 128 256
Exec
utio
n Ti
me
(s)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
61
00.5
11.5
22.5
33.5
44.5
5
120K 240K 360K 480K
Thro
ughp
ut (K
ops/
sec)
Number of Records (Record Size = 1KB)
IPoIB (QDR) OSU-IB (QDR)
0
50
100
150
200
250
300
120K 240K 360K 480K
Aver
age
Late
ncy
(us)
Number of Records (Record Size = 1KB)
IPoIB (QDR) OSU-IB (QDR)
Evaluations using YCSB (32 Region Servers: 100% Update)
Put Latency Put Throughput
• HBase using TCP/IP, running over HDFS-IB
• HBase Put latency for 480K records
– 201 us for OSU Design; 272 us for IPoIB (32Gbps)
• HBase Put throughput for 480K records
– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)
• 26% improvement in average latency; 24% improvement in throughput ADMS '14 62
• YCSB Evaluation with 4 RegionServers (100% update)
• HBase Put Latency and Throughput for 360K records
- 37% improvement over IPoIB (32Gbps)
- 18% improvement over OSU-IB HDFS only
HDFS and HBase Integration over IB (OSU-IB)
0
50
100
150
200
250
300
120K 240K 360K 480K
Late
ncy
(us)
Number of Records (Record Size = 1KB)
IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR)
OSU-IB (HDFS) - OSU-IB (HBase) (QDR)
0
1
2
3
4
5
6
120K 240K 360K 480K
Thro
ughp
ut (K
ops/
sec)
Number of Records (Record Size = 1KB)
IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR)
OSU-IB (HDFS) - OSU-IB (HBase) (QDR)
ADMS '14 63
• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark
Acceleration Case Studies and In-Depth Performance Evaluation
ADMS '14 64
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Job Tracker
Task Tracker
Map
Reduce
ADMS '14
• Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features – RDMA-based shuffle – Prefetching and caching map
output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle
Adjustment – Advanced overlapping
• map, shuffle, and merge • shuffle, merge, and reduce
– On-demand connection setup – InfiniBand/RoCE support
65
Advanced Overlapping among different phases
• A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches – Efficient Shuffle Algorithms – Dynamic and Efficient Switching – On-demand Shuffle Adjustment
Default Architecture
Enhanced Overlapping
Advanced Overlapping
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014.
ADMS '14 66
Evaluations using OHB MapReduce Micro-benchmark
• Stand-alone MapReduce micro-benchmark (MR-AVG)
• 1 KB key/value pair size
• For 8 slave nodes, RDMA has up to 30% over IPoIB (56Gbps)
• For 16 slave nodes, RDMA has up to 28% over IPoIB (56Gbps)
8 slave nodes 16 slave nodes
ADMS '14
0
100
200
300
400
500
600
700
800
900
8 16 32 64
IPoIB (FDR) OSU-IB (FDR)
0
100
200
300
400
500
600
700
16 32 64 128
IPoIB (FDR) OSU-IB (FDR)
28% 30%
67
D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014).
Evaluations using Sort (HDD, SSD)
8 HDD DataNodes
• With 8 HDD DataNodes for 40GB sort
– 43% improvement over IPoIB (QDR)
– 44% improvement over UDA-IB (QDR)
ADMS '14
8 SSD DataNodes
• With 8 SSD DataNodes for 40GB sort
– 52% improvement over IPoIB (QDR)
– 45% improvement over UDA-IB (QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)
0
100
200
300
400
500
600
25 30 35 40
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)
68
0
200
400
600
800
1000
1200
1400
1600
1800
2000
60 80 100
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Evaluations with TeraSort
• 100 GB TeraSort with 8 DataNodes with 2 HDD per node – 49% benefit compared to UDA-IB (QDR)
– 54% benefit compared to IPoIB (QDR)
– 56% benefit compared to 10GigE
49%
54%
ADMS '14 69
• For 240GB Sort in 64 nodes – 40% improvement over IPoIB (QDR)
with HDD used for HDFS
Performance Evaluation on Larger Clusters
ADMS '14
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes – 38% improvement over IPoIB
(FDR) with HDD used for HDFS
70
• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size
Evaluations using PUMA Workload
ADMS '14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)
Nor
mal
ized
Exe
cutio
n Ti
me
Benchmarks
10GigE IPoIB (QDR) OSU-IB (QDR)
71
• 50 small MapReduce jobs executed in a cluster size of 4
• Maximum performance benefit 24% over IPoIB (QDR)
• Average performance benefit 13% over IPoIB (QDR)
Evaluations using SWIM
15
20
25
30
35
40
1 6 11 16 21 26 31 36 41 46
Exec
utio
n Ti
me
(sec
)
Job Number
IPoIB (QDR) OSU-IB (QDR)
ADMS '14 72
• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark
Acceleration Case Studies and In-Depth Performance Evaluation
ADMS '14 73
Design Overview of Spark with RDMA
• Design Features – RDMA based shuffle – SEDA-based plugins – Dynamic connection
management and sharing – Non-blocking and out-of-
order data transfer – Off-JVM-heap buffer
management – InfiniBand/RoCE support
ADMS '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data
Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
74
ADMS '14
Preliminary Results of Spark-RDMA Design - GroupBy
0
1
2
3
4
5
6
7
8
9
4 6 8 10
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
0
1
2
3
4
5
6
7
8
9
8 12 16 20
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
10GigE
IPoIB
RDMA
Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores
• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks
– 18% improvement over IPoIB (QDR) for 10GB data size
• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks
– 20% improvement over IPoIB (QDR) for 20GB data size
75
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 76
Optimize Apache Hadoop over Parallel File Systems
MetaData Servers
Object Storage Servers
Compute Nodes TaskTracker
Map Reduce
Lustre Client
Lustre Setup
• HPC Cluster Deployment – Hybrid topological solution of Beowulf
architecture with separate I/O nodes – Lean compute nodes with light OS; more
memory space; small local storage – Sub-cluster of dedicated I/O nodes with
parallel file systems, such as Lustre • MapReduce over Lustre
– Local disk is used as the intermediate data directory
– Lustre is used as the intermediate data directory
ADMS '14 77
• For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR)
Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede
ADMS '14
• For 640GB Sort in 128 nodes – 48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (FDR)OSU-IB (FDR)
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (FDR) OSU-IB (FDR)
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Euro-Par, August 2014.
• Local disk is used as the intermediate data directory
78
• For 160GB Sort in 16 nodes – 35% improvement over IPoIB (FDR)
Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede
ADMS '14
• For 320GB Sort in 32 nodes – 33% improvement over IPoIB (FDR)
• Lustre is used as the intermediate data directory
0
20
40
60
80
100
120
140
160
180
200
80 120 160
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
50
100
150
200
250
300
80 GB 160 GB 320 GB
Cluster: 8 Cluster: 16 Cluster: 32Jo
b Ex
ecut
ion
Tim
e (s
ec)
IPoIB (FDR) OSU-IB (FDR)
• Can more optimizations be achieved by leveraging more features of Lustre? 79
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 80
• Some other existing Hadoop solutions – Hadoop-A (UDA)
• RDMA-based implementation of Hadoop MapReduce shuffle engine
• Uses plug-in based solution : UDA (Unstructured Data Accelerator)
• http://www.mellanox.com/page/products_dyn?product_family=144
– Cloudera Distributions of Hadoop • CDH: Open Source distribution
• Cloudera Standard: CDH with automated cluster management
• Cloudera Enterprise: Cloudera Standard + enhanced management capabilities and support
• Ethernet/IPoIB
• http://www.cloudera.com/content/cloudera/en/products.html
– Hortonworks Data Platform • HDP: Open Source distribution
• Developed as projects through the Apache Software Foundation (ASF), NO proprietary extensions or add-ons
• Functional areas: Data Management, Data Access, Data Governance and Integration, Security, and Operations.
• Ethernet/IPoIB
• http://hortonworks.com/hdp
ADMS '14
Ongoing and Future Activities for Hadoop Accelerations
81
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges
ADMS '14
Upper level Changes?
82
• Multi-threading and Synchronization – Multi-threaded model exploration
– Fine-grained synchronization and lock-free design
– Unified helper threads for different components
– Multi-endpoint design to support multi-threading communications
• QoS and Virtualization – Network virtualization and locality-aware communication for Big
Data middleware
– Hardware-level virtualization support for End-to-End QoS
– I/O scheduling and storage virtualization
– Live migration
ADMS '14
More Challenges
83
• Support of Accelerators – Efficient designs for Big Data middleware to take advantage of
NVIDA GPGPUs and Intel MICs
– Offload computation-intensive workload to accelerators
– Explore maximum overlapping between communication and offloaded computation
• Fault Tolerance Enhancements – Exploration of light-weight fault tolerance mechanisms for Big Data
• Support of Parallel File Systems – Optimize Big Data middleware over parallel file systems (e.g.
Lustre) on modern HPC clusters
• Big Data Benchmarking
ADMS '14
More Challenges (Cont’d)
84
• The current benchmarks provide some performance behavior
• However, do not provide any information to the designer/developer on: – What is happening at the lower-layer?
– Where the benefits are coming from?
– Which design is leading to benefits or bottlenecks?
– Which component in the design needs to be changed and what will be its impact?
– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?
ADMS '14
Are the Current Benchmarks Sufficient for Big Data Management and Processing?
85
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
Current Benchmarks
No Benchmarks
Correlation?
ADMS '14 86
OSU MPI Micro-Benchmarks (OMB) Suite • A comprehensive suite of benchmarks to
– Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs
• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth
• Extended later to – MPI-2 one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC
• Has become an industry standard • Extensively used for design/development of MPI libraries, performance comparison
of MPI libraries and even in procurement of large-scale systems • Available from http://mvapich.cse.ohio-state.edu/benchmarks
• Available in an integrated manner with MVAPICH2 stack
ADMS '14 87
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications
ADMS '14
Applications-Level
Benchmarks
Micro-Benchmarks
88
• Upcoming Releases of RDMA-enhanced Packages will support – Hadoop 2.x MapReduce & RPC
– Spark
– HBase
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support
– HDFS
– MapReduce
– RPC
• Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.)
• Advanced designs with upper-level changes and optimizations ADMS '14
Future Plans of OSU High Performance Big Data Project
89
• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase
• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities • Conclusion and Q&A
Presentation Outline
ADMS '14 90
• Presented an overview of data management and processing middleware in different tiers
• Provided an overview of modern networking technologies
• Discussed challenges in accelerating Big Data middleware
• Presented initial designs to take advantage of InfiniBand/RDMA for Memcached, HBase, Hadoop and Spark
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data management and processing community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner
ADMS '14
Concluding Remarks
91
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
ADMS '14
Thank You!
The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/
92