high performance big data (hibd): accelerating hadoop...
TRANSCRIPT
High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at Mellanox Theatre (SC ’15)
by
• Substantial impact on designing and utilizing modern data management and
processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Mellanox Booth (SC '15)
Data Processing and Management on Modern Clusters
2
Worker
Worker
Worker
Worker
Master
HDFS
Driver
Zookeeper
Worker
SparkContext
Spark - An Example of Back-end Data Processing Middleware
• An in-memory data-processing framework
– Iterative machine learning jobs
– Interactive data analytics
– Scala based Implementation
– Standalone, YARN, Mesos
• Scalable, communication and I/O intensive
– Wide dependencies between Resilient Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to repartition RDDs
– Sockets based communication
Mellanox Booth (SC '15)
http://spark.apache.org
3
Mellanox Booth (SC '15) 4
Memcached - An Example of Front-end Data Management Middleware
• Three-layer architecture of Web 2.0
– Web Servers, Memcached Servers, Database Servers
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
Main
memoryCPUs
SSD HDD
High Performance Networks
... ...
...
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
NetworksInternet
• High End Computing (HEC) is growing dramatically
– High Performance Computing
– Big Data Computing
• Technology Advancement
– Multi-core/many-core technologies
– Remote Direct Memory Access (RDMA)-enabled
networking (InfiniBand and RoCE)
– Solid State Drives (SSDs), Non-Volatile Random-Access
Memory (NVRAM), NVMe-SSD
– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Drivers for Modern HPC Clusters
Tianhe – 2 Titan Stampede Tianhe – 1A
Mellanox Booth (SC '15) 5
Mellanox Booth (SC '15)
Kernel
Space
Interconnects and Protocols in the Open Fabrics Stack
Application / Middleware
Verbs
Ethernet
Adapter
Ethernet Switch
Ethernet Driver
TCP/IP
1/10/40 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40 GigE-TOE
InfiniBand
Adapter
InfiniBand
Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCE Adapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
6
• Challenges for Accelerating Big Data Processing
• Accelerating Big Data Processing on RDMA-enabled High-
Performance Interconnects
– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached
• Accelerating Big Data Processing on High-Performance Storage
– Enhanced Designs for HDFS and MapReduce over Lustre
– SSD-assisted Hybrid Memory for RDMA-based Memcached
• The High-Performance Big Data (HiBD) Project
• Conclusion and Q&A
Presentation Outline
Mellanox Booth (SC '15) 7
Mellanox Booth (SC '15) 8
How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a “convergent trajectory”!
What are the major
bottlenecks in
current Big Data
processing
middleware (e.g.
Hadoop, Spark, and
Memcached)?
Can the bottlenecks be alleviated with
new designs by taking advantage of HPC technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applications?
How much
performance
benefits can be
achieved through
enhanced designs?
How to design
benchmarks for
evaluating the
performance of
Big Data
middleware on
HPC clusters?
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Challenges in Designing Communication and I/O Libraries for Big Data Systems
Mellanox Booth (SC '15)
Upper level
Changes?
9
I/O and File Systems
• Challenges for Accelerating Big Data Processing
• Accelerating Big Data Processing on RDMA-enabled High-
Performance Interconnects
– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached
• Accelerating Big Data Processing on High-Performance Storage
– Enhanced Designs for HDFS and MapReduce over Lustre
– SSD-assisted Hybrid Memory for RDMA-based Memcached
• The High-Performance Big Data (HiBD) Project
• Conclusion and Q&A
Presentation Outline
Mellanox Booth (SC '15) 10
Design Overview of HDFS with RDMA
HDFS
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS replication
– Parallel replication support
– On-demand connection setup
– InfiniBand/RoCE support
Mellanox Booth (SC '15) 11
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in
RDMA-Enhanced HDFS, HPDC '14, June 2014
Mellanox Booth (SC '15)
Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede
• Cluster with 64 DataNodes, single HDD per node
– 64% improvement in throughput over IPoIB (FDR) for 256GB file size
– 37% improvement in latency over IPoIB (FDR) for 256GB file size
0
200
400
600
800
1000
1200
1400
64 128 256
Agg
rega
ted
Th
rou
ghp
ut
(MB
ps)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR) Increased by 64% Reduced by 37%
0
100
200
300
400
500
600
64 128 256
Exec
uti
on
Tim
e (s
)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
12
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
Mellanox Booth (SC '15)
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Hybrid overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
13
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping
in MapReduce over High Performance Interconnects, ICS, June 2014.
• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size
Evaluations using PUMA Workload
Mellanox Booth (SC '15)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)
No
rmal
ized
Exe
cuti
on
Tim
e
Benchmarks
10GigE IPoIB (QDR) OSU-IB (QDR)
14
Design Overview of Spark with RDMA
• Design Features
– RDMA based shuffle
– SEDA-based plugins
– Dynamic connection management and sharing
– Non-blocking and out-of-order data transfer
– Off-JVM-heap buffer management
– InfiniBand/RoCE support
Mellanox Booth (SC '15)
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data
Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
15
Performance Evaluation on TACC Stampede - SortByTest
• Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R)
• RDMA-based design for Spark 1.4.0
• RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and
RamDisk. For SortByKey Test:
– Shuffle time reduced by up to 77% over IPoIB (56Gbps)
– Total time reduced by up to 58% over IPoIB (56Gbps)
16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time
Mellanox Booth (SC '15) 16
0
5
10
15
20
25
30
35
40
16 32 64
Tim
e (s
ec)
Data Size (GB)
IPoIB RDMA
0
5
10
15
20
25
30
35
40
16 32 64
Tim
e (s
ec)
Data Size (GB)
IPoIB RDMA 77%
58%
Memcached-RDMA Design
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread
• All other Memcached data structures are shared among RDMA and Sockets worker threads
• Native IB-verbs-level Design and evaluation with
– Server : Memcached (http://memcached.org)
– Client : libmemcached (http://libmemcached.org)
– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)
Sockets Client
RDMA Client
Master Thread
Sockets Worker Thread
Verbs Worker Thread
Sockets Worker Thread
Verbs Worker Thread
Shared
Data
Memory Slabs Items
…
1
1
2
2
Mellanox Booth (SC '15) 17
1
10
100
10001 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Tim
e (
us)
Message Size
OSU-IB (FDR)
IPoIB (FDR)
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4080
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r Se
con
d (
TPS)
No. of Clients
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us
– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s
– Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (TACC Stampede)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X
2X
Mellanox Booth (SC '15) 18
• Challenges for Accelerating Big Data Processing
• Accelerating Big Data Processing on RDMA-enabled High-
Performance Interconnects
– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached
• Accelerating Big Data Processing on High-Performance Storage
– Enhanced Designs for HDFS and MapReduce over Lustre
– SSD-assisted Hybrid Memory for RDMA-based Memcached
• The High-Performance Big Data (HiBD) Project
• Conclusion and Q&A
Presentation Outline
Mellanox Booth (SC '15) 19
Optimize I/O Performance for Big Data Applications with High-Performance Storage on HPC Clusters
MetaData Servers
Object Storage Servers
Compute Nodes
App Master
Map Reduce
RAM
Disk Lustre Client
Lustre Setup
• HPC Cluster Deployment – Hybrid topological solution of Beowulf
architecture with separate I/O nodes – Lean compute nodes with light OS; more
memory space; small local storage – Sub-cluster of dedicated I/O nodes with parallel
file systems, such as Lustre
• HDFS over Heterogeneous Storage – RAMDisk, SSD, HDD, Lustre as data directories
• MapReduce over Lustre – Local disk is used as the intermediate data
directory – Lustre is used as the intermediate data
directory
• Hybrid Memcached with RAM + SSD 20 Mellanox Booth (SC '15)
Map Reduce
SSD HDD
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the
heterogeneous storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on
data usage pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
21
Enhanced HDFS with In-memory and Heterogeneous Storage
Hybrid
Replication
Data Placement Policies
Eviction/
Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS
on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
Mellanox Booth (SC '15)
• For 160GB TestDFSIO in 32 nodes
– Write Throughput: 7x improvement
over IPoIB (FDR)
– Read Throughput: 2x improvement
over IPoIB (FDR)
Performance Improvement on TACC Stampede (HHH)
• For 120GB RandomWriter in 32
nodes
– 3x improvement over IPoIB (QDR)
0
1000
2000
3000
4000
5000
6000
write read
Tota
l Th
rou
ghp
ut
(MB
ps)
IPoIB (FDR) OSU-IB (FDR)
TestDFSIO
0
50
100
150
200
250
300
8:30 16:60 32:120
Exe
cuti
on
Tim
e (
s)
Cluster Size:Data Size
IPoIB (FDR) OSU-IB (FDR)
RandomWriter
Mellanox Booth (SC '15)
Increased by 7x
Increased by 2x Reduced by 3x
22
Performance Improvement on SDSC Gordon (HHH vs. Tachyon)
• RandomWriter: 200GB on 32 SSD Nodes (QDR)
– HHH reduces the execution time by 47% over Tachyon-WTC (IPoIB)), 56% over HDFS (IPoIB)
• Sort: 200GB on 32 SSD Nodes (QDR)
– HHH reduces the execution time by 19% over Tachyon-WTC (IPoIB), 31% over HDFS (IPoIB)
RandomWriter
0
20
40
60
80
100
120
140
160
180
200
8:50 16:100 32:200
Exe
cuti
on
Tim
e (
s)
Cluster Size:Data Size (GB)
HDFS Tachyon-WTC Tachyon-C HHH
Sort
0
50
100
150
200
250
300
350
400
450
500
8:50 16:100 32:200
Exe
cuti
on
Tim
e (
s)
Cluster Size:Data Size (GB)
Reduced by 47% Reduced by 19%
Mellanox Booth (SC '15) 23
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File
Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
Intermediate Data Directory
Design Overview of Shuffle Strategies for MapReduce over Lustre
Mellanox Booth (SC '15)
• Design Features
– Two shuffle approaches
• Lustre read based shuffle
• RDMA based shuffle
– Hybrid shuffle algorithm to take benefit from both shuffle approaches
– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation
– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design
24
Map 1 Map 2 Map 3
Lustre
Reduce 1 Reduce 2
Lustre Read / RDMA
In-memory
merge/sort reduce
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN
MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.
In-memory
merge/sort reduce
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
Performance Improvement of MapReduce over Lustre on TACC-Stampede
Mellanox Booth (SC '15)
• For 640GB Sort in 128 nodes
– 48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
Job
Exe
cuti
on
Tim
e (s
ec)
IPoIB (FDR) OSU-IB (FDR)
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-
based Approach Benefit?, Euro-Par, August 2014.
• Local disk is used as the intermediate data directory
Reduced by 48% Reduced by 44%
25
• For 80GB Sort in 8 nodes
– 34% improvement over IPoIB (QDR)
Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon
Mellanox Booth (SC '15)
• For 120GB TeraSort in 16 nodes
– 25% improvement over IPoIB (QDR)
• Lustre is used as the intermediate data directory
0
100
200
300
400
500
600
700
800
900
40 60 80
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
IPoIB (QDR)
OSU-Lustre-Read (QDR)
OSU-RDMA-IB (QDR)
OSU-Hybrid-IB (QDR)
0
100
200
300
400
500
600
700
800
900
40 80 120
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
IPoIB (QDR)
OSU-Lustre-Read (QDR)
OSU-RDMA-IB (QDR)
OSU-Hybrid-IB (QDR)
Reduced by 25% Reduced by 34%
26
Overview of SSD-Assisted Hybrid RDMA-Memcached Design
• Design Features
• Hybrid slab allocation and
management for higher data
retention
• Log-structured sequence of
blocks flushed to SSD
• SSD fast random read to
achieve low latency object
access
• Uses LRU to evict data to SSD
Mellanox Booth (SC '15) 27
– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value
pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when
data does not fit in memory)
– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem
– When data does not fix in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem
Performance Evaluation on SDSC Comet (IB FDR + SATA/NVMe SSDs)
Mellanox Booth (SC '15) 28
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe
Data Fits In Memory Data Does Not Fit In Memory
Late
ncy
(u
s)
slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty
• Challenges for Accelerating Big Data Processing
• Accelerating Big Data Processing on RDMA-enabled High-
Performance Interconnects
– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached
• Accelerating Big Data Processing on High-Performance Storage
– Enhanced Designs for HDFS and MapReduce over Lustre
– SSD-assisted Hybrid Memory for RDMA-based Memcached
• The High-Performance Big Data (HiBD) Project
• Conclusion and Q&A
Presentation Outline
Mellanox Booth (SC '15) 29
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache and HDP Hadoop distributions
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
– HDFS and Memcached Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 135 organizations from 20 countries
• More than 13,300 downloads from the project site
The High-Performance Big Data (HiBD) Project
Mellanox Booth (SC '15) 30
Mellanox Booth (SC '15) 31
Different Modes of RDMA for Apache Hadoop 2.x
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to
have better fault-tolerance as well as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to
perform all I/O operations in-memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides
support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks
and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different
running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).
• Upcoming Releases of RDMA-enhanced Packages will support
– CDH Plugin
– Spark
– HBase
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support
– MapReduce
– RPC
• Exploration of other components (Threading models, QoS, Virtualization,
Accelerators, etc.)
• Advanced designs with upper-level changes and optimizations
Mellanox Booth (SC '15)
Future Plans of OSU High Performance Big Data Project
32
• Discussed communication and I/O challenges in accelerating Big
Data middleware
• Presented initial designs to take advantage of InfiniBand/RDMA
and high-performance storage architectures for Hadoop, Spark,
and Memcached
• Presented challenges in designing benchmarks
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data processing community to take advantage of
modern HPC technologies to carry out their analytics in a fast and
scalable manner
Mellanox Booth (SC '15)
Concluding Remarks
33
An Additional Talk
34
• Tomorrow, Thursday (11:00-11:30am) • Exploiting Full Potential of GPU Clusters with InfiniBand
using MVAPICH2-GDR
Mellanox Booth (SC '15)
Mellanox Booth (SC '15)
Thank You!
The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/
35
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/