big data: hadoop and memcached - hpc advisory …...big data processing with hadoop •the...
TRANSCRIPT
Big Data: Hadoop and Memcached
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Switzerland Conference (2014)
by
Introduction to Big Data Applications and Analytics
• Big Data has become the one of the most
important elements of business analytics
• Provides groundbreaking opportunities
for enterprise information management
and decision making
• The amount of data is exploding;
companies are capturing and digitizing
more information than ever
• The rate of information growth appears
to be exceeding Moore’s Law
2 HPC Advisory Council Switzerland Conference, Apr '14
• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
• 5V’s of Big Data – 3V + Value, Veracity
• Scientific Data Management, Analysis, and Visualization
• Applications examples
– Climate modeling
– Combustion
– Fusion
– Astrophysics
– Bioinformatics
• Data Intensive Tasks
– Runs large-scale simulations on supercomputers
– Dump data on parallel storage systems
– Collect experimental / observational data
– Move experimental / observational data to analysis sites
– Visual analytics – help understand data visually
HPC Advisory Council Switzerland Conference, Apr '14 3
Not Only in Internet Services - Big Data in Scientific Domains
• Hadoop: http://hadoop.apache.org
– The most popular framework for Big Data Analytics
– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.
• Spark: http://spark-project.org
– Provides primitives for in-memory cluster computing; Jobs can load data into memory and
query it repeatedly
• Shark: http://shark.cs.berkeley.edu
– A fully Apache Hive-compatible data warehousing system
• Storm: http://storm-project.net
– A distributed real-time computation system for real-time analytics, online machine learning,
continuous computation, etc.
• S4: http://incubator.apache.org/s4
– A distributed system for processing continuous unbounded streams of data
• Web 2.0: RDBMS + Memcached (http://memcached.org)
– Memcached: A high-performance, distributed memory object caching systems
HPC Advisory Council Switzerland Conference, Apr '14 4
Typical Solutions or Architectures for Big Data Applications and Analytics
• Focuses on large data and data analysis
• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment
is gaining a lot of momentum
• http://wiki.apache.org/hadoop/PoweredBy
HPC Advisory Council Switzerland Conference, Apr '14 5
Who Are Using Hadoop?
• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours • http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html
• A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100 nodes • Hadoop 0.23.7, an early branch of the Hadoop 2.X
• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html
HPC Advisory Council Switzerland Conference, Apr '14 6
Hadoop at Yahoo!
http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
7 HPC Advisory Council Switzerland Conference, Apr '14
Big Data Processing with Hadoop
• The open-source implementation of MapReduce programming model for Big Data Analytics
• Major components
– HDFS
– MapReduce
• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase
• Model scales but high amount of communication during intermediate phases can be further optimized
8
HDFS
MapReduce
Hadoop Framework
User Applications
HBase
Hadoop Common (RPC)
HPC Advisory Council Stanford Conference, Feb '14
Hadoop MapReduce Architecture & Operations
Disk Operations
• Map and Reduce Tasks carry out the total job execution
• Communication in shuffle phase uses HTTP over Java Sockets
9 HPC Advisory Council Switzerland Conference, Apr '14
Hadoop Distributed File System (HDFS)
• Primary storage of Hadoop;
highly reliable and fault-tolerant
• Adopted by many reputed
organizations
– eg: Facebook, Yahoo!
• NameNode: stores the file system namespace
• DataNode: stores data blocks
• Developed in Java for platform-
independence and portability
• Uses sockets for communication!
(HDFS Architecture)
10
RPC
RPC
RPC
RPC
HPC Advisory Council Switzerland Conference, Apr '14
Client
HPC Advisory Council Switzerland Conference, Apr '14 11
System-Level Interaction Between Clients and Data Nodes in HDFS
High
Performance
Networks
(HDD/SSD)
(HDD/SSD)
(HDD/SSD)
...
...
(HDFS Data Nodes)(HDFS Clients)
...
...
HPC Advisory Council Switzerland Conference, Apr '14 12
HBase
• Apache Hadoop Database
(http://hbase.apache.org/)
• Semi-structured database,
which is highly scalable
• Integral part of many
datacenter applications
– eg:- Facebook Social Inbox
• Developed in Java for platform-
independence and portability
• Uses sockets for
communication!
ZooKeeper
HBase
Client
HMaster
HRegion
Servers
HDFS
(HBase Architecture)
HPC Advisory Council Switzerland Conference, Apr '14 13
System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes
(HBase Clients) (HRegion Servers) (Data Nodes)
(HDD/SSD)
(HDD/SSD)
(HDD/SSD)
...
......
... ...
...
High
Performance
Networks
High
Performance
Networks
HPC Advisory Council Switzerland Conference, Apr '14 14
Memcached Architecture
• Integral part of Web 2.0 architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
Networks
Main
memoryCPUs
SSD HDD
High Performance Networks
... ...
...
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
15 HPC Advisory Council Switzerland Conference, Apr '14
HPC Advisory Council Switzerland Conference, Apr '14 16
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
HPC and Advanced Interconnects
• High-Performance Computing (HPC) has adopted advanced interconnects and protocols
– InfiniBand
– 10 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Very Good Performance
– Low latency (few micro seconds)
– High Bandwidth (100 Gb/s with dual FDR InfiniBand)
– Low CPU overhead (5-10%)
• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems
• Many such systems in Top500 list
17 HPC Advisory Council Switzerland Conference, Apr '14
18
Trends of Networking Technologies in TOP500 Systems
Percentage share of InfiniBand is steadily increasing
Interconnect Family – Systems Share
HPC Advisory Council Switzerland Conference, Apr '14
• Message Passing Interface (MPI)
– Most commonly used programming model for HPC applications
• Parallel File Systems
• Storage Systems
• Almost 13 years of Research and Development since
InfiniBand was introduced in October 2000
• Other Programming Models are emerging to take
advantage of High-Performance Networks
– UPC
– OpenSHMEM
– Hybrid MPI+PGAS (OpenSHMEM and UPC)
19
Use of High-Performance Networks for Scientific Computing
HPC Advisory Council Switzerland Conference, Apr '14
HPC Advisory Council Switzerland Conference, Apr '14 20
One-way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council Switzerland Conference, Apr '14 21
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
22 HPC Advisory Council Switzerland Conference, Apr '14
Can High-Performance Interconnects Benefit Hadoop?
• Most of the current Hadoop environments use Ethernet infrastructure with Sockets
• Concerns for performance and scalability
• Usage of high-performance networks is beginning to draw interest from many companies
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?
• Can HPC Clusters with high-performance networks be used for Big Data applications using Hadoop?
23 HPC Advisory Council Switzerland Conference, Apr '14
Designing Communication and I/O Libraries for Big Data Systems: Challenges
HPC Advisory Council Switzerland Conference, Apr '14 24
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
HPC Advisory Council Switzerland Conference, Apr '14 25
Common Protocols using Open Fabrics
Application/ Middleware
Verbs Sockets
Application /Middleware Interface
SDP
RDMA
SDP
InfiniBand Adapter
InfiniBand Switch
RDMA
IB Verbs
InfiniBand Adapter
InfiniBand Switch
User
space
RDMA
RoCE
RoCE Adapter
User
space
Ethernet Switch
TCP/IP
Ethernet Driver
Kernel Space
Protocol
InfiniBand Adapter
InfiniBand Switch
IPoIB
Ethernet Adapter
Ethernet Switch
Adapter
Switch
1/10/40 GigE
iWARP
Ethernet Switch
iWARP
iWARP Adapter
User
space IPoIB
TCP/IP
Ethernet Adapter
Ethernet Switch
10/40 GigE-TOE
Hardware Offload
RSockets
InfiniBand Adapter
InfiniBand Switch
User
space
RSockets
Can Hadoop be Accelerated with High-Performance Networks and Protocols?
26
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Hadoop and HBase)
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
HPC Advisory Council Switzerland Conference, Apr '14
Two Broad Approaches
• All components of Hadoop (HDFS, MapReduce, RPC, HBase, etc.) working on native IB verbs
• Hadoop Map-Reduce over verbs for existing HPC clusters with Lustre
27 HPC Advisory Council Switzerland Conference, Apr '14
• Overview of Hadoop and Memcached
• Trends in Networking Technologies and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
28 HPC Advisory Council Switzerland Conference, Apr '14
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-
level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-
based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.9 (03/31/14)
– Based on Apache Hadoop 1.2.1
– Compliant with Apache Hadoop 1.2.1 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE
• Various multi-core platforms
• Different file systems with disks and SSDs
– http://hadoop-rdma.cse.ohio-state.edu
RDMA for Apache Hadoop Project
29 HPC Advisory Council Switzerland Conference, Apr '14
Design Overview of HDFS with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
30
HDFS
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS replication
– Parallel replication support
– On-demand connection setup
– InfiniBand/RoCE support
HPC Advisory Council Switzerland Conference, Apr '14
Communication Time in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced by 30%
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
31
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
Co
mm
un
icat
ion
Tim
e (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
HPC Advisory Council Switzerland Conference, Apr '14
Evaluations using TestDFSIO
• Cluster with 8 HDD DataNodes, single disk per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 SSD DataNodes, single SSD per node
– 61% improvement over IPoIB (QDR) for 20GB file size
Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps
Increased by 24%
Increased by 61%
32
0
200
400
600
800
1000
1200
5 10 15 20
Tota
l Th
rou
ghp
ut
(MB
ps)
File Size (GB)
IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
5 10 15 20
Tota
l Th
rou
ghp
ut
(MB
ps)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
HPC Advisory Council Switzerland Conference, Apr '14
Evaluations using TestDFSIO
• Cluster with 4 DataNodes, 1 HDD per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 DataNodes, 2 HDD per node
– 31% improvement over IPoIB (QDR) for 20GB file size
• 2 HDD vs 1 HDD
– 76% improvement for OSU-IB (QDR)
– 66% improvement for IPoIB (QDR)
Increased by 31%
33
0
500
1000
1500
2000
2500
5 10 15 20
Tota
l Th
rou
ghp
ut
(MB
ps)
File Size (GB)
10GigE-1HDD 10GigE-2HDD
IPoIB -1HDD (QDR) IPoIB -2HDD (QDR)
OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)
HPC Advisory Council Switzerland Conference, Apr '14
34
Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon
• Cluster with 32 DataNodes, single SSD per node
– 47% improvement in throughput over IPoIB (QDR) for 128GB file size
– 16% improvement in latency over IPoIB (QDR) for 128GB file size
HPC Advisory Council Switzerland Conference, Apr '14
Increased by 47% Reduced by 16%
0
500
1000
1500
2000
2500
32 64 128
Agg
rega
ted
Th
rou
ghp
ut
(MB
ps)
File Size (GB)
IPoIB (QDR)
OSU-IB (QDR)
0
20
40
60
80
100
120
140
160
32 64 128
Exec
uti
on
Tim
e (s
)
File Size (GB)
IPoIB (QDR)
OSU-IB (QDR)
HPC Advisory Council Switzerland Conference, Apr '14 35
Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede
• Cluster with 64 DataNodes, single HDD per node
– 64% improvement in throughput over IPoIB (FDR) for 256GB file size
– 37% improvement in latency over IPoIB (FDR) for 256GB file size
0
200
400
600
800
1000
1200
1400
64 128 256
Agg
rega
ted
Th
rou
ghp
ut
(MB
ps)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR) Increased by 64% Reduced by 37%
0
100
200
300
400
500
600
64 128 256
Exec
uti
on
Tim
e (s
)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
36 HPC Advisory Council Switzerland Conference, Apr '14
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
37 HPC Advisory Council Switzerland Conference, Apr '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching of map outputs
– In-memory merge
– Advanced optimization in overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
Evaluations using Sort
8 HDD DataNodes
• With 8 HDD DataNodes for 40GB sort
– 41% improvement over IPoIB (QDR)
– 39% improvement over UDA-IB (QDR)
38
M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-
based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May
2013.
HPC Advisory Council Switzerland Conference, Apr '14
8 SSD DataNodes
• With 8 SSD DataNodes for 40GB sort
– 52% improvement over IPoIB (QDR)
– 44% improvement over UDA-IB
(QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
10GigE IPoIB (QDR)
UDA-IB (QDR) OSU-IB (QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
10GigE IPoIB (QDR)
UDA-IB (QDR) OSU-IB (QDR)
Evaluations using TeraSort
• 100 GB TeraSort with 8 DataNodes with 2 HDD per node
– 51% improvement over IPoIB (QDR)
– 41% improvement over UDA-IB (QDR)
39 HPC Advisory Council Switzerland Conference, Apr '14
0
200
400
600
800
1000
1200
1400
1600
1800
60 80 100
Job
Exe
cuti
on
Tim
e (
sec)
Data Size (GB)
10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)
• For 240GB Sort in 64 nodes
– 40% improvement over IPoIB (QDR)
with HDD used for HDFS
40
Performance Evaluation on Larger Clusters
HPC Advisory Council Switzerland Conference, Apr '14
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exe
cuti
on
Tim
e (s
ec)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exe
cuti
on
Tim
e (s
ec)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes
– 38% improvement over IPoIB
(FDR) with HDD used for HDFS
• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size
Evaluations using PUMA Workload
41 HPC Advisory Council Switzerland Conference, Apr '14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)
No
rmal
ized
Exe
cuti
on
Tim
e
Benchmarks
10GigE IPoIB (QDR) OSU-IB (QDR)
• 50 small MapReduce jobs executed in a cluster size of 4
• Maximum performance benefit 24% over IPoIB (QDR)
• Average performance benefit 13% over IPoIB (QDR)
42
Evaluations using SWIM
15
20
25
30
35
40
1 6 11 16 21 26 31 36 41 46
Exe
cuti
on
Tim
e (
sec)
Job Number
IPoIB (QDR) OSU-IB (QDR)
HPC Advisory Council Switzerland Conference, Apr '14
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
43 HPC Advisory Council Switzerland Conference, Apr '14
HPC Advisory Council Switzerland Conference, Apr '14 44
Motivation – Detailed Analysis on HBase Put/Get
• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,
Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12
0
20
40
60
80
100
120
140
160
180
200
IPoIB (DDR) 10GigE OSU-IB (DDR)
Tim
e (
us)
HBase Put 1KB
0
20
40
60
80
100
120
140
160
IPoIB (DDR) 10GigE OSU-IB (DDR)
Tim
e (
us)
HBase Get 1KB
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
HPC Advisory Council Switzerland Conference, Apr '14 45
HBase Single-Server-Multi-Client Results
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16
Tim
e (
us)
No. of Clients
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
Op
s/se
c
No. of Clients
10GigE
Latency Throughput
J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
OSU-IB (DDR)
IPoIB (DDR)
HPC Advisory Council Switzerland Conference, Apr '14 46
HBase – YCSB Read-Write Workload
• HBase Get latency (Yahoo! Cloud Service Benchmark)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
10GigE
Read Latency Write Latency
OSU-IB (QDR) IPoIB (QDR)
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
47 HPC Advisory Council Switzerland Conference, Apr '14
Design Overview of Hadoop RPC with RDMA
Hadoop RPC
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Our Design
Default
OSU Design
48
• Design Features
– JVM-bypassed buffer management
– RDMA or send/recv based adaptive communication
– Intelligent buffer allocation and adjustment for serialization
– On-demand connection setup
– InfiniBand/RoCE support
HPC Advisory Council Switzerland Conference, Apr '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based RPC with communication library written in native code
Hadoop RPC with RDMA - Gain in Latency and Throughput
• Hadoop RPC with RDMA PingPong Latency
– 1 byte: 30 us; 4 KB: 38 us
– 59%-62% and 60%-63% improvements compared with the performance on 10 GigE
and IPoIB (32Gbps), respectively
• Hadoop RPC with RDMA Throughput
– 512 bytes & 56 clients: 170.63 Kops/sec
– 129% and 107% improvements compared with the peak performance on 10 GigE
and IPoIB (32Gbps), respectively
49 HPC Advisory Council Switzerland Conference, Apr '14
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy
(u
s)
Payload Size (Byte)
10GigE IPoIB (QDR) OSU-IB (QDR)
0
20
40
60
80
100
120
140
160
180
8 16 24 32 40 48 56 64
Thro
ugh
pu
t (K
op
s/Se
c)
Number of Clients
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
50 HPC Advisory Council Switzerland Conference, Apr '14
• Testbeds
– OSU RI Cluster (HDD/SSD, QDR, 10GigE)
– TACC-Stampede (HDD, FDR)
– SDSC-Gordon (SSD, QDR)
• Benchmarks
– TestDFSIO
– RandomWriter & Sort
– TeraGen & TeraSort
• Benefits of New Features
– Parallel Replication vs. Pipeline Replication
HPC Advisory Council Switzerland Conference, Apr '14 51
Performance Benefits of RDMA for Apache Hadoop 0.9.9
0
500
1000
1500
2000
2500
3000
5 10 15 20
Tota
l Th
rou
ghp
ut
(MB
ps)
Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
• For Latency,
– 13% improvement over IPoIB, 18%
improvement over 10GigE for 5 GB file size
– 27% improvement over IPoIB, 30%
improvement over 10GigE for 20 GB file size
52
Performance Benefits – TestDFSIO (SSD)
Cluster with 4 Nodes, TestDFSIO with a total of 16 maps
HPC Advisory Council Switzerland Conference, Apr '14
0
5
10
15
20
25
30
35
40
45
5 10 15 20
Job
Exe
cuti
on
Tim
e (s
ec)
Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
• For Throughput,
– 47% improvement over IPoIB, 64%
improvement over 10GigE for 5 GB file size
– 43% improvement over IPoIB, 58%
improvement over 10GigE for 20 GB file size
Reduced by 27% Increased by 43%
• For Latency,
– 36-38% improvement over
IPoIB for 80-120 GB file size
53
Performance Benefits – TestDFSIO in TACC Stampede
Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps
HPC Advisory Council Switzerland Conference, Apr '14
0
50
100
150
200
250
300
350
400
80 100 120
Job
Exe
cuti
on
Tim
e (s
ec)
Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
100
200
300
400
500
600
700
80 100 120
Tota
l Th
rou
ghp
ut
(MB
ps)
Size (GB)
IPoIB (FDR) OSU-IB (FDR)
• For Throughput,
– 60% improvement over
IPoIB for 80-120 GB file size
Reduced by 36%
Increased by 60%
0
50
100
150
200
250
300
350
400
50 GB 100 GB 200 GB 300 GB
Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job
Exe
cuti
on
Tim
e (
sec)
IPoIB (QDR) OSU-IB (QDR)
0
20
40
60
80
100
120
140
160
50 GB 100 GB 200 GB 300 GB
Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job
Exe
cuti
on
Tim
e (
sec)
IPoIB (QDR) OSU-IB (QDR)
Performance Benefits – RandomWriter & Sort in SDSC-Gordon
• RandomWriter in a cluster of single SSD per
node
– 16% improvement over IPoIB for 50GB in
a cluster of 16
– 20% improvement over IPoIB for 300GB in
a cluster of 64
Reduced by 20% Reduced by 36%
54 HPC Advisory Council Switzerland Conference, Apr '14
• Sort in a cluster of single SSD per node
– 20% improvement over IPoIB for
50GB in a cluster of 16
– 36% improvement over IPoIB for
300GB in a cluster of 64
0
200
400
600
800
1000
1200
80 100 120
Job
Exe
cuti
on
Tim
e (s
ec)
Sort Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
100
200
300
400
500
600
80 100 120
Job
Exe
cuti
on
Tim
e (
sec)
Size (GB)
IPoIB (FDR) OSU-IB (FDR)
55
Performance Benefits – TeraGen & TeraSort in TACC-Stampede
Cluster with 16 Nodes with a total of 64 maps and 32 reduces
HPC Advisory Council Switzerland Conference, Apr '14
• TeraGen with single HDD per
node
– 17% improvement over IPoIB
for 120GB data
• TeraSort with single HDD per
node
– 38% improvement over IPoIB
for 120GB data
Reduced by 17% Reduced by 38%
0
200
400
600
800
1000
1200
20 40 60
Tota
l Th
rou
ghp
ut
(MB
ps)
Size (GB)
10GigE IPoIB (QDR)OSU-IB-Pipeline (QDR) OSU-IB-Parallel (QDR)
0
50
100
150
200
250
300
350
20 40 60
Job
Exe
cuti
on
Tim
e (s
ec)
Size (GB)
10GigE
IPoIB (QDR)
OSU-IB-Pipeline (QDR)
OSU-IB-Parallel (QDR)
• For Latency,
– 11% improvement over OSU-IB-Pipeline for
20 GB file size
– 8% improvement over OSU-IB-Pipeline for 60
GB file size
56
Benefits of Parallel Replication – TestDFSIO
Cluster with 8 Nodes, TestDFSIO with a total of 32 maps
HPC Advisory Council Switzerland Conference, Apr '14
• For Throughput,
– 12% improvement over OSU-IB-Pipeline for
20 GB file size
– 10% improvement over OSU-IB-Pipeline for
60 GB file size
Reduced by 8%
Increased by 10%
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
57 HPC Advisory Council Switzerland Conference, Apr '14
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
58
Performance Improvement Potential of MapReduce over Lustre on HPC Clusters – An Initial Study
HPC Advisory Council Switzerland Conference, Apr '14
Sort in TACC Stampede Sort in SDSC Gordon
• For 100GB Sort in 16 nodes
– 33% improvement over IPoIB (QDR)
0
200
400
600
800
1000
1200
300 400 500
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
0
50
100
150
200
250
60 80 100
Job
Exe
cuti
on
Tim
e (s
ec)
Data Size (GB)
IPoIB (QDR)
OSU-IB (QDR)
• Can more optimizations be achieved by leveraging more features of Lustre?
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
59 HPC Advisory Council Switzerland Conference, Apr '14
Memcached-RDMA Design
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread
• All other Memcached data structures are shared among RDMA and Sockets worker threads
Sockets Client
RDMA Client
Master Thread
Sockets Worker Thread
Verbs Worker Thread
Sockets Worker Thread
Verbs Worker Thread
Shared
Data
Memory Slabs Items
…
1
1
2
2
60 HPC Advisory Council Switzerland Conference, Apr '14
61
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
IPoIB (QDR)
10GigE (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
Memcached Get Latency (Small Message)
HPC Advisory Council Switzerland Conference, Apr '14
Tim
e%(us)%
Message%Size%
Tim
e%(us)%
Message%Size%
Tim
e%(us)%
Message%Size%
Time%(us)%
Message%Size%
Tim
e%(us)%
Message%Size%
Tim
e%(us)%
Message%Size%
Tim
e%(us)%
Message%Size%
Time%(us)%
Message%Size%
IPoIB (DDR)
10GigE (DDR)
OSU-RC-IB (DDR)
OSU-UD-IB (DDR)
HPC Advisory Council Switzerland Conference, Apr '14 62
Memcached Get TPS (4byte)
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32 64 128 256 512 800 1K
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
0
200
400
600
800
1000
1200
1400
1600
4 8
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
Tim
e%(us)%
Message%Size%
Tim
e%(us)%
Message%Size%
Time%(us)%
Message%Size%
IPoIB (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
1
10
100
10001 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Tim
e (
us)
Message Size
OSU-IB (FDR)
IPoIB (FDR)
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4080
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r Se
con
d (
TPS)
No. of Clients
63
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us
– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s
– Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X
2X
HPC Advisory Council Switzerland Conference, Apr '14
HPC Advisory Council Switzerland Conference, Apr '14 64
Application Level Evaluation – Olio Benchmark
• Olio Benchmark
– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
0
500
1000
1500
2000
2500
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
0
10
20
30
40
50
60
1 4 8
Tim
e (
ms)
No. of Clients
IPoIB (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
OSU-Hybrid-IB (QDR)
Time%(ms)%
No.%of%Clients%
HPC Advisory Council Switzerland Conference, Apr '14 65
Application Level Evaluation – Real Application Workloads
• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design
0
10
20
30
40
50
60
70
80
90
1 4 8
Tim
e (
ms)
No. of Clients
0
50
100
150
200
250
300
350
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached
design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
IPoIB (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
OSU-Hybrid-IB (QDR)
Time%(ms)%
No.%of%Clients%
• Overview of Hadoop and Memcached
• Trends in High Performance Networking and Protocols
• Challenges in Accelerating Hadoop and Memcached
• Designs and Case Studies – Hadoop
• HDFS
• MapReduce
• HBase
• RPC
• Combination of components
– Hadoop Map-Reduce over Lustre
– Memcached
• Open Challenges
• Conclusion and Q&A
Presentation Outline
66 HPC Advisory Council Switzerland Conference, Apr '14
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges
HPC Advisory Council Switzerland Conference, Apr '14
Upper level
Changes?
67
• Multi-threading and Synchronization
– Multi-threaded model exploration
– Fine-grained synchronization and lock-free design
– Unified helper threads for different components
– Multi-endpoint design to support multi-threading communications
• QoS and Virtualization
– Network virtualization and locality-aware communication for Big
Data middleware
– Hardware-level virtualization support for End-to-End QoS
– I/O scheduling and storage virtualization
– Live migration
HPC Advisory Council Switzerland Conference, Apr '14 68
More Challenges
• Support of Accelerators
– Efficient designs for Big Data middleware to take advantage of
NVIDA GPGPUs and Intel MICs
– Offload computation-intensive workload to accelerators
– Explore maximum overlapping between communication and
offloaded computation
• Fault Tolerance Enhancements
– Novel data replication schemes for high-efficiency fault tolerance
– Exploration of light-weight fault tolerance mechanisms for Big Data
processing
• Support of Parallel File Systems
– Optimize Big Data middleware over parallel file systems (e.g.
Lustre) on modern HPC clusters
HPC Advisory Council Switzerland Conference, Apr '14 69
More Challenges (Cont’d)
• The current benchmarks provide some performance
behavior
• However, do not provide any information to the
designer/developer on:
– What is happening at the lower-layer?
– Where the benefits are coming from?
– Which design is leading to benefits or bottlenecks?
– Which component in the design needs to be changed and what will
be its impact?
– Can performance gain/loss at the lower-layer be correlated to the
performance gain/loss observed at the upper layer?
HPC Advisory Council Switzerland Conference, Apr '14 70
Are the Current Benchmarks Sufficient for Big Data?
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
71
Current
Benchmarks
No Benchmarks
Correlation?
HPC Advisory Council Switzerland Conference, Apr '14
OSU MPI Micro-Benchmarks (OMB) Suite
• A comprehensive suite of benchmarks to – Compare performance of different MPI libraries on various networks and systems
– Validate low-level functionalities
– Provide insights to the underlying MPI-level designs
• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth
• Extended later to – MPI-2 one-sided
– Collectives
– GPU-aware data movement
– OpenSHMEM (point-to-point and collectives)
– UPC
• Has become an industry standard
• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems
• Available from http://mvapich.cse.ohio-state.edu/benchmarks
• Available in an integrated manner with MVAPICH2 stack
72 HPC Advisory Council Switzerland Conference, Apr '14
Big Data Middleware (HDFS, MapReduce, HBase and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications
HPC Advisory Council Switzerland Conference, Apr '14 73
Applications-
Level
Benchmarks
Micro-
Benchmarks
• Upcoming RDMA for Apache Hadoop Releases will support
– HBase
– Hadoop 2.0.x
• Memcached-RDMA
• Exploration of other components (Threading models, QoS,
Virtualization, Accelerators, etc.)
• Advanced designs with upper-level changes and optimizations
• Comprehensive set of Benchmarks
• More details at http://hadoop-rdma.cse.ohio-state.edu
HPC Advisory Council Switzerland Conference, Apr '14
Future Plans of OSU Big Data Project
74
• Presented an overview of Big Data, Hadoop and Memcached
• Provided an overview of Networking Technologies
• Discussed challenges in accelerating Hadoop
• Presented initial designs to take advantage of InfiniBand/RDMA
for HDFS, MapReduce, RPC, HBase and Memcached
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data community to take advantage of modern
HPC technologies to carry out their analytics in a fast and scalable
manner
HPC Advisory Council Switzerland Conference, Apr '14
Concluding Remarks
75
HPC Advisory Council Switzerland Conference, Apr '14
Personnel Acknowledgments
Current Students
– S. Chakraborthy (Ph.D.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– M. Li (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
76
Past Research Scientist – S. Sur
Current Post-Docs
– X. Lu
– K. Hamidouche
Current Programmers
– M. Arnold
– J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associate
– H. Subramoni
Past Programmers
– D. Bureddy
HPC Advisory Council Switzerland Conference, Apr '14
Pointers
http://nowlab.cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
77
http://mvapich.cse.ohio-state.edu
http://hadoop-rdma.cse.ohio-state.edu