designing high-performance, resilient and …...designing high-performance, resilient and...
TRANSCRIPT
Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for
Modern HPC Clusters
Dipti Shankar
Dr. Dhabaleswar K. Panda (Advisor)Dr. Xiaoyi Lu (Co-Advisor)
Department of Computer Science & EngineeringThe Ohio State University
SC 2018 Doctoral Showcase 2Network Based Computing Laboratory
Outline
• Introduction and Problem Statement
• Research Highlights and Results• Broader Impact
• Conclusion & Future Avenues
SC 2018 Doctoral Showcase 3Network Based Computing Laboratory
Key-Value Storage in HPC and Data Centers• General purpose distributed memory-centric storage
– Allows to aggregate spare memory from multiple nodes (e.g, Memcached)
• Accelerating Online and Offline Analytics in High-Performance Compute (HPC) environments
• Our Basis: Current High-performance and hybrid key-value stores for modern HPC clusters
v High-Performance Network Interconnects (e.g., InfiniBand) v Low end-to-end latencies with IP-over-InfiniBand
(IPoIB) and Remote Direct Memory Access (RDMA)
v ‘DRAM+SSD’ hybrid memory designs
v Extend storage capabilities beyond DRAM capabilities using high-speed SSDs
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High Performance
Networks
High Performance
Networks
(Offline Analytical Workloads: Software-Assisted Burst-Buffer )
(Online Analytical Workloads: OLTP/NoSQL Query Cache)
SC 2018 Doctoral Showcase 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (e.g., PCIe/NVMe-SSDs), NVRAM (e.g., PCM, 3DXpoint), Parallel Filesystems (e.g., Lustre)
• Accelerators (e.g., NVIDIA GPGPUs)
• Production-scale HPC Clusters: SDSC Comet, TACC Stampede, OSC Owens, etc.
High Performance
Interconnects
Multi-core Processors with
vectorization support +
Accelerators (GPUs)
Large Memory
Nodes (DRAM)SSD, NVMe-SSD, NVRAM
SC 2018 Doctoral Showcase 5Network Based Computing Laboratory
• Our focus: Holistic approach to designing a high-performance and resilient
key-value storage for current and emerging HPC systems
– RDMA-enabled networking
– Hybrid NVM storage-awareness: High-speed NVMe-/PCIe-SSDs and upcoming
NVRAM technologies
– Heterogeneous compute devices (e.g., SIMD capabilities of CPUs and GPUs)
• Goals: Maximize end-to-end performance to
– Optimally exploit available compute/storage/network capabilities
– Enable data-intensive Big Data applications to leverage memory-centric key-value
storage to improve their overall performance
Holistic Approach for HPC-Centric KVS
SC 2018 Doctoral Showcase 6Network Based Computing Laboratory
Research Framework
HighPerformance, Hybrid and Resilient Key Value Storage
Application WorkloadsOnline
(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)
Offline (E.g., BurstBuffer over PFS for Hadoop for
MapReduce/Spark)
NonBlocking RDMAaware API Extensions
Fast Online Erasure Coding (Reliability)
NVRAMaware CommunicationProtocols (Persistence)
Accelerations on SIMDbased(e.g., GPU) Architecture
(Scalability)
HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine
Modern HPC System Architecture
Volatile/NonVolatileMemory Technologies (DRAM, NVRAM)
MultiCoreNodes
(w/ SIMD units)GPUs
RDMACapableNetworks (InfiniBand,
40Gbe RoCE)
Parallel FileSystem (Lustre)
Local Storage(PCIe SSD/ NVMeSSD)
SC 2018 Doctoral Showcase 7Network Based Computing Laboratory
High-Performance Non-Blocking API Semantics
v Heterogeneous Storage-Aware Key-Value Stores (e.g., ‘DRAM + PCIe/NVMe-SSD’)
• Higher data retention at the cost of SSD I/O; suitable for out-of-memory scenarios
• Performance limited by Blocking API semantics
v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory
v Approach: Novel Non-blocking API Semantics to extend RDMA-Libmemcached library
• memcached_(iset/iget/bset/bget) APIs for SET/GET
• memcached_(test/wait) APIs for progressing communication
D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and
SSDs: Non-blocking Extensions, Designs, and Benefits”, IPDPS 2016
BLOCKINGAPI
NON-BLOCKINGAPIREQ.
NON-BLOCKINGAPIREPLY
CLIENT
SERV
ER
HYBRIDSLABMANAGER(RAM+SSD)
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
LIBMEMCACHEDLIBRARY
BlockingAPIFlow Non-BlockingAPIFlowNonB-b
Test/Wait
NonB-i
SC 2018 Doctoral Showcase 8Network Based Computing Laboratory
High-Performance Non-Blocking API Semantics
• Set/Get Latency with Non-Blocking API: Up to 8x gain in overall latency vs. blocking API semantics over RDMA+SSD hybrid design
• Up to 2.5x gain in throughput observed at client; Ability to overlap request and response phases to hide SSD I/O overheads
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-
Block H-RDMA-Opt-
NonB-i H-RDMA-Opt-
NonB-b
Aver
age
Late
ncy
(us)
Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)
Overlap SSD I/O Access (near in-memory latency)
SC 2018 Doctoral Showcase 9Network Based Computing Laboratory
v Erasure Coding (EC): A low storage-overhead alternative to Replication
v Bottlenecks: (1) Encode/decode compute overheads (2) Communication overhead of scattering/ gathering distributed data/parity chunks
v Goal: Making Online EC viable for key-value stores
v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap
v Encode/Decode offload capabilities integrated into Memcached client (CE/CD) and server (SE/SD)
Fast Online Erasure Coding with RDMA
EC
Perf
orman
ce
Ideal Scenario
MemoryEfficient
High Performance
Memory Overhead
No FT
Replication
RS (3,2) => 66% Storage Overhead vs 200% of Replication-Factor=3
SC 2018 Doctoral Showcase 10Network Based Computing Laboratory
Fast Online Erasure Coding with RDMA
D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)
0
500
1000
1500
2000
2500
3000
3500
512 4K 16K 64K 256K 1M
Tota
l Set
Lat
ency
(us
)
Value Size (Bytes)
Sync-Rep=3 Async-Rep=3
Era(3,2)-CE-CD Era(3,2)-SE-SD
Era(3,2)-SE-CD
~1.6x
~2.8x
0
50
100
150
200
250
4K 16K 32K 4K 16K 32K
YCSB-A YCSB-B
Ag
g. T
hro
ug
hp
ut
(Ko
ps/
sec)
Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD
~1.5x~1.34x
• Experiments with YCSB for Online EC vs. Async. Rep:
• 150 Clients on 10 nodes on SDSC Comet Cluster (IB FDR + 24-core Intel Haswell) over 5-node RDMA-Memcached Cluster
• (1) CE-CD gains ~1.34x for Update-Heavy workloads; SE-CD on-par (2) CE-CD/SE-CD on-par for Read-Heavy workloads
Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)
SC 2018 Doctoral Showcase 11Network Based Computing Laboratory
• Offline Data Analytics Use-Case: Hybrid and resilient key-value store-based Burst-Buffer system Over Lustre (Boldio)
• Overcome local storage limitations on HPC nodes; performance of `data locality’
• Light-weight transparent interface to Hadoop/ Spark applications
• Accelerating I/O-intensive Big Data workloads
– Non-blocking RDMA-Libmemcached APIs to maximize overlap
– Client-based replication or Online Erasure Coding with RDMA for resilience
– Asynchronous persistence to Lustre parallel file system at RDMA-Memcached Servers
D. Shankar, X. Lu, D. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE International Conference on Big
Data 2016 (Short Paper)
Co-Designing Key-Value Store-based Burst Buffer over PFS
SC 2018 Doctoral Showcase 12Network Based Computing Laboratory
Co-Designing Key-Value Store-based Burst Buffer over PFS
• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge and IB QDR) with 16-node MapReduce Cluster + 4-node Boldio Cluster
• Boldio can sustain 3x and 6.7x gains in read and write throughputs over stand-alone Lustre
0
20000
40000
60000
80000
60 GB 100 GB 60 GB 100 GB
Write Read
Agg.Throu
ghpu
t(MBp
s) Lustre-Direct Boldio
6.7x
3x
0
10000
20000
30000
40000
50000
60000
Write Read
DF
SIO
Agg
. T
hro
ugh
pu
t (M
Bp
s)
Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC
No EC overhead (Rep=3 vs. RS (3,2))
• TestDFSIO on Intel Westmere Cluster (8-core Intel Sandy Bridge and IB QDR); 8-node MapReduce Cluster + 5-node Boldio Cluster over Lustre
• Performance gains over designs like Alluxio(formerly Tachyon) in HPC environments with no local storage
SC 2018 Doctoral Showcase 13Network Based Computing Laboratory
Exploring Opportunities with NVRAM and RDMA
DRAM HCA
Mem. Cntrl
DRAM NVRAM
I/O Cntrllr
L3
PCIe Bus
HCA
Mem.Cntrl
NVRAM
I/O Cntrllr
L3
PCIe Bus Du
rab
ility P
ath
RDMA into NVRAM RDMA from NVRAMLocal Durability Point
Core L1/L2
Core L1/L2
Core L1/L2
Core L1/L2
Node1 (Initiator/Target)Core L1/L2
Core L1/L2
Core L1/L2
Core L1/L2
Node0 (Initiator/Target)• Emerging Non-volatile Memory technologies
(NVRAM)
• Potential: Byte-addressable and persistent; capable of RDMA
• Observations: RDMA writes into NVRAM needs to guarantee remote durability
• Appliance Method: Hardware-Assisted remote direct persistence
• General Purpose Server Method: Server-Assisted software-based persistence
• Opportunities: Designing high-performance RDMA-based Persistence Protocols (e.g., Persistent In-Memory KVS over NVRAM) D. Shankar, X. Lu, D. Panda, RDMP-KVS: Remote Direct Memory
Persistence Aware Key-Value Store for NVRAM Systems (Under
Submission)
SC 2018 Doctoral Showcase 14Network Based Computing Laboratory
• RDMA for Apache Spark (RDMA-Spark), Apache Hadoop 2.x (RDMA-Hadoop-2.x), RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
– RDMA-aware `DRAM+SSD’ hybrid Memcached server design
– Non-Blocking RDMA-based Client API designs (RDMA-Libmemcached)
– Based on Memcached 1.5.3 and Libmemcached client 1.0.18
– Available for InfiniBand and RoCE
• OSU HiBD-Benchmarks (OHB)– Memcached Set/Get Micro-benchmarks for Blocking and Non-Blocking APIs, and Hybrid
Memcached designs
– YCSB plugin for RDMA-Memcached
– Also includes HDFS, HBase, Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 290 organizations, 34 countries, 27,950 downloads
Broader Impact: The High-Performance Big Data (HiBD) Project
SC 2018 Doctoral Showcase 15Network Based Computing Laboratory
Conclusion & Future Avenuesv Holistic approach to designing key-value storage by exploiting the capabilities of HPC
clusters for (1) performance, (2) scalability, and, (3) data resilience/availability
v RDMA-capable Networks: (1) Proposed Non-blocking RDMA-based Libmemcached APIs
(2) Fast Online EC-based RDMA-Memcached designs
v Heterogeneous Storage-Awareness: (1) Leverage `RDMA+SSD’ hybrid designs, (2)
`RDMA+NVRAM’ Persistent Key-Value Storage
v Application Co-Design: Memory-centric data-intensive applications on HPC Clusters
v Online (e.g., SQL query cache, YCSB) and Offline Data Analytics (e.g., Boldio Burst-Buffer for
Hadoop I/O)
v Future Work: Ongoing work in this thesis direction
v Heterogeneous compute capabilities of CPU/GPU: End-to-end SIMD-aware KVS designs
v Exploring co-design of (1) Read-intensive Graph-based workloads (E.g., LinkBench, RedisGraph)
(2) Key-value storage engine for ML Parameter Server frameworks
SC 2018 Doctoral Showcase 16Network Based Computing Laboratory
http://www.cse.ohio-state.edu/~shankard
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/