memcachedgpu: scaling-up scale-out key-value...

MemcachedGPU: Scaling-up Scale-out Key-value Stores Tayler H. Hetherington

[email protected] The University of British Columbia

Mike O’Connor [email protected] NVIDIA / UT-Austin

Tor M. Aamodt [email protected]

The University of British Columbia

GPU Network Offload Manager (GNoM)

MemcachedGPU

Evaluation

Key Size 16 B 64 B 128 B Tesla drops @ server 0.002% 0.003% 0.006% Tesla drops @ client 0.428% 0.043% 0.053% Tesla MRPS/Gbps 12.9 / 9.9 8.7 / 10 6 / 10

Maxwell-‐NGD drops @ server 0.47% 0.05% 0.02% Maxwell-‐NGD MRPS/Gbps 12.9 / 9.9 8.7 / 10 6 / 10

Load RX packets

Network service processing

Network service populate response & complete response packet header

UDP response packet generation

Store TX packets

Glo

bal M

emor

y

Shar

ed M

emor

y

UDP processing

Main Warps Helper Warps

Rx

Rx

Tx

GPUDirect

Tx

Rx NIC Rx

Tx

CPU:GNoM-host

GNoM-KM OS

GNoM-ND

GNoM-pre

GNoM-post

User-level Memcahed Item values Metadata

- GRXBs

Hash/lock tables

Response Buffers

GPU:GNoM-dev

TXBs Exploit request-level parallelism through fine-grained batching.

Multiple warps used to exploit task-level parallelism within a request to reduce latency. MemcachedGPU

UDP network processing in-line with application processing.

TX packets populated from the GPU and CPU.

Transfer RX packet data directly to GPU memory using GPUDirect.

Packet Data Packet Metadata

Control Flow __syncthreads

Packet Data + Metadata Kernel and user-level software framework to efficiently manage interactions between CPU, GPU, and NIC.

RX packet metadata transferred through the CPU.

0%#5%#

10%#15%#20%#25%#30%#35%#40%#45%#

2# 4# 6# 8# 10# 12# 14# 16# 18# 20#

Miss#R

ate#(%

)#

Hash#Table#Size#(Millions#of#Entries)#

32@way#

8@way#

2@way#

16@way#

4@way# Direct#Mapped#

Hash##Chaining#

Request#trace#working#set#

Hash Table Analysis

Miss-rate vs. hash table associativity and size compared to hash chaining for a request trace with a working set of 10 million requests following a Zipfian distribution. 16-ways closely matches hash chaining.

200#500#800#1100#1400#1700#2000#

2# 4# 6# 8# 10# 12#

RTT#Latency#(us)#

Average#MRPS#

Tesla?NGD#Maxwell?NGD#Tesla?GNoM#

Mean latency

200#500#800#1100#1400#1700#2000#

2# 4# 6# 8# 10# 12#Average#MRPS#

95-‐percen8le latency

RTT Latency Analysis

MemcachedGPU Mean and 95-percentile round-trip-time (RTT) latency with 512 requests/batch. GNoM reduces mean latency vs. NGD by 75%-96%.

NVIDIA GPU Tesla K20c GTX 750Ti

Throughput (MRPS) 27.5 28.3

Avg. Latency (us) 353.4 263.6

Energy-‐efficiency (KRPS/W) 100 127.3

Offline Analysis

MemcachedGPU offline, in-memory limit-study without network transfers. Results show promise for even lower power integrated GPUs in the data center.

Set 0 Set 1 Set N Hash Table / Key Storage

Lock Table

Value Storage

GPU Memory

CPU Memory

Main Memcached Modifications

NVIDIA GPU Tesla K20c GTX 750Ti Architecture Kepler Maxwell # Cores/Freq. 2496 / 706 MHz 640 / 1020 MHz Mem size / BW 5 GB / 208 GB/s 2 GB / 86.4 GB/s

TDP 225 W 60 W Cost $2,700 $150

RX mode GPUDirect (GNoM) Non-‐GPUDirect (NGD)

•  Use GPUs as flexible, energy-efficient accelerators for general network services in the data center. •  Exploit request-level parallelism through request batching

on the massively parallel GPU architecture. •  Small batches (e.g., 512 requests) to improve latency. •  Concurrent batches to improve throughput.

•  Perform both UDP network processing and application processing on the GPU.

•  GNoM: Achieve high-throughput, low-latency, and energy-efficient UDP network processing on commodity Ethernet and GPU hardware.

•  MemcachedGPU: Design and evaluate a popular in-memory key-value store application, Memcached, on GPUs. •  Distributed look-aside cache to alleviate database load. •  Requests: GET (read), SET/UPDATE/DELETE (modify). •  Goal: Scale-up the GET performance of single server.

Main Goals and Contributions

•  Data centers consume significant amounts of power to operate (e.g. 10’s of Megawatts).

•  There is a continuously growing demand for higher performance in the data center. •  Cannot easily trade performance for power.

•  Data center hardware needs to be general to support

constantly changing workloads and requirements.

•  Ideal properties for a data center accelerator: • High performance. • High energy-efficiency. • High generality and

programmability. • Low-cost commodity

hardware.

Problem & Motivation

Paper: ece.ubc.ca/~taylerh/doc/MemcachedGPU_SoCC15.pdf

Code: github.com/tayler-hetherington/MemcachedGPU

•  Focus on accelerating GET requests on the GPU - majority of SET request processing still on the CPU.

•  Many changes required to the core Memcached data structures and operations to improve performance and scalability: •  Partition key (GPU) and value storage (CPU). •  Hash table: dynamic hash chaining static set-associative. •  Global LRU replacement local per-set LRU replacement. •  Global locking per-set shared/exclusive locking.

Potential Solutions & Tradeoffs “Wimpy” cores +  Lowers power consumption. -  Reduces performance.

ASICs +  Very high performance and energy-efficiency. -  Lacks generality.

FPGAs +  High performance and energy-efficiency. +  Generality through reprogrammable hardware. +  Used in Microsoft’s data centers for Bing (Catapult). -  More difficult to program than general-purpose processors.

•  Programmability improving with high-level synthesis, but may trade off the quality of results.

-  Reprogramming times may limit potential for fine-grained task switching to support multiple concurrent workloads.

GP-GPUs +  High performance and energy-efficiency. •  Improves throughput at the cost of latency.

+  General-purpose and highly programmable. +  Positive impact on performance and energy-efficiency in

supercomputing. +  Used in Google’s data centers for Machine Learning. -  SIMD architecture limits potential applications. -  Smaller main memory than CPUs.

•  Integrated GPUs may remove this limitation.

Peak GET Throughput Analysis

GNoM and MemcachedGPU achieve ~10 GbE processing at all key-value sizes. With varying key/value lengths, MemcachedGPU becomes network bound before compute bound.

Power and Energy-Efficiency Analysis

System wall power and energy-efficiency of MemcachedGPU. The Tesla GPU only consumes 32% of peak TDP (underutilizing GPU resources). There are opportunities to further improve total system energy-efficiency through additional GPU I/O and system software support.

GPUs

Evaluated a high-performance and low-power GPU. The low-power GPU has comparable performance with higher efficiency.

Workload Consolidation Analysis

Impact on RTT when running a low-priority background task (BGT) on the same GPU with MemcachedGPU at 4 MRPS. The BGT is split into smaller kernels with fewer CTAs (256 CTAs total). 16 CTAs per BGT kernel reduces the max client RTT by 18X, while increasing the BGT execution time by 50%.

0"4"8"12"16"20"24"28"32"

0" 4" 8" 12" 16" 20" 24" 28" 32" 36" 40" 44" 48" 52"

Avg."Client"RTT"(m

s)"

Time"(ms)"

1" 2" 4" 8" 16"# of Background kernels

0"

15"

30"

45"

60"

75"

90"

0"30"60"90"120"150"180"210"240"

1" 2" 4" 6" 8" 10" 13"

KRPS/W

"

W"

Average"MRPS"

Avg."Tesla"System"Power"

Avg."Tesla"Power"

Tesla"System"EnergyDEfficiency"

Maxwell"System"EnergyDEfficiency"(Peak"Throughput)"

memcachedgpu: scaling-up scale-out key-value...

Documents