dcs: a fast and scalable device-centric server architecture · 2015-12-17 · dcs:a fast and...

26
DCS: A Fast and Scalable Device - Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim {jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo}@postech.ac.kr High Performance Computing Lab Pohang University of Science and Technology (POSTECH)

Upload: others

Post on 25-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

DCS: A Fast and Scalable Device-Centric Server Architecture

Jaehyung Ahn, Dongup Kwon, Youngsok Kim,Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim

{jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo}@postech.ac.kr

High Performance Computing LabPohang University of Science and Technology (POSTECH)

Page 2: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Inefficient device utilization

• Host-centric device management− Host manages every device invocation

− Frequent host-involved layer crossings Increases latency and management cost

1

Userspace

Kernel

Hardware

Application

Device A Device B

Driver BKernel stack

Driver AKernel stack

Device C

Driver CKernel stack

Datapath Metadata/Command path

Page 3: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Latency: High software overhead

• Single sendfile: Storage read & NIC send− Faster devices, more software overhead

2

Software overhead

Late

ncy

Dec

ompo

sitio

n(N

orm

aliz

ed)

7%

HDD10Gb NIC

50%

NVMe10Gb NIC

77%

PCM10Gb NIC

82%

PCM100Gb NIC

Software Storage NIC

0%

100%

Page 4: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Cost: High host resource demand

• Sendfile under host resource (CPU) contention− Faster devices, more host resource consumption

3

Sendfilebandwidth

100%

No contention

CPU Busy Sendfile bandwidth

*Measured from NVMe SSD/10Gb NIC

SendfileCPU usage

34%

High contention

Sendfilebandwidth

14%Sendfile

CPU usage 6%

Page 5: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Index

• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture• Experimental results• Conclusion

Page 6: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Limitations of existing work• Single-device optimization− Do not address inter-device communication

e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic)

• Inter-device communication − Not applicable for unsupported devices

e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband)

• Integrating devices− Custom devices and protocols, limited applicability

e.g., QuickSAN (SSD+NIC), BlueDBM (Accelerator – SSD+NIC)

Need for fast, scalable, and generic inter-device communication

5

Page 7: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Index

• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture− Key idea and benefits

− Design considerations

• Experimental results• Conclusion

Page 8: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

DCS Library Application

DCS Driver Device drivers & Kernel stacks

DCS: Key idea

• Minimize host involvement & data movement

7

Userspace

Kernel

Hardware

Datapath Metadata/Command path

Single command → Optimized multi-device invocation

Device CDevice BDevice A

DCS Engine

Page 9: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

DCS: Benefits

• Better device performance− Faster data delivery, lower total operation latency

• Better host performance/efficiency− Resource/time spent for device management

now available for other applications

• High applicability− Relies on existing drivers / kernel supports / interfaces− Easy to extend and cover more devices

8

Page 10: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Index

• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture− Key idea and benefits

− Design considerations By discussing implementation details

• Experimental results• Conclusion

Page 11: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

DCS: Architecture overview

10

Userspace

Kernel

Hardware

ApplicationDCS Librarysendfile(), encrypted sendfile()

DCS Driver

Command generatorKernel communicator

DCS Engine (on NetFPGA NIC)NVMe SSD

GPU

NetFPGA NIC

Fully compatible with existing system

CommandQueue

Commandinterpreter

Per-devicemanager

PCIe Switch

Drivers &Kernel stack

Existing System

Page 12: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Communicating with storage

11

Userspace

Kernel

Hardware

ApplicationDCS Library

DCS Driver

DCS Engine

NVMe SSDTarget device

Block addr (in device) / buffer addr (cached)

VFS cache

Source device

File descriptor

Hook / API call

Data consistency guaranteed

Source device

Target

(Virtual) Filesystem

❷❸

Page 13: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Communicating with network interface

12

Userspace

Kernel

Hardware

ApplicationDCS Library

DCS Driver

DCS Engine

Data buffer

Network stackConnection information

NetFPGA NIC

Packet generation & Send HW PacketGen

Socket descriptor

Hook / API call

HW-assisted packet generation

❷❸

Page 14: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Communicating with accelerator

13

Userspace

Kernel

Hardware

ApplicationDCS Library

DCS Driver

DCS Engine

Memory

GPU

Memory allocation

GPU user library

GPU kernel driverGet memory mapping

DMA / NVMe transferSource device

Kernel invocation

Process data(Kernel launch)

Call DCS library

Direct data loading without memcpy

❶❷

❸❺

Page 15: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Index

• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture• Experimental results• Conclusion

Page 16: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Experimental setup• Host: Power-efficient system− Core 2 Duo @ 2.00GHz, 2MB LLC− 2GB DDR2 DRAM

• Device: Off-the-shelf emerging devices− Storage: Samsung XS1715 NVMe SSD− NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth)− Accelerator: NVIDIA Tesla K20m− Device interconnect: Cyclone Microsystems PCIe2-2707

(Gen 2 switch, 5 slots, up to 80Gbps)

15

Page 17: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

DCS prototype implementation• Our 4-node DCS prototype

− Can support many devices per host

16

NVMe SSD

NetFPGA NIC

GPU

PCIe Switch

Page 18: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Reducing device utilization latency• Single sendfile: Storage read & NIC send− Host-centric: Per-device layer crossings− DCS: Batch management in HW layer

17

Late

ncy

(µs)

HW75

SW79

75

Host-centric DCS

DCS39

Page 19: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Reducing device utilization latency• Single sendfile: Storage read & NIC send− Host-centric: Per-device layer crossings− DCS: Batch management in HW layer

18

Late

ncy

(µs)

HW75

SW79

75

Host-centric DCS

DCS39

2x latency improvement(with low-latency devices)

Host-centric DCS

Late

ncy

Page 20: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

71% BW / CPU 11% busy

100% BW / CPU 29% busy

Host-independent performance• Sendfile under host resource (CPU) contention− Host-centric: host-dependent, high management cost− DCS: host-independent, low management cost

CPU BusySendfile bandwidth

Host-centricDCS

100% BW / CPU 70% busy

13% BW / CPU 10% busy

No contention High contention

High performance even on weak hosts

Page 21: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Multi-device invocation• Encrypted sendfile (SSD → GPU → NIC, 512MB)− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)

20

Normalized processing time

68

62

6

Host-centric

DCS

32 6

6 6Network send (1Gb)

14% reduction

GPU data loading GPU processing Network send NVIDIA driver

Page 22: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Multi-device invocation• Encrypted sendfile (SSD → GPU → NIC, 512MB)− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)

21

Normalized processing time

68

62

6

Host-centric

DCS

32 6

6 6Network send (1Gb)

14% reduction13

12

Network send (10Gb)38% reduction

GPU data loading GPU processing Network send NVIDIA driver

Page 23: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Real-world workload: Hadoop-grep• Hadoop-grep (10GB)− Faster input delivery & smaller host resource consumption

22

0255075

1000 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

Map progress Reduce progress

Host-centric%

0255075

100 DCS%

Map

/Red

uce

prog

ress

38% faster processing

Page 24: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Scalability: More devices per host • Doubling # of devices in a single host

23

CPU Utilization 60%

Total device throughput(N

ormalized)

2x1.3x

Scalable many-device support100% 22% 37%

Devices SSDNIC

SSDx2NICx2

SSDNIC

SSDx2NICx2

Host-centric DCS

Page 25: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Conclusion• Device-Centric Server architecture− Manages emerging devices on behalf of host− Optimized data transfer and device control− Easily extensible modularized design

• Real hardware prototype evaluation− Device latency reduction: ~25%− Host resource savings: ~61%− Hadoop-grep speed improvement: ~38%

24

Page 26: DCS: A Fast and Scalable Device-Centric Server Architecture · 2015-12-17 · DCS:A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin

Thank you!

High Performance Computing LabPohang University of Science and Technology (POSTECH)

Device latency reduction ~25%Host resource savings ~61%

Hadoop-grep speed improvement ~38%

NVMe SSD

NetFPGA NIC

GPU

PCIe Switch