a fast and flexible device-control mechanism for device

29
DCS - ctrl: A Fast and Flexible Device - Control Mechanism for Device - Centric Server Architecture Dongup Kwon 1 , Jaehyung Ahn 2 , Dongju Chae 2 , Mohammadamin Ajdari 2 , Jaewon Lee 1 , Suheon Bae 1 , Youngsok Kim 1 , and Jangwoo Kim 1 1 Dept. of Electrical and Computer Engineering, Seoul National University 2 Dept. of Computer Science and Engineering, POSTECH

Upload: others

Post on 18-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

DCS-ctrl:A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

Dongup Kwon1, Jaehyung Ahn2, Dongju Chae2, Mohammadamin Ajdari2, Jaewon Lee1, Suheon Bae1, Youngsok Kim1, and Jangwoo Kim1

1Dept. of Electrical and Computer Engineering, Seoul National University2Dept. of Computer Science and Engineering, POSTECH

Conventional Server Architecture• Primarily rely on “CPU and memory”− CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices

CPUStorage

NetworkCompute

2/28Host- & CPU-centric

Conventional Server Architecture• Primarily rely on “CPU and memory”− CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices

CPUStorage

NetworkCompute

2/28Host- & CPU-centric

Device-centric Server Architecture• Exploit “fast & high-bandwidth devices”− Data processing accelerators (e.g., GPU, FPGA)− Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3

PCIe

CPU

Sto

rage

Net

wor

k

… …

Accelerator

GPUGPU FPGAFPGA

NVMNVM NICNIC

Device-centric

CPU

Host- & CPU-centric

Storage

NetworkCompute

3/28

Index• Existing approaches

• DCS-ctrl: HW-based device-control mechanism

• Experimental results

• Conclusion

4/28

Existing Approaches• Software optimization− Memory mgmt. optimization, user-level device interface− Do not address multi-device tasks

• P2P communication− Transfer data directly through PCI Express è D2D comm.

• Device integration− Integrate heterogeneous devices è D2D comm.

5/28

Limitations of Existing D2D Comm.• P2P communication− Direct data transfers through PCI Express è D2D comm.− Slow and high-overhead control path

Data pathControl path

DevA

DevC

CPUDevB

0

30

60

90

120

Control Data copy KernelSW

Lat

ency

(us

)

SWopt

P2P0%

25%

50%

75%

100%

Others Control Kernel

CPU

util

. (%

)

SWopt

P2P

6/28

Limitations of Existing D2D Comm.• Integrated devices− Integrating heterogeneous devices è D2D comm.− Fast data & control transfers− Fixed and inflexible aggregate implementation

CPU

DevA

DevC

DevB

NewDev$$$

Co

ntro

llers7/28

Limited Performance Potentialwhile (true) {

rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break;processing(&md_ctx, buffer, recv_size);rc_write = write(fd_file, buffer, recv_size);…

}

• “Intermediate” processing between device ops− Prevent applications from using direct D2D comm.− Cause host-side resource contention (CPU and memory)

DevA

DevB

CPU

8/28

Design Goals• Performance & scalability− Faster inter-device data & control communication− More scalable with CPU-efficient device operations

• Flexibility− Support any types of off-the-shelf devices

• Applicability− Increase the opportunity of applying D2D comm.

9/28

Index• Existing approaches

• DCS-ctrl: HW-based device-control mechanism− Key ideas and benefits− Architecture

• Experimental results

• Conclusion

10/28

• DCS-ctrl: PCIe P2P + “HDC”− Hardware-based device-control (HDC) mechanism

− HDC Engine: “FPGA-based” device orchestrator+ “near-device” processing unit

§ Performance & scalability è HDC, device orchestrator§ Flexibility è FPGA-based, low-cost device controller§ Applicability è near-device processing unit

DCS-ctrl: Key Ideas & Benefits

11/28

HDC Engine: Overview

Application

Dev A Dev B Dev C

Device driver A

Dev A

Device driver B

Device driver C

HDC Engine (FPGA)

Devicectrl A

Devicectrl B

Devicectrl C

NDPDev A Dev B Dev C

SW-controlled P2P DCS-ctrl (HW)Application

Dev B Dev C Dev A Dev B Dev C

12/28

DCS-ctrl: Key Ideas & Benefits

HDC

HDC

void ssd_to_nic(){get_from_ssd(&data);process_in_HDC(&data);write_to_nic(&data);

}

DevA

DevB

CPU

Optimized dev. control⇒ Faster & scalable

communication

Generic dev. interfaces⇒ Higher flexibility

Near-device processing⇒ Higher applicability

NewDev

CPUDevA

DevC

DevB HDC

DevicecontrollerData path

Control path

CPUDevA

DevC

DevB

HDC

13/28

Key Idea #1: Device Orchestrator

ScoreboardDev R/W Src Dst Aux StateA Read Addr(DevA) Addr(NDP-A) - Done- - Addr(NDP-A) Addr(NDP-B) Hash IssueB Write Addr(NDP-B) Addr(DevB) - Ready

• Perform multi-device tasks w/o CPU involvement− Offload a multi-device task to HDC Engine− Manage all device operations and their dependencies

Dev A

Dev B

NDP

Mul

ti-de

vice

ta

sk NDP

Fast hardware-level device control14/28

Key Idea #2: Device Controller

Dev

ice

con

trol

ler

Submissionqueue

Completionqueue

Device

• Provide interfaces between HDC Engine & devices− Include submission & completion queues− Build standard & vendor-specific device commands

Doorbellregisters

PCIeswitch

Flexible & low-cost device control15/28

Key Idea #3: Near-device Processing• Near-device processing units− Execute intermediate processing between device ops− Scale-out storage app è hash, encryption, compression

Easy to be extended & support other devices & applications

Processing units LUTs Registers ApplicationsMD5 3.0% 0.69% Swift

AES256 3.52% 0.99% HDFS, SwiftGZIP 5.36% 2.09% HDFS

Highly applicable to existing applications16/28

Index• Existing approaches

• DCS-ctrl: HW-based device-control mechanism- Key idea and benefits− Architecture

• Experimental results

• Conclusion

17/28

Baseline Architecture

PCIeswitch

DevC

DevB

DevA

Appl

icat

ion

Dev

AD

ev B

Dev

C

Device driver A

• Software-controlled P2P− P2P comm. + indirect device-control path

Device driver A

Device driver A

SW HW

18/28

DCS-ctrl: HW-based Device Control (1/3)

PCIeswitch

DevC

DevB

DevA

Appl

icat

ion

• Offload device-control path to HDC Engine− Scoreboard: schedule device operations in a multi-dev task

A –

B -

C

Dev r/w Src Dst

A

B

C

Scoreboard

FPGA-based HDC Engine

SW HW

19/28

DCS-ctrl: Low-cost Integration (2/3)

SW

PCIeswitch

DevC

DevB

DevA

Appl

icat

ion

• Implement an FPGA-based device controller− Device controller: directly control devices using P2P

A –

B -

C

FPGA-based HDC Engine

Dev r/w Src Dst

A

B

C

Scoreboard Devicecontroller

NewDev

HW

20/28

DCS-ctrl: Near-device Processing (3/3)

PCIeswitch

DevC

DevB

DevA

Appl

icat

ion

• Provide units for intermediate processing− NDP unit: perform data processing on a data path

A –

B -

C

FPGA-based HDC Engine

Dev r/w Src Dst

A

B

C

Scoreboard Devicecontroller

Near-deviceprocessing

Intermediatebuffers

NewDev

SW HW

21/28

HDC Engine implemented on Xilinx Virtex-7 VC707

Supports off-the-shelf devices –Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs

DCS-ctrl Prototype

22/28

Index• Existing approaches

• DCS-ctrl: HW-based device-control mechanism

• Experimental results

• Conclusion

23/28

Reducing Device Control Latency• encrypted_sendfile(): SSD à hash à NIC − SW opt (+P2P): frequent boundary crossings, complex software− DCS-ctrl: less crossings, hardware-based device control

0

50

100

SW opt DCS-ctrl

HW Kernel Dev ctrl

0

100

200

300

SW opt SW opt+ P2P

DCS-ctrl

HW Kernel Data Copy Dev ctrl

Late

ncy

(us)

Late

ncy

(us)

SW

without processing with processing(AES256)

SW SW42%

72%

24/28

Reducing CPU Utilization• Swift & HDFS workloads− Offload device control & data transfers to hardware

0%25%50%75%

100%

SW opt SW opt+P2P

DCS-ctrl

Kernel (GET) Kernel (PUT)GPU control Others

0%25%50%75%

100%

Send Recv Send Recv Send Recv

SW opt SW opt+P2P

DCS-ctrl

Kernel (Sender) Kernel (Receiver)GPU control others

Swift HDFS

Nor

mal

ized

CPU

util

izat

ion

Nor

mal

ized

CPU

util

izat

ion

50% 52% 49%

25/28

Scalability: More Devices• Swift & HDFS workloads− More CPU-efficient è support more high-performance devices

0

2

4

6

0 10 20 30 40

SW opt SW opt+ P2P

DCS-ctrl

0

2

4

6

0 10 20 30 40

SW opt SW opt+ P2P

DCS-ctrl

Swift HDFS

CPU

util

izat

ion

(# c

ores

)

CPU

util

izat

ion

(# c

ores

)

Throughput (Gbps) Throughput (Gbps)

26/28

• Fast & flexible device-control mechanism− Hardware-based device-control (HDC) mechanism− FPGA-based standard device controllers− Near-device data processing (NDP) units

• Real hardware prototype evaluation− 72% faster inter-device communication− 50% lower CPU utilization for Swift & HDFS

Conclusion

27/28

Thank you!

28/28

We will release our IP & tools soon!