a fast and flexible device-control mechanism for device
TRANSCRIPT
DCS-ctrl:A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture
Dongup Kwon1, Jaehyung Ahn2, Dongju Chae2, Mohammadamin Ajdari2, Jaewon Lee1, Suheon Bae1, Youngsok Kim1, and Jangwoo Kim1
1Dept. of Electrical and Computer Engineering, Seoul National University2Dept. of Computer Science and Engineering, POSTECH
Conventional Server Architecture• Primarily rely on “CPU and memory”− CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices
CPUStorage
NetworkCompute
2/28Host- & CPU-centric
Conventional Server Architecture• Primarily rely on “CPU and memory”− CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices
CPUStorage
NetworkCompute
2/28Host- & CPU-centric
Device-centric Server Architecture• Exploit “fast & high-bandwidth devices”− Data processing accelerators (e.g., GPU, FPGA)− Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3
PCIe
CPU
…
Sto
rage
…
Net
wor
k
… …
Accelerator
GPUGPU FPGAFPGA
NVMNVM NICNIC
Device-centric
CPU
Host- & CPU-centric
Storage
NetworkCompute
3/28
Index• Existing approaches
• DCS-ctrl: HW-based device-control mechanism
• Experimental results
• Conclusion
4/28
Existing Approaches• Software optimization− Memory mgmt. optimization, user-level device interface− Do not address multi-device tasks
• P2P communication− Transfer data directly through PCI Express è D2D comm.
• Device integration− Integrate heterogeneous devices è D2D comm.
5/28
Limitations of Existing D2D Comm.• P2P communication− Direct data transfers through PCI Express è D2D comm.− Slow and high-overhead control path
Data pathControl path
DevA
DevC
CPUDevB
0
30
60
90
120
Control Data copy KernelSW
Lat
ency
(us
)
SWopt
P2P0%
25%
50%
75%
100%
Others Control Kernel
CPU
util
. (%
)
SWopt
P2P
6/28
Limitations of Existing D2D Comm.• Integrated devices− Integrating heterogeneous devices è D2D comm.− Fast data & control transfers− Fixed and inflexible aggregate implementation
CPU
DevA
DevC
DevB
NewDev$$$
Co
ntro
llers7/28
Limited Performance Potentialwhile (true) {
rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break;processing(&md_ctx, buffer, recv_size);rc_write = write(fd_file, buffer, recv_size);…
}
• “Intermediate” processing between device ops− Prevent applications from using direct D2D comm.− Cause host-side resource contention (CPU and memory)
DevA
DevB
CPU
8/28
Design Goals• Performance & scalability− Faster inter-device data & control communication− More scalable with CPU-efficient device operations
• Flexibility− Support any types of off-the-shelf devices
• Applicability− Increase the opportunity of applying D2D comm.
9/28
Index• Existing approaches
• DCS-ctrl: HW-based device-control mechanism− Key ideas and benefits− Architecture
• Experimental results
• Conclusion
10/28
• DCS-ctrl: PCIe P2P + “HDC”− Hardware-based device-control (HDC) mechanism
− HDC Engine: “FPGA-based” device orchestrator+ “near-device” processing unit
§ Performance & scalability è HDC, device orchestrator§ Flexibility è FPGA-based, low-cost device controller§ Applicability è near-device processing unit
DCS-ctrl: Key Ideas & Benefits
11/28
HDC Engine: Overview
Application
Dev A Dev B Dev C
Device driver A
Dev A
Device driver B
Device driver C
HDC Engine (FPGA)
Devicectrl A
Devicectrl B
Devicectrl C
NDPDev A Dev B Dev C
SW-controlled P2P DCS-ctrl (HW)Application
Dev B Dev C Dev A Dev B Dev C
12/28
DCS-ctrl: Key Ideas & Benefits
HDC
HDC
void ssd_to_nic(){get_from_ssd(&data);process_in_HDC(&data);write_to_nic(&data);
}
DevA
DevB
CPU
Optimized dev. control⇒ Faster & scalable
communication
Generic dev. interfaces⇒ Higher flexibility
Near-device processing⇒ Higher applicability
NewDev
CPUDevA
DevC
DevB HDC
DevicecontrollerData path
Control path
CPUDevA
DevC
DevB
HDC
13/28
Key Idea #1: Device Orchestrator
ScoreboardDev R/W Src Dst Aux StateA Read Addr(DevA) Addr(NDP-A) - Done- - Addr(NDP-A) Addr(NDP-B) Hash IssueB Write Addr(NDP-B) Addr(DevB) - Ready
• Perform multi-device tasks w/o CPU involvement− Offload a multi-device task to HDC Engine− Manage all device operations and their dependencies
Dev A
Dev B
NDP
Mul
ti-de
vice
ta
sk NDP
Fast hardware-level device control14/28
Key Idea #2: Device Controller
Dev
ice
con
trol
ler
Submissionqueue
Completionqueue
Device
• Provide interfaces between HDC Engine & devices− Include submission & completion queues− Build standard & vendor-specific device commands
Doorbellregisters
PCIeswitch
Flexible & low-cost device control15/28
Key Idea #3: Near-device Processing• Near-device processing units− Execute intermediate processing between device ops− Scale-out storage app è hash, encryption, compression
Easy to be extended & support other devices & applications
Processing units LUTs Registers ApplicationsMD5 3.0% 0.69% Swift
AES256 3.52% 0.99% HDFS, SwiftGZIP 5.36% 2.09% HDFS
Highly applicable to existing applications16/28
Index• Existing approaches
• DCS-ctrl: HW-based device-control mechanism- Key idea and benefits− Architecture
• Experimental results
• Conclusion
17/28
Baseline Architecture
PCIeswitch
DevC
DevB
DevA
Appl
icat
ion
Dev
AD
ev B
Dev
C
Device driver A
• Software-controlled P2P− P2P comm. + indirect device-control path
Device driver A
Device driver A
SW HW
18/28
DCS-ctrl: HW-based Device Control (1/3)
PCIeswitch
DevC
DevB
DevA
Appl
icat
ion
• Offload device-control path to HDC Engine− Scoreboard: schedule device operations in a multi-dev task
A –
B -
C
Dev r/w Src Dst
A
B
C
Scoreboard
FPGA-based HDC Engine
SW HW
19/28
DCS-ctrl: Low-cost Integration (2/3)
SW
PCIeswitch
DevC
DevB
DevA
Appl
icat
ion
• Implement an FPGA-based device controller− Device controller: directly control devices using P2P
A –
B -
C
FPGA-based HDC Engine
Dev r/w Src Dst
A
B
C
Scoreboard Devicecontroller
NewDev
HW
20/28
DCS-ctrl: Near-device Processing (3/3)
PCIeswitch
DevC
DevB
DevA
Appl
icat
ion
• Provide units for intermediate processing− NDP unit: perform data processing on a data path
A –
B -
C
FPGA-based HDC Engine
Dev r/w Src Dst
A
B
C
Scoreboard Devicecontroller
Near-deviceprocessing
Intermediatebuffers
NewDev
SW HW
21/28
HDC Engine implemented on Xilinx Virtex-7 VC707
Supports off-the-shelf devices –Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs
DCS-ctrl Prototype
22/28
Index• Existing approaches
• DCS-ctrl: HW-based device-control mechanism
• Experimental results
• Conclusion
23/28
Reducing Device Control Latency• encrypted_sendfile(): SSD à hash à NIC − SW opt (+P2P): frequent boundary crossings, complex software− DCS-ctrl: less crossings, hardware-based device control
0
50
100
SW opt DCS-ctrl
HW Kernel Dev ctrl
0
100
200
300
SW opt SW opt+ P2P
DCS-ctrl
HW Kernel Data Copy Dev ctrl
Late
ncy
(us)
Late
ncy
(us)
SW
without processing with processing(AES256)
SW SW42%
72%
24/28
Reducing CPU Utilization• Swift & HDFS workloads− Offload device control & data transfers to hardware
0%25%50%75%
100%
SW opt SW opt+P2P
DCS-ctrl
Kernel (GET) Kernel (PUT)GPU control Others
0%25%50%75%
100%
Send Recv Send Recv Send Recv
SW opt SW opt+P2P
DCS-ctrl
Kernel (Sender) Kernel (Receiver)GPU control others
Swift HDFS
Nor
mal
ized
CPU
util
izat
ion
Nor
mal
ized
CPU
util
izat
ion
50% 52% 49%
25/28
Scalability: More Devices• Swift & HDFS workloads− More CPU-efficient è support more high-performance devices
0
2
4
6
0 10 20 30 40
SW opt SW opt+ P2P
DCS-ctrl
0
2
4
6
0 10 20 30 40
SW opt SW opt+ P2P
DCS-ctrl
Swift HDFS
CPU
util
izat
ion
(# c
ores
)
CPU
util
izat
ion
(# c
ores
)
Throughput (Gbps) Throughput (Gbps)
26/28
• Fast & flexible device-control mechanism− Hardware-based device-control (HDC) mechanism− FPGA-based standard device controllers− Near-device data processing (NDP) units
• Real hardware prototype evaluation− 72% faster inter-device communication− 50% lower CPU utilization for Swift & HDFS
Conclusion
27/28