rmi: high performance user-level multi- threading …...user-level: n os native threads handle m rmi...
TRANSCRIPT
RMI: High Performance User-level Multi-threading on SPDK/DPDK
Munenori MAEDA†, Shinji KOBAYASHI†, Jun KATO†, Mitsuru SATO†, Hiroki OHTSUJI‡
† Fujitsu Limited‡ Fujitsu Laboratories Ltd.
Copyright 2019 FUJITSU LIMITED0
RMI Development Team
Copyright 2019 FUJITSU LIMITED
Munenori MAEDA•Fujitsu Limited•TL, Engineer, Ph.D.
Shinji KOBAYASHI•Fujitsu Limited•Engineer
Jun KATO•Fujitsu Limited•Engineer
Mitsuru SATO•Fujitsu Limited•Senior Manager, Ph.D.
Hiroki OHTSUJI (Speaker)
• Fujitsu Laboratories Ltd.•Researcher, Ph.D.
1
Agenda
Background
Why Thread?
RMI Architecture
RMI Design and Feature
ARCHITECTURE
RMI Threading Environment
Framework Overview
Multicore Optimizations
Benchmark Result
RMI Internode Communication
Design Overview
Multicore Optimizations
Performance Evaluation
Summary
Copyright 2019 FUJITSU LIMITED2
Background
Copyright 2019 FUJITSU LIMITED3
Background
Enterprise AFA Requirements
Capacity
Performance
Reliability and Availability
Scalability
Rich Storage Services
• Deduplication
• Erasure Coding
• Cloud native functions
• … and so on
HW Technology Trends
Multicores with Dual or more sockets
High Speed Interconnect with RDMA(RoCE)
Low latency Storage Device(NVMe, 3D XPoint)
SPDK Framework
Kernel-bypass Architecture to exploit HW Performance of the latest devices
Potential to be de facto
Copyright 2019 FUJITSU LIMITED
Our research has been started since 2016
4
Background
What’s the matter of our development? In early evaluation we got
Development Expense and Speed
• Developers are unfamiliar with asynchronous programming
• Debugging takes Long time
Effective Performance and Scalability
• Multicore parallelism sometimes degrades performance
• No load balancer to level CPU utilizations
• Internode Communication impacts I/O latency and Performance
Missing pieces of SPDK
Programmability
RDMA based Internode Communication
Copyright 2019 FUJITSU LIMITED5
What we need for Programming Framework
Better programmability for Industrial Applications
Compact
Comprehensive
Reliable
Integrity with SPDK
SPDK so far provides basic Storage Services and HW Drivers, including NVMe driver
Use them “as is” to shorten a development period
Copyright 2019 FUJITSU LIMITED6
Multi-threading Framework
Thread vs. Event
Long time debates in Academic and Industrial communities
Both have the Pros and Cons
Yes, SPDK/DPDK is in “Event” side
Why Thread?
Developers are quite familiar with Blocking API
• Traditional POSIX programming:file, socket
Why Thread? (Contd.)
Multi-threading has great success in some domains:
• Apache WEB server and WEB services with stateful sessions
Suited to AFA Storage Services such that
Complicated and large code
Many states to handle HW or SW failures
Copyright 2019 FUJITSU LIMITED7
Internode Communication
Various Comm. Purpose
Delegate I/O request to LUN Node
Update Metadata
Maintain Data Redundancy
Check Sanity/Soundness for HA
Data Redundancy constrains Scalability
Mix. use of Short Messages and Long RDMA transfers (8KB or More)
Copyright 2019 FUJITSU LIMITED
SCSI
Extent Lock
Cache
Deduplication
Log Structured FS
Node#m
Responsible Node#p
RAID
MirrorSynchronous RDMA Write
Cross Access
Update Reference Count
MirrorSynchronous RDMA Write
MirrorSynchronous RDMA Write
Pair Node#n
8
RMI Architecture
Copyright 2019 FUJITSU LIMITED9
RMI Design and Feature
Our Framework: RMI
An abbreviation of Rocket Messaging Infrastructure
Supports Multi-threading
Lightweight user-level threads
Various synchronizing primitives, similar to Java and Go
Enables SPDK compliant
No changes for SPDK NVMe driver and 3rd party driver
SPDK event-driven programming is still effective
Provides Dynamic Load Balancing
Level multicore utilization to exploit full performance
Copyright 2019 FUJITSU LIMITED10
ARCHITECTURE
Copyright 2019 FUJITSU LIMITED
Storage Services
Storage Protocols
DriversIntel QuickData
Technology(I/OAT) Driver
NVMe Devices
NVMePCIe Devices
NVMe-oFInitiator
RMI Threading Environment
SynchronizingPrimitives
Block Device Abstraction
(BDEV)
RDMA (Verbs)Driver
SCSI
iSCSITarget
NVMe-oFTarget
NVMe
Vender’s Impl.
Synchronous Multi-thread Support
RMI BDEVBridge API
Intel QuickAssistTechnology Driver
AcceleratorInternode Device
LegendRMISPDK3rd Party
Thread Pool /Load Balancer
11
This Talk …
Two main topics, we present, are:
RMI Threading Environment
RMI Internode Communication
Meanwhile, the following topics are omitted or only outlined:
RMI BDEV Bridge API
RMI Load Balancing Algorithm
RMI Synchronizing Primitives
RMI QuickAssist Driver Glue
Copyright 2019 FUJITSU LIMITED12
RMI Threading Environment
Copyright 2019 FUJITSU LIMITED13
RMI Threading Environment
Based on Lthread of DPDK 16.11
User-level: N OS native threads handle M RMI threads
“Cooperative”, or Non-preemptive: running threads voluntarily yield control
Synchronizing Primitives
i-structure (the same as Haskell’s ivar, Java’s Future)
q-structure (the same as Java’s BlockingQueue, Go’s channel)
RCU, Semaphore, and so on.
Thread Pool / Load Balancer
Enables fast and efficient allocation of Threads
Provides a “matching place” among idle threads and tasks
Copyright 2019 FUJITSU LIMITED14
RMI Threading Environment: Thread Pool
Design Considerations
Practical Limitation of Memory Footprint
• How many available threads?
• It depends on the gross stacks of threads
Locality-aware Task Scheduling
• Existing task scheduling methods are:
(1) Tasks activate Idle Threads
(2) Active Threads pull Waiting Tasks
• Common Problems
• Locality Unawareness
• Unnecessary Context Switches
Copyright 2019 FUJITSU LIMITED
(1)Tasks activate Idle Threads
(2) Active Threads pull Waiting Tasks
15
RMI Threading Environment: Thread Pool
RMI Approach
Fixed Memory Footprint 2GB
• Gross stack is the product of 16K threads x 128KB fixed stack for each
Hybrid Task Scheduling
• Hybrid of existing (1) and (2) methods
• Idle Threads and Waiting Tasks are both non-empty in some condition
• Lazy binding of Task and Thread
• Locality-aware
• Minimized Context Switching
Copyright 2019 FUJITSU LIMITED
Idle Threads
Active Threads
RMI Thread Pool
Waiting Tasks
Task
16
RMI Threading Environment: Example
Copyright 2019 FUJITSU LIMITED
static int number_stream(void) {…// Task declaration
struct rmi_exec_task gen_task;
// Content of Task is definedrmi_exec_task_init(&gen_task);gen_task.fn = gen;gen_task.fn_arg = &ga;
// Submit the Task into Thread pool rmi_task_submit_thrpool(&gen_task);
…}
static int gen(void *arg) {struct gen_arg args = *(struct gen_arg*)arg;
for (uint64_t i=2; i < args.max; ++i) {// use of RMI synchronizing primitive:// write integer to bounded-bufferrmi_bbstr_u64_write(args.out, i);
}
rmi_bbstr_u64_write(args.out, 0UL);return 0;
}
Computational Entity defined as Task
17
RMI Threading Environment: Load Balancing
Design Considerations
Possible Migration Target are:
• Active Running Threads, or
• Runnable Tasks in the Thread Pool
Active Running Threads
• [Pros] LB may directly level core utilizations
• [Cons] C programs and libraries are constrained for the use of Thread Local Storage, TLS
• Usually supported by program language like Go
Runnable Tasks ← RMI only supports
• [Pros] No limitations for C programs including OSS, because TLS is not yet accessed
• [Cons] Behind the task rearrangement, LB may gradually level core utilizations
Copyright 2019 FUJITSU LIMITED
Native threads may access TLS safely, assisted by OS.Cooperative threads, however, have no OS assist.
18
Poll HW and SW Queues
Manage Timers
SPDK reactor single iteration
Switch to a thread context
RMI and SPDK Integration
RMI thread scheduler takes over an SPDK reactor, running on each core
SPDK runs under the RMI scheduler context
Copyright 2019 FUJITSU LIMITED
RMI Main Loop
Poller and Timer Poller
Prioritized Event run
Event run
No change, just as-is
19
Multicore Performance Issue
SW Queues in RMI Main Loop are Frequently Accessed
SW Queues contain RMI hub infra., designed as Concurrent Queue:
RMI task queue
Ready queue
SPDK event queue
Concurrent Queue may impact Multicore Performance
Mutex are NEVER used in RMI Main Loop in Principle
Most Non-blocking Concurrent Queues use HW Locks (atomic instructions) instead
HW locks restrict Multicore Scalability
Copyright 2019 FUJITSU LIMITED20
Preliminary Concurrent Queue Evaluation
Round-trip Test for Concurrent Queue
Single Queue per Core, as “Multi Produces, Single Consumer”
Measure Elapsed time from (1) to (4)
All-to-all
Comparison
rte_ring of DPDK 16.11
Vyukov’s queue in Lthread of DPDK 16.11
Copyright 2019 FUJITSU LIMITED
(1)element push (2)element pop
(4)element return by pop (3)element push
core#P core#Q
21
Preliminary Result
Result / rte_ring
Copyright 2019 FUJITSU LIMITED
0.6369 1.50763.84201
11.50558
1.43083.3266
8.40633
22.94589
68.56327
0
10
20
30
40
50
60
70
80
2 4 8 16 32
roun
d tr
ip t
ime
(us)
# of cores
rte_ring of DPDK16.11
8+8
4+4
16+16Xeon E5-2697 v4 (Broadwell) 2.3GHz 2sockets
Single socket: all cores in a single socket
Dual sockets: divide cores in half
• Long Latency32 core’s RTT is 68.5us.
• Poor ScalabilityRTT is 48x when # of cores is 16x.
22
0.3856 0.49940.96111
2.25464
0.54771.15905
2.29205
5.47565
10.40527
0
2
4
6
8
10
12
2 4 8 16 32
roun
d tr
ip t
ime
(us)
# of cores
Vyukov’s queue of Lthread
Preliminary Result (Contd.)
Result / Vyukov’s queue
Copyright 2019 FUJITSU LIMITED
Single socket: all cores in a single socket
Dual sockets: divide cores in half
8+8
4+4
16+16Xeon E5-2697 v4 (Broadwell) 2.3GHz 2sockets
6.6x faster than rte_ring
• Acceptable Latency ?32 core’s RTT is 10us.
• Good ScalabilityRTT is 18x when # of cores is 16x.
23
RMI Multicore Optimizations
Observations
HW lock by inter-socket cores is quite heavy
• High Performance needs to reduce HW locks
Improving latency for concurrent queue is still necessary
• Vyukov’s queue is faster, but its latency is relatively high (>5us each way)
RMI Approach
Optimizations to reduce HW Locks, focusing on some dedicated use:
• RMI task queue for Load Balancing
• Lockless i-structure for HW completions
• RMI Internode Communication (←presented later)
Copyright 2019 FUJITSU LIMITED24
RMI task queue
Consists of N fast SPSC Queues with No HW Lock and slow MPSC queue
Enables fast push/pop unless SPSC queues are overflowed
Preserves FIFO order even overflowed
Copyright 2019 FUJITSU LIMITED
MPSC rte_ring
N SPSC queues, similar to SPSC rte_ring
Push if SPSC queue is overflowed
25
RTT Benchmark Result
Result / RMI task queue
Copyright 2019 FUJITSU LIMITED
0.39825 0.58810.97185
1.6565
0.85311.2832
1.90447
2.75045
4.75891
0.5477
1.15905
2.29205
5.47565
10.40527
0
2
4
6
8
10
12
2 4 8 16 32
roun
d tr
ip t
ime
(us)
# of cores
rmi_task_queue
4+48+8
Xeon E5-2697 v4 (Broadwell) 2.3GHz 2sockets
• Low Latency32 core’s RTT is 4.8us.
• Super ScalabilityRTT is 12x when # of cores is 16x.
Single socket: all cores in a single socket
Dual sockets: divide cores in half
16+16
Vyukov’s queue
26
Prime Benchmark
Description
“Prime” sifts N natural numbers, N=10M, with 10K filter threads of primes (2,3,5,…)
Filter threads are connected by channels
# of actual threads is less than 16K because of RMI Thread Pool capacity limitation
Compare with a native Go program
Both use compatible synchronization primitives
Go program creates threads directly, while RMI program submits tasks in Thread Pool
Copyright 2019 FUJITSU LIMITED27
Prime Benchmark Result
Good Performance, superior to Go
Need further analysis for Scalability Decrease
Copyright 2019 FUJITSU LIMITED
146.9
84.0
40.5
21.6 12.6 8.4
40.0
21.3 11.8
5.9 3.6 3.7
0.0
50.0
100.0
150.0
1 2 4 8 16 32
Exec
uti
on T
ime
(sec
.)
# of cores
1.0 1.7
3.6
6.8
11.7
17.5
1.0 1.9
3.4
6.8
11.3 10.8
0.0
5.0
10.0
15.0
20.0
1 2 4 8 16 32
Spee
d up
rat
io# of cores
SUPERMICRO 2028U-TN24R4T / Xeon E5-2697 v4 2.3GHz 36 Cores / 512GB
GoRMI
28
RMI Internode Communication
29
Design of Communication Primitives
Verbs API is commonly used in RDMA over Converged Ethernet (RoCE).
We adopt RoCE for low latency and low CPU usage.
Communication primitives for RMI
Using native Verbs API?
• Verbs API provides asynchronous posting and polling style primitives
• E.g. ibv_post_send() and ibv_poll_cq()
• Polling in every RMI threads is complicated and inefficient
RMI Approach
• Provides simple blocking style primitives implemented using Verbs APIs
• E.g. mmi_send(), mmi_recv(), mmi_read_rdma(), mmi_write_rdma()
Copyright 2019 FUJITSU LIMITED30
Implementation of Blocking Primitives
Send/RDMA Read/RDMA Write
i-structure is used to synchronize with the I/O completion.
Copyright 2019 FUJITSU LIMITED
Poll HW and SW Queues
Manage Timers
SPDK reactor single iteration
Switch to a thread context
RMI Thread
1. Post Send
2. Read i-structure
5. Resume execution
i-structure
3. Poll Send Completion
4. Write i-structure
RMI Main Loop
31
Implementation of Blocking Primitives (Contd.)
Receive
q-structure is used for synchronization as multiple messages may arrive.
Copyright 2019 FUJITSU LIMITED
Poll HW and SW Queues
Manage Timers
SPDK reactor single iteration
Switch to a thread context
RMI Thread
1. Read q-structure
4. Resume execution
q-structure
2. Poll Receive
3. Write q-structure
RMI Main Loop
32
Multicore Performance Issue of Internode Comm.
Spin Locks in the communication library impact performance
Performance degrades when many cores access a Queue Pair (QP)
• QP is the communication endpoint in Verbs API. One QP is needed for each communication target
Copyright 2019 FUJITSU LIMITED
QP
Core 0Core 1
Core N
33
Provide a dedicated QP for each Destination Node/RoCE Port/Core.
Each core communicates with the corresponding core on the destination node.
The illustration below assumes single RoCE Port for simplicity.
Node 1
Node 3
Node 2
Dedicated QP for Each Core
Copyright 2019 FUJITSU LIMITED
Core 0
Core 1
Core N
QP
QP
QP
Core 0
Core 1
Core N
QP
QP
QP
Core 0
Core 1
Core N
QP
QP
QP
Core 0
Core 1
Core N
QP
QP
QP
34
Dedicated QP for Each Core (Contd.)
Pros
No simultaneous access for QPs.
• Spin lock costs are greatly reduced.
• Specifying an appropriate thread model through Mellanox’s experimental Verbs API further improves CPU usage wasted for unnecessary HW locks.
Cons
Many QPs are needed.
• Modern NICs supports more than ~100K QPs.
•
Copyright 2019 FUJITSU LIMITED
Number of QPs needed = Max. # of destination nodes × # of RoCE ports × # of cores= 15 × 2 × 40 = 1200
We have enough QPs for our case.
35
Software
Latency benchmark using pairs of RMI threads exchanging a message or performing RDMA operations in Ping-Pong style.
Hardware
Processors: 2 Intel Xeon Gold 6148 (Skylake)
• 38 cores out of 40 cores are used for experiments.
• HyperThreading is disabled.
NIC: Mellanox ConnectX-4 EN
• Dual port 40GbE.
Network: 40GbE switch
Node 2Node 1
40 GbE Switch
Performance Evaluation Environment
Copyright 2019 FUJITSU LIMITED
VLANVLAN
Thread 1 Thread 1
Thread 2 Thread 2
Thread N Thread N
Node 1 Node 2
36
Performance Evaluation Results
Latency of Dedicated QP is ~1/10 of that of Shared QP when many threads are communicating.
Optimization of acquiring a lock before polling is adopted for Shared QP.
Copyright 2019 FUJITSU LIMITED
0
100
200
300
400
1 2 4 8 16 32 64 128 256
Late
ncy
(u
s)
# of RMI Threads
Latency of RDMA 512B
Dedicated QP RDMA Read
Dedicated QP RDMA Write
Shared QP RDMA Read
Shared QP RDMA Write
Minimum 2 * 40GbE1
10
0
100
200
1 2 4 8 16 32 64 128 256
Late
ncy
(u
s)
# of RMI Threads
Latency of Message Send/Receive
Dedicated QP
Shared QP
1
10
37
Performance Evaluation Results (Contd.)
CPU Usage of pthread_spin_lock() is reduced if Exp. Verbs is used.
We use IBV_EXP_THREAD_UNSAFE in ibv_exp_create_res_domain().
Copyright 2019 FUJITSU LIMITED
0
2
4
6
8
10
12
14
16
1 4 16 64 256
CP
U U
sage
(%
)
# of RMI Threads
pthread_spin_lock CPU Usage (RDMA 512B)
Read Without Exp. VerbsRead With Exp. VerbsWrite Without Exp. VerbsWrite With Exp. Verbs
Reduced
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64 128 256
CP
U U
sage
(%
)
# of RMI Threads
pthread_spin_lock CPU Usage (Message)
Without Exp. Verbs
With Exp. Verbs Reduced
38
Summary
Designed and Implemented RMI on SPDK/DPDK for Enterprise AFA
Better programmability
• Light-weight user-level threads
• Rich synchronizing primitives
• Blocking style communication primitives
SPDK compliant
• Various libraries and HW drivers can be used “as is.”
High Performance
• Load balancing with fast concurrent queue
• Lockless i-structure
• Low latency RDMA
Copyright 2019 FUJITSU LIMITED39
Questions ?
Copyright 2019 FUJITSU LIMITED
E-mail:
40
41
Bdev Extension
SPDK design
Bdev fn_table may define:
• What to handle
• Submit_request
• Get_io_channel
• Destruct
• How to dispatch
• Function call
• Event call
spdk_bdev_io_submit/_complete
• Invoke Bdev fn_table, like as Vtable
RMI design
Bdev fn_table Extension:
• Adds RMI’s dispatch manner
• Task submit
rmi_bdev_io_submit/_complete
• Determine whether transferring direction is from SPDK to RMI, or vice verse
• Select dispatch mechanism
• [RMI→SPDK] event call
• [SPDK→RMI] task submit
• [*→*] Function call
• Invoke extended Bdev fn_table
Copyright 2019 FUJITSU LIMITED42
Control Transfer between RMI and SPDK
Copyright 2019 FUJITSU LIMITED
spdk_event_queue
Reactor ready_queue
RMI Thread Pool
rmi_task_queue
RMI Scheduler
RMI Threads
SPDK Layer RMI Layer
task submitevent call
thread resume,i-structure write,
etc.
Dedicated asynchronous primitives may be called anytime, anywhere to
transfer controls between layers.
43
Lockless i-structure
The HW lock in i-structure can be safely removed if posting Send is performed AFTER the RMI Thread is blocked.
Copyright 2019 FUJITSU LIMITED
Poll HW and SW Queues
Manage Timers
SPDK reactor single iteration
Switch to a thread context
RMI Thread
1. Read i-structure
5. Resume execution
Lockless i-structure
3. Poll Send Completion
4. Write i-structure
2. Post Send
RMI Main Loop
44
Lockless i-structure (Contd.)
Generic i-structure
Post Send before reading i-structure.
HW lock in i-structure is mandatory as another core may write to the i-structure.
Lockless i-structure
Post Send in the pending function which is executed after the RMI thread is blocked by reading i-structure.
HW lock in i-structure is not necessary.
Copyright 2019 FUJITSU LIMITED
static void pending_send(rmi_istr_int *istr, void *arg) {// Executed in the scheduler context just after the RMI
// Thread is blocked.post_send(…);}
…int ret = rmi_istr_lockless_int_read(&istr,
pending_send, &args);// Blocks until the Send is completed.
…post_send(…);
int ret = rmi_istr_int_read(&istr);// Blocks until the Send is completed.
45
Performance Evaluation Results (Contd.)
Latency of RDMA 8KB is limited by the 40GbE throughput.
Copyright 2019 FUJITSU LIMITED
0
100
200
300
400
500
600
700
1 2 4 8 16 32 64 128 256
Late
ncy
(u
s)
# of RMI Threads
Latency of RDMA 8KB
Dedicated QP RDMA Read
Dedicated QP RDMA Write
Shared QP RDMA Read
Shared QP RDMA Write
Minimum 2 * 40GbE
46
Performance Evaluation Results (Contd.)
CPU Usage of pthread_spin_lock() is reduced if Exp. Verbs is used.
We use IBV_EXP_THREAD_UNSAFE in ibv_exp_create_res_domain().
Copyright 2019 FUJITSU LIMITED
0
2
4
6
8
10
12
14
16
1 2 4 8 16 32 64 128 256
CP
U U
sage
(%
)
# of RMI Threads
pthread_spin_lock CPU Usage (RDMA 8KB)
Read Without Exp. VerbsRead With Exp. VerbsWrite Without Exp. VerbsWrite With Exp. Verbs
Reduced
47
Performance Evaluation Results (Contd.)
Shared QP without Lock Optimization has severe Performance Degradation.
Copyright 2019 FUJITSU LIMITED
0
100
200
300
400
500
600
700
800
1 2 4 8 16 32 64 128 256
Late
ncy
(u
s)
# of RMI Threads
Latency of RDMA 512B
Dedicated QP RDMA Read
Dedicated QP RDMA Write
Shared QP RDMA Read
Shared QP RDMA Write
Shared QP w/o Lock Opt. RDMA Read
Shared QP w/o Lock Opt. RDMA Write
Minimum 2 * 40GbE
0
100
200
300
400
500
1 2 4 8 16 32 64 128 256
Late
ncy
(u
s)
# of RMI Threads
Latency of Message Send/Receive
Dedicated QP
Shared QP
Shared QP w/o Lock Opt.
48
Performance Evaluation Results (Contd.)
Shared QP without Lock Optimization has severe Performance Degradation.
Copyright 2019 FUJITSU LIMITED
0
100
200
300
400
500
600
700
800
1 2 4 8 16 32 64 128 256
Late
ncy
(u
s)
# of RMI Threads
Latency of RDMA 8KB
Dedicated QP RDMA Read
Dedicated QP RDMA Write
Shared QP RDMA Read
Shared QP RDMA Write
Shared QP w/o Lock Opt. RDMA Read
Shared QP w/o Lock Opt. RDMA Write
Minimum 2 * 40GbE
49