flashshare: punching through server storage stack from ... · samsung toshiba 3us n/a 100us n/a...
TRANSCRIPT
FlashShare: Punching Through Server
Storage Stack from Kernel to Firmware
for Ultra-Low Latency SSDs
Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh,
Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut
Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung
Executable Summary
FlashSharePunches throughthe performance barriers
Reduce average turnaroundresponse times by 22%
Reduce 99th-percentile turnaround response times
by 31%
Datacenter
Latency-critical application
Throughput application
Memory-likeperformance
Flash Firmware
NVMe Driver
Block Layer
File System
Interference
Lon
ger
I/O
late
ncy
ULL-SSD
Unaware of ULL-SSD
Unaware of latency-critical
application
Barrier
Ult
ra lo
w la
ten
cy
Motivation: applications in datacenterDatacenter executes a wide range of latency-critical workloads.• Driven by the market of social media and web services;• Required to satisfy a certain level of service-level agreement;• Sensitive to the latency (i.e., turn-around response time);
A typical example: Apache
respond
serverapache
mo
nit
or
TCP/IP service
qu
eu
edata obj.
worker
worker
TCP/IP service
worker
data obj.
HTTPrequest
A key metric:user experience
Motivation: applications in datacenter• Latency-critical applications exhibit varying loads during a day.• Datacenter overprovisions its server resources to meet the SLA.• However, it results in a low utilization and low energy efficiency.
Hour of the day
Figure 1. Example diurnal pattern in queries per second for a Web Search cluster1.
1Power Management of Online Data-Intensive Services.
Frac
tio
n o
f Ti
me
CPU utilization
Figure 2. CPU utilization analysis of Google server cluster2.
2The Datacenter as a Computer.
Varyingloads
30% avg.utilization
Motivation: applications in datacenterPopular solution: co-locating latency-critical and throughput workloads.
Micro’11 ISCA’15
Eurosys’14
Challenge: applications in datacenter
Applications:Apache – Online latency-critical application;PageRank – Offline throughput application;
Server configuration:
Experiment: Apache+PageRank vs. Apache only
Performance metrics:SSD device latency;Response time of latency-critical application;
Challenge: applications in datacenterExperiment: Apache+PageRank vs. Apache only
• The throughput-oriented application drastically increases the I/O accesslatency of the latency-critical application.
• This latency increase deteriorates the turnaround response time ofthe latency-critical application.
Fig 2: Apache response time increases due to PageRank.
Fig 1: Apache SSD latency increases due to PageRank.
Challenge: ULL-SSDThere are emerging Ultra Low-Latency SSD (ULL-SSD) technologies,which can be used for faster I/O services in the datacenter.
ZNAND XL-Flash
New NAND Flash
Samsung Toshiba
3us N/A
100us N/A
Optane nvNitro
TechniquePhase change
RAMMRAM
Vendor Intel Everspin
Read 10us 6us
Write 10us 6us
Challenge: ULL-SSDIn this work, we use engineering sample of Z-SSD.
Z-NAND1
TechnologySLC based 3D NAND
48 stacked word-line layer
Capacity 64Gb/die
Page Size 2KB/Page
Z-NAND based archives “Z-SSD”[1] Cheong, Wooseong, et al. "A flash memory controller for 15μs ultra-low-latency SSD using high-speed 3D NAND flash with 3μs read time." 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018.
Challenge: Datacenter server with ULL-SSD
Applications:Apache – online latency-critical application;PageRank – offline throughput application;
Device latency analysis
42x36us 28us
Server configuration:
Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications).
Challenge: Datacenter server with ULL-SSD
The storage stack is unaware of the characteristics of both latency-critical
workload and ULL-SSD
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe Driver
blkmq
The current design of blkmq layer, NVMe driver, and SSD firmware can
hurt the performance of latency-critical applications.
ULL-SSD fails to bring short latency, because of the storage stack.
NVMe Driver
ULL-SSD
Blkmq layer: challenge
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe Driver
I/O
su
bm
issi
on
blkmqSo
ftw
are
Qu
eue
Har
dw
are
Qu
eue
ReqReq
Merge
Req
Req Req ReqIncoming requests
Req
Req
Merge
Req
Queuing Queuing
Software queue: holds latency-critical I/O requests for a long time;
Blkmq layer: challenge
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe Driver
blkmqSo
ftw
are
Qu
eue
Har
dw
are
Qu
eue
Dispatch
Token=1 Token=0Token=0
ReqReqReq
ReqReq
Software queue: holds latency-critical I/O requests for a long time;Hardware queue: dispatches an I/O request without a knowledge of
the latency-criticality;
I/O
su
bm
issi
on
Blkmq layer: optimization
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe Driver
blkmqSo
ftw
are
Qu
eue
Har
dw
are
Qu
eue
ReqReqReq
Our solution: bypass.
Req
Req
Req
Req
Soft
war
e Q
ueu
eH
ard
war
e Q
ueu
e
Byp
ass
LatReq ThrReq ThrReqIncoming requests
LatReq
No merge
No I/O scheduling
Latency-critical I/Os: bypass blkmq for a faster response;
ThrReq ThrReq
Throughput I/Os: merge in blkmq for a higher storage bandwidth.
Little penalty
Addressedin NVMe
I/O
su
bm
issi
on
NVMe SQ: challenge (bypass is not simple enough)
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe DriverNVMe Driver
ULL-SSD
SQ doorbellregister
NVMecontroller
CQ doorbellregister
Co
re 1
Co
re 0
Co
re 2
NVMe SQ NVMe CQ
Incoming requests
Ring
Head
TailTail
Req
Wait
Head
Head
Fetch
+ Tfetch+ 2xTfetch
NVMe protocol-level queue: a latency-critical I/O request can be blockedby prior I/O requests; Time Cost = Tfetch-self + 2xTfetch >200%
overhead
I/O
su
bm
issi
on
NVMe SQ: optimizationIncoming
+ 2xT
Target: Designing towards a responsiveness-aware NVMe submission.Key Insight: • Conventional NVMe controller(s) allow to customize the standard arbitration
strategy for different NVMe protocol-level queue accesses.• Thus, we can make the NVMe controller to decide which NVMe command to
fetch by sharing a hint for the I/O urgency.
NVMe SQ: optimization
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe DriverNVMe Driver
ULL-SSD
SQ doorbellregister
NVMecontroller
CQ doorbellregister
Co
re 1
Co
re 0
Co
re 2
NVMe SQ NVMe CQ
Ring
+ 2xT
Our Solution:
Lat-SQ doorbell
Thr-SQ doorbell
NVMeCTL
CQ doorbell
Co
re 2
Lat-SQ Thr-SQ Co
re 1
Co
re 0
CQ
Incoming requests ThrReq ThrReq LatReq
Ring Postpone
1. Double SQs (one for latency-critical I/Os, another for throughput I/Os)2. Double the SQ doorbell register3. New arbitration strategy: gives the highest priority to the Lat-SQ
Immediate fetch
I/O
su
bm
issi
on
SSD firmware: challenge
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe Driver
Embedded cache cannot protect the latency-critical I/O from an eviction;
ULL-SSD
NVMe Controller
Caching Layer
FTL
NAND Flash
Caching LayerI/O Hit
Miss
Em
be
dd
ed
Ca
che
way addr
0 0x0
1 0x1
2 0x2
0x1
0x5
0x8
Req@0x01 Req@0x05 Req@0x04
Cost: TCL+TCACHECost: TCL+TFTL+TNAND +TCACHE
0x4
0x1
Evict
Req@0x08Incoming requests
Embedded cache provides the fastest response(DRAM service)
0x0
I/O
su
bm
issi
on
Embedded cache can be polluted by the throughput requests;
Req@0x0b
0xb 0x5
SSD firmware: optimization
App App App
ULL-SSD
Caching layer
Filesystem
blkmq
NVMe Driver
Our design: splits the internal cache space to protect latency-critical I/Orequests;
ULL-SSD
NVMe Controller
Caching Layer
FTL
NAND Flash
Caching Layer
Em
be
dd
ed
Ca
che
way addr
0 0x0
1 0x1
2 0x2
0x1
0x5
Req@0x01 Req@0x05 Req@0x04
0x4
Evict
Req@0x08Incoming requests
0x0
Protectionregion
0x40x8
I/O
su
bm
issi
on
Filesystem
NVMe Driver
Caching layer
blkmq
NVMe CQ: challenge
App App App
ULL-SSD
NVMe completion: MSI overhead for each I/O request;
ULL-SSD
NVMe Driver
Co
re 1
Co
re 0
Co
re 2
NVMe SQ NVMe CQ
SQ doorbellregister
NVMecontroller
CQ doorbellregister
Message
HeadTail
InterruptController CPU Interrupt
Service Routine
MSI
Inte
rru
pt
contextswitch
Blk
mq
laye
r
contextswitch
Tail
Cost: 2xTCS +TISR
I/O
co
mp
leti
on
Cost: TCS Cost: TCSCost: TISR
Filesystem
NVMe Driver
Caching layer
blkmq
NVMe CQ: optimization
App App App
ULL-SSDULL-SSD
NVMe Driver
Co
re 1
Co
re 0
Co
re 2
NVMe SQ NVMe CQ
SQ doorbellregister
NVMecontroller
CQ doorbellregister
InterruptController CPU Interrupt
Service Routine
MSI
Inte
rru
pt
contextswitch
Blk
mq
laye
r
contextswitch
I/O
co
mp
leti
on
Key insight: state-of-the-art Linux supports a poll mechanism;
Poll
Message
Blkmqlayer Save: 2xTCS +TISRSave: TCSSave: TISRSave: TCS
NVMe CQ: optimizationPoll mechanism can bring benefits to fast storage device.
4KB8KB
16KB32KB
141618202224262830
Avera
ge L
ate
ncy (
sec)
Interrupt
Polling
4KB8KB
16KB32KB
10
12
14
16
18
20
22
Polling
Avera
ge L
ate
ncy (
sec)
Interrupt
ULL SSD
Decreases by
Read: 7.5% & Write: 13.2%
Read Write
NVMe CQ: optimizationHowever, the poll-based I/O services consume most host resources.
4KB8KB
16KB32KB
0
20
40
60
80
100
Me
mo
ry B
ou
nd
(%
) Polling
Interrupt
0
20
40
60
80
100
Time
CP
U U
tiliz
ation (
%)
Interrupt
0
20
40
60
80
100
Time
CP
U U
tiliz
ation (
%)
Polling
InterruptController
Filesystem
NVMe Driver
Caching layer
blkmq
NVMe CQ: optimization
App App App
ULL-SSD
Our solution: selective interrupt service routine (Select-ISR).
ULL-SSD
NVMe Driver
Co
re 1
Co
re 0
Co
re 2
NVMe SQ NVMe CQ
SQ doorbellregister
NVMecontroller
CQ doorbellregister
Message
MSI
Inte
rru
pt
Incoming requests ThrReqLatReq CPU
Blkmqlayer
Message
Interrupt Service Routine
contextswitch
I/O
co
mp
leti
on
Sleep
Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.
Call a new utility: chworkload_attr
Modify Linux PCBdata structure
Invoke new system call
Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.• Deliver the workload attribute to each layer of storage stack.
Workloadattribute
rsvd2
nvme_cmd
NVMe controller
task_structUser Process
tag array
Embeddedcache
ad
dre
ss_sp
ace
Pa
ge
ca
ch
e
BIOFile system
request
Block layer(blk-mq)
nvme_rw_commandNVMe driver
More optimizationsAdvanced caching layer designs:
• Dynamic cache split scheme: to maximize cache hits in various request patterns;
• Read prefetching: better utilize SSD internal parallelism;• Adjustable read prefetching with ghost cache: adaptive to different
request patterns;
Hardware accelerator designs:• Conduct simple but timing-consuming tasks such as I/O poll and I/O
merge;• Simplify the design of blkmq and NVMe driver.
Experiment Setup
Test Environment
System configurations:• Vanilla – a vanilla Linux-based computer system running on ZSSD;• CacheOpt – compared to Vanilla, it optimizes the cache layer of SSD firmware;• KernelOpt – it optimizes blkmq layer and NVMe I/O submission;• SelectISR – compared to KernelOpt, it adds the optimization of selective ISR;
http://simplessd.org
Evaluation: latency breakdown
46% reduction
38% reduction 5us reduction
• KernelOpt reduces the time cost of blkmq layer by 46% thanks to no queuing time;• As latency-critical I/Os are fetched by NVMe controller immediately, KernelOpt drastically
reduces the waiting time;• CacheOpt better utilizes the embedded cache layer and reduces the SSD access delays by 38%;• By selectively using polling mechanism, SelectISR can reduce the I/O completion time by 5us.
Evaluation: online I/O access
• CacheOpt reduces the average I/O service latency, but it cannot eliminate the long tails;• KernelOpt can remove the long tails, because it can avoid long queuing time and prevents
throughput I/Os from blocking latency-critical I/Os;• SelectISR reduces the average latency further, thanks to selectively using poll mechanism.
ConclusionObservation The ultra-low latency of new memory-based SSDs is not exposed to latency-critical application and have no benefit from user-experience angle;ChallengePiecemeal reformations of the current storage stack won’t work due to multiple barriers; the storage stack is unaware of all behaviors of ULL-SSD and latency-critical applications; Our solutionFlashShare: We expose different levels of I/O responsiveness to the key components in the current storage stack and optimize the corresponding system layers to make ULL visible to users (latency-critical applications). Major results• Reducing average turnaround response times by 22%;• Reducing 99th-percentile turnaround response times by 31%.
Thank you