flashshare: punching through server storage stack from ... · samsung toshiba 3us n/a 100us n/a...

FlashShare: Punching Through Server

Storage Stack from Kernel to Firmware

for Ultra-Low Latency SSDs

Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh,

Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut

Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung

Executable Summary

FlashSharePunches throughthe performance barriers

Reduce average turnaroundresponse times by 22%

Reduce 99th-percentile turnaround response times

by 31%

Datacenter

Latency-critical application

Throughput application

Memory-likeperformance

Flash Firmware

NVMe Driver

Block Layer

File System

Interference

Lon

ger

I/O

late

ncy

ULL-SSD

Unaware of ULL-SSD

Unaware of latency-critical

application

Barrier

Ult

ra lo

w la

ten

cy

Motivation: applications in datacenterDatacenter executes a wide range of latency-critical workloads.• Driven by the market of social media and web services;• Required to satisfy a certain level of service-level agreement;• Sensitive to the latency (i.e., turn-around response time);

A typical example: Apache

respond

serverapache

mo

nit

or

TCP/IP service

qu

eu

edata obj.

worker

worker

TCP/IP service

worker

data obj.

HTTPrequest

A key metric:user experience

Motivation: applications in datacenter• Latency-critical applications exhibit varying loads during a day.• Datacenter overprovisions its server resources to meet the SLA.• However, it results in a low utilization and low energy efficiency.

Hour of the day

Figure 1. Example diurnal pattern in queries per second for a Web Search cluster1.

1Power Management of Online Data-Intensive Services.

Frac

tio

n o

f Ti

me

CPU utilization

Figure 2. CPU utilization analysis of Google server cluster2.

2The Datacenter as a Computer.

Varyingloads

30% avg.utilization

Motivation: applications in datacenterPopular solution: co-locating latency-critical and throughput workloads.

Micro’11 ISCA’15

Eurosys’14

Challenge: applications in datacenter

Applications:Apache – Online latency-critical application;PageRank – Offline throughput application;

Server configuration:

Experiment: Apache+PageRank vs. Apache only

Performance metrics:SSD device latency;Response time of latency-critical application;

Challenge: applications in datacenterExperiment: Apache+PageRank vs. Apache only

• The throughput-oriented application drastically increases the I/O accesslatency of the latency-critical application.

• This latency increase deteriorates the turnaround response time ofthe latency-critical application.

Fig 2: Apache response time increases due to PageRank.

Fig 1: Apache SSD latency increases due to PageRank.

Challenge: ULL-SSDThere are emerging Ultra Low-Latency SSD (ULL-SSD) technologies,which can be used for faster I/O services in the datacenter.

ZNAND XL-Flash

New NAND Flash

Samsung Toshiba

3us N/A

100us N/A

Optane nvNitro

TechniquePhase change

RAMMRAM

Vendor Intel Everspin

Read 10us 6us

Write 10us 6us

Challenge: ULL-SSDIn this work, we use engineering sample of Z-SSD.

Z-NAND1

TechnologySLC based 3D NAND

48 stacked word-line layer

Capacity 64Gb/die

Page Size 2KB/Page

Z-NAND based archives “Z-SSD”[1] Cheong, Wooseong, et al. "A flash memory controller for 15μs ultra-low-latency SSD using high-speed 3D NAND flash with 3μs read time." 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018.

Challenge: Datacenter server with ULL-SSD

Applications:Apache – online latency-critical application;PageRank – offline throughput application;

Device latency analysis

42x36us 28us

Server configuration:

Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications).

Challenge: Datacenter server with ULL-SSD

The storage stack is unaware of the characteristics of both latency-critical

workload and ULL-SSD

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

blkmq

The current design of blkmq layer, NVMe driver, and SSD firmware can

hurt the performance of latency-critical applications.

ULL-SSD fails to bring short latency, because of the storage stack.

NVMe Driver

ULL-SSD

Blkmq layer: challenge

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

I/O

su

bm

issi

on

blkmqSo

ftw

are

Qu

eue

Har

dw

are

Qu

eue

ReqReq

Merge

Req

Req Req ReqIncoming requests

Req

Req

Merge

Req

Queuing Queuing

Software queue: holds latency-critical I/O requests for a long time;

Blkmq layer: challenge

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

blkmqSo

ftw

are

Qu

eue

Har

dw

are

Qu

eue

Dispatch

Token=1 Token=0Token=0

ReqReqReq

ReqReq

Software queue: holds latency-critical I/O requests for a long time;Hardware queue: dispatches an I/O request without a knowledge of

the latency-criticality;

I/O

su

bm

issi

on

Blkmq layer: optimization

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

blkmqSo

ftw

are

Qu

eue

Har

dw

are

Qu

eue

ReqReqReq

Our solution: bypass.

Req

Req

Req

Req

Soft

war

e Q

ueu

eH

ard

war

e Q

ueu

e

Byp

ass

LatReq ThrReq ThrReqIncoming requests

LatReq

No merge

No I/O scheduling

Latency-critical I/Os: bypass blkmq for a faster response;

ThrReq ThrReq

Throughput I/Os: merge in blkmq for a higher storage bandwidth.

Little penalty

Addressedin NVMe

I/O

su

bm

issi

on

NVMe SQ: challenge (bypass is not simple enough)

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe DriverNVMe Driver

ULL-SSD

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

Incoming requests

Ring

Head

TailTail

Req

Wait

Head

Head

Fetch

+ Tfetch+ 2xTfetch

NVMe protocol-level queue: a latency-critical I/O request can be blockedby prior I/O requests; Time Cost = Tfetch-self + 2xTfetch >200%

overhead

I/O

su

bm

issi

on

NVMe SQ: optimizationIncoming

+ 2xT

Target: Designing towards a responsiveness-aware NVMe submission.Key Insight: • Conventional NVMe controller(s) allow to customize the standard arbitration

strategy for different NVMe protocol-level queue accesses.• Thus, we can make the NVMe controller to decide which NVMe command to

fetch by sharing a hint for the I/O urgency.

NVMe SQ: optimization

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe DriverNVMe Driver

ULL-SSD

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

Ring

+ 2xT

Our Solution:

Lat-SQ doorbell

Thr-SQ doorbell

NVMeCTL

CQ doorbell

Co

re 2

Lat-SQ Thr-SQ Co

re 1

Co

re 0

CQ

Incoming requests ThrReq ThrReq LatReq

Ring Postpone

1. Double SQs (one for latency-critical I/Os, another for throughput I/Os)2. Double the SQ doorbell register3. New arbitration strategy: gives the highest priority to the Lat-SQ

Immediate fetch

I/O

su

bm

issi

on

SSD firmware: challenge

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

Embedded cache cannot protect the latency-critical I/O from an eviction;

ULL-SSD

NVMe Controller

Caching Layer

FTL

NAND Flash

Caching LayerI/O Hit

Miss

Em

be

dd

ed

Ca

che

way addr

0 0x0

1 0x1

2 0x2

0x1

0x5

0x8

Req@0x01 Req@0x05 Req@0x04

Cost: TCL+TCACHECost: TCL+TFTL+TNAND +TCACHE

0x4

0x1

Evict

Req@0x08Incoming requests

Embedded cache provides the fastest response(DRAM service)

0x0

I/O

su

bm

issi

on

Embedded cache can be polluted by the throughput requests;

Req@0x0b

0xb 0x5

SSD firmware: optimization

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

Our design: splits the internal cache space to protect latency-critical I/Orequests;

ULL-SSD

NVMe Controller

Caching Layer

FTL

NAND Flash

Caching Layer

Em

be

dd

ed

Ca

che

way addr

0 0x0

1 0x1

2 0x2

0x1

0x5

Req@0x01 Req@0x05 Req@0x04

0x4

Evict

Req@0x08Incoming requests

0x0

Protectionregion

0x40x8

I/O

su

bm

issi

on

Filesystem

NVMe Driver

Caching layer

blkmq

NVMe CQ: challenge

App App App

ULL-SSD

NVMe completion: MSI overhead for each I/O request;

ULL-SSD

NVMe Driver

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Message

HeadTail

InterruptController CPU Interrupt

Service Routine

MSI

Inte

rru

pt

contextswitch

Blk

mq

laye

r

contextswitch

Tail

Cost: 2xTCS +TISR

I/O

co

mp

leti

on

Cost: TCS Cost: TCSCost: TISR

Filesystem

NVMe Driver

Caching layer

blkmq

NVMe CQ: optimization

App App App

ULL-SSDULL-SSD

NVMe Driver

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

InterruptController CPU Interrupt

Service Routine

MSI

Inte

rru

pt

contextswitch

Blk

mq

laye

r

contextswitch

I/O

co

mp

leti

on

Key insight: state-of-the-art Linux supports a poll mechanism;

Poll

Message

Blkmqlayer Save: 2xTCS +TISRSave: TCSSave: TISRSave: TCS

NVMe CQ: optimizationPoll mechanism can bring benefits to fast storage device.

4KB8KB

16KB32KB

141618202224262830

Avera

ge L

ate

ncy (

sec)

Interrupt

Polling

4KB8KB

16KB32KB

10

12

14

16

18

20

22

Polling

Avera

ge L

ate

ncy (

sec)

Interrupt

ULL SSD

Decreases by

Read: 7.5% & Write: 13.2%

Read Write

NVMe CQ: optimizationHowever, the poll-based I/O services consume most host resources.

4KB8KB

16KB32KB

0

20

40

60

80

100

Me

mo

ry B

ou

nd

(%

) Polling

Interrupt

0

20

40

60

80

100

Time

CP

U U

tiliz

ation (

%)

Interrupt

0

20

40

60

80

100

Time

CP

U U

tiliz

ation (

%)

Polling

InterruptController

Filesystem

NVMe Driver

Caching layer

blkmq

NVMe CQ: optimization

App App App

ULL-SSD

Our solution: selective interrupt service routine (Select-ISR).

ULL-SSD

NVMe Driver

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Message

MSI

Inte

rru

pt

Incoming requests ThrReqLatReq CPU

Blkmqlayer

Message

Interrupt Service Routine

contextswitch

I/O

co

mp

leti

on

Sleep

Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.

Call a new utility: chworkload_attr

Modify Linux PCBdata structure

Invoke new system call

Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.• Deliver the workload attribute to each layer of storage stack.

Workloadattribute

rsvd2

nvme_cmd

NVMe controller

task_structUser Process

tag array

Embeddedcache

ad

dre

ss_sp

ace

Pa

ge

ca

ch

e

BIOFile system

request

Block layer(blk-mq)

nvme_rw_commandNVMe driver

More optimizationsAdvanced caching layer designs:

• Dynamic cache split scheme: to maximize cache hits in various request patterns;

• Read prefetching: better utilize SSD internal parallelism;• Adjustable read prefetching with ghost cache: adaptive to different

request patterns;

Hardware accelerator designs:• Conduct simple but timing-consuming tasks such as I/O poll and I/O

merge;• Simplify the design of blkmq and NVMe driver.

Experiment Setup

Test Environment

System configurations:• Vanilla – a vanilla Linux-based computer system running on ZSSD;• CacheOpt – compared to Vanilla, it optimizes the cache layer of SSD firmware;• KernelOpt – it optimizes blkmq layer and NVMe I/O submission;• SelectISR – compared to KernelOpt, it adds the optimization of selective ISR;

http://simplessd.org

Evaluation: latency breakdown

46% reduction

38% reduction 5us reduction

• KernelOpt reduces the time cost of blkmq layer by 46% thanks to no queuing time;• As latency-critical I/Os are fetched by NVMe controller immediately, KernelOpt drastically

reduces the waiting time;• CacheOpt better utilizes the embedded cache layer and reduces the SSD access delays by 38%;• By selectively using polling mechanism, SelectISR can reduce the I/O completion time by 5us.

Evaluation: online I/O access

• CacheOpt reduces the average I/O service latency, but it cannot eliminate the long tails;• KernelOpt can remove the long tails, because it can avoid long queuing time and prevents

throughput I/Os from blocking latency-critical I/Os;• SelectISR reduces the average latency further, thanks to selectively using poll mechanism.

ConclusionObservation The ultra-low latency of new memory-based SSDs is not exposed to latency-critical application and have no benefit from user-experience angle;ChallengePiecemeal reformations of the current storage stack won’t work due to multiple barriers; the storage stack is unaware of all behaviors of ULL-SSD and latency-critical applications; Our solutionFlashShare: We expose different levels of I/O responsiveness to the key components in the current storage stack and optimize the corresponding system layers to make ULL visible to users (latency-critical applications). Major results• Reducing average turnaround response times by 22%;• Reducing 99th-percentile turnaround response times by 31%.

Thank you

flashshare: punching through server storage stack from ... · samsung toshiba 3us n/a 100us n/a...

Documents