flash storage disaggregation - stanford...

38
Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John 2 and Sanjeev Kumar 2 1 2 3 4 5

Upload: truongnhan

Post on 06-Apr-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

FlashStorageDisaggregation

AnaKlimovic1,ChristosKozyrakis1,4,Eno Thereska3,5,Binu John2 andSanjeev Kumar2

1 2 3

4 5

Flashisunderutilized

• Flashprovideshigherthroughputandlowerlatencythandisk

• Flashisunderutilizedindatacenters duetoimbalancedresourcerequirements

PCIe Flash:– 100,000sofIOPS– 10sofµslatency

2

DatacenterFlashUse-Case

AppTier

RAM

Flash

NIC

AppTier

Clients TCP/IP

DatastoreService

AppServers

Key-ValueStoreget(k)put(k,val)

Applica(onTier DatastoreTier

CPU

So9ware

Hardware

3

get(k)

ImbalancedResourceUtilization• SampleutilizationofFacebookservershostingaFlash-basedkey-valuestoreover6months

4

ImbalancedResourceUtilization• SampleutilizationofFacebookservershostingaFlash-basedkey-valuestoreover6months

5

ImbalancedResourceUtilization• SampleutilizationofFacebookservershostingaFlash-basedkey-valuestoreover6months

utilization

6

ImbalancedResourceUtilization• FlashcapacityandIOPSareunderutilizedforlongperiodsoftime

7

utilization

ImbalancedResourceUtilization• CPUandFlashutilizationvarywithseparatetrends

8

utilization

LocalFlashArchitecture

AppTier

RAM

Flash

NIC

AppTier

Clients TCP/IP

DatastoreService

AppServers

Key-ValueStoreget(k)put(k,val)

Applica(onTier DatastoreTier

CPU

So9ware

Hardware

9ProvisionFlashandCPUinadependentmanner.

DisaggregatedFlashArchitecture

AppTier

RAMNIC

AppTier

Clients TCP/IP

DatastoreService

AppServers

get(k)put(k,val)

Applica(onTier DatastoreTier

CPU

So5ware

Hardware

FlashNIC

iSCSI

CPU RAM

read(blk);write(blk,data)

FlashTier

Key-ValueStore

RemoteBlockService So5ware

Hardware

Protocol

10

Contributions

ForrealapplicationsatFacebook,weanalyze:

1. WhatistheperformanceoverheadofremoteFlashusingexistingprotocols?

2. Whatoptimizations improveperformance?

3. WhendoesdisaggregatingFlashleadtoresourceefficiencybenefits?

11

FlashWorkloadsatFacebook

• AnalyzeIOpatternsofrealFlash-basedFacebookapplications

• ApplicationsuseRocksDB,akey-valuestorewithalogstructuredmergetreearchitecture

IOPS/TB IOsize

Read 2K – 10K 10KB – 50KB

Write 100– 1K 500KB– 2MB

Lotsofrandomreads

Large,bursty writes

12

WorkloadAnalysis

AppTier

RAMNICTCP/IP

SSDBserverwrapper

ApplicationTier Datastore Tier

CPU

Software

Hardware

FlashNIC

RocksDB

RemoteBlockService Software

Hardware

Protocol FlashTier

mutilateload

generator

13

WorkloadAnalysis

AppTier

RAMNICTCP/IP

SSDBserverwrapper

ApplicationTier Datastore Tier

CPU

Software

Hardware

FlashNIC

RocksDB

RemoteBlockService Software

Hardware

iSCSI FlashTier

mutilateload

generator

14

iSCSI isastandardnetworkstorageprotocolthat

transportsblockstoragecommandsoverTCP/IP

WorkloadAnalysis

AppTier

RAMNICTCP/IP

SSDBserverwrapper

ApplicationTier Datastore Tier

CPU

Software

Hardware

FlashNIC

RocksDB

RemoteBlockService Software

Hardware

iSCSI FlashTier

mutilateload

generator

15

√ Transparenttoapplication√ Runsoncommoditynetwork√ Scalesdatacenter-wide

WorkloadAnalysis

AppTier

4GB10Gb/E

AppTier

Clients TCP/IP

SSDBserverwrapper

mutilateload

generator

ApplicationTier Datastore Tier

6cores

Software

Hardware

IntelP3600PCIe Flash

10Gb/E

iSCSI FlashTier

RocksDB

RemoteBlockService Software

Hardware

Measureround-triplatency

16

UnloadedLatency• RemoteaccesswithiSCSIadds260µstop95latency,tolerable forourtargetapplication(latencySLO~5ms)

260µs

17

ApplicationThroughput• 45%throughputdropwith“outofthebox”iSCSIFlash• NeedtooptimizeremoteFlashserverforhigherthroughput

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLa

tency(m

s)

QPS(thousands)

LocalFlashiSCSIbaseline (8processes)

45%drop

18

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLa

tency(m

s)

QPS(thousands)

LocalFlash6iSCSIprocesses(optimal)8iSCSIprocesses(default)1iSCSIprocess

Multi-process iSCSI• VarynumberofiSCSIprocessesthatissueIO• Wantenoughparallelism, avoidschedulinginterference

12%

19

NICoffloads

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLa

tency(m

s)

QPS(thousands)

LocalFlashNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)

• EnableNICoffloadsforTCPsegmentation(TSO/LRO)toreduceCPUloadonFlashserveranddatastore server

8%

20

JumboFrames• Jumboframesfurtherreduceoverheadbyreducingsegmentationaltogether(maxMTU9kB)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLa

tency(m

s)

QPS(thousands)

LocalFlashJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)

10%

21

InterruptAffinityTuning• SteerNICinterruptstocorehandlingTCPconnectionandFlashinterruptstocoresissuingIOcommands

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLa

tency(m

s)

QPS(thousands)

LocalFlashInterruptaffinityJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)

4%

22

OptimizedApplicationThroughput• SteerNICinterruptstocorehandlingTCPconnectionandFlashinterruptstocoresissuingIOcommands

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLa

tency(m

s)

QPS(thousands)

LocalFlashInterruptaffinityJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)

42%

23

ApplicationThroughput• 20%dropinapplicationthroughput,onaverage

00.20.40.60.81

1.21.41.61.82

0 10 20 30 40 50 60 70 80

ClientLa

tency(ms)

QPS(thousands)

local_avgremote_avglocal_p95remote_p95

20%drop

24

ApplicationThroughput• Atthetail,overheadofremoteaccessismaskedbyotherfactorslikewriteinterferenceonFlash

00.20.40.60.81

1.21.41.61.82

0 10 20 30 40 50 60 70 80

ClientLa

tency(ms)

QPS(thousands)

local_avgremote_avglocal_p95remote_p95

10%drop

25

20%drop

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 20 40 60 80 100 120 140

ClientLa

tency(ms)

QPS(thousands)

local_avgremote_avglocal_p95remote_p95

SharingRemoteFlash• SharingFlashamong2ormoretenantsleadstomorewriteinterferenceà degradestailperformance

26

20%droponavg

25%drop@tail

DisaggregationBenefits

• Makeupforthroughputlossbycost-effectivelyscalingresourceswithdisaggregation

• Improveoverallresourceutilization

• Formulatecostmodeltoquantifybenefits

27

ResourceSavings• Resourcesavingsofdisaggregatedvs.localFlasharchitectureasapprequirementsscale

40%

30%

20%

10%

0%

-10%

StorageCapacityScaling Factor

Compu

teIn

tensityScalingFactor

28

%costbenefitofdisaggregation

ResourceSavings• Resourcesavingsofdisaggregatedvs.localFlasharchitectureasapprequirementsscale

40%

30%

20%

10%

0%

-10%

StorageCapacityScaling Factor

Compu

teIn

tensityScalingFactor

BalancedCPU&Flashutilization

%costbenefitofdisaggregation

29

ResourceSavings• Whenstoragescalesathigherratethancompute,saveresourcesbydeployingFlashwithoutasmuchCPU

40%

30%

20%

10%

0%

-10%

StorageCapacityScaling Factor

Compu

teIn

tensityScalingFactor

BalancedCPU&Flashutilization

DeploymoreFlashserversthancompute

30

%costbenefitofdisaggregation

ResourceSavings• Whencomputeandstoragedemandsremainbalanced,nobenefitwithdisaggregation

40%

30%

20%

10%

0%

-10%

StorageCapacityScaling Factor

Compu

teIn

tensityScalingFactor

BalancedCPU&Flashutilization 31

%costbenefitofdisaggregation

ImplicationsforSystemDesign

• Dataplane:– Reducecomputeoverheadofnetwork(storage)stack

• OptimizeTCP/IPprocessing• Usealight-weightprotocol

– ProvideisolationmechanismsforsharedremoteFlash

• Controlplane:– PoliciesforallocatingandsharingremoteFlash

• ImportanttoconsiderwriteIOpatternsofapplications

32

40%30%20%10%0%-10%

StorageCapacityScalingFactor

Compu

teIntensity

ScalingFactor

%costbenefitofdisaggrega1on

00.20.40.60.81

1.21.41.61.82

0 10 20 30 40 50 60 70 80

ClientLaten

cy(m

s)

QPS(thousands)

local_avgremote_avglocal_p95remote_p95

10%drop

34

20%drop

AppTier

RAMNIC

AppTier

Clients TCP/IP

DatastoreService

AppServers

get(k)put(k,val)

Applica(onTier DatastoreTier

CPU

So5ware

Hardware

FlashNIC

iSCSI

CPU RAM

read(blk);write(blk,data)

FlashTier

Key-ValueStore

RemoteBlockService So5ware

Hardware

Protocol

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80

ClientLaten

cy(m

s)

QPS(thousands)

LocalFlashInterruptaffinityJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline(8processes)

42%

32

Conclusion

• DisaggregatingFlashisbeneficialbecauseitallowsustocost-effectivelyscaleresources:– Improveoverallresourceefficiency– Compensatefor20%throughputoverheadbyindependentlydeployingapplicationresources

• Systemtuningimprovesperformance~40%,moreopportunitiesifredesignsoftwarestack

34

Backup

RemoteFlashIOPSIO-intensivebenchmark:4kBrandomreads

0

50

100

150

200

250

1tenant 3tenants 6tenants

IOPS(thou

sand

s)

IRQaffinityJumboframeNICoffloadMul@-threadBaselineMul@-process

LocalFlashIOPS

CostModel

RelatedWork

• Disaggregateddiskstorage:– Petal[ASPLOS’96],Parallax [HotOS’05],Blizzard[NSDI’14]

• DisaggregatedFlashasdistributedsharedlog:– CORFU[NSDI’12],FAWN[SOSP’09]

• Disaggregatedmemory:–Memorybladeservers(Limetal.)[ISCA’09]

• Rack-scaledisaggregation:– Pelican[OSDI’14],HPMoonshot,IntelRack-Scale