flash storage disaggregation - stanford...
Post on 06-Apr-2018
214 Views
Preview:
TRANSCRIPT
FlashStorageDisaggregation
AnaKlimovic1,ChristosKozyrakis1,4,Eno Thereska3,5,Binu John2 andSanjeev Kumar2
1 2 3
4 5
Flashisunderutilized
• Flashprovideshigherthroughputandlowerlatencythandisk
• Flashisunderutilizedindatacenters duetoimbalancedresourcerequirements
PCIe Flash:– 100,000sofIOPS– 10sofµslatency
2
DatacenterFlashUse-Case
AppTier
RAM
Flash
NIC
AppTier
Clients TCP/IP
DatastoreService
AppServers
Key-ValueStoreget(k)put(k,val)
Applica(onTier DatastoreTier
CPU
So9ware
Hardware
3
get(k)
ImbalancedResourceUtilization• SampleutilizationofFacebookservershostingaFlash-basedkey-valuestoreover6months
4
ImbalancedResourceUtilization• SampleutilizationofFacebookservershostingaFlash-basedkey-valuestoreover6months
5
ImbalancedResourceUtilization• SampleutilizationofFacebookservershostingaFlash-basedkey-valuestoreover6months
utilization
6
ImbalancedResourceUtilization• FlashcapacityandIOPSareunderutilizedforlongperiodsoftime
7
utilization
LocalFlashArchitecture
AppTier
RAM
Flash
NIC
AppTier
Clients TCP/IP
DatastoreService
AppServers
Key-ValueStoreget(k)put(k,val)
Applica(onTier DatastoreTier
CPU
So9ware
Hardware
9ProvisionFlashandCPUinadependentmanner.
DisaggregatedFlashArchitecture
AppTier
RAMNIC
AppTier
Clients TCP/IP
DatastoreService
AppServers
get(k)put(k,val)
Applica(onTier DatastoreTier
CPU
So5ware
Hardware
FlashNIC
iSCSI
CPU RAM
read(blk);write(blk,data)
FlashTier
Key-ValueStore
RemoteBlockService So5ware
Hardware
Protocol
10
Contributions
ForrealapplicationsatFacebook,weanalyze:
1. WhatistheperformanceoverheadofremoteFlashusingexistingprotocols?
2. Whatoptimizations improveperformance?
3. WhendoesdisaggregatingFlashleadtoresourceefficiencybenefits?
11
FlashWorkloadsatFacebook
• AnalyzeIOpatternsofrealFlash-basedFacebookapplications
• ApplicationsuseRocksDB,akey-valuestorewithalogstructuredmergetreearchitecture
IOPS/TB IOsize
Read 2K – 10K 10KB – 50KB
Write 100– 1K 500KB– 2MB
Lotsofrandomreads
Large,bursty writes
12
WorkloadAnalysis
AppTier
RAMNICTCP/IP
SSDBserverwrapper
ApplicationTier Datastore Tier
CPU
Software
Hardware
FlashNIC
RocksDB
RemoteBlockService Software
Hardware
Protocol FlashTier
mutilateload
generator
13
WorkloadAnalysis
AppTier
RAMNICTCP/IP
SSDBserverwrapper
ApplicationTier Datastore Tier
CPU
Software
Hardware
FlashNIC
RocksDB
RemoteBlockService Software
Hardware
iSCSI FlashTier
mutilateload
generator
14
iSCSI isastandardnetworkstorageprotocolthat
transportsblockstoragecommandsoverTCP/IP
WorkloadAnalysis
AppTier
RAMNICTCP/IP
SSDBserverwrapper
ApplicationTier Datastore Tier
CPU
Software
Hardware
FlashNIC
RocksDB
RemoteBlockService Software
Hardware
iSCSI FlashTier
mutilateload
generator
15
√ Transparenttoapplication√ Runsoncommoditynetwork√ Scalesdatacenter-wide
WorkloadAnalysis
AppTier
4GB10Gb/E
AppTier
Clients TCP/IP
SSDBserverwrapper
mutilateload
generator
ApplicationTier Datastore Tier
6cores
Software
Hardware
IntelP3600PCIe Flash
10Gb/E
iSCSI FlashTier
RocksDB
RemoteBlockService Software
Hardware
Measureround-triplatency
16
UnloadedLatency• RemoteaccesswithiSCSIadds260µstop95latency,tolerable forourtargetapplication(latencySLO~5ms)
260µs
17
ApplicationThroughput• 45%throughputdropwith“outofthebox”iSCSIFlash• NeedtooptimizeremoteFlashserverforhigherthroughput
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLa
tency(m
s)
QPS(thousands)
LocalFlashiSCSIbaseline (8processes)
45%drop
18
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLa
tency(m
s)
QPS(thousands)
LocalFlash6iSCSIprocesses(optimal)8iSCSIprocesses(default)1iSCSIprocess
Multi-process iSCSI• VarynumberofiSCSIprocessesthatissueIO• Wantenoughparallelism, avoidschedulinginterference
12%
19
NICoffloads
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLa
tency(m
s)
QPS(thousands)
LocalFlashNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)
• EnableNICoffloadsforTCPsegmentation(TSO/LRO)toreduceCPUloadonFlashserveranddatastore server
8%
20
JumboFrames• Jumboframesfurtherreduceoverheadbyreducingsegmentationaltogether(maxMTU9kB)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLa
tency(m
s)
QPS(thousands)
LocalFlashJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)
10%
21
InterruptAffinityTuning• SteerNICinterruptstocorehandlingTCPconnectionandFlashinterruptstocoresissuingIOcommands
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLa
tency(m
s)
QPS(thousands)
LocalFlashInterruptaffinityJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)
4%
22
OptimizedApplicationThroughput• SteerNICinterruptstocorehandlingTCPconnectionandFlashinterruptstocoresissuingIOcommands
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLa
tency(m
s)
QPS(thousands)
LocalFlashInterruptaffinityJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline (8processes)
42%
23
ApplicationThroughput• 20%dropinapplicationthroughput,onaverage
00.20.40.60.81
1.21.41.61.82
0 10 20 30 40 50 60 70 80
ClientLa
tency(ms)
QPS(thousands)
local_avgremote_avglocal_p95remote_p95
20%drop
24
ApplicationThroughput• Atthetail,overheadofremoteaccessismaskedbyotherfactorslikewriteinterferenceonFlash
00.20.40.60.81
1.21.41.61.82
0 10 20 30 40 50 60 70 80
ClientLa
tency(ms)
QPS(thousands)
local_avgremote_avglocal_p95remote_p95
10%drop
25
20%drop
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 20 40 60 80 100 120 140
ClientLa
tency(ms)
QPS(thousands)
local_avgremote_avglocal_p95remote_p95
SharingRemoteFlash• SharingFlashamong2ormoretenantsleadstomorewriteinterferenceà degradestailperformance
26
20%droponavg
25%drop@tail
DisaggregationBenefits
• Makeupforthroughputlossbycost-effectivelyscalingresourceswithdisaggregation
• Improveoverallresourceutilization
• Formulatecostmodeltoquantifybenefits
27
ResourceSavings• Resourcesavingsofdisaggregatedvs.localFlasharchitectureasapprequirementsscale
40%
30%
20%
10%
0%
-10%
StorageCapacityScaling Factor
Compu
teIn
tensityScalingFactor
28
%costbenefitofdisaggregation
ResourceSavings• Resourcesavingsofdisaggregatedvs.localFlasharchitectureasapprequirementsscale
40%
30%
20%
10%
0%
-10%
StorageCapacityScaling Factor
Compu
teIn
tensityScalingFactor
BalancedCPU&Flashutilization
%costbenefitofdisaggregation
29
ResourceSavings• Whenstoragescalesathigherratethancompute,saveresourcesbydeployingFlashwithoutasmuchCPU
40%
30%
20%
10%
0%
-10%
StorageCapacityScaling Factor
Compu
teIn
tensityScalingFactor
BalancedCPU&Flashutilization
DeploymoreFlashserversthancompute
30
%costbenefitofdisaggregation
ResourceSavings• Whencomputeandstoragedemandsremainbalanced,nobenefitwithdisaggregation
40%
30%
20%
10%
0%
-10%
StorageCapacityScaling Factor
Compu
teIn
tensityScalingFactor
BalancedCPU&Flashutilization 31
%costbenefitofdisaggregation
ImplicationsforSystemDesign
• Dataplane:– Reducecomputeoverheadofnetwork(storage)stack
• OptimizeTCP/IPprocessing• Usealight-weightprotocol
– ProvideisolationmechanismsforsharedremoteFlash
• Controlplane:– PoliciesforallocatingandsharingremoteFlash
• ImportanttoconsiderwriteIOpatternsofapplications
32
40%30%20%10%0%-10%
StorageCapacityScalingFactor
Compu
teIntensity
ScalingFactor
%costbenefitofdisaggrega1on
00.20.40.60.81
1.21.41.61.82
0 10 20 30 40 50 60 70 80
ClientLaten
cy(m
s)
QPS(thousands)
local_avgremote_avglocal_p95remote_p95
10%drop
34
20%drop
AppTier
RAMNIC
AppTier
Clients TCP/IP
DatastoreService
AppServers
get(k)put(k,val)
Applica(onTier DatastoreTier
CPU
So5ware
Hardware
FlashNIC
iSCSI
CPU RAM
read(blk);write(blk,data)
FlashTier
Key-ValueStore
RemoteBlockService So5ware
Hardware
Protocol
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
ClientLaten
cy(m
s)
QPS(thousands)
LocalFlashInterruptaffinityJumboframeNICoffloadiSCSIwith6processesiSCSIbaseline(8processes)
42%
32
Conclusion
• DisaggregatingFlashisbeneficialbecauseitallowsustocost-effectivelyscaleresources:– Improveoverallresourceefficiency– Compensatefor20%throughputoverheadbyindependentlydeployingapplicationresources
• Systemtuningimprovesperformance~40%,moreopportunitiesifredesignsoftwarestack
34
RemoteFlashIOPSIO-intensivebenchmark:4kBrandomreads
0
50
100
150
200
250
1tenant 3tenants 6tenants
IOPS(thou
sand
s)
IRQaffinityJumboframeNICoffloadMul@-threadBaselineMul@-process
LocalFlashIOPS
RelatedWork
• Disaggregateddiskstorage:– Petal[ASPLOS’96],Parallax [HotOS’05],Blizzard[NSDI’14]
• DisaggregatedFlashasdistributedsharedlog:– CORFU[NSDI’12],FAWN[SOSP’09]
• Disaggregatedmemory:–Memorybladeservers(Limetal.)[ISCA’09]
• Rack-scaledisaggregation:– Pelican[OSDI’14],HPMoonshot,IntelRack-Scale
top related