high performance data center operating systemshomes.cs.washington.edu/~tom/talks/os.pdfredis hw 13%...
TRANSCRIPT
High Performance Data Center Operating Systems
TomAndersonAntoineKaufmann,YoungjinKwon,NaveenKr.Sharma,ArvindKrishnamurthy,SimonPeter,MothyRoscoe,and
EmmettWitchel
An OS for the Data Center
• ServerI/Operformancematters• Key-valuestores,web&fileservers,databases,mailservers,machinelearning,…
• Canwedeliverperformanceclosetohardware?• Examplesystem:DellPowerEdgeR520
IntelX52010GNIC
IntelRS3RAID1GBflash-backedcache
SandyBridgeCPU6cores,2.2GHz
+ +
Today’sI/Odevicesarefastandgettingfaster
2us/1KBpacket 25us/1KBwrite
40GNIC:500ns/1KBpacketNVDIMM:500ns/1KBwrite
Can’t we just use Linux?
Kernel
Linux I/O Performance
RedisHW13%
HW18%
Kernel84%
Kernel62%
App3%
App20%
SET
GET
%OF1KBREQUESTTIMESPENT
API Multiplexing
Naming Resourcelimits
Accesscontrol I/OScheduling
I/OProcessing Copying
Protection
DataPath
10GNIC2us/1KBpacket
RAIDStorage25us/1KBwrite
9us
163us
Kernelmediationistooheavyweight
Arrakis: Separate the OS control and data plane
KernelNaming Resourcelimits
Accesscontrol
Redis
How to skip the kernel?
Redis
I/ODevices
API Multiplexing
I/OScheduling
I/OProcessing Copying
Protection
DataPath
Kernel
Naming
Resourcelimits
Accesscontrol
Kernel
Naming
Resourcelimits
Accesscontrol
Redis
Arrakis I/O Architecture
Redis
I/ODevices
API
Multiplexing
I/OScheduling
I/OProcessing
Protection
DataPath
ControlPlane DataPlane
• StreamlinenetworkandstorageI/O• EliminateOSmediationinthecommoncase• Application-specificcustomizationvs.kernelonesizefitsall
• KeepOSfunctionality• Process(container)isolationandprotection• Resourcearbitration,enforceableresourcelimits• Globalnaming,sharingsemantics
• POSIXcompatibilityattheapplicationlevel• AdditionalperformancegainsfromrewritingtheAPI
Design Goals
This Talk
Arrakis(OSDI14)• OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage
Strata(SOSP17)• Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)
TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)• OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing
8
Storage diversification
Betterperform
ance
Highercapacity
9
Largeerasureblock:hardwareGCoverheadRandomwritescause5-6xslowdownbyGC
Byte-addressable:cache-linegranularityIODirectaccesswithload/storeinstructions
Latency Throughput $/GBDRAM 80ns 200GB/s 10.8NVDIMM 200ns 20GB/s 2SSD 10us 2.4GB/s 0.26HDD 10ms 0.25GB/s 0.02
Let’s Build a Fast Server
Keyvaluestore,database,fileserver,mailserver,…Requirements:
• Smallupdatesdominate• Datasetscalesuptomanyterabytes• Updatesmustbecrashconsistent
10
A fast server on today’s file system
11
Small,randomIOisslow!
• Smallupdates(1Kbytes)dominate• Datasetscalesupto100TB• Updatesmustbecrashconsistent
Evenwithanoptimizedkernelfilesystem,NVMistoofast,kernelisthebottleneck
91%
Application
Kernelfilesystem
NVM 0 1.5 3 4.5 6
1 KB
IO latency (us)
Write to device Kernel code 91%
Kernelfilesystem:NOVA[FAST16,SOSP17]
A fast server on today’s file system
12
• Smallupdates(1Kbytes)dominate• Datasetscalesupto100TB• Updatesmustbecrashconsistent
UsingonlyNVMistooexpensive!$200Kfor100TB
Kernelfilesystem
NVM
Application
Tosavecost,needawaytousemultipledevicetypes:NVM,SSD,HDD
A fast server on today’s file system
13
• Smallupdates(1Kbytes)dominate• Datasetscalesupto10TB• Updatesmustbecrashconsistent
Block-levelcaching
NVMSSDHDD
Block-levelcachingmanagesdatainblocks,butNVMisbyte-addressable!
Kernelfilesystem
0 3 6 9 12 15
1 KB
IO latency (us)
Byte-addressed NVM
Better
Application
Forlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes
A fast server on today’s file system
14
• Smallupdates(1Kbytes)dominate• Datasetscalesupto10TB• Updatesmustbecrashconsistent
Pillaietal.,OSDI2014
0 2 4 6 8 10 12
SQLite ZooKeeper
HSQLDB Git
Crash vulnerabilities
Applicationsstruggleforcrashconsistency
Today’s file systems: Limited by old design assumptions
15
KernelmediateseveryoperationNVMistoofast,kernelisthebottleneckTiedtoasingletypeofdeviceForlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes(NVM,SSD,HDD)AggressivecachinginDRAM,onlywritetodevicewhenyoumust(fsync)Applicationsstruggleforcrashconsistency
Strata: A Cross Media File System
16
Performance:Especiallysmall,randomIO• Fastuser-leveldeviceaccess
Capacity:leverageNVM,SSD&HDDforlowcost• Transparentdatamigrationacrossdifferentmedia• EfficientlyhandledeviceIOproperties
Simplicity:intuitivecrashconsistencymodel• In-order,synchronousIO• Nofsync()required
Strata: main design principle
LogoperationstoNVMatuser-level
Intuitivecrashconsistency Kernelbypass,butprivate
Coordinatemulti-processaccesses
17
DigestandmigratedatainkernelApplylogoperationstoshareddata
LibFS
KernelFS
Performance,simplicity:
Capacity:
Strata
• LibFS:logoperationstoNVMatuser-level• Fastuser-levelaccess• In-order,synchronousIO
• KernelFS:Digestandmigratedatainkernel• Asynchronousdigest• Transparentdatamigration• Sharedfileaccess
18
Log operations to NVM at user-level
19
• Fastwrites• DirectlyaccessfastNVM
• Sequentiallyappenddata• Cache-linegranularity• Blindwrites
• Crashconsistency• Oncrash,kernelreplayslog
Application
LibFS
Kernel-bypass
NVMPrivateoperationlog
creatsyscall renamesyscall …Fileoperations(data&metadata)madebyasinglesystemcall
Intuitive crash consistency
20
Application
LibFS
Kernel-bypass
NVM
SynchronousIO
• Wheneachsystemcallreturns:• Data/metadataisdurable• In-orderupdate• Atomicwrite• Limitedsize(logsize)
FastsynchronousIO:NVMandkernel-bypass
Privateoperationlog
fsync()isno-op
21
KernelFS
• Visibility:makeprivatelogvisibletootherapplications
• Datalayout:turnwrite-optimizedtoread-optimizedformat
• Large,batchedIO• Coalescelog
Digest data in kernel
Write
NVM
Digest
SharedareaPrivateoperationlog
Application
LibFS
Digest optimization: Log coalescing SQLite,Mailserver:crashconsistentupdateusingwriteaheadlogging
22
Create journal fileWrite data to journal fileWrite data to database fileDelete journal file
Digesteliminatesunneededwork
Write data to database file
Removetemporarydurablewrites
Privateoperationlog
Application
LibFS KernelFS
Throughputoptimization:LogcoalescingsavesIOwhiledigesting
NVMSharedarea
23
ApplicationStrata: LibFS
Strata: KernelFS NVMSharedareaPrivateoperationlog
Digest and migrate data in kernel
ApplicationStrata: LibFS
Strata: KernelFS NVMSharedareaPrivateoperationlog
24
SSDSharedarea
HDDSharedarea
Low-costcapacityKernelFSmigratesdatatolowerlayer
Digest and migrate data in kernel
NVMdataLogs
Digest
Resembleslog-structuredmerge(LSM)tree
HandledeviceIOproperties
Write1GBsequentiallyfromNVMtoSSD
AvoidSSDgarbagecollectionoverhead
Device management overhead
0
250
500
750
1000
1250
0.1 0.25 0.5 0.6 0.7 0.8 0.9 1
SSD
Thro
ughp
ut (M
B/s)
SSD utilization
64 MB 128 MB 256 MB 512 MB 1024 MB
UseNVMlayeraspersistentwritebuffer
SSDpreferslargesequentialIO
Read: hierarchical search
26
Application
LibFS
KernelFS
NVMSharedareaPrivateOPlog
SSDSharedarea
HDDSharedarea
NVMdata
SSDdata
HDDdata
Logdata21
3
4
Searchorder
Shared file access
27
Leasesgrantaccessrightstoapplications[SOSP’89]RequiredforfilesanddirectoriesFunctionlikelock,butrevocableExclusivewriter,sharedreaders
Onrevocation,LibFSdigestsleaseddataLeasesserializeconcurrentupdates
Shared file access
28
LeasesgrantaccessrightstoapplicationsAppliedtoadirectoryorafileExclusivewriter,sharedreaders
Application1 LibFS
KernelFS
Application2 LibFS
RequestwriteleasetofileA
LWritefileAdata
RequestwriteleasetofileA
Revokethewritelease
L
WritefileAdata
Data
Example:concurrentwritestothesamefileA
OPlog1 OPlog2 Sharedarea
Experimental setup
• 2xIntelXeonE5-2640CPU,64GBDRAM• 400GBNVMeSSD,1TBHDD• Ubuntu16.04LTS,Linuxkernel4.8.12
• EmulatedNVM• Use40GBofDRAM• Performancemodel[Y.Zhangetal.MSST2015]
• Throttlelatency&throughputinsoftware
• CompareStratavs.• PMFS,Nova,ext4-DAX:NVMkernelfilesystems• Nova:atomicupdate,in-ordersynchI/O• PMFS,ext4-DAX:noatomicwrite
29
Latency: LevelDB
0
10
20
30
Write sync.
Write rand.
Delete rand.
Strata PMFS NOVA EXT4-DAX
35.249.2 37.7
30
Better25%better
17%better
Latency(us)
LevelDB(NVM)
Keysize:16B
Valuesize:1KB
300,000objects
Levelcompactioncausesasynchronousdigests
Fastuser-levellogging
Randomwrite25%betterthanPMFS
Overwrite17%betterthanPMFS
Throughput: Varmail
31
MailserverworkloadfromFilebench• UsingonlyNVM• 10000files• Read/Writeratiois1:1• Write-aheadlogging
Logcoalescingeliminates86%oflogrecords,saving14GBofIO
0K 100K 200K 300K 400K
Strata
NOVA
Throughput (op/s)
Better
29%better
Create journal fileWrite data to journalWrite data to database fileDelete journal file
Digest eliminates unneeded work
Write data to database file
Removes temporary durable writes
KernelFSApplication
LibFS
Log coalescing
This Talk
Arrakis(OSDI14)• OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage
Strata(SOSP17)• Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)
TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)• OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing
32
Let’s Build a Fast Server
Keyvaluestore,database,mailserver,ML,…Requirements:
• MostlysmallRPCsoverTCP• 40Gbpsnetworklinks(100+Gbpssoon)• Enforceableresourcesharing(multi-tenant)• Agileprotocoldevelopment:kernelandapp• Taillatency,costefficienthardware
33
Let’s Build a Fast Server
34
Application
Linuxnetworkstack
40GbpsNIC
Kernelmediationtooslow
• SmallRPCsdominate• Enforceableresourcesharing• Agileprotocoldevelopment• Cost-efficienthardware
Overhead: 97%
• Directaccesstodeviceatuser-level• Multiplexing
• SR-IOV:VirtualPCIdevicesw/ownregisters,queues,INTs
• Protection• IOMMU:DMAto/fromappvirtualmemory
• Packetfilters:ex:legalsourceIPheader
• mTCP:2-3xfasterthanLinux• Whoenforcescongestioncontrol?
Hardware I/O Virtualization
SR-IOVNIC
Packetfilters
Network
Ratelimiters
User-levelVNIC1
User-levelVNIC2
Remote DMA (RDMA)
Programmingmodel:read/writeto(limited)regionofremoteservermemory
• Modeldatestothe80’s(AlfredSpector)• HPCcommunityrevivedforcommunicationwithinarack• ExtendedtodatacenteroverEthernet(RoCE)• Commerciallyavailable100GNICs
NoCPUinvolvementontheremotenode• Fastifappcanuseprogrammingmodel
Limitations:• Whatifyouneedremoteapplicationcomputation(RPC)?• Losslessmodelisperformance-fragile
36
Smart NICs (Cavium, …)
NICwitharrayoflow-endCPUcores(Cavium,…)IfcomputeontheNIC,maybedon’tneedtogoCPU?
• ApplicationsinhighspeedtradingWe’vebeenherebefore:“wheelofreinvention”
• Hardwarerelativelyexpensive• AppsoftensloweronNICvs.CPU(cf.Floem)
37
Step 1
BuildafasterkernelTCPinsoftwareNochangeinisolation,resourceallocation,APIQ:WhyisRPCoverLinuxTCPissoslow?
38
OS - Hardware Interface
• Highly optimized code path • Buffer descriptor queues
• No interrupt in common case • Maximize concurrency
39
Core1 Core2 Core3 Core4
NetworkInterfaceCard
Operating System Tx Rx Tx Rx Tx Rx Tx Rx
OS Transmit Packet Processing
• TCPlayer:movefromsocketbuffertoIPqueue• Locksocket• Congestion/flowcontollimit• FillinTCPheader,calculatechecksum• Copydata• Armre-transmissiontimeout
• IPlayer:• firewall,routing,ARP,trafficshaping
• Driver:movefromIPqueuetoNICqueue• Allocateandfreepacketbuffers
40
Sidebar: Tail Latency
OnLinuxwitha40Gbpslink,400outboundTPCflowssendingRPCs,nocongestion,whatistheminimumrateacrossallflows?
41
Kernel and Socket Overhead
events = poll(….) For e in events: If e.socket != listen_sock: receive(e.socket, ….) … send(e.socket, ….)
Multiple synchronous kernel transitions: • Parameter checks and copies • Cache pollution, pipeline stalls
42
Application Kernel
Synchronoussystemcalls
TCP Acceleration as a Service (TaS)
• TCPasauser-levelOSservice• SRIO-Vtodedicatedcores• Scalenumberofcoresup/downtomatchdemand• Optimizeddataplaneforcommoncaseoperations
• Applicationusesitsowndedicatedcores• Avoidpollutingapplicationlevelcache• Fewercores=>betterperformancescaling
• Totheapplication,per-sockettx/rxqueueswithdoorbells
• Analogoustohardwaredevicetx/rxqueues
43
Streamline common-case data path • Removeunneededcomputationfromdatapath
• Congestioncontrol,timeoutsperRTT(notperpacket)
• Minimizeper-flowTCPstate• prefetch2cachelinesonpacketarrival
• Linearizedcode• Betterbranchprediction• Super-scalarexecution
• EnforceIPlevelaccesscontroloncontrolplaneatconnectionsetup
Small RPC Microbenchmark
45
App:1FP:1
App:4FP:7 App:8
FP:114.9x
• IX:fastkernelTCPwithsyscallbatching,non-socketAPI• Linux/TaSlatency:7.3x;IX/TaSlatency:2.9x
TAS is workload proportional
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90 1001
2
3
4
5
6
7
8
9
Latency[us]
Time[s]
FastpathCores[#
]
Cores Latency
46
• Setup:4clientsstartingevery10seconds,thenstoppingincrementally
Step 2
TCPasaServicecansaturatea40GbpslinkwithsmallRPCs,butwhatabout100Gbpsor400Gbpslinks?
Networklinkspeedsscalingupfasterthancores
WhatNIChardwaredoweneedfastdatacentercommunication?
TCPasaServicedataplanecanbeefficientlybuiltinhardware
47
FlexNIC Design Principles
• RPCsarethecommoncase• Kernelbypasstoapplicationlogic
• Enforceableper-flowresourcesharing• Dataplaneinhardware,policyinkernel
• Agileprotocoldevelopment• Protocolagnostic(ex:TimelyandDCTCPandRDMA)• Offloadbothkernelandapppackethandling
• Cost-efficient• Minimalinstructionsetforpacketprocessing
48
FlexNIC Hardware Model Programmable
ParserPacket
Stream ...TCAM
SRAM
REGs...
EgressQueues
Eth
IPv4
TCP UDP
RPC
• TCAMforarbitrarywildcardmatches
• SRAMforexact/LPMlookups
• Statefulmemoryforcounters
• ALUsformodifyingheadersandregisters
1.p=lookup(eth.dst_mac)
2.pkt.egress_port=p
3.counter[ipv4.src_ip]++
...
...
Match
Action
49BarefootNetworksswitchRMT~6Tbps(withparallelstreams)
FlexNIC Hardware Model
• TransformpacketsforefficientprocessinginSW• DMAdirectlyintoandoutofapplicationdatastructures• SendacknowledgementsonNIC• Queuemanagerimplementsratelimits• Improvelocalitybysteeringtocoresbasedonappcriteria
RxPipeline
DBPipeline
TxPipeline
DMAPipeline PCIeDMA
FromNetwork
ToNetwork
FromPCIe(Doorbells)
QMan
50
FlexTCP: H/W Accelerated TCP
• FastpathissimpleenoughforFlexNICmodel• ApplicationsdirectlyaccessNICforRX/TX
• SimilarinterfacetoTCPasaservice:in-memoryqueues• Softwareslow-pathmanagesNICstate
• StreamlinesNICprocessing• ACKsconsumed/generatedinNIC,reducesPCIetraffic• Nodescriptorqueuesw/dependentDMAreads
• Evaluation:softwareFlexNICemulator
51
FlexTCP Performance
• Latency:7.8xbettervsLinux
52
10.7xvsLinux4.1xvsmTCP
7.2xvsLinux2.2xvsmTCP
Streamlining App RPC Processing
• How do we further reduce CPU overhead? • Integration of app logic with network code
• Remove socket abstraction, streamline app code
• But fundamental app bottlenecks remain • Data structure synchronization, cache utilization • Copies between app data structures and RPC buffers
• Idea: leverage FlexNIC to streamline app • FlexNIC is protocol agnostic, can parse on app protocol
53
Example: Key-Value Store
4
7
Hash Table
Core1
Core2NIC
Receive-sidescaling:core=hash(connection)%N
Client1K=3,4
Client2K=1,4,7
Client3K=1,7,8
• Lockcontention• Poorcacheutilization
Optimizing Reads: Key-based Steering
Core1
Core2NIC
1
2
3
4
5
6
7
8
Hash Table
Client1K=4,3
Client2K=1,4,7
Client3K=1,7,8
1
2
3
4
5
6
7
8
Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]• Nolocksneeded
• Bettercacheutilization40%higherCPUefficiency
Summary
Arrakis(OSDI14)• OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage
Strata(SOSP17)• Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)
TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)• OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing
56
Biography
• College:physics->psychology->philosophy• TookthreeCSclassesasasenior
• Aftercollege:developedanOSforaz80• Afterprojectshipped,projectgotcancelled• Appliedtogradschool:sevenoutofeightturnedmedown
• Gradschool• Learnedalot• Dissertationhadzerocommercialimpactfordecades
• Asfaculty• IlearnalotfromthepeopleIworkwith• Trytopicktopicsthatmatter,andwhereIgettolearn
What about the CPU overhead?
58
12x
Example: Key-Value Store
4
7
HashTable
Core1
Core2NIC
Receive-sidescaling:core=hash(connection)%N
Client1K=3,4
Client2K=1,4,7
Client3K=1,7,8
• Lockcontention• Coherencetraffic• Poorcacheutilization
59
Optimizing Reads: Key-based Steering
Core1
Core2NIC
1
2
3
4
5
6
7
8
HashTable
Client1K=4,3
Client2K=1,4,7
Client3K=1,7,8
1
2
3
4
5
6
7
8
Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]• Nolocksneeded
• Nocoherencetraffic• Bettercacheutilization
60
Where is hardware going?
The Example of Bitcoin
Bitcoinisaprotocolformaintainingabyzantinefaulttolerantpublicledger
• Cryptographicallysignedledgerofofasequenceoftransfersofdigitalcurrency,butcouldbeanything
Providesanincentivetoparticipateinthefaulttolerantprotocol
• Proofofwork:Solveacryptographicpuzzle,getareward• Hardtomonopolize
Extremelylowbandwidth:afewtransactions/second• EnergycostroughlythesameasIreland
Bitcoin Hardware Progression
CPU->GPU->FPGA->LowtechVLSI->HightechVLSIMarketprice:1petahashofSHA-256=$0.01
Price and Difficulty
FPGAs, GPU, >55nm out of business