high performance data center operating systemshomes.cs.washington.edu/~tom/talks/os.pdfredis hw 13%...

High Performance Data Center Operating Systems

TomAndersonAntoineKaufmann,YoungjinKwon,NaveenKr.Sharma,ArvindKrishnamurthy,SimonPeter,MothyRoscoe,and

EmmettWitchel

An OS for the Data Center

•  ServerI/Operformancematters•  Key-valuestores,web&fileservers,databases,mailservers,machinelearning,…

• Canwedeliverperformanceclosetohardware?•  Examplesystem:DellPowerEdgeR520

IntelX52010GNIC

IntelRS3RAID1GBflash-backedcache

SandyBridgeCPU6cores,2.2GHz

+ +

Today’sI/Odevicesarefastandgettingfaster

2us/1KBpacket 25us/1KBwrite

40GNIC:500ns/1KBpacketNVDIMM:500ns/1KBwrite

Can’t we just use Linux?

Kernel

Linux I/O Performance

RedisHW13%

HW18%

Kernel84%

Kernel62%

App3%

App20%

SET

GET

%OF1KBREQUESTTIMESPENT

API Multiplexing

Naming Resourcelimits

Accesscontrol I/OScheduling

I/OProcessing Copying

Protection

DataPath

10GNIC2us/1KBpacket

RAIDStorage25us/1KBwrite

9us

163us

Kernelmediationistooheavyweight

Arrakis: Separate the OS control and data plane

KernelNaming Resourcelimits

Accesscontrol

Redis

How to skip the kernel?

Redis

I/ODevices

API Multiplexing

I/OScheduling

I/OProcessing Copying

Protection

DataPath

Kernel

Naming

Resourcelimits

Accesscontrol

Kernel

Naming

Resourcelimits

Accesscontrol

Redis

Arrakis I/O Architecture

Redis

I/ODevices

API

Multiplexing

I/OScheduling

I/OProcessing

Protection

DataPath

ControlPlane DataPlane

•  StreamlinenetworkandstorageI/O•  EliminateOSmediationinthecommoncase•  Application-specificcustomizationvs.kernelonesizefitsall

• KeepOSfunctionality•  Process(container)isolationandprotection•  Resourcearbitration,enforceableresourcelimits•  Globalnaming,sharingsemantics

• POSIXcompatibilityattheapplicationlevel•  AdditionalperformancegainsfromrewritingtheAPI

Design Goals

This Talk

Arrakis(OSDI14)•  OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage

Strata(SOSP17)•  Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)

TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)•  OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing

8

Storage diversification

Betterperform

ance

Highercapacity

9

Largeerasureblock:hardwareGCoverheadRandomwritescause5-6xslowdownbyGC

Byte-addressable:cache-linegranularityIODirectaccesswithload/storeinstructions

Latency Throughput $/GBDRAM 80ns 200GB/s 10.8NVDIMM 200ns 20GB/s 2SSD 10us 2.4GB/s 0.26HDD 10ms 0.25GB/s 0.02

Let’s Build a Fast Server

Keyvaluestore,database,fileserver,mailserver,…Requirements:

•  Smallupdatesdominate• Datasetscalesuptomanyterabytes• Updatesmustbecrashconsistent

10

A fast server on today’s file system

11

Small,randomIOisslow!

• Smallupdates(1Kbytes)dominate• Datasetscalesupto100TB• Updatesmustbecrashconsistent

Evenwithanoptimizedkernelfilesystem,NVMistoofast,kernelisthebottleneck

91%

Application

Kernelfilesystem

NVM 0 1.5 3 4.5 6

1 KB

IO latency (us)

Write to device Kernel code 91%

Kernelfilesystem:NOVA[FAST16,SOSP17]


12


UsingonlyNVMistooexpensive!$200Kfor100TB

Kernelfilesystem

NVM

Application

Tosavecost,needawaytousemultipledevicetypes:NVM,SSD,HDD


13


Block-levelcaching

NVMSSDHDD

Block-levelcachingmanagesdatainblocks,butNVMisbyte-addressable!

Kernelfilesystem

0 3 6 9 12 15

1 KB

IO latency (us)

Byte-addressed NVM

Better

Application

Forlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes


14


Pillaietal.,OSDI2014

0 2 4 6 8 10 12

SQLite ZooKeeper

HSQLDB Git

Crash vulnerabilities

Applicationsstruggleforcrashconsistency

Today’s file systems: Limited by old design assumptions

15

KernelmediateseveryoperationNVMistoofast,kernelisthebottleneckTiedtoasingletypeofdeviceForlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes(NVM,SSD,HDD)AggressivecachinginDRAM,onlywritetodevicewhenyoumust(fsync)Applicationsstruggleforcrashconsistency

Strata: A Cross Media File System

16

Performance:Especiallysmall,randomIO• Fastuser-leveldeviceaccess

Capacity:leverageNVM,SSD&HDDforlowcost• Transparentdatamigrationacrossdifferentmedia• EfficientlyhandledeviceIOproperties

Simplicity:intuitivecrashconsistencymodel• In-order,synchronousIO• Nofsync()required

Strata: main design principle

LogoperationstoNVMatuser-level

Intuitivecrashconsistency Kernelbypass,butprivate

Coordinatemulti-processaccesses

17

DigestandmigratedatainkernelApplylogoperationstoshareddata

LibFS

KernelFS

Performance,simplicity:

Capacity:

Strata

•  LibFS:logoperationstoNVMatuser-level•  Fastuser-levelaccess•  In-order,synchronousIO

•  KernelFS:Digestandmigratedatainkernel•  Asynchronousdigest•  Transparentdatamigration•  Sharedfileaccess

18

Log operations to NVM at user-level

19

•  Fastwrites• DirectlyaccessfastNVM

•  Sequentiallyappenddata• Cache-linegranularity• Blindwrites

• Crashconsistency• Oncrash,kernelreplayslog

Application

LibFS

Kernel-bypass

NVMPrivateoperationlog

creatsyscall renamesyscall …Fileoperations(data&metadata)madebyasinglesystemcall

Intuitive crash consistency

20

Application

LibFS

Kernel-bypass

NVM

SynchronousIO

• Wheneachsystemcallreturns:• Data/metadataisdurable•  In-orderupdate• Atomicwrite• Limitedsize(logsize)

FastsynchronousIO:NVMandkernel-bypass

Privateoperationlog

fsync()isno-op

21

KernelFS

•  Visibility:makeprivatelogvisibletootherapplications

•  Datalayout:turnwrite-optimizedtoread-optimizedformat

•  Large,batchedIO•  Coalescelog

Digest data in kernel

Write

NVM

Digest

SharedareaPrivateoperationlog

Application

LibFS

Digest optimization: Log coalescing SQLite,Mailserver:crashconsistentupdateusingwriteaheadlogging

22

Create journal fileWrite data to journal fileWrite data to database fileDelete journal file

Digesteliminatesunneededwork

Write data to database file

Removetemporarydurablewrites

Privateoperationlog

Application

LibFS KernelFS

Throughputoptimization:LogcoalescingsavesIOwhiledigesting

NVMSharedarea

23

ApplicationStrata: LibFS

Strata: KernelFS NVMSharedareaPrivateoperationlog

Digest and migrate data in kernel

ApplicationStrata: LibFS

Strata: KernelFS NVMSharedareaPrivateoperationlog

24

SSDSharedarea

HDDSharedarea

Low-costcapacityKernelFSmigratesdatatolowerlayer

Digest and migrate data in kernel

NVMdataLogs

Digest

Resembleslog-structuredmerge(LSM)tree

HandledeviceIOproperties

Write1GBsequentiallyfromNVMtoSSD

AvoidSSDgarbagecollectionoverhead

Device management overhead

0

250

500

750

1000

1250

0.1 0.25 0.5 0.6 0.7 0.8 0.9 1

SSD

Thro

ughp

ut (M

B/s)

SSD utilization

64 MB 128 MB 256 MB 512 MB 1024 MB

UseNVMlayeraspersistentwritebuffer

SSDpreferslargesequentialIO

Read: hierarchical search

26

Application

LibFS

KernelFS

NVMSharedareaPrivateOPlog

SSDSharedarea

HDDSharedarea

NVMdata

SSDdata

HDDdata

Logdata21

3

4

Searchorder

Shared file access

27

Leasesgrantaccessrightstoapplications[SOSP’89]RequiredforfilesanddirectoriesFunctionlikelock,butrevocableExclusivewriter,sharedreaders

Onrevocation,LibFSdigestsleaseddataLeasesserializeconcurrentupdates

Shared file access

28

LeasesgrantaccessrightstoapplicationsAppliedtoadirectoryorafileExclusivewriter,sharedreaders

Application1 LibFS

KernelFS

Application2 LibFS

RequestwriteleasetofileA

LWritefileAdata

RequestwriteleasetofileA

Revokethewritelease

L

WritefileAdata

Data

Example:concurrentwritestothesamefileA

OPlog1 OPlog2 Sharedarea

Experimental setup

•  2xIntelXeonE5-2640CPU,64GBDRAM•  400GBNVMeSSD,1TBHDD•  Ubuntu16.04LTS,Linuxkernel4.8.12

•  EmulatedNVM•  Use40GBofDRAM•  Performancemodel[Y.Zhangetal.MSST2015]

•  Throttlelatency&throughputinsoftware

• CompareStratavs.•  PMFS,Nova,ext4-DAX:NVMkernelfilesystems•  Nova:atomicupdate,in-ordersynchI/O•  PMFS,ext4-DAX:noatomicwrite

29

Latency: LevelDB

0

10

20

30

Write sync.

Write rand.

Delete rand.

Strata PMFS NOVA EXT4-DAX

35.249.2 37.7

30

Better25%better

17%better

Latency(us)

LevelDB(NVM)

Keysize:16B

Valuesize:1KB

300,000objects

Levelcompactioncausesasynchronousdigests

Fastuser-levellogging

Randomwrite25%betterthanPMFS

Overwrite17%betterthanPMFS

Throughput: Varmail

31

MailserverworkloadfromFilebench•  UsingonlyNVM•  10000files•  Read/Writeratiois1:1• Write-aheadlogging

Logcoalescingeliminates86%oflogrecords,saving14GBofIO

0K 100K 200K 300K 400K

Strata

NOVA

Throughput (op/s)

Better

29%better

Create journal fileWrite data to journalWrite data to database fileDelete journal file

Digest eliminates unneeded work

Write data to database file

Removes temporary durable writes

KernelFSApplication

LibFS

Log coalescing

This Talk




32


Keyvaluestore,database,mailserver,ML,…Requirements:

• MostlysmallRPCsoverTCP•  40Gbpsnetworklinks(100+Gbpssoon)•  Enforceableresourcesharing(multi-tenant)• Agileprotocoldevelopment:kernelandapp•  Taillatency,costefficienthardware

33


34

Application

Linuxnetworkstack

40GbpsNIC

Kernelmediationtooslow

•  SmallRPCsdominate•  Enforceableresourcesharing•  Agileprotocoldevelopment•  Cost-efficienthardware

Overhead: 97%

• Directaccesstodeviceatuser-level• Multiplexing

•  SR-IOV:VirtualPCIdevicesw/ownregisters,queues,INTs

• Protection•  IOMMU:DMAto/fromappvirtualmemory

•  Packetfilters:ex:legalsourceIPheader

• mTCP:2-3xfasterthanLinux• Whoenforcescongestioncontrol?

Hardware I/O Virtualization

SR-IOVNIC

Packetfilters

Network

Ratelimiters

User-levelVNIC1

User-levelVNIC2

Remote DMA (RDMA)

Programmingmodel:read/writeto(limited)regionofremoteservermemory

•  Modeldatestothe80’s(AlfredSpector)•  HPCcommunityrevivedforcommunicationwithinarack•  ExtendedtodatacenteroverEthernet(RoCE)•  Commerciallyavailable100GNICs

NoCPUinvolvementontheremotenode•  Fastifappcanuseprogrammingmodel

Limitations:• Whatifyouneedremoteapplicationcomputation(RPC)?•  Losslessmodelisperformance-fragile

36

Smart NICs (Cavium, …)

NICwitharrayoflow-endCPUcores(Cavium,…)IfcomputeontheNIC,maybedon’tneedtogoCPU?

•  ApplicationsinhighspeedtradingWe’vebeenherebefore:“wheelofreinvention”

•  Hardwarerelativelyexpensive•  AppsoftensloweronNICvs.CPU(cf.Floem)

37

Step 1

BuildafasterkernelTCPinsoftwareNochangeinisolation,resourceallocation,APIQ:WhyisRPCoverLinuxTCPissoslow?

38

OS - Hardware Interface

• Highly optimized code path • Buffer descriptor queues

•  No interrupt in common case •  Maximize concurrency

39

Core1 Core2 Core3 Core4

NetworkInterfaceCard

Operating System Tx Rx Tx Rx Tx Rx Tx Rx

OS Transmit Packet Processing

•  TCPlayer:movefromsocketbuffertoIPqueue•  Locksocket•  Congestion/flowcontollimit•  FillinTCPheader,calculatechecksum•  Copydata•  Armre-transmissiontimeout

•  IPlayer:•  firewall,routing,ARP,trafficshaping

• Driver:movefromIPqueuetoNICqueue• Allocateandfreepacketbuffers

40

Sidebar: Tail Latency

OnLinuxwitha40Gbpslink,400outboundTPCflowssendingRPCs,nocongestion,whatistheminimumrateacrossallflows?

41

Kernel and Socket Overhead

events = poll(….) For e in events: If e.socket != listen_sock: receive(e.socket, ….) … send(e.socket, ….)

Multiple synchronous kernel transitions: • Parameter checks and copies • Cache pollution, pipeline stalls

42

Application Kernel

Synchronoussystemcalls

TCP Acceleration as a Service (TaS)

•  TCPasauser-levelOSservice•  SRIO-Vtodedicatedcores•  Scalenumberofcoresup/downtomatchdemand•  Optimizeddataplaneforcommoncaseoperations

• Applicationusesitsowndedicatedcores•  Avoidpollutingapplicationlevelcache•  Fewercores=>betterperformancescaling

•  Totheapplication,per-sockettx/rxqueueswithdoorbells

•  Analogoustohardwaredevicetx/rxqueues

43

Streamline common-case data path • Removeunneededcomputationfromdatapath

•  Congestioncontrol,timeoutsperRTT(notperpacket)

• Minimizeper-flowTCPstate•  prefetch2cachelinesonpacketarrival

•  Linearizedcode•  Betterbranchprediction•  Super-scalarexecution

•  EnforceIPlevelaccesscontroloncontrolplaneatconnectionsetup

Small RPC Microbenchmark

45

App:1FP:1

App:4FP:7 App:8

FP:114.9x

•  IX:fastkernelTCPwithsyscallbatching,non-socketAPI•  Linux/TaSlatency:7.3x;IX/TaSlatency:2.9x

TAS is workload proportional

0

20

40

60

80

100

120

140

160

180

0 10 20 30 40 50 60 70 80 90 1001

2

3

4

5

6

7

8

9

Latency[us]

Time[s]

FastpathCores[#

]

Cores Latency

46

•  Setup:4clientsstartingevery10seconds,thenstoppingincrementally

Step 2

TCPasaServicecansaturatea40GbpslinkwithsmallRPCs,butwhatabout100Gbpsor400Gbpslinks?

Networklinkspeedsscalingupfasterthancores

WhatNIChardwaredoweneedfastdatacentercommunication?

TCPasaServicedataplanecanbeefficientlybuiltinhardware

47

FlexNIC Design Principles

• RPCsarethecommoncase•  Kernelbypasstoapplicationlogic

•  Enforceableper-flowresourcesharing•  Dataplaneinhardware,policyinkernel

• Agileprotocoldevelopment•  Protocolagnostic(ex:TimelyandDCTCPandRDMA)•  Offloadbothkernelandapppackethandling

• Cost-efficient•  Minimalinstructionsetforpacketprocessing

48

FlexNIC Hardware Model Programmable

ParserPacket

Stream ...TCAM

SRAM

REGs...

EgressQueues

Eth

IPv4

TCP UDP

RPC

•  TCAMforarbitrarywildcardmatches

•  SRAMforexact/LPMlookups

•  Statefulmemoryforcounters

•  ALUsformodifyingheadersandregisters

1.p=lookup(eth.dst_mac)

2.pkt.egress_port=p

3.counter[ipv4.src_ip]++

...

...

Match

Action

49BarefootNetworksswitchRMT~6Tbps(withparallelstreams)

FlexNIC Hardware Model

•  TransformpacketsforefficientprocessinginSW•  DMAdirectlyintoandoutofapplicationdatastructures•  SendacknowledgementsonNIC•  Queuemanagerimplementsratelimits•  Improvelocalitybysteeringtocoresbasedonappcriteria

RxPipeline

DBPipeline

TxPipeline

DMAPipeline PCIeDMA

FromNetwork

ToNetwork

FromPCIe(Doorbells)

QMan

50

FlexTCP: H/W Accelerated TCP

•  FastpathissimpleenoughforFlexNICmodel• ApplicationsdirectlyaccessNICforRX/TX

•  SimilarinterfacetoTCPasaservice:in-memoryqueues•  Softwareslow-pathmanagesNICstate

•  StreamlinesNICprocessing•  ACKsconsumed/generatedinNIC,reducesPCIetraffic•  Nodescriptorqueuesw/dependentDMAreads

•  Evaluation:softwareFlexNICemulator

51

FlexTCP Performance

•  Latency:7.8xbettervsLinux

52

10.7xvsLinux4.1xvsmTCP

7.2xvsLinux2.2xvsmTCP

Streamlining App RPC Processing

• How do we further reduce CPU overhead? •  Integration of app logic with network code

•  Remove socket abstraction, streamline app code

• But fundamental app bottlenecks remain •  Data structure synchronization, cache utilization •  Copies between app data structures and RPC buffers

•  Idea: leverage FlexNIC to streamline app •  FlexNIC is protocol agnostic, can parse on app protocol

53

Example: Key-Value Store

4

7

Hash Table

Core1

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K=3,4

Client2K=1,4,7

Client3K=1,7,8

•  Lockcontention•  Poorcacheutilization

Optimizing Reads: Key-based Steering

Core1

Core2NIC

1

2

3

4

5

6

7

8

Hash Table

Client1K=4,3

Client2K=1,4,7

Client3K=1,7,8

1

2

3

4

5

6

7

8

Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]•  Nolocksneeded

•  Bettercacheutilization40%higherCPUefficiency

Summary




56

Biography

• College:physics->psychology->philosophy•  TookthreeCSclassesasasenior

• Aftercollege:developedanOSforaz80•  Afterprojectshipped,projectgotcancelled•  Appliedtogradschool:sevenoutofeightturnedmedown

• Gradschool•  Learnedalot•  Dissertationhadzerocommercialimpactfordecades

• Asfaculty•  IlearnalotfromthepeopleIworkwith•  Trytopicktopicsthatmatter,andwhereIgettolearn

What about the CPU overhead?

58

12x

Example: Key-Value Store

4

7

HashTable

Core1

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K=3,4

Client2K=1,4,7

Client3K=1,7,8

•  Lockcontention•  Coherencetraffic•  Poorcacheutilization

59

Optimizing Reads: Key-based Steering

Core1

Core2NIC

1

2

3

4

5

6

7

8

HashTable

Client1K=4,3

Client2K=1,4,7

Client3K=1,7,8

1

2

3

4

5

6

7

8

Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]•  Nolocksneeded

•  Nocoherencetraffic•  Bettercacheutilization

60

Where is hardware going?

The Example of Bitcoin

Bitcoinisaprotocolformaintainingabyzantinefaulttolerantpublicledger

•  Cryptographicallysignedledgerofofasequenceoftransfersofdigitalcurrency,butcouldbeanything

Providesanincentivetoparticipateinthefaulttolerantprotocol

•  Proofofwork:Solveacryptographicpuzzle,getareward•  Hardtomonopolize

Extremelylowbandwidth:afewtransactions/second•  EnergycostroughlythesameasIreland

Bitcoin Hardware Progression

CPU->GPU->FPGA->LowtechVLSI->HightechVLSIMarketprice:1petahashofSHA-256=$0.01

Price and Difficulty

FPGAs, GPU, >55nm out of business

high performance data center operating systemshomes.cs.washington.edu/~tom/talks/os.pdfredis hw 13%...

Documents