consensus services for loosely- coupled distributed...

Post on 12-Sep-2019

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ConsensusServicesforLoosely-coupleddistributedSystems

ZookeeperandChubby

Mo<va<on

•  Distributedsystemsneedaservicethatprovides–  Synchroniza<on(whoistheleader?)– Membership(whoareac<venodesinmyservice?)–  Configura<onmetadata(whatareenvironmentinfo?)

•  Distributedsystemsneedaservicethatis–  reliability–  availability–  easy-to-understandseman<cs–  performance,throughput,latencyonlysecondary

BeforeChubbyCameAbout…

•  Lotsofdistributedsystemswithclientsinthe10,000s

•  Howtodoprimaryelec<on?–  Adhoc(noharmfromduplicatedwork)–  Operatorinterven<on(correctnessessen<al)

•  Unprincipled•  Disorganized•  Costly•  Lowavailability

Whatisthispaperabout?

“BuildingChubbywasanengineeringeffort…itwasnotresearch.Weclaimnonewalgorithmsortechniques.Thepurposeofthispaperistodescribewhatwedidandwhy,ratherthantoadvocateit.”

•  Designbasedonwell-knownideas– distributedconsensus,caching,no<fica<ons,file-systeminterface

LifeBeforeCubby

•  Distributedsystemsdevelopers..–  ImplementRaR(wellactuallyPaxos)•  Applica<onmustbewriUenasastatemachine•  Poten<alperformanceproblems

–  Quorumon5iseasieroverquorumof10Knodes

– Sharedcri<calregions(Exclusivelocks)•  Hardtocode/understand

–  Peoplethinktheycan…buttheycan’t!

ChubbyDesign

DesignDecisions:Mo<va<ngLocks?

•  Lockservicevs.consensus(RaR/Paxos)library

•  Advantages:– Noneedtorewritecode

•  Maintainprogramstructure,communica<onpaUerns•  Cansupportno<fica<onmechanism

–  Smaller#ofnodes(servers)neededtomakeprogress

•  Advisoryinsteadofmandatorylocks(why?):– HoldingalockcalledFneitherisnecessarytoaccessthefileF,norpreventsotherclientsfromdoingso

DesignDecisions:LockTypes•  Coarsevs.fine-grainedlocks–  Fine-grained:grablockbeforeeveryevent–  Coarse-grained:grablockforlargegroupofevents

•  Advantagesofcoarse-grainedlocks–  Lessloadonlockserver–  Lessdelaywhenlockserverfails–  Lesslockserversandavailabilityrequired

•  Advantagesoffine-grainedlocks– Morelockserverload–  Ifneeded,couldbeimplementedonclientside

SystemStructure

•  Chubbycell:asmallnumberofreplicas(e.g.,5)

•  Masterisselectedusingaconsensusprotocol(e.g.,RaR)

SystemStructure

•  Clients–  Sendreads/writesonlytothemaster

–  Communicateswithmasterviaachubbylibrary

•  Everyreplicaserver–  IslistedinDNS–  Directclientstomaster–  Maintaincopiesofasimpledatabase

ReadandWrites

•  Write– Masterpropagateswritetoreplica

–  RepliesaRerthewritereachesamajority(e.g.,quorum)

•  Read– Masterrepliesdirectly,asithasmostuptodatestate

–  Readsmusts<llgotothemaster

ChubbyAPIandLocks

SimpleUNIX-likeFileSystemInterface

•  Barebonefile&directorystructure

•  /ls/foo/wombat/pouch

Lock service; common to all names

Cell name

Name within cell

SimpleUNIX-likeFileSystemInterface

•  Barebonefile&directorystructure

•  /ls/foo/wombat/pouch

•  Doesnotsupport,maintain,orreveal– Movingfiles– Path-dependentpermissionseman<cs– Directorymodified<mes,fileslast-access<mes

Nodes•  Node:afileordirectory– Anynodecanactasanadvisoryreader/writerlock

•  Anodemaybeeitherpermanentorephemeral– Ephemeralusedastemporaryfiles,e.g.,indicateaclientisalive

•  Metadata– ThreenamesofACLs(R/W/changeACLname)•  Authen<ca<onbuildintoROC

– 64-bitfilecontentchecksum

Locks•  Any node can act as lock (shared or exclusive)

•  Advisory (vs. mandatory) – Protect resources at remote services – No value in extra guards by mandatory locks

•  Write permission needed to acquire – Prevents unprivileged reader blocking progress

LocksandSequences•  Poten<allockproblemsindistributedsystems

–  AholdsalockL,issuesrequestW,thenfails–  BacquiresL(becauseAfails),performsac<ons–  Warrives(out-of-order)aRerB’sac<ons

•  Solu<on1:backwardcompa<ble–  Lockserverwillpreventotherclientsfromgehngthelockifalockbecome

inaccessibleortheholderhasfailed–  Lock-delayperiodcanbespecifiedbyclients

LocksandSequences•  Poten<allockproblemsindistributedsystems

–  AholdsalockL,issuesrequestW,thenfails–  BacquiresL(becauseAfails),performsac<ons–  Warrives(out-of-order)aRerB’sac<ons

•  Solu<on2:sequencer– AlockholdercanobtainasequencerfromChubby–  ItaUachesthesequencertoanyrequeststhatitsendstootherservers

– Theotherserverscanverifythesequencerinforma<on

Design:Events•  Clientsubscribeswhencrea<nghandle

•  Deliveredasyncviaup-callfromclientlibrary

•  Eventtypes– Filecontentsmodified– Childnodeadded/removed/modified– Chubbymasterfailedover– Handle/lockhavebecomeinvalid– Lockacquired/conflic<nglockrequest(rarelyused)

Design:API

•  Open() (only call using named node) – how handle will be used (access checks here) – events to subscribe to –  lock-delay – whether new file/dir should be created

•  Close() vs. Poison()

•  Other ops: –  GetContentsAndStat(), SetContents(), Delete(), Acquire(), TryAcquire(), Release(),

GetSequencer(), SetSequencer(), CheckSequencer()

Chubby:ProvidingPerformance

Whyscalingisimportant?•  Clientsconnecttoasingleinstanceofmasterinacell

–  Muchmoreclientprocessesthatnumberofmachines–  Note:MasterandclienthavesameCPU/Memory

•  Exis<ngmechanisms:–  Par$$on:MoreChubbycells(consistenthashing)–  Increaselease$me:from12sto60storeduceKeepAlivemessages(dominantrequestsinexperiments)

–  Clientcaches:reducesreads/keepalivenotwrites–  Addnewtypeofservers:AddProxyservers

Caching•  Client caches file data, node meta-data

–  Write-through held in memory

•  masterkeepslistofwhatclientsmayhavecached

•  Strictconsistency:easytounderstand–  Leasebased–  Masterwillinvalidatecachedcopiesuponawriterequest–  donotwanttoalterpreexis<ngcomm.protocols

•  Handles and locks cached as well –  Event informs client of conflicting lock request

CachingandInvalida<on–  writesblock,mastersendsinvalida<ons–  clientsflushchangeddata,ack.withKeepAlive–  datauncachableun<linvalida<onacked

•  allowsreadstohappenwithoutdelay

NewMechanisms

•  Proxies:– HandleKeepAliveandreadrequests,passwriterequeststothemaster

– Reducetrafficbutreduceavailability

•  Par<<oning:par<<onnamespace– Amasterhandlesnodeswithhash(name)modN==id

– Limitedcross-par<<onmessages

Scaling:Proxies•  Proxiespassrequestsfromclientstocell

–  Alayerofmiddlemanagementbetweencellandclients

•  CanhandleKeepAlivesandreadsNOTWRITES–  Notwrites,buttheyare<<1%ofworkload

•  KeepAlivetrafficbyfarmostdominant

•  Disadvantages:–  addi<onalRPCforwrites/first<mereads–  increasedunavailabilityprobability–  fail-overstrategynotideal(willcomebacktothis)

Scaling:Par<<oning•  Namespacepar<<onedbetweenservers– Npar<<ons,eachwithmasterandreplicas

•  NodeD/CstoredonP(D/C)=hash(D)modN– meta-dataforDmaybeondifferentpar<<on

•  LiUlecross-par<<oncomm.desirable–  permissionchecks–  directorydele<on–  cachinghelpsmi<gatethis

Chubby:ProvidingAvailability

SessionsandKeep-Alives•  Aclientsendskeep-aliverequeststoamaster

•  Amasterrespondsbyakeep-aliveresponse

•  ImmediatelyaRergehngthekeep-aliveresponse,theclientsendsanotherrequestforextension

•  Themasterwillblockkeep-alivesun<lclosetheexpira<onofasession

•  Extensionisdefaultto12s

Whenthingsfail…

•  Failure==missingkeepalivemsgs

•  Server– Deleteephemeralfilesw/oopenhandlesaReraninterval

– Deleteassociatedcache

•  Client:discardinmemorystate– Sessions,handles,locks,cacheddata…

LocksandAsynchFailures•  Asynchfailure:delayed,orderedorlostmessages

•  Twosolu<ons–  Sequence#s:getLockandaSequence-ID(seq-#)

•  SubmitrequestswithLock-Seq-ID•  LeadercheckersLock-seq-IDtomakesurerequestiscurrent•  Lock-seq-IDisincrementedwhenanewlockisacquired

–  LockDelay:delaygivingoutnewlocksaRerthelockholderdies•  Outstandingrequestsshouldarrivewithinthat<meperiod

Chubby:Prac<cal,warstories

UseandObserva<ons•  Manyfilesfornaming

•  Config,ACL,meta-datacommon

•  10clientsuseeachcachedfile,onavg.

•  Fewlocksheld,nosharedlocks

•  KeepAlivesdominateRPCtraffic

Use:Outages•  Sampleofcells–  61outagesoverfewweeks(700cell-days)–  duetonetworkconges<on,maintenance,overload,errorsinsoRware,hardware,operators

•  52outagesunder30s –  applica<onsnotsignificantlyaffected

•  Fewdozencell-yearsofopera<on–  dataloston6occasions(bugs&operatorerror)

Today•  Google’sChubby– Mo<va<on– Designchoices–  Scaling/Performance– Availability

•  Yahoo’sZooKeeper(NowApache’sZookeeper)– DifferenceswithChubby

•  Summary

Zookeeper=Chubbywithoutlocks

•  Filesystem-basedAPI– Similartypes:Ephemeral,persistent,sequen<al– DifferentAPIcalls

•  Performance– Cachingandwatches(ZK’sinvalida<on)•  Similarlyoneshot!

•  Availability– Keepalive,leaderelec<on

FileTypesChubby ZooKeeper

Filetypes Ephemeral Yes Yes

Permanent Yes Yes

Sequen<al NO Yes

FileAPI Hierarchical Yes Yes

Predefinedpath Yes NO/

services

users

apps

locks

servers

YaView

s-1

morestupidity

stupidname

EphemeralscreatedbySessionX

Sequenceappendedoncreate

•  Ephemeral:theznodewillbedeletedwhenthesessionthatcreatedit<mesoutoritisexplicitlydeleted

•  Permanent:explicitlydeletedbyclient

•  Sequence:thepathnamewillhaveamonotonicallyincreasingcounterrela<vetotheparentappended

FilesystemAPI

/ls/foo/wombat/pouch

Lock service; common to all names

Cell name

Name within cell

/

services

users

apps

locks

servers

YaView

read-1

morestupidity

stupidname

Read/WriteInterac<ons•  Writes

–  Allgothroughleader–  Requiresquorum

•  Reads–  ZK:gotoanynode

•  Higherperformance–  Chubby:gotoleader

•  Performancelimita<on–  Both:Clientscacheandserverinvalidatescache

•  Chubby:invalida<onisnon-op<onal•  ZK:mustexplicitlyregisterforinvalida<onrequests

top related