passive realtimedatacenter fault detection and localization · load balanced traffic simplifies...

Post on 15-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Passiverealtime datacenterfaultdetectionandlocalization

ArjunRoy,JamesHongyi Zeng*,JasmeetBagga*,andAlexC.SnoerenUniversityofCalifornia,SanDiegoFacebook*

1

“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”

- Ranjeeth Dasineni,Facebook(paraphrased)

2

Contemporarydatacenternetwork

However:faultsmaybepartial/intermittent.3

Partialfaults:Afewexamples

• Netpilot (Sigcomm 2011):Framecheckerror,unequalECMPhashing,etc.Wu,Xin,etal."Netpilot:automatingdatacenternetworkfailuremitigation." ACMSIGCOMMComputerCommunicationReview 42.4(2012):419-430.

• Everflow (Sigcomm 2015):TCAMbiterrors,silentpacketdrops.Zhu,Yibo,etal."Packet-LevelTelemetryinLargeDatacenterNetworks.”SIGCOMM,2015.

• Pingmesh (Sigcomm 2015):“fiberFCS…errors,switchingASICdefects,switchfabricflaw,switchsoftwarebug,NICconfigurationissue,networkcongestions,etc.Wehaveseenallthesetypesofissuesinourproductionnetworks.”

Guo,Chuanxiong,etal."Pingmesh:ALarge-ScaleSystemforDataCenterNetworkLatencyMeasurementandAnalysis.” SIGCOMM,2015.

4

Vastbodyofpriorwork(justasmallsample…)• Applicationinstrumentation:variousproductionsystems

• Activeprobing:Pingmesh (SIGCOMM’15),NetNorad (Facebook),ATPG(CoNEXT ‘12),Everflow (SIGCOMM‘15)

• Machinelearning:NetPoirot (SIGCOMM’16)

• Graphalgorithms:Gestalt(Usenix ATC‘14),SCORE(NSDI‘05)

• Pathtracing: Everflow (SIGCOMM‘15),NetNorad (Facebook),NetSight (NSDI‘14),TinyPacketPrograms(SIGCOMM‘14)

• Networkinstrumentation:FlowRadar (NSDI’16),Planck(SIGCOMM‘14),NetPilot (SIGCOMM‘11)

5

Weexploit:highlyregularloadbalancedtraffic

Sourceracktrafficmagnitude

Destinationracktrafficmagnitude

6

ArjunRoy,Hongyi Zeng,JasmeetBagga,GeorgePorter,andAlexC.Snoeren.InsidetheSocialNetwork's(Datacenter)Network. ACMSIGCOMM'15,London,England.

Loadbalancedtrafficsimplifiesfaulthandling

• Evenlyloadedpathsmeansperpathperformanceissimilarifnoerrors.• Networkfaultsleadtooutlierpaths.• Ifflownetworkpathknown,cancorrelateflowperformancewithpath.

• Approachallowsustofindandlocalizefaults:• Inanapplicationagnosticmanner• Incurringnoadditionalprobingoverhead• Morerapidlythanpriorpublishedworks

7

Facebookdatacentertopology

8

AlexeyAndreyev.Introducingdatacenterfabric,thenext-generationFacebookdatacenternetwork.https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg9

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg10

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg11

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg12

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg13

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg14

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Agg

15

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Agg

16

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Solution:aggregationswitchmarkspacketsbasedoncoredownlinktraversed.

Agg

17

Howdoweusepathinformation?

• Inprinciple:cancompareflowperformancebypath.1. Combinatorialdisaster:O(10,000)pathsfromsinglehosttoremoteracks.2. Nolocalization:doesn’ttelluswhichlink/switchisatfault.

• But:forthistrafficpattern,ECMProutinggivesusevenbytes/link.

• Solution:Justcomparelinks!

Create“EquivalenceSets”:setsoflinkshandlingsimilarload

andexhibitingsimilarperformance,intheabsenceoffaults

18

Equivalencesets:1. Reducesnumberofcomparisonsneeded.

2. Pinpointsfaulttospecificlocation.

EquivalencesetsinFacebooktopologyCoreCoreCore

CoreCoreCore

CoreCoreCore

Sourcehost

Agg

Agg

Agg

ToRCoreCoreCoreAgg

Equivalenceset:4uplinksfromeachToR

topodAgg layer

…eachhasclosetoidenticalperformancedistribution

inabsenceoferrors19

CoreCoreCore

CoreCoreCore

CoreCoreCore

Sourcehost

Agg

Agg

Agg

ToRCoreCoreCoreAgg

…eachhasclosetoidenticalperformancedistribution

inabsenceoferrors

Equivalenceset:NuplinksfrompodAgg layertocorelayer

EquivalencesetsinFacebooktopology

20

Outlieranalysiswithapplicationagnosticmetrics

Hostsalreadytrackmetricsforcongestioncontrolorperformancemonitoring:

TCPCongestionwindow:Affectedbypacketloss.TCPRetransmits:Affectedbypacketloss.SmoothedRoundtriptime:Affectedbylatencyspikes.Systemcalllatency: Affectedbypacketloss.

Caveat:Canbedifficulttodetermineifanaffectisduetoafaultylink,overloadedhosts,applicationvariance,etc.

Withequivalencesetbasedgrouping,wecancomparedistributionsbylink.

Onlylinkfaultscausevariancebetweenlinks.

21

DemonstratingequivalencesetsfromAgg toToR

(1)ToR markspacketDSCP

perinboundlink

(2)HostaggregatesTCPmetricsbylink(3b)Host drops0.5%ofpacketstraversinglink

(3a)Wesimulateerroronthislink:

22

Host ToRAgg 2

Agg 3

Agg 4

Agg 1

TCPCongestionwindowinAgg toToR equivalenceset

Cacheserver 23

Congestionwindowsignalisapplicationagnostic

Cacheserver Webserver24

Weuse:TCPretransmitsinourwork

Cacheserver Webserver25

Detectingfaultsinproduction

• Monitoredtrafficthroughpodaggregationswitch.1. Nofaultsinjected.2. CollectedTCPmetricdataon30webserverhosts.3. Equivalenceset:fourlinecards connectingtocorelayer

(eachlinecard hasequalshareofuplinks).

• OnJanuary25th,asinglelinecard hadasoftwarefault.1. Linecard controllersoftwarehung.2. BGProutestimedout,productiontrafficthroughlinecard routedaway.3. Afewminuteslater,NetNORAD flaggedunresponsivelinecard.

26

Faultvisibletoourapproachin30seconds

27

Classifyingfaultylinks

• “Doesthislinkhavemoreretransmitsperflowthantheotherlinks?”

• “Dotwodistributionshavethesamemean,orisonegreater?”

28

Classifier:compareeachlinktootherlinkswithonesampleStudent’sT-Test.

OnlinefaultmonitoringwithT-Testalone

• Inprinciple:cansetupasystemthatusesendhostT-Testresulttotelluswhichnetworklinksarefaulty.

• However:byitselfthisissusceptibletoFalsePositives.

• Can’taffordfalsepositivesinnetworkwithO(10,000)links!

29

Accountingforfalsepositives

• However,twocharacteristicsaidus:1. Per-hostfalsepositivesevenlydistributedperlinkovertime.2. Datacenterhasaplethoraofhostsforwhichthisistrue.

• Thus,we’renottryingtoseeif agivenlinkismarkedfaultybyhosts.

• Instead,weonceagainperformoutlieranalysis.1. “Areallthelinksbeingmarkedfaultybyhostsatsimilarrates?”2. “Arehostsflaggingaparticularsubsetoflinksasfaultyathigherrates?”

30

Chi-squaredtest:determinesifanylinksareoutliers.

P-Value≈ 1:“Yes,allthelinksbeingmarkedfaultybyhostsatsimilarrates.”

P-Value≈ 0: “No,asubsethasacomparativelyhighpercentageofhostsclaimingfault.”

Evaluationinthedatacenter

• Smalldetectionsurface;didnotdetectany‘organic’partialfaults.

• Approach:inject‘simulated’faultstoevaluateapproach.

• Inducedavarietyoffaultscenariostochallengeoursystem.

31

Evaluationinthedatacenter:faultscenarios

• Minisculefaults:faultscanhaveverylowdroprates.

• Concurrentfaults:multiplefaultscanoccursimultaneously.

• Maskedfaults:largerfaultcanmaskeffectofminisculefault.

• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.

32

Evaluationinthedatacenter:faultscenarios

• Minisculefaults:faultscanhavevery lowdroprates.

• Concurrentfaults:multiplefaultscanoccursimultaneously.

• Maskedfaults:largerfaultcanmaskeffectofminisculefault.

• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.

33

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

34

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

35

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

Equivalenceset:NuplinksfrompodAgg layertocorelayer

36

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

Partialfaultinducedonsingle

CoretoAggdownlink.

37

Faultdetectionratevsdroprate

38

Minisculefaults:choosingbetweendetectionspeedandsensitivity

39

Minisculefaults:choosingbetweendetectionspeedandsensitivity

40

Minisculefaults:choosingbetweendetectionspeedandsensitivity

41

Minisculefaults:choosingbetweendetectionspeedandsensitivity

42

Minisculefaults:choosingbetweendetectionspeedandsensitivity

43

Minisculefaults:choosingbetweendetectionspeedandsensitivity

44

“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”

Ranjeeth Dasineni,Facebook(paraphrased)

45

46

InterpretingtheT-Test

1. T-Statistic:“Doesthislinkhavemoreorlessretransmitsthanaverage?”

• Positive T-statisticmeanslargerthanaverage.• Negative T-statisticmeanssmallerthanaverage.

2. P-Value:“Isthedifferenceinmeanbigenoughtoconcernus?”

• Closeto0meansthislinkcouldbeanoutlier.• Closeto1meanswearenotconcerned.

47

InterpretingtheT-Test

P-value0,t-stat>0

P-value1,t-stat≈0

48

top related