machine learning and anomaly detection in splunkit service ......machine learning and anomaly...

Post on 21-May-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Copyright©2016Splunk Inc.

AlexCruiseSr.Dev.Manager/Architect,SplunkFredZhangSr.DataScientist,Splunk

MachineLearningandAnomalyDetectioninSplunk ITServiceIntelligence

Disclaimer

2

Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose

containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor

functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.

Agenda

Introductions/HistoryAxioms– ProblemDomainAxioms– SolutionDomainTimeSeriesFeatureEngineeringSpatialvs.TemporalAnalysisOtherApproachesMADServiceEngineeringITSIContext

3

Introductions/History

Keyteammembers– Shang– Mihai– Jacob– Iman– Touf

Presenters– Fred– Datascientist– Alex– Architect/DevManager

4

Axioms– ProblemDomain

5

THEUNIVERSEOFDATA

Time-seriesdata

Axioms– ProblemDomain

6

THEUNIVERSEOFDATA

ENHANCE!

Time-seriesdata

Axioms– ProblemDomain

7

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

IncreasingTimeà

x

Axioms– ProblemDomain

8

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)

IncreasingTimeà

x

regularinterval

Axioms– ProblemDomain

9

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)

Dense,RegulartimeseriesNewvaluesarefairlylikelytoarriveandnotbenull

IncreasingTimeà

x

regularinterval

fewgaps/nulls/NaNs

Axioms– SolutionDomain

10

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic

Axioms– SolutionDomain

11

Unsupervised– Nolabelledanomalies– What’snormalislearnedfromobservingthedataitself,notdefinedbyan

expertNon-ParametricRobustStreamingAdaptiveDomain-agnostic

Axioms– SolutionDomain

12

UnsupervisedNon-Parametric– Wemakenoassumptionsabouttheprobabilitydistributionofthevalues

(e.g.Gaussianorstationary)

RobustStreamingAdaptiveDomain-agnostic

Axioms– SolutionDomain

13

UnsupervisedNon-ParametricRobust– Outliersaredetectedasanomalies,butdon’tcausedistortionsinour

expectations

StreamingAdaptiveDomain-agnostic

Axioms– SolutionDomain

14

UnsupervisedNon-ParametricRobustStreaming– Noseparatetraining/testperiods– Anomaliesaredetectedandreportedin(near-)realtime

AdaptiveDomain-agnostic

Axioms– SolutionDomain

15

UnsupervisedNon-ParametricRobustStreamingAdaptive– Nostaticthresholds,discovernormalbehaviourpatternsautomatically– Adapttobehavioralchangeswithoutend-userfeedback– WhatwasnormallastweekmightbeworrisometodayDomain-agnostic

Axioms– SolutionDomain

16

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream

Memory/CPUusage

Axioms– SolutionDomain

17

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream

Unicornspersecond

GettingDataIn

18

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogo

TimeSeriesFeatureEngineering

GettingDataIn

19

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblem

TimeSeriesFeatureEngineering

GettingDataIn

20

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapabletradeoffsbetweendensity andprecision

TimeSeriesFeatureEngineering

GettingDataIn

21

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!

TimeSeriesFeatureEngineering

GettingDataIn

22

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!

TSFErequiresdealingwithTime,Space andValues

TimeSeriesFeatureEngineering

GettingDataIn

23

Time– Howfrequently donewvaluesarrive?– Howregularly donewvaluesarrive?– Howprecisely dowewanttobeabletorecordthetimewhenthe

measurementwastaken?ê Finertimeresolutionincreasessparsity:theprobabilitythatanyeventoccurredduringaparticulartimewindowisdecreased

SpaceValues

TimeSeriesFeatureEngineering

GettingDataIn

24

TimeSpace- howprecisely dowewanttobeabletorelatetimeseriesbacktotheunderlyingeventstream?

ê Howmanydimensions?e.g.IPaddress,geo.coordinates,MIMEtype,HTTPresponsecode– Addingdimensionsincreasesprecision,butalsomagnifiesthelikelihoodofsparsity

ê Withinadimension,howprecisedoweneedtobe?– FullIPaddressor/24?Distinguish400,401,403,404orjust4xx?– Country,state/province,city,neighbourhood,building,…?– Extraprecisionincreasesthelikelihoodofsparsity

Values

TimeSeriesFeatureEngineering

GettingDataIn

25

TimeSpaceValues– Howdowegenerateanumber?

ê Getanumericfieldas-is(i.e.a“gauge”)ê Incrementacounter

– Howdoweaggregatemultiplevalues?ê Min,max,mean,etc.

– Howshouldwehandlemissingvalues?ê ”Replacenullwithzero”onlymakessenseforsomethingweknowisacounterê “Takethepreviousvalue”mightmakesense

TimeSeriesFeatureEngineering

MetricAnomalyDetectionAlgorithms

26

Proprietary!Notopensourceoroff-the-shelf.Spatialandtemporalalgorithms– Whatdowemeanby“spatial”and“temporal”?– Completelyorthogonal,irreducibledistinction

ê Onecannotsubstitutefortheotherê Neitherisalwaysapplicabletoeverytimeseries

MetricAnomalyDetectionAlgorithms

27

Analyzeonetimeseriesatatime(embarrassinglyparallel)Alertingwhenpresentbehaviourissurprisingcomparedtopastbehaviour

TemporalAnalysis(aka“Trending”algorithm)

IncreasingTimeà

xnowß past

MetricAnomalyDetectionAlgorithms

28

Goodresultsonlywhenthereisahistoryofrecurringpatternsintheunderlyingeventstream– Notnecessarilyperiodic,justrecurring

Howmuchhistory?– Preliminary(usuallybad)resultsafter~2000points

ê e.g.1.5days at1-minuteresolution– Greatresultsaftera“fullperiod”hasbeenobserved(e.g.7days)– Moreisbetter!(modulomemory,storage…)

TrendingAlgorithmConstraints

MetricAnomalyDetectionAlgorithms

29

Comparepresent behaviourofmultiplemetrics

Spatial(“Cohesive”)Algorithm

IncreasingTimeà

x now

MetricAnomalyDetectionAlgorithms

30

Givenaset*oftimeseriesthatareexpected†tobehavesimilarly‡,detectwhenoneormoreofthemdepartsfromtheirpeers

*set>=3members

†expectedbyahumananalystorinterestingMLprocess

‡similarlyRoughlythesameshapeScaleandmagnitudeinvariant

CohesiveAlgorithmConstraints

MetricAnomalyDetectionAlgorithms

31

NoperiodicityrequiredHistoryimprovesscale/magnitudeinvariancePerformancereliesonsimilaritywithingroup– Whatifthegroupisn’tinherentlycohesive?

ê Lotsofalertsearlyonê Then,thealgorithmadaptstothechaosê Ifthegroupreturnstocohesion,thealgorithmwillautomaticallyadapttothe“newnormal”.

CohesiveAlgorithmCharacteristics

MetricAnomalyDetectionAlgorithms

32

Aclusterofserversperformingasimilarroleforthesameapplication,behindthesameloadbalancerAssumingtheloadbalancerisoperatingnominally,manyservermetricsshouldberoughlycorrelated,e.g.:– CPUusage(user,system,idle)– Diskusage(reads,writes,IOPS)– Networkusage(bandwidth,#activesockets)– Application-specificmetrics(requestshandledpersecond,500errors,

authenticationfailures,activesessions)

CohesiveAlgorithm:ExampleUseCase#1

MetricAnomalyDetectionAlgorithms

33

ImaginesomewindturbinesonthesamehillWecan’tpredictwinddirectionandspeedverywell(yet?)Butweexpecteveryturbineshouldberoughlycohesiveinseveralmetrics:– rotationspeed– powergenerationrate– vibration– direction

ê *actually,becausethisisaperiodicmetric(359° ≈1°),wedon’tsupportitwellrightnow

Ifanymetricforanyturbinedifferssignificantlyfromitspeers,weshouldbenotified,andmaybesendateamtoinvestigate

CohesiveAlgorithm:ExampleUseCase#2

Otherapproacheswehavetried

34

3-sigmaKolmogorov-SmirnovtestoverslidingwindowsTime-seriesforecastingmethods– Holt-Winters(previousversionofITSIADisbasedonitsnon-parametricversion)– ARIMA,etc

One-classSVMClusteringmethods– DBSCAN,K-means,etcVariousR,Pythonpackages

MADServiceEngineering

35

MAD=“MetaforAnomalyDetection”

MADServiceEngineering

36

MAD=“Metafor AnomalyDetection”

MADServiceEngineering

37

MAD=“Metric AnomalyDetection”

MADServiceEngineering

38

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

MADServiceEngineering

39

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesSearchCommandProtocolv2(availablesinceSplunk6.3)– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

MADServiceEngineering

40

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Fast!

MADServiceEngineering

41

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Fast!Designedforgeneral-purposeuse,nocouplingtoITSIruntime

Howtogetit

42

ITSI-AD

ITSI2.3“Batman”(July2016)– ITSIAnomalyDetectionreplacedwithTrendingalgorithm

ITSI2.4“Catwoman”(.conf 2016)– Cohesivealgorithmadded– ComparesentitieswithinaKPI

Howtogetit

43

ITSI-AD

Howtogetit

44

ITSI-AD

Howtogetit

45

ITSI-AD

THANKYOU

top related