machine learning and anomaly detection in splunkit service ......machine learning and anomaly...

46
Copyright © 2016 Splunk Inc. Alex Cruise Sr. Dev. Manager/Architect, Splunk Fred Zhang Sr. Data Scientist, Splunk Machine Learning and Anomaly Detection in Splunk IT Service Intelligence

Upload: others

Post on 21-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Copyright©2016Splunk Inc.

AlexCruiseSr.Dev.Manager/Architect,SplunkFredZhangSr.DataScientist,Splunk

MachineLearningandAnomalyDetectioninSplunk ITServiceIntelligence

Page 2: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Disclaimer

2

Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose

containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor

functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.

Page 3: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Agenda

Introductions/HistoryAxioms– ProblemDomainAxioms– SolutionDomainTimeSeriesFeatureEngineeringSpatialvs.TemporalAnalysisOtherApproachesMADServiceEngineeringITSIContext

3

Page 4: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Introductions/History

Keyteammembers– Shang– Mihai– Jacob– Iman– Touf

Presenters– Fred– Datascientist– Alex– Architect/DevManager

4

Page 5: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

5

THEUNIVERSEOFDATA

Time-seriesdata

Page 6: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

6

THEUNIVERSEOFDATA

ENHANCE!

Time-seriesdata

Page 7: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

7

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

IncreasingTimeà

x

Page 8: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

8

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)

IncreasingTimeà

x

regularinterval

Page 9: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

9

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)

Dense,RegulartimeseriesNewvaluesarefairlylikelytoarriveandnotbenull

IncreasingTimeà

x

regularinterval

fewgaps/nulls/NaNs

Page 10: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

10

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic

Page 11: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

11

Unsupervised– Nolabelledanomalies– What’snormalislearnedfromobservingthedataitself,notdefinedbyan

expertNon-ParametricRobustStreamingAdaptiveDomain-agnostic

Page 12: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

12

UnsupervisedNon-Parametric– Wemakenoassumptionsabouttheprobabilitydistributionofthevalues

(e.g.Gaussianorstationary)

RobustStreamingAdaptiveDomain-agnostic

Page 13: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

13

UnsupervisedNon-ParametricRobust– Outliersaredetectedasanomalies,butdon’tcausedistortionsinour

expectations

StreamingAdaptiveDomain-agnostic

Page 14: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

14

UnsupervisedNon-ParametricRobustStreaming– Noseparatetraining/testperiods– Anomaliesaredetectedandreportedin(near-)realtime

AdaptiveDomain-agnostic

Page 15: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

15

UnsupervisedNon-ParametricRobustStreamingAdaptive– Nostaticthresholds,discovernormalbehaviourpatternsautomatically– Adapttobehavioralchangeswithoutend-userfeedback– WhatwasnormallastweekmightbeworrisometodayDomain-agnostic

Page 16: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

16

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream

Memory/CPUusage

Page 17: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

17

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream

Unicornspersecond

Page 18: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

18

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogo

TimeSeriesFeatureEngineering

Page 19: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

19

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblem

TimeSeriesFeatureEngineering

Page 20: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

20

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapabletradeoffsbetweendensity andprecision

TimeSeriesFeatureEngineering

Page 21: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

21

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!

TimeSeriesFeatureEngineering

Page 22: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

22

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!

TSFErequiresdealingwithTime,Space andValues

TimeSeriesFeatureEngineering

Page 23: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

23

Time– Howfrequently donewvaluesarrive?– Howregularly donewvaluesarrive?– Howprecisely dowewanttobeabletorecordthetimewhenthe

measurementwastaken?ê Finertimeresolutionincreasessparsity:theprobabilitythatanyeventoccurredduringaparticulartimewindowisdecreased

SpaceValues

TimeSeriesFeatureEngineering

Page 24: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

24

TimeSpace- howprecisely dowewanttobeabletorelatetimeseriesbacktotheunderlyingeventstream?

ê Howmanydimensions?e.g.IPaddress,geo.coordinates,MIMEtype,HTTPresponsecode– Addingdimensionsincreasesprecision,butalsomagnifiesthelikelihoodofsparsity

ê Withinadimension,howprecisedoweneedtobe?– FullIPaddressor/24?Distinguish400,401,403,404orjust4xx?– Country,state/province,city,neighbourhood,building,…?– Extraprecisionincreasesthelikelihoodofsparsity

Values

TimeSeriesFeatureEngineering

Page 25: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

25

TimeSpaceValues– Howdowegenerateanumber?

ê Getanumericfieldas-is(i.e.a“gauge”)ê Incrementacounter

– Howdoweaggregatemultiplevalues?ê Min,max,mean,etc.

– Howshouldwehandlemissingvalues?ê ”Replacenullwithzero”onlymakessenseforsomethingweknowisacounterê “Takethepreviousvalue”mightmakesense

TimeSeriesFeatureEngineering

Page 26: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

26

Proprietary!Notopensourceoroff-the-shelf.Spatialandtemporalalgorithms– Whatdowemeanby“spatial”and“temporal”?– Completelyorthogonal,irreducibledistinction

ê Onecannotsubstitutefortheotherê Neitherisalwaysapplicabletoeverytimeseries

Page 27: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

27

Analyzeonetimeseriesatatime(embarrassinglyparallel)Alertingwhenpresentbehaviourissurprisingcomparedtopastbehaviour

TemporalAnalysis(aka“Trending”algorithm)

IncreasingTimeà

xnowß past

Page 28: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

28

Goodresultsonlywhenthereisahistoryofrecurringpatternsintheunderlyingeventstream– Notnecessarilyperiodic,justrecurring

Howmuchhistory?– Preliminary(usuallybad)resultsafter~2000points

ê e.g.1.5days at1-minuteresolution– Greatresultsaftera“fullperiod”hasbeenobserved(e.g.7days)– Moreisbetter!(modulomemory,storage…)

TrendingAlgorithmConstraints

Page 29: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

29

Comparepresent behaviourofmultiplemetrics

Spatial(“Cohesive”)Algorithm

IncreasingTimeà

x now

Page 30: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

30

Givenaset*oftimeseriesthatareexpected†tobehavesimilarly‡,detectwhenoneormoreofthemdepartsfromtheirpeers

*set>=3members

†expectedbyahumananalystorinterestingMLprocess

‡similarlyRoughlythesameshapeScaleandmagnitudeinvariant

CohesiveAlgorithmConstraints

Page 31: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

31

NoperiodicityrequiredHistoryimprovesscale/magnitudeinvariancePerformancereliesonsimilaritywithingroup– Whatifthegroupisn’tinherentlycohesive?

ê Lotsofalertsearlyonê Then,thealgorithmadaptstothechaosê Ifthegroupreturnstocohesion,thealgorithmwillautomaticallyadapttothe“newnormal”.

CohesiveAlgorithmCharacteristics

Page 32: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

32

Aclusterofserversperformingasimilarroleforthesameapplication,behindthesameloadbalancerAssumingtheloadbalancerisoperatingnominally,manyservermetricsshouldberoughlycorrelated,e.g.:– CPUusage(user,system,idle)– Diskusage(reads,writes,IOPS)– Networkusage(bandwidth,#activesockets)– Application-specificmetrics(requestshandledpersecond,500errors,

authenticationfailures,activesessions)

CohesiveAlgorithm:ExampleUseCase#1

Page 33: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

33

ImaginesomewindturbinesonthesamehillWecan’tpredictwinddirectionandspeedverywell(yet?)Butweexpecteveryturbineshouldberoughlycohesiveinseveralmetrics:– rotationspeed– powergenerationrate– vibration– direction

ê *actually,becausethisisaperiodicmetric(359° ≈1°),wedon’tsupportitwellrightnow

Ifanymetricforanyturbinedifferssignificantlyfromitspeers,weshouldbenotified,andmaybesendateamtoinvestigate

CohesiveAlgorithm:ExampleUseCase#2

Page 34: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Otherapproacheswehavetried

34

3-sigmaKolmogorov-SmirnovtestoverslidingwindowsTime-seriesforecastingmethods– Holt-Winters(previousversionofITSIADisbasedonitsnon-parametricversion)– ARIMA,etc

One-classSVMClusteringmethods– DBSCAN,K-means,etcVariousR,Pythonpackages

Page 35: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

35

MAD=“MetaforAnomalyDetection”

Page 36: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

36

MAD=“Metafor AnomalyDetection”

Page 37: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

37

MAD=“Metric AnomalyDetection”

Page 38: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

38

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

Page 39: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

39

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesSearchCommandProtocolv2(availablesinceSplunk6.3)– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Page 40: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

40

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Fast!

Page 41: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

41

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Fast!Designedforgeneral-purposeuse,nocouplingtoITSIruntime

Page 42: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

42

ITSI-AD

ITSI2.3“Batman”(July2016)– ITSIAnomalyDetectionreplacedwithTrendingalgorithm

ITSI2.4“Catwoman”(.conf 2016)– Cohesivealgorithmadded– ComparesentitieswithinaKPI

Page 43: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

43

ITSI-AD

Page 44: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

44

ITSI-AD

Page 45: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

45

ITSI-AD

Page 46: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

THANKYOU