machine learning and anomaly detection in splunkit service ......machine learning and anomaly...
TRANSCRIPT
Copyright©2016Splunk Inc.
AlexCruiseSr.Dev.Manager/Architect,SplunkFredZhangSr.DataScientist,Splunk
MachineLearningandAnomalyDetectioninSplunk ITServiceIntelligence
Disclaimer
2
Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose
containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor
functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.
Agenda
Introductions/HistoryAxioms– ProblemDomainAxioms– SolutionDomainTimeSeriesFeatureEngineeringSpatialvs.TemporalAnalysisOtherApproachesMADServiceEngineeringITSIContext
3
Introductions/History
Keyteammembers– Shang– Mihai– Jacob– Iman– Touf
Presenters– Fred– Datascientist– Alex– Architect/DevManager
4
Axioms– ProblemDomain
5
THEUNIVERSEOFDATA
Time-seriesdata
Axioms– ProblemDomain
6
THEUNIVERSEOFDATA
ENHANCE!
Time-seriesdata
Axioms– ProblemDomain
7
Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime
IncreasingTimeà
x
Axioms– ProblemDomain
8
Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime
Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)
IncreasingTimeà
x
regularinterval
Axioms– ProblemDomain
9
Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime
Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)
Dense,RegulartimeseriesNewvaluesarefairlylikelytoarriveandnotbenull
IncreasingTimeà
x
regularinterval
fewgaps/nulls/NaNs
Axioms– SolutionDomain
10
UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic
Axioms– SolutionDomain
11
Unsupervised– Nolabelledanomalies– What’snormalislearnedfromobservingthedataitself,notdefinedbyan
expertNon-ParametricRobustStreamingAdaptiveDomain-agnostic
Axioms– SolutionDomain
12
UnsupervisedNon-Parametric– Wemakenoassumptionsabouttheprobabilitydistributionofthevalues
(e.g.Gaussianorstationary)
RobustStreamingAdaptiveDomain-agnostic
Axioms– SolutionDomain
13
UnsupervisedNon-ParametricRobust– Outliersaredetectedasanomalies,butdon’tcausedistortionsinour
expectations
StreamingAdaptiveDomain-agnostic
Axioms– SolutionDomain
14
UnsupervisedNon-ParametricRobustStreaming– Noseparatetraining/testperiods– Anomaliesaredetectedandreportedin(near-)realtime
AdaptiveDomain-agnostic
Axioms– SolutionDomain
15
UnsupervisedNon-ParametricRobustStreamingAdaptive– Nostaticthresholds,discovernormalbehaviourpatternsautomatically– Adapttobehavioralchangeswithoutend-userfeedback– WhatwasnormallastweekmightbeworrisometodayDomain-agnostic
Axioms– SolutionDomain
16
UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream
Memory/CPUusage
Axioms– SolutionDomain
17
UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream
Unicornspersecond
GettingDataIn
18
Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogo
TimeSeriesFeatureEngineering
GettingDataIn
19
Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblem
TimeSeriesFeatureEngineering
GettingDataIn
20
Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapabletradeoffsbetweendensity andprecision
TimeSeriesFeatureEngineering
GettingDataIn
21
Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!
TimeSeriesFeatureEngineering
GettingDataIn
22
Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!
TSFErequiresdealingwithTime,Space andValues
TimeSeriesFeatureEngineering
GettingDataIn
23
Time– Howfrequently donewvaluesarrive?– Howregularly donewvaluesarrive?– Howprecisely dowewanttobeabletorecordthetimewhenthe
measurementwastaken?ê Finertimeresolutionincreasessparsity:theprobabilitythatanyeventoccurredduringaparticulartimewindowisdecreased
SpaceValues
TimeSeriesFeatureEngineering
GettingDataIn
24
TimeSpace- howprecisely dowewanttobeabletorelatetimeseriesbacktotheunderlyingeventstream?
ê Howmanydimensions?e.g.IPaddress,geo.coordinates,MIMEtype,HTTPresponsecode– Addingdimensionsincreasesprecision,butalsomagnifiesthelikelihoodofsparsity
ê Withinadimension,howprecisedoweneedtobe?– FullIPaddressor/24?Distinguish400,401,403,404orjust4xx?– Country,state/province,city,neighbourhood,building,…?– Extraprecisionincreasesthelikelihoodofsparsity
Values
TimeSeriesFeatureEngineering
GettingDataIn
25
TimeSpaceValues– Howdowegenerateanumber?
ê Getanumericfieldas-is(i.e.a“gauge”)ê Incrementacounter
– Howdoweaggregatemultiplevalues?ê Min,max,mean,etc.
– Howshouldwehandlemissingvalues?ê ”Replacenullwithzero”onlymakessenseforsomethingweknowisacounterê “Takethepreviousvalue”mightmakesense
TimeSeriesFeatureEngineering
MetricAnomalyDetectionAlgorithms
26
Proprietary!Notopensourceoroff-the-shelf.Spatialandtemporalalgorithms– Whatdowemeanby“spatial”and“temporal”?– Completelyorthogonal,irreducibledistinction
ê Onecannotsubstitutefortheotherê Neitherisalwaysapplicabletoeverytimeseries
MetricAnomalyDetectionAlgorithms
27
Analyzeonetimeseriesatatime(embarrassinglyparallel)Alertingwhenpresentbehaviourissurprisingcomparedtopastbehaviour
TemporalAnalysis(aka“Trending”algorithm)
IncreasingTimeà
xnowß past
MetricAnomalyDetectionAlgorithms
28
Goodresultsonlywhenthereisahistoryofrecurringpatternsintheunderlyingeventstream– Notnecessarilyperiodic,justrecurring
Howmuchhistory?– Preliminary(usuallybad)resultsafter~2000points
ê e.g.1.5days at1-minuteresolution– Greatresultsaftera“fullperiod”hasbeenobserved(e.g.7days)– Moreisbetter!(modulomemory,storage…)
TrendingAlgorithmConstraints
MetricAnomalyDetectionAlgorithms
29
Comparepresent behaviourofmultiplemetrics
Spatial(“Cohesive”)Algorithm
IncreasingTimeà
x now
MetricAnomalyDetectionAlgorithms
30
Givenaset*oftimeseriesthatareexpected†tobehavesimilarly‡,detectwhenoneormoreofthemdepartsfromtheirpeers
*set>=3members
†expectedbyahumananalystorinterestingMLprocess
‡similarlyRoughlythesameshapeScaleandmagnitudeinvariant
CohesiveAlgorithmConstraints
MetricAnomalyDetectionAlgorithms
31
NoperiodicityrequiredHistoryimprovesscale/magnitudeinvariancePerformancereliesonsimilaritywithingroup– Whatifthegroupisn’tinherentlycohesive?
ê Lotsofalertsearlyonê Then,thealgorithmadaptstothechaosê Ifthegroupreturnstocohesion,thealgorithmwillautomaticallyadapttothe“newnormal”.
CohesiveAlgorithmCharacteristics
MetricAnomalyDetectionAlgorithms
32
Aclusterofserversperformingasimilarroleforthesameapplication,behindthesameloadbalancerAssumingtheloadbalancerisoperatingnominally,manyservermetricsshouldberoughlycorrelated,e.g.:– CPUusage(user,system,idle)– Diskusage(reads,writes,IOPS)– Networkusage(bandwidth,#activesockets)– Application-specificmetrics(requestshandledpersecond,500errors,
authenticationfailures,activesessions)
CohesiveAlgorithm:ExampleUseCase#1
MetricAnomalyDetectionAlgorithms
33
ImaginesomewindturbinesonthesamehillWecan’tpredictwinddirectionandspeedverywell(yet?)Butweexpecteveryturbineshouldberoughlycohesiveinseveralmetrics:– rotationspeed– powergenerationrate– vibration– direction
ê *actually,becausethisisaperiodicmetric(359° ≈1°),wedon’tsupportitwellrightnow
Ifanymetricforanyturbinedifferssignificantlyfromitspeers,weshouldbenotified,andmaybesendateamtoinvestigate
CohesiveAlgorithm:ExampleUseCase#2
Otherapproacheswehavetried
34
3-sigmaKolmogorov-SmirnovtestoverslidingwindowsTime-seriesforecastingmethods– Holt-Winters(previousversionofITSIADisbasedonitsnon-parametricversion)– ARIMA,etc
One-classSVMClusteringmethods– DBSCAN,K-means,etcVariousR,Pythonpackages
MADServiceEngineering
35
MAD=“MetaforAnomalyDetection”
MADServiceEngineering
36
MAD=“Metafor AnomalyDetection”
MADServiceEngineering
37
MAD=“Metric AnomalyDetection”
MADServiceEngineering
38
MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency
MADServiceEngineering
39
MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency
UsesSearchCommandProtocolv2(availablesinceSplunk6.3)– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling
MADServiceEngineering
40
MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency
UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling
Fast!
MADServiceEngineering
41
MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency
UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling
Fast!Designedforgeneral-purposeuse,nocouplingtoITSIruntime
Howtogetit
42
ITSI-AD
ITSI2.3“Batman”(July2016)– ITSIAnomalyDetectionreplacedwithTrendingalgorithm
ITSI2.4“Catwoman”(.conf 2016)– Cohesivealgorithmadded– ComparesentitieswithinaKPI
Howtogetit
43
ITSI-AD
Howtogetit
44
ITSI-AD
Howtogetit
45
ITSI-AD
THANKYOU