anomaly (outlier) detection - new mexico state...
TRANSCRIPT
Anomaly(outlier)detection
Huiping Cao, Anomaly 1
Outline
• Generalconcepts– Whatareoutliers– Typesofoutliers– Causesofanomalies
• Challengesofoutlierdetection• Outlierdetectionapproaches
Huiping Cao, Anomaly 2
Whatareoutliers• Thesetofdatapointsthataresignificantlydifferent fromtherestofthe
objects• Assumption
– Thereareconsiderablymore“normal” observationsthan“abnormal”observations(outliers/anomalies)inthedata
– Applications– Frauddetection(creditcardusage)– Intrusiondetection(computersystems,computernetworks)– Ecosystemdisturbances– Publichealth– Medicine
• Related– Noveltydetection
Huiping Cao, Anomaly 3
Typesofoutliers• Global:deviatesignificantlyfromtherestofthedataset
– Alsocalledpointanomalies– Mostoutlierdetectionmethodsaredesignedtofindsuchoutliers
• Example– Intrusiondetectioninnetworktraffic
Huiping Cao, Anomaly 4
Typesofoutliers• Contextual(conditional)outliers
– Anobjectisanoutlierinonecontext,butmaybenormalinanothercontext
– Contextual attributes:definetheobject’scontext.• date,location
– Behavior attributes:definetheobject’scharacteristics,andareusedtoevaluatewhethertheobjectisanoutlierinthecontext.
• temperature– Ageneralizationoflocaloutlier,definedindensitybasedanalysis.– Backgroundinformationtodeterminecontextualattributes,etc.
Huiping Cao, Anomaly 5
Typesofoutliers• Collective:asubsetofdataobjectsformsacollectiveoutlierif
theobjectsasawholedeviatesignificantlyfromtheentiredataset– Theindividual dataobjectsmaynotbeoutliers– Applications:supply-chain,webvisiting,network(denial-of-service)
– Needbackground informationtomakeobjectrelationships
Huiping Cao, Anomaly 6
Causesofanomalies• Datafromdifferentclasses
– Hawkins’definitionofanOutlier:anoutlierisanobservationthatdifferssomuchfromotherobservationsastoarousesuspicionthatitwasgeneratedbyadifferentmechanism.
• Naturalvariation– Anomaliesthatrepresentextremeorunlikelyvariations(extremetallperson)
• Datameasurementandcollectionerrors– Removingsuchanomaliesisthefocusofdatapreprocessing(datacleaning)
• Others:severalsourcesHuiping Cao, Anomaly 7
Outline
• Generalconcepts– Whatareoutliers– Typesofoutliers– Causesofanomalies
• Challengesofoutlierdetection• Outlierdetectionapproaches
Huiping Cao, Anomaly 8
Challengesofoutlierdetection• Model normal/outlierobjects
– Hardtomodelcompletenormalbehavior– Somemethodsassign“normal”or“abnormal”– Somemethodsassignascoremeasuringthe“outlier-ness”ofthe
object.• Universaloutlierdetection:hardtodevelop
– Similarityanddistancedefinitionisapplication-dependent• Commonissues:noise• Understandability
– Understandwhythedetectedobjectsareoutliers– Providejustificationofthedetection
Huiping Cao, Anomaly 9
Outline
• Generalconcepts– Whatareoutliers– Typesofoutliers
• Challengesofoutlierdetection• Outlierdetectionapproaches
– Statisticalmethods– Proximity-basedmethods– Clustering-basedmethods
Huiping Cao, Anomaly 10
Outlierdetectionmethods• Dataforanalysisarelabeledwith“normal”or“abnormal”by
domainexperts.• Supervised methods
– Canbemodeledasaclassificationproblem– Specialaspectstoconsider:imbalancednormaldatapointsandabnormalpoints
– Measures:recallismoremeaningful• Unsupervised methods
– Largelyutilizeclusteringmethods• Semi-supervised
Huiping Cao, Anomaly 11
Outlierdetectionmethods• Outlierdetectionalgorithmsmakeassumptions aboutoutliersversusthe
restofthedata.• Categories accordingtotheassumptionsmade
– Statistical methods(ormodelbased)• Normaldatafollowastatistical(stochastic)model• Outliersdonotfollowthemodel
– Proximity-based methods• Theproximityofoutlierstotheirneighbors aredifferentfromtheproximityofmostotherobjectstotheirneighbors
• Distance-based,density-based– Clustering-based methods
• Normalobjectsbelongtolargeanddenseclusters• Outliersbelongtosmallorsparseclusters,orbelongtonocluster
Huiping Cao, Anomaly 12
Statisticalapproaches• Probabilisticdefinitionofanoutlier:anoutlierisanobjectthathasalow
probabilitywithrespecttoaprobabilitydistributionmodelofthedata.– Normalobjectsaregeneratedbyastochasticprocess,occurinregions
ofhighprobabilityforthestochasticmodel– Outliers occurinregionsoflowprobability
• Approachsteps– Learnagenerativemodelfittingthegivendata– Identifytheobjectsinlow-probabilityregionsofthemodel
• Categories– Parametric method(univariate,multivariate)– Nonparametric method
Huiping Cao, Anomaly 13
Parametric:univariate NormalDistribution
• Normaldistribution,maximumlikelihoodestimation(MLE)– Standardnormaldistribution,N(0,1)– Non-standardnormaldistribution,N(μ,σ2),z-score– UseMLEtoestimateμandσ2
Huiping Cao, Anomaly 14
Parametric:univariate NormalDistribution
• prob(|x|≥c)=αforN(0,1)– Markanobjectasanoutlierifitismorethan3σawayfromtheestimatedmeanμ,whereσ isthestandarddeviation(μ±3σregioncontains99.73%ofthedata)
• (c,α)pairforN(0,1)
Huiping Cao, Anomaly 15
c α for N(0,1)
1.0 0.3173
1.5 0.1336
2.0 0.0455
2.5 0.0124
3.0 0.0027
3.5 0.0005
4.0 0.0001
Parametric:univariate NormalDistribution
• Example• Acity’saveragetemperaturevaluesin10years:24,28.9,28.9,
29,29.1,29.1,29.2,29.2,29.3,29.4– μ=28.61– σ2≅2.29,σ =sqrt(2.29)=1.51– Is24anoutlier?
• z-score=(|24-28.61|)/1.51=3.04• >3
Huiping Cao, Anomaly 16
Parametric:otherunivariate outlierdetectionapproaches(S.S.)
• Boxplot method• Grubb’stest(maximumnormedresidualtest)
Huiping Cao, Anomaly 17
Parametric:multivariate• Multivariate
– Converttheproblemtoaunivariate outlierdetectionproblem
– UseMahalanobis distancefromobjecto toitsmeanμ– Useχ2 statistic
• oi:isthevalueofo onthei-th dimension• Ei:themeanofthei-th dimensionofallobjects• n:thenumberofobject
Huiping Cao, Anomaly 18
Nonparametric• Nonparametricmethodsusefewerassumptionsaboutdata
distribution,thuscanbeapplicableinmorescenarios• Histogramapproach
– Constructhistograms(types:equalwidthorequaldepth,numberofbins,orsizeofeachbin)
– Outliers:notinanybinorinbinswithsmallsize– Drawback:hardtodecidethebinsize
• Others:kernelfunction(morediscussedinmachinelearning)
Huiping Cao, Anomaly 19
Outline
• Generalconcepts– Whatareoutliers– Typesofoutliers
• Challengesofoutlierdetection• Outlierdetectionapproaches
– Statisticalmethods– Proximity-basedmethods– Clustering-basedmethods
Huiping Cao, Anomaly 20
Proximity-basedApproaches• Dataisrepresentedasavectoroffeatures• Basedontheneighborhood
• Majorapproaches– Distancebased– Densitybased
Huiping Cao, Anomaly 21
Distance-basedapproach• Anomaly:ifanobjectisdistantfrommostpoints.• Distancetok-NearestNeighbor:theoutlierscoreofanobject
isgivenbythedistancetoitsk-nearestneighbor.• Outliers:threshold
• Problem:hardtodecidek(seenextslides)• Improvement:averageofthedistancestothefirstk-nearest
neighbors
Huiping Cao, Anomaly 22
23
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
k=1, outlier is Ok=1, outlier is O
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
k=5, all points at the right upper corner are outliers
Distance-basedoutlierdetection• GivenadatasetD withn datapoints,adistancethresholdr• r-neighborhood:aboutoutliersvs.therestofthedata• ObjectoisaDB(r,π)-outlier
• Approach:– Computethedistancebetweeneverypair ofdatapoints– O(n2)– Practically,O(n)
Huiping Cao, Anomaly 24
Agrid-basedmethodimplementation
• Celldiagonallength:r/2• Celledgelength:
wheredisthenumberofdimensions• Level-1cell
– DirectneighborcellsofacellC– Anypointo’insuchcellshasdist(o,o’)≤r
• Level-2cell– OneortwocellsawayfromacellC– Anypointwithdist(o,o’)>rmustbeinlevel-2cell
Huiping Cao, Anomaly 25
r2 d
Agrid-basedmethodimplementation
• Pruning– n0 totalnumberofobjectsinacellC– n1 totalnumberofobjectsinacellC’slevel-1cells– n2 totalnumberofobjectsinacellC’slevel-2cells
• Level-1cellpruning:– If(n0+n1)>πn,oisNOTanoutlier
• Level-2cell:– If(n0+n1+n2)<πn+1,allthepointsinCareoutliers
Huiping Cao, Anomaly 26
Distance-basedoutlierdetection• Globaloutliers:cannothandledatasetswithregionsof
differentdensities
Huiping Cao, Anomaly 27
p2´ p1
´
Proximity-basedApproaches• Dataisrepresentedasavectoroffeatures• Basedontheneighborhood• Majorapproaches
– Distancebased– Densitybased
Huiping Cao, Anomaly 28
Density-basedoutlierdetection• Localproximity-basedoutlier• Comparethedensityaroundoneobjectwiththedensity
arounditslocalneighbors
Huiping Cao, Anomaly 29
p2´ p1
´
Densitybased• D:asetofobjects• Nearestneighborofo
– d(o,D)=min{d(o,o’)|o’inC}• Localoutliers:relativetotheirlocalneighborhoods,
particularlywithrespecttothedensitiesoftheneighborhoods.
• Densitybasedoutlier:theoutlierscoreofanobjectistheinverseofthedensityaroundanobject.
Huiping Cao, Anomaly 30
Concepts• k-distanceofanobjectodk(o):measuretherelativedensityof
anobjecto.• Formally,dk(o) =d(o,p)s.t.
– atleastk objectso’inD/{o},d(o,o`)≤ d(o,p)– atleastk-1objectso’inD/{o},d(o,o`)<d(o,p)
• K-distanceneighborhoodofanobjecto– Nk(o) ={o’|o’inD,d(o,o’)≤dk(o)}– Nk(o)maycontainmorethankobjects
• Measurelocaldensity:averagedistancefromo toNk(o)– Problem:fluctuations
Huiping Cao, Anomaly 31
Concepts• Reachabledistance
– reachdist(o’ào)=max{dk(o),d(o,o’)}– Alleviatefluctuations– Notsymmetric,reachdist(o’ào)≠reachdist(oào’)
• Localdensity ofo:averagereachabilitydistancefromotoNk(o)
• Differentfromdensitydefinitionindensity-basedclustering– Global/local
Huiping Cao, Anomaly 32
densityk (o) =| Nk (o) |
reachdist(o→ o ')o '∈Nk (o)∑
=| Nk (o) |
max{dk (o '),d(o,o ')}o '∈Nk (o)∑
Example• k=2,useEuclideandistance• Distancefromotoo’s2NNis1• dk(o)=1• Nk(o)={p1,p2,p3}
– dk(p1)=sqrt(0.64+1.0)=1.28,dist(o,p1)=0.8
– dk(p2)=sqrt(2)=1.41,dist(o,p2)=1– dk(p3)=sqrt(0.32)=0.57,dist(o,p3)=1– reachdist(o->p1)=1.28– reachdist(o->p2)=1.41– reachdist(o->p3)=1
• densityk(o)=3/(1.28+1.41+1)=0.813
Huiping Cao, Anomaly 33
0
1
2
3
4
5
0 1 2 3
yx
O
p1
p2
p3
• Localoutlierfactor(LOF)(oraveragerelativedensityofo)– Averageratiooflocalreachabilitydensityofo and localreachability
densityofthek-nearestneighborsofo
– Thelowerdensityk(o),andthehigherdensityk(o’)è higherLOFàhigherprobabilitytobeoutlier
Huiping Cao, Anomaly 34
Example
Huiping Cao, Anomaly 35
0
1
2
3
4
5
0 1 2 3
yx
O
p1
p2
p3
• k=2,useEuclideandistance• Distancefromotoo’s2NNis1• dk(o)=1• Nk(o)={p1,p2,p3}
– dk(p1)=sqrt(0.64+1.0)=1.28,dist(o,p1)=0.8
– dk(p2)=sqrt(2)=1.41,dist(o,p2)=1– dk(p3)=sqrt(0.32)=0.57,dist(o,p3)=1– reachdist(o->p1)=1.28– reachdist(o->p2)=1.41– reachdist(o->p3)=1
• densityk(o)=3/(1.28+1.41+1)=0.813• Then,calculatedensityk (p1),densityk (p2),
densityk (p3)
Outline
• Generalconcepts– Whatareoutliers– Typesofoutliers
• Challengesofoutlierdetection• Outlierdetectionapproaches
– Statisticalmethods– Proximity-basedmethods– Clustering-basedmethods
Huiping Cao, Anomaly 36
Clustering-Based• Clustering-basedoutlier:an
objectisacluster-basedoutlieriftheobjectdoesnotstronglybelongtoanycluster.
• Anoutlier– anobjectbelongingtoasmallandremotecluster
– ornotbelongingto anycluster
Huiping Cao, Anomaly 37
Clustering-Based• Basicsteps:
– Cluster thedataintogroupsofdifferentdensity
• Threegeneralapproaches– Anobjectdoesnotbelongto anyclusterà outlierobject– Thereisa largedistancebetweenanobjectandtheclustertowhichitisclosestà outlier
– Theobjectispartofa smallandsparseclusterà alltheobjectsinthatclusterareoutliers
Huiping Cao, Anomaly 38
Approach2• Thereisa largedistancebetweenanobjectandthecluster to
whichitisclosestà outlier
• Calculateratio,thelarger theratio,thefarther awayoisfromitsclosestclusterCo
Huiping Cao, Anomaly 39
ratio = d(o,co )d(o ',co )o '∈Co
∑|Co |
OutliersinLowerDimensionalProjection
• Inhigh-dimensionalspace,dataissparseandnotionofproximitybecomesmeaningless– Everypointisanalmostequallygoodoutlierfromtheperspectiveofproximity-baseddefinitions
• Lower-dimensionalprojectionmethods– Apointisanoutlierifinsomelowerdimensionalprojection,itispresentinalocalregionofabnormallylowdensity
Huiping Cao, Anomaly 40
Rpackages• https://cran.r-project.org/web/packages/outliers/outliers.pdf• RparallelimplementationofLocalOutlierFactor(LOF)whichuses
multipleCPUs tosignificantlyspeeduptheLOFcomputationforlargedatasets.https://cran.r-project.org/web/packages/Rlof/Rlof.pdf
• PythonLOFimplementation:http://shahramabyari.com/2015/12/30/my-first-attempt-with-local-outlier-factorlof-identifying-density-based-local-outliers/
Huiping Cao, Anomaly 41