unsupervised anomaly detection using bayesian networks and ... · needed. instead of using...

Unsupervised anomaly detection using Bayesian

Networks and Gaussian Mixture Models

Antonio Cansado, Alvaro Soto

e-mail: [acansado,asoto]@ing.puc.cl

Departamento de Ciencia de la Computacion

Casilla 306 - Santiago 22 - CHILE

June 5, 2005

Abstract

Working with large datasets make manual classification unfeasible.

To solve this problem many knowledge discovery techniques have been

introduced, several of them based in Bayesian Network modelling which

requires a high computational cost to find a suitable structure. A new

algorithm called Bayesian Network Outlier Detector (BNOD) manages to

efficiently find Bayesian Networks by exploring Gaussian Mixture Models

properties and advanced caching, making it highly suitable for massive

databases outlier detection. The main goal is to find a subset of objects

which are potentially strange and provide a brief explanation of what

makes them different. This is of special interest for databases with high

1

dimensional continuous variables and thousands of objects such as found

in astronomy.

1 Introduction

As storage costs has cut down to new levels, huge databases have become more

and more common. Time and complexity required to manually analyse thou-

sands of objects increases to such levels that makes it unfeasible to spend ex-

pertise knowledge in brute force search. Therefore, many knowledge discovery

techniques have been introduced in the past to find a subset of elements of more

interest. For example, a particular application is searching for anomalies within

the database, where only few ones are unique.

There are many difficulties when deciding which objects are strange. In real

world, there are intrinsic measurement errors of variables and many unknown

relations between data, making traditional boolean logic modelling useless, or at

least impractical. The huge amount of rules required and a high knowledge of the

domain would make any system too complex to be implemented, therefore, more

flexible models are needed. One excellent mathematic model for such domain is

probabilistic modelling of variables. Thanks to probabilistic approach, we are

relieved from classic boolean logic where everything must be either true or false.

Instead, we can make some situations very improbable, but not impossible,

letting even unexpected values to take place. This approach allows simpler

and more flexible models without thousands of rules. Bayes’ theorem has been

used as a mathematic aid to simplify joint probability by using conditional

2

independencies between variables. If we establish all conditional independencies,

we easily allow both evidence and hypothesis to help us make an update of our

belief of an event, even if we have uncertain data. Bayesian Networks (BN)

are graphical representations of such independencies. The key benefit of BN is

the factorization into smaller and easier relations between variables without any

loss of information, joint probability can easily be found.

Most common objects and rare objects would receive a high probability

and low probability respectively. If we have the joint probability of the study

domain, we can easily find strange objects simply by evaluating the probability

of each object and selecting the lower scoring ones. Therefore, BN brings solid

mathematic support to anomaly detection and, thanks to factors, we can explain

which set of variables have strange values. This makes BN much more attractive

to rare object detection, because it does not work as a black box selecting a few

objects. It is also possible to make a mathematic analysis of each variable.

To make inference using Bayesian Networks, we need to know the network

structure and all local probabilities. An expert could give us the real or at least

an approximated structure and the conditional probabilities which lead to our

goal. Problems arise when complexities of databases do not allow this approach

or when an expert is not available. Fortunately, finding a structure can be

reduced into an optimization problem which maximizes a given score. This is

known to be a NP-hard problem (Cooper, 1987), and thus requires heuristic

search of the candidate space.

Due to huge datasets found in astronomy, a highly efficient heuristic is

3

needed. Instead of using traditional Greedy Hill Climbing (GHC) search on

the BN, where local changes such as adding, deleting or reversing arcs require

O(n2) possible changes to each network, being n the number of variables, we ef-

ficiently shrink search space by statistically selecting the most probable parents

for each variable, similar as described in Sparse Candidate algorithm (Friedman,

Nachman, and Peer, 2000). As astronomy variables are mostly continuous, we

use Gaussian Mixture Models (GMM) to model probability distributions.

A new algorithm called Bayesian Network Outlier Detector (BNOD) man-

ages to efficiently find BN by exploring GMM properties and advanced caching,

making it highly suitable for massive databases outlier detection. The main

goal is to find a subset of objects which are potentially strange and provide a

brief explanation of what makes them different.

This paper is organized as follows: in section 2 we provide background in-

formati on on GMM and BN. In section 3, we discuss BN learning techniques

and describe BNOD implementation. In section 4 we show experimental results,

and section 5 shows final discussion and related future work.

2 Background

2.1 Gaussian Mixture Model (GMM)

Given a set Xi, 1 ≤ i ≤ d continuous variables, finding the joint probability

distribution p( ~X) can be a challenge because in most cases, empirical data

cannot not be modelled by a simple function. Fortunately, it has been proved

4

that such distribution can be approximated with arbitrary accuracy using a sum

of gaussians ph( ~X| ~µh, σh) multiplied by a membership probability wh.

p( ~X) =

g∑

h=1

wh · ph( ~X | ~µh, σh), ~X ∈ <d,k∑

h=1

wh = 1 (1)

ph( ~X | ~µh, σh) =1√

(2π)d |Σh|e{− 1

2 (~x− ~µh)tΣ−1h (~x− ~µh)} (2)

Parameters for each gaussian are means (~µ), Covariance matrix (Σ) and

membership probability (w). These can be found using the Expectation Maxi-

mization (EM) algorithm. The computation cost is O(g ·m · n2) each iteration,

where g is the number of gaussians, m is number of instances and n is the num-

ber of dimensions. EM implementation was done using (Zavala, 2005) studies

in advanced caching strategies similar as described in (Moore, 1998; Bradley,

Fayyad, and Reina, 1998). Therefore, this provides a way to reduce complexity

to manageable levels.

2.2 Bayesian Networks

Bayesian Networks can be defined as a directed acyclic graph G with a vertex

Vi for each variable Xi in the domain. Joint probability p( ~X) can be factorized

as:∏i p (X |Pa (Xi)) where Pa (Xi) is the parent set of Xi. Given the values

of Pa (Xi) and no other information, Xi is conditionally independent of all

variables that are not descendant of Vi in graph G. Therefore edges define

independencies between variables. We denote 〈G, θ〉 as the BN with parameters

5

θ, where θ includes all edges and conditional probabilities needed.

2.3 Bayesian Network structure learning

Let ~X = {X1, X2, . . . , Xn} be a variable set and D = {D1, D2, . . . , Dm}, Di ∈

<n instances of ~X, we want to find an optimum BN defined as the BN which

maximizes a scoring function. Continuous distributions needs to be modelled

with mathematic functions, and GMM can be chosen because they allow efficient

calculation of conditional probabilities. Given GMM for p(X, ~Y ), p(X |~Y ) is

modelled with the same amount of gaussians and µX|~Y and ΣX|~Y can be found

as following:

p(X |~Y ) =p(X, ~Y )

p(~Y )=

1

p(~Y )

∑

h

whph(X, ~Y ) (3)

p(X |~Y ) =1

p(~Y )

∑

h

whph(~Y )ph(X |~Y ) (4)

With: µX|~Y ,h = µX,h + ΣX~Y ,hΣ−1~Y ~Y ,h

(~Y − µ~Y ,h)

and ΣX|~Y ,h = ΣXX,h − ΣX~Y ,hΣ−1~Y ~Y ,h

Σ~Y X,h.

These statistics are taken from partitions of joint probability p(X, ~Y ). This

procedure is similar as used in (Davies and Moore, 2000).

To measure fitness of the BN model, scoring functions are usually em-

ployed. Most used are likelihood or penalty based Bayesian scoring metric

(BIC) (Cooper and Herskovits, 1992) criteria. Likelihood is defined as: L(G, θ :

D) =∏m p (xm1 , . . . , x

mn ) where xmi is the value of attribute Xi in instance m.

6

2.4 Other definitions

A well known function to measure the difference between two probability distri-

butions is Kullback-Leibler divergence DKL (Kullback and Leibler, 1951) defined

as:

DKL (p(x) ‖q(x) ) =∑

x

p(x) · logp(x)

q(x)(5)

This definition is useful for testing independence between variables. It also

can be seen as a generalization of mutual information I(x, y):

DKL (p(x, y) ‖p(x)p(y) ) =∑

x,y

p(x, y) · logp(x, y)

p(x)p(y)= I(x, y) (6)

Discrepancy between two variables Xi and Xj is defined as:

MDisc (Xi, Xj |B ) = DKL

(P (Xi, Xj) ‖PB(Xi, Xj)

)(7)

P (Xi, Xj) is the empirical probability and PB(Xi, Xj) theoretical probabil-

ity of Xi and Xj given the BN model B.

3 BN structure learning algorithms

3.1 Previous work

Modern research in heuristic BN structure learning such as Sparse Candidate

algorithm (Friedman, Nachman, and Peer, 2000), use a GHC exploration based

7

on a scoring function. As there are many independencies between variables,

exploring the whole set of possibilities would be unfeasible. To solve this issue,

Sparse Candidate algorithm uses statistically motivated scores and knowledge

of the current model, restricting the possible parent set for each variable. This

way, there are only O(p ·n) possible one arc additions, where p is a user defined

maximum candidate parent set. It greedily keeps the best one arc addition

and iterate again until a high-scoring network is found. In each iteration, new

knowledge of the BN allow refinement of the candidate parent set. This is a two

phase algorithm: Restrict and Maximize.

Restricting the parent set can be tricky. If the wrong parent set is found,

low scoring functions will probably be found or many unneeded edges will be

added. Otherwise, if a huge parent set is selected to ensure that all correct

edges will be analysed, computation costs will stall the algorithm. Discrepancy

can be used to find which relations are not being represented by the network.

Fortunately, this definition provides higher accuracy than traditional mutual

information which is highly pairwise dependent.

To find PB(Xi, Xj), M samples are taken from data and Xi is resampled

from the BN given Pa (Xi). It is very interesting to see how this definition allows

even weak parents to be found. This can be easily explained with a simple

example: Let R be a 3 variable network with X1 → X2 → X3. Therefore,

p(X3|X1, X2) = p(X3|X2) because X1 and X3 are conditionally independent

given X2. If X2 → X3 is found, both empirical and theoretical probabilities of

p(X1,, X3) should be close enough, and thus MDisc (X1, X3 |B ) should be low.

8

Using this strategy, many edges can be skipped allowing weaker parents to be

considered. In the example, it is important to find X2 before X1 as parent of

X3.

In the maximization phase, GHC is used. We know GHC finds local maxima

so final network is not necessarily optimal. One problem with GHC is that it

produces error propagation of mislead choices, limiting search space to lower

scoring networks. Also, BN restrictions of acyclic graphs makes it difficult to

find correct networks from arbitrary starting points. There are strategies to

avoid local maxima such as Random Restart and TABU Search which succeed

on finding higher scoring networks compromising computation cost, making this

solution unattractive when dealing with huge databases.

Sparse’s paper emphasis is mainly on restricting the correct parent set and

does not make further research in GHC improvements. It is also applied to

discrete variables, therefore, if we are interested in astronomy application where

variables are continuous, it is better to use a different model based in GMM as

described in Mixnets (Davies and Moore, 2000). In this paper, both discrete and

continuous variables are modelled together using GMM and discrete frequency

tables. This requires high computational costs, therefore, it only finds mutual

information between all pairs of variables once. Then it adds l (user parameter)

arcs for each variable considering the highest mutual information variables as

candidates, which is repeated until convergence. Very few EM calls are needed

using properties described in 2.3, but only very strong pairwise relations are

found due to use of mutual information criteria. Although it is highly scalable,

9

it only manages to find low local maxima networks, therefore, it is not suitable

for strange object detection where we need higher accuracy. Instead, we are in-

terested in an algorithm with better accuracy which is still highly scalable. With

such algorithm, we should be able to detect strange objects in huge datasets.

3.2 BNOD implementation

BNOD is based on a Sparse implementation using GMM for continuous variables

modelling. Scoring functions such as Likelihood or BIC can be used, for test-

ing purposes we will implement BNOD using a penalized criteria of Likelihood

similar as BIC.

As described in (Silverman, 1986; Scott, 1992), any probability distribu-

tion can be approximated with arbitrary accuracy with GMM. The problem

is how many clusters (gaussians) to use. Fortunately, this can be found using

techniques similar as described in (Zavala, 2005). Considering the low dimen-

sionality needed due to BN factorization into local conditional probabilities,

application to massive datasets becomes practical.

Suppose B is a BN 〈G, θ〉 initially empty (all variables are considered inde-

pendent). All variables are continuous and will be modelled with GMM using

accelerated versions of EM as described in 2.1.

• Restrict:

Let S be an empty set of relations. Given, for each Xi ∈ X , compute

MDisc (Xi, Xj |B ) = DKL

(P (Xi, Xj) ‖PB(Xi, Xj)

), where Xj ∈ X −

{Pa (Xi)−Xi}. Select highest p (user defined parameter) variables and

10

add relations Xj → Xi to S. To find PB(Xi, Xj), M samples are taken

from data and Xi is resampled from the BN given Pa (Xi). Discretization

of samples in t (user parameter) bins is needed to efficiently compute

discrepancy.

• Maximize:

For each edge Xj → Xi in S, compute Score(B′ : D), where B′ is the

BN Formula 〈G ∪ {Xj → Xi} , θ′〉. θ′ has the same parameters of θ except

for variable Xi which has an additional parent Xj . The highest scoring

network is used as B in the next iteration. If Score(B ′ : D)− Score(B :

D) ≤ δ algorithm stops. δ is a user defined parameter of convergence.

We need to discretize samples to allow fast calculation of discrepancy. O(n2 ·

m) operations are required to collect enough statistics, but it is important to

notice that a single pass over the whole database is sufficient to compute any

pairwise frequency in constant time. Equally sized bins were used in discretiza-

tion as an approximation to the real values of discrepancy. Further research is

needed to find an optimum method that could possibly enhance edge prediction.

Parent set size p is responsible for computation costs of the Maximization

phase and therefore, should be set as low as possible. Network quality will be

strongly dependent of MDisc capacity to find the correct parent set. We can still

use weak MDisc estimations if we set p high enough but it is not recommended

for high dimensionality databases due to computation costs.

As a scoring function, BNOD uses likelihood and/or BIC. Likelihood is bi-

11

ased towards complex models so BIC or other complexity penalized criteria is the

preferred way to go. It is important to choose a score which can be decomposed

using BN factorization. This way, we can reuse most of partial computations:

Score(G, θ : D) =∑

i Score(Xi|Pa(Xi), G, θ). In case of likelihood, detailed

mathematic deduction is:

L(G, θ : D) =∏

m

p(~Xm|G, θ

)= L(G, θ : D) =

∏

m

∏

i

p (Xmi |Pami (Xi), G, θ)

(8)

=∏

i

∏

m

p (Xmi |Pami (Xi), G, θ) =

∏

i

L(Xi|Pa(Xi)) (9)

Using the factorization shown above, only variables whose parents have

changed need to be computed, in case of BNOD, only one. All partial likelihood

L(Xi|Pa(Xi)) can be cached for latter use avoiding O(a ·m) operations for each

reused calculation, where a is the number of attributes needed to compute the

partial likelihood.

Detection of strange objects is made by evaluating each object’s probability

and selecting the worst α ones (user defined parameter). If BN training of data

was successful, most objects should be accurately modelled, and only few will

be displayed as low probability objects. It is possible to have objects with very

low probability of no interest, therefore, if higher accuracy is needed, we can

use additional clustering techniques for improved accuracy, see (Pichara, 2005)

for further details. Deciding the correct α value depends directly on BN fitness

12

of data.

Learning BN from data hardly ever founds an optimum network. Many

edges might not have a scientific explanation. Despite differences between real

and found structures, if high-scoring networks are found, both of them should

be adequate for strange object detection because most objects will still be fitted

into the model leaving rare ones apart. As said in 1, thanks to BN factorization,

each variable can give us some explanation of why these objects were considered

strange.

3.3 BNOD2 extension

As seen in 2.3, when modelling with GMM we have available these nice proper-

ties. However, in BNOD they have not been used to greater extend.

Search space analysis is the bottleneck for higher-scoring networks search.

As EM consumes approximately 90% of computation costs, we cannot enhance

search space by testing new edges which need new EM calls. Fortunately, there

are many network configurations that can still be explored using properties

shown in 2.3. If p(A,B,C,D) was previously used to compute p(A|B,C,D), we

can easily find p(B|A,C,D) without any computation cost of a new EM call.

This can be seem as an edge inversion process. There are few restrictions: first,

we will need to compute Likelihood (or any score) for the new configuration.

This will require at least O(a ·m) operations, because it is a new partial like-

lihood calculation. Second, in the same example, if B had previous parents,

those edges have to be removed or a new expensive call to EM will be needed.

13

Last, BN acyclic restriction must be guaranteed, therefore, other edges might be

conflicting. We created the algorithm BNOD2, which is an extension to BNOD

including these properties.

Restrict phase is identical to BNOD. For each edge Xj → Xi in candidate

space, we previously tested p(Xi|Xj , Pa(Xi)) where Pa(Xi) are old parents of

Xi. Therefore, we can test all other combinations mentioned above:

p(Xj |Xi, Pa(Xi)), p(Xk|Xi, Xj, Pak(Xi)) (10)

where Xk ∪ Pak(Xi) = Pa(Xi), Xk /∈ Pak(Xi). The algorithm is as follows:

• Remove parents of Xi.

• To compute p(Xj |Xi, Pa(Xi)), remove all parents of Xj and add Xi ∪

Pa(Xi) as new parents. If network is cycle, rollback. Otherwise compute

Score and rollback.

• To compute p(Xk|Xi, Xj, Pak(Xi)), remove all parents of Xk and add

Xi∪Xj∪Pak(Xi) as new parents. If network is cycle, rollback. Otherwise

compute Score and rollback.

Finally, we need to compare these new BN to the best found by BNOD and

keep the highest. Normal iteration proceeds as described in 3.1. This algorithm

can be seen as a particular edge reversing strategy using GMM properties. Com-

putation costs should be greater than BNOD, but search space is wider without

any additional EM call needed. This algorithm might require more iterations

than BNOD because we actually have fewer edges after the inverting procedure.

14

4 Experimental Results

We tested BNOD and BNOD2 algorithms using in both synthetic and real

databases. Tests show M = 5000 samples and t = 10 bins are good enough

for most cases. More samples did not give major changes in variable order,

therefore, is insignificant. Number of bins lower than 5 tend to increase error

and 15 or more did not give any benefit. Memory needed to create frequency

table seems to be quite irrelevant, being approximately 16MB for a configuration

with 200 variables and 10 bins each and 256MB for 400 variables and 20 bins

using a simple table. All cache of EM calculations and Score functions were

kept in disk for implementation simplicity.

4.1 Synthetic databases

Synthetic datasets were created using the following strategy. Given k, and a set

of n variables Xi, 1 ≤ i ≤ n:

• For 1 ≤ h ≤ k, create an orthogonal base of n variables which represent

the covariance matrix Σh. This matrix is positive defined by definition.

• For 1 ≤ h ≤ k, create arbitrary mean values of variables ~µh.

• Create arbitrary membership probability vector ~w,∑k

h=1 wh = 1.

• Let G be an empty graph. For each Xi ∈ X add an arbitrary amount of

edges Xi → Xj , 1 ≤ j < i to G.

• For each Xi ∈ X create a conditional gaussian with restrictions imposed

15

Algorithm EM Calls Score 4 Frauds(%)* Frauds(%)*BNOD 572 -422.317 90-9.34 90 / 0.16GHC 850 -425.075 90 / 9.25 90 / 0.29

Best-Case - -458.023 80 / 10.01 90 / 0.89

Table 1: Comparing different implementations

in G using ~w, µ and Σ statistics similar as 2.1.

This algorithm creates an acyclic BN with all conditional probabilities con-

sistent. No cycles are possible due to restrictions on edge order, and consistency

is guaranteed because all conditional probabilities are taken from partitions of

the same joint probability modelled with GMM.

BNOD was tested with databases created as above. To test its strange

object’s detection capacity, f frauds were introduced. Each fraud consisted in

an valid instance with c variables with arbitrary (but within normal range) value

introduced. This way, there were n− c variables with values generated by a BN

network and c non BN. Results can be seen in table 1. The BN created consisted

in 20 variables with 10 gaussians. Frauds were introduced in 4 and 8 arbitrary

variables.

It is interesting to see that 90% of frauds were properly detected with less

than 1% of false positives when c = 8. Additional information could be found by

analysing the partial likelihood of each strange object, being quite clear which

variables were changed for each instance. False positive analysis showed that

low scoring variables were mostly due to noticeable distance from the expected

value, therefore, being very improbable.

16

Algorithm # variables EM calls ScoreBNOD 11 177 -219.545BNOD2 11 197 -221.815BNOD 20 572 -422.317BNOD2 20 592 -426.852

Table 2: Comparing BNOD and BNOD with 2 databases

Other test was performed, which used BNOD algorithm with real GMM pa-

rameters (Best-case). These results do not show better network score neither

better classification accuracy, due to simpler models used in GMM. Joint proba-

bilities found using EM were over-fitted, and therefore, had a higher Likelihood.

Traditional GHC implementation using GMM was also tested. Results are

high computations needs without increase in BN quality. Classification accu-

racy was slightly worse, but within estimation accuracy, showing restrictions in

candidate space do not have an important impact in network quality. Extra

computation needed do not justify such similar results. Therefore, it can be

said BNOD was not able to achieve better results due to limitations of add only

edge strategy and quality of GMM parameters found by EM, but not because

of badly chosen candidate parents.

Finally, BNOD was tested against BNOD2. Results shown in table 2 are

interesting. Despite a wider search space, the algorithm did not find higher-

scoring networks. Accuracy ratio is closer to GHC and time spent is similar

to BNOD. These results reflect the algorithm was able to find sub optimum

networks during iterations and stopped in low local maxima. It can be seen

that computation costs stayed similar to BNOD thanks to properties shown in

17

3.3.

4.2 Real databases

4.2.1 Astronomy strange object detection

An astronomical database from CFHT was used to test the algorithm. The

database consists of 79 variables and 104386 objects, corresponding to galaxies

colors and shapes. Only 15 variables were considered because others were not

relevant for our study, this assertion was made by the astronomers who provided

the database. The idea is to find unusual shapes and colors which could describe

interesting new galaxies.

Similar fraud detection as described in 4.1 was used, modifying 4 variables

for fraud simulation. After modelling, 80.7% of frauds were detected with 0.97%

false positives corresponding to 84/104 and 1013/104282 objects respectively.

These false positives might be real strange objects within the database and

thus throwing low probabilities. These rare objects are currently being anal-

ysed by experts and further results will be published whenever possible. Once

again, both total and partial likelihoods of synthetic frauds were useful as an

explanation of why these objects were considered different amongst others.

4.2.2 Flaw detection

We tested our algorithm in flaw detection using pattern recognition as shown

in (Mery, da Silva, Caloba, and Rebello, 2003). Pictures of a metallic piece

are taken using Xray and converted to binary information, then processed using

18

Algorithm EM calls True Positive False PositiveBNOD 234 52/60 (86.7%) 448/22.876(1.96%)BNOD2 307 56/60 (93.3%) 444/22.876(1.94%)Mery* - 57/60 (95%) 230/22.876(1%)

Table 3: Flaw detection using 500 lowest scoring objects

segmentation techniques, generating 405 characteristics for each region. Only 28

of them were used, which were previously selected by (Mery, da Silva, Caloba,

and Rebello, 2003), and a total of 22.936 regions, from which 60 were visually

classified as flaws, were selected. We blindly used BNOD and BNOD2 without

any classification aid. Experimental results can be seen in table 3 together with

Mery’s supervised algorithm.

It is very interesting to see results found by both unsupervised and super-

vised algorithms are very close. We did not apply any additional filtering or

variable selection with BNOD and BNOD2 although it may enhance classifica-

tion accuracy. Setting threshold to the lowest 1000 objects, we were able to keep

all flaws within our study set, therefore, almost 95% of objects can be filtered

without any missed flaw.

Despite lower accuracy found in 4.1, BNOD2 was able to achieve better

results than BNOD, showing it may have its hidden strengths in certain domains.

Both algorithms finished in less than half an hour using an Athlon 3200+ CPU.

Similar increase in EM calls as seen in synthetic domain was found in flaw

detection, proving BNOD2 does indeed increase computation costs but explores

a higher amount of network structures.

19

5 Conclusions and future work

5.1 Conclusions

This paper’s contribution are: presenting an efficient algorithm for anomaly

detection without compromising network quality; and showing how BN with

GMM can be used with high dimensionality databases and thousands of objects.

The strength in GMM representation of continuous attributes, together with

optimized versions of EM and caching strategies throughout the whole algo-

rithm, allows an efficient implementation of BN structure learning. Due to BN’s

capacity of modelling noisy variables, it is possible to use a simple, yet powerful,

model to represent most objects with high accuracy. As Bayesian inference is

very fast once we have the BN trained, this algorithm can be effectively used as

a real time filter.

This paper succeeded in showing how GMM properties can be exploited to

achieve high scoring networks. It is clear that BNOD’s speed is highly dependent

of EM implementation to find GMM parameters. Accelerated versions allow

much faster computation and, therefore, can be used to analyse bigger datasets.

The use of statistically aided heuristics over search space has a great impact

in performance. Using incremental BN knowledge about the data, we can esti-

mate better relations and spend more resources in interesting search space, thus

achieving higher scoring networks. Joint probability factorization into each vari-

able given its parents is of great use for rare object detection and explanation,

showing BN are well suited for such applications.

20

New algorithms exploring GMM properties can indeed achieve higher scoring

networks without a huge computation expense. BNOD2 was not able to find

better networks in synthetic datasets, but was successful when used in flaw

detection application. One important note is that we never designed BNOD

to work as a flaw detection algorithm, only as a outlier detector. However,

it was still useful as a filter for such applications. The main reason is that

only few objects were indeed flaws, therefore, BNOD does not consider them

when creating a model for the whole database. It rather spends most resources

(structure parameters) trying to fit as many objects as it can, leaving flaws

badly fitted.

5.2 Future work

As likelihood can be decomposed in factors, maybe a different strange object

definition could be used: any object with very low partial likelihood is strange.

This way even high probability objects could be found, but all of them would

have something special. Further research to test this hypothesis is needed.

We need to take special care with variables set as root. As they do not have

any parents, probability modelled by GMM is very simple: only considers its

membership probability, mean and variance. Therefore, it does not take any

additional knowledge about the data, having a lower score than other attributes.

In 3.3, a new methodology is shown. It is still an early implementation of

efficiently using GMM properties for further maximization effectiveness. Addi-

tional research is recommended to exploit new networks using current cache of

21

computations such as scoring values and GMM/EM calls.

Despite many optimizations in GMM, when the database grows to millions of

objects, algorithm performance can be severely degraded by the scoring function

cost. Recent studies on adaptive sampling (Domingo, Gavalda, and Watanabe,

1999) are able to scale knowledge discovery algorithms. Instead of using a

fixed sampling size, the algorithm dynamically decides the sample set needed to

estimate a confidence interval defined by user parameter. If we are not in the

worst case, just a small portion of data is needed to correctly estimate values,

thus avoiding huge costs in score computation.

We are confident that it is possible to achieve much better classification ac-

curacy if BNOD is used together with supervised algorithms such as seen in

(Pichara, 2005). Flaw detection should be greatly increased achieving higher

true positive and lower false positive, being very interesting for industrial ap-

plication where false positives must be minimized.

References

Bradley, P., U. Fayyad, and C. Reina (1998, August). Scaling EM (Expec-

tation Maximization) Clustering to Large Databases. Technical report,

Microsoft Research Report.

Cooper, G. and E. Herskovits (1992). A Bayesian method for induction of

probabilistic networks from data. Machine Learning, 9:309-347.

Cooper, G. F. (1987). Probabilistic inference using belief networkds is np-

22

hard. Technical report, KSL-87-27.

Davies, S. and A. Moore (2000, April). Mix-nets: Factored mixtures of gaus-

sians in bayesian networks with mixed continuous and discrete variables.

pp. 168–175.

Domingo, C., R. Gavalda, and O. Watanabe (1999). Adaptive sampling meth-

ods for scaling up knowledge discovery algorithms. In Discovery Science,

pp. 172–183.

Friedman, N., I. Nachman, and D. Peer (2000). Learning bayesian network.

Kullback, S. and R. Leibler (1951). On Information and Sufficiency. Annals

of Math. Stats. 22, pp. 79-86.

Mery, D., R. da Silva, L. Caloba, and J. Rebello (2003, 02-06 June). Flaw

detection in castings using pattern recognition. In 3er Pan-American Con-

ference for Nondestructive Testing (PANNDT 2003), Rio de Janeiro.

Moore, A. (1998). Very fast em-based mixture model clustering using mul-

tiresolution kd-trees.

Pichara, K. (2005). Novedosa aplicacion de tecnicas de aprendizaje activo en

la deteccion de anomalıas en bases de datos. Engineering Thesis, Pontificia

Universidad Catolica.

Scott, D. (1992, August). Multivariate Density Estimation: Theory, Practice,

and Visualization. Wiley.

Silverman, B. (1986). Density Estimation for Statistics and Data Analysis.

Monographs on Statistics and Applied Probability, London: Chapman

23

and Hall.

Zavala, F. (2005). Density estimation in huge datasets accelerating EM-GMM.

Master Thesis, Pontificia Universidad Catolica.

24

unsupervised anomaly detection using bayesian networks and ... · needed. instead of using...

Documents