a probabilistic resource allocating network for novelty...

Communicated by John Platt

A Probabilistic Resource Allocating Network for Novelty Detection

Stephen Roberts Lionel Tarassenko Neural Network Research Group, Department of Engineering Science, Utziversity of Oxford, Oxford, UK

The detection of novel or abnormal input vectors is of importance in many monitoring tasks, such as fault detection in complex systems and detection of abnormal patterns in medical diagnostics. We have developed a robust method for novelty detection, which aims to minimize the number of heuristically chosen thresholds in the novelty decision process. We achieve this by growing a gaussian mixture model to form a representation of a training set of ”normal” system states. When previously unseen data are to be screened for novelty we use the sume threshold as was used during training to define a novelty decision boundary. We show on a sample problem of medical signal processing that this method is capable of providing robust novelty decision boundaries and apply the technique to the detection of epileptic seizures within a data record.

1 Introduction

The detection of novelty (or abnormality) is a very important task in many diagnostic or monitoring systems. Sensors distributed around a plant or an engine, for example, will be used to monitor its performance. In safety-critical applications, it will be essential to detect the occurrence of an unexpected event as quickly as possible. This can best be done by a continuous on-line assessment of the novelty of successive sets of sensor readings.

If we have a training set of data, 7 say, which represents the states of some system, we can ask a simple question when a previously unseen data vector is presented to us; does the vector coincide with any of the data points in 7? If not, we can decide that the new vector is novrl. This simplistic argument is, of course, flawed. The data set 7 would have to encode the entirety of ”normal” system states and all data would have to be noiseless. Nature does not provide us with such luxuries and “real- world” data sets are incomplete and noisy. We wish, therefore, to form some representation of our data set before we can decide on the novelty, or otherwise, of new input vectors.

N w r a l Cornputation 6, 270-284 (1994) @ 1994 Massachusetts Institute of Technology

Network for Novelty Detection 271

The most appropriate representation is an estimate of the probability density function (PDF) of the data set we are given. For most problems, we have no a priori statistical information regarding this data set, i.e., we neither know the number of generators within the data set nor their underlying functions. Parametric estimation methods, therefore, cannot be applied. Nonparametric techniques, such as Parzen windows (Parzen 1962), require the application of a windowing, or kernel, function sited at every sample in the training set and are thus computationally expensive for large data sets with the added drawback that they may also model any noise that is in the training data (Ripley 1992; Bishop 1991). A more parsimonious representation of the training data is hence required.

Semiparametric methods using kernel estimation (where the number of kernel functions is less than the number of x E 7 but still large compared to the probable number of generators within the data) offer most of the advantages of nonparametric methods along with computational economy (Trdvh 1991; Ripley 1992).

1.1 Semiparametric Estimation. Semiparametric methods assume that the data set can be encoded, generally, as a parameterization of the statistical moments of the data (typically the first and second moments only). It should be noted that semiparametric estimation may be regarded as a clustering or partitioning of the data set in terms of a set of cluster means (first moments) and covariance matrices (second moments). The K-means algorithm clusters using a model where data points are ”hard- partitioned” into subgroups, that is, membership functions’ are either 1 or 0. Gath and Geva (1989) reported on a method for fuzzy partitioning of data using a variant of the maximum likelihood approach. Such an approach is more realistic in that membership functions are continuous variables and has much in common with statistical techniques for parameter estimation in a semiparametric model.*

We assume that the underlying PDF may be represented as a mixture of a finite number of component densities. Much work has, historically, concentrated on the use of (nonorthogonal) gaussian functions as model generators for the component densities. The choice of gaussians is not an arbitrary one, indeed it is possible to show that a finite sum of gaussian responses can approximate, with arbitrarily small error, any PDF (the property of universal approximation) (Park and Sandberg 1991).

2 Theory

We consider some finite training set of data, 7, consisting of a sequence of &dimensional feature vectors, x = [ X I , . . . , xdIT such that x E 7 E rRd. Let

‘The degree of association between a data vector and some data cluster. 21n the framework of neural networks, there is an intimate link between kemel-based

PDF estimation (or fuzzy data partitioning) and the formulation of the hidden layer of a radial-basis-function (RBF) network (Broomhead and Lowe 1988).

272 Stephen Roberts and Lionel Tarassenko

the number of members of 7 be N and the number of gaussian kernels at a given time be K . Each gaussian kernel has two sets of degrees of freedom, its vector location in input-space (the first moment) m, and a smoothing function (here generalized as a covariance matrix-the second moment) F. The response of the jth such kernel function to an input feature vector, x, is denoted as d(x; mi, F,) or, more simply, as @(x).

Bayes‘ rule specifies that the mixture density may be written as K

P ( X ) = C P ( k ) P ( X I k) (2.1)

where p ( k ) is the prior probability of selecting the kth kernel function and p(x I k) is the conditional density of x on the kth kernel. If a gaussian mixture model is used then p(x I k) is simply the response of the kth gaussian function in the mixture and equation 2.1 can be rewritten as

k = l

K

(2.2)

The set of priors is subject to the constraints, CkK_]p(k) = 1 and each 1 2 p ( k ) 2 0. Bayes’ theorem for densities specifies that the posterior probability of selecting the jth kernel function given feature vector x can be written as

(2.3)

k = l

The conventional approach for defining the free parameters of each gaussian function is to maximize the log-likelihood over all x E 7, namely3

N C l ogp(xJ (2.4)

Upon substitution from equation 2.2, solutions that maximize 2.4 can be sought, subject to the constraints imposed on the priors, using Lagrange multipliers. Details may be found in several papers (Dempster et al. 1977; Trdven 1991) and we quote the results here. The form of the free parameters of each dj may be specified by seeking [m,. F,] that satisfy

I = ]

log $(xi; m,, Fj) = 0 N

I = 1 (2.5)

The solutions of equation 2.5, specifiying a gaussian component density of the form

~

3Assuming independence between data samples.


are given by

and

(2.8)

Solutions to equations 2.7 and 2.8 require a recursive method for nonlinear optimization, for example the expectation-maximization (EM) algorithm (Dempster et al. 1977). An iterative version of this nonlinear optimization scheme, which resembles reinforcement learning, can also be formulated (Lowe 1991; Trdvh 1991; Neal and Hinton 1993).

With the method described in this paper, we also solve equations 2.7 and 2.8 using reinforcement learning using an adaption, or learning, parameter (defining the "cooling curve" of the network), which is gradually reduced during training. Defining this parameter as 0 5 of < 1 the iterative equations are as follows:

(2.9)

(2.10)

where x f is the data vector randomly selected from 7 at the tth iteration. Convergence of equations 2.9 and 2.10 to equations 2.7 and 2.8 as + 0 may be confirmed analytically (see Appendix).

F,,f + "f[P(j I X , ) ( X f - m,,f)(xf - A,,JT - F,fl (1 - 0,) + '',PO' I XI)

$+I =

2.1 Network Growth. The major problem with any form of novelty detection is the choice of an appropriate "novelty threshold." We have attempted to minimize the number of heuristically-determined thresholds and parameters and the main point of our paper is that once the network is fully trained, new data may be tested for novelty using the same threshold as was used to determine network growth during training.

We define a test parameter, X(xf) , such that

where xf is the input vector presented at time t during training and $( )


is the response of a gaussian kernel, such that JJ(X = m) = 1, namely

(2.12)

We monitor the value of X(xt) at each data presentation and use it to decide whether the network should grow by one further gaussian unit, based on the following criterion:

5 c1 + growth 4xt ) { > c1 -+ no growth

Taking natural logarithms of equation 2.12 leads to a reformulated growth criterion of

where Qt = 2 ln(l/c,). The growth decision is thus based on monitoring the smallest Mahalanobis distance between xt and each m within the network. Network growth is thus similar to that proposed by other authors, being based on some distance metric (Sebestyen 1962; Platt 1991). Note that two data vectors may have identical minimum Euclidean distances, but differing minimum Mahalanobis distances depending on the statistics of the data distribution (Fig. 1).

There are no kernel centers when the network is initialized. The first presentation of data causes the addition of a single kernel function, positioned at the site of the first feature vector. As usual, the training set is presented in random order and the first kernel function is adapted according to equations 2.9 and 2.10 until some X(x) becomes equal to, or falls below, the f 1 threshold. At this point a new kernel function is generated.

The growth threshold, 0 5 c t 5 1 is initially set as to = 0, that is, growth will occur only if some x1 has a 0% chance of having been generated by the network at that time. The magnitude of f 1 increases linearly with time according to

The network thus starts fitting kernel functions to model the statistics of 7 coarsely; as tt increases, the precision of the fit becomes progressively better. The upper limit, emax, specifies the final precision to which the statistics of 'T are represented by the kernel functions. Note that if f,,

is set to unity, then the network will grow until a kernel function is allocated to every data vector in 'T.4

41t should be noted that such a system could result in ill-conditioned covariance matrices.


-0. I

0 New data vector

" 5 1

1

0. I 0.2 0.3 0.4 0.5

No growth

i

-0.2 -

-0.3 -

-0.4 -

-0.5 - Growth

Figure 1: Growth based on a weighted distance criterion. There are three clear clusters in this artificial 2-D problem indicated by the covariance ellipses around each cluster center (filled square).

When a new kernel function, index n say, is generated at time t , its position vector is set as

m, = xf The initial estimate for the covariance matrix, F,, is uniform and sym- metric with each component F,, being defined as

where C = (m, - m,)(m, - m,)T in which 1 is the index of the kernel function that, prior to growth, had the largest posterior p(I I XI) . To avoid having to make any assumptions about the distribution of the data in the training set, we take all the priors, p u ) , to be equal, their value being updated after growth such that

1

for all j 5 n, where Kf is the current number of kernel functions.


2.1.1 Local "Cooling". There is one remaining problem with such network growth, especially if it occurs late during the training process. If ( t , the adaption gain, is a scalar (i.e., equal and decreasing globally for all kernel functions) then a new kernel function may not be sufficiently "plastic" to adapt adequately to the statistics of the region of input space to which it is sensitive. This problem can be avoided by allowing both 0 , and the time index t, to be vectors, with components for each kernel function. The dimensions of a and t thus grow with the number of kernel functions. As detailed in the Appendix the adaption gain for the jth kernel function is

where t, is the component of the time vector for kernel function j . When a new kernel function, index n , is added, we set t,, to zero. This en- sures that the new kernel function is "plastic" ((],, is large) and its free parameters can converge with minimal disturbance to all other kernel functions (remember that, in the intervals between new kernel functions being generated, all existing kernels continue to be adapted using equations 2.9 and 2.10). Training ends when min{t,} reaches some predefined limit, ti say, where tl >> N, N being the number of x E 7, such that we may assume that every gaussian has seen every member o f 7 at least once.

2.1.2 Choice of Parameters. In implementing the algorithm we have imposed several model constraints. In addition to making all kernel priors equal we can let the covariance matrices, F, be diagonal, without loss of generality. This leads to a considerable reduction in computing overheads when the dimensionality of input space is large, since matrix inversion then becomes a simple task.

The choice of the parameters T,, r,,, and 0,) is not critical, save t h i t MV require the rate of increase of c, to be small compared to the i n i t i r i l r,itc> of decrease of ( r , . We choose r, to be equal to N, the number of x i T , 50

that c , reaches its maximum value after one iteration through the tr,iining data, and r(, is set to unity. In all results presented here 110 ~ 0 7.

2.2 Novelty Detection. Once training is complete we know that no member x, of 7 has X(x,) 5 fmax. On presentation of some previously unseen test data, u say, we may calculate X(u) via equations 2.11 and 2.12. Our novelty criterion makes use of the threshold f,,,, such that, if

5 cmax + u is a novel vector ~ ( u ) { > c,,, + u is not novel


3 An Example from Medical Signal Processing

In earlier papers (Roberts and Tarassenko 1992a,b), we have described a method for analyzing the human electroencephalogram (EEG) during sleep using a radial-basis-function (RBF) network. Our results have shown that it is possible to classify sleep states from patterns of electri- cal activity recorded during sleep from the cortex using scalp electrodes. There are many patients, however, whose EEG records contain abnormal signal features (epileptics, for example). With these patients, unexpected events, not represented in the training set, are likely to occur during the patient’s EEG recording. It is very important clinically to identify these novel input vectors as they occur.

To validate the method of novelty detection described in this paper for such an application, we have constructed an artificial test problem using our EEG data, in which all the input vectors recorded during times of wakefulness5 are deliberately excluded from the training database. Such a database, consisting of 3644 10-dimensional input vectors,6 was assembled and a representation of this database was formed using test thresholds of c,,, = 0.1, which led to the growth of a total of 97 gaussian functions and cmax = 0.2, which gave 258 gaussian functions.

Figure 2 shows the time course of K,, the number of gaussian functions during training for the two different values of f,,,. The trained networks were then presented with patterns from a previously unseen EEG recording (Fig. 3a and b). At the beginning of this recording, the patient is not asleep and this is clearly indicated by the value of X(x,) falling below the novelty thresholds in each case, save for three short occurrences of drowsy sleep. As the subject falls asleep ( t = 501, the data are no longer identified as being novel with respect to the training data. There are subsequent episodes during which X(x,) decreases to lower values, but these correspond either to body movements (during which the subject’s EEG state is that of wakefulness, for very short periods of time) and a recording dropout (end of record). The training data did not contain any dropouts, hence its inclusion as a “novel feature.” It is important to make the point that novelty detection does not specify a particular class of input vectors, merely the fact that these vectors are novel with respect to the training database. We see from Figure 3 that as long as enough gaussian functions are grown to represent the complexity of the problem, the novelty decision is robust when the value of F,,,, and hence the number of kernel functions, is altered.

sAs defined by the consensus of three human scorers working from a rule-based

hAs described in Roberts and Tarassenko (1992a,b), the EEG signal is parameterized scoring system.

on a time scale of 1 sec using a 10th-order Kalman filter.


250

200

100

S O

0 T-

Figure 2: Growth of gaussian functions during training on nonwake EEG database (a) fmax = 0.1 and (b) Fmax = 0.2.

3.1 Detection of Epileptic Seizures. We have now carried out a pilot study of our novelty detection algorithm on an EEG record known to contain epileptic seizures (i.e., “abnormal events”). From the 20 min of available data, a training database of 1000 ”normal” EEG segments was constructed. The algorithm described in this paper was used to form a gaussian mixture representation of this data, using a threshold of c m a x = 0.1. A total of 195 gaussian kernels was grown by the algorithm. Figure 4 shows the time course of X(x,) (upper trace) along with the error term from the 10th-order Kalman filter, the coefficients of which are used as the input representation to the network (lower trace). The four major peaks (A, B, C, and D) in the novelty trace correspond to epileptiform activity, which is also shown u p by the Kalman error term. There are two revealing areas of discrepancy between the two plots, however:

1. The peak at E in the Kalman error term ( t zz 950 sec) corresponds to high-frequency muscle artifact. This type of artifact is present else- where in the training database and the novelty threshold is there-


Figure 3: Time course of X(x) during presentation of unseen test data including data novt.l to the system (sections of wake-state EEG) - (a) 97 gaussian units, fmaX = 0.1 and (b) 258 gaussian units, f,,,, = 0.2. The novelty decision threshold is shown in each case.

fore not crossed (the reason for the discontinuity in the Kalman error term is the exceptionally large amplitude of the artifact).

2. Between the first peak at A ( t =: 75 sec) and the second one at B ( t = 400 sec), there are four smaller peaks (1, 2, 3, and 4) that can be identified from the novelty trace but not from the Kalman filter error term. On returning to the original EEG record, it is quite clear that these short bursts of "novel" activity correspond to short sections of signal which do exhibit seizure-type waveforms.

The above represents only our first results from a short pilot study but the detection of novel events between the first two seizures does represent a promising beginning.


40- 35 - 30 - 25 - 20 - I5 - 10 - 5 - 01

Error term I ~

E B C A

- t (seconds) D

n 200 400 600 800 IWO I200

Figure 4: Time course of X(x) during presentation of an EEG record with epileptic seizures (upper trace) and Kalman filter error term from the same record (lower trace).

4 Conclusions

In this paper, we have introduced a method for the detection of novelty, based on a gaussian mixture model. The main advantage of our method lies in the fact that the detection of novelty is robust in thechoiceofthreshold, provided that a sufficient number of kernel functions is used to build up the representation of the training data set. This was demonstrated on a medical signal processing problem, specifically constructed from real data for the purposes of this paper. We are now using our method as a screening tool for the detection of unexpected abnormalities in EEG recordings. We are also extending its use to other monitoring systems such as plasma diagnostics (Bishop et al. 1993).

In the implementation of the model presented here we have assumed that all kernel priors are equal. We may, however, allow the priors to be free parameters of the model and attempt to evaluate them using nonlinear optimization. The kernel priors may be regarded as an unbiased


estimate of kernel posteriors, given all the data set (TrBven 19911, namely

(4.1)

a solution for which may be sought by reinforcement learning (see Ap- pendix):

(4.2)

If priors are allowed to be free parameters it is possible to prune the trained network by removal of kernel functions with small priors, since these kernels typically correspond to outliers within T. The removal of outliers from a training set, however, is a complex problem in pattern recognition (Ripley 1992; Beckman and Cook 1983) and this strategy would have to be investigated in detail.

The algorithm described in this paper does have some similarities with previous work by others, for example, the pioneering work of Sebestyen (1962) who describes a method to build a representation of data in input space using gaussian kernels of uniform width centered on a subset of the training patterns. New input patterns are then clussi- fied according to their Euclidean distances from the cluster means, each of which has a class label attached to it. We believe that there are two weak- nesses associated with this approach: first, the use of the same width for all clusters and of the Euclidean metric removes any information about the way the data are distributed around each center. Second, an algorithm that builds up a representation of data according to its density in input space is unlikely to be optimal for classification when one is more interested in the boundaries between classes rather than the distribution of data in input space per se. Class labels represent important a priori information that must be built into the data encoding by the network to achieve optimal classification performance.

Our algorithm can be adapted, however, to provide a validation parameter for the outputs of an RBF network if the gaussian mixture representation is used as the hidden layer of the network. In addition to the a posteriori class membership probabilities that can be estimated at the output of the RBF network, we can then decide whether the input vector lies within the hidden-layer representation of the training data. Neu- ral networks, like all function approximation methods, cannot extrapolate and the X(x,) parameter can therefore serve as validation of the network outputs.

P c i ) f + l = P(jh + {Yf [PV I XI) - Po’) , ]

5 Appendix

We consider a stochastic gradient descent in some parameter 8, of the form

0, + rr,(hfit - 0,) (1 - ( Y f ) + rrfht 8f+1 = (5.1)


where frt is a monotonically decreasing parameter, 0 5 f t t < 1 and it is some input parameter specified at time t . The stable point is when

(AO) = (Ht+1 - 0,) = 0

where (.) denotes an expectation value. Upon rearranging equation 5.1 we obtain

The stable point of this system is hence when

(hri, - ktHt) = 0

and at this point a limit value of Ht = HL is obtained, hence

( A t i t ) - 4.(h) = 0

or N

(5.2)

(5.3)

l k l

where N is the total number of available it. Inspection of equations 2.9 and 2.10 shows that they are in the form of

equation 5.1, where kt = p(j I xt), 0 is m, or F, and it is xt or ( x t - m,,t)(xt - m,,t respectively. Equation 5.3 thus corresponds to a limit convergence to equations 2.7 and 2.8. Note that for k = 1 equation 5.3 reduces to

which is of the form of equation 4.1 if it = p ( j I xt). We must, however, show that the stochastic algorithm reaches this

fixed point. Equation 5.2 is of the form AH = Etf(Ht) stochastic approximation of which may be achieved by means of the Robbins-Munro algorithm (Duda and Hart 1973). The step size of the algorithm is given here by

The convergence conditions of the Robbins-Munro algorithm are satis- fied, for bounded training data, by

( f > 0; Vt


and

If we set

then (YO

Et = t + r,, + tro(ht - 1)

The first of these conditions is met for 0 < 00 < 1, 0 5 At 5 1 and r, 2 I as r, + cro(h, - 1) > 0 for all t . The other two conditions are also met, as

=cc (kl oc

5 f + r,, + cro(h, - 1)

and x

<oo

Acknowledgments

The authors would like to thank Dr. J. Stradling (Churchill Hospital, Oxford) for his continuing support of this work and all the members of the Neural Network Research Group for valuable discussions. We acknowledge support of one of the authors by the Wellcome Trust and wish to thank the two referees of this paper for very helpful comments.

References

Beckman, R. J., and Cook, R. D. 1983. Outlier ......... s. Technometrics 25(2), 119-149. Bishop, C., Strachen, J., ORourke, J., Madison, G., and Thomas, I? 1993. Recon-

struction of Tokamak density profiles using feedforward networks. Neural Comp. Appl. 1, 4 1 6 .

Broomhead, D., and Lowe, D. 1988. Multivariable function interpolation and adaptive networks. Complex Syst. 2, 321-355.

Dempster, A. I?, Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. I. Roy. Stat. SOC. 39(1), 1-38.

Duda, R., and Hart, I? 1973. Pattern Classification and Scene Analysis. John Wiley, New York.

Gath, I., and Geva, B. 1989. Unsupervised optimal fuzzy clustering. I€€€ Transact. Pattern Anal. Machine Intelligence 11(7), 773-781.

Lowe, D. 1991. On the iterative inversion of RBF networks: A statistical inter- pretation. Proc. 2nd I € € Int. Conf. Artificial Neural Networks, 29-33.


Neal, R. M., and Hinton, G. E. 1993. A new view of the EM algorithm that justifies incremental and other variants. Biometrika, submitted.

Park, J., and Sandberg, I. W. 1991. Universal approximation using radial-basis- function networks. Neural Comp. 3(2), 246-257.

Parzen, E. 1962. On estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065-1076.

Platt, J. 1991. A resource-allocating network for function interpolation. Neural Conip. 3, 213-225.

Ripley, B. D. 1992. Statistical aspects of neural networks. Proc. SemStat, Den- mark, April 1992.

Roberts, S., and Tarassenko, L. 1992a. A new method of automated sleep quan- tification. Med. Bid . Eng. Comput. 30(5), 509-517.

Roberts, S., and Tarassenko, L. 1992b. The analysis of the sleep EEG using a multi-layer network with spatial organisation. ZEE Proc.-F 139(6), 420-425.

Sebestyen, G. S. 1962. Pattern recognition by an adaptive process of sample set construction. IRE Trans. Info. Theory IT-8, S82-S91.

Trsven, H. G. C. 1991. A neural network approach to statistical pattern classification by "semiparametric" estimation of probability density functions. l E E E Trans. Neural Networks 2(3), 366-377.

Received November 16, 1992; accepted June 2, 1993.

a probabilistic resource allocating network for novelty...

Documents