deep recurrent neural network-based autoencoders for...

15
Research Article Deep Recurrent Neural Network-Based Autoencoders for Acoustic Novelty Detection Erik Marchi, 1,2,3 Fabio Vesperini, 4 Stefano Squartini, 4 and Björn Schuller 2,3,5 1 Machine Intelligence & Signal Processing Group, Technische Universit¨ at M¨ unchen, Munich, Germany 2 audEERING GmbH, Gilching, Germany 3 Chair of Complex & Intelligent Systems, University of Passau, Passau, Germany 4 A3LAB, Department of Information Engineering, Universit` a Politecnica delle Marche, Ancona, Italy 5 Department of Computing, Imperial College London, London, UK Correspondence should be addressed to Erik Marchi; [email protected] Received 12 July 2016; Accepted 25 September 2016; Published 15 January 2017 Academic Editor: Stefan Haufe Copyright © 2017 Erik Marchi et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or state-space models. Only recent studies introduced (pseudo-)generative models for acoustic novelty detection with recurrent neural networks in the form of an autoencoder. In these approaches, auditory spectral features of the next short term frame are predicted from the previous frames by means of Long-Short Term Memory recurrent denoising autoencoders. e reconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel events. ere is no evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and giving a broad and in depth evaluation of recurrent neural network-based autoencoders. e present contribution aims to consistently evaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. Besides providing an extensive analysis of novel and state-of- the-art methods, the article shows how RNN-based autoencoders outperform statistical approaches up to an absolute improvement of 16.4% average -measure over the three databases. 1. Introduction Novelty detection aims at recognizing situations in which unusual events occur. e challenging task of novelty detec- tion is usually considered as single class classification task. e “normal” data traditionally comprises a very big set which allows for an accurate modelling. e acoustic events not included in the “normal” data are treated as novel events. Novel patterns are tested by comparing them with the normal class model resulting in a novelty score. en, the score is processed by a decision logic—typically a threshold—to decide whether the test sample is novel or normal. A plethora of approaches have been proposed due to the practical relevance of novelty detection, especially for medical diagnosis [1–3], damage inspection [4, 5], physiolog- ical condition monitoring [6], electronic IT security [7], and video surveillance systems [8]. According to [9, 10], novelty detection techniques can be grouped into two macro categories: (i) statistical and (ii) neural network-based approaches. Extensive studies have been made in the category of statistical and probabilistic approaches which are evidently the most widely used in the field of novelty detection. e approaches on this category are modelling data based on its statistical properties and exploiting this information to determine when an unknown test sample belongs to the learnt distribution or not. Statistical approaches have been applied to a number of applications [9] ranging from data stream mining [11], outlier detection of underwater targets [12], the recognition of cancer [1], nondestructive inspection for the analysis of mechanical components [13], and audio segmentation [14], to many others. In 1999, support vector machines (SVMs) were intro- duced in the field of novelty detection [15] and subsequently applied to time-series [16, 17], jet engine vibration analysis Hindawi Computational Intelligence and Neuroscience Volume 2017, Article ID 4694860, 14 pages https://doi.org/10.1155/2017/4694860

Upload: others

Post on 16-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Research ArticleDeep Recurrent Neural Network-BasedAutoencoders for Acoustic Novelty Detection

Erik Marchi123 Fabio Vesperini4 Stefano Squartini4 and Bjoumlrn Schuller235

1Machine Intelligence amp Signal Processing Group Technische Universitat Munchen Munich Germany2audEERING GmbH Gilching Germany3Chair of Complex amp Intelligent Systems University of Passau Passau Germany4A3LAB Department of Information Engineering Universita Politecnica delle Marche Ancona Italy5Department of Computing Imperial College London London UK

Correspondence should be addressed to Erik Marchi erikmarchitumde

Received 12 July 2016 Accepted 25 September 2016 Published 15 January 2017

Academic Editor Stefan Haufe

Copyright copy 2017 Erik Marchi et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In the emerging field of acoustic novelty detection most research efforts are devoted to probabilistic approaches such as mixturemodels or state-space models Only recent studies introduced (pseudo-)generative models for acoustic novelty detection withrecurrent neural networks in the form of an autoencoder In these approaches auditory spectral features of the next short termframe are predicted from the previous frames by means of Long-Short Term Memory recurrent denoising autoencoders Thereconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel eventsThere isno evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and givinga broad and in depth evaluation of recurrent neural network-based autoencoders The present contribution aims to consistentlyevaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried outon three databases A3Novelty PASCALCHiME andPROMETHEUS Besides providing an extensive analysis of novel and state-of-the-art methods the article shows howRNN-based autoencoders outperform statistical approaches up to an absolute improvementof 164 average 119865-measure over the three databases

1 Introduction

Novelty detection aims at recognizing situations in whichunusual events occur The challenging task of novelty detec-tion is usually considered as single class classification taskThe ldquonormalrdquo data traditionally comprises a very big setwhich allows for an accurate modelling The acoustic eventsnot included in the ldquonormalrdquo data are treated as novel eventsNovel patterns are tested by comparing themwith the normalclass model resulting in a novelty score Then the scoreis processed by a decision logicmdashtypically a thresholdmdashtodecide whether the test sample is novel or normal

A plethora of approaches have been proposed due tothe practical relevance of novelty detection especially formedical diagnosis [1ndash3] damage inspection [4 5] physiolog-ical condition monitoring [6] electronic IT security [7] andvideo surveillance systems [8]

According to [9 10] novelty detection techniques canbe grouped into two macro categories (i) statistical and(ii) neural network-based approaches Extensive studies havebeen made in the category of statistical and probabilisticapproaches which are evidently the most widely used in thefield of novelty detection The approaches on this categoryare modelling data based on its statistical properties andexploiting this information to determine when an unknowntest sample belongs to the learnt distribution or not Statisticalapproaches have been applied to a number of applications[9] ranging from data stream mining [11] outlier detectionof underwater targets [12] the recognition of cancer [1]nondestructive inspection for the analysis of mechanicalcomponents [13] and audio segmentation [14] to manyothers In 1999 support vector machines (SVMs) were intro-duced in the field of novelty detection [15] and subsequentlyapplied to time-series [16 17] jet engine vibration analysis

HindawiComputational Intelligence and NeuroscienceVolume 2017 Article ID 4694860 14 pageshttpsdoiorg10115520174694860

2 Computational Intelligence and Neuroscience

[18] failure detection in jet engines [19] patient vital-signmonitoring [20] fMRI analysis [21] and damage detectionof a gearbox wheel [22]

Neural network-based approachesmdashalso named recon-struction-based [23]mdashhave gained interest in recent yearsalong with the evident success of neural networks in severalother fields In the past decade several works focused on theapplication of a neural network in the form of an autoencoder(AE) have been presented [10] given the huge impact andeffectiveness of neural networks The autoencoder-basedapproaches involve building a regression model using theldquonormalrdquo data The test data are processed by analysingthe reconstruction error between the regression target andthe encoded value When the reconstruction error showshigh score the test data is considered novel Examples ofapplications include such to detect abnormal CPU data usage[24 25] and such to detect outliers [26ndash29] for damageclassification under changing environmental conditions [30]

In these scenarios very little studies have been conductedin the field of acoustic novelty detection Recently weobserved a growing research interest in application domainsinvolving surveillance and homeland security to monitorpublic places or supervise private environments where peoplemay live alone Driven by the increasing requirement ofsecurity public places such as but not limited to stores bankssubway trains and airports have been equipped with varioussensors like cameras or microphones As a consequenceunsupervised monitoring systems have gained much atten-tion in the research community to investigate new and effi-cient signal processing approachesThe research in the area ofsurveillance systems mainly focusses on detecting abnormalevents relying on video information [8] However it has tobe noted that several advantages can be obtained by relyingon acoustic information In fact acoustic signalsmdashas opposedto video informationmdashneed low computational costs and areinvariant to illumination conditions possible occlusion andabrupt events (eg a shotgun and explosions) Specificallyin the field of acoustic novelty detection studies focusedonly on statistical approaches by applying hidden Markovmodels (HMM) and Gaussian mixture models (GMM) toacoustic surveillance of abnormal situations [31ndash33] and toautomatic space monitoring [34] Despite the number ofstudies exploring statistical and probabilistic approaches theuse of neural network-based approaches for acoustic noveltydetection has only been introduced recently [35 36]

Contribution Only in the last two years the use of neuralnetworks for acoustic novelty detection has gained interest inthe research community In fact few recent studies proposeda (pseudo-)generative model in the form of a denoisingautoencoder with recurrent neural networks (RNNs) Inparticular the use of Long-Short Term Memory (LSTM)RNNs as generative model [37] was investigated in the fieldof text generation [38] handwriting [38] and music [39]However the use of LSTM as a model for audio generationwas only introduced in our recent works [35 36]

This article provides a broad and extensive evaluationof state-of-the-art methods with a particular focus on noveland recent unsupervised approaches based on RNN-based

autoencoders We significantly extended the studies con-ducted in [35 36] by evaluating further approaches suchas one-class SVMs (OCSVMs) and multilayer perceptrons(MLP) and most importantly we conducted a broad and indepth evaluation on three different datasets for a total numberof 160 153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection

We evaluate and compare all these methods withthree different databases A3Novelty PASCAL CHiME andPROMETHEUS We provide evidence that RNN-basedautoencoders significantly outperformothermethods by out-performing statistical approaches up to an absolute improve-ment of 164 average 119865-measure over the three databases

The remainder of this contribution is structured asfollows First a basic description of the different statisticalmethods is given in Section 2 Then the feed-forward andLSTM RNNs together with autoencoder-based schemes foracoustic novelty detection are described (Sections 3 and 4)Next the thresholding strategy and features employed inthe experiments are given in Section 5 The used databasesare introduced in Section 6 and the experimental set-up isdiscussed in Section 7 before discussing the evaluation ofobtained results in Section 8 Section 9 finally presents ourconclusions

2 Statistical Methods

In this section we introduce statistical approaches such asGMM HMM and OCSVM We formally define the inputvector 119909 isin R119899 where 119899 is the number of acoustic features(cf Section 5)

21 Gaussian Mixture Models GMMs estimate the prob-ability density of the ldquonormalrdquo class given training datausing a number of Gaussian components The training phaseof a GMM exploits the 119896-means algorithm or other suitedtraining algorithms and the Expectation-Maximisation (EM)algorithm [40] The former initializes the parameters while 119871iterations of EM algorithm lead to the final model Given apredefined threshold (defined in Section 5) if the probabilityproduced by the GMM with a test sample is lower than thethreshold the sample is detected as novel event

22 Hidden Markov Models A further statistical model isthe HMM [41] HMMs differ from GMMs in terms ofinput temporal evolution Indeed while a diagonal GMMtends to approximate the whole training data probabilitydistribution by means of a number of Gaussian componentsa HMM models the variations of the input signal throughits hidden states The HMM topology employed in this workis left-right and it is trained by means of the Baum-Welchalgorithm [41] while regarding the novelty detection phasethe decision is based on the sequence paradigm Consideringa left-right HMM having 119873119904 hidden states a sequence is aset of 119873119904 feature vectors = 1199091 119909119873119904 The emissionprobabilities of these observable events are determined by aprobability distribution one for each state [9] We trainedan HMM on what we call ldquonormalrdquo material and exploited

Computational Intelligence and Neuroscience 3

the log-likelihoods as novelty scores In the testing phase theunseen signal is segmented into a fixed length depending onthe number of states of the HMM and if the log-likelihood ishigher than the defined threshold (cf Section 5) the segmentis detected as novel

23 One-Class Support Vector Machines A OCSVM [42]maps an input example onto a high-dimensional feature spaceand iteratively searches for the hyperplane that maximisesthe distance between the training examples from the originIn this constellation the OCSVM can be seen as a two-classSVM where the origin is the unique member of the secondclass whereas the training examples belong to the first classGiven the training data 1199091 119909119897 isin 119883 where 119897 is the numberof observations the class separation is performed by solvingthe following

min119908120585120588

12 1199082 1]119897sum119894 120585119894 minus 120588subject to (119908 sdot Φ (119909119894)) ge 120588 minus 120585119894 120585119894 ge 0

(1)

where 119908 is the support vector 120585119894 are slack variables 120588 is theoffset andΦmaps119883 into a dot product space 119865 such that thedot product in the image ofΦ can be computed by evaluatinga certain kernel function such as a linear or Gaussian radialbase function

119896 (119909 119910) = exp(minus1003817100381710038171003817119909 minus 1199101003817100381710038171003817221205902 ) (2)

The parameter ] sets an upper bound on the fraction of theoutliers defined to be the data being outside the estimatedregion of normality Thus the decision values are obtainedwith the following function

119891 (119909) = 119908 sdot Φ (119909) minus 120588 (3)

We trained a OCSVM on what we call ldquonormalrdquo material andused the decision values as novelty scores During testing theOCSVMprovides a decision value for the unseen pattern andif the decision value is higher than the defined threshold (cfSection 5) the segment is detected as novel

3 Feed-Forward and RecurrentNeural Networks

This section introduces the MLP and the LSTM RNNsemployed in our acoustic novelty detectors

The first neural network type we used is a multilayerperceptron [43] In a MLP the units are arranged in layerswith feed-forward connections from one layer to the nextEach node outputs an activation function applied over theweighted sum of its inputs The activation function can belinear a hyperbolic function (tanh) or the sigmoid functionInput examples are fed to the input layer and the resultingoutput is propagated via the hidden layers towards the outputlayer This process is known as the forward pass of thenetwork This type of neural networks only relies on thecurrent input and not on any past or future inputs

Cell

Forget

gate

gate

gate

Output

Input

119841t

119839t

119857t

119841tminus1

119857t

119857t

119841tminus1

119841tminus1

119857t

119841tminus1

119835c

119835i

119835f

119835o

119848t

119836t

TT

T

119842t

Figure 1 LSTM unit comprising the input output and forget gatesand the memory cell

The second neural network type we employed is theLSTM RNN [44] Compared to a conventional RNN thehidden units are replaced by so-called memory blocksThesememory blocks can store information in the ldquocell variablerdquoc119905 In this way the network can exploit long-range temporalcontext Each memory block consists of a memory cell andthree gates the input gate output gate and forget gate asdepicted in Figure 1

The memory cell is controlled by the input output andforget gates

The stored cell variable c119905 can be reset by the forget gatewhile the functions responsible for reading input from x119905 andwriting output to h119905 are controlled by the input and outputgates respectively

c119905 = f119905 otimes c119905minus1 + i119905 otimes tanh (W119909119888x119905 +Wℎ119888h119905minus1 + b119888) h119905 = o119905 otimes tanh (c119905) (4)

where tanh and otimes stand for element-wise hyperbolic tangentand element-wise multiplication respectively The output ofthe input gates is denoted by the variable i119905 while the outputof the output and forget gates are indicated by o119905 and f119905respectively The variableW denotes a weight matrix and b119888indicates a bias term

Each LSTM unit is a separate and independent block Infact the size of h119905 is the same as i119905 o119905 f119905 and c119905 The sizecorresponds to the number of LSTMunits in the hidden layerIn order to have the gates being dependent uniquely from thememory cell within the same LSTM unit the matrices of theweights from the cells to the gates are diagonal

Furthermore we employed bidirectional RNN (BRNN)[45] which are capable of learning the context in bothtemporal directions In fact a BRNN contains two distincthidden layers which are processing the input vector in eachdirection The output layer is then connected to both hiddenlayers A more complex architecture can be obtained bycombining a LSTM unit with a BRNN which is referredto as bidirectional LSTM (BLSTM) [46] BLSTM exploitscontext from both temporal directions Note that in the caseof BLSTM it is not possible to perform online processing asa short buffer to look ahead is required

4 Computational Intelligence and Neuroscience

When the layout of a neural network comprises morehidden layers it is defined as deep neural network (DNN)[47] An incrementally higher level representation of theinput data is provided when multiple hidden layers arestacked on each other (deep learning)

In the case of multiple layers the output of a BRNN iscomputed as

y119905 = W997888rarrh119873119910

997888rarrh119873119905 +Wlarr997888

h119873119910

larr997888h119873119905 + b119910 (5)

where the forward and backward activation of the 119873th(last) hidden layer are denoted by

997888rarrh119873119905 and

larr997888h119873119905 respectively

The reconstructed signal is generated by using an identityactivation function at the outputThe best network layout wasobtained by conducting a number of preliminary evaluationsSeveral configurations were evaluated by changing the sizeand the number of hidden layers (ie the number of LSTMunits for each layer)

The training procedure was iterated up to a maximum of100 epochsThe standard gradient descent with backpropaga-tion of the sum squared error was used to recursively updatethe network weights Those were initialized with a randomGaussian distribution with mean 0 and standard deviation01 as it usually provides an acceptable initialization in ourexperience

4 Autoencoders for AcousticNovelty Detection

This section introduces the concepts of autoencoders anddescribes the basic autoencoder compression autoencoderdenoising autoencoder and nonlinear predictive autoen-coder [36]

41 Basic Autoencoder A basic autoencoder is a neuralnetwork trained to set the target values equal to the inputs Itsstructure typically consists of only one hidden layer while theinput and the output layers have the same size The trainingsetXtr consists of background environmental sounds whiletest set Xte is composed of recordings containing abnormalsounds It is used to find common data representation fromthe input [48 49] Formally in response to an input example119909 isin R119899 the hidden representation ℎ(119909) isin R119898 is

ℎ (119909) = 119891 (1198821119909 + 1198871) (6)

where 119891(119911) is a nonlinear activation function typically alogistic sigmoid function 119891(119911) = 1(1 + exp(minus119911)) appliedcomponentwisely1198821 isin R119898times119899 is a weight matrix and 1198871 isin R119898is a bias vector

The network output maps the hidden representation ℎback to a reconstruction isin R119899

= 119891 (1198822ℎ (119909) + 1198872) (7)

where 1198822 isin R119899times119898 is a weight matrix and 1198872 isin R119899 is a biasvector

Given an input set of examples X AE training consistsin finding parameters 120579 = 11988211198822 1198871 1198872 that minimise the

reconstruction error which corresponds to minimising thefollowing objective function

J (120579) = sum119909isinX

119909 minus 2 (8)

A well-known approach tominimise the objective function isthe stochastic gradient descent with error backpropagationThe layout of the AE is shown in Figure 2(a)

42 Compression Autoencoder The compression autoen-coder (CAE) learns a compressed representation of the inputwhen the number of hidden units 119898 is smaller than thenumber of input units 119899 For example if some of the inputfeatures are correlated these correlations are learnt andreconstructed by the CAE The structure of the CAE is givenin Figure 2(b)

43 Denoising Autoencoder In the denoising AE (DAE)[50] configuration the network is trained to reconstruct theoriginal input from a corrupted version of itThe initial input119909 is corrupted by means of additive isotropic Gaussian noisein order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868) The corrupted input 1199091015840is then mapped as with the AE to a hidden representation

ℎ (1199091015840) = 119891 (119882101584011199091015840 + 11988710158401) (9)

forcing the hidden layer to retrieve more robust features andprevent it from simply learning the identityThus the originalsignal is reconstructed as follows

1015840 = 119891 (11988210158402119909 + 11988710158402) (10)

The structure of the denoising autoencoder is shown inFigure 2(c) In the training phase the set of network weightsand biases 1205791015840 = 11988210158401 11988210158402 11988710158401 11988710158402 are updated in order tohave 1015840 as close as possible to the uncorrupted input 119909 Thisprocedure corresponds to minimising the reconstructionerror objective function (8) In our approach to corrupt theinitial input 119909119905 we make use of additive isotropic Gaussiannoise in order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868)44 Nonlinear Predictive Autoencoder The basic idea of anonlinear predictive (NP) AE is to train the AE in orderto predict the current frame from a previous observationFormally the input up to a given time frame 119909119905 is mappedto a hidden representation ℎ

ℎ (119909119905) = 119891 (119882lowast1 119887lowast1 1199091119905) (11)

where 119882 and 119887 denote weights and bias respectively Fromthis we reconstruct an approximation of the original signalas follows

119905+119896 = 119891 (119882lowast2 119887lowast2 ℎ1119905) (12)

where 119896 is the prediction delay and ℎ119894 = ℎ(119909119894) A predictiondelay of 119896 = 1 corresponds to a shift of 10ms in theaudio signal in our setting (cf Section 5) The training of

Computational Intelligence and Neuroscience 5

(a) (b) (c)

Novelty

Output layer

Hidden layer

Input layer

Input

Thresholding

minuse0

Stochastic mapping

Feature extraction

x998400 | x sim N(x 1205902I)

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

Figure 2 Block diagram of the proposed acoustic novelty detector with different autoencoder structures Features are extracted from theinput signal and the reconstruction error between the input and the reconstructed features is then processed by a thresholding block whichdetects the novel or nonnovel event Structure of the (a) basic autoencoder (b) compression autoencoder and (c) denoising autoencoder onthe training setXtr or testing setXteXtr contains data of nonnovel acoustic eventsXte consists of novel and nonnovel acoustic events

the parameters is performed by minimising the objectivefunction (8)mdashthe difference is that is now based onnonlinear prediction according to (11) and (12) Thus theparameters 120579lowast = 119882lowast1 119882lowast2 119887lowast1 119887lowast2 are trained to minimisethe average reconstruction error over the training set to have119905+119896 as close as possible to the prediction delay The resultingstructure of the nonlinear predictive denoising autoencoder(NP-DAE) is similar to the one depicted in Figure 2(c) butwith input and output updated as described above

5 Thresholding and Features

This section describes the thresholding decision strategy andthe features employed in our experiments

51 Thresholding Auditory spectral features (ASF) inSection 52 used in this work are composed by 54 coefficientswhich means that the input and output layer of the networkhave 54 units each The trained AE reconstructs eachsample and novel events are identified by processing thereconstruction error signal with an adaptive threshold Theinput audio signal 119909 is segmented into sequences of 30seconds of length In the testing phase we computemdashon a

frame basismdashthe average Euclidean distance between thenetworksrsquo outputs and each standardized input feature valueIn order to compress the reconstruction error to a singlevalue the distances are summed up and divided by thenumber of coefficients Then we apply a threshold 120579th toobtain a binary signal shifting from the median of the errorsignal of a sequence 1198900 by a multiplicative coefficient 120573 Thecoefficient ranges from 120573min = 1 to 120573max = 2

120579th = 120573 lowastmedian (1198900 (1) 1198900 (119873)) (13)

Figure 3 shows the reconstruction error for a givensequence The figure clearly depicts a low reconstructionerror in reproducing normal input such as talking televisionsounds and other normal environmental sounds

52 Acoustic Features An efficient representation of theaudio signal can be achieved by extracting the auditoryspectral features (ASF) [51] The audio signal is split intoframes with the size equal to 30ms and a frame step of10ms and then the ASF are obtained by applying ShortTime Fourier Transform (STFT) which yields the powerspectrogram of the frame Mel spectrograms11987230(119899119898) (with

6 Computational Intelligence and Neuroscience

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

123

LRe

cons

truc

tion

erro

rab

el

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(a)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(b)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(c)

Figure 3 Spectrogram of a 30-second sequence from (a) the A3Novelty Corpus containing a novel siren event (top) (b) the PASCALCHiMEdatabase containing three novel events such as a siren and two screams and (c) the PROMETHEUS database containing a novel eventcomposed by an alarm and people screaming together (top) Reconstruction error signal of the related sequence obtained with a BLSTM-DAE (middle) Ground-truth binary novelty signal novel (0) and not-novel (0) (bottom)

119899 being the frame index and 119898 the frequency bin index)are calculated converting the power spectrogram to the Mel-frequency scale using a filter-bankwith 26 triangular filters Alogarithmic scaling is chosen to match the human perceptionof loudness

11987230log (119899119898) = log (11987230 (119899119898) + 10) (14)

In addition the positive first-order differences 11986330(119899119898) arecalculated from each Mel spectrogram following

11986330 (119899119898) = 11987230log (119899119898) minus11987230log (119899 minus 1119898) (15)

Furthermore the frame energy and its derivative arealso included as feature ending up in a total number of 54coefficients For better reproducibility the features extractionprocess is computed with our open-source audio analysistoolkit openSMILE [52]

6 Databases

This section describes the three databases evaluated in ourexperiments A3Novelty PASCAL CHiME and PROME-THEUS

61 A3Novelty The A3Novelty Corpus (httpwwwa3labdiiunivpmitresearcha3novelty) includes around 56 hoursof recording acquired in a laboratory of the UniversitaPolitecnica delle Marche These recordings were performedduring different day and night hours so very differentacoustic conditions are available A variety of novel eventswere randomly played back by a speaker (eg scream fallalarm or breakage of objects) during the recordings

Eight microphones were used in the recording room forthe acquisitions four Behringer B-5 microphones with car-dioid pattern and an array of fourAKGC400BLmicrophones

spaced by 4 cm and then A MOTU 8pre sound card and theNU-Tech software were utilised to record the microphonesignals The sampling rate was equal to 48 kHz

The abnormal event sounds (cf Table 1) can be groupedinto four categories and they are freely available to downloadfrom httpwwwfreesoundorg

(i) Sirens three different types of sirens or alarm sounds(ii) Falls two occurrences of a person or an object falling

to the ground(iii) Breakage of objects noise produced by the breakage of

an object after the impact with the ground(iv) Screams four different human screams both pro-

duced by a single person or by a group of people

The A3Novelty Corpus is composed of two types ofrecordings background which contains only backgroundsounds such as human speech technical tools noise andenvironmental sounds and background with novelty whichcontains in addition to the background the artificially gen-erated novelty events

In the original A3Novelty database the recordings aresegmented in sequences of 30 seconds In order to limit thesize of training data we randomly selected 300 sequencesfrom the background partition to compose training material(150 minutes) and 180 sequences from the background withnovelty partition to compose the testing set (90 minutes)Thetest set contains 13 novelty occurrences

For reproducibility the list of randomly selected record-ings and the train and test set are made available (httpwwwa3labdiiunivpmitresearcha3novelty)

62 PASCAL CHiME The original dataset is composed ofaround 7 hours of recordings of a home environment takenfrom the PASCALCHiME speech separation and recognition

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

2 Computational Intelligence and Neuroscience

[18] failure detection in jet engines [19] patient vital-signmonitoring [20] fMRI analysis [21] and damage detectionof a gearbox wheel [22]

Neural network-based approachesmdashalso named recon-struction-based [23]mdashhave gained interest in recent yearsalong with the evident success of neural networks in severalother fields In the past decade several works focused on theapplication of a neural network in the form of an autoencoder(AE) have been presented [10] given the huge impact andeffectiveness of neural networks The autoencoder-basedapproaches involve building a regression model using theldquonormalrdquo data The test data are processed by analysingthe reconstruction error between the regression target andthe encoded value When the reconstruction error showshigh score the test data is considered novel Examples ofapplications include such to detect abnormal CPU data usage[24 25] and such to detect outliers [26ndash29] for damageclassification under changing environmental conditions [30]

In these scenarios very little studies have been conductedin the field of acoustic novelty detection Recently weobserved a growing research interest in application domainsinvolving surveillance and homeland security to monitorpublic places or supervise private environments where peoplemay live alone Driven by the increasing requirement ofsecurity public places such as but not limited to stores bankssubway trains and airports have been equipped with varioussensors like cameras or microphones As a consequenceunsupervised monitoring systems have gained much atten-tion in the research community to investigate new and effi-cient signal processing approachesThe research in the area ofsurveillance systems mainly focusses on detecting abnormalevents relying on video information [8] However it has tobe noted that several advantages can be obtained by relyingon acoustic information In fact acoustic signalsmdashas opposedto video informationmdashneed low computational costs and areinvariant to illumination conditions possible occlusion andabrupt events (eg a shotgun and explosions) Specificallyin the field of acoustic novelty detection studies focusedonly on statistical approaches by applying hidden Markovmodels (HMM) and Gaussian mixture models (GMM) toacoustic surveillance of abnormal situations [31ndash33] and toautomatic space monitoring [34] Despite the number ofstudies exploring statistical and probabilistic approaches theuse of neural network-based approaches for acoustic noveltydetection has only been introduced recently [35 36]

Contribution Only in the last two years the use of neuralnetworks for acoustic novelty detection has gained interest inthe research community In fact few recent studies proposeda (pseudo-)generative model in the form of a denoisingautoencoder with recurrent neural networks (RNNs) Inparticular the use of Long-Short Term Memory (LSTM)RNNs as generative model [37] was investigated in the fieldof text generation [38] handwriting [38] and music [39]However the use of LSTM as a model for audio generationwas only introduced in our recent works [35 36]

This article provides a broad and extensive evaluationof state-of-the-art methods with a particular focus on noveland recent unsupervised approaches based on RNN-based

autoencoders We significantly extended the studies con-ducted in [35 36] by evaluating further approaches suchas one-class SVMs (OCSVMs) and multilayer perceptrons(MLP) and most importantly we conducted a broad and indepth evaluation on three different datasets for a total numberof 160 153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection

We evaluate and compare all these methods withthree different databases A3Novelty PASCAL CHiME andPROMETHEUS We provide evidence that RNN-basedautoencoders significantly outperformothermethods by out-performing statistical approaches up to an absolute improve-ment of 164 average 119865-measure over the three databases

The remainder of this contribution is structured asfollows First a basic description of the different statisticalmethods is given in Section 2 Then the feed-forward andLSTM RNNs together with autoencoder-based schemes foracoustic novelty detection are described (Sections 3 and 4)Next the thresholding strategy and features employed inthe experiments are given in Section 5 The used databasesare introduced in Section 6 and the experimental set-up isdiscussed in Section 7 before discussing the evaluation ofobtained results in Section 8 Section 9 finally presents ourconclusions

2 Statistical Methods

In this section we introduce statistical approaches such asGMM HMM and OCSVM We formally define the inputvector 119909 isin R119899 where 119899 is the number of acoustic features(cf Section 5)

21 Gaussian Mixture Models GMMs estimate the prob-ability density of the ldquonormalrdquo class given training datausing a number of Gaussian components The training phaseof a GMM exploits the 119896-means algorithm or other suitedtraining algorithms and the Expectation-Maximisation (EM)algorithm [40] The former initializes the parameters while 119871iterations of EM algorithm lead to the final model Given apredefined threshold (defined in Section 5) if the probabilityproduced by the GMM with a test sample is lower than thethreshold the sample is detected as novel event

22 Hidden Markov Models A further statistical model isthe HMM [41] HMMs differ from GMMs in terms ofinput temporal evolution Indeed while a diagonal GMMtends to approximate the whole training data probabilitydistribution by means of a number of Gaussian componentsa HMM models the variations of the input signal throughits hidden states The HMM topology employed in this workis left-right and it is trained by means of the Baum-Welchalgorithm [41] while regarding the novelty detection phasethe decision is based on the sequence paradigm Consideringa left-right HMM having 119873119904 hidden states a sequence is aset of 119873119904 feature vectors = 1199091 119909119873119904 The emissionprobabilities of these observable events are determined by aprobability distribution one for each state [9] We trainedan HMM on what we call ldquonormalrdquo material and exploited

Computational Intelligence and Neuroscience 3

the log-likelihoods as novelty scores In the testing phase theunseen signal is segmented into a fixed length depending onthe number of states of the HMM and if the log-likelihood ishigher than the defined threshold (cf Section 5) the segmentis detected as novel

23 One-Class Support Vector Machines A OCSVM [42]maps an input example onto a high-dimensional feature spaceand iteratively searches for the hyperplane that maximisesthe distance between the training examples from the originIn this constellation the OCSVM can be seen as a two-classSVM where the origin is the unique member of the secondclass whereas the training examples belong to the first classGiven the training data 1199091 119909119897 isin 119883 where 119897 is the numberof observations the class separation is performed by solvingthe following

min119908120585120588

12 1199082 1]119897sum119894 120585119894 minus 120588subject to (119908 sdot Φ (119909119894)) ge 120588 minus 120585119894 120585119894 ge 0

(1)

where 119908 is the support vector 120585119894 are slack variables 120588 is theoffset andΦmaps119883 into a dot product space 119865 such that thedot product in the image ofΦ can be computed by evaluatinga certain kernel function such as a linear or Gaussian radialbase function

119896 (119909 119910) = exp(minus1003817100381710038171003817119909 minus 1199101003817100381710038171003817221205902 ) (2)

The parameter ] sets an upper bound on the fraction of theoutliers defined to be the data being outside the estimatedregion of normality Thus the decision values are obtainedwith the following function

119891 (119909) = 119908 sdot Φ (119909) minus 120588 (3)

We trained a OCSVM on what we call ldquonormalrdquo material andused the decision values as novelty scores During testing theOCSVMprovides a decision value for the unseen pattern andif the decision value is higher than the defined threshold (cfSection 5) the segment is detected as novel

3 Feed-Forward and RecurrentNeural Networks

This section introduces the MLP and the LSTM RNNsemployed in our acoustic novelty detectors

The first neural network type we used is a multilayerperceptron [43] In a MLP the units are arranged in layerswith feed-forward connections from one layer to the nextEach node outputs an activation function applied over theweighted sum of its inputs The activation function can belinear a hyperbolic function (tanh) or the sigmoid functionInput examples are fed to the input layer and the resultingoutput is propagated via the hidden layers towards the outputlayer This process is known as the forward pass of thenetwork This type of neural networks only relies on thecurrent input and not on any past or future inputs

Cell

Forget

gate

gate

gate

Output

Input

119841t

119839t

119857t

119841tminus1

119857t

119857t

119841tminus1

119841tminus1

119857t

119841tminus1

119835c

119835i

119835f

119835o

119848t

119836t

TT

T

119842t

Figure 1 LSTM unit comprising the input output and forget gatesand the memory cell

The second neural network type we employed is theLSTM RNN [44] Compared to a conventional RNN thehidden units are replaced by so-called memory blocksThesememory blocks can store information in the ldquocell variablerdquoc119905 In this way the network can exploit long-range temporalcontext Each memory block consists of a memory cell andthree gates the input gate output gate and forget gate asdepicted in Figure 1

The memory cell is controlled by the input output andforget gates

The stored cell variable c119905 can be reset by the forget gatewhile the functions responsible for reading input from x119905 andwriting output to h119905 are controlled by the input and outputgates respectively

c119905 = f119905 otimes c119905minus1 + i119905 otimes tanh (W119909119888x119905 +Wℎ119888h119905minus1 + b119888) h119905 = o119905 otimes tanh (c119905) (4)

where tanh and otimes stand for element-wise hyperbolic tangentand element-wise multiplication respectively The output ofthe input gates is denoted by the variable i119905 while the outputof the output and forget gates are indicated by o119905 and f119905respectively The variableW denotes a weight matrix and b119888indicates a bias term

Each LSTM unit is a separate and independent block Infact the size of h119905 is the same as i119905 o119905 f119905 and c119905 The sizecorresponds to the number of LSTMunits in the hidden layerIn order to have the gates being dependent uniquely from thememory cell within the same LSTM unit the matrices of theweights from the cells to the gates are diagonal

Furthermore we employed bidirectional RNN (BRNN)[45] which are capable of learning the context in bothtemporal directions In fact a BRNN contains two distincthidden layers which are processing the input vector in eachdirection The output layer is then connected to both hiddenlayers A more complex architecture can be obtained bycombining a LSTM unit with a BRNN which is referredto as bidirectional LSTM (BLSTM) [46] BLSTM exploitscontext from both temporal directions Note that in the caseof BLSTM it is not possible to perform online processing asa short buffer to look ahead is required

4 Computational Intelligence and Neuroscience

When the layout of a neural network comprises morehidden layers it is defined as deep neural network (DNN)[47] An incrementally higher level representation of theinput data is provided when multiple hidden layers arestacked on each other (deep learning)

In the case of multiple layers the output of a BRNN iscomputed as

y119905 = W997888rarrh119873119910

997888rarrh119873119905 +Wlarr997888

h119873119910

larr997888h119873119905 + b119910 (5)

where the forward and backward activation of the 119873th(last) hidden layer are denoted by

997888rarrh119873119905 and

larr997888h119873119905 respectively

The reconstructed signal is generated by using an identityactivation function at the outputThe best network layout wasobtained by conducting a number of preliminary evaluationsSeveral configurations were evaluated by changing the sizeand the number of hidden layers (ie the number of LSTMunits for each layer)

The training procedure was iterated up to a maximum of100 epochsThe standard gradient descent with backpropaga-tion of the sum squared error was used to recursively updatethe network weights Those were initialized with a randomGaussian distribution with mean 0 and standard deviation01 as it usually provides an acceptable initialization in ourexperience

4 Autoencoders for AcousticNovelty Detection

This section introduces the concepts of autoencoders anddescribes the basic autoencoder compression autoencoderdenoising autoencoder and nonlinear predictive autoen-coder [36]

41 Basic Autoencoder A basic autoencoder is a neuralnetwork trained to set the target values equal to the inputs Itsstructure typically consists of only one hidden layer while theinput and the output layers have the same size The trainingsetXtr consists of background environmental sounds whiletest set Xte is composed of recordings containing abnormalsounds It is used to find common data representation fromthe input [48 49] Formally in response to an input example119909 isin R119899 the hidden representation ℎ(119909) isin R119898 is

ℎ (119909) = 119891 (1198821119909 + 1198871) (6)

where 119891(119911) is a nonlinear activation function typically alogistic sigmoid function 119891(119911) = 1(1 + exp(minus119911)) appliedcomponentwisely1198821 isin R119898times119899 is a weight matrix and 1198871 isin R119898is a bias vector

The network output maps the hidden representation ℎback to a reconstruction isin R119899

= 119891 (1198822ℎ (119909) + 1198872) (7)

where 1198822 isin R119899times119898 is a weight matrix and 1198872 isin R119899 is a biasvector

Given an input set of examples X AE training consistsin finding parameters 120579 = 11988211198822 1198871 1198872 that minimise the

reconstruction error which corresponds to minimising thefollowing objective function

J (120579) = sum119909isinX

119909 minus 2 (8)

A well-known approach tominimise the objective function isthe stochastic gradient descent with error backpropagationThe layout of the AE is shown in Figure 2(a)

42 Compression Autoencoder The compression autoen-coder (CAE) learns a compressed representation of the inputwhen the number of hidden units 119898 is smaller than thenumber of input units 119899 For example if some of the inputfeatures are correlated these correlations are learnt andreconstructed by the CAE The structure of the CAE is givenin Figure 2(b)

43 Denoising Autoencoder In the denoising AE (DAE)[50] configuration the network is trained to reconstruct theoriginal input from a corrupted version of itThe initial input119909 is corrupted by means of additive isotropic Gaussian noisein order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868) The corrupted input 1199091015840is then mapped as with the AE to a hidden representation

ℎ (1199091015840) = 119891 (119882101584011199091015840 + 11988710158401) (9)

forcing the hidden layer to retrieve more robust features andprevent it from simply learning the identityThus the originalsignal is reconstructed as follows

1015840 = 119891 (11988210158402119909 + 11988710158402) (10)

The structure of the denoising autoencoder is shown inFigure 2(c) In the training phase the set of network weightsand biases 1205791015840 = 11988210158401 11988210158402 11988710158401 11988710158402 are updated in order tohave 1015840 as close as possible to the uncorrupted input 119909 Thisprocedure corresponds to minimising the reconstructionerror objective function (8) In our approach to corrupt theinitial input 119909119905 we make use of additive isotropic Gaussiannoise in order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868)44 Nonlinear Predictive Autoencoder The basic idea of anonlinear predictive (NP) AE is to train the AE in orderto predict the current frame from a previous observationFormally the input up to a given time frame 119909119905 is mappedto a hidden representation ℎ

ℎ (119909119905) = 119891 (119882lowast1 119887lowast1 1199091119905) (11)

where 119882 and 119887 denote weights and bias respectively Fromthis we reconstruct an approximation of the original signalas follows

119905+119896 = 119891 (119882lowast2 119887lowast2 ℎ1119905) (12)

where 119896 is the prediction delay and ℎ119894 = ℎ(119909119894) A predictiondelay of 119896 = 1 corresponds to a shift of 10ms in theaudio signal in our setting (cf Section 5) The training of

Computational Intelligence and Neuroscience 5

(a) (b) (c)

Novelty

Output layer

Hidden layer

Input layer

Input

Thresholding

minuse0

Stochastic mapping

Feature extraction

x998400 | x sim N(x 1205902I)

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

Figure 2 Block diagram of the proposed acoustic novelty detector with different autoencoder structures Features are extracted from theinput signal and the reconstruction error between the input and the reconstructed features is then processed by a thresholding block whichdetects the novel or nonnovel event Structure of the (a) basic autoencoder (b) compression autoencoder and (c) denoising autoencoder onthe training setXtr or testing setXteXtr contains data of nonnovel acoustic eventsXte consists of novel and nonnovel acoustic events

the parameters is performed by minimising the objectivefunction (8)mdashthe difference is that is now based onnonlinear prediction according to (11) and (12) Thus theparameters 120579lowast = 119882lowast1 119882lowast2 119887lowast1 119887lowast2 are trained to minimisethe average reconstruction error over the training set to have119905+119896 as close as possible to the prediction delay The resultingstructure of the nonlinear predictive denoising autoencoder(NP-DAE) is similar to the one depicted in Figure 2(c) butwith input and output updated as described above

5 Thresholding and Features

This section describes the thresholding decision strategy andthe features employed in our experiments

51 Thresholding Auditory spectral features (ASF) inSection 52 used in this work are composed by 54 coefficientswhich means that the input and output layer of the networkhave 54 units each The trained AE reconstructs eachsample and novel events are identified by processing thereconstruction error signal with an adaptive threshold Theinput audio signal 119909 is segmented into sequences of 30seconds of length In the testing phase we computemdashon a

frame basismdashthe average Euclidean distance between thenetworksrsquo outputs and each standardized input feature valueIn order to compress the reconstruction error to a singlevalue the distances are summed up and divided by thenumber of coefficients Then we apply a threshold 120579th toobtain a binary signal shifting from the median of the errorsignal of a sequence 1198900 by a multiplicative coefficient 120573 Thecoefficient ranges from 120573min = 1 to 120573max = 2

120579th = 120573 lowastmedian (1198900 (1) 1198900 (119873)) (13)

Figure 3 shows the reconstruction error for a givensequence The figure clearly depicts a low reconstructionerror in reproducing normal input such as talking televisionsounds and other normal environmental sounds

52 Acoustic Features An efficient representation of theaudio signal can be achieved by extracting the auditoryspectral features (ASF) [51] The audio signal is split intoframes with the size equal to 30ms and a frame step of10ms and then the ASF are obtained by applying ShortTime Fourier Transform (STFT) which yields the powerspectrogram of the frame Mel spectrograms11987230(119899119898) (with

6 Computational Intelligence and Neuroscience

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

123

LRe

cons

truc

tion

erro

rab

el

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(a)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(b)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(c)

Figure 3 Spectrogram of a 30-second sequence from (a) the A3Novelty Corpus containing a novel siren event (top) (b) the PASCALCHiMEdatabase containing three novel events such as a siren and two screams and (c) the PROMETHEUS database containing a novel eventcomposed by an alarm and people screaming together (top) Reconstruction error signal of the related sequence obtained with a BLSTM-DAE (middle) Ground-truth binary novelty signal novel (0) and not-novel (0) (bottom)

119899 being the frame index and 119898 the frequency bin index)are calculated converting the power spectrogram to the Mel-frequency scale using a filter-bankwith 26 triangular filters Alogarithmic scaling is chosen to match the human perceptionof loudness

11987230log (119899119898) = log (11987230 (119899119898) + 10) (14)

In addition the positive first-order differences 11986330(119899119898) arecalculated from each Mel spectrogram following

11986330 (119899119898) = 11987230log (119899119898) minus11987230log (119899 minus 1119898) (15)

Furthermore the frame energy and its derivative arealso included as feature ending up in a total number of 54coefficients For better reproducibility the features extractionprocess is computed with our open-source audio analysistoolkit openSMILE [52]

6 Databases

This section describes the three databases evaluated in ourexperiments A3Novelty PASCAL CHiME and PROME-THEUS

61 A3Novelty The A3Novelty Corpus (httpwwwa3labdiiunivpmitresearcha3novelty) includes around 56 hoursof recording acquired in a laboratory of the UniversitaPolitecnica delle Marche These recordings were performedduring different day and night hours so very differentacoustic conditions are available A variety of novel eventswere randomly played back by a speaker (eg scream fallalarm or breakage of objects) during the recordings

Eight microphones were used in the recording room forthe acquisitions four Behringer B-5 microphones with car-dioid pattern and an array of fourAKGC400BLmicrophones

spaced by 4 cm and then A MOTU 8pre sound card and theNU-Tech software were utilised to record the microphonesignals The sampling rate was equal to 48 kHz

The abnormal event sounds (cf Table 1) can be groupedinto four categories and they are freely available to downloadfrom httpwwwfreesoundorg

(i) Sirens three different types of sirens or alarm sounds(ii) Falls two occurrences of a person or an object falling

to the ground(iii) Breakage of objects noise produced by the breakage of

an object after the impact with the ground(iv) Screams four different human screams both pro-

duced by a single person or by a group of people

The A3Novelty Corpus is composed of two types ofrecordings background which contains only backgroundsounds such as human speech technical tools noise andenvironmental sounds and background with novelty whichcontains in addition to the background the artificially gen-erated novelty events

In the original A3Novelty database the recordings aresegmented in sequences of 30 seconds In order to limit thesize of training data we randomly selected 300 sequencesfrom the background partition to compose training material(150 minutes) and 180 sequences from the background withnovelty partition to compose the testing set (90 minutes)Thetest set contains 13 novelty occurrences

For reproducibility the list of randomly selected record-ings and the train and test set are made available (httpwwwa3labdiiunivpmitresearcha3novelty)

62 PASCAL CHiME The original dataset is composed ofaround 7 hours of recordings of a home environment takenfrom the PASCALCHiME speech separation and recognition

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Computational Intelligence and Neuroscience 3

the log-likelihoods as novelty scores In the testing phase theunseen signal is segmented into a fixed length depending onthe number of states of the HMM and if the log-likelihood ishigher than the defined threshold (cf Section 5) the segmentis detected as novel

23 One-Class Support Vector Machines A OCSVM [42]maps an input example onto a high-dimensional feature spaceand iteratively searches for the hyperplane that maximisesthe distance between the training examples from the originIn this constellation the OCSVM can be seen as a two-classSVM where the origin is the unique member of the secondclass whereas the training examples belong to the first classGiven the training data 1199091 119909119897 isin 119883 where 119897 is the numberof observations the class separation is performed by solvingthe following

min119908120585120588

12 1199082 1]119897sum119894 120585119894 minus 120588subject to (119908 sdot Φ (119909119894)) ge 120588 minus 120585119894 120585119894 ge 0

(1)

where 119908 is the support vector 120585119894 are slack variables 120588 is theoffset andΦmaps119883 into a dot product space 119865 such that thedot product in the image ofΦ can be computed by evaluatinga certain kernel function such as a linear or Gaussian radialbase function

119896 (119909 119910) = exp(minus1003817100381710038171003817119909 minus 1199101003817100381710038171003817221205902 ) (2)

The parameter ] sets an upper bound on the fraction of theoutliers defined to be the data being outside the estimatedregion of normality Thus the decision values are obtainedwith the following function

119891 (119909) = 119908 sdot Φ (119909) minus 120588 (3)

We trained a OCSVM on what we call ldquonormalrdquo material andused the decision values as novelty scores During testing theOCSVMprovides a decision value for the unseen pattern andif the decision value is higher than the defined threshold (cfSection 5) the segment is detected as novel

3 Feed-Forward and RecurrentNeural Networks

This section introduces the MLP and the LSTM RNNsemployed in our acoustic novelty detectors

The first neural network type we used is a multilayerperceptron [43] In a MLP the units are arranged in layerswith feed-forward connections from one layer to the nextEach node outputs an activation function applied over theweighted sum of its inputs The activation function can belinear a hyperbolic function (tanh) or the sigmoid functionInput examples are fed to the input layer and the resultingoutput is propagated via the hidden layers towards the outputlayer This process is known as the forward pass of thenetwork This type of neural networks only relies on thecurrent input and not on any past or future inputs

Cell

Forget

gate

gate

gate

Output

Input

119841t

119839t

119857t

119841tminus1

119857t

119857t

119841tminus1

119841tminus1

119857t

119841tminus1

119835c

119835i

119835f

119835o

119848t

119836t

TT

T

119842t

Figure 1 LSTM unit comprising the input output and forget gatesand the memory cell

The second neural network type we employed is theLSTM RNN [44] Compared to a conventional RNN thehidden units are replaced by so-called memory blocksThesememory blocks can store information in the ldquocell variablerdquoc119905 In this way the network can exploit long-range temporalcontext Each memory block consists of a memory cell andthree gates the input gate output gate and forget gate asdepicted in Figure 1

The memory cell is controlled by the input output andforget gates

The stored cell variable c119905 can be reset by the forget gatewhile the functions responsible for reading input from x119905 andwriting output to h119905 are controlled by the input and outputgates respectively

c119905 = f119905 otimes c119905minus1 + i119905 otimes tanh (W119909119888x119905 +Wℎ119888h119905minus1 + b119888) h119905 = o119905 otimes tanh (c119905) (4)

where tanh and otimes stand for element-wise hyperbolic tangentand element-wise multiplication respectively The output ofthe input gates is denoted by the variable i119905 while the outputof the output and forget gates are indicated by o119905 and f119905respectively The variableW denotes a weight matrix and b119888indicates a bias term

Each LSTM unit is a separate and independent block Infact the size of h119905 is the same as i119905 o119905 f119905 and c119905 The sizecorresponds to the number of LSTMunits in the hidden layerIn order to have the gates being dependent uniquely from thememory cell within the same LSTM unit the matrices of theweights from the cells to the gates are diagonal

Furthermore we employed bidirectional RNN (BRNN)[45] which are capable of learning the context in bothtemporal directions In fact a BRNN contains two distincthidden layers which are processing the input vector in eachdirection The output layer is then connected to both hiddenlayers A more complex architecture can be obtained bycombining a LSTM unit with a BRNN which is referredto as bidirectional LSTM (BLSTM) [46] BLSTM exploitscontext from both temporal directions Note that in the caseof BLSTM it is not possible to perform online processing asa short buffer to look ahead is required

4 Computational Intelligence and Neuroscience

When the layout of a neural network comprises morehidden layers it is defined as deep neural network (DNN)[47] An incrementally higher level representation of theinput data is provided when multiple hidden layers arestacked on each other (deep learning)

In the case of multiple layers the output of a BRNN iscomputed as

y119905 = W997888rarrh119873119910

997888rarrh119873119905 +Wlarr997888

h119873119910

larr997888h119873119905 + b119910 (5)

where the forward and backward activation of the 119873th(last) hidden layer are denoted by

997888rarrh119873119905 and

larr997888h119873119905 respectively

The reconstructed signal is generated by using an identityactivation function at the outputThe best network layout wasobtained by conducting a number of preliminary evaluationsSeveral configurations were evaluated by changing the sizeand the number of hidden layers (ie the number of LSTMunits for each layer)

The training procedure was iterated up to a maximum of100 epochsThe standard gradient descent with backpropaga-tion of the sum squared error was used to recursively updatethe network weights Those were initialized with a randomGaussian distribution with mean 0 and standard deviation01 as it usually provides an acceptable initialization in ourexperience

4 Autoencoders for AcousticNovelty Detection

This section introduces the concepts of autoencoders anddescribes the basic autoencoder compression autoencoderdenoising autoencoder and nonlinear predictive autoen-coder [36]

41 Basic Autoencoder A basic autoencoder is a neuralnetwork trained to set the target values equal to the inputs Itsstructure typically consists of only one hidden layer while theinput and the output layers have the same size The trainingsetXtr consists of background environmental sounds whiletest set Xte is composed of recordings containing abnormalsounds It is used to find common data representation fromthe input [48 49] Formally in response to an input example119909 isin R119899 the hidden representation ℎ(119909) isin R119898 is

ℎ (119909) = 119891 (1198821119909 + 1198871) (6)

where 119891(119911) is a nonlinear activation function typically alogistic sigmoid function 119891(119911) = 1(1 + exp(minus119911)) appliedcomponentwisely1198821 isin R119898times119899 is a weight matrix and 1198871 isin R119898is a bias vector

The network output maps the hidden representation ℎback to a reconstruction isin R119899

= 119891 (1198822ℎ (119909) + 1198872) (7)

where 1198822 isin R119899times119898 is a weight matrix and 1198872 isin R119899 is a biasvector

Given an input set of examples X AE training consistsin finding parameters 120579 = 11988211198822 1198871 1198872 that minimise the

reconstruction error which corresponds to minimising thefollowing objective function

J (120579) = sum119909isinX

119909 minus 2 (8)

A well-known approach tominimise the objective function isthe stochastic gradient descent with error backpropagationThe layout of the AE is shown in Figure 2(a)

42 Compression Autoencoder The compression autoen-coder (CAE) learns a compressed representation of the inputwhen the number of hidden units 119898 is smaller than thenumber of input units 119899 For example if some of the inputfeatures are correlated these correlations are learnt andreconstructed by the CAE The structure of the CAE is givenin Figure 2(b)

43 Denoising Autoencoder In the denoising AE (DAE)[50] configuration the network is trained to reconstruct theoriginal input from a corrupted version of itThe initial input119909 is corrupted by means of additive isotropic Gaussian noisein order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868) The corrupted input 1199091015840is then mapped as with the AE to a hidden representation

ℎ (1199091015840) = 119891 (119882101584011199091015840 + 11988710158401) (9)

forcing the hidden layer to retrieve more robust features andprevent it from simply learning the identityThus the originalsignal is reconstructed as follows

1015840 = 119891 (11988210158402119909 + 11988710158402) (10)

The structure of the denoising autoencoder is shown inFigure 2(c) In the training phase the set of network weightsand biases 1205791015840 = 11988210158401 11988210158402 11988710158401 11988710158402 are updated in order tohave 1015840 as close as possible to the uncorrupted input 119909 Thisprocedure corresponds to minimising the reconstructionerror objective function (8) In our approach to corrupt theinitial input 119909119905 we make use of additive isotropic Gaussiannoise in order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868)44 Nonlinear Predictive Autoencoder The basic idea of anonlinear predictive (NP) AE is to train the AE in orderto predict the current frame from a previous observationFormally the input up to a given time frame 119909119905 is mappedto a hidden representation ℎ

ℎ (119909119905) = 119891 (119882lowast1 119887lowast1 1199091119905) (11)

where 119882 and 119887 denote weights and bias respectively Fromthis we reconstruct an approximation of the original signalas follows

119905+119896 = 119891 (119882lowast2 119887lowast2 ℎ1119905) (12)

where 119896 is the prediction delay and ℎ119894 = ℎ(119909119894) A predictiondelay of 119896 = 1 corresponds to a shift of 10ms in theaudio signal in our setting (cf Section 5) The training of

Computational Intelligence and Neuroscience 5

(a) (b) (c)

Novelty

Output layer

Hidden layer

Input layer

Input

Thresholding

minuse0

Stochastic mapping

Feature extraction

x998400 | x sim N(x 1205902I)

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

Figure 2 Block diagram of the proposed acoustic novelty detector with different autoencoder structures Features are extracted from theinput signal and the reconstruction error between the input and the reconstructed features is then processed by a thresholding block whichdetects the novel or nonnovel event Structure of the (a) basic autoencoder (b) compression autoencoder and (c) denoising autoencoder onthe training setXtr or testing setXteXtr contains data of nonnovel acoustic eventsXte consists of novel and nonnovel acoustic events

the parameters is performed by minimising the objectivefunction (8)mdashthe difference is that is now based onnonlinear prediction according to (11) and (12) Thus theparameters 120579lowast = 119882lowast1 119882lowast2 119887lowast1 119887lowast2 are trained to minimisethe average reconstruction error over the training set to have119905+119896 as close as possible to the prediction delay The resultingstructure of the nonlinear predictive denoising autoencoder(NP-DAE) is similar to the one depicted in Figure 2(c) butwith input and output updated as described above

5 Thresholding and Features

This section describes the thresholding decision strategy andthe features employed in our experiments

51 Thresholding Auditory spectral features (ASF) inSection 52 used in this work are composed by 54 coefficientswhich means that the input and output layer of the networkhave 54 units each The trained AE reconstructs eachsample and novel events are identified by processing thereconstruction error signal with an adaptive threshold Theinput audio signal 119909 is segmented into sequences of 30seconds of length In the testing phase we computemdashon a

frame basismdashthe average Euclidean distance between thenetworksrsquo outputs and each standardized input feature valueIn order to compress the reconstruction error to a singlevalue the distances are summed up and divided by thenumber of coefficients Then we apply a threshold 120579th toobtain a binary signal shifting from the median of the errorsignal of a sequence 1198900 by a multiplicative coefficient 120573 Thecoefficient ranges from 120573min = 1 to 120573max = 2

120579th = 120573 lowastmedian (1198900 (1) 1198900 (119873)) (13)

Figure 3 shows the reconstruction error for a givensequence The figure clearly depicts a low reconstructionerror in reproducing normal input such as talking televisionsounds and other normal environmental sounds

52 Acoustic Features An efficient representation of theaudio signal can be achieved by extracting the auditoryspectral features (ASF) [51] The audio signal is split intoframes with the size equal to 30ms and a frame step of10ms and then the ASF are obtained by applying ShortTime Fourier Transform (STFT) which yields the powerspectrogram of the frame Mel spectrograms11987230(119899119898) (with

6 Computational Intelligence and Neuroscience

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

123

LRe

cons

truc

tion

erro

rab

el

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(a)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(b)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(c)

Figure 3 Spectrogram of a 30-second sequence from (a) the A3Novelty Corpus containing a novel siren event (top) (b) the PASCALCHiMEdatabase containing three novel events such as a siren and two screams and (c) the PROMETHEUS database containing a novel eventcomposed by an alarm and people screaming together (top) Reconstruction error signal of the related sequence obtained with a BLSTM-DAE (middle) Ground-truth binary novelty signal novel (0) and not-novel (0) (bottom)

119899 being the frame index and 119898 the frequency bin index)are calculated converting the power spectrogram to the Mel-frequency scale using a filter-bankwith 26 triangular filters Alogarithmic scaling is chosen to match the human perceptionof loudness

11987230log (119899119898) = log (11987230 (119899119898) + 10) (14)

In addition the positive first-order differences 11986330(119899119898) arecalculated from each Mel spectrogram following

11986330 (119899119898) = 11987230log (119899119898) minus11987230log (119899 minus 1119898) (15)

Furthermore the frame energy and its derivative arealso included as feature ending up in a total number of 54coefficients For better reproducibility the features extractionprocess is computed with our open-source audio analysistoolkit openSMILE [52]

6 Databases

This section describes the three databases evaluated in ourexperiments A3Novelty PASCAL CHiME and PROME-THEUS

61 A3Novelty The A3Novelty Corpus (httpwwwa3labdiiunivpmitresearcha3novelty) includes around 56 hoursof recording acquired in a laboratory of the UniversitaPolitecnica delle Marche These recordings were performedduring different day and night hours so very differentacoustic conditions are available A variety of novel eventswere randomly played back by a speaker (eg scream fallalarm or breakage of objects) during the recordings

Eight microphones were used in the recording room forthe acquisitions four Behringer B-5 microphones with car-dioid pattern and an array of fourAKGC400BLmicrophones

spaced by 4 cm and then A MOTU 8pre sound card and theNU-Tech software were utilised to record the microphonesignals The sampling rate was equal to 48 kHz

The abnormal event sounds (cf Table 1) can be groupedinto four categories and they are freely available to downloadfrom httpwwwfreesoundorg

(i) Sirens three different types of sirens or alarm sounds(ii) Falls two occurrences of a person or an object falling

to the ground(iii) Breakage of objects noise produced by the breakage of

an object after the impact with the ground(iv) Screams four different human screams both pro-

duced by a single person or by a group of people

The A3Novelty Corpus is composed of two types ofrecordings background which contains only backgroundsounds such as human speech technical tools noise andenvironmental sounds and background with novelty whichcontains in addition to the background the artificially gen-erated novelty events

In the original A3Novelty database the recordings aresegmented in sequences of 30 seconds In order to limit thesize of training data we randomly selected 300 sequencesfrom the background partition to compose training material(150 minutes) and 180 sequences from the background withnovelty partition to compose the testing set (90 minutes)Thetest set contains 13 novelty occurrences

For reproducibility the list of randomly selected record-ings and the train and test set are made available (httpwwwa3labdiiunivpmitresearcha3novelty)

62 PASCAL CHiME The original dataset is composed ofaround 7 hours of recordings of a home environment takenfrom the PASCALCHiME speech separation and recognition

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

4 Computational Intelligence and Neuroscience

When the layout of a neural network comprises morehidden layers it is defined as deep neural network (DNN)[47] An incrementally higher level representation of theinput data is provided when multiple hidden layers arestacked on each other (deep learning)

In the case of multiple layers the output of a BRNN iscomputed as

y119905 = W997888rarrh119873119910

997888rarrh119873119905 +Wlarr997888

h119873119910

larr997888h119873119905 + b119910 (5)

where the forward and backward activation of the 119873th(last) hidden layer are denoted by

997888rarrh119873119905 and

larr997888h119873119905 respectively

The reconstructed signal is generated by using an identityactivation function at the outputThe best network layout wasobtained by conducting a number of preliminary evaluationsSeveral configurations were evaluated by changing the sizeand the number of hidden layers (ie the number of LSTMunits for each layer)

The training procedure was iterated up to a maximum of100 epochsThe standard gradient descent with backpropaga-tion of the sum squared error was used to recursively updatethe network weights Those were initialized with a randomGaussian distribution with mean 0 and standard deviation01 as it usually provides an acceptable initialization in ourexperience

4 Autoencoders for AcousticNovelty Detection

This section introduces the concepts of autoencoders anddescribes the basic autoencoder compression autoencoderdenoising autoencoder and nonlinear predictive autoen-coder [36]

41 Basic Autoencoder A basic autoencoder is a neuralnetwork trained to set the target values equal to the inputs Itsstructure typically consists of only one hidden layer while theinput and the output layers have the same size The trainingsetXtr consists of background environmental sounds whiletest set Xte is composed of recordings containing abnormalsounds It is used to find common data representation fromthe input [48 49] Formally in response to an input example119909 isin R119899 the hidden representation ℎ(119909) isin R119898 is

ℎ (119909) = 119891 (1198821119909 + 1198871) (6)

where 119891(119911) is a nonlinear activation function typically alogistic sigmoid function 119891(119911) = 1(1 + exp(minus119911)) appliedcomponentwisely1198821 isin R119898times119899 is a weight matrix and 1198871 isin R119898is a bias vector

The network output maps the hidden representation ℎback to a reconstruction isin R119899

= 119891 (1198822ℎ (119909) + 1198872) (7)

where 1198822 isin R119899times119898 is a weight matrix and 1198872 isin R119899 is a biasvector

Given an input set of examples X AE training consistsin finding parameters 120579 = 11988211198822 1198871 1198872 that minimise the

reconstruction error which corresponds to minimising thefollowing objective function

J (120579) = sum119909isinX

119909 minus 2 (8)

A well-known approach tominimise the objective function isthe stochastic gradient descent with error backpropagationThe layout of the AE is shown in Figure 2(a)

42 Compression Autoencoder The compression autoen-coder (CAE) learns a compressed representation of the inputwhen the number of hidden units 119898 is smaller than thenumber of input units 119899 For example if some of the inputfeatures are correlated these correlations are learnt andreconstructed by the CAE The structure of the CAE is givenin Figure 2(b)

43 Denoising Autoencoder In the denoising AE (DAE)[50] configuration the network is trained to reconstruct theoriginal input from a corrupted version of itThe initial input119909 is corrupted by means of additive isotropic Gaussian noisein order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868) The corrupted input 1199091015840is then mapped as with the AE to a hidden representation

ℎ (1199091015840) = 119891 (119882101584011199091015840 + 11988710158401) (9)

forcing the hidden layer to retrieve more robust features andprevent it from simply learning the identityThus the originalsignal is reconstructed as follows

1015840 = 119891 (11988210158402119909 + 11988710158402) (10)

The structure of the denoising autoencoder is shown inFigure 2(c) In the training phase the set of network weightsand biases 1205791015840 = 11988210158401 11988210158402 11988710158401 11988710158402 are updated in order tohave 1015840 as close as possible to the uncorrupted input 119909 Thisprocedure corresponds to minimising the reconstructionerror objective function (8) In our approach to corrupt theinitial input 119909119905 we make use of additive isotropic Gaussiannoise in order to obtain 1199091015840 | 119909 sim 119873(119909 1205902119868)44 Nonlinear Predictive Autoencoder The basic idea of anonlinear predictive (NP) AE is to train the AE in orderto predict the current frame from a previous observationFormally the input up to a given time frame 119909119905 is mappedto a hidden representation ℎ

ℎ (119909119905) = 119891 (119882lowast1 119887lowast1 1199091119905) (11)

where 119882 and 119887 denote weights and bias respectively Fromthis we reconstruct an approximation of the original signalas follows

119905+119896 = 119891 (119882lowast2 119887lowast2 ℎ1119905) (12)

where 119896 is the prediction delay and ℎ119894 = ℎ(119909119894) A predictiondelay of 119896 = 1 corresponds to a shift of 10ms in theaudio signal in our setting (cf Section 5) The training of

Computational Intelligence and Neuroscience 5

(a) (b) (c)

Novelty

Output layer

Hidden layer

Input layer

Input

Thresholding

minuse0

Stochastic mapping

Feature extraction

x998400 | x sim N(x 1205902I)

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

Figure 2 Block diagram of the proposed acoustic novelty detector with different autoencoder structures Features are extracted from theinput signal and the reconstruction error between the input and the reconstructed features is then processed by a thresholding block whichdetects the novel or nonnovel event Structure of the (a) basic autoencoder (b) compression autoencoder and (c) denoising autoencoder onthe training setXtr or testing setXteXtr contains data of nonnovel acoustic eventsXte consists of novel and nonnovel acoustic events

the parameters is performed by minimising the objectivefunction (8)mdashthe difference is that is now based onnonlinear prediction according to (11) and (12) Thus theparameters 120579lowast = 119882lowast1 119882lowast2 119887lowast1 119887lowast2 are trained to minimisethe average reconstruction error over the training set to have119905+119896 as close as possible to the prediction delay The resultingstructure of the nonlinear predictive denoising autoencoder(NP-DAE) is similar to the one depicted in Figure 2(c) butwith input and output updated as described above

5 Thresholding and Features

This section describes the thresholding decision strategy andthe features employed in our experiments

51 Thresholding Auditory spectral features (ASF) inSection 52 used in this work are composed by 54 coefficientswhich means that the input and output layer of the networkhave 54 units each The trained AE reconstructs eachsample and novel events are identified by processing thereconstruction error signal with an adaptive threshold Theinput audio signal 119909 is segmented into sequences of 30seconds of length In the testing phase we computemdashon a

frame basismdashthe average Euclidean distance between thenetworksrsquo outputs and each standardized input feature valueIn order to compress the reconstruction error to a singlevalue the distances are summed up and divided by thenumber of coefficients Then we apply a threshold 120579th toobtain a binary signal shifting from the median of the errorsignal of a sequence 1198900 by a multiplicative coefficient 120573 Thecoefficient ranges from 120573min = 1 to 120573max = 2

120579th = 120573 lowastmedian (1198900 (1) 1198900 (119873)) (13)

Figure 3 shows the reconstruction error for a givensequence The figure clearly depicts a low reconstructionerror in reproducing normal input such as talking televisionsounds and other normal environmental sounds

52 Acoustic Features An efficient representation of theaudio signal can be achieved by extracting the auditoryspectral features (ASF) [51] The audio signal is split intoframes with the size equal to 30ms and a frame step of10ms and then the ASF are obtained by applying ShortTime Fourier Transform (STFT) which yields the powerspectrogram of the frame Mel spectrograms11987230(119899119898) (with

6 Computational Intelligence and Neuroscience

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

123

LRe

cons

truc

tion

erro

rab

el

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(a)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(b)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(c)

Figure 3 Spectrogram of a 30-second sequence from (a) the A3Novelty Corpus containing a novel siren event (top) (b) the PASCALCHiMEdatabase containing three novel events such as a siren and two screams and (c) the PROMETHEUS database containing a novel eventcomposed by an alarm and people screaming together (top) Reconstruction error signal of the related sequence obtained with a BLSTM-DAE (middle) Ground-truth binary novelty signal novel (0) and not-novel (0) (bottom)

119899 being the frame index and 119898 the frequency bin index)are calculated converting the power spectrogram to the Mel-frequency scale using a filter-bankwith 26 triangular filters Alogarithmic scaling is chosen to match the human perceptionof loudness

11987230log (119899119898) = log (11987230 (119899119898) + 10) (14)

In addition the positive first-order differences 11986330(119899119898) arecalculated from each Mel spectrogram following

11986330 (119899119898) = 11987230log (119899119898) minus11987230log (119899 minus 1119898) (15)

Furthermore the frame energy and its derivative arealso included as feature ending up in a total number of 54coefficients For better reproducibility the features extractionprocess is computed with our open-source audio analysistoolkit openSMILE [52]

6 Databases

This section describes the three databases evaluated in ourexperiments A3Novelty PASCAL CHiME and PROME-THEUS

61 A3Novelty The A3Novelty Corpus (httpwwwa3labdiiunivpmitresearcha3novelty) includes around 56 hoursof recording acquired in a laboratory of the UniversitaPolitecnica delle Marche These recordings were performedduring different day and night hours so very differentacoustic conditions are available A variety of novel eventswere randomly played back by a speaker (eg scream fallalarm or breakage of objects) during the recordings

Eight microphones were used in the recording room forthe acquisitions four Behringer B-5 microphones with car-dioid pattern and an array of fourAKGC400BLmicrophones

spaced by 4 cm and then A MOTU 8pre sound card and theNU-Tech software were utilised to record the microphonesignals The sampling rate was equal to 48 kHz

The abnormal event sounds (cf Table 1) can be groupedinto four categories and they are freely available to downloadfrom httpwwwfreesoundorg

(i) Sirens three different types of sirens or alarm sounds(ii) Falls two occurrences of a person or an object falling

to the ground(iii) Breakage of objects noise produced by the breakage of

an object after the impact with the ground(iv) Screams four different human screams both pro-

duced by a single person or by a group of people

The A3Novelty Corpus is composed of two types ofrecordings background which contains only backgroundsounds such as human speech technical tools noise andenvironmental sounds and background with novelty whichcontains in addition to the background the artificially gen-erated novelty events

In the original A3Novelty database the recordings aresegmented in sequences of 30 seconds In order to limit thesize of training data we randomly selected 300 sequencesfrom the background partition to compose training material(150 minutes) and 180 sequences from the background withnovelty partition to compose the testing set (90 minutes)Thetest set contains 13 novelty occurrences

For reproducibility the list of randomly selected record-ings and the train and test set are made available (httpwwwa3labdiiunivpmitresearcha3novelty)

62 PASCAL CHiME The original dataset is composed ofaround 7 hours of recordings of a home environment takenfrom the PASCALCHiME speech separation and recognition

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Computational Intelligence and Neuroscience 5

(a) (b) (c)

Novelty

Output layer

Hidden layer

Input layer

Input

Thresholding

minuse0

Stochastic mapping

Feature extraction

x998400 | x sim N(x 1205902I)

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

x x isin 119987tr 119987te x x isin 119987tr 119987te x998400 x998400 isin 119987tr 119987te

Figure 2 Block diagram of the proposed acoustic novelty detector with different autoencoder structures Features are extracted from theinput signal and the reconstruction error between the input and the reconstructed features is then processed by a thresholding block whichdetects the novel or nonnovel event Structure of the (a) basic autoencoder (b) compression autoencoder and (c) denoising autoencoder onthe training setXtr or testing setXteXtr contains data of nonnovel acoustic eventsXte consists of novel and nonnovel acoustic events

the parameters is performed by minimising the objectivefunction (8)mdashthe difference is that is now based onnonlinear prediction according to (11) and (12) Thus theparameters 120579lowast = 119882lowast1 119882lowast2 119887lowast1 119887lowast2 are trained to minimisethe average reconstruction error over the training set to have119905+119896 as close as possible to the prediction delay The resultingstructure of the nonlinear predictive denoising autoencoder(NP-DAE) is similar to the one depicted in Figure 2(c) butwith input and output updated as described above

5 Thresholding and Features

This section describes the thresholding decision strategy andthe features employed in our experiments

51 Thresholding Auditory spectral features (ASF) inSection 52 used in this work are composed by 54 coefficientswhich means that the input and output layer of the networkhave 54 units each The trained AE reconstructs eachsample and novel events are identified by processing thereconstruction error signal with an adaptive threshold Theinput audio signal 119909 is segmented into sequences of 30seconds of length In the testing phase we computemdashon a

frame basismdashthe average Euclidean distance between thenetworksrsquo outputs and each standardized input feature valueIn order to compress the reconstruction error to a singlevalue the distances are summed up and divided by thenumber of coefficients Then we apply a threshold 120579th toobtain a binary signal shifting from the median of the errorsignal of a sequence 1198900 by a multiplicative coefficient 120573 Thecoefficient ranges from 120573min = 1 to 120573max = 2

120579th = 120573 lowastmedian (1198900 (1) 1198900 (119873)) (13)

Figure 3 shows the reconstruction error for a givensequence The figure clearly depicts a low reconstructionerror in reproducing normal input such as talking televisionsounds and other normal environmental sounds

52 Acoustic Features An efficient representation of theaudio signal can be achieved by extracting the auditoryspectral features (ASF) [51] The audio signal is split intoframes with the size equal to 30ms and a frame step of10ms and then the ASF are obtained by applying ShortTime Fourier Transform (STFT) which yields the powerspectrogram of the frame Mel spectrograms11987230(119899119898) (with

6 Computational Intelligence and Neuroscience

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

123

LRe

cons

truc

tion

erro

rab

el

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(a)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(b)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(c)

Figure 3 Spectrogram of a 30-second sequence from (a) the A3Novelty Corpus containing a novel siren event (top) (b) the PASCALCHiMEdatabase containing three novel events such as a siren and two screams and (c) the PROMETHEUS database containing a novel eventcomposed by an alarm and people screaming together (top) Reconstruction error signal of the related sequence obtained with a BLSTM-DAE (middle) Ground-truth binary novelty signal novel (0) and not-novel (0) (bottom)

119899 being the frame index and 119898 the frequency bin index)are calculated converting the power spectrogram to the Mel-frequency scale using a filter-bankwith 26 triangular filters Alogarithmic scaling is chosen to match the human perceptionof loudness

11987230log (119899119898) = log (11987230 (119899119898) + 10) (14)

In addition the positive first-order differences 11986330(119899119898) arecalculated from each Mel spectrogram following

11986330 (119899119898) = 11987230log (119899119898) minus11987230log (119899 minus 1119898) (15)

Furthermore the frame energy and its derivative arealso included as feature ending up in a total number of 54coefficients For better reproducibility the features extractionprocess is computed with our open-source audio analysistoolkit openSMILE [52]

6 Databases

This section describes the three databases evaluated in ourexperiments A3Novelty PASCAL CHiME and PROME-THEUS

61 A3Novelty The A3Novelty Corpus (httpwwwa3labdiiunivpmitresearcha3novelty) includes around 56 hoursof recording acquired in a laboratory of the UniversitaPolitecnica delle Marche These recordings were performedduring different day and night hours so very differentacoustic conditions are available A variety of novel eventswere randomly played back by a speaker (eg scream fallalarm or breakage of objects) during the recordings

Eight microphones were used in the recording room forthe acquisitions four Behringer B-5 microphones with car-dioid pattern and an array of fourAKGC400BLmicrophones

spaced by 4 cm and then A MOTU 8pre sound card and theNU-Tech software were utilised to record the microphonesignals The sampling rate was equal to 48 kHz

The abnormal event sounds (cf Table 1) can be groupedinto four categories and they are freely available to downloadfrom httpwwwfreesoundorg

(i) Sirens three different types of sirens or alarm sounds(ii) Falls two occurrences of a person or an object falling

to the ground(iii) Breakage of objects noise produced by the breakage of

an object after the impact with the ground(iv) Screams four different human screams both pro-

duced by a single person or by a group of people

The A3Novelty Corpus is composed of two types ofrecordings background which contains only backgroundsounds such as human speech technical tools noise andenvironmental sounds and background with novelty whichcontains in addition to the background the artificially gen-erated novelty events

In the original A3Novelty database the recordings aresegmented in sequences of 30 seconds In order to limit thesize of training data we randomly selected 300 sequencesfrom the background partition to compose training material(150 minutes) and 180 sequences from the background withnovelty partition to compose the testing set (90 minutes)Thetest set contains 13 novelty occurrences

For reproducibility the list of randomly selected record-ings and the train and test set are made available (httpwwwa3labdiiunivpmitresearcha3novelty)

62 PASCAL CHiME The original dataset is composed ofaround 7 hours of recordings of a home environment takenfrom the PASCALCHiME speech separation and recognition

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

6 Computational Intelligence and Neuroscience

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

123

LRe

cons

truc

tion

erro

rab

el

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(a)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(b)

02468

Freq

uenc

y (H

z)

5 10 15 20 25

Time (s)

times103

0

1

2

Labe

l

123

Reco

nstr

uctio

ner

ror

500 1000 1500 2000 2500

Frames

500 1000 1500 2000 2500

Frames(c)

Figure 3 Spectrogram of a 30-second sequence from (a) the A3Novelty Corpus containing a novel siren event (top) (b) the PASCALCHiMEdatabase containing three novel events such as a siren and two screams and (c) the PROMETHEUS database containing a novel eventcomposed by an alarm and people screaming together (top) Reconstruction error signal of the related sequence obtained with a BLSTM-DAE (middle) Ground-truth binary novelty signal novel (0) and not-novel (0) (bottom)

119899 being the frame index and 119898 the frequency bin index)are calculated converting the power spectrogram to the Mel-frequency scale using a filter-bankwith 26 triangular filters Alogarithmic scaling is chosen to match the human perceptionof loudness

11987230log (119899119898) = log (11987230 (119899119898) + 10) (14)

In addition the positive first-order differences 11986330(119899119898) arecalculated from each Mel spectrogram following

11986330 (119899119898) = 11987230log (119899119898) minus11987230log (119899 minus 1119898) (15)

Furthermore the frame energy and its derivative arealso included as feature ending up in a total number of 54coefficients For better reproducibility the features extractionprocess is computed with our open-source audio analysistoolkit openSMILE [52]

6 Databases

This section describes the three databases evaluated in ourexperiments A3Novelty PASCAL CHiME and PROME-THEUS

61 A3Novelty The A3Novelty Corpus (httpwwwa3labdiiunivpmitresearcha3novelty) includes around 56 hoursof recording acquired in a laboratory of the UniversitaPolitecnica delle Marche These recordings were performedduring different day and night hours so very differentacoustic conditions are available A variety of novel eventswere randomly played back by a speaker (eg scream fallalarm or breakage of objects) during the recordings

Eight microphones were used in the recording room forthe acquisitions four Behringer B-5 microphones with car-dioid pattern and an array of fourAKGC400BLmicrophones

spaced by 4 cm and then A MOTU 8pre sound card and theNU-Tech software were utilised to record the microphonesignals The sampling rate was equal to 48 kHz

The abnormal event sounds (cf Table 1) can be groupedinto four categories and they are freely available to downloadfrom httpwwwfreesoundorg

(i) Sirens three different types of sirens or alarm sounds(ii) Falls two occurrences of a person or an object falling

to the ground(iii) Breakage of objects noise produced by the breakage of

an object after the impact with the ground(iv) Screams four different human screams both pro-

duced by a single person or by a group of people

The A3Novelty Corpus is composed of two types ofrecordings background which contains only backgroundsounds such as human speech technical tools noise andenvironmental sounds and background with novelty whichcontains in addition to the background the artificially gen-erated novelty events

In the original A3Novelty database the recordings aresegmented in sequences of 30 seconds In order to limit thesize of training data we randomly selected 300 sequencesfrom the background partition to compose training material(150 minutes) and 180 sequences from the background withnovelty partition to compose the testing set (90 minutes)Thetest set contains 13 novelty occurrences

For reproducibility the list of randomly selected record-ings and the train and test set are made available (httpwwwa3labdiiunivpmitresearcha3novelty)

62 PASCAL CHiME The original dataset is composed ofaround 7 hours of recordings of a home environment takenfrom the PASCALCHiME speech separation and recognition

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Computational Intelligence and Neuroscience 7

Table 1 Acoustic novel events in the test set Shown are the number of different events per database the average duration and the totalduration in seconds per event typeThe last column indicates the total number of events and total duration across the databases The last lineindicates the total duration in seconds of the test set including normal and novel events per database

EventsA3Novelty PASCAL CHiME PROMETHEUS Total

ATM Corridor Outdoor Smart-room Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) Time (avg) time

Alarm mdash mdash 76 4358 (60) mdash mdash 6 840 (140) mdash mdash 3 90 (30) 85 5288Anger mdash mdash mdash mdash mdash mdash mdash mdash 6 2930 (488) mdash mdash 6 2930Fall 3 42 (21) 48 895 (18) mdash mdash 3 30 (10) mdash mdash 2 20 (10) 55 987Fracture 1 22 32 704 (22) mdash mdash mdash mdash mdash mdash mdash mdash 33 726Pain mdash mdash mdash mdash mdash mdash 2 80 (40) mdash mdash 5 670 (134) 7 750Scream 6 104 (17) 111 2146 (19) 5 300 (60) 25 2280 (91) 4 480 (120) 10 2340 (234) 159 7622Siren 3 204 (68) mdash mdash mdash mdash mdash mdash mdash mdash mdash mdash 3 181Total 13 381 (29) 267 8103 (31) 5 300 (50) 36 3230 (90) 10 3410 (341) 20 3120 (156) 348 18484Test time mdash 54000 mdash 41880 mdash 7500 mdash 9600 mdash 16200 mdash 10200 mdash 139380

challenge [53] It consists of a typical in-home scenario (aliving room) recorded during different days and times whilethe inhabitants (two adults and two children) perform com-mon actions such as talking watching television playing oreatingThe dataset was recorded with a binaural microphoneand a sample-rate of 16 kHz In the original PASCAL CHiMEdatabase the recordings are segmented in sequences of 5minutesrsquo duration In order to limit the size of training datawe randomly selected sequences to compose 100 minutesof background for the training set and around 70 minutesfor the testing set For reproducibility the list of randomlyselected recordings and the train and test set are madeavailable (httpa3labdiiunivpmitwebdavaudioNoveltyDetection Datasettargz) The test set was generated addingdifferent types of sounds (taken from httpwwwfreesoundfreesoundorg) such as screams alarms falls and fractures(cf Table 1) after their normalization to the volume of thebackground recordings The events in the test set were addedat random position thus the distance between one event andanother is not fixed

63 PROMETHEUS ThePROMETHEUS database [31] con-tains recordings of various scenarios designed to serve a widerange of real-world applications The database includes (1)a smart-room indoor home environment including phaseswhere a user is interacting with an automated speech-drivenhome assistant and (2) an outdoor public space consisting of(a) interaction of people with anATM (b) an outdoor securityscenario in which people are waiting in front of a counter and(3) an indoor office corridor scenario for security monitoringin standard indoor space These scenarios substantially differin terms of acoustic environment The indoor scenarioswere recorded under quiet acoustic conditions whereas theoutdoor recordings were conducted in an open-air publicarea and contain nonstationary backgroundnoiseThe smart-home scenario contains recordings of five professional actorsperforming five single-person and 14 multiple-person actionscripts The main activities include human-machine interac-tion with a virtual home agent and a number of alternat-ing normal and abnormal activities specifically designed to

monitor and interpret human behaviour The single-personand multiple-person actions include abnormal events suchas falls alarm followed by panic atypical vocalic reactions(pain fear and anger) or fractures Examples are walking tothe couch sitting or interacting with the smart environmentto turn the TV on open the windows or decrease thetemperatureThe scenarios were recorded three to five timesby changing the actors and their roles in the action scriptsTable 1 provides details on the number of abnormal events perscenario including average time duration

7 Experimental Set-Up

The networks were trained with the gradient steepest descentalgorithm on the sum of squared errors (SSE) In the caseof all the LSTM and BLSTM networks we used a constantvalue of learning rate 119897 = 1119890minus6 since it showed betterperformances in our previous works [35] whereas differentvalues of 119897 = 1119890minus8 1119890minus9 were used for MLP networksDifferent noise sigma values 120590 = 001 01 025were appliedto the DAE No Gaussian noise was applied to the basicAE and to the CAE following the architectures describedin Section 4 The prediction delay was applied for differentvalues 119896 = 1 2 3 4 5 10 The AEs were trained usingour open-source CUDA RecurREnt Neural Network Toolkit(CURRENNT) [54] ensuring reproducibility As evaluationmetrics we used 119865-measure in order to compare the resultswith previous works [35 36]We evaluated several topologiesfor the nonlinear predictive DAE ranging from 54-128-54to 216-216-216 and from 54-30-54 to 54-54-54 in the caseof CAE and basic AE respectively Every network topologywas evaluated for each of the 100 epochs of training Inorder to compare our results with our previous studies wekept the same optimisation procedure as applied in [3536] We employed further three state-of-the-art approachesbased on OCSVM GMM and HMM In the case ofOCSVM we trained models at different complexity values119862 = 005 001 0005 0001 00005 00001 Radial basisfunction kernels were used with different gamma values 120574 =001 0001 and we controlled the fraction of outliers in

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

8 Computational Intelligence and Neuroscience

Table 2 Comparison over the three databases of methods by percentage of 119865-measure (1198651) Indicated prediction (D)elay and average 1198651weighted by the of instances per database Reported approaches are GMM HMM OCSVM compression autoencoder with MLP (MLP-CAE) BLSTM (BLSTM-CAE) and LSTM (LSTM-CAE) denoising autoencoder withMLP (MLP-DAE) BLSTM (BLSTM-DAE) and LSTM(LSTM-DAE) and related versions of nonlinear predictive autoencoders NP-MLP-CAEAEDAE and NP-(B)LSTM-CAEAEDAE

MethodA3Novelty PASCAL CHiME PROMETHEUS Weighted average

ATM Corridor Outdoor Smart-roomD(119896) 1198651 D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () D(119896) 1198651 () 1198651 ()

OCSVM mdash 918 mdash 634 mdash 602 mdash 653 mdash 573 mdash 574 733GMM mdash 894 mdash 894 mdash 502 mdash 494 mdash 564 mdash 591 787HMM mdash 882 mdash 914 mdash 520 mdash 496 mdash 560 mdash 591 789MLP-CAE 0 976 0 852 0 761 0 762 0 648 0 612 848LSTM-CAE 0 977 0 891 0 785 0 778 0 626 0 640 862BLSTM-CAE 0 987 0 913 0 784 0 784 0 621 0 637 873MLP-AE 0 972 0 850 0 761 0 770 0 651 0 618 848LSTM-AE 0 977 0 891 0 787 0 779 0 616 0 614 860BLSTM-AE 0 978 0 894 0 785 0 776 0 631 0 634 864MLP-DAE 0 973 0 873 0 775 0 785 0 658 0 646 860LSTM-DAE 0 979 0 924 0 795 0 787 0 680 0 650 881BLSTM-DAE 0 984 0 934 0 787 0 798 0 685 0 651 887NP-MLP-CAE 4 985 5 883 1 788 2 750 1 652 5 640 864NP-LSTM-CAE 5 988 1 925 1 787 2 744 2 647 3 638 877NP-BLSTM-CAE 4 992 3 928 2 783 1 752 2 657 2 632 881NP-MLP-AE 4 985 5 859 1 790 2 746 1 647 2 628 855NP-LSTM-AE 5 987 1 921 1 781 1 750 1 650 3 644 876NP-BLSTM-AE 4 992 2 941 3 784 2 756 1 656 1 636 885NP-MLP-DAE 5 989 5 888 1 816 1 775 1 670 4 643 873NP-LSTM-DAE 5 991 1 942 4 804 1 764 2 662 1 652 888NP-BLSTM-DAE 5 994 3 944 2 807 3 785 2 667 1 656 893

the training phase with different values ] = 01 001 TheOCSVM was trained via the LIBSVM library [55] In thecase of GMM models were trained at different numbers ofGaussian components 2119899 with 119899 = 1 2 8 whereasleft-right HMMs were trained with different numbers ofstates 119904 = 3 4 5 and 2119899 Gaussian components with 119899 =1 2 7 GMMs and HMMs were trained using the Torch[56] toolkit The decision values produced as output of theOCSVM and the probability estimates produced as output ofthe probabilistic models were postprocessed with a similarthresholding algorithm (cf Section 5) in order to fairlycompare the performance among the different methods Forall the experiments and settings we maintained the samefeature set

8 Results

In this section we present and comment on the resultsobtained in our evaluation across the three databases

81 A3Novelty Evaluations on the A3Novelty Corpus arereported in the second column of Table 2 In this datasetGMMs and HMMs perform similarly however they areoutperformed by the OCSVM with a maximum improve-ment of 36 absolute 119865-measure The autoencoder-basedapproaches are significantly boosting the performance up

to 987 We observe a vast absolute improvement by upto 69 against the probabilistic approaches Among thethree CAE AE and DAE we observe that compressionand denoising layouts with BLSTM units perform closely toeach other at up to 987 in the case of the BLSTM-CAEThis can be due to the fact that the dataset contains fewervariations in the background material used for training andthe feature selection operated internally by the AE increasesthe sensitivity of the reconstruction error

The nonlinear predictive results are shown in the lastpart of Table 2 We provide performance in the three namedconfigurations and with the three named unit types Inconcordance to what we found in the PASCAL database theNP-BLSTM-DAE method provided the best performance interms of 119865-measure of up to 994 A significant absoluteimprovement (one-tailed z-test [57] 119901 lt 001 (in the restof the manuscript we reported as ldquosignificantrdquo the improve-ments with at least 119901 lt 001 under the one-tailed z-test[57])) of 100 119865-measure is observed against the GMM-based approach while an absolute improvement of 76 119865-measure is exhibited with respect to the OCSVM methodWe observe an overall improvement of asymp1 between theldquoordinaryrdquo and the ldquopredictiverdquo architectures

Theperformance obtained by progressively increasing theprediction delay (119896) values (from 0 up to 10) is reported inFigure 4 We evaluated the compression autoencoder (CAE)

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Computational Intelligence and Neuroscience 9

970

975

980

985

990

995

F-m

easu

re (

)

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

Figure 4 A3Novelty performances obtained with NP-Autoencod-ers by progressively increasing the prediction delay

the basic autoencoder (AE) and the denoising autoencoder(DAE) with MLP LSTM and BLSTM units and we applieddifferent layouts (cf Section 7) per network type Howeverfor the sake of brevity we only show the best configurationsThe best results across all the three unit types are 994and 991 119865-measure for the NP-BLSTM-DAE and NP-LSTM-DAE networks respectivelyThese are obtained with aprediction delay of 5 frames which translates into an overalldelay of 50ms In general the best performances are achievedwith 119896 = 4 or 119896 = 5 Increasing the prediction delay up to 10frames produces a heavy decrease in performance down to978 119865-measure

82 PASCAL CHiME In the first column of Table 2 wereport the performance obtained on the PASCAL datasetusing different approaches Parts of the results obtained onthis database were also presented in [36] Here we con-ducted additional experiments to evaluate OCSVM andMLPapproaches The one-class SVM shows lower performancecompared to probabilistic approaches such as GMM andHMM which seems to work reasonably well up to 914 119865-measure The OCSVM low performance can be due tothe fact that the dataset was generated artificially and theabnormal sound dynamics were normalized with respectto the ldquonormalrdquo material making the margin maximisationmore complex and less effectiveNext we evaluatedAE-basedapproaches in the three configurations compression (CAE)basic (AE) and denoising (DAE) We also evaluated MLPLSTM and BLSTM unit types Among the three configura-tions we observe that denoising ones perform better than theothers independently of the type of unit In particular thebest performance is obtained with the denoising autoencoderrealised as BLSTM RNN showing up to 934 119865-measureThe last three groups of rows in Table 2 show results of theNP approach again in the three configurations and with thethree unit types

The NP-BLSTM-DAE achieved the best result of upto 944 119865-measure Significant absolute improvements of

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

610

650

690

730

770

810

850

890

930

970

F-m

easu

re (

)

Figure 5 PASCAL CHiME performances obtained with NP-Autoencoders by progressively increasing the prediction delay

40 30 and 1 are observed over GMM HMM andldquoordinaryrdquo BLSTM-DAE approaches respectively

Interestingly applying the nonlinear prediction schemeto the compression autoencoders NP-(B)LSTM-CAE(928 119865-measure) also increased the performances incomparison with the (B)LSTM-CAE (913 119865-measure)In fact in a previous work [35] the compression learningprocess alone showed scarce results However here the CAEwith the nonlinear prediction encodes information on theinput more effectively

Figure 5 depicts results for increasing values of theprediction delay (119896) ranging from 0 to 10 We evaluatedCAE AE and DAE with MLP LSTM and BLSTM neuralnetworks with different layouts (cf Section 7) per networktype However due to space restrictions we only report thebest performances Here the best performances are obtainedwith a prediction delay of 3 frames (30ms) for the NP-BLSTM-DAE network (944 119865-measure) and of one framein the case of NP-LSTM-DAE (942 119865-measure) As inthe A3Novelty database we observe a similar decrease inperformance down to 862 119865-measure when the predictiondelay increases up to 10 which corresponds to 100ms Infact applying a higher prediction delay (eg 100ms) induceshigher values of the reconstruction error in the presence offast periodic events which subsequently leads to an increasedfalse detection rate

83 PROMETHEUS This subsection elaborates on theresults obtained on the four subsets present in the PROME-THEUS database

831 ATM The ATM scenario evaluations are shown inthe third column of Table 2 The GMM and HMM performsimilarly at chance level In fact we observe an 119865-measureof 502 and 520 for GMMs and HMMs respectivelyThe one-class SVM shows slightly better performance of upto 602 On the other hand AE-based approaches in thethree configurationsmdashcompression (CAE) traditional (AE)

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

10 Computational Intelligence and Neuroscience

1 2 3 4 50 10

Prediction delay

NP-BLSTM-CAENP-LSTM-CAENP-MLP-CAE

NP-BLSTM-AENP-LSTM-AENP-MLP-AE

NP-BLSTM-DAENP-LSTM-DAENP-MLP-DAE

720

740

760

780

800

820

F-m

easu

re (

)

Figure 6 PROMETHEUS-ATM performances obtained with NP-Autoencoders by progressively increasing the prediction delay

and denoising (DAE)mdashshow a significant improvement inperformance up to 193 absolute 119865-measure against theOCSVM Among the three configurations we observe thatDAE performs better independently of the type of networkIn particular the best performance considering the ordinary(without nonlinear prediction) approach is obtained with theDAEwith a LSTMnetwork leading to an119865-measure of 795

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP)Thenonlinear predictivedenoising autoencoder performs best up to 816 119865-measureSurprisingly the best performance is obtained using MLPunits suggesting that for long eventsmdashas those containedin the ATM scenario (with an average duration of 60 s cfTable 1)mdashmemory-enhanced units such as (B)LSTM are notas effective as for shorter events

A significant absolute improvement of 214 119865-measureis observed against the OCSVM approach while an absoluteimprovement of 314 119865-measure is exhibited with respectto the GMM-based method Among the two autoencoder-based approaches we report an absolute improvement of 10between the namely ldquoordinaryrdquo and ldquopredictiverdquo structuresItmust be observed that the performance presented in [31] arehigher than the one provided in this article since the tolerancewindow used in that study was set to 1 s whereas herewe aimed at a higher temporal resolution with a tolerancewindow of 200ms which is suitable also for abrupt events

Figure 6 depicts performance for progressive values ofthe prediction delay (119896) ranging from 0 to 10 applying aCAE AE and DAE with MLP LSTM and BLSTM networksSeveral layouts (cf Section 7) were evaluated per networktype however we report only the best configurations Settinga prediction delay of 1 frame which corresponds to a totalprediction delay of 10ms leads to the best performance ofup to 816 119865-measure in the NP-MLP-DAE network In thecase of the NP-BLSTM-DAE we observe better performancewith a delay of 2 frames up to 807 119865-measure In generalwe do not observe a consistent trend by increasing theprediction delay corroborating the fact that for long eventsas those contained in the ATM scenario memory-enhanced

units and a nonlinear predictive approach are not as effectiveas for shorter events

832 Corridor The evaluations on the corridor subset areshown in the fourth column of Table 2TheGMM andHMMperform similarly at chance level We observe an 119865-measureof 494 and 496 for GMM and HMM respectivelyThe OCSVM shows better performance up to 653 Asobserved in the ATM scenario again a significant improve-ment in performance up to 165 absolute 119865-measure isobserved using the autoencoder-based approaches in thethree configurations (CAE AE and DAE) with respect tothe OCSVM Among the three configurations we observethat the denoising autoencoder performs better than theothers The best performance is obtained with the denoisingautoencoder with a BLSTM unit of up to 798 119865-measure

The ldquopredictiverdquo approach is reported in the last threegroups of rows in Table 2 Interestingly the nonlinear pre-dictive autoencoders do not improve the performance as wehave seen in the other scenarios A plausible explanationcan be found based on the nature of the novelty eventspresent in the subset In fact the subset contains verylong events with an average duration of up to 140 s perevent With such long events the generative model does notintroduce a more sensitive reconstruction error Howeverthe delta in performance between the BLSTM-DAE and theNP-BLSTM-DAE is rather small (13 119865-measure) in favourof the ldquoordinaryrdquo approach The best performance (798)is obtained using BLSTM units confirming that memory-enhanced units are more effective in the presence of shortevents In fact this scenariomdashbesides very long eventsmdashalsocontains fall and pain short events with an average durationof 10 s and 30 s respectively

A significant absolute improvement up to 165 119865-measure is observed against the OCSVM approach whilebeing even higher with respect to the GMM and HMM

833 Outdoor The evaluations on the outdoor subset areshown in the fifth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATMand corridorWeobserve an119865-measure of 573 564and 560 for OCSVM GMM and HMM respectively Inthis scenario the improvement brought by the autoencoderis not as vast as in the previous subsets but still significantWe report an absolute improvement of 112 119865-measurebetween OCSVM and BLSTM-DAE Again the denoisingautoencoder performs better than the other configurationsIn particular the best performance obtained with BLSTM-DAE is 685 119865-measure

As observed in the corridor scenario the nonlinearpredictive autoencoders (last three groups of rows in Table 2)do not improve the performance These results corroborateour previous explanation that the long duration nature of thenovelty events present in the subset affects the sensitivity ofthe reconstruction error in the generative model Howeverthe delta in performance between the BLSTM-DAE and NP-BLSTM-DAE is rather small (13 119865-measure)

It must be observed that the performance in this scenariois rather low compared to the one obtained in the other

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Computational Intelligence and Neuroscience 11

datasets We believe that the presence of anger novel soundsintroduces a higher degree of complexity in our autoencoder-based approach In fact anger novel events may containdifferent level of aroused content which could be acousticallysimilar to neutral spoken content present in the trainingmaterial Under this condition the generative model shows alow reconstruction errorThis issue could be solved by settingthe novel events to only contain the aroused segments ormdashconsidering anger as a long-term speaker statemdashincreasingthe temporal resolution of our system

834 Smart-Room Thesmart-room scenario evaluations areshown in the sixth column of Table 2 The OCSVM GMMand HMM perform better in this scenario as opposed toATM corridor and outdoor We observe an 119865-measure of574 591 and 591 for OCSVM GMM and HMMrespectively In this scenario the improvement brought aboutby the autoencoder is still significant We report an absoluteimprovement of 60 119865-measure between GMMHMM andthe BLSTM-DAE Again the denoising autoencoder per-forms better than the other configurations In particular thebest performance in the ordinary approach is obtained withthe BLSTM-DAE of up to 651 119865-measure

The last three groups of rows in Table 2 show results of thenonlinear predictive approach (NP) The NP-BLSTM-DAEperforms best at up to 656 119865-measure

As in the outdoor subset we report a low performancein the smart-room subset as well In fact the subset containsseveral long novel events related to spoken content expressingpain and fear As commented in the outdoor scenariounder this condition the generative model may be ableto reconstruct the novel event without producing a highreconstruction error

84 Overall Overall the experimental results proved thattheDAEmethods achieved superior performances comparedto the CAEAE schemes This is due to the combination oftwo leaning processes of a denoising autoencoder such asthe process of encoding of the input by preserving the infor-mation about the input itself and simultaneously reversingthe effect of a corruption process applied to the input of theautoencoder

In particular the predictive approachwith (B)LSTMunitsshowed the best performance of up to 893 average 119865-measure among all the six different datasets weighted by thenumber of instances per database (cf Table 2)

To better understand the improvement brought by theRNN-based approaches we provide in Figure 7 the compar-ison between state-of-the-art methods in terms of weightedaverage 119865-measure computed across the A3Novelty Cor-pus PASCAL CHiME and PROMETHEUS In general weobserve that the recently proposedNP-BLSTM-DAEmethodprovided the best performance in terms of average119865-measureof up to 893 A significant absolute improvement of160 average 119865-measure is observed against the OCSVMapproach while an absolute improvement of 106 and 104average119865-measure is exhibitedwith respect to theGMM- andHMM-based methods An absolute improvement of 06is observed over the ldquoordinaryrdquo BLSTM-DAE It has to be

OCS

VM

GM

MH

MM

MLP

-CA

ELS

TM-C

AE

BLST

M-C

AE

MLP

-AE

LSTM

-AE

BLST

M-A

EM

LP-D

AE

LSTM

-DA

EBL

STM

-DA

EN

P-M

LP-C

AE

NP-

LSTM

-CA

EN

P-BL

STM

-CA

EN

P-M

LP-A

EN

P-LS

TM-A

EN

P-BL

STM

-AE

NP-

MLP

-DA

EN

P-LS

TM-D

AE

NP-

BLST

M-D

AE720

740760780800820840860880900

F-m

easu

re (

)

787

873

733

862848

789

860848

887 881864

860

881 877864 855

885876

893888873

Figure 7 Average 119865-measure computed over A3Novelty CorpusPASCAL and PROMETHEUS weighted by the of instance perdatabase Extended comparison of methods

noted that the average 119865-measure is computed includingthe PROMETHEUS database for which the performance hasbeen shown to be lower because it contains long-term eventsand a lower resolution in the labels (1 s)

The RNN-based schemes also bring an evident bene-fit when applied to the ldquonormalrdquo autoencoders (ie withno denoising or compression) in fact the NP-BLSTM-AEachieves an 119865-measure of 885 Furthermore when weapplied the nonlinear prediction scheme to a denoisingautoencoder the performance achieved with LSTM was inthis case comparable with BLSTM units and also outper-formed state-of-the-art approaches

In conclusion the combination of the nonlinear pre-diction paradigm and the various (B)LSTM autoencodersproved to be effective outperforming significantly otherstate-of-the-art methods Additionally there is evidence thatmemory-enhanced units such as LSTM and BLSTM outper-formed MLP without memory showing that the knowledgeof the temporal context can improve the novelty detectorabilities

9 Conclusions and Outlook

We presented a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recentunsupervised approaches based on RNN-based autoen-coders We significantly extended the studies conductedin [35 36] by evaluating further approaches such as one-class support vector machines (OCSVMs) and multilayerperceptron (MLP) and most importantly we conducted abroad evaluation on three different datasets for a total numberof 160153 experiments making this article the first to presentsuch a complete evaluation in the field of acoustic noveltydetection We show evidently that RNN-based autoencoderssignificantly outperform other methods by achieving up to893 weighted average 119865-measure on the three databaseswith a significant absolute improvement of 104 againstthe best performance obtained with statistical approaches(HMM) Overall a significant increase in performance wasachieved by combining the (B)LSTM autoencoder-basedarchitecture with the nonlinear prediction scheme

Future works will focus on using multiresolution features[51 58] likely more suitable to deal with different eventdurations in order to face the issues encountered in the

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

12 Computational Intelligence and Neuroscience

PROMETHEUS database Further studies will be conductedon other architectures of RNN-based autoencoders rangingfrom dynamic Bayesian networks [59] to convolutional neu-ral networks [60]

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research leading to these results has received fund-ing from the European Communityrsquos Seventh FrameworkProgramme (FP72007ndash2013) under Grant Agreement no338164 (ERC StartingGrant iHEARu) no 645378 (RIAARIAVALUSPA) and no 688835 (RIA DE-ENIGMA)

References

[1] L Tarassenko P Hayton N Cerneaz and M Brady ldquoNoveltydetection for the identification of masses in mammogramsrdquoin Proceedings of the 4th International Conference on ArtificialNeural Networks pp 442ndash447 June 1995

[2] J Quinn and C Williams ldquoKnown unknowns novelty detec-tion in conditionmonitoringrdquo in Pattern Recognition and ImageAnalysis J Mart J Bened A Mendona and J Serrat Eds vol4477 of Lecture Notes in Computer Science pp 1ndash6 SpringerBerlin Germany 2007

[3] L Clifton D A Clifton P J Watkinson and L TarassenkoldquoIdentification of patient deterioration in vital-sign data usingone-class support vector machinesrdquo in Proceedings of the Feder-ated Conference on Computer Science and Information Systems(FedCSIS rsquo11) pp 125ndash131 September 2011

[4] Y-H Liu Y-C Liu and Y-J Chen ldquoFast support vector datadescriptions for novelty detectionrdquo IEEETransactions onNeuralNetworks vol 21 no 8 pp 1296ndash1313 2010

[5] C Surace and K Worden ldquoNovelty detection in a changingenvironment a negative selection approachrdquo Mechanical Sys-tems and Signal Processing vol 24 no 4 pp 1114ndash1128 2010

[6] J A Quinn C K I Williams and N McIntosh ldquoFactorialswitching linear dynamical systems applied to physiologicalcondition monitoringrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 31 no 9 pp 1537ndash1551 2009

[7] A Patcha and J-M Park ldquoAn overview of anomaly detectiontechniques existing solutions and latest technological trendsrdquoComputer Networks vol 51 no 12 pp 3448ndash3470 2007

[8] M Markou and S Singh ldquoA neural network-based noveltydetector for image sequence analysisrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 28 no 10 pp1664ndash1677 2006

[9] M Markou and S Singh ldquoNovelty detection a reviewmdashpart1 statistical approachesrdquo Signal Processing vol 83 no 12 pp2481ndash2497 2003

[10] M Markou and S Singh ldquoNovelty detection a reviewmdashpart 2neural network based approachesrdquo Signal Processing vol 83 no12 pp 2499ndash2521 2003

[11] E R de Faria I R Goncalves J a Gama and A C CarvalholdquoEvaluation of multiclass novelty detection algorithms for datastreamsrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 27 no 11 pp 2961ndash2973 2015

[12] C Satheesh Chandran S Kamal A Mujeeb and M HSupriya ldquoNovel class detection of underwater targets usingself-organizing neural networksrdquo in Proceedings of the IEEEUnderwater Technology (UT rsquo15) Chennai India February 2015

[13] K Worden G Manson and D Allman ldquoExperimental vali-dation of a structural health monitoring methodology part INovelty detection on a laboratory structurerdquo Journal of Soundand Vibration vol 259 no 2 pp 323ndash343 2003

[14] J Foote ldquoAutomatic audio segmentation using a measure ofaudio noveltyrdquo in Proceedings of the IEEE Internatinal Confer-ence on Multimedia and Expo (ICME rsquo00) vol 1 pp 452ndash455August 2000

[15] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo inProceedings of the Neural Information Processing Systems (NIPSrsquo99) vol 12 pp 582ndash588 Denver Colo USA 1999

[16] J Ma and S Perkins ldquoTime-series novelty detection usingone-class support vector machinesrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo03)vol 3 pp 1741ndash1745 Portland Ore USA July 2003

[17] J Ma and S Perkins ldquoOnline novelty detection on temporalsequencesrdquo in Proceedings of the 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD rsquo03) pp 613ndash618 ACM Washington DC USA August2003

[18] P Hayton B Scholkopf L Tarassenko and P Anuzis ldquoSupportvector novelty detection applied to jet engine vibration spectrardquoin Proceedings of the 14th Annual Neural Information ProcessingSystems Conference (NIPS rsquo00) pp 946ndash952 December 2000

[19] L Tarassenko A Nairac N Townsend and P Cowley ldquoNoveltydetection in jet enginesrdquo in Proceedings of the IEE Colloquiumon Condition Monitoring Machinery External Structures andHealth pp 41ndash45 Birmingham UK April 1999

[20] L Clifton D A Clifton Y Zhang P Watkinson L TarassenkoandH Yin ldquoProbabilistic novelty detection with support vectormachinesrdquo IEEE Transactions on Reliability vol 63 no 2 pp455ndash467 2014

[21] D R Hardoon and L M Manevitz ldquoOne-class machinelearning approach for fmri analysisrdquo in Proceedings of the Post-graduate Research Conference in Electronics Photonics Commu-nications and Networks and Computer Science Lancaster UKApril 2005

[22] M Davy F Desobry A Gretton and C Doncarli ldquoAn onlinesupport vector machine for abnormal events detectionrdquo SignalProcessing vol 86 no 8 pp 2009ndash2025 2006

[23] M A F Pimentel D A Clifton L Clifton and L TarassenkoldquoA review of novelty detectionrdquo Signal Processing vol 99 pp215ndash249 2014

[24] N Japkowicz C Myers M Gluck et al ldquoA novelty detectionapproach to classificationrdquo in Proceedings of the 14th interna-tional joint conference on Artificial intelligence (IJCAI rsquo95) vol1 pp 518ndash523 Montreal Canada 1995

[25] B B Thompson R J Marks J J Choi M A El-SharkawiM-Y Huang and C Bunje ldquoImplicit learning in autoencodernovelty assessmentrdquo in Proceedings of the 2002 InternationalJoint Conference on Neural Networks (IJCNN rsquo02) vol 3 pp2878ndash2883 IEEE Honolulu Hawaii USA 2002

[26] S Hawkins H He G Williams and R Baxter ldquoOutlier detec-tion using replicator neural networksrdquo inDataWarehousing andKnowledge Discovery pp 170ndash180 Springer Berlin Germany2002

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Computational Intelligence and Neuroscience 13

[27] N Japkowicz ldquoSupervised versus unsupervised binary-learningby feedforward neural networksrdquoMachine Learning vol 42 no1-2 pp 97ndash122 2001

[28] L Manevitz and M Yousef ldquoOne-class document classificationvia Neural Networksrdquo Neurocomputing vol 70 no 7ndash9 pp1466ndash1481 2007

[29] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the 2nd IEEE International Conference onData Mining (ICDM rsquo02) pp 709ndash712 IEEE Computer SocietyDecember 2002

[30] H Sohn K Worden and C R Farrar ldquoStatistical damageclassification under changing environmental and operationalconditionsrdquo Journal of Intelligent Material Systems and Struc-tures vol 13 no 9 pp 561ndash574 2002

[31] S Ntalampiras I Potamitis and N Fakotakis ldquoProbabilisticnovelty detection for acoustic surveillance under real-worldconditionsrdquo IEEE Transactions onMultimedia vol 13 no 4 pp713ndash719 2011

[32] E Principi S Squartini R Bonfigli G Ferroni and F PiazzaldquoAn integrated system for voice command recognition andemergency detection based on audio signalsrdquo Expert Systemswith Applications vol 42 no 13 pp 5668ndash5683 2015

[33] A Ito A Aiba M Ito and S Makino ldquoEvaluation of abnormalsound detection using multi-stage GMM in various environ-mentsrdquo in Proceedings of the 12th Annual Conference of the Inter-national Speech Communication Association (INTERSPEECHrsquo11) pp 301ndash304 Florence Italy August 2011

[34] S Ntalampiras I Potamitis and N Fakotakis ldquoOn acousticsurveillance of hazardous situationsrdquo in Proceedings of the 34thIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo09) pp 165ndash168 April 2009

[35] EMarchi F Vesperini F Eyben S Squartini andB Schuller ldquoAnovel approach for automatic acoustic novelty detection usinga denoising autoencoder with bidirectional LSTM neural net-worksrdquo in Proceedings of the 40th IEEE International ConferenceonAcoustics Speech and Signal Processing (ICASSP rsquo15) 5 pagesIEEE Brisbane Australia April 2015

[36] E Marchi F Vesperini F Weninger F Eyben S Squartini andB Schuller ldquoNon-linear prediction with LSTM recurrent neuralnetworks for acoustic novelty detectionrdquo in Proceedings of the2015 International Joint Conference on Neural Networks (IJCNNrsquo15) p 5 IEEE Killarney Ireland July 2015

[37] F A Gers N N Schraudolph and J Schmidhuber ldquoLearningprecise timing with LSTM recurrent networksrdquo The Journal ofMachine Learning Research vol 3 no 1 pp 115ndash143 2003

[38] A Graves ldquoGenerating sequences with recurrent neural net-worksrdquo httpsarxivorgabs13080850

[39] D Eck and J Schmidhuber ldquoFinding temporal structure inmusic blues improvisation with lstm recurrent networksrdquo inProceedings of the 12th IEEE Workshop on Neural Networks forSignal Processing pp 747ndash756 IEEE Martigny SwitzerlandSeptember 2002

[40] J A Bilmes ldquoA gentle tutorial of the EM algorithm andits application to parameter estimation for gaussian mixtureand hidden Markov modelsrdquo Tech Rep 97-021 InternationalComputer Science Institute Berkeley Calif USA 1998

[41] L R Rabiner ldquoTutorial on hiddenMarkov models and selectedapplications in speech recognitionrdquo Proceedings of the IEEE vol77 no 2 pp 257ndash286 1989

[42] B Scholkopf R C Williamson A J Smola J Shawe-Taylorand J C Platt ldquoSupport vectormethod for novelty detectionrdquo in

Advances in Neural Information Processing Systems 12 S SollaT Leen and K Muller Eds pp 582ndash588 MIT Press 2000

[43] D E Rumelhart G E Hinton and R J Williams ldquoLearningrepresentations by back-propagating errorsrdquo Nature vol 323no 6088 pp 533ndash536 1986

[44] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[45] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no11 pp 2673ndash2681 1997

[46] A Graves and J Schmidhuber ldquoFramewise phoneme classi-fication with bidirectional LSTM and other neural networkarchitecturesrdquo Neural Networks vol 18 no 5-6 pp 602ndash6102005

[47] G Hinton L Deng D Yu et al ldquoDeep neural networks foracoustic modeling in speech recognition the shared views offour research groupsrdquo IEEE Signal Processing Magazine vol 29no 6 pp 82ndash97 2012

[48] I Goodfellow H Lee Q V Le A Saxe and A Y Ng ldquoMea-suring invariances in deep networksrdquo in Advances in NeuralInformation Processing Systems Y Bengio D Schuurmans JLafferty CWilliams and A Culotta Eds vol 22 pp 646ndash654Curran amp Associates Inc Red Hook NY USA 2009

[49] Y Bengio P Lamblin D Popovici and H Larochelle ldquoGreedylayer-wise training of deep networksrdquo in Advances in NeuralInformation Processing Systems 19 B Scholkopf J Platt and THoffman Eds pp 153ndash160 MIT Press 2007

[50] P Vincent H Larochelle I Lajoie Y Bengio and P-AManzagol ldquoStacked denoising autoencoders learning usefulrepresentations in a deep network with a local denoisingcriterionrdquo Journal of Machine Learning Research vol 11 pp3371ndash3408 2010

[51] E Marchi G Ferroni F Eyben L Gabrielli S Squartini andB Schuller ldquoMulti-resolution linear prediction based featuresfor audio onset detection with bidirectional LSTM neural net-worksrdquo in Proceedings of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo14) pp2183ndash2187 IEEE Florence Italy May 2014

[52] F Eyben F Weninger F Gross and B Schuller ldquoRecentdevelopments in openSMILE the munich open-source mul-timedia feature extractorrdquo in Proceedings of the 21st ACMInternational Conference on Multimedia (MM rsquo13) pp 835ndash838ACM Barcelona Spain October 2013

[53] J Barker E Vincent N Ma H Christensen and P Green ldquoThepascal chime speech separation and recognition challengerdquoComputer Speech amp Language vol 27 no 3 pp 621ndash633 2013

[54] F Weninger J Bergmann and B Schuller ldquoIntroducing CUR-RENNT the Munich open-source CUDA RecurREnt neuralnetwork toolkitrdquo Journal of Machine Learning Research vol 16pp 547ndash551 2015

[55] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[56] R Collobert K Kavukcuoglu and C Farabet ldquoTorch7 amatlab-like environment for machine learningrdquo BigLearnNIPS Workshop no EPFL-CONF-192376 2011

[57] MD Smucker J Allan and B Carterette ldquoA comparison of sta-tistical significance tests for information retrieval evaluationrdquoin Proceedings of the 16th ACM Conference on Conference onInformation and Knowledge Management (CIKM rsquo07) pp 623ndash632 ACM Lisbon Portugal November 2007

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

14 Computational Intelligence and Neuroscience

[58] E Marchi G Ferroni F Eyben S Squartini and B SchullerldquoAudio onset detection a wavelet packet based approach withrecurrent neural networksrdquo in Proceedings of the 2014 Inter-national Joint Conference on Neural Networks (IJCNN rsquo14) pp3585ndash3591 Beijing China July 2014

[59] N Friedman KMurphy and S Russell ldquoLearning the structureof dynamic probabilistic networksrdquo in Proceedings of the 14thConference on Uncertainty in Artificial Intelligence (UAI rsquo98) pp139ndash147 Morgan Kaufmann Madison Wisc USA July 1998

[60] S Lawrence C L Giles A C Tsoi and A D Back ldquoFacerecognition a convolutional neural-network approachrdquo IEEETransactions on Neural Networks vol 8 no 1 pp 98ndash113 1997

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Deep Recurrent Neural Network-Based Autoencoders for ...downloads.hindawi.com/journals/cin/2017/4694860.pdf · ResearchArticle Deep Recurrent Neural Network-Based Autoencoders for

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014