unsupervised deep learning for anomaly detection and ......acknowledgements first and foremost i...

Unsupervised Deep Learning for Anomaly Detection and Explanation inSequential Data

by

Chandripal Budnarain

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Mechanical and Industrial EngineeringUniversity of Toronto

c© Copyright 2020 by Chandripal Budnarain

Abstract

Unsupervised Deep Learning for Anomaly Detection and Explanation in Sequential Data

Chandripal BudnarainMaster of Applied Science

Graduate Department of Mechanical and Industrial EngineeringUniversity of Toronto

2020

With recent successes of recurrent neural networks (RNNs) for machine translation and handwriting recognition

tasks, we hypothesize that RNN approaches might be best suited for unsupervised anomaly detection in time

series. In this thesis, we first contribute a comprehensive comparative evaluation of RNN-based deep learning

methods for anomaly detection across a wide array of popular deep neural network architectures. In our second

major contribution we observe that a key gap of deep learning based anomaly detection methods is the inability to

identify portions of the data that led to the detected anomaly. To address this, we propose a novel explainability

approach that aims to pinpoint regions of an input that lead to the detected anomaly. In sum, this thesis not only

advances the state-of-the-art in deep learning based anomaly detection for time series data but it also contributes

novel methods for producing explanations and evaluating explanation quality of anomaly detectors.

ii

Acknowledgements

First and foremost I would like to thank my supervisor, Professor Scott Sanner for his continuoussupport, guidance and sincere patience throughout my Masters. Scott has been a wonderful mentor anda source of inspiration.

I would also like to thank Professor J. Christopher Beck to whom I had an opportunity to workwith during my early beginnings to research in my final year of undergraduate studies. I owe a debt ofgratitude to him for his advice, inspiration and kindness.

I have had the privilege to share an office with many talented colleagues and friends. Thanks goesto all D3M members for their friendship and insightful conversations.

I would like to thank my mother for her unconditional love, support and patience. Without hersacrifices I would not be where I am today. Finally, I would like to thank Janki for her love, support foreverything and sincere kindness.

iii

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Fully Connected Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Stacked RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Bidirectional RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Long-Short Term Memory RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Encoder-Decoder framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.5 Attentional Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.6 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Anomaly Detection 173.1 Supervised Deep Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 GAN-based Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Autoencoding-based Unsupervised Anomaly Detection . . . . . . . . . . . . . . . . . . . . 20

3.3.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 Fully-Connected Neural Network (FCNN) . . . . . . . . . . . . . . . . . . . . . . 213.3.3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Probabilistic Model-based Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Explainable Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iv

3.5.1 Explaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5.2 Classic Shapley Value Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Shapley regression values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Shapley sampling values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.3 Local interpretable model-agnostic explanations (LIME) . . . . . . . . . . . . . . . 263.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Deep Learning Approaches for Unsupervised Anomaly Detection in Time Series 284.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Methodology - From reconstruction to anomaly detection . . . . . . . . . . . . . . . . . . 294.4 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Precision@k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Average Precision@k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5.2 Yahoo Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.3 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.4 CICIDS2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Data splitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Parameter tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Models: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Model training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.7.1 Sequential models and autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . 354.7.2 Principal Component Analysis: Linear vs Non-Linear . . . . . . . . . . . . . . . . 384.7.3 Attentional component and explainability . . . . . . . . . . . . . . . . . . . . . . . 38

4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8.1 Comparison to other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8.2 Time series approach to modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.8.3 Operational deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Explaining Sequential Anomalies Detected by Autoencoders 425.1 Motivation for Explanation within Anomaly Detection . . . . . . . . . . . . . . . . . . . . 425.2 Explanation through Reconstruction Difference . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Explanation through the Closest non-anomaly (CNA) . . . . . . . . . . . . . . . . . . . . 44

5.3.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Boolean Metrics: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Precision: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Recall: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Accuracy: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

v

F1 Score: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Hamming distance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Jaccard similarity: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.2 Behavior of metrics under realistic testing scenarios . . . . . . . . . . . . . . . . . 505.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.1 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.4.2 Continuous Sine Wave Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Conclusion 616.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography 63

vi

List of Tables

4.1 A summary of data and constructed data sets. #instances is the total number of datapoints prior to any preprocessing, D is the dimensionality of the feature space, T is thelength of a data point, N is total number of data samples for training and #malicious isthe number of malicious data points in the test set. . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Precision at recall for all models and data sets. Bi stands for bidirectional, and S forstacked. Values represent average across 10-fold validation with standard deviation afterthe ± sign. For CICIDS data we only ran the models once. . . . . . . . . . . . . . . . . . 36

4.3 Average precision at recall for all models and data sets. Bi stands for bidirectional, and Sfor stacked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Results for discrete data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Results for continuous data where anomalous data points have random Gaussian noise

inserted randomly per data point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3 Results for continuous data where anomalous data points are square waves . . . . . . . . 575.4 Results for continuous data where anomalous data points are triangle waves . . . . . . . . 57

1 Extracted features for Rank data. Related event indicates the underlying event for asubset of features, and column included indicates if the feature is included in the finalconstruction of a data set. Certain features do not carry any information and are thereforeremoved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

2 Extracted features from CICIDS2017 PCAP files. f is the transformation function appliedto shift the feature distribution into a more uniform range. . . . . . . . . . . . . . . . . . 74

3 Parameters values tested in validation procedure for each model. Bi/S indicates all threecombinations of bidirectional and stacked variants of LSTM. Learning rate α is the co-efficient responsible for the magnitude of gradient updates during optimization. Numberof layers L is how many layers are present in the network. For autoencoder this valueis doubled, while LSTM variants in this study only have L = 1 or L = 2 (for stackedvariant). Number of neurons n represents number of processing units within a single layer. 75

4 Chosen values of hyperparameters for each model and data set. α is the learning rate,L number of layers, n the number of neurons inside each layer, and ’b.n. n’ is numberof neurons inside bottle-neck layer for the autoencoder. The standard LSTM model onlyhas a single layer, while stacked variants of LSTM contain two layers. . . . . . . . . . . . 76

vii

List of Figures

2.1 Fully connected neural network as an autoencoder with two layers, L = 2. Blue nodes areinput x(i) and output y(i) layers. Teal nodes are hidden layers h1, h2, . . ., h2L. . . . . . . 7

2.2 A Recurrent Neural Network architecture. Weights W are recurrent connections thatallow for the handling of sequential data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Sequential encoder-decoder RNN architecture. The encoder takes as input the entire inputsequence prior to the decoder attempting to reconstruct it back solely using the embeddedoutput, he

T . In this model, we use two LSTM models and illustrate a single predictioninstance where the outputs of past time steps are used as inputs at later time steps duringthe decoder phase. This illustration shows both the encoder and decoder LSTM to havethe same number of hidden neurons, though in practice it can be variable. . . . . . . . . . 10

2.4 Decoding phase at time step t with attention mechanism. The attentional vector αtk ismultiplied by it’s corresponding encoder output he

k then these products are summed tooutput a context vector rt. F (he

k,hdt−1) represent a single layer network that computes

alignment αtk between previous decoder outputs hdt−1 and each encoder output he

k. . . . . 122.5 A Hidden Markov Model where X1, ..., X3 is a sequence of states that are unobserved but

it’s output per state is observed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 t2 is a Contextual anomaly in this time series on average high temperatures in a yearin Florida USA. It is important to note that the temperature at time t1 is equal to thetemperature at t2 but happens in a different context. The average high temperature inSeptember in Florida is not 74 Fahrenheit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Red data points collectively are anomalous in this simulated human electrocardiogramplot. In this case, the sequence of red data points collectively is an anomaly but individu-ally each red data point by itself is not an anomalous. The main cause of this anomaly isthe collective occurrence of 0’s sequentially despite the existence of other 0 valued pressurepoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 We show PCA as a linear autoencoder. By setting the weight matrices, W1 = UT andW2 = U we can create a linear autoencoder using PCA. . . . . . . . . . . . . . . . . . . . 20

4.1 Data set construction process from a sequence of D-dimensional features. The examplein the figure is for D = 3 and T = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

viii

4.2 Precision at recall for all models and data sets. Bi stands for bidirectional, S for stacked,ED for Encoder-Decoder and Attn.ED for Attentional Encoder-Decoder. Shaded regionrepresents one standard deviation from the mean, and this information is absent for CI-CIDS data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Average precision at recall for all models and data sets. Bi stands for bidirectional and Sfor stacked. Standard deviation is omitted for clarity. . . . . . . . . . . . . . . . . . . . . . 37

4.4 Attention map for an input sequence containing several anomalous time steps (red). Whitecolors indicates higher values (maximum of 1). Numbers of both axes indicate timestepsstarting at 0. The top-left square is input/encoder step at 0 and output/decoder step at0. Attention component does not have any restrictions in terms of focus, that is, eachdecoding step can use all other timesteps from the encoder’s end. . . . . . . . . . . . . . . 39

5.1 In this figure we walk through the steps taken to pinpoint anomalous regions of a time series. We show two

approaches to do this. The first is the reconstruction difference approach and the second is an approach we

contribute coined closest non-anomaly (CNA) which we provide greater detail on in the next section. We show

in the first plot an anomalous time series (dotted red), it’s reconstruction (green) and the closest non-anomalous

signal (blue) to the anomalous time series which is what CNA proposes. Our hypothesis is that CNA is better

for pinpointing anomalous regions within an anomalous time series then the reconstruction difference approach

because the reconstruction of an anomalous time series often contains major deviations across a majority of time

steps. Given these signals shown in the first plot, the next step is to take the absolute difference between the

anomalous signal and it’s reconstruction and then again between the CNA signal, and introduce a threshold. We

show this step in the second plot. The final step is map each difference value for each signal shown in the second

plot to 0 is the value is less then the threshold and 1 if it is greater then the threshold. The result of this is two

binary sequence signals shown in the final plot. We reason that the time steps or regions containing 1 represent

anomalous regions within the original anomalous signal. In this example we show that, the reconstruction

is capable of pinpointing such large regions which give us little insight because often the reconstruction of a

time series incurs major deviations due to a downstream effect of an earlier cause. On the contrary, with our

proposed CNA method we find the closest non-anomalous signal to the anomalous signal and can pinpoint

specific regions of dissimilarity which we hypothesize are the anomalous regions. We also contribute a novel

method for evaluation that leverages Boolean metrics such as Precision, Recall, Accuracy, Hamming distance,

Jaccard similarity, and F1 score. Each of these metrics provide us with useful yet different ways to evaluate

explanations which we explain in greater detail in a later section. With these metrics, together with our

explanations or explainable regions shown in the third plot for CNA and reconstruction error respectively, and

with ground truth anomalous regions we can evaluate performance by comparing these explanations against it’s

ground truth where reconstruction error represents a trivial baseline. The key contribution of our evaluation

methodology is that the explanation we provide each metric is a region and given ground truth labels we can

measure the explanation quality via these metrics. Moreover, we leverage the fact that our explanations are

binary labels on a time series which allow for such evaluations metrics to be computed. . . . . . . . . . . . 435.1 Confusion Matrix showing TP, FP, TN, and FN . . . . . . . . . . . . . . . . . . . . . . . . 465.2 We choose 5 examples of explanations and a ground truth explanation to understand the

different properties of each boolean metric used in our evaluation methodology. Note thatthe anomaly occurs at time steps 2, 3, 4, 6, 7, 8. . . . . . . . . . . . . . . . . . . . . . . . . 47

ix

5.3 The blue plot shows a sine wave with Gaussian noise inserted randomly. The red andgreen plot respectively show a sine wave with stepped square and triangle wave segmentsinserted randomly throughout. The areas where the noise or subset of stepped square andtriangle wave are the anomalous regions within the time series. The purple plot shows asine wave representing normal data. In our experiments we focus on 24 time step windowsof these sine wave signals that represent a single data stream, and so we include in orangea 24 time step stream of the shaded region shown above to give better insight on howthese 24 time step streams visually look like. . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 We show 3 anomalous data instances (ground truth) from our discrete data set. For eachwe report in blue the original anomalous data instance, it’s corresponding reconstruction,and CNA, and in red we report the reconstruction, CNA and ground truth explanationrespectively. Note that each sequence represents a boolean list where red and blue squaresrepresent 1 digit labels and white squares represent 0 digit labels. . . . . . . . . . . . . . . 53

5.5 We show an anomalous sample where the anomaly is Gaussian noise. We report in bluethe original anomalous data instance, it’s corresponding reconstruction, and CNA, and inred we report each reconstruction, CNA and ground truth explanation respectively. . . . . 54

5.6 We show an anomalous sample where the anomaly is stepped squares segments. We reportin blue the original anomalous data instance, it’s corresponding reconstruction, and CNA,and in red we report each reconstruction, CNA and ground truth explanation respectively. 55

5.7 We show an anomalous sample where the anomaly is triangle wave segments. we reportin blue the original anomalous data instance, it’s corresponding reconstruction, and CNA,and in red we report each reconstruction, CNA and ground truth explanation respectively. 56

x

Chapter 1

Introduction

1.1 Motivation

Cybersecurity is recognized as an established area of study for computer system experts and networksecurity operatives for over a century and will continue to thrive as the number cyber attacks continue tobe on the rise (Cybersecurity Ventures, 2017). It is estimated that a global spending of 30 trillion dollarsby 2030 from 6 trillion as of 2017 will be required to enhance cybersecurity practices (CybersecurityVentures, 2017; Atlantic Council, 2017). Specifically, cybersecurity is an area that focuses on protectinginformation stored on computer based systems from adversaries that may have malicious intents (Buczakand Guven, 2016). The most powerful malicious attacks on computer based systems involve unauthorizedaccess, modification or deletion of data (Buczak and Guven, 2016). A primary solution to defend againstsuch malicious attacks and protect a computer system is the employment of an intrusion detection system(IDS) (Liao et al., 2013).

At a high level, an IDS is software that specializes in analyzing activities that have taken place on acomputer system and aims to uncover indications on whether the computer system has been maliciouslymisused (Liao et al., 2013). Specifically the two main classes of IDS’s are anomaly-based and signature-based (Liao et al., 2013). Signature based systems rely on having an understanding of manually identifiedpatterns for which rules can be defined for detecting them. A major pitfall of these systems are that itrequires specification of many rules as for how to handle known patterns which creates a lot of manualwork and can become infeasible in practice (Liao et al., 2013). In contrast, anomaly-based systems aresystems that attempt to automatically identify anomalies from examples of normal data and perhapsabnormal data (Garcia-Teodoro et al., 2009). If abnormal data is present, this is supervised learning,and if only normal data is present, this is unsupervised learning. A major advantage of anomaly-basedsystems are that they aim to generalize beyond specific examples, and with unsupervised learning it cangeneralize to never-before observed anomalies as well.

To date there has been multiple contributions of machine learning and statistical approaches onanomaly-based detection tasks within the cybersecurity domain that are supervised and unsupervised.Some supervised approaches include: support vector machines (SVM) (Mulay et al., 2010), bayesiannetworks (Jemili et al., 2007), naive bayes (Panda and Patra, 2007) and hidden markov models (Huet al., 2009). Some unsupervised approaches include: clustering-based algorithms (Syarif et al., 2012;Duan et al., 2009; Jiang and An, 2008), principal component analysis (Sakurada and Yairi, 2014), local

1

outlier factor (Amer and Goldstein, 2012) and autoencoders (Zhou and Paffenroth, 2017; Xu et al.,2018).

Largely, while many machine learning and statistical strategies have been used for cybersecurity,there exist a large gap in the literature on how deep learning approaches can be applied to this domain.We motivate the need for deep learning approaches for this task of anomaly detection next.

The development of machine learning in the form of deep learning has gained incredible tractionbecoming a state-of-the-art approach to many tasks. Specifically, within deep learning the contribution ofrecurrent neural networks (RNNs) has been a go-to approach for machine translation, speech recognition,and handwriting recognition tasks (Bahdanau et al., 2014). RNNs thrive for their unique ability tocapture long term dependencies embedded within time series data. With this in mind, a majority of theliterature on unsupervised anomaly detection involve statistical modelling, and more importantly theredoes not exist an exploration of deep unsupervised approaches driven by RNNs for this particular task.

This leads us to the broad aim of this thesis. We examine an array of RNN-based architectures forthe task of unsupervised anomaly detection in time series and evaluate their performance. The goal is tounderstand what prominent RNN architectures are best suited for time series based anomaly detection.

The two open questions we aim to answer in this thesis are the following:

(1) Among popular RNN architectures which ones are most promising for the task of unsupervisedanomaly detection in time series?

(2) Given that we can identify anomalous time series using an RNN-based model, how can we pinpointanomalous regions as a means of explaining anomalies that are detected?

In the following subsections we formally introduce our contributions and an outline of the remainderof this thesis.

1.2 Contributions

The goal of this thesis is to explore RNN-based deep learning approaches for unsupervised anomalydetection in time series data. We make the following contributions:

(1) We start by conducting a comprehensive comparative evaluation of RNN-based deep learning meth-ods for anomaly detection using popular types of recurrent neural networks across a selection of timeseries data sets. The type of RNN-based anomaly detectors we compare are reconstruction-based.This means that we use an RNN to model benign behaviour of a particular system representedas a time series, and then using reconstruction error of other time series propagated through ourRNN-based anomaly detector and rank-based metrics we evaluate and compare performances.

(2) Next, given an identified anomalous time series we propose an approach that is capable of pinpointingspecific areas of a time series that correspond to the anomaly. This is motivated by the need ofexplaining to an end-user the cause of an anomaly. Specifically, we refer this as providing anexplanation to an end-user. While reconstruction error could be used to pinpoint such regions weshow in our experiments why it is not a strong method to pinpoint anomalous regions, and use itas our baseline approach. We compare our approach across discrete and continuous data sets andcompare it’s performance against our reconstruction error driven baseline.

2

(3) Our final contribution is a novel methodology for automatic evaluation of time series anomalyexplanations. We employ this contributed evaluation methodology for our experiments mentionedin (2).

1.3 Outline

The thesis proceeds as follows. In Chapter 2 we present the background material relevant to our contri-butions. We start by introducing the relevant machine learning, deep learning and probabilistic graphicalmodel background and end with details on the reconstruction error method for anomaly detection.

In Chapter 3, we start by outlining both supervised and unsupervised deep anomaly detection.Next we delve into relevant anomaly detection background and provide an overview of deep and non-deep learning approaches for anomaly detection. Following this, we provide background on explainableartificial intelligence and motivate the importance of explaining predictions of machine learning models.

In Chapter 4, we present a comprehensive comparative evaluation of RNN-based deep learning tech-niques for anomaly detection using popular types of recurrent neural networks. We begin by providingrelevant notation and describe our unsupervised reconstruction-based RNN anomaly detector and thenproceed to describe our methodology for detecting anomalous time series. Next, we describe the datasets used for our experiments. This chapter ends with a description of our evaluation metrics used formeasuring performance and a discussion of our experimental results.

In Chapter 5, we begin by motivating the need for explainability techniques in neural network basedanomaly detection. Next, we formally describe how reconstruction error is used to pinpoint anomalousregions of time series and discuss why it is not a strong approach. Using this as our motivation we thenpropose our novel explainability approach and discuss our experimental setup. Next, we propose anevaluation methodology that is able to evaluate the quality of explanations. This chapter ends with asynopsis of our experimental results and discussion on key findings.

3

Chapter 2

Background

In this chapter we provide a comprehensive review of relevant knowledge pertaining to this thesis. Westart with the relevant machine learning background required for this thesis and touch on relevantmachine learning approaches for anomaly detection discussed in chapter 3. Subsequently, we describeimportant deep neural network architectures that are relevant for the scope of this thesis. Followingthis, we provide an introduction to anomaly detection and discuss the scope of anomaly detection as itpertains to this thesis. Together, these parts are important background material as it leads into Chapter3. Next, we provide relevant background on explainable artificial intelligence as it motivates our finalchapter on explanation.

2.1 Principal Component Analysis (PCA)

We start with providing a background on a commonly used linear approach called principal componentanalysis (PCA) which we explain in greater detail later. Mainly, the basis for unsupervised anomalydetection in this thesis is the autoencoder framework, which we detail in chapter 4, and PCA is a simpletype of linear autoencoder. Later we outline how we work with non-linear autoencoders in the form ofdeep neural networks including sequential RNN-based autoencoders. Nonetheless, PCA represents thesimplest of all autoencoders(linear, non-sequential) and hence is used as a baseline. In section 3.4.1 wereview literature on how PCA has been used in anomaly detection.

Principal component analysis has been a historical contribution by Karl Pearson within linear algebra,and has gained traction in modern day data analysis (Pearson, 1901; Wold et al., 1987). It has become arobust approach for feature extraction in data driven areas such as machine learning and other artificialintelligence domains. At a high level, PCA seamlessly extracts linearly uncorrelated variables calledprincipal components from a data set where the first principal component is the direction of the largestvariability within the data set and each subsequent component captures the next highest variance acrossthe data set. One distinct feature of the extracted principal components are that they respect theconstraint that each are orthogonal to one another. In other words, the set of principal componentsextracted from a data set form an uncorrelated orthogonal basis set (Shlens, 2003; Pearson, 1901). It isimportant to note that there are some clear assumptions PCA makes. Namely, PCA is driven by thenotion that the principal components with the largest variances are most important.

The formal intuition of PCA is as follows. Let our training data of N vectors be xnNn=1 of dimen-

4

sionality D, so xi ∈ RD. The goal is to reduce the dimensionality of the data, and so we achieve this bylinearly projecting it to a lower dimensional space. That is,

x ≈ Uz + a, (2.1)

where U is a D ×M matrix and z is a M -dimensional vector. From here we want to search for theorthogonal directions in the space that contains the highest variance and then project the data onto thissubspace. To achieve this we must find the principal components. The algorithm to do so is as follows.

(1) Start by centering the data by subtracting the mean from each variable.

(2) Calculate the covariance matrix:

C =1

N

N∑n=1

(x(n) − x)(x(n) − x)T , (2.2)

where x is the mean.

(3) Using C extract principal components.

(a) Select the top M eigenvectors of C where C = UΣUT ≈ U1:MΣ1:MUT1:M where U is orthogo-

nal, columns are unit-length eigenvectors, UTU = UUT = 1 and Σ is a matrix of eigenvalueson the diagonal representing the variance in the direction of the eigenvector

(4) Assemble these found eigenvectors into a D ×M matrix U principal components

(5) Now we can express D-dimensional vectors x by projecting them to M -dimensional z where z =

UTx

(6) Next, to project back into the original space we simply compute UUTx. This step is key as itallows us to use PCA as a linear autoencoder.

Th In the next section we provide an overview of deep learning knowledge as it pertains to the rest ofthis dissertation.

2.2 Deep Learning

In this next section, we provide a comprehensive overview of all neural network models pertaining to therest of this thesis. For each model we outline their architectures and their respective equations. We notethat the scope of this work is geared towards unsupervised anomaly detection in time series via deeplearning architectures, and so we cater our notation and explanation of architectures to respect this.In addition, the use of neural networks in this thesis focus on the framework of autoencoders, which isa neural network that aims to learn a compressed representation of an input (Rumelhart et al., 1988).With this in mind, we cover neural networks as an autoencoder because autoencoders form the backboneof our RNN-based anomaly detection approach which we detail in section 3.3.

5

2.2.1 Notation

While most notation is introduced in place, we briefly summarize some common notation used in this the-sis for ease of reference. We represent a time series data set as a sequence of observations [x1,x2,x3, . . .]

where each xt = [x1t , x2t , . . . , x

Dt ] ∈ RD is a D−dimensional vector.

2.2.2 Fully Connected Neural Network

The fully connected neural network also called deep feedforward networks or multilayer perceptrons(MLPs) have been the backbone of deep learning ever since Bishop first described it in 1995 (Bishopet al., 1995). The underlying goal of a fully connected neural network is to approximate an arbitraryfunction f∗. For instance, with a logistic regression classifier given an input x we map it via a function toa single class y, that is y = f(x). Similarly, a fully connected neural network specifies a mapping to y =

f(x, θ) but also learns the value of the parameters θ that result in the optimal function approximation.Specifically, a fully connected neural network encapsulates a series of layers of neurons where each layerindividually are interconnected. We often refer to these connections as weights. Each weight is betweentwo adjacent layers. The input layer consumes the feature vector x(i) from our data set, and the outputlayer’s target is some output vector y(i). Formally, every other layer are known as hidden layers thatperform transformations on an arbitrary feature vector x(i). Given an input feature vector x(i) we feed itthrough our fully connected neural network until it reaches the output layer, we call this a forward-pass.Using the backpropagation algorithm by Rumelhart et al., we perform training on our neural network(Rumelhart et al., 1988). A neuron is a processing unit that transforms a given input and sends it’soutput further down to the next layer. These transformations, also known as activation’s are mostcommonly non-linear activation’s but can be linear. For instance, we can specify a sigmoid, hyperbolictangent or rectified linear unit activation’s that will transform a given input in a non-linear manner (Hanand Moraga, 1995). In general, the ability to specify non-linear activation functions for a given neuralnetwork is the underlying reason why neural nets get their non-linear representational capabilities andultimately encourage generalization. Below we show in Figure 2.3 a fully connected neural network with2L layers where we process an input feature vector from left to right with a series of transformations asfollows.

h0 = x(i)

ak = hk−1 ·Wk, k = 1, . . . , 2L

hk = f(ak), k = 1, . . . , 2L

y(i) = h2L

We note that the output of model M(x(i)) is the activation of the output layer h2L. We definea loss function as follows: L = F(

x(i); y(i)

), for instance if we choose squared error, then F =∑N

i=1 ||x(i) − y(i)||2, and so now we can set all weights W1, . . . ,W2L that connect each layer as a setof parameters to be optimized using such function. Note we omit the bias connection, bk for each layerto give focus on the key relationships between layers. The architecture shown in Figure 2.1 above isspecifically known as an autoencoder because the inputs and outputs are equivalent, x(i) = y(i). In aneffort to provide clarity, the first portion of an autoencoder is known as the encoder and the second

6

W1 W2 WL+1 W2L

bottleneck layer

input layer output layer

hL+1

hL

x(i) y(i)

h1

Figure 2.1: Fully connected neural network as an autoencoder with two layers, L = 2. Blue nodes areinput x(i) and output y(i) layers. Teal nodes are hidden layers h1, h2, . . ., h2L.

portion is the decoder. The encoder is the first L layers including the bottleneck layer. The primarygoal of the encoder is to provide a lower dimensional representation or embedded feature vector of theinput feature vector, that is x→ hL. On the other hand, the decoder is the layer immediately precedingthe bottleneck layer up until the output layer. The primary goal of the decoder is to take as input theembedded feature vector and reconstruct the original input feature vector. We formally denote this as,hL → x. In general, the specifications of the encoder are mimicked in a reverse fashion in the decoder.

2.2.3 Recurrent Neural Networks

In this subsection we cover recurrent neural networks (RNNs). It is important to take note that in allexperiments throughout this thesis we use an RNN as our anomaly detector and so this subsection iscritical in precisely understanding the remainder of this thesis.

Recurrent neural networks (RNN) have been a prominent deep neural network architecture solely dueto it’s ability to powerfully generalize against sequential data, x1, . . . ,xT (Jordan, 1997). (Elman, 1990).The underlying reason why RNNs thrive at handling sequential data is because it leverages parametersharing, a early concept within machine learning and statistics from the 1980s. Parameter sharing inthis context essentially allows for models to apply to different lengths of sequences and can generalizeacross them. When a specific piece of datum can occur at multiple positions within a given sequencethen parameter sharing becomes important. Moreover, if we had separate parameters for each value ofa given time index, then we could not generalize to sequences of variable length that haven’t been in ourtraining set. An RNN has extra connections between it’s hidden layers, and these correspond to a formof internal memory that is able to represent the current state sequence up to a particular time step t.Each state is recursively updated as each input is processed. To be clear, the update step is computedusing the following equations:

ht = f(Wht−1 + Uxt) (2.3)

yt = g(Vht)

7

where ht is represents the state. When we compare this to the fully connected network, ht is output oflayer t. Both function f and g. That is, depending on the problem domain various activation functionssuch as linear, sigmoidal or softmax might be better suited. In general, function f is the hyperbolictangent.

Thus far, we have described an RNN in it’s simplistic form though it can further be improved ondepending on the problem instance. Specifically, if desired we can introduce additional processing units,and add multiple layers(stacked RNN) or we can alter the order of inputs xT → xT−1 → . . . → x1

(bidirectional RNN).

Stacked RNN

We can simply define a stacked RNN as an RNN with many layers. In practice, we generally use afew layers because having too many layers often allows for diminishing returns in terms of performance.Furthermore, the more layers an RNN has the more computationally expensive the training becomesthe models become even more complex with each additional layer. Choosing how many stacks of layersan RNN should have is typically problem dependent. The intuitive understanding for having more thanone layer in an RNN is so that the neural network can increasingly grasp intrinsic details embedding intraining data. For instance, in natural language processing (NLP) we often turn to an RNN to learn text,and the order an RNN learns text is first by learning letters, then learning from words, then sentences,and so on at successive layer. In other words, each layer has a particular role it plays in the learningprocess. Below, we provide an extension of equations of an RNN to account for multiple L layers, andshow an RNN architecture.

xt

ht

yt

ht+1 ht+1 hT

W

V

U

W W W

x1 x2 xT

y1 y2 yT

...

Figure 2.2: A Recurrent Neural Network architecture. Weights W are recurrent connections that allowfor the handling of sequential data

h(1)t = f(W(1)h

(1)t−1 + U(1)xt) (2.4)

h(2)t = f(W(2)h

(2)t−1 + U(2)h

(1)t )

...

h(L)t = f(W(L)h

(L)t−1 + U(L)h

(L−1)t )

yt = g(Vh(L)t )

8

The set of parameters to optimize are the weight matrices, W(j), U(j), j ∈ 1, . . . , L and V.

Bidirectional RNN

The Bidirectional RNN (BiRNN) (Graves and Schmidhuber, 2005; Bahdanau et al., 2015) enjoys a morepowerful information processing model based on the following premise: an event in sequence at step trelates to both its historical context up to step t − 1, and its succeeding context t + 1 up to T . Thatis, how does current state ht affect the future? For example, in Natural Language Processing (NLP),a word is usually defined in the context of its neighboring words (both before and after), which hasinspired several word-embedding representations (Mikolov et al., 2013; Pennington et al., 2014) that arenow common preprocessing step for NLP tasks.

A BiRNN consists of two RNNs – forward and backward. The forward RNN reads the input sequenceas it is ordered (from 1 to T ) and calculates a sequence of forward hidden states (

−→h1, . . . ,

−→hT ). The

backward RNN reads the sequence in the reverse order (from T to 1), which produces a sequence ofbackward hidden states (

←−h1, . . . ,

←−hT ). Finally, the hidden state in BiRNN is defined as a concatenation

of these two sequences at each step, that is, ht = [−→ht;←−ht]. The update equations remain the same as for

the RNN described previously.

Long-Short Term Memory RNN

One challenge with training a simple RNN architecture is that learning long term dependencies heavilythreatens an RNNs ability to learn effectively. The underlying reason for this is that the gradientspropagated through the network over time during training tends to either vanish or explode (Hochre-iter, 1998). This is respectively known as the vanishing gradient problem and the exploding gradientproblem. The vanishing gradient problem has been widely studied and many authors have providedvarious solutions to overcome this but we will be focusing on a specific type of network, that is theLong-Short Term Memory (LSTM) cell type network (Hochreiter, 1998) (Doya, 1999), (Bengio et al.,1994). The goal of the LSTM cell was to allow for a constant flow of information via gates. Gates areconstructs in a deep neural network that dictate what information will be processed. An LSTM cell hasthree gates, an input, forget and output gate. An input gate dictates what information should pass andbe considered through the learning process. The forget gate creates a mechanism that prevents variousparts of information from being updated as well as controls what parts should be modified and storefor later. Lastly, the output gate dictates what information is pushed out of a cell as output. Below wewrite the general update rules for an LSTM cell.

it = σ(Wixt + Viht−1) (2.5)

ft = σ(Wfxt + Vfht−1)

ot = σ(Woxt + Voht−1)

ct = tanh(Wcxt + Vcht−1)

ct = ftct−1 + itct

ht = ottanh(ct)

it, ot, ft denotes the input, forget and output gates. σ is the sigmoid function, representing the type ofactivation function. ct represents the candidate cell state. ht represents the output of the cell at time

9

ce1 ce2 ceT

W W

x1 x2 xT

cd1 cdTcd2

0 xd2

... ...heT

encoder decoder

hd1 hd2 hdT

xdT

Figure 2.3: Sequential encoder-decoder RNN architecture. The encoder takes as input the entire inputsequence prior to the decoder attempting to reconstruct it back solely using the embedded output, he

T .In this model, we use two LSTM models and illustrate a single prediction instance where the outputs ofpast time steps are used as inputs at later time steps during the decoder phase. This illustration showsboth the encoder and decoder LSTM to have the same number of hidden neurons, though in practice itcan be variable.

step t. In general, common practice is to initialize the model with a zero initial state, that is c0 = 0.During training we optimization parameters that involved our W and V matrices.

2.2.4 Encoder-Decoder framework

We have seen various RNNs that are capable of mapping an input sequence to a fixed size vector simplyrepresenting a sequence. In this subsection, we discuss how an RNN can be extended to map an inputsequence to an output sequence of variable length. This is particularly useful in numerous applicationssuch as machine translation, speech recognition and question answering AIs. The earliest contribution ofan RNN architecture for mapping an input sequence to a variable length output was by Cho et al. andSutskever et al. who both independently developed the architecture (Cho et al., 2014; Sutskever et al.,2014). This architecture we are referring to is the Encoder-Decoder model or sequence to sequence modelwhich we illustrate below. The encoder decoder architecture has even shown strong promise at sequentialanomaly detection (Malhotra et al., 2016). The Encoder-Decoder model is quite simple, it involves twoRNNs where an input sequence is processed with one RNN, known as the encoder to produce an encodedstate he

T and the second RNN, known as the decoder aims to decode the original input sequence givenstate he

T . To be clear, the decoder is initialized as the last state of the encoder, that is cd0 = F(heT ).

This particular type of RNN uses the existing update rules as the simple RNN but the main differencelies in how this network is trained. One particular advantage of this network is that during trainingthe decoder is able to receive the correct inputs during training and this can force the network to learneffectively. This is known as teacher forcing (Williams and Zipser, 1989). During test time, the decoderwill reuse it’s previous step output as it’s next time step input. That is, if the output at time step t

is hdt then at time step t + 1 the input used is xd

t+1 = hdt and so on for the next time steps until the

complete sequence is decoded. In addition to, we note that teacher forcing could also be leveraged toassist during test time as well to encourage better predictions (Williams and Zipser, 1989).

10

2.2.5 Attentional Encoder-Decoder

An attention mechanism component has been shown to be a promising contribution to the existingencoder-decoder model in tasks involving text or image processing (Bahdanau et al., 2015; Luong et al.,2015; Xu et al., 2015). It has been interestingly motivated by this notion of human perception. Thisis that humans generally do not process an entire visual in it’s entirety, instead humans usually focustheir attention on selective sub areas of a visual in order to gain information. Humans usually combineinformation gained at various sub areas over time to develop a representation of the visual in their minds.In general, humans accomplish this by performing guiding eye movements and ultimately, this featureof human perception aids in various decision making tasks. In particular, guidance performed by eyemovements have been a widely studied area within the neuroscience and cognitive domains and have beena fundamental driver for this attention mechanism component in neural networks (Mnih et al., 2014).The attention mechanism comes into play during the decoding task from the decoder and encouragesthe decoder to focus on important states of the encoder instead of relying on the final state. Intuitively,when we produce an output at time step t, it is plausible that the state of the encoder at time step t ort− 1 can be useful when processing the input sequence.

With respect to interpretation, because neural networks have predominantly been seen as blackboxes, the attention mechanism is one method to produce a possibly clear interpretation of an RNN’soutput based on it’s respective input. Furthermore, it has been shown that the attention mechanism canimprove prediction performance in comparison to non-attentional models (Bahdanau et al., 2015). In therealm of anomaly detection specifically in the cybersecurity domain, this contribution of the attentionmechanism has been widely studied area because it can yield powerful neural networks. If we are able tohave input features that quite naturally lend itself for an interpretation, then using attention mechanismit can direct us to features that are accountable for the produced output, and further this can lend itselfin explaining the cause of found anomalies. Below we denote the equations for the LSTM decoder duringmodification.

it = σ(Wixdt + Vih

dt−1 + Cirt) (2.6)

ft = σ(Wfxdt + Vfh

dt−1 + Cfrt)

ot = σ(Woxdt + Voh

dt−1 + Cort)

ct = tanh(Wcxdt + Vch

dt−1 + Ccrt)

rt is the context vector that is the output from the encoder. We denote this below:

rt =

T∑k=1

αtkhek (2.7)

Next, we denote the equations for the encoder state, at time step t, this is het . etj corresponds to the

previous decoder output hdt−1 and encoder output he

k. Both of which are computed with one layer ofthe network using weights, Wa, Ua and va. The coefficients are normalized and are computed per timestep t. Formally this is, αt = [αt1, . . . , αtT ], t = 1, . . . , T . This allows use to quickly inspect the mostgenerous contributions from the input sequence. In Figure 2.6 above we show a single decoding step of

11

ce1 ce2 ceT

W W

cdtcdt-1

xdt-1

...

αΣF(hk

e , ht-1d )

he1 he

2 heT

*

*

*attention

rt

hdt-1 hd

t

xdt

Figure 2.4: Decoding phase at time step t with attention mechanism. The attentional vector αtk ismultiplied by it’s corresponding encoder output he

k then these products are summed to output a contextvector rt. F (he

k,hdt−1) represent a single layer network that computes alignment αtk between previous

decoder outputs hdt−1 and each encoder output he

k.

this encoder decoder architecture with attention mechanism.

αtk =exp(etk)∑Tj=1 exp(etj)

, k ∈ 1, . . . , T (2.8)

etj = vTa tanh(Wahdt−1 + Uah

ej), j ∈ 1, . . . , T

where etj are energies relating the previous decoder output hdt−1 and encoder output he

k which arecomputed using a single layer network with Wa, Ua and va being the weights of this network. Thenormalized coefficients αt = [αt1, . . . , αtT ], t = 1, . . . , T are computed at every step t and offer a quickway to inspect the most significant contributions from the input sequence. C∗ matrices in Eqs. (2.6)are additional set of weights being optimized that incorporate information from the context vector rt.A single decoding step of this architecture is shown in Fig. 2.4. For experiments in this paper, we use adecoder based on a GRU-type cell (Bahdanau et al., 2015) that outperformed an LSTM-based decoderand has modified update rules as discussed previously. Nonetheless, the idea of a weighted context vectorrt remains the same.

2.2.6 Generative Adversarial Networks (GANs)

A GAN consists of two neural network components, a generator and a discriminator (Goodfellow et al.,2014). The purpose of the generator is to learn a distribution pg over data x by first defining a prior onnoise variables pz(z), and then representing a mapping to the data space asG(z; θg), whereG is a functionwith parameters θg. The discriminator is D(x; θd) outputs a single scalar. D(x) governs whether x camefrom the data rather from pg. D is trained to maximize the probability of assigning correct labels

12

to training instances and instances generated from G. Both G and D are trained simultaneously tominimize log(1 − D(G(z))). Specifically, D and G engage in the following two player minimax gamewith a value function V (G,D):

minG

maxD

V (D,G) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))]. (2.9)

Intuitively, early on during training G performs poorly because D can detect with high confidencethat an instance is from the training data. However, overtime G will converge and become a goodestimator of pdata.

2.3 Probabilistic Graphical Models

Probabilistic graphical models (PGM) have been around for decades in artificial intelligence (AI) andhave been the backbone of many recent machine learning contributions (Bishop, 2006). At a high level, Aprobabilistic graphical model is a graph that compactly expresses the conditional dependence structurebetween random variables (Frey and Jojic, 2005). In the context of anomaly detection PGMs cansuccinctly model a normally operating computer network to create components of an IDS. In general,probabilistic graphical models thrive for it’s ability to describe dependencies of the components thatmake up a complex probability model and succinctly represents assumptions (Frey and Jojic, 2005).Furthermore, an important advantage of graphical models have been it’s ability to achieve exponentialspeed ups in it’s decision making (Frey and Jojic, 2005). It is important to recognize that graphicalmodels have been heavily used by the artificial intelligence community as a key approach to planningunder uncertainty too (Cassandra et al., 1994). There are 3 main types of graphical models, namelyBayesian networks (BNs), Markov random fields (MRFs) and factor graphs (FGs) (Frey and Jojic,2005). With respect to this dissertation, while we do not use PGMs in our experiments or as part of ourcontributions, the importance of including it here is twofold. The first is that PGMs have laid a lot of thegroundwork in the domain of anomaly detection which we want to acknowledge. Secondly, PGMs havetackled anomaly detection in time series via HMMs and to date all HMMs have been outperformed byRNN based approaches for all sequential tasks, and so our contribution of RNN-based anomaly detectionin time series explores RNNs as a replacement of HMMs for anomaly detection in time series.

2.3.1 Bayesian Models

Bayesian networks in particular have been an incredibly useful model that compactly represents someaspect of the world quite naturally via a directed graph. To be precise, by compact we are referringit’s ability to represent a joint probability distribution compactly (Koller and Friedman, 2009). Next weformally describe the Bayesian network representation.

A Bayesian network represents a joint probability distribution over a set of random variables com-pactly (Koller and Friedman, 2009; Guo and Hsu, 2002; Cansado and Soto, 2008). Formally, given a setof random variables x1, ..., xn we can represent the joint probability distribution from the data withinthe Bayesian network using the chain rule as follows:

P (x1, ..., xn) =

n∏i=1

P (xi|Parents(xi)) (2.10)

13

A few important properties that must hold in a Bayesian network is the following (Koller and Fried-man, 2009; Guo and Hsu, 2002):

• Every node within the network corresponds to a random variable, therefore a set of random vari-ables make up the nodes of the BN.

• A set of directed links connects each node pair. Intuitively if there is a link from node X to nodeY then this can be understood as X has a direct influence on Y .

• Each node within a BN has it’s respective conditional probability table (CPT) that captures theeffects of it’s parent. A parent of a node X are all nodes that have a link pointing to node X.

• A Bayesian Network has no directed cycles; a BN is a directed acyclic graph (DAG).

We can better understand a BN as a probabilistic expert that contains all the probabilistic knowledgerepresented by the structure of the BN and each CPT at each node.

2.3.2 Naive Bayes

Naive Bayes has been a popular probabilistic classifier in machine learning for decades. This classifierhas been desirable for many of it’s qualities such as the following; it’s easy to implement, it runs quickly,when independence holds then it’s likely to thrive, scales outstandingly well, requires few training datasamples, and disregards irrelevant features. Next we formally define the Naive Bayes classifier. Thefundamental principal of the Naive Bayes classifier is Bayes theorem. That is, given random variables Aand B, Bayes theorem states the following:

P (A|B) =P (B|A)P (A)

P (B)(2.11)

From Bayes theorem we can compute the probability of A given B, and in this equation B is our evidenceand A is what we want to determine. Moreover, the main assumption here is that the input featuresare conditionally independent given a specific class label. With this in mind, we can deduce that nosingle feature has affect on another. From this, given a vector x = (x1, ..., xn) representing a data pointto be classified that has n independent features using the principals of Naive Bayes we formally writethe conditional distribution p(x|y = c), as a product of distinct conditional probabilities. Note that crepresents a single class.

p(x|y = c) =

n∏i=1

p(xi|y = c) (2.12)

From this it is important to recognize the nature of data that Naive Bayes can classify on and how thisaugments the underlying model. That is, specifically given discrete data that we want to classify usingNaive Bayes, we can compute the likelihood of such variables using a Bernoulli event model.

p(x|y = c) =

n∏i=1

pxic (1− pc)1−xi , (2.13)

where c is the class, pc is the probability of class c that generates data point xi. In the case we areworking with continuous data then using Naive Bayes we can use a Gaussian event model given a single

14

data point x. We formally denote this below.

p(x|y = c) =1√

2πσ2c

e− 1

2σ2c(x−µc)2 (2.14)

2.3.3 Hidden Markov Models

Hidden Markov Models (HMMs) have been a go-to approach to statistically model linear problemsinvolving time series or sequences and have been widely applied to various speech recognition domainsfor decades (Eddy, 1996). HMMs have also been used for computational sequence analysis (Churchill,1989) tasks and protein structural modelling (Stultz et al., 1993; White et al., 1994). As previouslymentioned, HMMs have been outperformed by RNNs on every sequential task to date andso we cover HMMs because historical literature uses them for time series. We can formallydefine an HMM as follows.

Let Xn and Yn be discrete stochastic processes where n ≥ 1. The tuple (Xn, Yn) is a hidden markovmodel if

• Xn is a Markov process and the states of process Xn are not observed (also known as hiddenstates).

• P (Yn ∈ A|X1 = x1, ..., Xn = xn) = P (Yn ∈ A|Xn = xn) for x1, ..., xn where n ≥ 1 given set A.

Specifically, a Markov process is a stochastic model representing a sequence of events that assumefuture events only depend only on the current state, and not on the events that occurred before it. Thisis formally known as the Markov property.

We illustrate an HMM model down below.

. . . X1 X2 X3

Y1 Y2 Y3 Observations

Hidden States (Unobserved)

Figure 2.5: A Hidden Markov Model where X1, ..., X3 is a sequence of states that are unobserved butit’s output per state is observed

Notice that the Markov process Xn contains a sequence of states that are unobserved but the outputper state is observed.

Finally we show the factorization of the joint distribution of a sequence of states and observations.This is important because it allows us to compute the probability of a sequence of observations occurring.

15

We can query an HMM using Bayes net factorization as follows.

P (X1:T , Y1:T ) = P (X1)P (Y1|X1)

T∏t=2

P (Xt|Xt−1)P (Yt|Xt), (2.15)

where the notation X1:T means X1, ..., XT . This is the factorization of the join probability distributionshown in Figure 2.5. Using this factorization we can compute the probability of a sequence. We notethat the reminder of this thesis does not focus on performing inference on a HMM and so we omit thesedetails.

16

Chapter 3

Anomaly Detection

Anomaly detection has been an area of interest for numerous years across an array of diverse researchareas and applied domains. Anomaly detection, also known as outlier detection aims to identify datapoints that deviate from some expected behaviour (Zimek and Schubert, 2017). With this in mind, ananomaly can be understood as a specific data point that does not conform to some precise behaviour(Islam et al., 2017). In particular, Hawkins defined anomalies as any observation that significantly devi-ates from other observations that can arouse suspicion that it was generated by an alternate mechanism(Hawkins, 1980). With this in mind, we can quite easily understand how detecting outliers can translateto real-world crises such bank frauds, cyber related attacks on computer networks, and detection of bonefractures (Hodge and Austin, 2004).

We can categorize anomalies into three main classes, namely time series anomalies, contextual anoma-lies, and collective anomalies:

(1) Time Series anomalies: If a data point can be considered anomalous with respect to the rest ofthe data then we denote this data point as a time series anomaly (Chandola et al., 2009).

(2) Contextual anomalies: If a data point given some context and behavioural attributes cansometimes be considered normal in another context but not in all contexts, then we denote this asa contextual anomaly (Chandola et al., 2009).

(3) Collective anomalies: is one where a series of data points collectively are anomalous with respectto the data set. However, the data points individually may not be anomalous (Chandola et al.,2009).

We illustrate this with an example.

17

Figure 3.1: t2 is a Contextual anomaly in this time series on average high temperatures in a year inFlorida USA. It is important to note that the temperature at time t1 is equal to the temperature att2 but happens in a different context. The average high temperature in September in Florida is not 74

Fahrenheit.

Figure 3.2: Red data points collectively are anomalous in this simulated human electrocardiogram plot.In this case, the sequence of red data points collectively is an anomaly but individually each red datapoint by itself is not an anomalous. The main cause of this anomaly is the collective occurrence of 0’ssequentially despite the existence of other 0 valued pressure points.

We have now outlined the three main classes categories of anomalies. In this thesis our approach foranomaly detection is an RNN-based autoencoder approach that can detect aspects of all three categoriesof anomalies discussed above. In short, since RNNs are sequential, our RNN-based autoencoder cannaturally detect time series anomalies easily and further can pick up on patterns of anomalous time series

18

w.r.t each other and their surrounding context with also supports contextual and collective anomalies.We organize the following subsections as follows. We start addressing supervised anomaly detec-

tion is. Following this, we address a popular GAN-based anomaly detection method. Next, we delveinto autoencoding-based unsupervised anomaly detection. Lastly, we cover probabilistic model-basedanomaly detection and conclude with explainable artificial intelligence.

3.1 Supervised Deep Anomaly Detection

The primary goal with supervised deep anomaly detection is to train a deep neural network on bothnormal and anomalous data points to perform classification (Chalapathy and Chawla, 2019). This wouldresult in a neural network that can classify new data points as normal or anomalous within some marginof confidence. While this approach can work well in practice, it’s difficult to employ due to the lack oflabeled anomalous training samples. Moreover, it is difficult to have a large pool of anomalous trainingsamples. Thus, this approach is likely to suffer from a class imbalance, where the total number of normalsamples is far greater then the total number of anomalous data samples (Chalapathy and Chawla, 2019).

3.2 GAN-based Anomaly Detection

A widely recognized unsupervised approach for anomaly detection in images has been known as AnoGan(Schlegl et al., 2017). This approach trains a deep convolutional generative adversarial network (GAN)(Goodfellow et al., 2014) that identifies generated and normal images. We mention that whileAnoGAN is beyond the scope of this thesis, this approach is a notable contribution in theanomaly detection literature and our basis for including it in this thesis. Further, we domention that we do not compare against AnoGAN because it does not handle sequentialdata.

Recall GAN formulation described in Section 2.2.6. AnoGAN trains a GAN on a set of medicalimages of healthy anatomy with the aim to identify images that contain unhealthy anatomy which arerepresentative of an anomaly in this particular use case. After training is complete, the generator haslearnt a mapping G(z) 7→ x from latent space representations z to normal images x. Given a newquery image x the authors use a sampling based technique to find a point z in the latent space thatcorresponds to an image G(z) that is visually most similar to query image x (Schlegl et al., 2017). Usingthis technique they can define a loss function for the mapping of new images to the latent space thathas two components, residual loss and a discrimination loss. The residual loss R(x) ensures the visualsimilarity between the generated image G(z) and query image x. The discrimination loss, D(x) enforcesthat the generated image G(z) lies on the learned latent space.

Anomaly identification on new data points are performed via an anomaly score, which outlines thefit of a query image x to learnt distribution of normal images. (Schlegl et al., 2017) defines an anomalyscore function as follows:

A(x) = (1− λ) ·R(x) + λ ·D(x), (3.1)

where R(x) is the residual score and D(x) is the discrimination score.

19

3.3 Autoencoding-based Unsupervised Anomaly Detection

The primary goal of autoencoder-based unsupervised anomaly detection is to train an autoencoder onsolely normal data time series. This approach leverages the notion of having an abundance of labels ofa single non-anomalous class and by having an autoencoder trained on this single class it can separateoutliers from the learnt properties of this single class (Chalapathy and Chawla, 2019), (Wulsin et al.,2010). Specifically, this approach leverages reconstruction error as the underlying mechanism for anomalydetection which we explain next.

The reconstruction error approach is based on the autoencoder framework. We define the mathemat-ical formulation of this approach as it pertains to three types of models seen thus far as they are usedin our experiments in Chapter 3. Namely, these are PCA, fully-connected neural network and recurrentneural network. We begin as follows.

Let x be a vector of representing a single data point, where its reconstruction is x = AE(x) whereAE is some autoencoder. Then let reconstruction error be E(x, AE) = ||x − AE (x) ||22 > t where t issome threshold for a given data point x. This equation ultimately defines whether data point x is ananomaly or not.

Intuitively we can understand reconstruction error as how well the autoencoder is able to recreatethe data point itself. If the autoencoder is able to perfectly recreate the data point then E(x, AE) = 0.With respect to anomaly detection this approach trains an autoencoder on "normal" data only, that is,the autoencoder learns a distribution of "normal" data. With this in mind, the higher the reconstructionerror for unseen data points are, the more likely the data point does not belong to the training datadistribution. Thus, these data points become suspicious and are classified as anomalous data points.

With reconstruction error explained we next identify for PCA, fully-connected neural networks andrecurrent neural networks how they are used in the form of autoencoders for reconstruction basedanomaly detection as described here.

3.3.1 PCA

In this subsection we describe the autoencoder function as it pertains to the reconstruction error approachfor unsupervised anomaly detection previously described.

In section 2.1 we outlined the steps for extracting principal components. From these steps, recallvectors U and UT . Below we show a diagram representing a simple autoencoder and show that whenyou let W1 = UT and W2 = U we achieve a linear autoencoder. With this in mind, the optimal weightsfor a linear autoencoder using PCA are the inclusion of all principal components.

Figure 3.3: We show PCA as a linear autoencoder. By setting the weight matrices, W1 = UT andW2 = U we can create a linear autoencoder using PCA.

20

With this explained, we can use PCA as an autoencoder as part of the autoencoding-based anomalydetection method.

Next, in this paragraph we acknowledge popular PCA driven approaches to anomaly detection thatuse reconstruction error previously described. (Huang et al., 2007) contributed a PCA approach foranomaly detection in large distributed systems, where they uncovered anomalous data points by con-tinuously tracked the projection of the data onto a residual subspace. (Shyu et al., 2003) has also usedPCA to train a classifier to tackle intrusion detection problems in the unsupervised setting where theyonly have access to normal or benign data points. (Brauckhoff et al., 2009) has studied why spatialPCA in the network anomaly detection domain is sensitive to calibration and provided reasoning forsuch discrepancies.

3.3.2 Fully-Connected Neural Network (FCNN)

In this subsection we describe how the fully-connected neural network is used as an autoencoder forreconstruction-based anomaly detection.

Recall, in Figure 2.1 a fully connected neural network as an autoencoder. As mentioned earlier,this architecture shown contains both an encoder and decoder component that includes a bottlenecklayer between them where the main objective is to have inputs and outputs be equivalent. The goal ofthe encoder is to provide a lower dimensional representation of the input. In contrast, the purpose ofthe decoder is to take as input this lower dimensional representation outputted from the encoder andreconstruct it to it’s original input. To sum up, we can use FCNNs as an autoencoder as part of theautoencoding-based anomaly detection method.

3.3.3 Recurrent Neural Network

In this subsection we describe how the recurrent neural network is used as an autoencoder for recon-struction based anomaly detection.

Recall both, in Figure 2.2 an recurrent network and the many variants of RNN architectures discussedpreviously such as Stacked, Bidirectional and Encoder-Decoder RNNs. All of these RNNs explained thusfar can be adapted in the form of an autoencoder by training on data where it’s corresponding labeloutput is the same as it’s input. This achieves an RNN architecture where the inputs and outputsare equivalent satisfying the purpose of an autoencoder. With this in mind, we can use RNNs as anautoencoder as part of the autoencoding-based anomaly detection method explained previously.

Next we cover relevant background on how previous literature has used RNNs for anomaly detection.(Munir et al., 2018) motivated the need for capturing periodic behaviour in time series for anomalydetection and that traditional distance and density-based anomaly detection techniques fail to learnperiodic behaviours embedded within data. The authors strategized that an RNN based approachwould capture periodic behaviour within time series. (Munir et al., 2018) presented a deep learning-based anomaly detection approach (DeepAnT) for detecting a range of time series anomalies i.e., pointanomalies, and contextual anomalies. DeepAnT contains two components, a time series predictor andan anomaly detector. The time series predictor is a deep neural network that predicts the next value ina time series given some stream of data. The predicted value from this neural network is then fed intothe anomaly detector which labels it as normal or anomalous. (Munir et al., 2018) concludes that thisapproach is can easily be incorporated in practice since it is unsupervised and does not rely on learning

21

anomalously labelled data points.(Filonov et al., 2017) presented a reconstruction based RNN approach for time series anomaly detec-

tion where the data is representative of a Tennessee Eastman Process (TEP). The authors in this papertrain a LSTM based RNN and stacked RNN on normal time series data points to perform reconstruc-tion. They train their RNNs on synthetically generated time series data and test their performance ona hold out set containing anomalous data points. The authors used the Nab metric to compare theirreconstruction based RNN models against each other that measures (Lavin and Ahmad, 2015). Whilethe results in this paper are convincing and show strong promise of RNNs for anomaly detection on timeseries data, one critical concern is that the synthetically generated data used for their experiments maynot translate seamlessly to real data.

Many security researchers have shown the vulnerabilities of automobiles to hacking. A car’s controllerarea network (CAN) could be accessed by exploiting vulnerabilities of car’s external interfaces such aswifi, bluetooth and other physical connections (Taylor et al., 2016). With access to a CAN bus commandscould be sent to control a car (Taylor et al., 2016). For instance, commands to activate the brakes of thevehicle or turn off it’s engine could be deployed via the CAN bus (Taylor et al., 2016). While there doesexist approaches to mitigate threats to such car interfaces, a critical area of work focuses on detectingmalicious behaviour on the CAN bus. This motivates the work done by (Taylor et al., 2016). (Tayloret al., 2016) proposed an anomaly detector powered by LSTM based RNNs that is trained to detect CANbus attacks. The RNN they train learns to predict the next command from a sender on the bus. Highlysurprising data commands in the actual next command are flagged as anomalies. The authors trainon normal CAN bus command data that has been synthetically generated, and evaluate their detectorperformance on modified CAN bus data. The underlying analysis of results from this paper concludesthat their RNN based model is able to detect anomalies with low false alarm rates.

3.4 Probabilistic Model-based Anomaly Detection

In this subsection, we reference relevant anomaly detection approaches driven by Bayesian Networks,HMMs and Naive Bayes.

3.4.1 Naive Bayes

In particular, there have been many contributions that use Naive Bayes classifiers to perform anomalydetection. Specifically some popular contributions of Naive Bayes for anomaly detection are as follows.(Panda and Patra, 2007) contributed a naive Bayes technique for network intrusion detection on the KDDcup’99 data set and outlined the criteria for which it is capable of outperforming a feed-forward neuralnetwork based approach. (Mukherjee and Sharma, 2012) has uses various feature selection techniques toextract important features to then train a Naive Bayes classifier on this reduces set of features to performnetwork intrusion detection. To add to this, (Amor et al., 2004) showcased how a simple Naive Bayesstructure could yield competitive results on the KDD cup ’99 data set and evaluated their approachagainst other tree based approaches for intrusion detection. Evidently, we have seen that Naive Bayeshave been the backbone of numerous anomaly detection algorithms.

22

3.4.2 Bayesian Networks

Bayesian networks have been applied for numerous anomaly detection, and surveillance applications(Wong et al., 2003), (Cansado and Soto, 2008).

In the health care sector, early detection of disease outbreak and identifying anomalies have beentackled using Bayesian networks (Wong et al., 2003). To be brief, one prominent approach for thisspecific type of anomaly detection problem was to represent normal symptoms of a disease with aBayesian network and query the Bayesian network in such a way that exposed attributes that wereresponsible for certain trends (Wong et al., 2003). Other work, leveraged Bayesian networks to detectanomalies in large databases (Cansado and Soto, 2008). The primary goal with this approach was toeffectively detect records in a data base, and to help solve this they desired a systematic way to selecta subset of attributes that explained what makes a data base record anomalous. To start, they followeda probabilistic approach to model the joint probability distribution of attributes of each record in thedatabase. From this, the accomplished a method to rank records according to how anomalous theywere. That is, highly common records in the database would be well explained and should received ahigh likelihood whereas anomalous records should be poorly explain receiving a low likelihood. Theyessentially used a Bayesian network to represent such a joint probability distribution and was able toefficiently scale their approach to large data bases simply due to a BN’s property of being compact(Cansado and Soto, 2008)

In this next paragraph, we give concrete insight as for how Bayesian networks have traditionallybeen leveraged to perform anomaly detection (Mascaro et al., 2014). In a Bayesian network anomalydetection framework we can intuitively understand anomalies as events that are highly unlikely undernormal circumstances, and with such an understanding of an anomaly we can compute calculate P (e|m)

where e represents an event or evidence of an event and m is a model. With this in mind, an event e canbe either be of a normally working system or of an malicious event, and so we can introduce a thresholdt that becomes the deciding factor of an normal or anomalous event. That is, if P (e|m) < t then e isclassified as an anomalous event and normal otherwise. Alternatively, to handle sequential data or aseries of sequential events we can aggregate probabilities over time. That is, if 1

N

∑i P (ei|m) < t with N

time steps then the series of events ei where i ≤ N is an anomalous sequence. At a high level, this is thetypical approach for how Bayesian networks can power anomaly detection in time series data. Helldinand Riveiro used BNs in this manner to detect anomalies in sequential vessel traffic data (Helldin andRiveiro, 2009). Also, Johansson and Falkman used the constraint based PC algorithm (Spirtes et al.,2000) to learn a BN that captured how a normally behaving vessel operates and flagged vessel ships thatwere on a route that deviated significantly from normal routes (Johansson and Falkman, 2007). Lastly,Lane et al. also used Bayesian networks in this manner to detect time series anomalies in maritime data(Lane et al., 2010). Specifically, they defined various categories of anomalous ship behavior. For instance,they considered deviation from standard routes, unexpected vessel activity, unexpected port arrival, andnear approach and zone entry to be 5 types of anomalies. Using the data they had, they produced atree of conditional probabilities represented as a Bayesian network to measure the probability’s of eachtype of anomalous behaviour.

23

3.4.3 Hidden Markov Models

With respect to anomaly detection, there has been abundant amount of time series based anomalydetection approaches that has been powered by HMMs. For instance, (Limkar and Jha, 2012) hascontributed a novel HMM approach that is an effective defense mechanism against ddos attacks. As wellas, (Jia and Yang, 2007) has created an intrusion detection system powered by HMMs and show casedit’s performance on a few different data sets, and also (Cao et al., 2013) has leveraged HMMs to detectstock price manipulation which can be interpreted as detecting abnormal behaviour.

3.5 Explainable Artificial Intelligence

As previously mentioned, machine translation, speech recognition, image classification, and object de-tection to name a few have been problem domains for which deep neural networks have provided state-of-the-art results. Intuitively, deep neural networks have been understood in industry and academia asblack box models. Once a neural network has been trained on training data, we feed the trained neuralnetwork new data and expect an output. With this in mind, a trained neural network could be usedwith no understanding of the granularity pertaining to training or specifications of the architecture, andso this has been the underlying reason why neural networks are widely understood as black-box models(Gunning, 2017).

In practice, a major issue that arises is that when trying to deploy a system where a neural networkgoverns predictions for a particular business model, stakeholders become hesitant to employ such asystem because explaining why it’s predictions are what they are is difficult to interpret (Gunning,2017). Not to mention, the technicality of neural networks has become difficult for practitioners andbusiness entities to grasp (Gunning, 2017). In contrary to this, many researchers in academia thatwork on neural networks have struggled with understanding what areas of an input have had the mostinfluence for it’s output from a neural network. This has led to a recent surge of work that focuses onexplainable artificial intelligence (XAI) (Gunning, 2017).

XAI tackles the underlying problem that machine learning algorithms are not able to provide coherentinsights into their behaviour or decision making process (Gilpin et al., 2018). Often the creators of suchmachine learning models are not able to reason about it’s algorithms final prediction which has createda lot of controversy in industry when deciding whether to deploy machine learning driven algorithmsin their business models or not. A few important points to note is that providing such explanationsultimately aid in ensuring algorithmic fairness, identification of potential bias in training data, andultimately ensure the algorithm performs as it should (Gilpin et al., 2018).

An important motivation for XAI is that it aims to provide an answer to the social construct withinthe regulation of algorithms, this is the right to an explanation. The regulation of algorithms specificallyfor machine learning algorithms states that there exists a right to an explanation for an algorithmsoutput. We motivate this with a brief example. If a person applies for a loan from a credit bureauand gets denied, then they are entitled for an explanation. However, if the methodology that dictateswhether a person gets a loan is primarily governed by a machine learning algorithm for instance considera neural network then, it becomes extremely difficult to do this. In essence, an individual has the rightto request an explanation if the decision significantly affects an individual in any way, especially legallyor financially.

This leads us to the importance of explaining predictions which we cover next.

24

3.5.1 Explaining Predictions

In this subsection, we go over a few commonly used approaches for explaining individual predictions ofclassifiers. The background material in this subsection is important because in Chapter 5 we contributeour own approach that explains individual predictions but in a different way. While both these ap-proaches tackle the same goal of providing better insight for why a prediction is what it is, our approachfocuses on uncovering responsible inputs in time series that contribute to a prediction. We note thatboth Shapley value estimations nor LIME focus on this type of explanation work, nor has there beenany literature to date that focuses on this.

Below we go over Shapley regression values (Lipovetsky and Conklin, 2001), Shapley sampling value(Štrumbelj and Kononenko, 2014), and Local interpretable model-agnostic explanations (LIME) (Ribeiroet al., 2016). We explicitely mention that these are model-agnostic frameworks meaning that theseexplanation approaches could be used on any black-box machine learning model.

3.5.2 Classic Shapley Value Estimation

These approaches aim to explain predictions from a given model. We briefly describe the details of themdown below.

Shapley regression values

This approaches requires that you retrain a model on all feature subsets S ⊂ F , where F is the set ofall features. Essentially an importance value is assigned to each feature that represents the effort on themodel prediction when you include that feature. To compute the effect of including this feature, a modelfS∪i is trained with that feature present and we train another model fS excluding that feature. Fromhere, predictions from the two models can be compared on the current input fS∪i(xS∪i) − fS(xS),where xS represents the values of the input features in S. The effect of excluding a feature dependssolely on the other features in the trained model, and so the preceding differences would be computedfor all possible subsets S ⊆ F \ i. With this in mind, the Shaley values are computed and used asfeature attributions, and are a weighted average of all possible differences:

φ =∑

S⊆F\i

|S|!(|F | − |S| − 1)!

|F |![fS∪i(xS∪i)− fS(xS)] (3.2)

Each Shapley regression value maps 1 or 0 to the original input space where 1 represents the input isincluded in the model and 0 means it was excluded from the model.

Shapley sampling values

This is similar to Shapley regression values but involve sampling techniques for approximations. Theprimary goal of Shapley sampling values is to (1) apply sampling approximations to Equation 2.11 and(2) approximate the effect what occurs when a variable is removed from the model. (2) is accomplishedby integrating over samples from the training data set.

25

3.5.3 Local interpretable model-agnostic explanations (LIME)

Local interpretable model-agnostic explanations (LIME) has been a staple approach in explaining pre-dictions of machine learning models, which have been often referred to as black box models. LIMEhas been recognized as an easy to implement go-to approach to making machine learning models moreinterpretable. At a high level, LIME is a technique that is able to explain predictions of any classifier ina easy to comprehend trustworthy manner by learning an interpretable model locally around predictions(Ribeiro et al., 2016). We briefly review this technique.

Formally, LIME provides local explanations of predictions from a given classifier f by learning asimpler interpretable model g locally around a data point x for which we desire an explanation. Theinterpretable model g is learnt using an interpretable representation of the original data space. Considera vector of grat scale values of pixels in an image, that is let x ∈ Rd. An interpretable representation ofx could be x′ ∈ 0, 1d

′, a vector of binary values that represent the absence or presence of a pixels. From

this the LIME explanation g could be solved by solving this optimization problem.

g = arg ming∈G

L(f, g, πx) + Ω(g), (3.3)

where G is the explanation model family, L is a loss function, πx defines the local space around datapoint x, and Ω is the complexity penalty.

Typically, G is understood as the set of linear regression models, where Ω is used to restrict somenumber of explanatory features so they can have have non-zero regression weights, although you coulduse other types of explanation models. WE define the loss function below.

L(f, g, πx) =∑i

πx(zi)(f(zi − g(z′i))2, (3.4)

where the summation is over a set of sampled perturbed data points that are around x, (zi, z′i), i = 1, ...,m

where zi is a perturbed data point from the original data space and z′i corresponds to the interpretablerepresentation. Finally, πx(zi) weights each sample based on their similarity to x, which is the pointwhere the classification result is being explained (Ribeiro et al., 2016).

A few advantages of LIME is that given a change of the underlying machine learning model (black-box) the trained local surrogate model can still be leveraged for explanation, and that the choice offeatures to use when training the local surrogate models can be different then the original features usedto train the underlying machine learning model. This latter advantage is powerful because it creates a usecase for interpreting different features. In contrast to these advantages, a critical hurdle this techniquehas is there does not exist a succinct definition of how what the size of a neighbourhood should be.In addition to this, LIME can often be unstable when explaining because as mentioned in the originalpaper two data points that occurred close to one another varied in different simulated settings. This iscritical pitfall of LIME because instability of explanations ultimately diminish the trust of a machinelearning model and it’s predictions.

3.6 Summary

In this chapter, we have provided a clear overview of existing approaches to anomaly detection priorto deep learning. We started by providing a background on relevant probabilistic graphical model

26

and statistical machine learning topics that have been applied to many anomaly detection tasks. Weprovided an overview of Bayesian models, Hidden Markov Models, Principal Component Analysis, andNaive Bayes to name a few. We then delved in deep learning background, focusing on recurrent neuralnetworks since the focus of this thesis aims to explore popular recurrent neural network architecturesfor the task of unsupervised anomaly detection in time series. After this, we then proceeded to discussanomaly detection and referencing relevant deep learning and non-deep learning based anomaly detectionapproaches as they pertain to the scope of this dissertation. Lastly, we concluded with an overviewof explainable artificial intelligence, motivated the importance of explaining predictions of black-boxmachine learning models, and outlined relevant approaches for explaining predictions of a given machinelearning model in the literature.

We emphasis that the remainder of this thesis and experiments in future chapters reference the neuralnetwork architectures mentioned in Chapter 3.2 on deep neural networks. Furthermore, the explanationoverview in Chapter 3.5 is important because in Chapter 5 we contribute a novel method for pinpointingareas of an input that attribute to a prediction which differ from these approaches.

27

Chapter 4

Deep Learning Approaches forUnsupervised Anomaly Detection inTime Series

Thus far, we have seen that most literature on unsupervised anomaly detection have been driven bystatistical or linear based approaches. Moreover, while many of these approaches have been deemedsuccessful in their own right, there exist no thorough exploration of RNNs for this particular task ofunsupervised anomaly detection in time series data. We have touched upon the successes of RNNs fordomains that involve sequential data i.e., machine translation, forecasting, handwritten recognition. Forexample, machine translation involves language data sets where typically a data point is a sentencewhere each placement of a word within it critically matters when performing translation. With thesuccess of RNNs over sequential data, we hypothesize that among deep neural network architecturesRNNs plausibly are best suited for unsupervised anomaly detection in time series data. We explore thisdirection in the remaining of this chapter.

In this chapter we provide an extensive evaluation of popular deep neural architectures for thecybersecurity problem of intrusion detection in sequential data streams. With intrusion detection the goalis to identify whether an event or series of events pose a potential threat to an environment. We evaluateempirically a spectrum of deep learning architectures for this particular task. The class of architecturesincludes a fully connected autoencoder, and various recurrent neural network based models. Our goalhere is to understand which model lends itself well to the detection of attacks in sequential data streams.We take a unsupervised learning approach to this anomaly detection task. Our detection approachfirst models a benign sequential data distribution using trained neural networks. Then using our trainedmodel we compute an anomaly score of a future sequence under this trained model to scope out potentialintrusions. Our evaluation methodology focuses on ranking based metrics such as average precision,recall precision. Using ranking based metrics in this context allows us gain a better understanding onwhich deep neural architectures are better then others at identifying malicious sequences. Specifically,the higher the rank of a sequence the more likely it is malicious, and with this in mind it ultimatelybecomes a useful aid for network security experts whose line of work focuses on diagnosing potentialintrusion attacks in a priority queue manner. Lastly, we leverage attention mechanism in a recurrent

28

neural network to an effort to provide explanations for it’s respective predictions. Again, this informationbecomes particularly useful to operators that must resolve potential intrusions.

4.1 Introduction

This leads to the the broad aim of this work, which to explore several deep neural network architecturesin an unsupervised manner for the task of anomaly detection in sequential data streams, as known as timeseries data. Specifically, we compare a wide array of deep neural networks that lend itself particularly wellfor sequential data. These networks are, the fully connected neural network, various types of recurrentneural networks (RNN) such as bidirectional recurrent neural network, stacked recurrent neural network,LSTM based recurrent neural network, encoder-decoder recurrent neural network, and an attention basedencoder decoder neural network. Research has shown that RNN based models are successful at handlingsequential data to accomplish tasks such as speech recognition, machine translation. In contrast tothis, the fully connected neural networks in the form of autoencoders have gained a significant amountof success in various other applications. With this in mind, we use this as our baseline because RNNsnaturally have components of it’s architecture that lend itself to sequential data whereas autoencoders donot, and ultimately autoencoder are seen as a deep neural network in it’s most simplistic form. Lastly,we are optimistic that this work will equip cybersecurity practitioners and other anomaly detectionexperts with a deeper understanding of various deep neural network approaches and algorithms thatcater themselves particularly well for unsupervised anomaly detection on sequential data.

4.2 Notation

Before we proceed to our methodology we start by introducing key notation pertaining to the rest ofthis chapter. We represent a single sequential data point as a sequence of observations [x1,x2,x3, . . .]

where each xt = [x1t , x2t , . . . , x

Dt ] ∈ RD is of dimension D−dimensional. By observation we are referring

to various types of information observable from a computer network. For instance, this could be packet-level information in a computer network, byte size of a packet or various other information that feedsthrough a computer network. Each feature within a data point corresponds to a particular time instance,and so a single data point will capture information across a span of time. With this in mind, for ourexperiments we limit each sequence to a maximum length T , and so a single data point x(i) will consistof a series of T feature vectors x(i) = [xi,xi+1, . . . ,xi+T−1]. Together we can represent our data setX as consecutive T length vectors which is the result of a sliding window of size T over a full series ofobservations. We previously discussed, our approach is a reconstruction based model and so our targetvectors are the inputs. Formally, given a model M and a data point x(i), when we feed it through ourmodel we get an output y(i), we desire M(x(i)) = y(i) which intuitively means we want the input to beas close as possible to it’s respective output when fed through model M .

4.3 Methodology - From reconstruction to anomaly detection

While identification of anomalous samples is definitely of utmost importance, there are cases where themost dangerous threat needs to be addressed immediately to enforce necessary preventive measures.In the case of binary classification with only two possible outcomes, such countermeasures cannot be

29

realized due to the lack of anomaly scores or ranked list. For this reason, it is important to have anIDS that provides scores that enables us to rank all new tests cases. This ranked list can further beexamined with the premise that top-most ranked items are highly suspicious.

A simple but effective way to rank the data samples is by their reconstruction error – how wellthe model is able to replicate the sample itself? In the case of perfect replication, the error is zeroE(x,M(x)) = 0. The higher the error, the more likely the sample does not belong to the training datadistribution representing normal data and hence becomes suspicious. We define reconstruction error fora sample x(i) as E(i) = ||x(i) −M

(x(i)

)||22.

Finally, we define a framework for rank-based anomaly detection:

1. Train a reconstruction model M only on benign data X

2. Compute reconstruction errors for new samplesE(i)

x(i)∈Xtest

3. Construct a ranked list Ψ = [E(1), E(2), . . . , ] where E(j) ≥ E(k) for j < k ≤ |Xtest|

The list Ψ can be examined element-by-element, from the most suspicious (anomalous) cases to theleast suspicious cases. However, a key question remains: how do we compare two models M1 and M2

and their respective ranked lists Ψ1 and Ψ2 on the test set? In Section 4.4, we describe several measuresthat provide an answer as to which ranked list is more appropriate for the end user.

4.4 Evaluation procedure

As motivated and described in Section 3.3, in this thesis we use reconstruction error derived from anyof the aforementioned deep autoencoder architectures to assign an anomaly score to an observed datastream. As part of our evaluation procedure, we construct the ranked list by ordering (in descendingorder) new test samples according to their reconstruction error. Since network security analysts areinherently time-constrained in how many anomalies they can investigate in a fixed amount of time, ouraim is to only examine and evaluate the top-k highly ranked cases. When considering how to evaluatesuch top-ranked results, we remark that it is analogous to the evaluation of ranked search engine results.In the intrusion detection context, anomalousness indicates relevancy (what we want to be top-ranked),and retrieving a sample in the top-k would mean that the model labels it as anomalous. Hence, werestrict our evaluations to the top-k variants of standard ranking metrics, namely precision@k andaverage precision@k, which we formally define next.

Precision@k. In brief, this metric computes the fraction of the top-k ranked items that are actuallymalicious according to ground truth maliciousness labels known in the test data:

Prec(k) =|malicious(Ψ[: k]) ∩ labeled malicious(Ψ[: k])|

|labeled malicious(Ψ[: k])|.

If there are a total of R malicious items in the ground truth labeled test data, we can let k = R

and compute Prec(R) (a.k.a. Precision at Recall). One reason for doing this is that it guarantees thata perfect ranking (all R malicious examples ranked in the top-k) achieves a score of 1.0 so that we caneasily compare Prec(R) across different datesets with widely varying values of R.

30

Average Precision@k. One well-known caveat of Prec(k) is that it provides the same evaluation toa ranked list, no matter how the relevant items are permuted among the top-k items since Prec(k) onlymeasures the fraction of top-k items that are known to be malicious. To address this caveat, AveragePrecision@k is another metric that builds on the Prec(k) definition to provide a higher score to rankedlists which place malicious items higher in the top-k compared to ranked lists which place the samemalicious items lower in the top-k. Formally, letting malicious(i) take the value 1 if the ith-ranked itemis malicious and 0 otherwise, the definition of Average Precision@k (AP@k) is

AP (k) =1

k

k∑i=1

Prec(i) ·malicious(i).

4.5 Data

The data we use for our reconstruction based approach to anomaly detection leverages sequential data.Sequential data naturally has a temporal aspect to it that is a critical factor for an IDS based anomalydetector. To elaborate, a single event may not be anomalous but a series of the same event that occursin the context of a sequence of other events may be anomalous. In our experiments, we use several timeseries data sets. We go into greater detail on each data set below.

• Synthetically generated data that allows us to control sequential aspects of anomalies.

• Yahoo! data set for anomaly detection.

• CICIDS2017 flow-based data developed at the University of New Brunswick.

• Real-world data collected by Rank Software Inc.

Every data set excluding the synthetic data mention above were constructed by collecting statisticalfeatures of data streams from a fixed time window or time steps and aggregating them. We go intofurther detail on the construction of this data in the Setup section below.

data set #instances D T #data points (N) #maliciousSynthetic N/A 1 100 1000 50Yahoo! A1-51 1500 1 24 1477 204Yahoo! A1-56 1524 1 24 1501 204Rank N/A 31 12 7032 46CICIDS2017 N/A 33 60 146532 26122

Table 4.1: A summary of data and constructed data sets. #instances is the total number of data pointsprior to any preprocessing, D is the dimensionality of the feature space, T is the length of a data point,N is total number of data samples for training and #malicious is the number of malicious data pointsin the test set.

4.5.1 Synthetic Data

We introduce an artificial time series which allows us to control sequential aspects of anomalies in order totest generalization of the standard non-sequential autoencoding approach. For this purpose, we created

31

sequences of binary values (0 and 1) having length 100 that contain a certain pattern. Samples with thisspecific pattern are considered to be normal/benign samples, while samples without the pattern are putinto anomalous group. The pattern for normal cases is a separation of 1s with predefined number of 0swhich can be 9, 10 or 11. However, both normal and anomalous cases have exactly the same number ofones and zeros – 10 and 90 respectively. The distribution across time remains the same, but the orderof values plays an important role.

4.5.2 Yahoo Data

This dataset1 is provided as part of the Yahoo! Webscope program. There are several time-series baseddata as part of the package, where all personal information is removed, as well as real-data propertiesand GEOs. We use A1Benchmark set of time-series which are based on the real production traffic tosome of the Yahoo! properties. It is hourly data with labels of alarms. Out of more than 50 time seriesprovided, we only use A1-51 and A1-56 that are challenging enough that require a learning system ratherthan visual inspection. Since this data is 1-dimensional D = 1 preprocessed data, we only construct thecomplete data set by sliding a window of size T = 24 to represent 1 day’s worth of traffic.

4.5.3 Rank

This data set2 was provided by Rank Software Inc. It contains a month’s worth of events collected fromthe enterprise network of a 50 employee software development organization. Events are produced bymonitoring both the network traffic, using Zeek, and also the behavior of processes on individual hosts,using Sysmon3. Events from different sources are correlated by their occurrence time. The followingevents have been extracted: file creation time modification, process creation, process termination, andnetwork flows. We constructed features based on 300 second intervals from the event logs, summarizingboth network flows and host events. The list of all features extracted is given in the Appendix.

No known malicious activity occurred in the network during the data set’s time interval, so labeledmalicious activity was added to the data set. The only attack type introduced was the installation ofNmap and the execution of network scans over two time intervals (30 minute and 2 hour intervals).The malicious activity was not performed in the real network. Instead, the activity was done on virtualmachines in an isolated virtual network which was monitored in the same manner as the real network.The identifiers and times of the resulting events were modified to match the identifiers and time intervalin the rest of the data set.

This approach allows using the actual tools that an attacker might use, without the risk of disruptingor damaging the real network. Network scanning alone has a low risk of damaging a network, but furtherattack types are planned. Network scanning behavior should also be relatively easy to detect as it involvesan unusual increase in the number of attempted and connected network flows from one process on onehost to distinct host IPs and ports.

1https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70&guccounter=12Available by request3https://docs.microsoft.com/en-us/sysinternals/downloads/sysmon

32

https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70&guccounter=1

https://docs.microsoft.com/en-us/sysinternals/downloads/sysmon

4.5.4 CICIDS2017

This data set (Sharafaldin et al., 2018) has been generated to resemble true real-world data based onseveral criteria considered necessary for building a reliable benchmark data set (Gharib et al., 2016).It is based on abstract behavior of human interactions that constitutes benign background traffic. Thecomplete simulation takes into account 5 working days (Monday through Friday) and common attacktypes have been generated alongside benign traffic during last the four days.

For the purposes of this paper, we used a modified CICFlowMeter tool to fetch information of interestand construct sequences of fixed-time features, as opposed to flow-based features that is inherent forCICFlowMeter. This modified version is run on raw PCAP files and we calculate the overall statisticsof all the flows within 1 second interval. For example, we compute the total number of flows started,number of stopped flows and number of different flow flags to name a few. A complete list of all extractedfeatures for this data is provided in the Appendix. Complete data generation contains around 40 hoursworth of traffic flow and we use T = 60 to represent a 60 second window. This window size is chosento reflect that the majority of attack types introduced to the system are rather long-term in nature(reaching up to 2 minute duration) as opposed to the overall short-term nature of flows (less than asecond).

4.6 Experiments

The primary experimental question we would like to evaluate is whether the sequential nature of dataplays an important role when choosing the best anomaly detection model and, if so, which deep autoen-coder architecture for scoring anomalies according to reconstruction error performs best? Given thatfully connected autoencoders are known to be a powerful tool for anomaly detection, we experimen-tally assess the utility of RNN-based autoencoder models and their variations in comparison to a fullyconnected baseline model, referred to simply as “Autoencoder” in the experiments.

A secondary question is whether the attentional encoder-decoder model is able to provide guidancetowards the underlying causes of maliciousness. Because the attentional component is only tied to thetemporal dimension, we are interested in seeing whether the malicious segments within the T -lengthwindow can be identified from the attention they receive during the decoding phase.

All the experiments are performed on a NVIDIA GTX1080Ti GPU-based machine using TensorFlow4

and Keras5 libraries.Sequential data is transformed into 3D tensors, with size [samples, time steps, features] = [N,T,D],

which in the case of D = 1 simply yields a matrix (i.e., a 2D tensor). For fully connected autoencoderswe do not consider the temporal dimension, and instead use a matrix of size [N,T ·D]. The specificvalues for N , T and D in each data set are given in Table 4.1.

Data splitting. Each data set is comprised of both benign B and malicious A samples. Set A is onlyused at test time, while benign set B is divided into a training part Btrain and a test part Btest. Asample is malicious if any segment along temporal dimension t contains traces of maliciousness, thatis, if any segment within a sample contains a time frame where an attack occurs, the complete sampleis considered malicious. Otherwise, if none of the segments in the sample contain traces of attacks, it

4https://www.tensorflow.org/5https://keras.io/

33

https://www.tensorflow.org/

https://keras.io/

features

indicators

time step

Twindow size

1 2 3 4 5 6

b b mbb b

x11

x12

x1D

x31

x32

x3 D

x41

x42

x4D

x21

x22

x2D

x51

x52

x5D

x61

x62

x6D

x71

x72

x7D

7

b

x2D x3

D x4D

x5D

x6D

x7D

x1D

x12 x2

2 x32 x4

2

x52

x62

x72

x11 x2

1 x31 x4

1

x21 x3

1 x41 x5

1

x31 x4

1 x51 x6

1

x41 x5

1 x61 x7

1D

T

.

.

.

benign

malicious

malicious

malicious

...

...

... N

x5D

x6D

x7D

x2D x3

D x4D

x52

x62

x72

x22 x3

2 x42

x21 x3

1 x41 x5

1

x31 x4

1 x51 x6

1

x41 x5

1 x61 x7

1

x2D x3

D x4Dx1

D

x12 x2

2 x32 x4

2

x11 x2

1 x31 x4

1

B A

.

.

.

.

.

.

Figure 4.1: Data set construction process from a sequence of D-dimensional features. The example inthe figure is for D = 3 and T = 4.

is considered benign. Fig. 4.1 depicts this data creation process alongside benign/malicious labeling ofsamples.

Parameter tuning. Hyper-parameters for the RNN models (number of neurons, learning rate) arechosen via a 10-fold cross-validation procedure using only benign training data Btrain and using (lower)reconstruction error as the validation criterion for selecting hyper-parameters. For most of the data sets,the number of samples is rather small compared to contemporary deep learning benchmark data sets,and we resort to architecturally rather shallow networks. The exception is CICIDS2017 data for whichwe chose a reasonable set of values that provided quick learning, but still resorted to having a stoppingcondition by monitoring the validation loss. We tuned the number of principal components to use forour PCA model using 10 fold cross-validation using only the benign training data, and chose the numberof principal components that yielded a lower reconstruction error.

Models: We first list all models used in our experiments and give rationale for why we chose them.

1. Feed-Forward Autoencoder

2. Vanilla Recurrent Neural Network

3. Bidirectional Recurrent Neural Network

4. Stacked Recurrent Neural Network

5. Bidirectional-Stacked Recurrent Neural Network

6. Encoder-Decoder Recurrent Neural Network

7. Attention Encoder-Decoder

8. Principal Component Analysis

34

Recall that we motivated this chapter with the purpose to explore many types of RNNs for the taskof unsupervised anomaly detection. We chose the feed-forward autoencoder model to be one of ourbaselines with the intent to determine if the impact of an LSTM cell encourages stronger performanceon time series data then without it. The underlying difference between a feed-foward autoencoder andan RNN is that is allows signals to be processed directly from input to output, where as the latter allowsfor signals to travel in both directions with the addition of memory retention via LSTM cells.

We chose the PCA driven model because we wanted to determine whether this simple linear modeloutperform non-linear models i.e., RNNs, feed-forward autoencoder etc. This baseline is importantbecause it answers the question on whether non-linear models for this task should be direction to exploreor not.

Since the purpose of this chapter and scope of this thesis is around exploring RNNs for unsupervisedanomaly detection in time series, we wanted to provide a thorough exploration of common RNN basedmodels. Thus we explored the Vanilla RNN, Bi-RNN, S-RNN, Bi-S-RNN, and Encoder-Decoder RNN.Intuitively, we include the RNN in it’s simplest architecture namely the Vanilla RNN, and include otherRNN based architectures that include more layers (stacked), a bidirectional component, and an neuralarchitecture that uses two separate RNN components (for the encoder-decoder RNN). Lastly, we includean Attention Encoder-Decoder architecture as a means to experiment with the Attention mechanismto determine if this mechanism could help us point out potential outliers or anomalies or if it couldencourage better decoding or reconstruction.

Model training. Once validation is done, the best performing combination of hyper-parameters isselected as the final set of parameters for a specific model. Training is done by using 80% of benigndata, while the other 20% is used for monitoring the learning phase and early stopping. Both parts arebased on the Btrain set.

Evaluation. With the trained model, we compute the metrics on the held-out benign data (not usedduring previous two phases) and all malicious data, that is, Btest ∪ A. We report Prec(k), Prec(R),AP (k) and AP (R) for all the models and data sets. R is the number of malicious samples given inTable 4.1.

4.7 Results

4.7.1 Sequential models and autoencoders

Tables 4.2 and 4.3 show Prec(R) and AR(R) for all data sets, respectively, while Figs. 4.2 and 4.3show AR(R) from different perspectives for easier comparison. If a ranking method is perfect (i.e., allR anomalies are ranked in the top-k) then both Prec(R) and AR(R) will be 1.0. An overall initialimpression from the results is that sequence-based models are more suitable for the task at hand, withonly marginal differences between them. The pure fully-connected Autoencoder still appears to be areasonable approach in terms of computed metrics, but overall lags behind most sequential methodswhere it only outperforms the Attentional Encoder-Decoder. Somewhat surprisingly, the AttentionalEncoder-Decoder model underperforms compared to its non-attentional variant, and completely fails intwo cases. We conjecture that despite its explanatory promise, the attentional aspect adds complexityto these models and seems hinder the training process. The pure Encoder-Decoder models remain a

35

Data setSynthetic A1-51 A1-56 Rank CICIDS

Autoencoder .890 ± .033 .642 ± .000 .535 ± .004 .774 ± .022 .525LSTM .898 ± .100 .644 ± .004 .556 ± .007 .883 ± .032 .764

Bi-LSTM .862 ± .131 .645 ± .004 .561 ± .005 .861 ± .030 .785S-LSTM .820 ± .115 .647 ± .006 .555 ± .009 .835 ± .041 .807

Bi-S-LSTM .812 ± .107 .645 ± .003 .557 ± .013 .876 ± .026 .729Encoder-Decoder .900 ± .116 .642 ± .000 .553 ± .008 .846 ± .031 .796

Attention ED .464 ± .179 .643 ± .001 .538 ± .006 .802 ± .030 .499Principal Component Analysis .274 ± .025 .352 ± .001 .411 ± .002 .305 ± .004 .263

Table 4.2: Precision at recall for all models and data sets. Bi stands for bidirectional, and S for stacked.Values represent average across 10-fold validation with standard deviation after the ± sign. For CICIDSdata we only ran the models once.

Data setSynthetic A1-51 A1-56 Rank CICIDS2017

Autoencoder .862 ± .050 .348 ± .001 .273 ± .003 .704 ± .024 .380LSTM .895 ± .105 .365 ± .007 .287 ± .004 .864 ± .037 .674

Bi-LSTM .843 ± .157 .364 ± .007 .289 ± .004 .827 ± .045 .719S-LSTM .817 ± .117 .368 ± .007 .286 ± .005 .753 ± .056 .728

Bi-S-LSTM .798 ± .111 .356 ± .005 .287 ± .009 .827 ± .042 .642Encoder-Decoder .895 ± .125 .354 ± .005 .284 ± .005 .789 ± .045 .715

Attention ED .308 ± .209 .351 ± .002 .276 ± .005 .756 ± .041 .349Principal Component Analysis .062 ± .031 .124 ± .001 .153 ± .001 .065 ± .020 .192

Table 4.3: Average precision at recall for all models and data sets. Bi stands for bidirectional, and S forstacked.

reasonable choice as they offer comparable performance to standard LSTM-based models and performbest on one data set (Synthetic).

For both Yahoo! data sets, A1-51 and A1-56, all the models have difficulties distinguishing betweenbenign and anomalous samples, which is indicated by rather low R-precision scores (around 0.64 and0.55). This means that roughly 60% and 50% in the first R samples are normal/benign cases. Consideringthat other data sets have higher values for both Prec(R) and AP (R), it looks that normal and anomalouscases for Yahoo! data are very similar to each other. With a smaller number of samples for trainingand the low dimensionality D = 1 for both Yahoo! data sets A1-51 and A1-56, all models are able toreliably capture a distribution that includes a significant portion of the malicious cases. Performanceis stronger across all models for the Synthetic data, where the normal and anomalous dynamics aremanually controlled. For Rank and CICIDS data, there is enough information in the D > 30 featuresto allow the models to distinguish between the two cases.

Overall, the two most promising models based on the results are Stacked and Bidirectional networkswith the Stacked version being slightly better. The combined version (Bidirectional-Stacked-LSTM)does not further improve performance, suggesting that data complexity is fully captured with either anadditional layer (stacked), or providing reversed inputs (bidirectional). On the other hand, if we excludeCICIDS data, the pure LSTM model offers the most stable results, i.e., it provides the best performanceon Rank data while being very close to the top-performing models on the remaining data sets.

36

Figure 4.2: Precision at recall for all models and data sets. Bi stands for bidirectional, S for stacked,ED for Encoder-Decoder and Attn.ED for Attentional Encoder-Decoder. Shaded region represents onestandard deviation from the mean, and this information is absent for CICIDS data.

Figure 4.3: Average precision at recall for all models and data sets. Bi stands for bidirectional and S forstacked. Standard deviation is omitted for clarity.

37

4.7.2 Principal Component Analysis: Linear vs Non-Linear

PCA has been a dominant linear approach for anomaly detection (Zhang et al., 2009), and while it hasbeen proven successful in it’s own right we wanted to showcase empirically whether a simple linear model(PCA) can outperform a simple non-linear model (feed-forward neural network). Our hypothesize is thatsince most data is non-linear, intuitively a non-linear model would better fit this data as oppose to alinear model. With this in mind, our conjecture was that although PCA has been proven successful forsome tasks, specifically for the task of unsupervised anomaly detection in time series non-linear modelsoutperform linear models.

Across all data sets we notice that PCA yields the poorest results and are not competitive to no otherneural network approach. To start, PCA is linear and would yield strong results if and only if featureswithin a data set are linearly correlated. However, in the real world this is not the case, data is complexand non-linear. Mainly, in practice PCA performs poorly solely due to it’s limited linear capabilities.On the contrary, neural networks are able to understand non-linear data via activation functions suchas ReLu and create complex functions that can generalize remarkably well over real data. Hence, weconfirm that our conjecture within the scope of our experiments are justified.

4.7.3 Attentional component and explainability

Based on the tables and figures, the attentional version of the Encoder-Decoder provides surprisinglypoor results. In this particular scenario, we conjecture that the attentional component acts more asa constraint that is difficult to satisfy rather than augmenting the decoding phase with additionalinformation. The difference between successful applications of attention in the literature and our usecase here is the reconstruction problem we are adopting. This is exemplified by the trivial explanationshown in Fig. 4.4. The figure shows an attention map for one specific anomalous case in Rank data.Unsurprisingly, the best time step to use on the decoder’s side is the same time step from the encoder’sside. From a reconstruction perspective, this is a logical outcome, but from an explainability perspective,we do not gain any useful information that would benefit network security operators. In short, theattentional component focuses on information (timesteps) that is beneficial for the current decodingstep, rather than pointing out which steps are potential outliers worthy of further inspection.

4.8 Discussion

4.8.1 Comparison to other methods

A direct comparison to other approaches in the literature is not possible since almost all published resultsfocus on using labeled data and (supervised) classification-based metrics. In this paper, we argue thatsuch an approach cannot be fully adopted in an operating environment and we suggest alternative metricsfor evaluating performance of models. Operators and analysts have access to a substantial palette ofdifferent types of data, and most of the time, the data is unlabeled, noisy, and sequential but otherwiseunstructured – characteristics that we tried to mimic in our evaluation.

38

0 1 2 3 4 5 6 7 8 9 10 11input sequence

0

1

2

3

4

5

6

7

8

9

10

11

outp

ut se

quen

ce

Figure 4.4: Attention map for an input sequence containing several anomalous time steps (red). Whitecolors indicates higher values (maximum of 1). Numbers of both axes indicate timesteps starting at 0.The top-left square is input/encoder step at 0 and output/decoder step at 0. Attention component doesnot have any restrictions in terms of focus, that is, each decoding step can use all other timesteps fromthe encoder’s end.

4.8.2 Time series approach to modeling

One potential downside of our methodology is the break-up of data by time windows of specific (fixed)length. When a malicious event occurs in data, it is split across several, if not up to hundreds ofsamples. The models can in turn put all these samples at the top of the ranked list, thereby inflating theranking scores. To avoid such pathological cases, we would instead need some way to split data based onevents and aggregate data in a way that included all relevant information to predicting anomalousnessof the event. However, how to do this is not always obvious. For example, the CICIDS data consists ofnetwork-related flows of packets, which would have to be accounted for on an individual basis. Similarly,the Rank data is a collection of heterogeneous observations originating from several different monitoringsensors (including network flows, file related events and process related events) where aggregation byevent is not obvious.

While our autoencoding methodology can in principle incorporate any form of data, a commondenominator in all network data is the time dimension and it seems natural to aggregate over thisdimension. In an actual operating environment, one would choose the time-window pertinent to thatparticular environment, and based on the security analyst’s needs – how often an IDS should returnreports about the current state of the system? The one second window specified for CICIDS data in ourexperiment may prove to be inappropriate in practice, but given the volume of data in this particularcase, this value was chosen to capture the overall short-term nature of network flows.

39

4.8.3 Operational deployment

Until now, our primary focus has been on investigating different sequential deep neural network architec-tures and providing a model enabling insight into its decisions, while actual operating implementationwas of secondary concern. However, an important question for all of the models evaluated in this arti-cle is whether they can be deployed in a real operating environment and the resulting time and spacecomplexity of doing so. While the volumes of modern network data available combined with the com-putational expense of deep learning may suggest these models could be impractical in deployed settings,the success of deep learning has led to new hardware solutions that have been designed to meet thecomputational demands of such models: new GPU models optimized for high parallel data processing,as well as cloud-based services offering resources on demand. A potential issue that still arises in an op-erating environment, however, is that data is not fixed-length, but comes in streams, requiring constantupdates or re-training of models. This online scenario has already been touched upon in (Tuor et al.,2017) suggesting to expose each sample only once to non-sequential models, and augmenting sequentialRNNs with auxiliary structures having their own update policies.

4.9 Conclusion

In this chapter, we presented an experimental comparison of different deep learning architectures for thepurpose of detecting anomalous behavior in a cyber-security setting. Our methodology is oriented aroundautoencoder reconstruction modeling where we aim to capture the underlying normal data distributionand then detect future deviations from this distribution. The driving assumption behind this approach isthat a model trained on normal data (network traffic) should have difficulties reconstructing anomaloussamples coming from a different distribution. Adopting this approach we are able to avoid the pitfallsarising from (supervised) classification-based strategies: 1) labeled data is difficult to acquire in manycases, 2) the classification approach can only distinguish between classes it was trained on, and 3) binaryclassification outcomes do not offer additional insight into the most dangerous threats in the system.

In order to assess reconstruction quality and detection rate for anomalous cases, we focused onranking metrics that provide higher scores for systems that place true anomalies at higher ranks – animportant metric for end-user network security analysts. Based on our analysis, sequential models proveto be more suitable for this ranking task than the arguably more popular fully-connected autoencoderapproaches. Out of several tested variants, the most promising deep architecture for anomaly detectionis a stacked recurrent network, where additional recurrent layers offer improved distribution modeling incomparison to bidirectional connections and the encoder-decoder framework. However, both the encoder-decoder framework and vanilla RNN perform competitively against the stacked recurrent network and sothese three architectures in general are strong architectures for the task of reconstruction-based anomalydetection in time series.

We can draw three conclusions from our experimental results discussed previously:

1. Sequential deep learning models as RNNs prove to be effective for anomaly detectionin time series over the popular fully-connected autoencoder and PCA driven approach

2. Ranking metrics are an important metric for end-user network analysts

3. Through our experiments the top most promising deep architectures are the stackedrecurrent network, encoder-decoder RNN and vanilla RNN

40

Finally, in an effort to be critical of this work, while sequential deep learning models prove effectiveat anomaly detection, an open question remains how to explain these anomalies to end users. Theattentional model has an appealing property to provide explanations at each time step, however theparticular anomaly detection model we used proved difficult to train for the reconstruction task and wasnot able to improve upon its non-attentional encoder-decoder variant. Future work should investigateimproved training of such models as well as attentional components that explicitly focus on identifyinganomalous segments of input data to pinpoint sources of anomalies for end users.

41

Chapter 5

Explaining Sequential AnomaliesDetected by Autoencoders

5.1 Motivation for Explanation within Anomaly Detection

In section 2.5 we have motivated the need for explaining predictions outputted from machine learningmodels. In this section we delve into the importance of explaining predictions within the scope ofunsupervised anomaly detection. In Chapter 3 we have contributed an unsupervised anomaly detectionframework that is capable of flagging anomalous time series. Our framework is unique for it’s abilityto rank time series data from most suspicious to least suspicious which is useful because operators cannow choose to attend to potential threats in an intelligent manner governed by importance. While ourframework is robust in it’s ability to flag anomalous time series and provide a systematic method foroperators to attend to them, a useful extension would be determining what areas of a time series ismalicious or causing our anomaly detector initially flag it. In greater detail, being able to direct areas ofinterest within a time series that are most malicious to operators could save them time, and corporationsmillions of dollars because often cyber attacks take months or years before it’s cause is determined.

This leads to the prime focus of this chapter which focuses on pinpointing anomalousregions of time series in an effort to explain why anomalously classified time series has beenflagged as anomalous. For the remainder of this chapter we use explain or explaining timeseries interchangeably with pinpointing anomalous regions of a time series. We first presentan outline for the remainder of this chapter as follows. We begin by introducing how reconstructiondifference could be used to pinpointing anomalies, and explain why it may fail. Next, we introduce ourfirst contribution in this chapter, that is a novel approach for pinpointing malicious areas of flaggeddata points. We motivate our approach as a potential remedy to the reconstruction difference approach.Following this, we present our experimental setup, and propose our second contribution which is a novelevaluation methodology. Lastly, we conclude with a discussion of our results and findings.

5.2 Explanation through Reconstruction Difference

In this subsection, we begin by formalizing the reconstruction difference approach for explaining anoma-lous time series. Specifically, the intent of this approach is to pinpoint anomalous regions of a times

42

Figure 5.1: In this figure we walk through the steps taken to pinpoint anomalous regions of a time series. We show twoapproaches to do this. The first is the reconstruction difference approach and the second is an approach we contribute coinedclosest non-anomaly (CNA) which we provide greater detail on in the next section. We show in the first plot an anomalous timeseries (dotted red), it’s reconstruction (green) and the closest non-anomalous signal (blue) to the anomalous time series which iswhat CNA proposes. Our hypothesis is that CNA is better for pinpointing anomalous regions within an anomalous time seriesthen the reconstruction difference approach because the reconstruction of an anomalous time series often contains major deviationsacross a majority of time steps. Given these signals shown in the first plot, the next step is to take the absolute difference betweenthe anomalous signal and it’s reconstruction and then again between the CNA signal, and introduce a threshold. We show thisstep in the second plot. The final step is map each difference value for each signal shown in the second plot to 0 is the value is lessthen the threshold and 1 if it is greater then the threshold. The result of this is two binary sequence signals shown in the finalplot. We reason that the time steps or regions containing 1 represent anomalous regions within the original anomalous signal. Inthis example we show that, the reconstruction is capable of pinpointing such large regions which give us little insight because oftenthe reconstruction of a time series incurs major deviations due to a downstream effect of an earlier cause. On the contrary, withour proposed CNA method we find the closest non-anomalous signal to the anomalous signal and can pinpoint specific regions ofdissimilarity which we hypothesize are the anomalous regions. We also contribute a novel method for evaluation that leveragesBoolean metrics such as Precision, Recall, Accuracy, Hamming distance, Jaccard similarity, and F1 score. Each of these metricsprovide us with useful yet different ways to evaluate explanations which we explain in greater detail in a later section. Withthese metrics, together with our explanations or explainable regions shown in the third plot for CNA and reconstruction errorrespectively, and with ground truth anomalous regions we can evaluate performance by comparing these explanations against it’sground truth where reconstruction error represents a trivial baseline. The key contribution of our evaluation methodology is thatthe explanation we provide each metric is a region and given ground truth labels we can measure the explanation quality via thesemetrics. Moreover, we leverage the fact that our explanations are binary labels on a time series which allow for such evaluationsmetrics to be computed.

series. Below we explain the details of this approach mathematically and explain some major weaknessesof this approach.

Let M be an autoencoder that is trained on benign data. Let X be an anomalous data point

43

(non-benign). Let X be the reconstruction of X when fed through M , that is M(X) = X. Thereconstruction difference approach for pinpointing anomalous regions of X is to compare the differenceacross each respective time step between X and X. Simply, the dissimilar regions of X in X is theestimated pinpointed region of anomalous data point X.

While highlighting regions with high reconstruction error seems intuitively a natural place to delveinto when explaining anomalous time series, it turns out not to perform well in many cases. One majorreason is that often the reconstruction error of a time series incurs major deviations due to a downstreameffect of an earlier cause resulting in highly inaccurate regions being pinpointed. Ultimately, this resultsin reconstruction error inaccurately classifying many non-anomalous regions as anomalous regions whichprovide us with noisy predictions that provide us with little insight.

In Figure 5.1 we show an example on how reconstruction difference is used to pinpoint anomalousregions of a time series and show why it is highly inaccurate. In this figure in the first plot, we have ananomalous signal (dashed red) where the anomalous behaviour is a deviation from a sine wave (normalbehaviour) which occurs at time steps 4, 5, 6, 7, 8, 9 and it’s reconstruction (green). We first take theabsolute difference between the anomalous signal and it’s reconstruction, that is |X −X| and introducea threshold at t = 0.1. In the second plot in Figure 5.1 we show this difference in green. Next toextract the explainable regions (time steps) for where an anomaly occurs we map each difference valueto 0 if it is below threshold t and 1 if it is greater. The time steps that have a 1 and 0 indicateanomalous and non-anomalous regions respectively. In the fourth plot in Figure 5.1, we show theextracted explainable regions using the reconstruction error signal and can compare it and to the groundtruth anomalous regions (shown in the last plot in Figure 5.1). We notice that this approach predictedalmost all true anomalous regions but it has almost equally incorrectly classified many non-anomalousregions as anomalous regions ones which becomes less useful for operators in practice.

5.3 Explanation through the Closest non-anomaly (CNA)

In the previous subsection we have explained the naive approach of using reconstruction difference topinpoint anomalous regions of an anomalous time series. As mentioned, there exist a few critical pitfallsof this approach and so we propose our approach as a potential remedy.

In our approach, we first try to find the closest non-anomalous variation of the ob-served anomalous data stream that we will refer to as the closest non-anomaly (CNA).Our intuition is that the regions where the CNA data stream disagrees with the anoma-lous data stream are more likely to provide better explanations for the anomaly since it isprecisely these regions whose variation has led to the anomalous classification. We formalizeour approach below in parallel with the example provided in Figure 5.1 to reinforce our idea.

We begin by defining some notation. Let X be an anomalous time series data stream. Let M be ourreconstruction based anomaly detector as described in Chapter 4. Let D(A,B) = |A−B| be a functionthat computes the matrix subtraction between A and B element-wise where both A and B are of equalsize.

1. We start by finding an X∗ as follows:

arg minX∗

‖X∗ −M(X∗)‖22 + λ ‖X −X∗‖22 (5.1)

44

We minimize both |X − X∗| and |X∗ −M(X∗)| of the function to obtain X∗. By minimizing|X − X∗| and |X∗ −M(X∗)| we ensure that we find the closest non-anomalous data point, X∗

that is most similar to the anomalous signal X. This is important because our hypothesis is thatthat the dissimilar regions of X∗ to X within X contain the the anomaly and the underlyingreason our model flagged it as an anomaly. With this in mind, it is critical to find the the closestnon-anomalous data point because it will ensure that what remains after we take the differenceand threshold values are precise regions without many false positive regions. In other words, thefurther offX∗ is fromX the the likelihood of incorrectly classifying non-anomalous regions becomesgreater.

In Figure 5.1, we show X∗ and X (anomalous signal). Notice that X∗ is similar to X from timesteps 0 to 4 and again from 9 onwards but different in the remaining regions.

2. Next, we want to extract the dissimilar regions of X∗ to X within X. To do this we computeZ = D(X∗, X) where each element Zi ∈ Z and i ∈ 1, ..., T corresponds to a value at time step i,where T represents the maximum time step. After computing Z, we then introduce a threshold tso we can map each value of Z to 0 if and only if it is less then t and 1 otherwise. We denote thisbelow.

Formally we start by defining an indicator function f that performs this mapping step.

f(Z) =

1 Zi ≥ t

0 Zi < t(5.2)

Then, let Z∗ = [f(Z0), f(Z1)..., f(ZT )].

We illustrate what has been explained thus far in the example in Figure 5.1. In the plot titleddifference, we show Z (blue) and proceeding this we show a visualization of Z∗ (blue) whichrepresents our thresholded explanation.

3. Lastly, our approach concludes that the anomalous regions pinpointed are the areas of this binarysequence, D(X∗, X) that contain a 1. Essentially, our explanations are binary labels which easilyallow us to compare with ground truth labels. In our example in Figure 5.1 focusing on the lastplot CNA’s thresholded explanation (blue) indicates that time steps t = 4 to t = 9 contain theanomaly which perfectly align with ground truth (red).

In the next section, we present our evaluation methodology that focuses on evaluating the quality ofthese explainable or pinpointed regions.

5.3.1 Evaluation Methodology

In this subsection we contribute our novel evaluation methodology for explaining the quality of ourpinpointed anomalous regions as shown in Figure 5.1.

Recall our thresholded explanations from our approach explained previously. The key contributionwith our novel evaluation methodology is that we reduce an anomaly explanation to thesubset of regions most likely responsible for the anomaly. When our time series is discrete,these regions are indicated by a sequence of binary labels that allow us to leverage standard

45

boolean evaluation metrics for the purpose of evaluating explanation quality. Additionally,we note that many of these evaluation metrics have natural continuous time extensions, asdiscussed later. Further, we note that this approach to the best of our knowledge has not been usedin the literature for evaluating explanations of anomalies in time series.

With this in mind, we report 6 metrics that each measure in it’s own way how similar predictedexplainable regions are with ground truth regions. We use both boolean and distance metrics. Themetrics we report are precision, accuracy, recall, F1 score, Hamming distance, and jaccard similarity.Using Figure 5.2 we explain the properties that each metric focuses on for this particular setting.

Boolean Metrics:

Since our explanation evaluation leverages boolean metrics, we now review relevant literature on binaryclassifiers and commonly used boolean evaluation metrics. Binary classifiers are trained to classify datainstances into either positive or negative classes. Below we define true positives, false positives, truenegatives and false negatives in effort to convey boolean metrics succinctly.

1. True positive (TP): Data points correctly predicted as the positive class.

2. False positives (FP): Data points incorrectly predicted as the positive class.

3. True negatives (TN): Data points correctly predicted as the negative class.

4. False negatives (FN): Data points incorrectly predicted as the negative class.

True Positive(TP)

FalseNegative (FN)

False Positive(FP)

True Negative(TN)

Predicted ClassP N

P

N

Actual Class

Figure 5.1: Confusion Matrix showing TP, FP, TN, and FN

46

Precision Accuracy Recall F1 Score Hamming Distance Jaccard SimilarityExplanation 1 1.0 0.70 0.25 0.40 3.0 0.25Explanation 2 0 0 0 0 10.0 0Explanation 3 0.40 0.40 1.0 0.57 6.0 1.0Explanation 4 0 0.60 0 0 4.0 0Explanation 5 0.50 0.59 0.25 0.33 4.0 0.20

Figure 5.2: We choose 5 examples of explanations and a ground truth explanation to understand thedifferent properties of each boolean metric used in our evaluation methodology. Note that the anomalyoccurs at time steps 2, 3, 4, 6, 7, 8.

Precision: Precision measures the accuracy of positive predictions. Precision is the only metric thatfocuses specifically on how precise our explanations can label TP regions. In Figure 5.2 we see thatin explanation 1 we achieve a precision of 1.0 though we did not accurately label all TP regions whencompared to the ground truth. This is because among the labelled TP regions from explanation 1 we

47

are measuring how precise we were, and in explanation 1 we have a precision score of 1.0 because all TPregions were labelled correctly with no incorrectly labelled FP regions. Specifically, this metric is usefulwhen we care about FP labels. Below we show the formal definition of precision.

Precision =TP

TP + FP(5.3)

As a final note, we note that currently this metric caters for the discrete case but it can be easily extendedto the continuous case by interpreting the size of the set as it’s total length rather than it’s count.

We briefly explain precision for the continuous case. Let E and G be a continuous interval of timepoints representing a given explanation, and ground truth explanation respectively. Next, let ∩ be theintersection of two sets of intervals. Lastly, we denote | · | to be the total length of the intervals. Thenin the continuous case,

Precision =|E ∩G||E|

(5.4)

Recall: Recall measures the proportion of true positives that were classified correctly. In Figure 5.2we see that in explanation 3, we achieve 1.0 recall with 0.40 precision and accuracy scores. This isbecause we have captured all TP regions in our explanation compared to the ground truth sequence. Inexplanation 2 and 4, we see that we incorrectly have labelled TP regions thus yielding a recall score of 0

respectively for both explanations. Overall, this metric is useful when FN labels are important. Belowwe show the formal definition of recall.

Recall =TP

TP + FN(5.5)

Similar to precision, this metric can be extended to the continuous case as follows.

Recall =|E ∩G||G|

(5.6)

Accuracy: Accuracy measures the fraction of correct predictions. In Figure 5.2 we see that in ex-planation 4 we achieve precision, recall and F1 scores of 0, and an accuracy score of 0.60. This is animportant metric because it’s the only metric among precision, recall, and F1 score that considers TNpredictions into it’s score. Further, it’s the only metric that includes a measure of how accurately ourexplanations are at labelling TN regions whereas the others do not capture this. In particular, thismetric is useful when all predicted classes are equally important. Below we show the formal definitionof accuracy.

Accuracy =TP + TN

TP + TN + FP + FN(5.7)

In the continuous case, accuracy is defined as follows.

Accuracy =|E ∩G||G|+ |E|

(5.8)

F1 Score: F1 Score measures a predicted accuracy and is defined as the weighted harmonic mean ofthe precision and recall. This metric is particularly useful when there exist few anomalous regions orimbalanced class labels in the ground truth sequence. For instance, in Figure 5.2 in explanation 3 wepredicted all time steps as anomalous yielding a large class imbalance, and so because F1 score focuses

48

on FP and FN labels it becomes particularly useful when there exist an imbalanced class distributionof labels. Since, accuracy considers all classification labels, it performs poorly when class distributionsare imbalanced, hence why in explanation 3, accuracy is over 15% lower then F1 Score.

With precision and recall scores it is possible to achieve a perfect recall score at the expense of a poorprecision score, and vice versa. Consider explanation 3 and explanation 1 that showcase this behaviour.Ultimately, by leveraging F1-score, we can prevent highly conflicting precision and recall scores becauseF1-score gives equal weight to both precision and recall. Similar to F1-score, its extension Fα and Fβmetrics may also be useful in the setting where we want to adjust the amount of weight on recall orprecision (Manning et al., 2008). F1-score is the most commonly used metric that balances recall andprecision scores. Below, we show the formal definition of F1 Score which incorporates both precisionand recall equations.

F1 Score = 2 · precision · recallprecision + recall

(5.9)

In the continuous case, F1-score can be extended as follows.

F1 Score = 2 ·|E∩G||E| ·

|E∩G||G|

|E∩G||E| + |E∩G|

|G|

(5.10)

Hamming distance: Given two vectors, Hamming distance measures the number of positions at whichthe corresponding vectors are different. This metric in nature is a distance metric, which disregardsspecific details of TN, TP, FP, FN predictions. In explanation 2 we see that every predicted label mustbe corrected when compared against ground truth, thus yielding a Hamming distance score of 10.0,whereas in explanation 1 only 2 labels must be corrected. With this in mind, unlike the previously seenmetrics where a higher score is interpreted as a positive notion in this use case, with Hamming distance alower score is interpreted as a positive notion here. Overall, this metric is useful when trying to capturehow close our predictions are with the ground truth independent of particular frequencies of TN, TP,FP, FN labels.

In the continuous case, Hamming distance is defined as follows.

|E \G|+ |G \ E| (5.11)

Jaccard similarity: Jaccard similarity measures the similarity between two vectors or finite sets.This metric disregards any temporal aspect of the binary sequence of explanations and instead treatsthe sequence as a finite set that counts the number of overlapping binary labels in both ground truth andprediction explanations. Specifically, this computes the intersection of the ground truth and predictionover their union. In Figure 5.2, we see in explanation 5 we get a Jaccard similarity score of 0.20 butin explanation 3 we get a score of 1.0. This is because Jaccard similarity focuses on capturing how wellthe ground truth and prediction align holistically. For instance, this is one reason why we are able toget a low precision and recall score in explanation 3 yet have a high Jaccard score of 1.0. Similarly, itis possible to get a good precision and recall score as seen in explanation 1 with a low Jaccard score of0.25 which is because Jaccard penalizes heavily when alignment between prediction and ground truthdo not align.

J(A,B) =|A ∩B||A ∪B|

=|A ∩B|

|A|+ |B| − |A ∩B|(5.12)

49

In the continuous case, Jaccard similarity is defined as follows.

J(E,G) =|E ∩G||E ∪G|

=|E ∩G|

|E|+ |G| − |E ∩G|(5.13)

5.3.2 Behavior of metrics under realistic testing scenarios

In practice, most data will not contain anomalies therefore altering the efficacy of these metrics dis-cussed previously. Mainly, under a realistic testing scenario a significant fraction of the time we will beexperiencing non-anomalous time series. In this subsection we explain what happens to metrics in sucha realistic setting. To start, the saturation of non-anomalous time series will overall reduce both preci-sion and accuracy scores substantially because there will be more false positives and this results in anincrease to the denominator and not the numerator during calculation. Mainly, the false positives thatare not correct do not impact the numerator for both precision and accuracy equations. On the otherhand, recall will be unaffected because both the numerator (true positives) and denominator (groundtruth positives) are entirely unaffected by all non-anomalous time series which contribute ground truthnegatives and false positives. Therefore, recall should be unaffected. Furthermore, F1 score will likelybe closer to precision then recall because precision would decrease substantially whereas recall wouldincrease. This is because by definition, the harmonic mean of precision and recall will always be lessthan the arithmetic or geometric mean, and often close to the minimum of precision and recall (Manninget al., 2008). Lastly, regarding Hamming distance and Jaccard similarity, we can expect Hamming dis-tance to increase significantly because false positives will increase therefore causing Hamming distanceto be very large, and for Jaccard similarity, it’s denominator will increase resulting in a very small score.

Overall, it is clear that all metrics except recall are going to become very small in this realistictest scenario. This does not mean that these metrics cannot be used to relatively compare differentexplanation methods but in general it’s not clear cut whether any explanation method will achieve ahigh enough value on any metric deeming it useful in practice. With respect to our testing scenario,these metrics align because it is sensible that a user has zero’ed in on a small anomalous region andrequests an explanation in that region rather than simply generation of all explanations over all data inan untargeted manner.

5.4 Experimental Setup

We perform an empirical evaluation using our method of explanability on both continuous and discretedata sets in an effort to showcase the versatility of our underlying explanability approach and evaluationmethodology. We experiment across 3 different reconstruction-based architectures. Namely, these are thefeed-forward autoencoder, encoder-decoder recurrent neural network, and the vanilla recurrent neuralnetwork. We have chosen these architectures because through our experiments discussed in Chapter 3they have been the three superior architectures among all surveyed.

5.4.1 Discrete Data

The synthetic data generated has been described in Section 4.1.1. We briefly go over the details again.Each normal or benign data point generated is a 100 length binary sequence (0 and 1) with 10 occurrencesof 1 digits separated every 9, 10, or 11 spaces within the sequence. The remaining entries within the

50

sequences are 0 digits. We note that both normal and anomalous data points have exactly the samenumber of ones and zeros – 10 and 90 respectively. Most importantly, we emphasize that the distributionacross time remains the same, but the order of values plays an important role.

We illustrate a normal data instance and anomalous data instance below.

Notice that both sequences are a 100 length binary sequence where the only difference is that thenormal sequence with blue squares has a separation of 1’s every 10 spaces which is representative of anormal behaviour whereas the anomalous sequence with red squares does not have this separation of 1’severy 9, 10 or 11 i.e., notice a digit 1 at time step 40 and time step 36 breaking the separation rule ofevery 9, 10 or 11.

5.4.2 Continuous Sine Wave Data

In this subsection we use a single notion of normal behaviour and three different types of anomalousbehaviours.

Normal Data: Each normal data point is a 24 time step of varying temporal frequencies by .25

increments of a sine wave.

Anomalous Data: Each anomalous behaviour is a 24 time slice of a normally generated sine wavethat contains variants of noise inserted at random time steps. We consider three types of noise variants.The first set of anomalous data points have noise generated from a Gaussian inserted at randomlygenerated continuous time steps. The second set of anomalous data points have a subset of a steppedsquare waves inserted at randomly generated continuous time steps. The last set of anomalous datapoints are similar to the second set where the only difference is that we insert triangle waves instead ofstepped square waves. In essence, the region where we insert the noise, or subset of stepped square andtriangle waves we deem as anomalous regions.

We illustrate all three anomalous behaviours and a normal signal below.We note that because a datastream is of a 24 time step of a sine wave, often these waves plotted will be perceived as linear losingthat typical sine wave shape. In the last two plots of Figure 5.3 we first show a normal sine wave witha highlighted boxed region of a 24 time stepped window and below in the following plot show how this24 time stepped window visually looks like.

51

Figure 5.3: The blue plot shows a sine wave with Gaussian noise inserted randomly. The red and greenplot respectively show a sine wave with stepped square and triangle wave segments inserted randomlythroughout. The areas where the noise or subset of stepped square and triangle wave are the anomalousregions within the time series. The purple plot shows a sine wave representing normal data. In ourexperiments we focus on 24 time step windows of these sine wave signals that represent a single datastream, and so we include in orange a 24 time step stream of the shaded region shown above to givebetter insight on how these 24 time step streams visually look like.

5.4.3 Setup

Above we have described 4 types of anomalous data points for both discrete and continuous data. Asmentioned in 5.3 our approach at a high level takes the difference between a perturbed data point foundwithin our training set and compares it against an anomalous time series across each respective timestep. We have previously mentioned in section 5.2 the trivial reconstruction difference approach foruncovering anomalous regions within a time series. We use this reconstruction based difference as ourbaseline approach.

Thus far we have mentioned two approaches for pinpointing anomalous and an evaluation method-

52

ology for measuring the quality of an explanation. Namely, these approaches are the reconstructiondifference and our closest non-anomaly approach. In the next section we describe our results and con-clude with a discussion.

5.5 Results

In this section we report the precision, recall, accuracy, F1 score, Hamming distance and jaccard sim-ilarity across the feed forward neural network, vanilla recurrent neural network and encoder decoderrecurrent neural network for both continuous and discrete data sets of anomalous data points.

For each data set we use 50 data points. We aggregated across all 50 data points per metric andreport these results below. Note that CNA is our closest non-anomaly approach presented in section 5.3

and RD is our reconstruction distance baseline approach presented in section 5.2.

Figure 5.4: We show 3 anomalous data instances (ground truth) from our discrete data set. For each wereport in blue the original anomalous data instance, it’s corresponding reconstruction, and CNA, andin red we report the reconstruction, CNA and ground truth explanation respectively. Note that eachsequence represents a boolean list where red and blue squares represent 1 digit labels and white squaresrepresent 0 digit labels.

53

Figure 5.5: We show an anomalous sample where the anomaly is Gaussian noise. We report in bluethe original anomalous data instance, it’s corresponding reconstruction, and CNA, and in red we reporteach reconstruction, CNA and ground truth explanation respectively.

54

Figure 5.6: We show an anomalous sample where the anomaly is stepped squares segments. We reportin blue the original anomalous data instance, it’s corresponding reconstruction, and CNA, and in red wereport each reconstruction, CNA and ground truth explanation respectively.

55

Figure 5.7: We show an anomalous sample where the anomaly is triangle wave segments. we report inblue the original anomalous data instance, it’s corresponding reconstruction, and CNA, and in red wereport each reconstruction, CNA and ground truth explanation respectively.

Table 5.1: Results for discrete dataFC-Autoencoder Vanilla RNN Enc-Dec RNN

Metrics CNA RD CNA RD CNA RDPrecision 0.89 0.24 0.95 0.23 0.99* 0.25Accuracy 0.83 0.14 0.90 0.02 0.93* 0.18Recall 0.77 0.10 0.85 0.13 0.87* 0.14F1 Score 0.66 0.17 0.74* 0.15 0.73 0.15Hamming Distance 0.09 0.16 0.07* 0.19 0.08 0.19Jaccard Similarity 0.85 0.22 0.88* 0.13 0.87 0.17

56

Table 5.2: Results for continuous data where anomalous data points have random Gaussian noise insertedrandomly per data point

FC-Autoencoder Vanilla RNN Enc-Dec RNNMetrics CNA RD CNA RD CNA RDPrecision 0.93 0.47 0.97 0.40 0.99* 0.49Accuracy 0.79 0.61 0.83 0.54 0.86* 0.63Recall 0.55 0.69 0.58 0.62 0.61 0.71*F1 Score 0.67 0.55 0.71 0.47 0.74* 0.57Hamming Distance 0.07* 0.38 0.11 0.45 0.13 0.36Jaccard Similarity 0.79 0.61 0.83 0.54 0.86* 0.63

Table 5.3: Results for continuous data where anomalous data points are square wavesFC-Autoencoder Vanilla RNN Enc-Dec RNN

Metrics CNA RD CNA RD CNA RD

Precision 0.67* 0.56 0.61 0.54 0.60 0.50Accuracy 0.62 0.54 0.67* 0.50 0.59 0.49Recall 0.51 0.72 0.58 0.79 0.53 0.81*F1 Score 0.46* 0.38 0.45 0.29 0.42 0.34Hamming Distance 0.32* 0.58 0.34 0.68 0.39 0.67Jaccard Similarity 0.59 0.54 0.65* 0.52 0.45 0.47

Table 5.4: Results for continuous data where anomalous data points are triangle wavesFC-Autoencoder Vanilla RNN Enc-Dec RNN

Metrics CNA RD CNA RD CNA RDPrecision 0.56 0.51 0.78 0.49 0.81* 0.56Accuracy 0.41 0.47 0.76 0.47 0.78* 0.52Recall 0.68 0.75 0.72 0.81* 0.75 0.74F1 Score 0.51 0.63 0.70 0.57 0.72* 0.34Hamming Distance 0.19 0.39 0.16 0.42 0.13* 0.43Jaccard Similarity 0.36 0.33 0.65 0.29 0.76* 0.36

Note in table’s 5.1, 5.2, 5.3, and 5.4 we use ∗ to indicate the highest value per row.

5.6 Discussion

In this section we present an overview of the results presented above and discuss our findings. We startby analyzing CNA performance against reconstruction difference for explaining anomalies using ourevaluation methodology explained previously across each data set. Following this, we discuss findingsaround architecture performance for this particular task of explanation. In Figures 5.4, 5.5, 5.6 and 5.7we show a single anomalous time series along with it’s reconstruction explanation and CNA explanationfrom across all discrete and continuous data sets which we will refer to as support in our analysis ofresults.

In Table 5.1 we present the results for our discrete data. The reconstruction distance approach (RD)across every model performed most poorly in comparison to our closest non-anomaly (CNA). Focusingon accuracy, CNA was able to better pinpoint anomalous regions and non-anomalous regions with an

57

accuracy score of 0.93 using the encoder-decoder RNN. We notice that with precision, across all threeneural network architectures, CNA has a greater precision score then accuracy which indicates thatwe pinpoint the anomalous regions better then we can label non-anomalous regions. This is becauseaccuracy considers correctly labeled non-anomalous regions where as precision does not capture this.We notice for distance metric Hamming distance, CNA performs better then RD across all architecturesindicating that our approach requires the least number of swaps when CNA explanations are comparedagainst ground truth anomalous regions.

In Figure 5.4 we see in general that the reconstruction explanation compared to ground truth issignificantly noisy across all three time series instances. Whereas CNA is less noisy and captures theground truth anomalous regions relatively better in being able to yield explanations that either correctlyalign to ground truth or are often a few time steps off from the ground truth label.

In terms of architecture performance, we see that for CNA the encoder-decoder RNN outperformsthe rest of the architectures for precision, accuracy and recall scores but the difference in scores are 0.10

which indicates that all architectures are relatively as strong for this particular approach with a slightedge in favour of the encoder-decoder RNN. In general we see that CNA outperforms RD significantlyacross all metrics independent of model choice. Additionally, we see that among these architecturesCNA performance with an encoder-decoder RNN performs slightly better then the Vanilla RNN andeven more so then the FC-Autoencoder for this discrete data set.

In Table 5.2 we present the results for our first continuous data set where anomalous data time serieshave random Gaussian noise inserted randomly. From these results we see that CNA outperforms the RDbaseline in precision, accuracy, F1 score, Hamming distance and Jaccard similarity. Both precision andaccuracy scores for CNA across each neural network architecture is far greater then the RD approachindicating that we can precisely pinpoint anomalous regions better then the RD approach. Focusingon recall, this is the only metric for which RD outperforms CNA. We reason that this is because thereconstruction is often inaccurate and includes major deviations that result in classifying majority of timesteps as anomalous which would intuitively result in correctly pinpointing anomalous regions. In otherwords, while the RD approach outperforms the CNA approach for recall it is not as precise, meaningthat the RD approach is classifying all time steps as an anomalous regions.

In Figure 5.5 we see how the reconstruction of the anomalous time series contains a major deviationresulting in creating an explanation that covers a large portion of the time series. From this we also seehow the CNA approach is able to pinpoint precise anomalous regions that align with the ground truth.Further, this supports the notion that the reconstruction can result in major deviations of the originaltime series causing a poor explanation.

For this particular data set, we notice that the superior choice of architecture is the encoder-decoderRNN when using CNA but not significantly because the Vanilla RNN performs up to 0.3 poorer inprecision, accuracy, Hamming distance, and Jaccard similarity scores in comparison to. Overall, forCNA the choice of architecture does not incur significant gains over one another however in generalthe encoder-decoder RNN outperforms the FC-Autoencoder and Vanilla RNN across a majority ofmetrics. Furthermore, we see an overall trend that CNA outperforms the RD approach for this particularcontinuous data set.

In Table 5.3 for our second continuous data set we see that CNA outperforms the RD approachacross each metric except for recall. While the performance of our approach outperforms the baseline,the scores across precision, accuracy, recall, and F1 score are on the lower end. Comparing the scores

58

for both approach they do not deviate significantly from one another. For instance, considering thefeed-forward autoencoder model, the Jaccard similarity score of our CNA approach is 0.59 to the RDapproach’s Jaccard similarity score of 0.54. This metric along with the previously mentioned metricssuggest that for this particular anomalous behaviour neither approaches are far superior to one another.

In Figure 5.6 we see that RD is not able to provide any explanations for that particular anomaloustime series whereas CNA is able to capture all true positive anomalous regions with a few false positiveregions. For this particular data stream it is clear that CNA has provided us with more meaningfulexplanations and has pinpointed precise regions in comparison to RD.

In terms of model performance, for CNA the Vanilla RNN outperforms the rest of the architecturesfor accuracy, whereas the feed-forward autoencoder outperforms the rest of the models for the CNAmetric scores on precision, F1 score, and Hamming distance. We notice that the encoder-decoder RNNmodel performs most poorly in comparison to the rest of the models but it’s precision score is competitivewith the Vanilla RNN’s precision score for the CNA approach. Since both CNA and RD approachesperform relatively the weak the choice of model becomes more important. In general, we see that CNAhas outperformed RD across a majority of metrics, and that for this particular data set the superiorarchitecture choice is the feed-forward autoencoder.

In Table 5.4 we present the results for our final continuous data set where anomalous time seriescontain random triangle waves. We notice that for precision, accuracy, F1 score, Hamming distanceand Jaccard similarity scores CNA outperforms RD mainly in our encoder-decoder RNN model. Wenotice that for the feed-forward autoencoder CNA performs poorly compared to it’s performance whenusing the encoder-decoder RNN or the Vanilla RNN. Furthermore, we notice that CNA performance isrelatively similar to the performance of RD when using the feed-forward autoencoder when comparingprecision, accuracy, Jaccard similarity, and F1 scores against each other.

In Figure 5.7 we see that in this particular anomalous time series the reconstruction explanationpredicts many false positive time steps whereas CNA doesn’t. Further we see that CNA is more preciseat pinpointing anomalous regions resulting in 0 false positive regions. Mainly, while CNA does notpredict all ground truth anomalous regions correctly it is able to better align with these regions incomparison with RD explanations.

Overall, we notice for this particular data set, CNA outperforms RD across RD across all metrics.Further we notice that CNA performance of the Vanilla RNN and encoder-decoder RNN are strongwhereas the FC-Autoencoder decreases significantly. This suggests that the choice of model is important.

5.7 Conclusion

In this chapter we have contributed a novel approach for explaining anomalously classified data points.We have provided a comprehensive evaluation across synthetic data sets and overall shown strong per-formance of our CNA approach over the reconstruction distance baseline. We have also contributed anovel evaluation methodology for evaluating explanation quality of different anomaly detectors.

We can draw three key conclusions from our experimental results discussed previously:

1. CNA methods almost always dominate reconstruction methods across all metrics.

2. The RNN always performs better then or at least as strong as the FC-Autoencoder.Moreover, not only does the RNN produce the best anomaly detection as shown in

59

Chapter 4 but it also facilitates the best explanation.

3. The Encoder-Decoder and Vanilla RNN perform comparably for explanation howeverthe Encoder-Decoder RNN appears to slightly outperform in 3 out of 4 experiments.

Finally, in an effort to be critical of this work, we do acknowledge that all experiments here have beenconducted on synthetic data sets and so future work aims to experiment with our approach on real timeseries data sets. We conclude that in this chapter we have contributed a novel explainability approachfor pinpointing anomalous regions of an anomalous classified data point and shown it’s successes as aproof-of-concept on a variety of synthetic data sets.

60

Chapter 6

Conclusion

In this dissertation we have motivated the importance of anomaly detection in the cybersecurity domainand provided an overview of reconstruction-based anomaly detectors where the underlying model is arecurrent neural network. Furthermore, we have contributed a novel approach for pinpointing sourcesof anomalies and contributed a unique methodology for evaluating the quality of these explanations.

Overall, the central thesis of this dissertation was to explore RNN-based deep learning methods foranomaly detection over popular RNN architectures, and finally to contribute a new method for explainingdata points that have been flagged as anomalies. Each chapter made the following contributions:

• In Chapter 2, we have provided an overview of machine learning formalisms that have laid theground work in anomaly detection. We acknowledged relevant statistical machine learning, deeplearning and probabilistic graphical model approaches for anomaly detection.

• In Chapter 3, we have provided a overview of Anomaly Detection describing what supervised andunsupervised deep anomaly detection is and have acknowledged relevant literature on deep andnon-deep learning based anomaly detection approaches.

• In Chapter 4, we contributed a comprehensive comparative evaluation of RNN-based deep learningtechniques for anomaly detection across many widely used deep neural network architectures. Wehave outlined superior models for this anomaly detection task and thoroughly evaluated perfor-mance on numerous synthetic and real data sets. Some key conclusions drawn from this chapterare that sequential deep learning models as RNNs prove effective for anomaly detection in time se-ries. Further, based on our analysis we have seen that ranking metrics are an important metric forend-user network analysts. We have also seen that sequential models prove to be more suitable fortask of reconstruction-based anomaly detection over the popular fully-connected autoencoder andPCA driven approach. Finally, through our experiments the most promising deep architecturesare the encoder-decoder RNN, vanilla RNN and stacked recurrent network.

• In Chapter 5, we have motivated the need for explaining anomalies and how it has created a gap inneural network based anomaly detectors. In this chapter we contributed a novel explainability ap-proach called CNA, an approach that shows strong merit in pinpointing anomalous regions of timeseries which overall has shown strong promise at being able to produce explanations for anomalies.Our key conclusions are that CNA dominates reconstruction methods for explanability, and that

61

overall via our experiments an RNN always outperforms the FC-Autoencoder and facilitates thebest explanation. Lastly, we have contributed a unique methodology that evaluates the quality ofexplanations by reducing an anomaly explanation to the subset of regions most likely responsiblefor the anomaly.

6.1 Future Directions

Overall, while this thesis contributes a reconstruction based comparative evaluation of RNN basedanomaly detectors, a novel approach for pinpointing and explaining anomalous regions of a data pointand a unique methodology for measuring the quality of predictions, there do exist some importantunexplored areas that we believe have potential for future research.

• In Chapter 4, we explore many RNN architectures for reconstruction based anomaly detection.We have seen that the attentional model has an appealing property of providing explanations ateach time step, however the attentional model used proved difficult to train for the reconstruction-based anomaly detection task and ultimately was not successful in this setting. Regarding this,one direction of future work is to investigate how the attentional components embedded withinthese models can be used to identify anomalous segments of input data as a means to pinpointsources of anomalies for users. This would complement the work contributed in Chapter 5.

• In Chapter 5, we have motivated the need for explaining anomalies and have contributed a novelexplanation approach that pinpoints anomalous regions of time series. Another direction of futurework is to investigate how LIME, a model-agnostic explanation framework discussed earlier canbe used to provide useful explanations using training data. While we have explored reconstructionerror as a model-agnostic method for explanation in chapter 5, LIME proposes a potential bettermodel-agnostic approach because it creates a local decision boundary that naturally creates anexplanation of regions, whereas reconstruction error can lead to scattered explanations that aredifficult to interpret as seen in Chapter 5.

• Throughout this dissertation, we have discussed an approach for reconstruction-based anomalydetection that trains a model to learn a distribution of some "normal" data pattern. However,in practice it can take a significant amount of time to train such a model. To combat this,providing a real-time anomaly detection model for incoming streams of time series without humanintervention and domain knowledge is desirable. With this in mind, future work aims to study howreconstruction-based anomaly detection as shown throughout this dissertation could be extendedto an online environment and how these RNN-based autoencoders could be used to determinewhether or not an incoming time series is a sign that an anomaly is likely forthcoming.

62

Bibliography

Amer, M. and Goldstein, M. (2012). Nearest-neighbor and clustering based anomaly detection algorithmsfor rapidminer. In Proc. of the 3rd RapidMiner Community Meeting and Conference (RCOMM 2012),pages 1–12.

Amor, N. B., Benferhat, S., and Elouedi, Z. (2004). Naive bayes vs decision trees in intrusion detectionsystems. In Proceedings of the 2004 ACM symposium on Applied computing, pages 420–424. ACM.

Atlantic Council (2017). http://www.publications.atlanticcouncil.org/cyberrisks/. Accessed16 July 2018.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to alignand translate. arXiv preprint arXiv:1409.0473.

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to alignand translate. CoRR, abs/1409.0473.

Bengio, Y., Simard, P., Frasconi, P., et al. (1994). Learning long-term dependencies with gradient descentis difficult. IEEE transactions on neural networks, 5(2):157–166.

Bishop, C. M. (2006). Pattern recognition and machine learning. springer.

Bishop, C. M. et al. (1995). Neural networks for pattern recognition. Oxford university press.

Brauckhoff, D., Salamatian, K., and May, M. (2009). Applying pca for traffic anomaly detection:Problems and solutions. In IEEE INFOCOM 2009, pages 2866–2870. IEEE.

Buczak, A. L. and Guven, E. (2016). A survey of data mining and machine learning methods for cybersecurity intrusion detection. IEEE Communications Surveys & Tutorials, 18(2):1153–1176.

Cansado, A. and Soto, A. (2008). Unsupervised anomaly detection in large databases using bayesiannetworks. Applied Artificial Intelligence, 22(4):309–330.

Cao, Y., Li, Y., Coleman, S., Belatreche, A., and McGinnity, T. M. (2013). A hidden markov modelwith abnormal states for detecting stock price manipulation. In 2013 IEEE International Conferenceon Systems, Man, and Cybernetics, pages 3014–3019. IEEE.

Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. (1994). Acting optimally in partially observablestochastic domains. In Aaai, volume 94, pages 1023–1028.

Chalapathy, R. and Chawla, S. (2019). Deep learning for anomaly detection: A survey. arXiv preprintarXiv:1901.03407.

63

http://www.publications.atlanticcouncil.org/cyberrisks/

Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM computingsurveys (CSUR), 41(3):15.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.(2014). Learning phrase representations using rnn encoder–decoder for statistical machine transla-tion. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), pages 1724–1734. Association for Computational Linguistics.

Churchill, G. A. (1989). Stochastic models for heterogeneous dna sequences. Bulletin of mathematicalbiology, 51(1):79–94.

Cybersecurity Ventures (2017). 2017 cybercrime report. https://cybersecurityventures.com/

2015-wp/wp-content/uploads/2017/10/2017-Cybercrime-Report.pdf. Accessed 17 July 2018.

Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia and the cerebralcortex? Neural networks, 12(7-8):961–974.

Duan, L., Xu, L., Liu, Y., and Lee, J. (2009). Cluster-based outlier detection. Annals of OperationsResearch, 168(1):151–168.

Eddy, S. R. (1996). Hidden markov models. Current opinion in structural biology, 6(3):361–365.

Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2):179–211.

Filonov, P., Kitashov, F., and Lavrentyev, A. (2017). Rnn-based early cyber-attack detection for thetennessee eastman process. arXiv preprint arXiv:1709.02232.

Frey, B. J. and Jojic, N. (2005). A comparison of algorithms for inference and learning in probabilisticgraphical models. IEEE Transactions on pattern analysis and machine intelligence, 27(9):1392–1416.

Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., and Vázquez, E. (2009). Anomaly-basednetwork intrusion detection: Techniques, systems and challenges. computers & security, 28(1-2):18–28.

Gharib, A., Sharafaldin, I., Lashkari, A. H., and Ghorbani, A. A. (2016). An evaluation framework forintrusion detection dataset. In 2016 International Conference on Information Science and Security(ICISS), pages 1–6, Pattaya, Thailand.

Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., and Kagal, L. (2018). Explaining explana-tions: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conferenceon data science and advanced analytics (DSAA), pages 80–89. IEEE.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., andBengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems,pages 2672–2680.

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm andother neural network architectures. Neural Networks, 18(5):602 – 610.

Gunning, D. (2017). Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency(DARPA), nd Web, 2.

64

https://cybersecurityventures.com/2015-wp/wp-content/uploads/2017/10/2017-Cybercrime-Report.pdf

https://cybersecurityventures.com/2015-wp/wp-content/uploads/2017/10/2017-Cybercrime-Report.pdf

Guo, H. and Hsu, W. (2002). A survey of algorithms for real-time bayesian network inference. In JoinWorkshop on Real Time Decision Support and Diagnosis Systems.

Han, J. and Moraga, C. (1995). The influence of the sigmoid function parameters on the speed ofbackpropagation learning. In International Workshop on Artificial Neural Networks, pages 195–201.Springer.

Hawkins, D. M. (1980). Identification of outliers, volume 11. Springer.

Helldin, T. and Riveiro, M. (2009). Explanation methods for bayesian networks: review and applicationto a maritime scenario. In Proceedings of the 3rd Annual Skövde Workshop on Information FusionTopics (SWIFT 2009), pages 11–16.

Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problemsolutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.

Hodge, V. and Austin, J. (2004). A survey of outlier detection methodologies. Artificial intelligencereview, 22(2):85–126.

Hu, J., Yu, X., Qiu, D., and Chen, H.-H. (2009). A simple and efficient hidden markov model schemefor host-based anomaly intrusion detection. IEEE network, 23(1):42–47.

Huang, L., Nguyen, X., Garofalakis, M., Jordan, M. I., Joseph, A., and Taft, N. (2007). In-network pcaand anomaly detection. In Advances in Neural Information Processing Systems, pages 617–624.

Islam, M. R., Sultana, N., Moni, M. A., Sarkar, P. C., and Rahman, B. (2017). A comprehensive surveyof time series anomaly detection in online social network data. International Journal of ComputerApplications, 180(3):13–22.

Jemili, F., Zaghdoud, M., and Ahmed, M. B. (2007). A framework for an adaptive intrusion detectionsystem using bayesian network. In 2007 IEEE Intelligence and Security Informatics, pages 66–70.IEEE.

Jia, C. and Yang, F. (2007). An intrusion detection method based on hierarchical hidden markov models.Wuhan University Journal of Natural Sciences, 12(1):135–138.

Jiang, S.-y. and An, Q.-b. (2008). Clustering-based outlier detection method. In 2008 Fifth InternationalConference on Fuzzy Systems and Knowledge Discovery, volume 2, pages 429–433. IEEE.

Johansson, F. and Falkman, G. (2007). Detection of vessel anomalies-a bayesian network approach. In2007 3rd International Conference on Intelligent Sensors, Sensor Networks and Information, pages395–400. IEEE.

Jordan, M. I. (1997). Chapter 25 - serial order: A parallel distributed processing approach. In Donahoe,J. W. and Dorsel, V. P., editors, Neural-Network Models of Cognition, volume 121 of Advances inPsychology, pages 471 – 495. North-Holland.

Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MITpress.

65

Lane, R. O., Nevell, D. A., Hayward, S. D., and Beaney, T. W. (2010). Maritime anomaly detection andthreat assessment. In 2010 13th International Conference on Information Fusion, pages 1–8. IEEE.

Lavin, A. and Ahmad, S. (2015). Evaluating real-time anomaly detection algorithms–the numentaanomaly benchmark. In 2015 IEEE 14th International Conference on Machine Learning and Appli-cations (ICMLA), pages 38–44. IEEE.

Liao, H.-J., Lin, C.-H. R., Lin, Y.-C., and Tung, K.-Y. (2013). Intrusion detection system: A compre-hensive review. Journal of Network and Computer Applications, 36(1):16–24.

Limkar, S. and Jha, R. K. (2012). An effective defence mechanism for detection of ddos attack onapplication layer based on hidden markov model. In Proceedings of the International Conference onInformation Systems Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam,India, January 2012, pages 943–950. Springer.

Lipovetsky, S. and Conklin, M. (2001). Analysis of regression in game theory approach. Applied StochasticModels in Business and Industry, 17(4):319–330.

Luong, T., Pham, H., and Manning, C. D. (2015). Effective approaches to attention-based neuralmachine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.

Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P., and Shroff, G. (2016). Lstm-basedencoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148.

Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval. Cambridgeuniversity press.

Mascaro, S., Nicholso, A. E., and Korb, K. B. (2014). Anomaly detection in vessel tracks using bayesiannetworks. International Journal of Approximate Reasoning, 55(1):84–98.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representationsin vector space. CoRR, abs/1301.3781.

Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. In Advances inneural information processing systems, pages 2204–2212.

Mukherjee, S. and Sharma, N. (2012). Intrusion detection using naive bayes classifier with featurereduction. Procedia Technology, 4:119–128.

Mulay, S. A., Devale, P., and Garje, G. (2010). Intrusion detection system using support vector machineand decision tree. International Journal of Computer Applications, 3(3):40–43.

Munir, M., Siddiqui, S. A., Dengel, A., and Ahmed, S. (2018). Deepant: A deep learning approach forunsupervised anomaly detection in time series. IEEE Access, 7:1991–2005.

Panda, M. and Patra, M. R. (2007). Network intrusion detection using naive bayes. International journalof computer science and network security, 7(12):258–263.

Pearson, K. (1901). Principal components analysis. The London, Edinburgh, and Dublin PhilosophicalMagazine and Journal of Science, 6(2):559.

66

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),pages 1532–1543, Doha, Qatar.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should i trust you?: Explaining the predictionsof any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledgediscovery and data mining, pages 1135–1144. ACM.

Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3):1.

Sakurada, M. and Yairi, T. (2014). Anomaly detection using autoencoders with nonlinear dimensionalityreduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory DataAnalysis, pages 4–11.

Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U., and Langs, G. (2017). Unsupervisedanomaly detection with generative adversarial networks to guide marker discovery. In InternationalConference on Information Processing in Medical Imaging, pages 146–157. Springer.

Sharafaldin, I., Lashkari, A. H., and Ghorbani, A. A. (2018). Toward generating a new intrusion detectiondataset and intrusion traffic characterization. In Proceedings of the 4th International Conference onInformation Systems Security and Privacy - Volume 1: ICISSP,, pages 108–116, Funchal, Madeira,Portugal.

Shlens, J. (2003). A tutorial on principal component analysis: Derivation, discussion and singular valuedecomposition. 2003. Disponible de: www. snl. salk. edu/˜ shlens/pca. pdf.

Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., and Chang, L. (2003). A novel anomaly detection schemebased on principal component classifier. Technical report, MIAMI UNIV CORAL GABLES FL DEPTOF ELECTRICAL AND COMPUTER ENGINEERING.

Spirtes, P., Glymour, C. N., Scheines, R., Heckerman, D., Meek, C., Cooper, G., and Richardson, T.(2000). Causation, prediction, and search. MIT press.

Štrumbelj, E. and Kononenko, I. (2014). Explaining prediction models and individual predictions withfeature contributions. Knowledge and information systems, 41(3):647–665.

Stultz, C. M., White, J. V., and Smith, T. F. (1993). Structural analysis based on state-space modeling.Protein Science, 2(3):305–314.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.

Syarif, I., Prugel-Bennett, A., and Wills, G. (2012). Unsupervised clustering approach for networkanomaly detection. In International conference on networked digital technologies, pages 135–145.Springer.

Taylor, A., Leblanc, S., and Japkowicz, N. (2016). Anomaly detection in automobile control networkdata with long short-term memory networks. In 2016 IEEE International Conference on Data Scienceand Advanced Analytics (DSAA), pages 130–139. IEEE.

67

Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., and Robinson, S. (2017). Deep learning for unsu-pervised insider threat detection in structured cybersecurity data streams. In Workshop on ArtificialIntelligence and Cyber Security, pages 224–231. AAAI.

White, J. V., Stultz, C. M., and Smith, T. F. (1994). Protein classification by stochastic modeling andoptimal filtering of amino-acid sequences. Mathematical biosciences, 119(1):35–75.

Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neuralnetworks. Neural Computation, 1(2):270–280.

Wold, S., Esbensen, K., and Geladi, P. (1987). Principal component analysis. Chemometrics andintelligent laboratory systems, 2(1-3):37–52.

Wong, W.-K., Moore, A. W., Cooper, G. F., and Wagner, M. M. (2003). Bayesian network anomaly pat-tern detection for disease outbreaks. In Proceedings of the 20th International Conference on MachineLearning (ICML-03), pages 808–815.

Wulsin, D., Blanco, J., Mani, R., and Litt, B. (2010). Semi-supervised anomaly detection for eegwaveforms using deep belief nets. In 2010 Ninth International Conference on Machine Learning andApplications, pages 436–441. IEEE.

Xu, H., Chen, W., Zhao, N., Li, Z., Bu, J., Li, Z., Liu, Y., Zhao, Y., Pei, D., Feng, Y., et al. (2018).Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. InProceedings of the 2018 World Wide Web Conference, pages 187–196.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015).Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the32nd International Conference on International Conference on Machine Learning, pages 2048–2057,Lille, France.

Zhang, W., Yang, Q., and Geng, Y. (2009). A survey of anomaly detection methods in networks. In2009 International Symposium on Computer Network and Multimedia Technology, pages 1–3. IEEE.

Zhou, C. and Paffenroth, R. C. (2017). Anomaly detection with robust deep autoencoders. In Proceedingsof the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages665–674.

Zimek, A. and Schubert, E. (2017). Outlier detection. Encyclopedia of Database Systems, pages 1–5.

68

Appendix A. Input based attention

We can introduce the inputs to the computation of context vector rt Eq. (2.7) in the following mannerfor xt = [x1t , . . . , x

Dt ]:

αtj =exp(etj)∑Tk=1 exp(etk)

, j ∈ 1, . . . , T (1)

etkj = vTa tanh(Wahdt−1 + Uah

ek + ZaC(xlk)), l ∈ 1, . . . , D (2)

etk =

D∑j=1

etkj (3)

where C is expansion operator that transforms a scalar value to a vector of those scalars. Thenfor every decoding step, we have a complete matrix measuring the influence of every step and everyfeature within those steps. Comparing to Eqs. (2.8), Za is a matrix representing new learnable weightsconnecting a single feature to the attention vector.

Appendix B. Information Retrieval metrics

In Information Retrieval, for a given query we are interested in returning all the relevant items/documents.Any kind of machine learning algorithm for that purpose returns a subset of all items in the domain.These retrieved items are compared against all relevant items to compute both precision and recall. Bothof these metrics are computed for an unordered set of retrieved items and are defined as follows:

Prec =#retrieved relevant items

#retrieved items, (4)

Recall =#retrieved relevant items

#relevant items. (5)

That is, precision is the fraction of retrieved documents that are relevant, and recall is the fractionof relevant items that are retrieved.

For ranked retrieval, both precision and recall need to be modified to incorporate the order of retrieveditems. This ordering is naturally given by the top-k retrieved items defined by the inherent scoring ofthe machine learning algorithm, for example, probabilities of belonging to relevant class. One methodof measuring performance is precision-recall graph, that plots precision values at predefined recall levels.The whole graph can be further aggregated to provide a single number as a performance measure.Another measure (that we adopt for our experiments) is the Average Precision (AP) which has beenshown to have good discrimination and stability. Average Precision is the average of the precision valuesobtained for the set of top-k items existing after each relevant items is retrieved and is defined as:

AP =1

|R|

N∑i=1

Prec(i) ·Relevant(i), (6)

with R being the total number of relevant samples, Prec(i) is precision computed for the first i itemsin the set and Relevant(i) is an binary function indicating whether item i is relevant or not. N includes

69

all the items in domain, but once all R relevant items have been encountered, Relevant(·) is zero for theremaining items and we can stop the summation.

For our purposes in the intrusion detection context, anomalousness indicates relevancy (what wewant to be highly ranked), and retrieving a sample in the top-k would mean that the model labels it asanomalous. With this interpretation, recall is not very informative, as it will always be constant nomatter the ranking. Thus, we are omitting it from our evaluation and focus only on precision-relatedmetrics.

Since network security analysts are inherently time-constrained in how many anomalies they caninvestigate in a fixed amount of time, our aim is to only examine the top-k highly ranked cases. This isanalogous to the number of retrieved items a user is willing to investigate when performing web queries– they will be interested in only couple of cases among the top 10 or 20 retrieved items. Thus, we restrictour evaluations to the top-k variants of the metrics – precision@k and average precision@k which aredefined as:

Prec(k) =#retrieved relevant items in the top-k items

k(7)

AP (k) =1

k

k∑i=1

Prec(i) ·Relevant(i). (8)

For detecting malicious cases, we have to map notions of relevant and retrieved item to maliciousand labeled malicious samples, respectively. Armed with this conversion to security context, we refinePrec(k) and AP (k) as our evaluation metrics in Section 4.4.

The downside of Prec(k) is that it does not average well across different queries, since the size of therelevant set will influence the outcome (consider cases where relevant set size is less than k and largerthan k). To overcome this issue, what is commonly used is Prec(R) or R-precision, that is, precisioncomputed on the first R items, where R is the number of relevant items. For R-precision, ideal scoreequals 1 no matter the size of the relevant item set.

Appendix C. Feature lists

Tables 1 and 2 show which features were extracted for Rank data and CICIDS2017 data respectively.Not all computed features carried useful information and some of them were either constant (at zerovalue) or heavily skewed towards 0. These where removed prior to further preprocessing steps.

Related event feature calculation included

process creation

number of created processes true

number of unique source ips false

total number of parent processes true

number of unique parents processes true

number of unique directories when cre-ating processes

true

70

number of unique commands when cre-ating processes

true

total number of users true

number of unique users true

total number of parentImage_paths true

number of unique parentImage_paths true

total number of images files true

number of unique images files true

total number of images directories true

number of unique images directories true

process termination

number of processes false

number of unique processes false

number of parents processes false

number of unique parents processes false

number of unique directories when ter-minating processes

false

number of unique commands when ter-minating processes

false

total number of users true

number of unique users true

file related

number of created files true

number of unique files true


number of processes involved in thiscreation

true

number of unique categories true

number of unique account types false

number of unique hostnames true

minimal difference between creationtime and previous creation time

true

maximal difference between creationtime and previous creation time

true

average difference between creationtime and previous creation time

false

sysmon related

total number of source ips false


71

total number of source ports false

number of unique source ports false

total number of destination ips false

number of unique destination ips true

total number of destination ports false

total number of processes involved false

number of unique processes involved false

total number of account types true

hashtag of all source ports true

hashtag of all destination ports true

Table 1: Extracted features for Rank data. Related event indicatesthe underlying event for a subset of features, and column includedindicates if the feature is included in the final construction of adata set. Certain features do not carry any information and aretherefore removed.

Feature extracted included f(x)

number of unique source ips true x

number of unique source ports true x

number of unique destination ips true x

number of unique destination ports true x

number of started flows true ln(x+ 1)

number of stopped flows true log10(x+ 1)

number of idle flows true log10(x+ 1)

number of started tcp flows true

number of stopped tcp flows true

number of idle tcp flows true log10(x+ 1)

number of started udp flows true log10(x+ 1)

number of stopped udp flows false

number of idle udp flows true log10(x+ 1)

number of started other flows false

number of stopped other flows false

number of idle other flows false

minimum payload size false

maximum payload size true x

72

mean payload size true x

payload size deviation true x

minimum header size true x

maximum header size true x

mean header size true x

header size deviation true x

number of fin flags found true log10(x+ 1)

number of psh flags found true log10(x+ 1)

number of urg flags found false

number of ece flags found false

number of syn flags found true log10(x+ 1)

number of ack flags found true log10(x+ 1)

number of cwr flags found false

number of rst flags found false

minimum flow idle-active time true log10(x+ 1)

maximum flow idle-active time true log10(x+ 1)

mean flow idle-active time true log10(x+ 1)

flow idle-active time deviation true log10(x+ 1)

minimum flow activity false

maximum flow activity false

mean flow activity false

flow activity deviation false

minimum flow idle time false

maximum flow idle time false

mean flow idle time false

flow idle time deviation false

forward flow payload false

forward flow header false

number of forward flow psh flags false

number of forward flow urg flags false

minimum forward flow idle-active time false

maximum forward flow idle-active time false

mean forward flow idle-active time false

forward flow idle-active time deviation false

73

backward flow payload true log10(x+ 1)

backward flow header true log10(x+ 1)

number of backward flow psh flags true log10(x+ 1)

number of backward flow urg flags true log10(x+ 1)

minimum backward flow idle-activetime

false

maximum backward flow idle-activetime

true log10(x+ 1)

mean backward flow idle-active time true log10(x+ 1)

backward flow idle-active time devia-tion

true log10(x+ 1)

Table 2: Extracted features from CICIDS2017 PCAP files. f isthe transformation function applied to shift the feature distributioninto a more uniform range.

74

Appendix D. Hyper-parameters of models

Model Parameter Values

Autoencoder

learning rate α 0.01, 0.001number of layers L 0, 1, 2number of neurons n 100, 200bottle-neck layer neurons 10, 20

LSTMlearning rate α 0.01, 0.001number of neurons n 8, 16, 32

Bi/S - LSTMlearning rate α 0.01, 0.001number of neurons n 4, 8, 16

(Attention) Encoder-Decoderlearning rate α 0.01, 0.001number of neurons n 8, 16, 32

Table 3: Parameters values tested in validation procedure for each model. Bi/S indicates all threecombinations of bidirectional and stacked variants of LSTM. Learning rate α is the coefficient responsiblefor the magnitude of gradient updates during optimization. Number of layers L is how many layers arepresent in the network. For autoencoder this value is doubled, while LSTM variants in this study onlyhave L = 1 or L = 2 (for stacked variant). Number of neurons n represents number of processing unitswithin a single layer.

75

Data set Model Hyper-parametersα L n b.n. n

Synthetic

Autoencoder 0.001 1 200 20LSTM 0.01 1 8

Bi-LSTM 0.01 1 8S-LSTM 0.01 2 16

Bi-S-LSTM 0.01 2 8Encoder-Decoder 0.01 1 8

Attention ED 0.01 1 32

Yahoo! A1-51


Bi-LSTM 0.01 1 16S-LSTM 0.01 2 16



Yahoo! A1-56


Bi-LSTM 0.01 1 16S-LSTM 0.01 2 8



Rank


Bi-LSTM 0.001 1 16S-LSTM 0.01 2 16



CICIDS


Bi-LSTM 0.01 1 20S-LSTM 0.01 2 20



Table 4: Chosen values of hyperparameters for each model and data set. α is the learning rate, L numberof layers, n the number of neurons inside each layer, and ’b.n. n’ is number of neurons inside bottle-necklayer for the autoencoder. The standard LSTM model only has a single layer, while stacked variants ofLSTM contain two layers.

76

unsupervised deep learning for anomaly detection and ......acknowledgements first and foremost i...

Documents