explaining and aggregating anomalies to detect insider … · explaining and aggregating anomalies...

10
Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E. Senator Leidos, Inc. Arlington, VA USA {goldberghg,youngwil,memoryac,senatort}@leidos.com Abstract—Anomalies in computer usage data may be indica- tive of insider threats. Distinguishing actual malicious activities from unusual but justifiable activities requires not only a sophisticated anomaly detection system but also the expertise of human analysts with access to additional data sources. Because any anomaly detection system for extremely rare events will generate many false positives, human analysts must decide which anomalies are worth their time and effort for follow- up investigations. Providing a ranked or scored list of users – the typical output of an anomaly detection system – is necessary but far from sufficient for this purpose. Anomalies indicative of insider threats can be distinguished from those that arise from legitimate activity by explaining why they are anomalous, and high-risk users may be identified by their repeated appearance near the top of the ranked and scored lists. This paper describes results of experiments that show the utility of these techniques of explaining and aggregating anomalies to detect insider threats with greater accuracy than is achieved solely with anomaly detection methods. Keywords-Anomaly Detection; Outlier Detection; Explana- tion; Temporal Aggregation I. I NTRODUCTION An automated insider threat detection system that detects anomalies in a single data source such as computer usage data generates leads for investigation by human analysts. These leads consist of ranked or scored lists of users whose activities on particular days may warrant further investigation. The next step in the investigative process is for a counter-intelligence analyst to review these leads and determine which leads are worth his/her time to investigate further. We propose and evaluate two complementary methods to help determine which anomalies should be investigated further and which should not. (1) Explaining anomalies by providing more details than the simple risk score can help an analyst understand why the system considered a particular user’s activity on a particular day unusual and provides insight as to whether such activity requires further investigation or has a legitimate explanation. (2) Because insider threat scenarios typically are executed over multiple days, and because malicious users may engage in repeated improper activities, aggregating anomalous user-days by user can help identify users who are likely to be engaging in improper activity. Further, such aggregation allows an analyst to focus on malicious users who are the ultimate target of real investigations – rather than just on malicious activities. This paper describes experiments and analyses we have performed using these two techniques of anomaly explanation and anomaly aggregation to more accurately dis- criminate between improper malicious activities and unusual but innocent activities. We conduct our research using an anomaly detection system called PRODIGAL that is described in Section II. The structure of this paper is as follows. First, we describe the PRODIGAL system including the type of explanations it computes and how they are used in the analyst interface. Next we describe our experiments, methods, results and analyses of PRODIGAL’s explanation generation ability involving combinations of features from individual detection algorithms. The next section of the paper discusses our experiments about how best to aggregate single day anomaly scores. II. THE PRODIGAL SYSTEM PRODIGAL comprises data processing and anomaly de- tection components that are described in reference [1]. PRODIGAL has been configured to explore methods for unsupervised and semi-supervised anomaly detection as the first step in a multistage detection process for insider threats [2]. As such it represents one of several approaches to the problem. (See [3] for a comprehensive survey of methods for anomaly detection, while [4] surveys approaches to insider threat analysis and prediction.) PRODIGAL uses an ensemble technique to combine results from multiple diverse detectors to identify anomalous user-days in a database of real computer usage activity [5]. PRODIGAL has been tested and evaluated against realistic independent red-team inserted scenarios. PRODIGAL’s unsupervised anomaly de- tection ensemble combines scores from multiple diverse detectors into single user-day scores each month, resulting in a ranked and scored list of user-days ordered by the degree of anomalousness. This technique consistently achieves a level of performance on unknown inserted scenarios com- parable to the performance of its best component detector as determined after the answer key has been provided, as described in reference [5]. This gives us confidence in PRODIGALs ability to detect not only known and suspected insider threat scenarios but also variants and combinations of such scenarios, and, more important, previously unknown scenarios. 2016 49th Hawaii International Conference on System Sciences 1530-1605/16 $31.00 © 2016 IEEE DOI 10.1109/HICSS.2016.344 2739

Upload: dotu

Post on 09-Sep-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Explaining and Aggregating Anomalies to Detect Insider Threats

Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E. Senator

Leidos, Inc.Arlington, VA USA

{goldberghg,youngwil,memoryac,senatort}@leidos.com

Abstract—Anomalies in computer usage data may be indica-tive of insider threats. Distinguishing actual malicious activitiesfrom unusual but justifiable activities requires not only asophisticated anomaly detection system but also the expertise ofhuman analysts with access to additional data sources. Becauseany anomaly detection system for extremely rare events willgenerate many false positives, human analysts must decidewhich anomalies are worth their time and effort for follow-up investigations. Providing a ranked or scored list of users– the typical output of an anomaly detection system – isnecessary but far from sufficient for this purpose. Anomaliesindicative of insider threats can be distinguished from thosethat arise from legitimate activity by explaining why they areanomalous, and high-risk users may be identified by theirrepeated appearance near the top of the ranked and scoredlists. This paper describes results of experiments that showthe utility of these techniques of explaining and aggregatinganomalies to detect insider threats with greater accuracy thanis achieved solely with anomaly detection methods.

Keywords-Anomaly Detection; Outlier Detection; Explana-tion; Temporal Aggregation

I. INTRODUCTION

An automated insider threat detection system that detects

anomalies in a single data source such as computer usage

data generates leads for investigation by human analysts.

These leads consist of ranked or scored lists of users

whose activities on particular days may warrant further

investigation. The next step in the investigative process is

for a counter-intelligence analyst to review these leads and

determine which leads are worth his/her time to investigate

further.We propose and evaluate two complementary methods

to help determine which anomalies should be investigated

further and which should not. (1) Explaining anomalies

by providing more details than the simple risk score can

help an analyst understand why the system considered a

particular user’s activity on a particular day unusual and

provides insight as to whether such activity requires further

investigation or has a legitimate explanation. (2) Because

insider threat scenarios typically are executed over multiple

days, and because malicious users may engage in repeated

improper activities, aggregating anomalous user-days by

user can help identify users who are likely to be engaging

in improper activity. Further, such aggregation allows an

analyst to focus on malicious users who are the ultimate

target of real investigations – rather than just on malicious

activities. This paper describes experiments and analyses

we have performed using these two techniques of anomaly

explanation and anomaly aggregation to more accurately dis-

criminate between improper malicious activities and unusual

but innocent activities.

We conduct our research using an anomaly detection

system called PRODIGAL that is described in Section II.

The structure of this paper is as follows. First, we describe

the PRODIGAL system including the type of explanations

it computes and how they are used in the analyst interface.

Next we describe our experiments, methods, results and

analyses of PRODIGAL’s explanation generation ability

involving combinations of features from individual detection

algorithms. The next section of the paper discusses our

experiments about how best to aggregate single day anomaly

scores.

II. THE PRODIGAL SYSTEM

PRODIGAL comprises data processing and anomaly de-

tection components that are described in reference [1].

PRODIGAL has been configured to explore methods for

unsupervised and semi-supervised anomaly detection as the

first step in a multistage detection process for insider threats

[2]. As such it represents one of several approaches to the

problem. (See [3] for a comprehensive survey of methods

for anomaly detection, while [4] surveys approaches to

insider threat analysis and prediction.) PRODIGAL uses an

ensemble technique to combine results from multiple diverse

detectors to identify anomalous user-days in a database

of real computer usage activity [5]. PRODIGAL has been

tested and evaluated against realistic independent red-team

inserted scenarios. PRODIGAL’s unsupervised anomaly de-

tection ensemble combines scores from multiple diverse

detectors into single user-day scores each month, resulting in

a ranked and scored list of user-days ordered by the degree

of anomalousness. This technique consistently achieves a

level of performance on unknown inserted scenarios com-

parable to the performance of its best component detector

as determined after the answer key has been provided,

as described in reference [5]. This gives us confidence in

PRODIGALs ability to detect not only known and suspected

insider threat scenarios but also variants and combinations

of such scenarios, and, more important, previously unknown

scenarios.

2016 49th Hawaii International Conference on System Sciences

1530-1605/16 $31.00 © 2016 IEEE

DOI 10.1109/HICSS.2016.344

2739

Page 2: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 1. Overall user-day scores compared to all users over the month. User #410400’s scores compared to the baseline population for the month. Theblack squares represent the users scores. The green box highlights the day in question for the user, which is September 9, 2014.

Figure 2. The individual feature scores for user 410400 on September 9, 2014. The user is in the less than 0.0001% percentile of all users on this daywith respect to multiple file features.

A. Background: Insider Threat

Malicious insider activity on real computer networks is

carried out by a small number of authorized users, and,

more important, represents only a small fraction of their

overall activities on their computers. Our anomaly detectors,

which we distinguish from algorithms, comprise not only the

algorithms but also the specifications of the entity-extents

whose activity is being examined and the contexts against

which their activities are being compared. For example, an

entity-extent may be an individual user or a group of users,

defined by common projects, locations, organizations, or job

functions, by relationships such as communications patterns

or shared resource usage patterns, or by community member-

ship, where communities are identified by combining aspects

of these definitions. These entity extents may be defined over

different time periods as well. The context against which

activities by entity extents are compared includes various

choices for peer groups and community memberships as

well as various choices for time periods. For example, an

individual user’s activities on a given day may be compared

with his/her activities on all days in a month or longer; they

2740

Page 3: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 3. Detailed drill-down of events. Individual observations and select attributes associated with the top-ranked individual feature score, Distinct FilesCount.

may be compared with activities of other users or groups

of users on the same day; with activities of other users

or groups of users during a month; or any variant thereof.

Explanations therefore, must provide to a human analyst not

only the activity of the entity-extent being examined, but

also the context of the activities of other entity-extents with

which it is being compared.

B. Using Explanations to Support the Analyst

Analysts need explanations that illustrate, in terms relating

to user activity, why such activities are anomalous. Analysts

are not interested in algorithmic mechanisms used to identify

anomalies. (See [6] for a similar approach to explaining

document classification, which is also performed in a high

dimension feature space.) While analysts would benefit from

explanations of users’ overall plans and intentions, this is

infeasible because of the inaccessibility of data needed to

explain such hypotheses. Furthermore, and perhaps more

important, it is infeasible due to the unbounded amount of

diverse domain knowledge that would be required to explain

even a portion of the full set of plans that a computer user

might be undertaking at any given time.

Three levels of explanation meet these needs. The first

level consists of pre-computed single-feature outlier detec-

tion scores that are available for examination by analysts.

Section III-A describes their computation. This provides the

ability to examine the actual activity data of a user in the

context of similar activities of other users over similar time

periods. The second level is a collection of features or sets

of features that contribute most to the anomaly score from

an individual detector for an individual user. The third level

combines the contribution from diverse detectors that are

incorporated into the ensemble computation of the overall

anomaly scores.

C. PRODIGAL Analyst Interface

This section presents examples of the use of single feature

outlier scores in the PRODIGAL Analyst Interface (AI).

Analysis starts with a list of entity extents (user-days) sorted

by highest ensemble anomaly score. The analyst compares

these scores with others for the date or the entire month

using the display shown in Figure 1. The AI presents

user day scores using a box plot, with the upper whisker

representing the top 5 percent of scores in order to highlight

the most anomalous behaviors. The black squares represent

the selected user’s scores for every day. User 410,400 has

scores (as shown in Figure 1) that are in the top 5 % for

several days in the month. The day highlighted by the green

box is the highest-ranked day for the entire month.

For a given day of interest, the AI enables the analyst to

view the individual features associated with the ensemble

score, allowing the analyst to focus on specific anomalous

behaviors while investigating a particular scored entity, as

shown in Figure 2. The AI lists the data type associated

with the feature score (e.g., file, email, URL, printer, lo-

gon), a summary description of the feature name, and the

normalized score for that feature.

Finally, drill-down to underlying user-computer transac-

tions is included to let the analyst view the behavior from

which features and ultimately anomaly scores were com-

puted. (Figure 3) (Note, to preserve privacy in the research

database, numerical hash keys replace unique user names,

file names, domain names, and email addresses. A live

implementation of PRODIGAL would present these to the

analyst.) Inclusion of the underlying observations associated

with the feature scores enables the analyst to visually inspect

the data and assess whether the user’s unusual behavior

is concerning and merits further exploration (outside of

PRODIGAL).

2741

Page 4: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 4. Histograms of values (left) and Logit outlier scores (right) forthe feature: Uploads / Distinct URL domain ratio

III. EXPLANATIONS OF ANOMALIES

An important motivation for creating and refining ways

of explaining anomaly detection is that explanations are

needed to support an analyst in distinguishing malicious

from non-malicious user-days. Recently, studies such as [7]

have confirmed the benefits of explanations for analysts. In

lieu of direct utility judgments by analysts (which would

be costly), our domain expert labeled possibly malicious

activities and has prepared ground truth for a test of feature

utility.

Specifically, he considered the known malicious activities

inserted by a red team into two months’ of live computer

usage data, totalling 81 user-days. He labeled each user-

day as either containing or not containing each of 12

activities of concern that he would be likely to cite to another

analyst to explain why a particular user-day is worth further

investigation (see Table I).

A. Single Feature Outlier Scores

This section explains how we calculate single feature

outlier scores. Each entity receiving a score in PRODIGAL

has associated with it a large number of feature values,

V (U,D, F ), where U is the user ID, D is the date, and Fis the feature ID. PRODIGAL computes a statistical outlier

score for each value by comparing it against all other users

for that date, the comparison population being V (x,D, F ).We compute an outlier score using the cumulative distribu-

tion function (CDF) of the logistic distribution with mean

and variance of this population. This score is normalized to

[0, 1] and is easily compared with other features’ scores. This

score is a “marginal” explanation of the anomaly of the user-

day, because it estimates the likelihood that the feature value

is greater than those of other users from the base population.

An explanation of a scored entity is a list of features plus

outlier scores.

Pre-computed features have been selected from a wide

range of user behaviors identified by counter-intelligence

analysts. Some examples include: URL upload count, email

event count, average recipient count per email sent, fixed

drive file event count, and upload/distinct URL domain ratio.

A sample of the values and outlier scores computed for the

last example is shown in Figure 4.

Figure 5. Histograms of values (left) and Gamma outlier scores (right)for the feature: Uploads / Distinct URL domain ratio

No. Description

1 Copies lots of files2 Copies to removable3 Other file activity4 Searches networked drives5 Unusual web upload6 Other unusual browsing activity7 Excessive email attachments8 Unusual email send activity9 Unusual email received activity

10 Prints a lot (jobs and pages)11 Unusual printer activity12 Unusual logons (events and distinct WS)

Table ILABELS OF ACTIVITIES OF CONCERN TO AN ANALYST

For these examples of pre-computed features, the first

histogram shows the raw feature value with the density

(red) and CDF (green) of the fitted logistic distribution. The

second histogram shows the resulting outlier scores. Other

distributions may fit this data better, such as the Gamma

distribution (shown in Figure 5).

B. Drop-Out Explanations

In addition to the single feature outlier scores, we have

modified several general purpose anomaly detection algo-

rithms, e.g., [8], to produce explanations based on a “drop-

out” method of sensitivity analysis.[9] These explanations

also comprise a set of the same features as previously

described, plus weights representing how much impact re-

moving each feature has on the score the detector computes

for any user-day. Other methods of generating explanations

are possible; in [10], Dang et al., propose a technique for

finding anomalies and explanations simultaneously.

C. Evaluating Explanations

To evaluate explanations in terms of the 100 PRODIGAL

features against a ground truth vector of 12 labels, we tried

two approaches, direct prediction of analyst-assigned labels

by the feature vectors, and transforming the feature vectors

into label vectors for comparison with a metric such as

cosine similarity. To measure how well any particular single

feature outlier score predicts a specific label, we treat the

set of outlier scores as a detector over the collection of

labeled user-days and compute the AUC (Area under the

2742

Page 5: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 6. Performance (measured by AUC) of specific features (vertical axis) at predicting each label (horizontal axis). Since features and labels arederived from a full day of user activity, correlation between several activities can produce non-intuitive predictions, such as printer features predicting fileactivity or web features predicting excessive email attachments.

ROC Curve). This results in a measure of how well each

feature distinguishes user days with and without the label.

The results of this experiment are presented in Figure 6 as a

color map. Rows are individual features, but we have labeled

groups of features that address aspects of various types of

computer usage. Columns are the analyst’s labels. We can

see a number of areas of the grid where features involving

particular activities (e.g., file access of various types) predict

labels involving the same activities well. However, there are

other places which surprise us. For example, several file

activity features appear to predict Unusual Printer Activity.

This may be a result of the way the sensor measuring

file activities (on network drives) picks up movement of

print jobs. A more puzzling case is where Unusual Email

Send/Received labels are predicted by ratios derived from

web activity, such as the ratio of uploads to distinct URLs.

This may be due to the fact that features are derived over

an entire day, and users in the test and training set tend to

perform both types of activity.

To allow direct comparison of our feature-based explana-

tions (both single feature and multi-feature drop-out) to the

ground truth labels, we learn a transformation from feature

space to label space. This is done by deriving the correlation

matrix from a training set of user-days. A cell of this matrix,

M(i, j), contains the Pearson Correlation coefficient of all

values of feature i (in [0, 1]) with all values of label j (0 or

1). Multiplying a feature vector by this matrix produces a

predicted label vector, which we then compare to the ground

truth for that user-day. Figure 7 displays the correlation

coefficients of a matrix derived from single feature outlier

scores over all labeled user-days. We avoid over-fitting via

repeated random sub-sampling validation (RRSSV) cross-

validation in which we select some user-days to test and

derive a correlation matrix from the remainder. We compute

two metrics — cosine similarity and Euclidean distance –

and find the average of each metric over all user-days in the

test sample.

D. Results and Analysis

The chart in Figure 8 shows the results of running 100

iterations where 20 cases are used for testing and 111 to

derive the matrix. We see that, using a correlation matrix

2743

Page 6: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 7. Transformation Matrix (using Pearson Correlation) of Features to Labels. The matrix shown was derived from 131 labeled user-days. The factthat the matrix is relatively sparse, with many low-correlation cells, suggests that our feature set is relatively well aligned with the labels. Columns withno strong correlations point out areas where we need additional features.

learned from a relatively few, labeled cases, we can derive

analyst’s labels that are similar to ground truth by cos(41°).

We also tested explanation weights generated at random

and found that they performed nearly as well as single

feature explanations. This is likely due to the ability of the

learned matrix to capture a model of prior likelihoods of

label occurrence, especially over a small sample size.

Finally, the drop-out explanations derived from our IForest

anomaly detector perform roughly as well as the single

feature explanations whether using a matrix learned from the

single feature explanations or from the IForest explanation

weights themselves, yielding similarity scores of cos(45°)

and cos(39°) respectively. We would have expected drop-

out explanations to do better. Single feature explanations

are independent of one another, and their score, or expla-

nation weight, is a comparison to other instances of the

same feature from other users. While, the drop-out weights

depend on the entire anomaly score, which is derived from

the full feature vectors. One possible explanation is inter-

feature correlation. In the case of two essential features

that are highly correlated with one another, neither would

receive a high drop-out score, since the method tests them

individually. This suggests a path to improving the drop-out

methods by learning inter-feature correlations and testing

entire groups for sensitivity.

IV. TEMPORAL AGGREGATION OF ANOMALIES

A. Background and Introduction

PRODIGAL scores user days; however, we do not sys-

tematically apply the user day scores to find users who

repeatedly display the most unusual behavior. For exam-

ple, ranking user days does not identify users who had

multiple, high-scoring days in a time period whereas an

analyst, visually reviewing the output from PRODIGALs

ensemble, would likely recognize patterns (e.g., a user who

exhibits anomalous behavior on consecutive days, or a week

apart on the same day). Our goals in temporal aggregation

experimentation were to develop a detector, D, that (1) used

output from the ensemble (user ID, rank, and day) to find

the most unusual users in a time period and (2) could serve

as another detector in the PRODIGAL system.

2744

Page 7: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 8. Cross Validation Results of Label Vectors Derived from Features(Error bars show 95% confidence interval)

Table IIPARAMETERS USED FOR TEMPORAL AGGREGATION.

Name Definition Possible values

τ1 The rank of a userday score; interpretedas a cutoff point

The number of user day ranks in thetop 5, 10, 20, 50, 100, 200, 500,1000, 5000, and 10000; 10 values intotal

τ2 The count of thenumber of user daysa specific user has atrank r

The number of days in the timeperiod that a user has at or higherthan a given rank cutoff point; for amonth, 1 — 31

Our experiments differ from research focused on temporal

aggregation techniques in the context of time series analysis.

Traditionally, this research has focused on topics in eco-

nomics and finance such as modeling interest and exchange

rates [11], [12], [13], [14], [15], [16]. Recent research has

extended the previous work in temporal aggregation time

series analysis to agronomy and meteorology [17], [18] and

some in the social sciences (e.g., traffic patterns) [19], [20].

B. Methodology

1) Designing the temporal aggregation model: The

PRODIGAL system consists of over one hundred features

and detectors, whose output is combined into a single

ensemble detector score. We used the date and rank position

of these scores in a family of temporal aggregation detectors

parametrized by rank cutoff (τ1) and the number of days

(τ2) that a user has at a given rank cutoff. Table II below

describes how we specified these parameters.

We selected several values of τ1 which correspond to what

analysts expect to see in operational environments. Detector

parameter τ2 — the number of times that a user has a day at

a specific rank — covers the time period of our operational

surveillance. Thus, in a month, there are between 280 and

310 possible detectors.

2) Specifying the detector: Figure 9 depicts our tempo-

ral aggregation model development methodology. For each

month, we (1) obtain the count of all distinct user IDs

(including RT users), ranks, and days from the ensemble

detector. Using those inputs, we (2) find the count of the

all users at each value of (τ1) and the count of days for all

users at by rank cutoff point (τ2). We form combinations

of each parameter (4) and develop the detectors (D) for the

period of analysis; examples of D include: Top 5, 1 Day;

Top 10, 3 Days; Top 50, and 4 Days. We denote a detector

with thresholds τ1 and τ2 as D(τ1, τ2).3) Data: We used 21 months of test data (approximately

165,000 total user days scores/month) from September, 2012

to July, 2014 to populate the model. We aggregated all user

behavior for the month and did not distinguish between a

Red Team user’s days with and without inserted events. In

an operational context, an analyst could discover even low-

signal malicious behavior by starting with a higher-ranked

day or a pattern of lower-ranked days. Ensemble ranks by

user-day were the inputs for temporal aggregation, and the

Red Team’s answer key allows calculation of lift.

4) Metrics and evaluation: We used lift as the value of

the detector. Lift characterizes the improvement offered by a

classifier over random choice, and is an appropriate method

to apply in our temporal aggregation. As lift measures the

amount of data enrichment offered by a classifier, it enables

us to assess the improvement in detecting malicious insiders

by looking at focused subsets (e.g., the number of users

who have a rank at or above 50 three days in the month) of

the overall population. In our experiments, we define lift at

thresholds τ1 and τ2

L(τ1, τ2) =

[nR(τ1, τ2)

n(τ1, τ2)

/NR

N

](1)

where

• nR(τ1, τ2) is the number of RT users at D(τ1, τ2),• n(τ1, τ2) is the total number of users at D(τ1, τ2),• NR is the total number of RT users in the data set and

• N is the total number of all users.

We evaluated the temporal aggregation detectors perfor-

mance in two ways: (1) average lift across all months

and (2) average lift by specific Red Team scenario. In

the first approach we calculated the lift by data month

(across multiple and different scenarios) and averaged lift of

each classifier across all months in the set (i.e., 21 months

between September 2012 through July 2014). In the second

approach, we calculated lift by each scenario type, averaging

lift of each classifier across scenario instance. There are 36

scenarios and 74 distinct data sets in the data range. For

example, we averaged lift by classifier for the five instances

of the Snowed In scenario, spanning multiple data months

(July and October 2013 and July 2014).

5) Experiment results: Table III shows the final results

of the model across all months. Table IV shows the top ten,

2745

Page 8: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Figure 9. Our temporal aggregation model development methodology.

Table IVTHE TOP TEN, MOST FREQUENTLYOCCURRING TEMPORAL

AGGREGATION CLASSIFIERS ACROSS ALL MONTHS.

Rank Lift Detector Rank Lift Detector

1 114.50 Top 10, 4 Days 6 34.37 Top 20, 5 Days2 91.64 Top 5, 3 Days 7 31.24 Top 50, 10 Days3 42.96 Top 20, 7 Days 8 29.84 Top 10, 3 Days4 42.96 Top 20, 6 Days 9 20.83 Top 50, 9 Days5 31.15 Top 5, 2 Days 10 20.83 Top 50, 8 Days

most frequently occurring temporal aggregation classifiers

across all months. A review of the most frequently-occurring

temporal aggregation classifiers suggests that analysts focus

on users who are often highly unusual within a given time

period. Table V shows the performance of the temporal

aggregation detector by scenario, with lift averaged across

the distinct instances within a scenario.

Table VI lists the temporal aggregation detectors that

produced the highest and lowest lift values by scenario and

relates the number of red team and all users for each detector

and the number of all users in the month of the best detector.

C. Discussion

We have implemented a temporal aggregation filter in

our analyst interface (AI) and intend to present highly-

anomalous users identified by the temporal aggregation to

counter intelligence analysts from the data provider and

determine the number of those users whose actions are of

interest. Also, in reviewing our results in the data laboratory,

we noticed that a high percentage of frequently anomalous

users (e.g., users who have multiple days in the top 20 user

days) appear to perform tasks associated with job roles and

functions categorized as high-risk for insider threat (e.g.,

system administrators).

V. CONCLUSIONS AND ONGOING RESEARCH

Our results to date suggest that analysts will find explana-

tions useful to discriminate between malicious and legitimate

activities with similar high anomaly detection scores, and

that useful explanations can be generated from vectors of

single feature outlier scores. We are pursuing several lines of

research involving improvements to explanation generation

that take into account inter-feature dependence as well as use

of explanations generated by individual anomaly detectors

to guide the ensemble process that produces PRODIGAL’s

overall scores. [9][21]

Our experiments with temporal aggregation also suggest

simple approaches to detect users who are likely to warrant

further investigation by finding users who frequently appear

towards the top of the anomaly detection score list on

multiple days. As we configure PRODIGAL to operate over

various time periods we will refine these approaches to

fit the operational requirements of specific insider threat

surveillance enterprises.

2746

Page 9: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Table IIIFINAL RESULTS OF THE TEMPORAL AGGREGATION MODEL ACROSS ALL MONTHS.

Days Top5 Top10 Top20 Top50 Top100 Top200 Top500 Top1000 Top5000 Top10000

31 030 0 0 029 0 0 0 0 028 0 0 0 0 0 027 0 0 0 0 0 026 0 0 0 0 0 025 0 0 0 0 0 024 0 0 0 0 0 023 0 0 0 0 2.03 3.0322 0 0 0 4.39 1.72 1.6121 0 0 0 0 3.99 1.62 1.9720 0 0 0 0 4.80 1.31 1.4519 0 0 0 0 2.69 1.00 1.8518 0 0 0 0 1.88 0.88 3.5817 0 0 0 0 1.57 0.72 2.8716 0 0 0 0 1.31 1.52 2.5415 0 0 0 0 1.16 1.42 2.4614 0 0 0 0 0.89 3.02 2.9813 0 0 0 7.39 4.00 2.62 2.4512 0 0 0 5.21 5.20 3.43 2.2311 0 20.83 11.46 6.83 4.75 3.10 2.0510 31.24 17.62 8.49 6.83 3.58 2.87 1.86

9 20.83 15.80 7.05 6.55 3.14 2.53 2.068 0 20.83 11.46 7.05 8.07 2.51 2.83 1.997 42.96 18.33 8.33 6.64 5.73 1.89 2.97 2.016 42.96 13.89 6.94 12.91 4.03 1.46 2.71 1.955 0 34.37 8.81 6.64 11.90 3.05 1.94 2.88 2.354 114.56 20.83 7.64 6.64 7.78 2.03 3.08 2.75 2.293 91.64 29.88 14.32 6.64 4.98 3.60 1.57 2.97 2.70 2.122 37.15 12.27 10.11 5.05 4.12 1.86 2.55 4.30 2.34 1.771 6.84 4.10 4.24 2.02 3.44 1.92 3.23 3.10 1.88 1.56

Table VITHE TEMPORAL AGGREGATION DETECTORS THAT PRODUCED THE HIGHEST AND LOWEST LIFT VALUES BY SCENARIO. THE NUMBER OF RED TEAM

AND ALL USERS FOR EACH DETECTOR AND THE NUMBER OF ALL USERS IN THE MONTH OF THE BEST DETECTOR.

Scenario name Averagelift, bestdetector

Best detector (D) forthis scenario

Num./RTusers at D

Num./Allusers at D

Num./Allusers for themonth

Snowed In 1347.67 Top 5, 3 days 1 1 4124Anomalous Encryption 147.84 Top 1000, 7 days 1 19 5618Exfiltration Prior to Termination 146.84 Top 20, 1 day 1 13 4372Selling Login Credentials 109.44 Top 10000, 23 days 1 4 5691Czech Mate 51.47 Top 500, 2 days 1 83 4272Hiding Undue Affluence 2.22 Top 5000, 2 days 1 867 5721Parting Shot 2 - Deadly Aim 2.22 Top 5000, 2 days 1 635 4230From Belarus With Love 2.21 Top 10000, 5 days 1 723 4392Manning Up 1.99 Top 10000, 4 days 1 959 5729Panic Attack 0.86 Top 10000, 5 days 1 700 4286

ACKNOWLEDGMENT

The authors wish to thank the researchers and engineers

of the PRODIGAL team. Funding was provided by the

U.S. Army Research Office (ARO) and Defense Advanced

Research Projects Agency (DARPA) under Contract Number

W911NF-11-C-0088. The content of the information in this

document does not necessarily reflect the position or the

policy of the Government, and no official endorsement

should be inferred. The U.S. Government is authorized to

reproduce and distribute reprints for Government purposes

notwithstanding any copyright notation here on.

REFERENCES

[1] T. E. Senator, H. G. Goldberg, A. Memory, W. T. Young,B. Rees, R. Pierce, D. Huang, M. Reardon, D. A. Bader,E. Chow et al., “Detecting insider threats in a real corporatedatabase of computer usage activity,” in Proceedings of the19th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 2013, pp. 1393–1401.

[2] T. E. Senator, “Multi-stage classification,” in Data Mining,

2747

Page 10: Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies to Detect Insider Threats Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E

Table VTHE PERFORMANCE OF THE TEMPORAL AGGREGATION DETECTOR BY

SCENARIO, WITH LIFT AVERAGED ACROSS THE DISTINCT INSTANCES

WITHIN A SCENARIO.

Scenario Name Count of scenarioinstances

Average lift ofD across scenarioinstances

Snowed In 5 1374.67Anomalous Encryption 2 147.84Exfil. Prior to Termination 2 146.84Selling Login Credentials 1 109.44Czech Mate 1 51.47Manning Up Redux 1 36.04Byte Me 2 30.21Breaking the Stovepipe 3 25.66Credit Czech 1 23.89Blinded Me With Science 1 23.39Survivor’s Burden 3 21.52Job Hunter 1 21.44What’s the Big Deal 1 16.03The Big Goodbye 1 12.47Insider Startup 7 11.49Bona Fides 2 9.93Conspiracy Theory 2 9.57Bollywood Breakdown 1 7.92Layoff Logic Bomb 2 7.72Parting Shot 1 6.91Masquerading 2 2 6.49Circumventing Sureview 2 6.05Strategic Tee Time 1 4.75Indecent RFP 2 2 4.08Indecent RFP 1 4.08Passed Over 4 3.76Exfil...Using Screenshots 3 2.88Gift Card Bonanza 1 2.82Byte Me Middleman 2 2.63Naughty by Proxy 4 2.45Outsourcer’s Apprentice 3 2.45Hiding Undue Affluence 2 2.22Parting Shot 2 1 2.22From Belarus With Love 2 2.21Manning Up 2 1.99Panic Attack 2 0.86

Fifth IEEE International Conference on. IEEE, 2005, pp.8–pp.

[3] V. Chandola, A. Banerjee, and K. Vipin, “Anomaly detection:A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 15:1–15:58, 2009.

[4] A. Azaria, A. Richardson, S. Kraus, and V. S. Subrahmanian,“Behavioral analysis of insider threat: A survey and boot-strapped prediction in imbalanced dat,” IEEE Transactionson Computational Social Systems, vol. 1, no. 2, pp. 135–153,2014.

[5] W. T. Young, A. Memory, H. G. Goldberg, and T. E. Senator,“Detecting unknown insider threat scenarios,” in Security andPrivacy Workshops (SPW), 2014 IEEE. IEEE, 2014, pp.277–288.

[6] D. Martens and F. Provost, “Explaining data-driven documentclassifications,” MIS Quarterly, vol. 38, no. 1, pp. 73–99,2014.

[7] K. L. Wagstaff, N. L. Lanza, D. R. Thompson, T. G. Diet-terich, and M. S. Gilmore, “Guiding scientific discovery withexplanations using demud.” in AAAI, 2013.

[8] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,”in Data Mining, 2008. ICDM’08. Eighth IEEE InternationalConference on. IEEE, 2008, pp. 413–422.

[9] M. A. Siddiqui, A. Fern, T. G. Dietterich, and W.-K. Wong,“Sequential feature explanations for anomaly detection,” inKDD15 Digital Proceedings, Outlier Definition, Detection,and Description (ODDx3) Workshop [to be published 2015].

[10] X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schu-bert, “Discriminative features for identifying and interpretingoutliers,” in Data Engineering (ICDE), 2014 IEEE 30thInternational Conference on. IEEE, 2014, pp. 88–99.

[11] F. C. Drost and T. E. Nijman, “Temporal aggregation ofgarch processes,” Econometrica: Journal of the EconometricSociety, pp. 909–927, 1993.

[12] D. Geltner, “Temporal aggregation in real estate return in-dices,” Real Estate Economics, vol. 21, no. 2, pp. 141–166,1993.

[13] M. Marcellino, “Some consequences of temporal aggregationin empirical analysis,” Journal of Business & EconomicStatistics, vol. 17, no. 1, pp. 129–136, 1999.

[14] R. J. Rossana and J. J. Seater, “Temporal aggregation andeconomic time series,” Journal of Business & EconomicStatistics, vol. 13, no. 4, pp. 441–451, 1995.

[15] W. W.-S. Wei, Time series analysis. Addison-Wesley publ,1994.

[16] A. A. Weiss, “Systematic sampling and temporal aggregationin time series models,” Journal of Econometrics, vol. 26,no. 3, pp. 271–281, 1984.

[17] T. A. Buishand, M. V. Shabalova, and T. Brandsma, “Onthe choice of the temporal aggregation level for statisticaldownscaling of precipitation,” Journal of Climate, vol. 17,no. 9, pp. 1816–1827, 2004.

[18] L. Van Bussel, C. Muller, H. Van Keulen, F. Ewert, andP. Leffelaar, “The effect of temporal aggregation of weatherinput data on crop growth models results,” Agricultural andforest meteorology, vol. 151, no. 5, pp. 607–619, 2011.

[19] E. Vlahogianni and M. Karlaftis, “Temporal aggregation intraffic data: implications for statistical characteristics andmodel choice,” Transportation Letters, vol. 3, no. 1, pp. 37–49, 2011.

[20] T. Usman, L. Fu, and L. Miranda-Moreno, “Accident predic-tion models for winter road safety: Does temporal aggregationof data matter?” Transportation Research Record: Journal ofthe Transportation Research Board, no. 2237, pp. 144–151,2011.

[21] A. Memory and T. Senator, “Towards robust anomaly de-tection ensembles using explanations,” in KDD15 DigitalProceedings, Outlier Definition, Detection, and Description(ODDx3) Workshop [to be published 2015].

2748