user modeling through symbolic learning: the lus method and initial results

36
User Modeling Through Symbolic Learning: The LUS Method and Initial Results Guido Cervone Ken Kaufman Ryszard Michalski Machine Learning and Inference Laboratory School of Computational Sciences George Mason University Fairfax, VA, USA {cervone, kaufman, michalski}@gmu.edu http://www.mli.gmu.edu

Upload: rozene

Post on 01-Feb-2016

20 views

Category:

Documents


0 download

DESCRIPTION

User Modeling Through Symbolic Learning: The LUS Method and Initial Results. Guido Cervone Ken Kaufman Ryszard Michalski Machine Learning and Inference Laboratory School of Computational Sciences George Mason University Fairfax, VA, USA {cervone, kaufman, michalski}@gmu.edu - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Guido CervoneKen Kaufman

Ryszard Michalski

Machine Learning and Inference LaboratorySchool of Computational Sciences

George Mason UniversityFairfax, VA, USA

{cervone, kaufman, michalski}@gmu.edu

http://www.mli.gmu.edu

Page 2: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Research Objectives

The main objectives of this research are:

(1) To develop a new methodology for user modeling, called LUS (Learning User Style)

(2) To test and evaluate LUS on datasets consisting of real user activity

(3) To implement an experimental computer intrusion detection system based on the LUS methodology

Page 3: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Main Features of LUS

(1) User models are created automatically through a process of symbolic inductive learning from training data sets characterizing users’ interaction with computers

(2) Models are in the form of symbolic descriptions based on attributional calculus, a representation system that combines elements of propositional logic, first-order predicate logic, and multiple-valued logic

(3) Generated user models are easy to interpret by human experts, and can thus be modified or adjusted manually

(4) Generated user models are evaluated automatically on testing data sets using an episode classifier

Page 4: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Terminology

An event is a description of an entity (e.g., a user activity) at a given time or during a given time period, represented by a vector of attribute-values that characterizes the use of the computer by a user at a specific time.

A session is a sequence of events characterizing a user’s interaction with the computer from logon to logoff.

An episode is a sequence of user states extracted from a session that is used for training or testing/execution of user models; it may contain consecutive states or selected states from a session(s).

In the training phase (during which user models are learned) it is generally desirable to use long episodes, as this helps to generate more accurate and complete user models. In the testing (or execution) phase it is desirable to be able to use short episodes, so that a legitimate or illegitimate user can be identified from as little information as possible.

Page 5: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Approach

System polls active processes every half-second and logs information on the processes and the users responsible for them

Data extracted from the logs takes the form of vectors of values of nominal, temporal and structured attributes

Initial experiments concentrated on one attribute, mode, a derived attribute based on the class of process that was running (e.g., compiler)

Data from successive records are combined into n-grams, e.g., <compiler, print, web, print>

Sets of n-grams comprising an episode are passed to the AQ20 learner

Page 6: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

AQ20 Algorithm Application

Each training n-gram is used as an example of the class representing the user whose activity it reflects.

To learn a user’s profile, AQ20 divides the n-grams into positive examples (examples representing the user whose profile is being learned) and negative examples (examples representing other users’ activities)

AQ20 searches for maximal conjunctive rules that cover positive examples, but not negative ones, and selects the best ones according to user-specified criteria

The rule:[User = 1] if [mode1 = compiler] and [mode2 and mode4 = print] will be returned in the form:[User = 1] <= <compiler, print, *, print>

Rules and conditions may be annotated with weights (e.g., p, n, u)

Page 7: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

EPICn: Episode Classification User Identification Matching Episodes with n-gram Patterns

EPICn matches episodes with n-gram-based patterns of different users’ behavior and computes a degree of match for each user

EPIC employs the ATEST program for matching individual events with patterns

The results from ATEST for each n-gram in the episode are aggregated to give overall episode scores for each class (profile)

EPIC allows flexible classification: all classes whose scores are both above the episode threshold and within the episode tolerance of the best achieved scored are returned as classifications

Page 8: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Experiments

Two sets of preliminary experiments were performed for different training and testing data sizes.Small: First 7 users (SD)Large: All 23 users (LD)

Rules were learned with AQ19 and AQ20, using different control parameters (TF, PD, LEF -- 3 different for SD and LD)

EPICn was used to test the learned hypotheses.

Page 9: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Data Used in the Experiments

24 users for a total of 4,808,024 4-grams. Each user has different number of

sessions, each varying in length. The data contains many repetitions. This is by far the largest dataset AQ20 has

been applied to.

Page 10: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Distribution of the Sessions for Each User

Number of Sessions for each user

0

20

40

60

80

100

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Users

Sess

ions

Page 11: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

4-grams for Each UserTotal number of 4-grams for each user

0200000400000600000800000

1000000120000014000001600000

0 1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

Users

4-g

ram

s

Average number of 4-grams for each user

0

5000

10000

15000

20000

25000

30000

0 1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

Users

4-g

ram

s

Page 12: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

[user = 0 ] <{explorer,web,office,sql,rundll32,system,time,install},

{explorer,web,logon,rundll32,system,time,install},

{explorer,web,office,logon,printing,rundll32,system,time,install}

{web,office,rundll32,system,time,install,multimedia}> : pd=171,nd=52,ud=27, pt=2721, nt=710, ut=160, qd=0.372459, qt=0.60304

[user = 1]

<{netscape,msie,telnet,explorer,web,acrobat,logon,system,welcome,help},

{netscape,msie,telnet,explorer,web,acrobat,logon,rundll32,welcome,help},

{netscape,msie,telnet,explorer,web,acrobat,logon,printing,welcome,dos,help},

{netscape,msie,telnet,explorer,web,acrobat,logon,welcome,dos,help}> : pd=260,nd=54,ud=28,pt=20713,nt=132,ut=2019,qd=0.610064,qt=0.986564...................

Experiment 1A Sample of Results from AQ20 (7 Users)

Page 13: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Distribution of Positive and Negative Events In the Training Set for Each User

(80% of total data; the rest 20% constituted the testing dataset)

User Distinct + Distinct - Total + Total -

0 345 5236 3573 616828

1 348 5214 20858 671154

2 784 4497 19477 570508

3 226 5253 9351 627480

4 3006 2012 92626 545656

5 142 5537 59524 647063

6 865 4413 506532 84895

Page 14: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Predictive Accuracy of User ModelsGenerated Using PD mode and LEF1 (MaxNewPositives,0; MinNumSelectors,0)

Confidence Matrix for 100% of the training data and 100% of the testing data, PD Mode, Default LEF

0

0.05

0.1

0.15

0.2

0.25

0.3

De

gre

e o

f m

atc

h

User 0 0.17 0.11 0.13 0.12 0.13 0.16 0.11

User 1 0.14 0.2 0.08 0.11 0.1 0.09 0.11

User 2 0.16 0.14 0.24 0.16 0.16 0.15 0.14

User 3 0.12 0.1 0.12 0.18 0.14 0.14 0.12

User 4 0.15 0.17 0.23 0.17 0.23 0.15 0.19

User 5 0.13 0.13 0.09 0.12 0.08 0.17 0.1

User 6 0.13 0.16 0.12 0.14 0.16 0.14 0.24

User 0 User 1 User 2 User 3 User 4 User 5 User 6

Page 15: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Predictive Accuracy of User ModelsGenerated Using PD mode and LEF2 (MaxQ,0;MaxNewPositives,0; MinNumSelectors,0)

Confidence Matrix for 100% of the training data and 100% of the testing data, PD Mode, MaxQ

0

0.05

0.1

0.15

0.2

0.25

0.3

De

gre

e o

f m

atc

h

User 0 0.18 0.11 0.13 0.12 0.13 0.16 0.11

User 1 0.14 0.2 0.08 0.11 0.1 0.1 0.11

User 2 0.17 0.14 0.24 0.16 0.16 0.15 0.14

User 3 0.11 0.09 0.11 0.19 0.14 0.14 0.12

User 4 0.15 0.17 0.23 0.17 0.23 0.14 0.19

User 5 0.13 0.12 0.09 0.1 0.08 0.18 0.09

User 6 0.13 0.16 0.12 0.14 0.16 0.13 0.24

User 0 User 1 User 2 User 3 User 4 User 5 User 6

Page 16: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Predictive Accuracy of User ModelsGenerated Using PD mode and LEF3 (MaxTotQ,0; MaxNewPositives,0; MinNumSelectors,0)

Confidence Matrix for 100% of the training data and 100% of the testing data,PD Mode, MaxTotalQ

0

0.05

0.1

0.15

0.2

0.25

0.3

De

gre

e o

f m

atc

h

User 0 0.18 0.11 0.13 0.12 0.13 0.16 0.11

User 1 0.14 0.2 0.09 0.11 0.1 0.1 0.11

User 2 0.17 0.14 0.24 0.16 0.16 0.15 0.14

User 3 0.12 0.09 0.11 0.19 0.14 0.14 0.12

User 4 0.15 0.17 0.23 0.17 0.23 0.14 0.19

User 5 0.12 0.12 0.09 0.1 0.07 0.18 0.09

User 6 0.13 0.16 0.12 0.14 0.16 0.13 0.24

User 0 User 1 User 2 User 3 User 4 User 5 User 6

Page 17: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Sample Results Using TF

Confidence Matrix for 100% of the training data and 100% of the testing data, TF Mode, Default LEF

0

0.05

0.1

0.15

0.2

0.25

0.3

De

gre

e o

f m

atc

h

User 0 0.17 0.11 0.13 0.12 0.13 0.15 0.11

User 1 0.15 0.2 0.09 0.11 0.1 0.09 0.11

User 2 0.16 0.14 0.24 0.16 0.16 0.15 0.14

User 3 0.12 0.1 0.12 0.2 0.13 0.15 0.12

User 4 0.15 0.17 0.22 0.18 0.24 0.14 0.19

User 5 0.12 0.12 0.09 0.1 0.08 0.19 0.09

User 6 0.13 0.16 0.12 0.14 0.16 0.13 0.24

User 0 User 1 User 2 User 3 User 4 User 5 User 6

Page 18: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Sample Rules for User 0PD mode LEF1

  # -- This learning took:   # -- System time 10.45  # -- User time   10  # -- Number of stars generated = 46  # -- Number of rules for this class = 42  # -- Average number of rules kept from each stars = 1

  # -- Size of the training events in the target class:          345  # -- Size of the training events in the other class(es):       5236  # -- Size of the total training events in the target class:    3573  # -- Size of the total training in the other class(es):        616828

[User = 0] <{mail,office,printing,rundll32,system,time,install}

{web,rundll32,system,time,install}{explorer,web,mail,office,logon,rundll32,system,install,multimedia} {explorer,web,office,logon,sql,rundll32,system,help,install,multimedia}>        : pd=149,nd=20,ud=22,pt=2490,nt=75,ut=37,qd=0.377406,qt=0.676398

<{explorer,web,office,sql,rundll32,system,time,install,multimedia} {explorer,web,office,logon,sql,rundll32,system,time,install}    {web,office,logon,sql,printing,rundll32,system,time,install}{web,rundll32,system,time,install}>        : pd=136,nd=30,ud=8,pt=2481,nt=1148,ut=14,qd=0.318267,qt=0.473443  

<{explorer,web,rundll32,system,multimedia}{explorer,system,time,install}{explorer,rundll32,system,time,install,multimedia}{explorer,rundll32,system,time,install,multimedia}>        : pd=107,nd=21,ud=32,pt=2453,nt=930,ut=474,qd=0.255909,qt=0.496713

Page 19: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Experiment 2

In this experiment hypotheses were generated to describe the behavior of all 24 users

The training set consisted of approximately 4 million 4-grams

The testing set consisted of approximately 1 million 4-grams

Page 20: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Description of Experiment 2

Experiments were performed using 20% and 100% of the training set (which constituted 80% of the sessions that make up the training set)

Experiments were performed in PD and TF modes

Three different LEFs were used: LEF1: (TF MODE) <MaxNewPositives,0; MinNumSelectors,0>

LEF2: (TF MODE) <MaxEstimatedPositives, 0; MinEstimatedNegatives, 0; MaxNewPositives,0, MinNumSelectors,0>

LEF3: (PD MODE) MaxQ, 0, MaxNewPositives,0, MinNumSelectors,0

Page 21: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Experiment 2

When combining all of a user’s testing data into a single long episode, out of the 24 users:20 users classified correctly.3 users could not be classified because the

degrees of match of the best-scoring users were insufficiently separated

1 user was classified incorrectly

Page 22: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 0-2

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 1 2

Users

Page 23: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 3-5

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

3 4 5

Users

Page 24: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 6-8

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

6 7 8

Users

Page 25: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 9-11

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

9 10 11

Users

Page 26: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 12-14

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

12 13 14

Users

Page 27: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 15-17

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

15 16 17

Users

Page 28: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 18-20

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

18 19 20

Users

Page 29: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Users 21-23

Predictive Accuracy for 100% testing, 100% training

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

21 22 23

Users

Deg

ree

of M

atch

Page 30: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Sample Rules for User 0

# -- This learning took: # -- System time 767.15 # -- User time 768 # -- Number of stars generated = 57 # -- Number of rules for this class = 52 # -- Average number of rules kept from each stars = 1

# -- Size of the training events in the target class: 346 # -- Size of the training events in the other class(es): 71931 # -- Size of the total training events in the target class: 1826 # -- Size of the total training in the other class(es): 3750169

[user=0] <- <explorer,install,multimedia,system,time> <multimedia,system> <explorer,install,system> <explorer,install,multimedia,system> : pd=64,nd=31,ud=8,pt=916,nt=404,ut=11,qd=0.124322,qt=0.348035 # 18648

<- <explorer,install,office,rundll32,system,time> <multimedia,system> <install,multimedia,rundll32,system,time> <explorer,install,rundll32,system,time> : pd=68,nd=42,ud=9,pt=919,nt=73,ut=11,qd=0.121131,qt=0.466232 # 24747

<- <explorer,help,install,mail,multimedia,rundll32,system,time,web> <help,install,logon,mail,office,rundll32,system,time,web> <help,install,mail,office,printing,rundll32,system,time,web> <help,install,rundll32,system,time> : pd=140,nd=343,ud=41,pt=1316,nt=701,ut=66,qd=0.1159,qt=0.470102 # 5068

<- <install,office,printing,system> <install,rundll32,time> <install,multimedia,office,sql,system,web> <explorer,install,multimedia,rundll32,system,web> : pd=43,nd=4,ud=2,pt=397,nt=4,ut=2,qd=0.11365,qt=0.215245 # 7642

Page 31: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Best Rule for User 23

# -- This learning took: # -- System time -39.9073 # -- User time 4256 # -- Number of stars generated = 658 # -- Number of rules for this class = 533 # -- Average number of rules kept from each stars = 1

# -- Size of the training events in the target class: 9712 # -- Size of the training events in the other class(es): 40602 # -- Size of the total training events in the target class: 1337548 # -- Size of the total training in the other class(es): 2063808

[user=23> <- <ControlPanel,activesync,id,mail,multimedia,netscape,network,spreadsheet,system,wordprocessing> <ControlPanel,activesync,explorer,id,logon,mail,msie,multimedia,netscape,network,printing,spreadsheet,web,wordprocessing> <ControlPanel,activesync,mail,multimedia,netscape,printing,spreadsheet,wordprocessing> <ControlPanel,activesync,mail,multimedia,netscape,spreadsheet,web,wordprocessing> : pd=4685,nd=878,ud=975,pt=1296647,nt=1166,ut=34254,qd=0.388046,qt=0.967985 # 3524022

Page 32: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Experiments with Smaller Test Episodes

In experiments with 150 session-sized testing episodes, some performed with traditional “best matching” and others with threshold-tolerance matching, identification accuracy was as follows:

Traditional ATEST (Rform) scoring, threshold-tolerance matching: 169 classifications, 75 correct, 84 incorrect

Traditional, best only matching: 71 (47.3%) correct

Simple scoring, threshold-tolerance matching: 165 classifications, 117 correct, 48 incorrect

Simple scoring, besst-only matching: 112 (74.7%) correct

Page 33: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Prediction-Based Approach

In the prediction-based approach, events characterizing a user are pairs

<predecessor, successor>,

where:

predecessor is a sequence of lb states of the user (in the experiments, modes) that directly precede a given time instance t, and successor is a sequence of lf states of the user (in the experiments, modes) that occur immediately after t.

Parameters lb and lf, called look-back and look-forward respectively, are determined experimentally.

Page 34: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

An Initial Small Experiment

Rules were learned using decomposition model with lookback of 1, 2, 3, and 4. The results provided by EPICp were as follows:

CONFUSION MATRIX

Data-1 Data-2 Data-3

User 1: 374 86 66

User 2: 202 141 130

User 3: 176 97 557

Page 35: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Topics for Further Research

1. Comparative study of the n-gram-based methodology for currently available datasets using different control parameters

2. Study performance degradation on reduced session size

3. Annotate process tables with window information

4. Testing the ability to identify unknown users

5. Development and implementation of a prediction-based approach using a dedicated program sequential pattern discovery (SPARCum)

6. Employment of multivariate representation, e.g., <mode, process name, time>

7. Improving the representational space through constructive induction

8. Handling drift and shift of user models

9. Coping with incremental growth and change in the user population

Page 36: User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Conclusions

LUS methodology uses symbolic learning to generate user signatures

Unlike traditional classifiers, EPICn classifies based on episodes rather than individual events

Initial experiments have been promising, but several real world situations have yet to be addressed in full

Multistrategy approaches may lead to further performance improvement