predicting zero-day software vulnerabilities through data mining --second presentation

25
PREDICTING ZERO-DAY SOFTWARE VULNERABILITIES THROUGH DATA MINING --SECOND PRESENTATION Su Zhang 1

Upload: isanne

Post on 25-Feb-2016

46 views

Category:

Documents


4 download

DESCRIPTION

Predicting zero-day software vulnerabilities through data mining --Second Presentation. Su Zhang. Outline. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing. Functions Available For Our Approach. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Predicting zero-day software vulnerabilities through data mining --Second Presentation

1

PREDICTING ZERO-DAY SOFTWARE VULNERABILITIES THROUGH DATA

MINING--SECOND PRESENTATION

Su Zhang

Page 2: Predicting zero-day software vulnerabilities through data mining --Second Presentation

2

Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our

Experiments. Why The Six Vendors Are Chosen. Data Preprocessing. Functions Available For Our Approach. Statistical Results Plan For Next Phase.

Outline

Page 3: Predicting zero-day software vulnerabilities through data mining --Second Presentation

3

Quick Review

Page 4: Predicting zero-day software vulnerabilities through data mining --Second Presentation

4

National Vulnerability Database◦ U.S. government repository of standards based

vulnerability management data.◦ Data included in each NVD entry

Published Date Time Vulnerable software’s CPE Specification

◦ Derived data Published Date Time Month Published Date Time Day Two adjacent vulnerabilities’ CPE diff (v1,v2)Version diff CPE Specification Software Name Adjacent different Published Date Time ttpv Adjacent different Published Date Time ttnv

Source Database – NVD

Page 5: Predicting zero-day software vulnerabilities through data mining --Second Presentation

5

Linux: 56925 instances Sun: 24726 instances Cisco: 20120 instances Mozilla: 19965 instances Microsoft: 16703 instances Apple: 14809 instances.

Six Most Vulnerable/Popular Vendors

Page 6: Predicting zero-day software vulnerabilities through data mining --Second Presentation

6

r e s tAd

obe IBM Ph

pAp

ple

Microso

ft

Mozilla

Cisco

SunLin

ux0

100002000030000400005000060000

Instances Table

Instances

Why We Only Choose Instances Of Pop Vendors—Instances Table

Page 7: Predicting zero-day software vulnerabilities through data mining --Second Presentation

7

r e s t HPLin

uxMozi

laCisc

oOrac

le IBMApple Su

n

Microso

ft0

500

1000

1500

2000

2500Vulnerability Table

Vul_Num

Why We Only Choose Instances Of Pop Vendors—Vulnerability Table

Page 8: Predicting zero-day software vulnerabilities through data mining --Second Presentation

8

Huge size of nominal types (vendors and software) will result in a scalability issue.

Top six take up 43.4% of all instances.

We have too many vendors(10411) in NVD.

The seventh most popular/vulnerable vendor is much less than the sixth.

Vendors are independent for our approach.

Why We Only Choose Instances Of Pop Vendors

Page 9: Predicting zero-day software vulnerabilities through data mining --Second Presentation

9

NVD data—Training/Testing dataset◦ Starting from 2005 since before that the data

looks unstable.◦ Correct some obvious errors in NVD(e.g.

“cpe:/o:linux:linux_kernel:390”).

Attributes◦ Published time : Only use month and day. ◦ Version diff: A normalized difference between two

versions.◦ Vendor: Removed.

Data Preprocessing

Page 10: Predicting zero-day software vulnerabilities through data mining --Second Presentation

10

Attributes◦ “Group” vulnerabilities published at the same

day- we can guarantee ttnv/ttpv are non-zero values.

◦ ttnv is the predicted attribute.

For each software◦ Delete its first bunch of instances.◦ Delete its last bunch of instances.

Data Preprocessing(cont)

Page 11: Predicting zero-day software vulnerabilities through data mining --Second Presentation

11

v1= 3.6.4; v2 = 3.6; MaxVersionLength=4; v1= expand ( v1, 4 ) = 3.6.4.0 v2 =expand ( v2, 4 ) = 3.6.0.0 diff(v1, v2) = (3-3) * 1000 +(6-6) * 100-1

+(4-0) * 100-2

+(0-0) * 100-3 = 4 E -4

version diff Calculation

Page 12: Predicting zero-day software vulnerabilities through data mining --Second Presentation

12

Vendor, soft, version, month, day, vdiff, ttpv, ttnv linux,kernel,2.6.18, 05, 02, 0, 70, 5 linux,kernel,2.6.19.2, 05, 07,1.02E-4,5, 281

An Example

Page 13: Predicting zero-day software vulnerabilities through data mining --Second Presentation

13

Least Mean Square. Linear Regression Multilayer Perceptron. SMOreg. RBF Network. Gaussian Processes.

Functions Available For Our Approach On Weka

Page 14: Predicting zero-day software vulnerabilities through data mining --Second Presentation

14

Function: Linear Regression Training Dataset: 66% Linux(Randomly picked

since 2005). Test Dataset: the rest 34% Test Result:

◦ Correlation coefficient 0.5127◦ Mean absolute error 11.2358◦ Root mean squared error 25.4037◦ Relative absolute error 107.629 %◦ Root relative squared error 86.0388 %◦ Total Number of Instances 17967

Several Statistical Results

Page 15: Predicting zero-day software vulnerabilities through data mining --Second Presentation

15

Correlation Coefficient

Page 16: Predicting zero-day software vulnerabilities through data mining --Second Presentation

16

Mean absolute error :

Root mean square error:

Several Definitions About “Error”

Page 17: Predicting zero-day software vulnerabilities through data mining --Second Presentation

17

Relative absolute error:

Root relative squared error:

Several Definitions About “Error”(Cont)

Page 18: Predicting zero-day software vulnerabilities through data mining --Second Presentation

18

Function: Least Mean Square Training Dataset: 66% Linux(Randomly picked

since 2005). Test Dataset: the rest 34% Test Result:

◦ Correlation coefficient -0.1501◦ Mean absolute error 7.6676◦ Root mean squared error 30.6038◦ Relative absolute error 73.449 %◦ Root relative squared error 103.6507 %◦ Total Number of Instances 17967

Several Statistical Results

Page 19: Predicting zero-day software vulnerabilities through data mining --Second Presentation

19

Function: Multilayer Perceptron Training Dataset: 66% Linux(Randomly picked

since 2005). Test Dataset: the rest 34% Test Result:

◦ Correlation coefficient 0.9886◦ Mean absolute error 0.4068◦ Root mean squared error 4.6905◦ Relative absolute error 3.7802 %◦ Root relative squared error 15.1644 %◦ Total Number of Instances 17967

Several Statistical Results

Page 20: Predicting zero-day software vulnerabilities through data mining --Second Presentation

20

Function: RBF Network Training Dataset: 66% Linux(Randomly picked since

2005). Test Dataset: the rest 34% Test Result:

◦ Linear Regression Model ttnv = -15.3206 * pCluster_0_1 + 21.6205

◦ Correlation coefficient 0.1822◦ Mean absolute error 10.5857◦ Root mean squared error 29.048 ◦ Relative absolute error 101.4023 %◦ Root relative squared error 98.3814 %◦ Total Number of Instances 17967

Several Statistical Results

Page 21: Predicting zero-day software vulnerabilities through data mining --Second Presentation

21

Linear Regression: Not accurate enough but looks promising (correlation coefficient: 0.5127).

Least Mean Square: Probably not good for our approach(negative correlation coefficient).

Multilayer Perceptron: Looks good but it couldn’t provide us with a linear model.

Summary Of Current Results

Page 22: Predicting zero-day software vulnerabilities through data mining --Second Presentation

22

SMOreg: For most vendors, it takes too long time to finish (usually more than 80 hours).

RBF Network: Not very accurate.

Gaussian Processes: Runs out of heap memory for most of our experiments.

Summary Of Current Results (Cont)

Page 23: Predicting zero-day software vulnerabilities through data mining --Second Presentation

23

Adding CVSS metrics as predictive attributes.

Binarize our predictive attributes (e.g. divide ttnv/ttpv into several categories.)

Use regression SVM with multiple kernels.

Possible Ways To Improve The Accuracy Of Our Models.

Page 24: Predicting zero-day software vulnerabilities through data mining --Second Presentation

24

Try to find out an optimal model for our prediction.

Try to investigate how to apply it with MulVAL if we get a good model. Otherwise, find out the reason why it is not accurate enough.

Plan For Next Phase

Page 25: Predicting zero-day software vulnerabilities through data mining --Second Presentation

25

Thank you!