introduction to machine learning - san jose state universitystamp/ml/files/zz_figures.pdf ·...

54
Introduction to Machine Learning with Applications in Information Security Mark Stamp April 27, 2017 Chapter 2 0 1 2 ··· 1 0 1 2 ··· 1 Figure 2.1: Hidden Markov model H .06 C .28 H .0448 C .0336 H .003136 C .014112 H .002822 C .000847 Figure 2.2: Dynamic programming 1

Upload: trinhkien

Post on 09-Jul-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

Introduction to Machine Learningwith Applications in Information Security

Mark Stamp

April 27, 2017

Chapter 2

𝒪0 𝒪1 𝒪2 · · · 𝒪𝑇−1

𝑋0 𝑋1 𝑋2 · · · 𝑋𝑇−1𝐴 𝐴 𝐴 𝐴

𝐵 𝐵 𝐵 𝐵

Figure 2.1: Hidden Markov model

H.06

C.28

H.0448

C.0336

H.003136

C.014112

H.002822

C.000847

Figure 2.2: Dynamic programming

1

Page 2: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

2

Problem 2.15

I L I K E K I L L I N G P E O P L

E B E C A U S E I T I S S O M U C

H F U N I T I S M O R E F U N T H

A N K I L L I N G W I L D G A M E

I N T H E F O R R E S T B E C A U

S E M A N I S T H E M O S T D A N

G E R O U E A N A M A L O F A L L

T O K I L L S O M E T H I N G G I

V E S M E T H E M O S T T H R I L

L I N G E X P E R E N C E I T I S

E V E N B E T T E R T H A N G E T

T I N G Y O U R R O C K S O F F W

I T H A G I R L T H E B E S T P A

R T O F I T I S T H A E W H E N I

D I E I W I L L B E R E B O R N I

N P A R A D I C E A N D A L L T H

E I H A V E K I L L E D W I L L B

E C O M E M Y S L A V E S I W I L

L N O T G I V E Y O U M Y N A M E

B E C A U S E Y O U W I L L T R Y

T O S L O I D O W N O R A T O P M

Y C O L L E C T I O G O F S L A V

E S F O R M Y A F T E R L I F E E

B E O R I E T E M E T H H P I T I

Problem 2.16

Page 3: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

3

Chapter 3

begin 𝑀1 𝑀2 𝑀3 𝑀4 end

Figure 3.1: PHMM without gaps

begin 𝑀1 𝑀2 𝑀3 𝑀4 end

𝐼0 𝐼1 𝐼2 𝐼3 𝐼4

Figure 3.2: PHMM with insertions

begin 𝑀1 𝑀2 𝑀3 𝑀4 end

𝐷1 𝐷2 𝐷3 𝐷4

Figure 3.3: PHMM with deletions

begin 𝑀1 𝑀2 𝑀3 𝑀4 end

𝐼0 𝐼1 𝐼2 𝐼3 𝐼4

𝐷1 𝐷2 𝐷3 𝐷4

Figure 3.4: Profile hidden Markov model

Page 4: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

4

5

48

3

2

7 1

6

10

9

79 105

85

79

94 85

84

78

81

Figure 3.5: Minimum spanning tree

10,11,12 12 5,8,12

1,3,9

5,13 1,2,4,6,10

2,6,7,8

3,7,9,11

1,2,3,4,5 10 13

6,7,8,9,13

4,11

4 8

3,7,9,11,13

1,2,3,5

2,6,7,10

1 6 9

𝑀0 𝑀1 𝑀2 𝑀3

𝐼0 𝐼1 𝐼2

𝐷1 𝐷2

Figure 3.6: PHMM with 𝑁 = 2 illustrating paths in Table 3.12

Page 5: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

5

𝐹𝑀0 (0)

𝐹𝑀1 (1)

𝐹 𝐼0 (1)

𝐹𝐷1 (0)

𝐹𝑀2 (2)

𝐹 𝐼1 (2)

𝐹𝐷2 (1)

𝐹𝑀1 (2)

𝐹 𝐼0 (2)

𝐹𝐷1 (1)

𝐹𝑀2 (1)

𝐹 𝐼1 (1)

𝐹𝐷2 (0)

𝐹𝑀𝑁 (𝐿)

𝐹 𝐼𝑁 (𝐿)

𝐹𝐷𝑁 (𝐿)

Figure 3.7: Forward algorithm recursion

𝐹𝑀𝑁+1(𝐿)

𝐹𝑀𝑁 (𝐿)

𝐹 𝐼𝑁 (𝐿)

𝐹𝐷𝑁 (𝐿)

𝑎𝑀

𝑁𝐸

𝑎𝐼𝑁𝐸

𝑎𝐷𝑁

𝐸

Figure 3.8: Final score

Page 6: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

6

Chapter 4

𝑥

𝐴𝑥

Figure 4.1: Matrix multiplication example

𝑥

𝐴𝑥

Figure 4.2: Eigenvector example

(a) Experimental results (b) Direction of maximum variance

Figure 4.3: PCA and maximum variance

Page 7: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

7

Figure 4.4: A better basis

𝜃

Figure 4.5: Ferris wheel data

Figure 4.6: Non-orthogonal data

Page 8: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

8

𝜎1

𝜎2

𝜎1𝜎2

𝑀

𝑆

𝑉 𝑇 𝑈

Figure 4.7: Matrix transformation using SVD

Page 9: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

9

Chapter 5

Figure 5.1: Scatterplot of training data

Figure 5.2: Separating hyperplanes

Figure 5.3: Maximizing the margin

Page 10: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

10

Figure 5.4: Not linearly separable

𝜑=⇒

Figure 5.5: Transformation to linearly separable

𝜑=⇒

Figure 5.6: Transformation from 2-d to 3-d

Page 11: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

11

−4.00−2.00 0.00 2.00 4.00 −5.00

0.00

5.00

−20.00

0.00

20.00

𝑥𝑦

𝑓(𝑥,𝑦)

Figure 5.7: Graph of 𝑓(𝑥, 𝑦) = 16− (𝑥2 + 𝑦2)

−4.00−2.00 0.00 2.00 4.00 −5.00

0.00

5.00

−20.00

0.00

20.00

−4.00−2.00 0.00 2.00 4.00 −5.00

0.00

5.00

−20.00

0.00

20.00

(a) Intersection (b) Feasible region

Figure 5.8: Constrained optimization example

𝑚

Figure 5.9: Linearly separable example

Page 12: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

12

supportvectors

Figure 5.10: Support vectors

Figure 5.11: Errors and soft margin

Page 13: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

13

Problem 5.15

Page 14: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

14

Chapter 6

𝐸 𝐴 𝑂 𝐼 𝑈 𝑇 𝑁 𝑆 𝑅

Figure 6.1: Dendrogram

𝐴

𝐵

Euclidean

dista

nce

Manhattan distance

Figure 6.2: Euclidean vs Manhattan distance

Figure 6.3: Distortion

Page 15: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

15

(a) Suitable for clustering (b) Not as well-suited for clustering

Figure 6.4: Clusterability

(c) Correlation 0 < 𝑟𝑋𝑌 < 1 (d) Correlation −1 < 𝑟𝑋𝑌 < 0

(a) Correlation 𝑟𝑋𝑌 = 1 (b) Correlation 𝑟𝑋𝑌 = −1

Figure 6.5: Correlation coefficient and regression line examples

Page 16: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

16

(a) Correlation 𝑟𝐴𝐷 = −0.8652 (b) Correlation 𝑟𝐴𝐷 = −0.5347

Figure 6.6: Correlation coefficient examples

X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 231 0.00 0.36 0.70 0.85 0.75 1.61 0.95 1.41 2.17 2.25 1.71 2.49 2.81 3.28 3.50 3.82 4.46 4.46 3.96 3.20 2.97 3.86 4.232 0.36 0.00 0.56 0.49 0.51 1.25 0.78 1.10 2.23 2.42 1.77 2.55 2.94 3.16 3.37 3.67 4.25 4.25 3.75 3.01 2.75 3.62 4.003 0.70 0.56 0.00 0.62 1.03 1.27 1.31 1.44 1.72 2.01 1.28 2.03 2.48 2.61 2.83 3.13 3.76 3.76 3.26 2.50 2.27 3.16 3.534 0.85 0.49 0.62 0.00 0.61 0.76 0.89 0.83 2.31 2.63 1.88 2.61 3.09 2.97 3.16 3.42 3.92 3.92 3.42 2.72 2.42 3.25 3.655 0.75 0.51 1.03 0.61 0.00 1.14 0.29 0.70 2.73 2.93 2.28 3.05 3.46 3.56 3.75 4.02 4.52 4.52 4.03 3.32 3.03 3.85 4.266 1.61 1.25 1.27 0.76 1.14 0.00 1.31 0.76 2.75 3.21 2.38 3.02 3.59 3.01 3.14 3.32 3.64 3.64 3.16 2.57 2.21 2.92 3.357 0.95 0.78 1.31 0.89 0.29 1.31 0.00 0.71 3.01 3.18 2.55 3.33 3.72 3.85 4.04 4.30 4.79 4.79 4.29 3.60 3.30 4.11 4.528 1.41 1.10 1.44 0.83 0.70 0.76 0.71 0.00 3.13 3.45 2.71 3.43 3.92 3.66 3.82 4.02 4.39 4.39 3.91 3.29 2.94 3.68 4.109 2.17 2.23 1.72 2.31 2.73 2.75 3.01 3.13 0.00 0.74 0.46 0.32 0.85 1.63 1.90 2.30 3.21 3.21 2.79 2.05 2.11 2.90 3.1110 2.25 2.42 2.01 2.63 2.93 3.21 3.18 3.45 0.74 0.00 0.86 0.83 0.63 2.33 2.60 3.00 3.92 3.92 3.52 2.79 2.85 3.64 3.8411 1.71 1.77 1.28 1.88 2.28 2.38 2.55 2.71 0.46 0.86 0.00 0.78 1.22 1.88 2.15 2.53 3.38 3.38 2.93 2.15 2.13 2.98 3.2412 2.49 2.55 2.03 2.61 3.05 3.02 3.33 3.43 0.32 0.83 0.78 0.00 0.68 1.51 1.78 2.18 3.11 3.11 2.72 2.03 2.15 2.87 3.0413 2.81 2.94 2.48 3.09 3.46 3.59 3.72 3.92 0.85 0.63 1.22 0.68 0.00 2.09 2.33 2.73 3.68 3.68 3.33 2.68 2.82 3.51 3.6514 3.28 3.16 2.61 2.97 3.56 3.01 3.85 3.66 1.63 2.33 1.88 1.51 2.09 0.00 0.27 0.67 1.60 1.60 1.25 0.75 1.08 1.48 1.5615 3.50 3.37 2.83 3.16 3.75 3.14 4.04 3.82 1.90 2.60 2.15 1.78 2.33 0.27 0.00 0.40 1.35 1.35 1.03 0.70 1.07 1.30 1.3316 3.82 3.67 3.13 3.42 4.02 3.32 4.30 4.02 2.30 3.00 2.53 2.18 2.73 0.67 0.40 0.00 0.96 0.96 0.72 0.75 1.13 1.05 0.9817 4.46 4.25 3.76 3.92 4.52 3.64 4.79 4.39 3.21 3.92 3.38 3.11 3.68 1.60 1.35 0.96 0.00 0.00 0.50 1.27 1.50 0.74 0.3218 4.46 4.25 3.76 3.92 4.52 3.64 4.79 4.39 3.21 3.92 3.38 3.11 3.68 1.60 1.35 0.96 0.00 0.00 0.50 1.27 1.50 0.74 0.3219 3.96 3.75 3.26 3.42 4.03 3.16 4.29 3.91 2.79 3.52 2.93 2.72 3.33 1.25 1.03 0.72 0.50 0.50 0.00 0.79 1.00 0.38 0.3220 3.20 3.01 2.50 2.72 3.32 2.57 3.60 3.29 2.05 2.79 2.15 2.03 2.68 0.75 0.70 0.75 1.27 1.27 0.79 0.00 0.39 0.85 1.1021 2.97 2.75 2.27 2.42 3.03 2.21 3.30 2.94 2.11 2.85 2.13 2.15 2.82 1.08 1.07 1.13 1.50 1.50 1.00 0.39 0.00 0.90 1.2622 3.86 3.62 3.16 3.25 3.85 2.92 4.11 3.68 2.90 3.64 2.98 2.87 3.51 1.48 1.30 1.05 0.74 0.74 0.38 0.85 0.90 0.00 0.4323 4.23 4.00 3.53 3.65 4.26 3.35 4.52 4.10 3.11 3.84 3.24 3.04 3.65 1.56 1.33 0.98 0.32 0.32 0.32 1.10 1.26 0.43 0.00

(a) Heatmap corresponding to Figure 6.6 (a)

X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 231 0.00 0.36 0.45 0.85 0.75 1.53 0.95 1.41 0.74 1.01 0.80 1.32 1.56 1.46 1.70 2.05 2.11 2.42 1.66 1.10 1.23 1.66 2.052 0.36 0.00 0.54 0.49 0.51 1.17 0.78 1.10 0.79 1.23 0.69 1.31 1.70 1.35 1.55 1.86 1.82 2.12 1.35 0.79 0.89 1.31 1.703 0.45 0.54 0.00 0.86 1.05 1.55 1.30 1.61 0.29 0.70 0.43 0.87 1.17 1.05 1.30 1.68 1.84 2.16 1.45 0.93 1.35 1.57 1.964 0.85 0.49 0.86 0.00 0.61 0.70 0.89 0.83 1.00 1.55 0.74 1.36 1.90 1.24 1.37 1.60 1.41 1.69 0.91 0.41 0.51 0.82 1.205 0.75 0.51 1.05 0.61 0.00 1.01 0.29 0.70 1.30 1.72 1.17 1.80 2.21 1.78 1.95 2.20 2.00 2.26 1.50 1.02 0.64 1.31 1.666 1.53 1.17 1.55 0.70 1.01 0.00 1.17 0.61 1.66 2.23 1.35 1.89 2.51 1.63 1.66 1.73 1.25 1.43 0.82 0.76 0.38 0.43 0.677 0.95 0.78 1.30 0.89 0.29 1.17 0.00 0.71 1.57 1.95 1.45 2.09 2.47 2.07 2.24 2.48 2.25 2.50 1.75 1.30 0.79 1.52 1.838 1.41 1.10 1.61 0.83 0.70 0.61 0.71 0.00 1.81 2.31 1.57 2.19 2.72 2.03 2.12 2.26 1.85 2.04 1.39 1.13 0.35 1.03 1.259 0.74 0.79 0.29 1.00 1.30 1.66 1.57 1.81 0.00 0.59 0.34 0.59 0.92 0.81 1.08 1.47 1.74 2.06 1.40 0.95 1.51 1.60 1.9710 1.01 1.23 0.70 1.55 1.72 2.23 1.95 2.31 0.59 0.00 0.92 0.83 0.63 1.22 1.48 1.89 2.27 2.58 1.97 1.53 2.05 2.18 2.5511 0.80 0.69 0.43 0.74 1.17 1.35 1.45 1.57 0.34 0.92 0.00 0.64 1.17 0.67 0.90 1.26 1.43 1.75 1.07 0.62 1.25 1.26 1.6312 1.32 1.31 0.87 1.36 1.80 1.89 2.09 2.19 0.59 0.83 0.64 0.00 0.68 0.43 0.67 1.07 1.54 1.84 1.37 1.13 1.86 1.70 2.0213 1.56 1.70 1.17 1.90 2.21 2.51 2.47 2.72 0.92 0.63 1.17 0.68 0.00 1.10 1.30 1.67 2.21 2.51 2.05 1.75 2.41 2.36 2.6914 1.46 1.35 1.05 1.24 1.78 1.63 2.07 2.03 0.81 1.22 0.67 0.43 1.10 0.00 0.27 0.67 1.12 1.42 1.00 0.90 1.68 1.37 1.6515 1.70 1.55 1.30 1.37 1.95 1.66 2.24 2.12 1.08 1.48 0.90 0.67 1.30 0.27 0.00 0.40 0.93 1.21 0.93 0.99 1.77 1.35 1.5716 2.05 1.86 1.68 1.60 2.20 1.73 2.48 2.26 1.47 1.89 1.26 1.07 1.67 0.67 0.40 0.00 0.71 0.91 0.92 1.19 1.91 1.35 1.4817 2.11 1.82 1.84 1.41 2.00 1.25 2.25 1.85 1.74 2.27 1.43 1.54 2.21 1.12 0.93 0.71 0.00 0.32 0.50 1.03 1.54 0.83 0.8218 2.42 2.12 2.16 1.69 2.26 1.43 2.50 2.04 2.06 2.58 1.75 1.84 2.51 1.42 1.21 0.91 0.32 0.00 0.78 1.33 1.76 1.01 0.8719 1.66 1.35 1.45 0.91 1.50 0.82 1.75 1.39 1.40 1.97 1.07 1.37 2.05 1.00 0.93 0.92 0.50 0.78 0.00 0.56 1.06 0.43 0.6520 1.10 0.79 0.93 0.41 1.02 0.76 1.30 1.13 0.95 1.53 0.62 1.13 1.75 0.90 0.99 1.19 1.03 1.33 0.56 0.00 0.78 0.65 1.0321 1.23 0.89 1.35 0.51 0.64 0.38 0.79 0.35 1.51 2.05 1.25 1.86 2.41 1.68 1.77 1.91 1.54 1.76 1.06 0.78 0.00 0.75 1.0422 1.66 1.31 1.57 0.82 1.31 0.43 1.52 1.03 1.60 2.18 1.26 1.70 2.36 1.37 1.35 1.35 0.83 1.01 0.43 0.65 0.75 0.00 0.3923 2.05 1.70 1.96 1.20 1.66 0.67 1.83 1.25 1.97 2.55 1.63 2.02 2.69 1.65 1.57 1.48 0.82 0.87 0.65 1.03 1.04 0.39 0.00

(b) Heatmap corresponding to Figure 6.6 (b)

Figure 6.7: Heatmaps

Page 17: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

17

𝑥𝑖

𝐶1

𝐶2

𝐶3

Figure 6.8: Silhouette coefficient example

(a) 𝐸 = 0.7632 and 𝑈 = 0.7272 (b) 𝐸 = 1.0280 and 𝑈 = 0.4545

Figure 6.9: Entropy and purity examples

Cluster 1

Cluster 3

Cluster 2

Figure 6.10: Three clusters

Page 18: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

18

0 2 4 6 8 10

Cluster 1

Cluster 2

Cluster 3

NumberOvals Circles Diamonds

Figure 6.11: Stacked column chart for clusters in Figure 6.10

0 200 400 600 800 1000 1200 1400 1600 1800

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Cluster 10

NumberZeroaccess Zbot Winwebsec Benign

Figure 6.12: Stacked column chart (4-d model with 10 clusters)

1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.5040

50

60

70

80

90

Duration

Waitingtime

Figure 6.13: EM clustering of Old Faithful eruption data

Page 19: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

19

1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.5040

50

60

70

80

90

Duration

Waitingtime

Figure 6.14: Old Faithful data for Gaussian mixture example

1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.5040

50

60

70

80

90

Duration

Waitingtime

Figure 6.15: EM clusters for Old Faithful data

Page 20: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

20

Problem 6.4

Problem 6.16

Page 21: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

21

Chapter 7

Figure 7.1: Labeled training data

𝑏

𝑋𝑟1 𝑟2

𝑏

𝑋

(a) 1-nearest neighbor (1-NN) (b) 3-nearest neighbor (3-NN)

Figure 7.2: 𝑘-NN examples

Page 22: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

22

𝑋1 𝑋2

𝑓1 𝑓1 𝑓1

𝑓2 𝑓2 𝑓2 𝑓2

𝑔 𝑔 𝑔

𝑌1 𝑌2 𝑌3

Input layer

1st hidden layer

2nd hidden layer

Output layer

Output

Figure 7.3: MLP with two hidden layers

file size

entropy

entropy

benign

malware

benign

benign

large

small

high

high

low

low

Figure 7.4: Decision tree example

Page 23: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

23

entropy

file size

file size

benign

malware

benign

benign

high

low

large

large

small

small

Figure 7.5: Features in different order

(a) Separating with LDA (b) Separating with QDA

Figure 7.6: LDA vs QDA

(a) A projection (b) A better projection

Figure 7.7: Projecting onto hyperplanes

Page 24: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

24

𝜇𝑥

𝜇𝑦

𝜇𝑥

𝜇𝑦

(a) Means widely separated (b) Means closer together

Figure 7.8: Projecting the means

(a) Largest eigenvalue (b) Smallest eigenvalue

Figure 7.9: Projections of data in Table 7.4

0 1−1 2−2 3−3 · · ·· · ·[ )[ )[ )[ )[ )[ )[ )

Figure 7.10: Rounding as VQ

(a) House size vs price (b) Linear regression

Figure 7.11: Regression line

Page 25: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

25

(a) Linear regression (b) Piecewise linear

Figure 7.12: Regression examples

−6 −4 −2 0 2 4 6

0.25

0.50

0.75

1.00

Figure 7.13: Logistic function

𝒪0 𝒪1 𝒪2 · · · 𝒪𝑇−1

𝑋0 𝑋1 𝑋2 · · · 𝑋𝑇−1

Figure 7.14: Graph structure of HMM

Page 26: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

26

𝒪0 𝒪1 𝒪2 · · · 𝒪𝑇−1

𝑋0 𝑋1 𝑋2 · · · 𝑋𝑇−1

Figure 7.15: Linear chain CRF

Logistic regression Linear chain CRF Conditional random field

Naıve bayes Hidden Markov model Generative directed model

Figure 7.16: Generative-discriminative pairs

Page 27: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

27

Problem 7.16

Page 28: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

28

Chapter 8

Match scores

Nomatch scores

Experiment

Score

Figure 8.1: Scatterplot of scores

Figure 8.2: Thresholding is easy . . . sometimes

TP FP

FN TN

Low

score

Highscore

Malware Not malware

Figure 8.3: Confusion matrix

Page 29: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

29

FPR

TPR

00

1

1

Figure 8.4: Scatterplot and a point on the ROC curve

FPR

TPR

00

1

1

Figure 8.5: ROC curve example

FPR

TPR

00

1

1 FPR

TPR

00

1

1

Figure 8.6: Area under ROC curve (AUC) and AUC𝑝

Page 30: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

30

TP = 99 FP = 1998

FN = 1 TN = 97,902

Low

score

Highscore

Malware Not malware

Figure 8.7: Confusion matrix

Recall

Precision

00

1

1

Figure 8.8: Scatterplot and a point on the PR curve

Recall

Precision

00

1

1

Figure 8.9: PR curve example

Page 31: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

31

Problem 8.7

Match scores

Nomatch scores

ExperimentScore

Page 32: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

32

Chapter 9

0 20 40 60 80 100 120 140 160 180 200

−140

−120

−100

−80

−60

−40

−20

0

Sample

HMM

score

MalwareBenign

Figure 9.1: NGVCK vs benign

letters A B C D E F G H I J K L M N O P Q R S T U V W X Y ZA 0.02 0.19 0.38 0.33 0.01 0.10 0.19 0.04 0.29 0.01 0.09 0.85 0.29 1.59 0.02 0.17 0.00 0.93 0.82 1.19 0.10 0.18 0.08 0.02 0.26 0.01B 0.15 0.01 0.00 0.00 0.48 0.00 0.00 0.00 0.10 0.01 0.00 0.19 0.00 0.00 0.18 0.00 0.00 0.09 0.03 0.01 0.17 0.00 0.00 0.00 0.13 0.00C 0.43 0.01 0.06 0.01 0.49 0.00 0.00 0.52 0.21 0.00 0.12 0.12 0.01 0.00 0.64 0.01 0.00 0.12 0.03 0.30 0.09 0.00 0.01 0.00 0.03 0.00D 0.39 0.14 0.08 0.08 0.63 0.09 0.06 0.10 0.47 0.02 0.01 0.07 0.11 0.06 0.30 0.07 0.01 0.13 0.23 0.38 0.12 0.03 0.11 0.00 0.06 0.00E 1.00 0.22 0.61 1.02 0.46 0.30 0.18 0.19 0.39 0.03 0.06 0.53 0.48 1.25 0.33 0.36 0.04 1.75 1.36 0.77 0.09 0.24 0.36 0.14 0.15 0.01F 0.23 0.02 0.05 0.02 0.19 0.13 0.02 0.04 0.25 0.01 0.00 0.06 0.04 0.02 0.42 0.03 0.00 0.18 0.05 0.36 0.08 0.00 0.03 0.00 0.01 0.00G 0.21 0.02 0.03 0.02 0.31 0.03 0.03 0.22 0.18 0.00 0.00 0.06 0.03 0.06 0.19 0.02 0.00 0.19 0.07 0.15 0.06 0.00 0.03 0.00 0.01 0.00H 0.84 0.02 0.04 0.01 2.42 0.02 0.01 0.03 0.66 0.00 0.00 0.02 0.03 0.04 0.48 0.02 0.00 0.10 0.05 0.21 0.08 0.01 0.04 0.00 0.03 0.00I 0.23 0.08 0.56 0.27 0.30 0.13 0.21 0.01 0.01 0.00 0.04 0.40 0.22 1.81 0.56 0.07 0.01 0.26 0.94 0.89 0.01 0.22 0.01 0.01 0.00 0.05J 0.03 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.00K 0.04 0.01 0.01 0.01 0.21 0.01 0.00 0.02 0.09 0.00 0.00 0.02 0.01 0.04 0.04 0.01 0.00 0.01 0.06 0.03 0.00 0.00 0.02 0.00 0.01 0.00L 0.50 0.06 0.06 0.27 0.70 0.07 0.02 0.03 0.54 0.00 0.02 0.58 0.06 0.02 0.33 0.06 0.00 0.04 0.17 0.15 0.10 0.03 0.04 0.00 0.36 0.00M 0.50 0.11 0.02 0.01 0.63 0.01 0.00 0.02 0.30 0.00 0.00 0.01 0.11 0.01 0.32 0.17 0.00 0.07 0.09 0.07 0.11 0.00 0.02 0.00 0.04 0.00N 0.53 0.08 0.38 1.03 0.64 0.11 0.80 0.10 0.44 0.02 0.05 0.09 0.09 0.13 0.49 0.07 0.01 0.05 0.50 1.20 0.09 0.05 0.12 0.00 0.10 0.00O 0.14 0.14 0.16 0.18 0.06 0.86 0.09 0.07 0.10 0.01 0.06 0.34 0.49 1.36 0.22 0.22 0.00 1.04 0.31 0.46 0.70 0.17 0.30 0.01 0.04 0.00P 0.27 0.01 0.00 0.00 0.35 0.01 0.00 0.07 0.12 0.00 0.00 0.20 0.02 0.00 0.28 0.11 0.00 0.36 0.05 0.09 0.08 0.00 0.01 0.00 0.01 0.00Q 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00R 0.65 0.06 0.16 0.21 1.43 0.08 0.10 0.08 0.64 0.01 0.10 0.13 0.19 0.17 0.64 0.09 0.00 0.12 0.50 0.51 0.13 0.06 0.08 0.00 0.20 0.00S 0.66 0.14 0.24 0.09 0.73 0.13 0.05 0.36 0.65 0.02 0.05 0.12 0.16 0.10 0.56 0.26 0.01 0.07 0.48 1.33 0.24 0.02 0.21 0.00 0.05 0.00T 0.63 0.09 0.11 0.05 0.97 0.08 0.03 2.89 1.09 0.01 0.01 0.14 0.10 0.05 1.03 0.07 0.00 0.35 0.40 0.49 0.20 0.01 0.19 0.00 0.20 0.00U 0.09 0.08 0.13 0.08 0.11 0.02 0.10 0.00 0.07 0.00 0.00 0.26 0.10 0.37 0.01 0.11 0.00 0.38 0.37 0.34 0.00 0.00 0.00 0.00 0.01 0.00V 0.09 0.00 0.00 0.00 0.65 0.00 0.00 0.00 0.22 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00W 0.32 0.01 0.01 0.01 0.31 0.01 0.00 0.32 0.33 0.00 0.00 0.01 0.02 0.06 0.21 0.01 0.00 0.03 0.04 0.03 0.00 0.00 0.01 0.00 0.02 0.00X 0.02 0.00 0.02 0.00 0.01 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.01 0.05 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00Y 0.19 0.07 0.07 0.05 0.14 0.06 0.02 0.07 0.12 0.01 0.01 0.04 0.07 0.04 0.18 0.05 0.00 0.04 0.17 0.20 0.01 0.01 0.09 0.00 0.01 0.00Z 0.02 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01

Figure 9.2: English digraph relative frequencies (as percentages)

Page 33: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

33

200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Ciphertext length

Accuracy

(percentage)

keydata

Figure 9.3: Jakobsen’s algorithm

200 400 600 800 1000 12000

10

20

30

40

50

60

70

80

90

100

Ciphertext length

Accuracy

1 start10 restarts

102 restarts

103 restarts

104 restarts

105 restarts

Figure 9.4: Accuracy vs data size

Page 34: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

34

200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Ciphertext length

Accuracy

HMM (105 restarts)Jakobsen’s

Figure 9.5: Jakobsen’s algorithm vs HMM

200 400 600 800 1000 1200101

103

105

0

20

40

60

80

100

Ciphertext length

Restarts

Accuracy

Figure 9.6: Accuracy vs data size vs restarts (200 iterations)

Page 35: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

35

101 102 103 104 105400

800

1200

0.00

0.20

0.40

0.60

0.80

1.00

Restarts

CiphertextlengthAccuracy

Figure 9.7: Accuracy vs restarts vs data size (200 iterations)

Page 36: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

36

Chapter 10

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.000.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

False positive rate

Truepositiverate

1-gram2-gram3-gramHMM

Figure 10.1: Comparison of HMM and weighted 𝑛-grams

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.000.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

False positive rate

Truepositiverate

4 sequences5 sequences10 sequences20 sequences50 sequences

Figure 10.2: Detection results for PHMM

Page 37: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

37

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.000.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

False positive rate

Truepositiverate

PHMMHMM3-gram

Figure 10.3: Comparison of PHMM, HMM, and 3-gram scores

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.000.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

False positive rate

Truepositive

rate

PHMMHMM

Figure 10.4: HMM vs PHMM based on simulated data

Page 38: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

38

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.000.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

False positive rate

Truepositiverate

PHMM 200PHMM 400PHMM 800HMM 200HMM 400HMM 800

Figure 10.5: HMM vs PHMM with limited training data

200 commands 400 commands 800 commands

0.2

0.4

0.6

0.8

1.0

PHMM AUC0.1

PHMM AUCHMM AUC0.1

HMM AUC

Figure 10.6: Results based on limited synthetic data

Page 39: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

39

0.00 10.00 20.00 30.00 40.00 50.00−30.00

−25.00

−20.00

−15.00

−10.00

−5.00

0.00

Score

MalwareBenign

0.00 10.00 20.00 30.00 40.00 50.00−30.00

−25.00

−20.00

−15.00

−10.00

−5.00

0.00

Score

MalwareBenign

(a) Scatterplot for static case (b) Scatterplot for dynamic case

0.00 0.20 0.40 0.60 0.80 1.000.00

0.20

0.40

0.60

0.80

1.00

False Positive Rate

TruePositiveRate

0.00 0.20 0.40 0.60 0.80 1.000.00

0.20

0.40

0.60

0.80

1.00

False Positive Rate

TruePositiveRate

(c) ROC curve for static case (d) ROC curve for dynamic case

Figure 10.7: Security Shield HMM results

Cridex

Harebot

SecurityShield

SmartHDD

Winwebsec

Zbot

ZeroAccess

0.2

0.4

0.6

0.8

1.0

AUC

PHMM

HMM (dynamic)

HMM (static)

Figure 10.8: PHMM vs HMMs

Page 40: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

40

Chapter 11

Figure 11.1: Training images

Figure 11.2: Eigenfaces of images in Figure 11.1

Page 41: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

41

1.0 1.5 2.0 2.5 3.0 3.5 4.00.95

0.96

0.97

0.98

0.99

1.00

Padding ratio

AUC

Figure 11.3: Graph of AUC for MWOR

1.0 1.5 2.0 2.5 3.0 3.5 4.00.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Padding ratio

AUC

SVDHMMSSD

Figure 11.4: AUC comparison for MWOR

Page 42: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

42

Figure 11.5: Image spam

Figure 11.6: Spam images from standard dataset

Page 43: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

43

Figure 11.7: Projections onto eigenspace for images in Figure 11.6

1 10 100 5000.75

0.80

0.85

0.90

0.95

1.00

Number of eigenvalues

AUC

Figure 11.8: AUC for different numbers of eigenvalues (standard dataset)

Figure 11.9: Examples of improved spam images

Page 44: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

44

Chapter 12

ADD CALL JMP

NOP

SUB

1/4

1/4

1/2

1/3

1/2

2/51/5

1/6

1/2

1/3

2/5

1/6

1/2

1/2

Figure 12.1: Opcode graph

0 5 10 15 20 25 30 35 40−60

−50

−40

−30

−20

−10

0

Score

MalwareBenign

0 5 10 15 20 25 30 35 400.0

0.2

0.4

0.6

0.8

1.0

Score

MalwareBenign

(a) HMM (b) OGS

0 5 10 15 20 25 30 35 400.0

0.2

0.4

0.6

0.8

1.0

Score

MalwareBenign

0 5 10 15 20 25 30 35 400.0

0.2

0.4

0.6

0.8

1.0

Score

MalwareBenign

(c) SSD (d) SVM

Figure 12.2: NGVCK score scatterplots

Page 45: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

45

Linear

Polynomial

Neural

Radial

0.0

0.2

0.4

0.6

0.8

1.0 0.920.86 0.85

1.00

AUC

Figure 12.3: Comparison of SVM kernels (NGVCK at 80% morphing)

0 20 40 60 80 100 1200.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

Figure 12.4: AUC at various morphing percentages (NGVCK)

Page 46: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

46

Harebot

SecurityShield

SmartHDD

Winwebsec

Zbot

Zeroaccess

0.0

0.2

0.4

0.6

0.8

1.0

AUC

HMMOGSSSDSVM

Figure 12.5: AUC comparisons for Malicia families

Page 47: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

47

0 20 40 60 80 100 120 140

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

0 20 40 60 80 100 120 140

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

(a) Winwebsec (b) Zeroaccess

0 20 40 60 80 100 120 140

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

0 20 40 60 80 100 120 140

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

(c) Zbot (d) Harebot

0 20 40 60 80 100 120 140

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

0 20 40 60 80 100 120 140

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Morphing percentage

AUC

HMMOGSSSDSVM

(e) Security Shield (f) Smart HDD

Figure 12.6: AUC comparison for morphed Malicia families

Page 48: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

48

(a) Ham image (b) Grayscale

(c) Canny edges (d) HOG

Figure 12.7: Features of a ham image

(a) Spam image (b) Grayscale

(c) Canny edges (d) HOG

Figure 12.8: Spam image feature extraction

Page 49: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

49

0 2 4 6 8 100

100

200

300

400

500

600

SNR

Frequency

HamSpam

0 50 100 150 2000

5

10

15

20

25

30

35

Compression ratio

Frequency

HamSpam

(a) Signal to noise ratio (b) Compression ratio

1 1 2 2 3 3 4 4 50

50

100

150

200

250

300

Entropy of LBP

Frequency

HamSpam

0 10000 20000 30000 40000 500000

50

100

150

200

250

300

350

400

Edge count

Frequency

HamSpam

(c) LBP (d) Edge count

Figure 12.9: Ham and spam distributions for standard dataset

Comp

Aspect

Edges

EdgelenSNRNoiseLBP

Color

HOG

Mean1

Mean2

Mean3

Variance1

Variance2

Variance3

Skew

1

Skew

2

Skew

3

Kurtosis

1

Kurtosis

2

Kurtosis

30.0

0.2

0.4

0.6

0.8

1.0 0.95

0.77 0.87

0.65

0.95

0.63

0.85

0.58

0.81

0.50

0.50

0.50

0.96

0.98

0.97

0.91

0.95

0.94

0.92

0.94

0.94

AUC

Figure 12.10: AUC for individual features

Page 50: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210.94

0.96

0.98

1.00

Number of features selected

AUC

0.92

0.94

0.96

0.98

Accuracy

AUCAccuracy

Figure 12.11: RFE results for standard dataset

Comp

Aspect

Edges

EdgelenSNRNoiseLBP

Color

HOG

Mean1

Mean2

Mean3

Variance1

Variance2

Variance3

Skew

1

Skew

2

Skew

3

Kurtosis

1

Kurtosis

2

Kurtosis

30.0

0.2

0.4

0.6

0.8

1.0

SVM

weigh

t

Figure 12.12: Linear SVM weights for standard dataset

Page 51: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

51

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210.94

0.95

0.96

0.97

0.98

0.99

1.00

Number of features

AUC

RFE featuresRanked features

Figure 12.13: RFE vs ranked features

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50

2

4

6

8

10

12

14

SNR

Frequency

HamSpam

20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

Compression ratio

Frequency

HamSpam

(a) Signal to noise ratio (b) Compression ratio

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.00

2

4

6

8

10

12

14

16

Entropy of LBP

Frequency

HamSpam

3000 6000 9000 12000 15000 180000

2

4

6

8

10

12

14

Edge count

Frequency

HamSpam

(c) LBP (d) Edge count

Figure 12.14: Ham and spam distributions for improved dataset

Page 52: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

52

Comp

Aspect

Edges

EdgelenSNRNoiseLBP

Color

HOG

Mean1

Mean2

Mean3

Variance1

Variance2

Variance3

Skew

1

Skew

2

Skew

3

Kurtosis

1

Kurtosis

2

Kurtosis

30.0

0.2

0.4

0.6

0.8

1.0

AUC

Standard datasetImproved dataset

Figure 12.15: Comparison of standard and improved datasets

Page 53: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

53

Chapter 13Com

bination

k=2

k=3

k=4

k=5

k=6

k=7

k=8

k=9

k=10

k=11

k=12

k=13

k=14

k=15

0000001

0.6714

0.67790.682

20.6954

0.695

90.696

50.7001

0.7169

0.7245

0.7245

0.7219

0.7264

0.7304

0.7304

0000011

0.6132

0.63990.655

90.6569

0.661

30.681

40.6823

0.6863

0.6863

0.7020

0.7054

0.7218

0.7306

0.7306

0000010

0.5650

0.65230.684

90.6849

0.695

80.680

90.7160

0.7155

0.7250

0.7248

0.7256

0.7337

0.7450

0.7538

0000110

0.5591

0.57480.574

50.5930

0.596

10.596

20.5962

0.5967

0.5967

0.5991

0.6036

0.6036

0.6011

0.6044

0000111

0.5657

0.68080.687

40.6878

0.680

30.700

20.7015

0.7172

0.7172

0.7248

0.7255

0.7969

0.7359

0.8020

0000101

0.5598

0.56040.615

70.5607

0.638

70.638

90.6444

0.6359

0.6430

0.6427

0.6530

0.6498

0.6610

0.6664

0000100

0.5656

0.66180.666

80.6604

0.676

80.690

00.7017

0.7018

0.7213

0.7263

0.7018

0.7190

0.7336

0.7120

0001100

0.6769

0.69810.716

50.7041

0.771

50.760

60.7683

0.7600

0.7636

0.7768

0.7718

0.7707

0.7703

0.7766

0001101

0.6767

0.68410.683

60.7818

0.777

40.777

50.7753

0.7779

0.7759

0.7357

0.7755

0.7357

0.7510

0.7383

0001111

0.6779

0.69650.684

20.7783

0.711

00.711

50.7686

0.7047

0.7712

0.7710

0.7703

0.7757

0.7794

0.7837

0001110

0.6769

0.68380.683

60.7575

0.710

10.784

40.7689

0.7773

0.7683

0.7823

0.7719

0.7824

0.7727

0.7830

0001010

0.6769

0.68440.678

80.7061

0.773

70.705

80.7733

0.7747

0.7727

0.7782

0.7746

0.7746

0.7702

0.7689

0001011

0.6768

0.68440.763

70.7042

0.783

80.704

70.7837

0.7834

0.7834

0.7784

0.7823

0.7774

0.7812

0.7800

0001001

0.6777

0.68540.685

80.7120

0.787

90.706

10.7818

0.7091

0.7842

0.7848

0.7850

0.7865

0.7798

0.7851

0001000

0.6763

0.68360.764

40.6850

0.787

10.705

80.7852

0.7101

0.7862

0.7873

0.7833

0.7875

0.7724

0.7875

0011000

0.5591

0.58090.560

60.5604

0.639

40.640

10.6009

0.6103

0.6103

0.6532

0.6531

0.6155

0.6535

0.6531

0011001

0.6768

0.67740.683

70.6915

0.704

70.704

90.7115

0.7134

0.7068

0.7079

0.7356

0.7347

0.7052

0.7346

0011011

0.5589

0.61250.607

30.5590

0.644

60.644

80.6467

0.6182

0.6398

0.7019

0.6269

0.6478

0.6928

0.7633

0011010

0.5650

0.66300.669

40.6806

0.689

90.690

00.7005

0.7006

0.7032

0.7005

0.7224

0.7224

0.7068

0.7213

0011110

0.5591

0.57180.562

60.5630

0.583

00.583

40.5834

0.5936

0.5936

0.5936

0.6117

0.5931

0.6009

0.6022

0011111

0.5657

0.67960.689

40.6927

0.705

20.713

80.7102

0.7149

0.7156

0.7329

0.7073

0.7166

0.7382

0.7360

0011101

0.5589

0.56020.603

40.5871

0.587

20.641

50.6415

0.6467

0.6515

0.6515

0.7011

0.6472

0.7020

0.7019

0011100

0.5656

0.66400.670

50.6700

0.691

80.692

00.6920

0.7031

0.6958

0.7095

0.7093

0.7105

0.7146

0.7106

0010100

0.6768

0.68420.703

60.7109

0.779

80.699

50.7710

0.7727

0.7712

0.7787

0.7771

0.7734

0.7784

0.7770

0010101

0.6768

0.68440.765

20.7049

0.785

00.704

20.7841

0.7150

0.7841

0.7096

0.7779

0.7201

0.7792

0.7129

0010111

0.6777

0.68510.677

20.7120

0.788

40.705

60.7815

0.7087

0.7847

0.7801

0.7846

0.7844

0.7887

0.7844

0010110

0.6759

0.68270.765

70.6849

0.787

30.705

00.7851

0.7088

0.7847

0.7161

0.7862

0.7741

0.7784

0.7879

0010010

0.6768

0.67740.684

70.7043

0.700

10.799

60.7000

0.7847

0.7141

0.7839

0.7939

0.7999

0.7939

0.7759

0010011

0.6767

0.68400.684

50.7045

0.709

20.792

40.7001

0.7879

0.7128

0.7873

0.7119

0.8015

0.7834

0.7791

0010001

0.6773

0.67790.685

40.7068

0.706

80.804

80.7008

0.7905

0.7084

0.7916

0.8038

0.8102

0.8032

0.7511

0010000

0.6754

0.68200.682

00.7050

0.705

80.787

80.7059

0.7880

0.7174

0.7917

0.7178

0.7917

0.7879

0.7812

0110000

0.5658

0.68080.694

00.7125

0.712

50.714

90.7104

0.7142

0.7142

0.7011

0.7156

0.7141

0.7014

0.7152

0110001

0.5658

0.68000.683

70.6961

0.719

50.716

60.7181

0.7202

0.7322

0.7309

0.7373

0.7319

0.7279

0.7301

0110011

0.5656

0.65740.684

70.6805

0.704

60.704

70.7077

0.7084

0.7120

0.7242

0.7231

0.7254

0.7266

0.7370

0110010

0.5658

0.67970.684

90.6849

0.706

90.707

00.7143

0.7170

0.7160

0.7137

0.7301

0.7282

0.7346

0.7286

0110110

0.5658

0.68880.689

50.7027

0.716

90.716

90.7160

0.7166

0.7168

0.7359

0.7168

0.7359

0.7154

0.7174

0110111

0.5658

0.68780.688

60.6937

0.702

20.702

00.7156

0.7010

0.7155

0.7237

0.7445

0.7197

0.7154

0.7448

0110101

0.5658

0.66100.661

70.6817

0.688

80.687

80.7005

0.7024

0.7105

0.7105

0.7184

0.7336

0.7177

0.7161

0110100

0.5658

0.67990.680

50.6809

0.701

70.701

70.7018

0.7018

0.7187

0.7106

0.7264

0.7268

0.7346

0.7291

0111100

0.6765

0.68370.709

90.7780

0.773

00.772

80.7741

0.7354

0.7741

0.7377

0.7797

0.7359

0.7794

0.7514

0111101

0.6767

0.68400.683

70.7507

0.734

60.751

10.7331

0.7612

0.7374

0.7776

0.7373

0.7796

0.7315

0.7779

0111111

0.6774

0.68460.683

70.7848

0.710

50.778

30.7727

0.7815

0.7755

0.7814

0.7753

0.7826

0.7773

0.7829

0111110

0.6760

0.68330.679

90.6928

0.705

50.714

30.7339

0.7333

0.7332

0.7764

0.7377

0.7826

0.7429

0.7824

0111010

0.6765

0.68370.690

00.7010

0.771

60.700

20.7703

0.7161

0.7755

0.7190

0.7730

0.7751

0.7757

0.7861

0111011

0.6767

0.68350.755

10.7045

0.772

70.702

90.7732

0.7193

0.7778

0.7555

0.7857

0.7860

0.7815

0.7860

0111001

0.6770

0.68420.683

70.7065

0.782

10.705

80.7820

0.7158

0.7839

0.7839

0.7839

0.7853

0.7778

0.7837

0111000

0.5656

0.68240.684

00.7047

0.705

50.705

60.7059

0.7161

0.7803

0.7723

0.7719

0.7776

0.7234

0.7709

0101000

0.5658

0.69040.691

10.7101

0.710

20.715

80.7019

0.7155

0.7173

0.7152

0.7166

0.7106

0.7106

0.7127

0101001

0.5658

0.68830.689

10.6809

0.701

50.714

00.7140

0.7014

0.7140

0.7111

0.7143

0.7087

0.7223

0.7300

0101011

0.5658

0.67600.676

70.6992

0.699

20.698

10.7001

0.7022

0.7081

0.7081

0.7102

0.7261

0.7166

0.7283

0101010

0.5658

0.67960.679

60.6808

0.700

60.709

50.7008

0.7008

0.7142

0.7090

0.7209

0.7190

0.7213

0.7266

0101110

0.5659

0.68700.694

10.7115

0.711

60.710

60.7168

0.7169

0.7169

0.7593

0.7169

0.7165

0.7587

0.7158

0101111

0.5658

0.68860.689

10.6933

0.714

90.712

90.7154

0.7220

0.7222

0.7387

0.7183

0.7387

0.7388

0.7457

0101101

0.5658

0.67190.678

70.7002

0.700

20.700

20.7004

0.7010

0.7095

0.7086

0.7109

0.7129

0.7255

0.7175

0101100

0.5658

0.68000.680

80.6822

0.701

80.701

70.7161

0.7027

0.7110

0.7257

0.7114

0.7257

0.7275

0.7278

0100100

0.6765

0.68470.678

60.7040

0.779

60.699

90.7747

0.7064

0.7751

0.7059

0.7924

0.7123

0.7798

0.7224

0100101

0.6763

0.68350.692

00.7040

0.781

20.702

60.7814

0.7164

0.7807

0.7152

0.7794

0.7155

0.7815

0.7714

0100111

0.6770

0.68420.683

70.7116

0.786

90.705

90.7820

0.7115

0.7852

0.7122

0.7851

0.7106

0.7816

0.7424

0100110

0.6753

0.68040.683

60.7038

0.704

20.704

60.7727

0.7068

0.7735

0.7169

0.7816

0.7734

0.7789

0.7761

0100010

0.5658

0.68500.685

50.7047

0.699

10.711

40.6996

0.7785

0.7134

0.7816

0.7143

0.7843

0.7952

0.7773

0100011

0.5656

0.68440.684

90.7046

0.714

70.783

50.7004

0.7837

0.7015

0.7979

0.7127

0.7943

0.7961

0.7874

0100001

0.5658

0.68330.684

00.7063

0.700

80.715

90.7006

0.7787

0.7083

0.7825

0.7149

0.7885

0.7848

0.7202

0100000

0.5656

0.68040.681

10.7040

0.700

40.779

40.7011

0.7793

0.7150

0.7853

0.7168

0.7917

0.7844

0.7759

1100000

0.6767

0.68380.700

20.7764

0.772

10.766

90.7643

0.7953

0.7964

0.7950

0.7988

0.7983

0.7803

0.7807

1100001

0.6759

0.68320.679

90.7615

0.778

40.781

60.7624

0.7682

0.7650

0.7694

0.7603

0.7696

0.7605

0.7706

1100011

0.6713

0.67500.683

60.6845

0.707

80.779

10.7753

0.7753

0.7739

0.7828

0.7741

0.7766

0.7769

0.7792

1100010

0.6722

0.67960.683

80.6799

0.704

10.781

80.7812

0.7636

0.7619

0.7706

0.7703

0.7732

0.7793

0.7720

1100110

0.6767

0.68500.685

60.7051

0.699

60.796

90.7001

0.7848

0.7688

0.7825

0.8015

0.7805

0.8005

0.7833

1100111

0.5656

0.68140.680

50.6849

0.704

20.701

40.7058

0.7974

0.7055

0.7965

0.7134

0.8037

0.7956

0.7693

1100101

0.5658

0.67690.677

60.7000

0.705

40.705

40.7031

0.8070

0.7837

0.7735

0.8029

0.7891

0.8212

0.7693

1100100

0.5656

0.67960.684

20.6849

0.783

40.702

20.7069

0.7756

0.7058

0.7797

0.7168

0.7800

0.7806

0.7701

1101100

0.6767

0.68380.710

20.7123

0.782

90.775

90.7753

0.7756

0.7756

0.7824

0.7888

0.7706

0.7934

0.7803

1101101

0.6768

0.68410.683

70.7847

0.781

40.782

10.7768

0.7766

0.7771

0.7697

0.7773

0.7701

0.7816

0.7734

1101111

0.6774

0.69470.683

70.7887

0.786

10.705

40.7796

0.7783

0.7853

0.7784

0.7943

0.7962

0.7762

0.8012

1101110

0.6772

0.68440.683

60.7924

0.783

00.783

00.7765

0.7837

0.7760

0.7801

0.7798

0.7847

0.7889

0.7806

1101010

0.6767

0.68420.705

20.7834

0.727

80.782

40.7762

0.7734

0.7750

0.7851

0.7818

0.7761

0.7748

0.7784

1101011

0.6768

0.68440.685

10.7045

0.786

70.784

20.7883

0.7737

0.7725

0.7803

0.7723

0.7802

0.7757

0.7655

1101001

0.6774

0.68470.683

70.7120

0.792

80.784

20.7839

0.7867

0.7894

0.7797

0.7843

0.7793

0.7842

0.7782

1101000

0.6772

0.68440.680

60.7065

0.788

50.785

90.7871

0.7851

0.7764

0.7814

0.7780

0.7885

0.7928

0.7859

1111000

0.6776

0.68500.685

50.7056

0.707

30.700

20.7032

0.7956

0.7689

0.7984

0.7009

0.7984

0.6997

0.7809

1111001

0.6738

0.68150.680

10.7029

0.700

90.700

80.7989

0.7925

0.7983

0.7983

0.7675

0.7967

0.7104

0.7741

1111011

0.6714

0.67820.678

80.7013

0.705

00.705

00.8012

0.7095

0.8044

0.8062

0.7084

0.7223

0.7155

0.7218

1111010

0.5652

0.67950.680

10.6808

0.702

90.703

70.7873

0.7812

0.7924

0.7917

0.7700

0.7911

0.7714

0.7869

1111110

0.5658

0.67760.684

90.7059

0.699

20.699

30.7008

0.7152

0.8160

0.7005

0.8152

0.8152

0.8019

0.8157

1111111

0.5656

0.67990.680

40.6846

0.714

50.804

00.7013

0.7152

0.8114

0.7149

0.8112

0.7278

0.8112

0.8111

1111101

0.5658

0.67220.679

60.7010

0.701

00.700

20.7059

0.7047

0.7997

0.7042

0.8023

0.7181

0.7125

0.8062

1111100

0.5656

0.67970.680

40.6847

0.705

90.794

80.7014

0.7019

0.7142

0.7069

0.8033

0.7150

0.7967

0.8011

1110100

0.6767

0.68410.705

20.7869

0.707

40.781

50.7768

0.7732

0.7774

0.7770

0.7796

0.7794

0.7673

0.7792

1110101

0.6768

0.68440.684

70.7045

0.787

10.784

40.7888

0.7843

0.7885

0.7778

0.7775

0.7778

0.7773

0.7771

1110111

0.6774

0.68460.683

70.7120

0.712

40.783

40.7775

0.7843

0.7811

0.7810

0.7811

0.7814

0.7810

0.7812

1110110

0.6770

0.68410.679

90.7060

0.788

50.785

50.7894

0.7848

0.7887

0.7768

0.7814

0.7848

0.7885

0.7842

1110010

0.6767

0.68410.684

60.7049

0.794

20.699

70.7938

0.6996

0.7802

0.7802

0.7796

0.7802

0.7770

0.7796

1110011

0.6767

0.68440.774

70.7045

0.789

10.700

10.7819

0.7055

0.7860

0.7848

0.7800

0.7810

0.7776

0.7805

1110001

0.6774

0.68490.685

50.7068

0.789

80.706

00.7883

0.6996

0.7893

0.7996

0.7848

0.7864

0.7801

0.7884

1110000

0.6768

0.68410.684

70.7061

0.794

10.705

50.7879

0.7027

0.7855

0.7888

0.7802

0.7888

0.7843

0.7853

1010000

0.6776

0.68440.704

90.7702

0.759

80.773

50.7591

0.7721

0.7624

0.7489

0.7807

0.7538

0.7792

0.7524

1010001

0.6750

0.68090.681

40.7020

0.702

00.757

10.7589

0.7574

0.7614

0.7614

0.7611

0.7637

0.7551

0.7637

1010011

0.6717

0.67880.683

60.7046

0.704

50.708

60.7105

0.7136

0.7710

0.7643

0.7588

0.7756

0.7596

0.7671

1010010

0.6723

0.67940.679

50.6799

0.700

50.707

40.7511

0.7628

0.7568

0.7574

0.7674

0.7560

0.7686

0.7696

1010110

0.5658

0.68410.752

40.7055

0.712

50.704

70.7136

0.7721

0.7687

0.7748

0.7213

0.7698

0.7545

0.7747

1010111

0.5658

0.68010.680

30.7041

0.703

60.706

10.7067

0.7538

0.7815

0.7815

0.7175

0.7818

0.7175

0.7687

1010101

0.5658

0.67880.684

00.7006

0.705

80.705

50.7020

0.7775

0.7101

0.7718

0.7609

0.7807

0.7915

0.7694

1010100

0.5658

0.67940.681

30.6815

0.703

30.706

10.7058

0.7058

0.7105

0.7138

0.7140

0.7128

0.7191

0.7742

1011100

0.6765

0.68380.705

00.7775

0.772

10.773

30.7725

0.7747

0.7719

0.7891

0.7769

0.7730

0.7724

0.7974

1011101

0.6767

0.68400.704

20.7794

0.769

10.778

70.7729

0.7820

0.7770

0.7814

0.7770

0.7783

0.7776

0.7534

1011111

0.6774

0.68460.684

60.7867

0.784

30.778

70.7743

0.7759

0.7782

0.7774

0.7783

0.7551

0.7958

0.7956

1011110

0.6770

0.68410.683

70.7814

0.775

60.775

60.7756

0.7748

0.7768

0.7773

0.7785

0.7855

0.7762

0.7818

1011010

0.6767

0.68380.680

00.7004

0.777

30.779

80.7805

0.7738

0.7785

0.7732

0.7743

0.7766

0.7784

0.7816

1011011

0.6767

0.68380.757

40.7040

0.782

40.701

00.7856

0.7721

0.7847

0.7733

0.7756

0.7138

0.7756

0.7618

1011001

0.6774

0.68460.683

70.7061

0.784

40.780

70.7828

0.7805

0.7788

0.7773

0.7819

0.7652

0.7823

0.7669

1011000

0.6768

0.68410.755

90.7055

0.785

10.782

50.7724

0.7814

0.7793

0.7101

0.7803

0.7702

0.7857

0.7759

1001000

0.5658

0.68370.678

30.7050

0.712

50.703

80.7050

0.7210

0.7765

0.7869

0.7727

0.7768

0.7091

0.7727

1001001

0.5658

0.67940.681

70.7018

0.702

90.706

30.7830

0.7054

0.7823

0.7820

0.7714

0.7824

0.7170

0.7715

1001011

0.5658

0.67900.684

20.7005

0.699

00.705

90.7844

0.7092

0.7097

0.7885

0.7727

0.7885

0.7646

0.7727

1001010

0.5658

0.67950.681

80.7004

0.703

30.706

10.7768

0.7056

0.7159

0.7159

0.7718

0.7161

0.7748

0.7779

1001110

0.5658

0.68330.683

80.7049

0.700

10.798

70.7008

0.7143

0.7277

0.7010

0.8158

0.7175

0.8011

0.7138

1001111

0.5658

0.67990.680

40.7026

0.702

40.706

50.7068

0.7068

0.7058

0.7145

0.8025

0.7265

0.8114

0.7355

1001101

0.5658

0.67900.679

60.7009

0.700

10.793

80.7010

0.7042

0.7110

0.7096

0.7213

0.7179

0.7186

0.7170

1001100

0.5658

0.67950.680

10.6800

0.701

00.703

70.7011

0.7069

0.7102

0.7197

0.7915

0.7049

0.7997

0.7165

1000100

0.6767

0.68380.700

10.7766

0.701

90.780

30.7748

0.7727

0.7789

0.7759

0.7810

0.7773

0.7787

0.7821

1000101

0.6767

0.68380.685

90.7043

0.783

20.781

20.7823

0.7741

0.7856

0.7759

0.7771

0.7756

0.7771

0.7789

1000111

0.6774

0.68460.683

70.7061

0.784

10.782

40.7826

0.7818

0.7802

0.7766

0.7830

0.7797

0.7832

0.7843

1000110

0.6768

0.68330.679

90.7056

0.786

00.782

90.7834

0.7820

0.7873

0.7757

0.7861

0.7839

0.7869

0.7684

1000010

0.6767

0.68500.685

40.7050

0.782

50.704

60.7770

0.6997

0.7818

0.7771

0.7801

0.7821

0.7756

0.7785

1000011

0.6767

0.68370.766

60.7045

0.783

40.699

50.7800

0.7105

0.7776

0.7785

0.7778

0.7787

0.7800

0.7782

1000001

0.6770

0.68420.764

70.7067

0.786

60.705

10.7847

0.7024

0.7832

0.7791

0.7824

0.7869

0.7760

0.7906

1000000

0.5656

0.68330.765

50.7056

0.705

40.701

80.7009

0.7079

0.7819

0.7824

0.7033

0.7824

0.7104

0.7775

Figure 13.1: Heatmap of HMM scores (Gray code order)

2 3 4 5 6 7 8 9 100.00

0.20

0.40

0.60

0.80

1.00

Clusters

Purity

Score

𝐾-means clusteringEM clustering

2 3 4 5 6 7 8 9 100.00

0.20

0.40

0.60

0.80

1.00

Clusters

Purity

Score

𝐾-means clusteringEM clustering

(a) 2-dimensional (b) 3-dimensional

2 3 4 5 6 7 8 9 100.00

0.20

0.40

0.60

0.80

1.00

Clusters

Purity

Score

𝐾-means clusteringEM clustering

2 3 4 5 6 7 8 9 100.00

0.20

0.40

0.60

0.80

1.00

Clusters

Purity

Score

𝐾-means clusteringEM clustering

(c) 4-dimensional (d) 5-dimensional

Figure 13.2: Purity scores for EM and 𝐾-means clustering

Page 54: Introduction to Machine Learning - San Jose State Universitystamp/ML/files/zz_figures.pdf · Introduction to Machine Learning with Applications in Information Security Mark Stamp

54

23

45 2

46

810

0.60

0.80

1.00

Dimensions

Cluster

s

Purity

Score

23

45 2

46

810

0.60

0.80

1.00

Dimensions

Cluster

s

Purity

Score

(a) 𝐾-means clustering (b) EM clustering

Figure 13.3: Clustering stem plots