malware detection software using a support vector...

UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING

(FER)

GRADUATE THESIS no. 1216

MALWARE DETECTION SOFTWARE USING

A SUPPORT VECTOR MACHINE AS A CLASSIFIER

Nicole Bilić

Zagreb, July of 2016.

Table of contents

1. Introduction 3

2. Malware 6

2.1. Adware 7

2.2. Spyware 7

2.3. Virus 8

2.4. Worm 8

2.5. Trojan 9

2.6. Rootkit 10

2.7. Backdoors 11

2.7.1. Object code backdoors 11

2.7.2. Asymmetric backdoors 11

2.7.3. Compiler backdoors 11

2.8. Keyloggers 12

2.9. Rogue security software 13

2.10. Ransomware 13

2.11. Browser Hijacker 14

3. Machine learning and classification 15

3.1. Supervised machine learning 15

3.1.1. General steps 18

3.2. Model selection 19

3.3. Evaluation of binary classifiers 21

3.4. Support vector machine, SVM 23

3.4.1. Maximal margin problem 24

3.5. Principal component analysis, PCA 30

3.6. Linear discriminant analysis, LDA 38

4. Dataset processing 43

5. Malware detection software 47

6. Results of the classifiers evaluation 52

6.1. Results obtained for the polynomial kernel, training data 52

6.2. Results obtained for the polynomial kernel, validation data 53

7. Conclusion 54

8. Literature 56

9. Abstract 58

10. Sažetak 59

1. Introduction Malware detection softwares today are inevitable as with the Internet era

the number of types of malware, as well as the total number of known malware

has drastically increased. Internet is used as a transport media to deliver those

malwares to the end users, most of whom are standard users without deep IT

knowledge making them an easy target for the attackers. Each malware has its

purpose, ranging from only annoying ads jumping out in user’s browser to

encrypting all user’s documents in the infected computer and asking for a

ransom. Therefore, malware detection (and protection) software is an obligatory

accessory to the operating system.

Machine learning has proved itself useful and a good solution to

optimization problems with either no standard solution or problems that would

require a lot of computational power to be solved in a standard manner.

Standard malware detection software has a database containing all the known

malware and has to be updated regularly, as every day new malwares are

delivered and discovered.

The idea to combine machine learning with the malware detection is

based on the following facts:

a) Dataset, in this case, is almost infinite all the malware and

virusfree software can be used. Taking into consideration that this

dataset is expanding every day, it is possible to obtain very

numerous dataset giving better opportunity to find the correct

decision function

b) Once fitted, the classifier would be able to correctly predict if the

given file is a malware or virusfree, ie. new malware would be

detected by this software even though it has never been seen

before

3

As most of the malware comes in an executable form, the analysis here is

reduced to the static analysis of the disassembled executable assuming that

there is a correlation between data and the results of the classification, ie. there

is a decision function that correctly separates malware from virusfree software.

In this paper, the whole process from the dataset processing to the final

product, ie. malware detection software, is described, as well as the theoretical

background. The paper starts with a chapter dedicated to malwares in general,

giving the general definition of malware. As the full list of malware types is very

numerous, in continuation of the paper the focus is on those which make the

majority, ie. over 99%, of all malwares, with emphasis on properties specific for

each malware type. Third chapter is dedicated to machine learning and

classification. It explains and defines some general important terms, such as

machine learning itself, supervised machine learning, model selection,

evaluation of binary classifiers, support vector machine (SVM), etc., giving a

good theoretical background for understanding how the final product works.

While the emphasis in this chapter is on the idea behind SVM, including

maximal margin problem and its dual form, Lagrange multipliers method, etc.,

two dimension reduction algorithm also found their place in the third chapter

Principal component analysis (PCA) and Linear discriminant analysis (LDA), as

a possibility of removing the noise from the data and eventually improving the

results of the classification. Furthermore, fourth chapter gives an overview and

explanation of the dataset processing. The process is divided into five steps

which thoroughly explain the way the dataset is processed, from an executable

file to a matrix which, later on, is an input for the classifier. Also, the overview of

the dataset division is given with the exact percentages of the dataset that

belong to train, validation or test set. Fifth chapter describes the final product as

the practical part of this thesis, malware detection software. It contains general

4

guidelines for the end user, as well as the explanation of what is happening in

the background explained in steps.

5

2. Malware Malware or malicious software is any software used to gather sensitive

information, gain access to private computer systems, display unwanted

advertising or in any way disrupt computer operations [1]. It should not be

confused for software that causes an unintentional damage as a side effect of

some deficiency. Software is classified as malicious for its malicious intention

towards users or their computers, which draws the line between malware and

badware.

Figure 2.1. Malware by categories [1]

Since the list of all malware types is long and almost not finite, I will focus

on several most common types classified by the general categories of infection.

6

2.1. Adware Adware is short for advertisingsupported software. This type of malware

automatically delivers advertisements. Typical example of this type of malware

often comes as a part of free versions of softwares or applications, sponsored

by advertisers, as a source of income. Popup ads on various websites belong

to this type of malware as well. Although the intention of this malware type is

solely advertising and it is considered to be the least dangerous type, in some

cases adware comes bundled with spyware which can lead to breach of privacy

of user data and similar.

2.2. Spyware As its name suggests, spyware is a type of malware that uses different

techniques to spy on user’s activity without their knowledge. However, the term

spying, in this case, extends well beyond simple monitoring. Spyware can collect

almost any type of data ranging from personal information, such as internet

surfing habits, to user logins and bank/credit account information. Furthermore,

spyware is also capable of interfering with user’s control of a computer by

installing an additional software or redirecting web browsers. The most

advanced spywares are even capable of changing computer settings resulting in

slow connection speed, unauthorized changes in browser settings etc.

Spyware spreads either by software vulnerabilities exploitation, bundling

itself with legitimate software, or in Trojans. Most common purpose of spyware

is user’s data collection to use those to determine “targeted” advertisement

impressions.

7

2.3. Virus The main characteristic of this type of malware is its capability of copying

itself and spreading to other computers, doing so without user consent. When

executed, this type of malware replicates itself by inserting possibly modified

copies of itself into other computer programs, data files or even into the boot

sector of the hard drive. After the successful replication, viruses often perform

harmful activities on infected host, such as stealing hard disk space or CPU

time, accessing private information, corrupting data, displaying political or

humorous messages on the user’s screen, spamming their contacts, logging

their keystrokes or even causing the computer to be useless [20]. However,

most of the computer viruses known today target systems running Microsoft

Windows OS using stealth strategies to avoid antivirus protection software.

Taking into consideration the fact that computer viruses cause billions of Euros

worth of damage every year [3] by wasting computer resources, causing system

failures, corrupting data, the motives for creating this type of malware are clear:

either personal data theft, profit, sending political messages, amusement,

sabotage and similar. Luckily, today exist a lot of freely available antivirus

software. However, none of them can detect all viruses with 100% accuracy due

to constant development of new viruses.

2.4. Worm Worm is a standalone malware computer program that replicates itself in

order to spread to other computers using computer network, exploiting security

failures on target computer to access it. It is often confused with a computer

virus, however there is a big difference worm does not need to attach itself to

an existing program. Furthermore, worms almost always cause some damage to

the network, at least just by consuming the bandwidth, whereas viruses almost

8

always corrupt or modify files on a targeted computer [21]. Even though their

only goal was to spread through the network, not changing anything in the

systems they pass through, even “payload free” worms can cause a lot of

damage by increasing network traffic or causing major disruption. A “payload” is

a code in the worm that other than spreading, does other harmful things such as

data deletion, data encryption, sending documents via email etc. One of the

common purposes of a payload worm is to install a backdoor in the infected

computer to allow the creation of the “zombie” computer under control of the

worm author [21]. Other malicious purposes are money extortion in the so called

“ransomware attack” or blackmailing companies by threatening with DoS attack.

2.5. Trojan Trojan horse is a type of malware that misrepresents itself to appear

useful, routine or interesting in order to persuade a victim to install it [22].

Trojan's’ payload usually acts like a backdoor and is not easily detectable.

However, it causes changes in computer’s behaviour in a way it becomes slow

due to heavy processor or network usage. As the main difference with worms

and viruses is the fact that trojans usually do not try to inject themselves into

other files or in any other way to propagate themselves further. The purposes

and uses can be divided in the following categories [22]:

a) Destructive crashing the computer or device, modification or deletion of

files, data corruption, formatting disks, destroying all contents, spread

malware across the network, spy on user activities

b) Use of resource or identity use of the machine as part of a botnet, using

computer resources for mining cryptocurrencies, using the infected

computer as proxy for illegal activities and/or attacks on other computers,

infecting other connected devices on the network

c) Money theft, ransom electronic money theft

9

d) Data theft industrial espionage, user passwords or payment card

information, user personally identifiable information, trade secrets

e) Spying, surveillance or stalking keystroke logging, watching the user’s

screen, viewing the user’s webcam, controlling the computer system

remotely

2.6. Rootkit A rootkit is a collection of usually malicious computer software designed

to enable access to a computer or areas of its software that would not otherwise

be allowed while at the same time masking its existence or the existence of

other software [23]. Usually the first step is obtaining root or Administrator

access which is done by exploiting a known vulnerability or a password

(obtained through cracking or social engineering). Rootkits are usually able of

hiding its intrusion by subverting the software intended to find them, while

maintaining privileged access. Therefore, the removal can be practically

impossible, especially when the kernel is infected by a rootkit or if dealing with

firmware rootkits. Often, complete reinstallation of the operating system is

required or even hardware replacement in the case of firmware rootkits.

However, modern rootkits are used to add stealth capabilities in order to make

payload of the other software undetectable, rather than elevating the access.

Malicious rootkits and their payloads can have one of the following uses:

1. Provide an attacker with full access via a backdoor, permitting

unauthorized access to steal or falsify documents.

2. Conceal other malware, for example passwordstealing key loggers and

computer viruses

3. Adjust the compromised machine as a zombie computer for attacks on

other computers.

10

2.7. Backdoors A backdoor in a computer system is a method of bypassing normal

authentication, securing remote access to a computer, obtaining access, while

attempting to remain undetected. It can take the form of an installed program or

could be a modification to an existing program or hardware device [4]. When

multiuser and networked operating systems became widely adopted it surfaced

the threat of backdoors. Most common types of backdoors are:

2.7.1. Object code backdoors Object code backdoors involve modifying the object code instead of the

source code which makes them hard to detect. Object code is not in the

humanreadable form but rather in machinereadable which makes it hard to

inspect. However, this type of backdoors is easily detectable by checking for

differences, notably in length or checksum and in some cases by disassembling

the object code. Furthermore, object code backdoors are easy to remove by

simply recompiling from source.

2.7.2. Asymmetric backdoors As opposed to traditional symmetric backdoors which allow anyone that

finds that backdoor to use it, asymmetric backdoors enable only the attacker

who plants it to use it.

2.7.3. Compiler backdoors Compiler backdoors are a form of black box backdoor, where not only is a

compiler subverted (to insert backdoor in some other program), but it is further

modified to detect when it is compiling itself and then inserts both the backdoor

insertion code and the code modifying selfcompilation, like the mechanism how

retroviruses infect their host. This can be done by modifying the source code,

11

and the resulting compromised compiler can compile the original source code

and insert itself: the exploit has been bootstrapped [24].

2.8. Keyloggers Keylogging is the action of recording the keys struck on a keyboard,

typically covertly, so that the person using the keyboard is unaware that their

actions are being monitored. Numerous keylogging methods exist: they range

from hardware and softwarebased approaches to acoustic analysis.

Softwarebased keyloggers can from a technical perspective be divided in

several categories [25]:

1. Hypervisorbased the keylogger resides in a malware hypervisor

underneath the OS, and therefore remains untouched

2. Kernelbased a program on the machine obtains root access to

hide itself in the OS and intercepts keystrokes that pass through

the kernel

3. APIbased hook the keyboard APIs inside a running application.

The keylogger registers keystroke events, as if it was a normal

piece of the application instead of malware

4. Form grabbing based log web form submissions by recording the

web browsing on submit events after the user fills the form and

submit it

5. Memory injection based perform their logging function by altering

the memory tables associated with the browser and other system

functions

6. Packet analyzers capture network traffic associated with HTTP

POST requests to retrieve unencrypted passwords

7. Remote access local software keyloggers with an added feature

that allows access to locally recorded data from a remote location.

12

2.9. Rogue security software Rogue security software is a form of malicious software and internet fraud

that misleads users into believing there is a virus on their computer and

manipulates them into paying money for a fake malware removal tool (that

possibly introduces malware to the computer) [26]. The attackers in this case

mostly rely on some form of social engineering to manipulate their victim to buy

or install the antivirus software which usually contains Trojan horse component.

After the installation, the rogue security software may try to persuade the victim

into purchasing a service or additional software through various ways: displaying

an animation simulating a system crash and reboot, disabling parts of the

system, disabling automatic system software updates and blocking access to

antimalware vendors’ web pages, altering system registries and/or security

settings etc.

2.10. Ransomware Ransomware is a kind of malware that restricts access to the infected

computer system in some way, and demands that the user pay a ransom to the

malware operators to remove the restriction [27]. Some ransomwares encrypt

user’s hard drive to force the user to pay the ransom in order to retrieve back

their data since it is almost impossible to decrypt it on their own. Usually, they

use Trojans to propagate itself and enter the victim’s system where the payload

is ran. Payloads may display fake warnings (for example issued by a law

enforcement agency), lock the system until ransom is paid or encrypt the files

using in such a way that only the attacker has the needed decryption key.

13

2.11. Browser Hijacker Browser hijacking is a form of malware that modifies a web browser’s

settings without user’s permission, to inject unwanted advertising into the user’s

browser [28]. The average browser hijacker can change the user’s home page,

error page or preferred search engine in order to increase the number of hits to

a particular website to increase its advertising revenue. Usually, they come

bundled with some other software as “offers” without uninstall instructions or

documentation on what they do in order to trick the user into installing them too.

Some of the browser hijackers are not dangerous but rather annoying, but easy

to detect and uninstall. However, some of them are capable of permanently

damaging the registry on Windows systems and are not so easily detectable as

well as uninstallable.

14

3. Machine learning and classification Machine learning is a type of artificial intelligence (AI) that provides

computers with the ability to learn without being explicitly programmed. Machine

learning focuses on the development of computer programs that can teach

themselves to grow and change when exposed to new data [5]. The way

learning is done can be divided in three broad categories:

1. Supervised learning computer is given two sample values, one

representing the input and the other representing its desired output.

Computer’s task is to find a general rule that maps inputs into desired

outputs.

2. Unsupervised learning output vector is not given and computer has to

find the structure only in its input

3. Reinforcement learning a computer program interacts with a dynamic

environment in which it must perform a certain goal, without a teacher

explicitly telling it whether it has come close to its goal [29].

Between supervised and unsupervised there is also semisupervised learning

approach where some (usually most) output values are unknown. However, in

this thesis, the focus will be on supervised learning as the program learns from

dataset containing all necessary input and expected output values.

3.1 Supervised machine learning Supervised machine learning is the machine learning task of inferring a

function from labeled training data [13]. It can be used for solving two types of

tasks: classification and regression. The difference between the two is that

classification associates an input with its class, while the regression associates

an input with some continuous value, i.e. the major difference is if the target

variable is discrete or continuous.

15

The set of training examples consists of pairs of instances and their

labels which can be denoted as:

x ,( (i) y(i))Ni=1

N = total number of examples in training set

Each example x can be represented as an ndimensional vector of

features,

x = ( , , ,...., )x1 x2 x3 xn

which can be interpreted as a point in an ndimensional vector space, so called

input or instance space. The assumption of all machine learning algorithms is

that all the examples from an input space are independent and identically

distributed (i.i.d.) which means that each random variable has the same

probability distribution as the others and all are mutually independent. Learning

set consists of tuples of examples and their labels and can be represented as a

table:

Table 3.1. Learning set as table [6]

x1 x2 ... xn y

x1(1) x2

(1) ... xn(1) y(1)

x1(2) x2

(2) ... xn(2) y(2)

x1(N) x2

(N) ... xn(N) y(N)

16

Binary or binomial classification is the task of classifying the elements of

a given set into two groups on the basis of a classification rule [30]. When

performing binary classification, the goal of a classification algorithm is to

determine whether an input example x belong to a class C or not which we

denote as:

h(x) = 1, x is an instance of a class C

h(x) = 0, x is not an instance of a class C

The set of all possible hypothesis is called model. Learning is then the

process of searching through the model in order to find an optimal hypothesis by

minimizing the empiric or training error. Empiric error on the training set equalsΓ

to the ratio of examples which were incorrectly classified and total number of

examples in that training set, N:

[6](h|D)E = 1N ∑N

i=11 h(x ) = (i) / y(i) = 1

N ∑N

i=1h(x )||

(i) − y(i)||

where 1P is an indicator function whose value is 1 if , and 0 otherwise.P ≡ T

In machine learning, the aim is to construct algorithms that are able to learn to

predict a certain target output so the learning algorithm is presented some

training examples that demonstrate the intended relation of input and output

values. Then the learner is supposed to approximate the correct output, even for

the examples that have not been shown during training. The set of assumptions

that the learner uses to predict outputs when given inputs that it has not

encountered is called inductive or learning bias [31]. The inductive bias can be:

1. Restriction or language bias we choose the model and that way restrict

the set of hypotheses that can be represented by the chosen mod

2. Preference or search bias we define the method hypothesis search

within the model that way giving the advantage to some hypotheses over

the others

17

3.1.1 General steps in solving supervised machine learning problem

[13]:

1. Determine the type of training examples. Before doing anything else, the

user should decide what kind of data is to be used as a training set. In the

case of handwriting analysis, for example, this might be a single

handwritten character, an entire handwritten word, or an entire line of

handwriting.

2. Gather a training set. The training set needs to be representative of the

realworld use of the function. Thus, a set of input objects is gathered and

corresponding outputs are also gathered, either from human experts or

from measurements.

3. Determine the input feature representation of the learned function. The

accuracy of the learned function depends strongly on how the input object

is represented. Typically, the input object is transformed into a feature

vector, which contains a number of features that are descriptive of the

object. The number of features should not be too large, because of the

curse of dimensionality; but should contain enough information to

accurately predict the output.

4. Determine the structure of the learned function and corresponding

learning algorithm. For example, the engineer may choose to use support

vector machines or decision trees.

5. Complete the design. Run the learning algorithm on the gathered training

set. Some supervised learning algorithms require the user to determine

certain control parameters. These parameters may be adjusted by

optimizing performance on a subset (called a validation set) of the

training set, or via crossvalidation.

18

6. Evaluate the accuracy of the learned function. After parameter

adjustment and learning, the performance of the resulting function should

be measured on a test set that is separate from the training set.

3.2 Model selection

Model selection is the task of selecting a statistical model from a set of

candidate models, given data [14]. The selection is actually optimization of

hyperparameters of some fixated model. However, the aim of the model is not to

perform a good classification of the learning dataset examples, but to be able to

generalize and perform good with previously unseen examples, predict. Criteria

for model selection is as follows [6]:

Simple, but not too simple model is capable of better generalization

Simple model is easier to use and it has lower computational complexity

Simple model is easier to teach because complex models have a lot of

parameters to optimize

Simple model is easier to understand and extract knowledge from it (for

example rules)

The aim is to choose a model whose complexity fits to the complexity of

the function we are trying to determine. Considering the complexity of the

chosen model, there are two extremes:

1. Overfitting If the chosen model is too complex comparing to the real

function, hypotheses are too flexible. Those hypotheses will assume

more information than there actually is in the given data which leads to a

loss of generalization property. Furthermore, these hypotheses will

perform well on the training set because they will adjust even to the noise

in the data. However, when in need to predict the output of an input they

never encountered in the learning set, their performance will be very low.

19

2. Underfitting If the chosen model is too simple comparing to the real

function, hypotheses will not be able to adjust to the data from the

learning set and therefore will not even perform well on the learning

dataset.

Figure 3.1. Underfitted vs convenient vs overfitted function[7]

Figure 3.1. shows three possible situations when it comes to model complexity:

1. Left picture, underfitting for the given dataset represented by 2D points,

the function of a degree 1 is not complex enough to fit the data properly.

2. Middle picture displays a function of a degree 4, which for the given

dataset is the optimal complexity and this function fits the data the best.

This function is the goal function we are looking for.

3. Right picture, overfitting the function of degree 15 displayed on this

picture is too complex for the data and it adjusts to the noise in data,

which leads to low generalization power on previously not seen data.

In order to check if the model is over or under fitted, there is a technique

called cross validation. Cross validation is performed by splitting the dataset into

training set, validation set and test set. Training set is used to teach the chosen

models and validation set is used to check the generalization capability of the

20

model in question. After the optimal model is selected, based on validation

results, validation set and train set are united and the model is trained on both of

those and its generalization capability is tested on test set.

3.3 Evaluation of binary classifiers

In the field of machine learning, a confusion (error) matrix is a specific

matrix that enables visualization of the performance of a supervised learning

algorithm. To evaluate a classifier, one compares its output to another reference

classification ideally a perfect classification, but in practice the output of

another gold standard test and cross tabulates the data into a 2 2×

contingency table, comparing the two classifications. One then evaluates the 1

classifier relative to the gold standard by computing summary statistics of these

4 numbers. Each column of the matrix represents the instances in a predicted

class while each row represents the instances in an actual class [15]. In

predictive analysis, a confusion matrix is a matrix whose rows and columns

represent the number of true positives, false positives, true negatives and false

negatives. False positive error is a result that indicates a given condition has

been fulfilled, when it actually has not. On the other hand, false negative error is

where test result indicates that a condition failed, when it actually did not.

Figure 3.2. Confusion matrix

1 Contingency table is a type of table in a matrix format that displays the frequency distribution of the variables [16]

21

Following measures are used to evaluate the binary classifier:

a) Accuracy the proportion of true results (both true positives and true

negatives) among the total number of cases examined. An accuracy of

100% means that the measured values are exactly the same as the given

values [17].

ccuracya = TP+TNTP+FP+TN+FN

b) Recall measures the proportion of positives that are correctly identified

as such [32].

ecallr = TPTP+FN

c) Precision the proportion of the true positives against all the positive

results (both true positives and false positives)

recisionp = TPTP+FP

d) a measure that combines precision and recall with theeasureF β −m

restraint . It measures the effectiveness of retrieval with respect to aβ > 0

factor which gives more or less importance to precision or recall.β

F β = 1( + β2) ∙ precision∙recallβ ∙precision+recall2

For the measure is approximately the average of precision andβ = 1

recall when they are close, and is more generally the square of the geometric

mean divided by the arithmetic mean. In this case, precision and recall are

equally weighted.

F 1 = 2 ∙precision∙recallprecision+recall

For Fmeasure weights recall higher than precisionβ = 2

F 2 = 5 ∙precision∙recall

4∙precision + recall

For Fmeasure weights precision higher than recall.5β = 0

.25F 0.5 = 1 ∙ precision∙recall0.25∙precision+recall

22

3.4 Support vector machine, SVM

Support vector machine is a discriminative model which means it models

the conditional probability directly, as opposed to generative modelsp (y|x)

which model the joint probability distribution .p (x, )y

Example

Let's assume that we have the following data in the form (x,y):

(3,4), (3,1), (4,1), (4,1)

Table 3.2. The joint probability

p (x, )y y = 1 y = 4

x = 3 41 4

1

x = 4 21 0

Table 3.3. The conditional probability

p (y|x) y = 1 y = 4

x = 3 21 2

1

x = 4 1 0

Generative algorithms model how the data was generated to classify it

and based on those generation assumptions, try to determine which class was

more likely to generate the given example. On the other hand, discriminative

algorithms do not take into an account how the data was generated, but rather

uses the data to create a decision boundary and then tries to determine to what

side of that decision boundary does the given example belong to.

23

Every machine learning algorithm is defined by a model, error and

optimization procedure. In the case of support vector machines, a data point is

viewed as a pdimensional vector (a list of p numbers), and we want to know

whether we can separate such points with a (p1)dimensional hyperplane. This

is called a linear classifier. There are many hyperplanes that might classify the

data. One reasonable choice as the best hyperplane is the one that represents

the largest separation, or margin, between the two classes. So we choose the

hyperplane so that the distance from it to the nearest data point on each side is

maximized [35]. The chosen hyperplane will also have the minimal

generalization error.

3.4.1 Maximal margin problem

Model:

(x) xh = wT +w0

Class labels:

− ,+ y∈ 1 1

Boundary between classes:

gn(h(x))y = s

We assume that the examples from the dataset are linearly separableχ

meaning that such that the following applies:w,∃ w0

(x ) , ∀y +h (i) ≥ 0 (i) = 1

(x ) , ∀y −h (i) < 0 (i) = 1

Or shorter:

(x , ) . y h(x )∀ (i) y(i) ∈ χ (i) (i) ≥ 0

It is obvious that the space version is infinite, ie. there is an infinite

number of solutions. However, due to the inductive bias, ie. the desired solution

24

is the one giving the maximum margin, there is only one solution of our interest.

Maximizing the margin gives, as a side effect, the hyperplane passing exactly

halfway between 2 closest examples.

The distance from the hyperplane:

d = ||w||h(x)

Since only the hyperplanes which correctly classify the examples are of interest,

the absolute distance to the hyperplane can be denoted as:

||w||y h(x)(i)

= ||w||y (w x +w )(i) T (i)

0

According to the definition, the distance between the hyperplane and the closest

example is as follows:

min y (w x )1||w|| i

(i) T +w0

and that distance we want to maximize:

min y (w x )argmaxw,w01||w|| i

(i) T +w0

The last equation is the optimization problem that needs to be solved. However,

in order to simplify it, since in the given form it is difficult to optimize, ie. there is

a min function within max function, there are several steps that need to be

performed.

First, it is possible to define that for the example , the closest one to thexi

margin, applies:

(w x )y(i) T +w0 = 1

As a consequence, for all the other examples, the following will apply:

(w x )y(i) T +w0 ≥ 1

With the above definition, the optimization problem becomes:

= = min y (w x )argmaxw,w01||w|| i

(i) T +w0 argmaxw,w01||w|| * 1 argmaxw,w0

1||w||

with the constraint:

, (w x )y(i) T +w0 ≥ 1 , ..,i = 1 . N

25

Furthermore, to maximize is equal to minimize which is further1||w|| |w||| = √w wT

equal to minimize . Due to simplifying the further steps, the last equation is||w||2

multiplied by and it yields the final formulation of the optimization problem:21

||w||argminw,w021 2

with the constraint:

, (w x )y(i) T +w0 ≥ 1 , ..,i = 1 . N

Finally, the optimization problem became a typical convex optimization 2

problem with the constraints which can be defined in its standard form as

follows:

Minimize (x)f

with constraints: (x) , i , ..,gi ≤ 0 = 1 . m

x , i , ..,aiT − bi = 0 = 1 . p

Lagrange multipliers method

Lagrange multipliers method is used to reformulate the optimization

problem with constraints in a way that the constraints are directly built into the

target function. The aforementioned convex optimization problem with

constraints can be transformed to:

(x, , ) (x) g(x) h(x)L α β = f + ∑m

i=1αi + ∑

p

i=1βi

where and are Lagrange multipliers for constraints with equality andαi βi

inequality. Furthermore, for multipliers the so called KarushKuhnTucker’sαi

(KKT) conditions apply:

, i , ..,αi ≥ 0 = 1 . m

g (x) , i , ..,αi i = 0 = 1 . m

2 Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. The convexity property can make optimization in some sense "easier" than the general case for example, any local minimum must be a global minimum [19].

26

Maximal margin problem’s dual form

Aforementioned maximal margin problem was defined as follows:

||w||argminw,w021 2

, (w x )y(i) T +w0 ≥ 1 , ..,i = 1 . N

In the terms of Lagrange dual function, we have a function to optimize (f(x)) and

a constraint with an inequality, leading to the following Lagrange function:

(w, , ) ||w|| y (w x ) L w0 α = 21 2 − ∑

N

i=1αi (i) T (i) +w0 − 1

is a vector of Lagrange multipliers, one for each constraint. Byα , .., )α = ( 1 . αN

choosing to optimize the dual form, the optimization problem is simplified as the

optimization comes down to optimization of just one variable ( ).α

Derivation by and and equalizing with zero:w w0

y xw = ∑N

i=1αi (i) (i) y0 = ∑

N

i=1αi (i)

The dual Lagrange function takes the form:

(α) ||w|| y (w x ) L︿

= 21 2 − ∑

N

i=1αi (i) T (i) +w0 − 1

||w|| y w x y= 21 2 − ∑

N

i=1αi (i) T (i) −w0 ∑

N

i=1αi (i) + ∑

N

i=1αi

y (x ) y x y (x ) y x = 21 ∑N

i=1αi (i) (i) T ∑

N

j=1αj (j) (j) − ∑

N

i=1αi (i) (i) T ∑

N

j=1αj (j) (j) + ∑

N

i=1αi

α y y (x ) x= ∑N

i=1αi − 2

1 ∑N

i=1∑N

j=1αi j

(i) (j) (i) T (j)

Consequently, the dual optimization problem becomes to maximize the

expression with the constraints α y y (x ) x∑N

i=1αi − 2

1 ∑N

i=1∑N

j=1αi j

(i) (j) (i) T (j) , i , ..,αi ≥ 0 = 1 . N

and y∑N

i=1αi (i) = 0

27

By transforming the optimization problem to its dual form, the complexity

of the algorithm is reduced in the case N << n because the number of variables

in:

Primary problem: variablesn + 1

Dual problem: variablesN

Model

Previously it was calculated:

y xw = ∑N

i=1αi (i) (i)

Replacing w with the above in the model function, we obtain:

in primary form(x) xh = wT +w0

(x) y x x in dual formh = ∑N

i=1αi (i) T (i) +w0

To classify an example x, we calculate the scalar product of the example

x and all the other examples in the dataset , multiplied by the weight andχ αi

signum . Instead of storing weights , we need to store examples, instead ofy(i) w

we have meaning that the complexity of the model grows with the(x|w)h (x|α, )h χ

number of examples, ie. the model is unparameterized. If SVM is trained in the

way that the primary problem is being solved, it is a parameterized model. On

the other hand, if SVM is trained to solve the dual problem, then it is

unparameterized model.

Support vectors

With KKTcondition:

(y h(x ) )αi (i) (i) − 1 = 0

It is possible to conclude that for every example from applies:x(i) χ

or αi = 0 h(x )y(i) (i) = 1

28

ie. in the expression appear only the vectors precisely on the maximum margin

plane and those vectors are called the support vectors. All of the other vectors

for which applies absolutely do not affect the output of the model and canαi = 0

be disregarded when predicting is performed. The hyperplane in the primary

problem is defined by the linear combination of the support vectors if in its dual

form.

29

3.5 Principal component analysis, PCA

Principal component analysis (PCA) is a statistical procedure that uses

an orthogonal transformation to convert a set of observations of possibly 3

correlated variables into a set of values of linearly uncorrelated variables called

principal components [33]. The basic idea behind it suggests that if the data

lives in a subspace, it is going to look very flat when viewed from the full space,

e.g.

Figure 3.3. Data in a subspace [11]

If we fit a Gaussian to the data, the equiprobability contours are going to be

ellipsoids. If y is Gaussian with covariance , the equiprobability contours willΣ

be ellipses whose principal components are the eigenvectors of , andΦi Σ

principal lengths are the eigenvalues of .λi Σ

3 In linear algebra, an orthogonal transformation is a linear transformation T : V → V on a real inner product space V, that preserves the inner product

30

Figure 3.4. Equiprobability contour [11]

By computing the eigenvalues we know if the data is flat:

1. Data is flat if >λ1 > λ2

Figure 3.5. Flat data [11] 2. Data is not flat if λ1 = λ2

Figure 3.6. Not flat data [11]

31

As previously mentioned, PCA finds the principal components of data.

The principal components are underlying structure in the data that tell us the

directions where there is the most variance or the directions where the data is

most spread out.

Example 1

Let the following picture represent the points of data in the dataset:

Figure 3.7. Dataset

In order to find the most variance direction, we need to find a straight line

where the data is most spread out when projected onto it.

a) A vertical straight line

Figure 3.8. Data projected on the vertical line

32

b) A horizontal straight line

Figure 3.9. Data projected on the horizontal line

By visual inspection of the above pictures, it is obvious that the data is

more spread when projected on the horizontal line. Any line, other than

horizontal in this case, would have smaller variance than a horizontal one and

therefore, in this example, the horizontal line is the principal component.

Every set of data points can be deconstructed into eigenvectors and eigenvalue.

An eigenvector is a direction of the vector, and eigenvalue is a number

describing how spread the data is. The eigenvector with the highest eigenvalue

is the principal component. The amount of eigenvectors/values that exist in a

data set equals the number of dimensions that the data set has. If, for example,

there are 2 variables, meaning that the data set is twodimensional, there are 2

eigenvectors/values.

33

Example 2

Let us assume that we have the following data:

Figure 3.10. Example data

As mentioned above, since the data is represented by 2 variables, there

are two eigenvectors. One of them, from the previous example, is the line

splitting the ellipses longways. The other eigenvector is perpendicular to the

principal component. Eigenvectors have to be able to span the whole xy area,

and the optimal way to satisfy that condition, the two eigenvector directions have

to be orthogonal to one another. The two eigenvectors would look like this:

Figure 3.11. Eigenvectors

34

These eigenvectors actually create a much more useful axis to frame the data:

Figure 3.12. New axis

The data itself was not changed in any way, it is just observed from a

different perspective. The directions of the axes are the directions in which there

are the most variation in the data, which is where there is more information.

Now let us observe how PCA is used for dimension reduction on another

example.

Example 3

Let us assume we have data set in 3 dimensions:

Figure 3.13. 3D data set [12]

35

As the data is represented by 3 variables, there are 3 eigenvectors that

need to be found. Since the data in the picture is all lying in a 2D plane, one of

the three eigenvectors will have an eigenvalue of zero, while the other two will

have large eigenvalues.

Figure 3.14. Eigenvectors [12]

Since the eigenvector ev3 equals to zero, it is pretty useless and the data

can be represented by the other two eigenvectors which means it can be

represented in 2D instead of 3D like before. 2D representation is shown in the

Figure 3.15.

It is important to mention that the dimension can be reduced even if non

of the eigenvectors is equal to zero, but instead the value of the eigenvector that

is discarded is much smaller than the values of the other eigenvectors.

36

Figure 3.15. 2D representation [12]

37

3.6 Linear discriminant analysis, LDA

Linear discriminant analysis (LDA) is a generalization of Fisher’s linear

discriminant, a method used in statistics, pattern recognition and machine

learning to fine a linear combination of features that characterizes or separates

two or more classes of objects or events [34]. The objective of LDA is to perform

dimensionality reduction but also preserve as much of the class discriminatory

information as possible. In PCA the main idea was to reexpress the available

dataset by extracting the relevant information and reducing the redundancy and

minimizing the noise. However, the discrimination between classes was not

taken into concern, instead we could project the dataset onto a less dimensional

space with more powerful data representation.

Let us assume we have a pattern classification problem, where there are

C classes. Each class has mdimensional samples, where ,N i , , ..,i = 1 2 . C

hence we have a set of mdimensional samples belonging to classx , , .., 1 x2 . xNi

. All of those samples are fit into a matrix X in a way that each columnωi

represents one sample. The idea is to obtain a transformation of X to Y through

projecting the samples in X onto a hyperplane with dimension . In order toC − 1

find a good projection vector, ie. the one that maximizes the separability of the

scalars, we need to define a measure of separation between the projections.

Example [13]

Compute the Linear Discriminant projection for the following twodimensional

dataset:

Samples for class x , ) (4, ), 2, ), 2, ), 3, ), 4, )ω1 : X1 = ( 1 x2 = 2 ( 4 ( 3 ( 6 ( 4

Samples for class x , ) (9, 0), 6, ), 9, ), 8, ), 10, )ω2 : X2 = ( 1 x2 = 1 ( 8 ( 5 ( 7 ( 8

38

Figure 3.16. 2D dataset [13]

First step is to calculate the classes means:

[(4, ) 2, ) 2, ) 3, ) 4, )] 3, 3.8)μ1 = 1N1

∑

x∈ω1

x = 51 2 + ( 4 + ( 3 + ( 6 + ( 4 = (

[(9, 0) 6, ) 9, ) 8, ) 10, )] 8.4, 7.6)μ2 = 1N2

∑

x∈ω2

x = 51 1 + ( 8 + ( 5 + ( 7 + ( 8 = (

In the second step we need to calculate the covariance matrix of the

classes. Instead, we could have chosen the distance between the projected

means as our objective function, however it would not be a good measure as it

does not take into account the standard deviation within the classes, as shown

in figure 3.17.

39

Figure 3.17. Means distance between classes [13]

The solution to this issue was proposed by Fisher maximize a function

that represents the difference between the means, but normalized by a measure

of the withinclass variability, ie. scatter.

The covariance matrix for the first class:

The covariance matrix for the second class:

40

The withinclass scatter matrix is:

Betweenclass scatter matrix is:

The LDA projection is then obtained as the solution of the generalized eigenvalue problem S w wSw−1 B = λ

Furthermore:

41

where is known as a Fisher’s Linear Discriminant, which is rather a specificw* choice of direction for the projection of the data down to one dimension.

Figure 3.18. Solution [13]

42

4. Dataset processing Dataset used for training the SVM classifier consists of the data from the

executable files (.exe or .dll) containing either malware or virusfree software.

The processing of the dataset consists of the following steps:

1. Objdump linux terminal command that is used to display information

about one or more object files. In this paper the command

was used to disassemble the executables,bjdump excutableF ileo − d

preparing the dataset for further processing. The shell script was used to

automate the process it reads in all the available .dll and .exe files, calls

the aforementioned command on each file and saves the output to textual

files, naming them in numerical order (eg. 0.txt, 1.txt, 2.txt etc.) to simplify

the further processing.

2. Dictionary creation dictionary holding all the possible assembly

commands returned by objdump command, taken over from Intel 64 and

IA32 Architectures [9], as keys. Values in the dictionary represent the

number of occurrences of each of the command in the output of objdump.

3. Regular expression matching regular expression from the Figure 4.1. is

used to match and extract only the commands from the objdump output.

Figure 4.1. Regular expression for obtaining the commands

4. Counting occurrences for every command obtained in step 3, the

number of occurrences for that command in the dictionary is increased by

one.

43

5. Matrix creation final step of the process consists of creating the matrix

that will be an input to the SVM classifier training algorithm. Columns of

the matrix represent each of the commands from the dictionary. Each row

of the matrix represents one file from the dataset, while element (i,j) of the

matrix represents the number of occurrences of jth command in ith file.

Table 4.1. The input matrix for the classifier

files/com

mands

ADD JMP MOV AND OR ...

0.txt 123 456 19 22 17 ...

1.txt 98 123 11 54 22 ...

… ... ... ... ... ... ...

44

Figure 4.2.. Code for parsing the data set

The second argument of an SVM classifier, in case of supervised

learning, is the corresponding vector Y containing the correct class for each of

the rows (files) from the matrix. Vector Y, in this paper, contains n elements with

values either 1 or 1, marking the file as malware or clean software, respectively.

Furthermore, the dataset is divided into three groups:

1. Train set contains 50% of all the files in the dataset. It is used to

determine the structure of the learned function and corresponding

learning algorithm.

45

2. Validation set contains 25% of all the files in the dataset. In order to

avoid overfitting, when any classification parameter needs to be adjusted,

validation set needs to be used.

3. Test set contains last 25% of the total files in the dataset. After choosing

the best fitting model for the problem and adjusting its parameters on

validation set, test set is used to analyze model’s generalization, ie.

precision, recall etc.

Figure 4.3. shows a part of the Python script used to both process the

dataset and calculate the precision and F1score of SVM classifiers initialized

with the provided parameters. After the creation of the classifier with the given

parameters, the classifier is fitted to the train set and tested on the validation set,

giving as an output the precision and F1score in order to make search for the

best possible classifier easier.

Figure 4.3. Code for precision calculation

46

5. Malware detection software

The malware detection software is based on the previously selected SVM

classifier. The aforementioned python script saves the input matrix into a .txt file.

Once the software is started, first it initializes the SVM classifier and calls .fit

function on the matrix read in from a file. As the Figure 5.1. shows, the graphical

interface is presented to the user with the button “Open” which opens a file

selector window where they have to choose the executable file they wanna run

the analysis on. Once the file is selected, the user can start the analysis by

clicking on “Start” button. Other option is for user to type in the full path of the file

they want to analyze.

Figure 5.1.. Application window

The results are obtained following these steps:

1. The program will call aforementioned objdump linux command on the

selected executable file

2. Using the same regular expression as when training the classifier, only

the commands from objdump file will be selected

3. Using the same list of assembler commands, the dictionary is created

4. Parsing the commands extracted in step 2, the program will count the

number of occurrences for each of the commands in the dictionary

47

5. Once the input vector from step 4 is created, program will call

classifier.predict function on that input vector and as a result get either 1

or 1, meaning software is clean or malware, respectively.

6. Based on the results of the classification from the step 5, the user will be

presented with a window containing the message either from the Figure

5.3. or 5.4.

Figure 5.2. Code containing some of the aforementioned steps above

48

Figure 5.3. Virus detected window

Figure 5.4. Virus free software

Malware detection software is written in Python and the graphical

interface is created using Glade . Furthermore, for machine learning purposes, 4

program is using open source scikitlearn library which is a python library

4 Glade is a RAD tool to enable quick & easy development of user interfaces for the GTK+ toolkit and the GNOME desktop environment. The user interfaces designed in Glade are saved as XML, and by using the GtkBuilder GTK+ object these can be loaded by applications dynamically as needed. By using GtkBuilder, Glade XML files can be used in numerous programming languages including C, C++, C#, Vala, Java, Perl, Python,and others.

49

containing simple and efficient tools for data mining and data analysis and it is

built on NumPy , SciPy , and matplotlib . 5 6 7

Both the Python script from Chapter 4 of this paper, and the malware

detection software from this chapter are programmed in a plug/unplug fashion,

giving the opportunity to easily modify several important parameters:

1. Dataset used to train the classifier as the fit function on the

classifier is trained with the data from two textual files called “x.txt”

and “y.txt”, those files can easily be replaced with a better or

updated version of the input matrix and the corresponding classes

vector. The only limitation is the format of those files, where “x.txt”

needs to be in a form of a matrix, and “y.txt” in a form of a 1D

array.

2. Parameters of the SVN classifier the classifier is initialized in an

init function when the program is ran. The parameters can easily

be changed by altering the current values in the constructor call.

This gives the opportunity to easily switch the program to a better

classifier once better results are obtained from the script in

Chapter 4.

3. Classifier SVN classifier can be replaced by any other available

classifier from the scikitlearn library if that would yield better

results.

If we take into consideration that the Python script from Chapter 4 is

easily modifiable as well, it makes the combination of these two programs

5 NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful Ndimensional array object. sophisticated (broadcasting) functions [10]. 6 SciPy (pronounced “Sigh Pie”) is an open source Python library used by scientists, analysts, and engineers doing scientific computing and technical computing [18]. 7 Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an objectoriented API for embedding plots into applications using generalpurpose GUI toolkits like wxPython, Qt, or GTK+ [2].

50

perfect for testing and investigation purposes. First with the Python script it is

possible to get the results for different classifiers, as well as process different

datasets and create different input matrices and class vectors, saving them to a

file. Then in the next step, once it is determined which classifier yielded better

results or which dataset trained the classifier better, this data can be forwarded

to the malware classification software, which will then use it for a better

performance.

Furthermore, the graphical interface itself is created separately from all

the other Python code. Therefore, replacing the current simple graphical

interface with a more advanced one, should not require any additional changes

in the source code itself. When replacing the graphical interface, it is important

to notice that there are two methods called from the Python code:

1. When button “Open” is pressed “name_of_the_function” method creates

a File selector window and displays it to the user

2. When button “Start” is pressed button_start method is called which

starts the process explained above through 6 steps.

Therefore, in order to “plug in” the new graphical user interface, it is

important to connect button click events to these two functions, while the source

code itself stays unchanged.

51

6. Results of the classifiers evaluation

Before selecting the appropriate parameters of the SVM classifier with a

polynomial kernel, several different polynomial degrees have been tested.

Testing of a classifier is a process of training it on the train set and calculating all

the evaluation measures explained in chapter 3.3 of this paper precision,

recall, accuracy and F1macro measures. Furthermore, to get an insight into the

classifier's generalization power, it is trained on the training set and then tested

on previously unseen data.

6.1. Results obtained for the polynomial kernel, training data

The SVM classifier was trained on the train set and the results from the

table 6.1. were obtained. It is obvious that the more complex polynomial

function, the better are the results. However, after reaching to approximately

60% precision, the growth starts to reduce, and it results in the classifier

evaluation values growing slower. Furthermore, the best result here is obtained

for the SVM classifier with the degree 6, with the accuracy of 72%.

Table 6.1. Results of the testing train set

degree F1macro Precision Recall Accuracy

1 0.346153846 0.26470588 0.5 0.32

2 0.452830188 0.35294117 0.63157895 0.42

3 0.5 0.419354838 0.6190476 0.48

4 0.666666666 0.6 0.75 0.64

5 0.7037037 0.678571428 0.73076923 0.68

6 0.74074074 0.769230769 0.7142857 0.72

52

6.2. Results obtained for the polynomial kernel, validation

data

After training the classifier with a specific degree, it is presented with a

new set of previously not seen data. As expected, the performance of classifiers

with each of the degrees has decreased. When tested on validation set, the best

results yielded the SVM classifier of the degree 4, as opposed to the one of

degree 6 from the table 6.1., which gave better results on the train set.

Table 6.2. Results of testing validation set

degree F1macro Precision Recall Accuracy

1 0.142857 0.114754098 0.189189 0.16

2 0.2708333 0.22807017 0.3333333 0.3

3 0.4175824 0.365384615 0.487179487 0.47

4 0.5531914 0.5909091 0.52 0.58

5 0.5454 0.551020408 0.54 0.55

6 0.51020408 0.53191489 0.490196 0.52

To conclude, from the above two tables it is possible to conclude that the

classifier with the degree 4 is the optimal option of all of the tested classifiers, as

it gave the best values on the validation set. Although the polynomial of the

degree 6 gave much better results on the train set, once presented with the new,

unseen set of data, its results decreased. This behavior could be a signal of

overfitting and therefore the polynomial of the degree 4 is chosen as an option.

53

7. Conclusion

The results of the classification with the chosen classificator show that

there is space for improvement. Taking into consideration the number of

features for every example in the dataset is pretty big, ie. the number of

occurrences for all the assembly commands, it is possible that using one of the

aforementioned algorithms for dimension reduction would yield better results.

Also, taking a look into the matrix created from the given dataset shows that

there are a lot of elements with 0 as value when some command does not

appear in the example at all, as opposed to those commands which appear

frequently setting the value of an element to numbers greater than 10 000.

These oscillations in data might be a sign of “noise” in the data which could

affect the results of the classifier, as well as the features that do not hold any

data value. Therefore, using either PCA or LDA algorithms might remove the

noise from the data and also reduce the dimensionality by disregarding

irrelevant features.

Furthermore, adding other features besides the statistical analysis (for

example, perform also a dynamic analysis of the changes in the OS when the

program is ran), could also benefit the improvement of the obtained results, as

there is always a possibility that the features in the dataset do not hold enough

data or data relevant for statistical analysis. In other words, if there is no

correlation between the data in the dataset and the results of the classification, it

will result in low efficiency no matter which statistical approach is chosen.

However, approximately 60% accuracy obtained in this paper’s practical

part I consider enough to continue working on improvements. Once again the

machine learning approach showed itself useful and applicable for a problem

that seems to be out of its scope. If the results of the classification would

54

improve to over 90%, especially when tested on the newly created malwares, it

would mean that the created software would detect malwares (already known or

yet unknown), based on what it learnt from the historic data, and without any

need for updating the database of known malwares.

55

8. Literature [1] https://en.wikipedia.org/wiki/Malware [2] Nate Lord, Common Malware Types: Cybersecurity 101, 12th of October 2012, Common Malware Types: Cybersecurity 101, https://www.veracode.com/blog/2012/10/commonmalwaretypescybersecurity101, 29th of March 2016. [3] Viruses that can cost you, Viruses that can cost you, http://www.symantec.com/region/reg_eu/resources/virus_cost.html, 1st of April 2016. [4] The malware encyclopedia, http://malware.wikia.com/, 29th of March 2016. [5] Margaret Rouse, Machine learning definition, February 2016, http://whatis.techtarget.com/definition/machinelearning, 5th of April 2016. [6] Snajder, J., Dalbelo Basic, B. Strojno ucenje. 2014. Zagreb: 2014. [7] Scikitlearn developers, Validation curves, 2010 2014, Validation curves: plotting scores to evaluate models, http://scikitlearn.org/stable/modules/learning_curve.html, April 2016. [8] Intel corporation, Intel commands, Intel® 64 and IA32 Architectures Software Developer’s Manual, http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64ia32architecturessoftwaredeveloperinstructionsetreferencemanual325383.pdf, April 2016. [9] Numpy developers, 2016, Numpy documentation, www.numpy.org/, May 2015. [10] Nuno Vasconcelos, PCA and LDA, PCA and LDA, http://www.svcl.ucsd.edu/courses/ece271BF09/handouts/Dimensionality2.pdf, May 2016. [11] George Dallas, PCA for Dummies, 30th of October 2013, Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction, https://georgemdallas.wordpress.com/2013/10/30/principalcomponentanalysis4dummieseigenvectorseigenvaluesanddimensionreduction/, May 2016. [12] Aly A. Farag, Shireen Y. Elhabian, LDA, 2nd of October 2008, A Tutorial on Data Reduction Linear Discriminant Analysis (LDA), http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid437773.pdf, May 2016. [13] Wikipedia, Supervised learning, https://en.wikipedia.org/wiki/Supervised_learning, April 2016. [14] Wikipedia, Model selection, https://en.wikipedia.org/wiki/Model_selection, April 2016. [15] Wikipedia, Confusion matrix, https://en.wikipedia.org/wiki/Confusion_matrix, April 2016. [16] Wikipedia, Contingency table, https://en.wikipedia.org/wiki/Contingency_table, April 2016.

56

https://en.wikipedia.org/wiki/Malware

https://www.veracode.com/blog/2012/10/common-malware-types-cybersecurity-101

http://www.symantec.com/region/reg_eu/resources/virus_cost.html

http://malware.wikia.com/

http://whatis.techtarget.com/definition/machine-learning

http://scikit-learn.org/stable/modules/learning_curve.html

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

http://www.numpy.org/

http://www.svcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality2.pdf

https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid437773.pdf

https://en.wikipedia.org/wiki/Supervised_learning

https://en.wikipedia.org/wiki/Model_selection

https://en.wikipedia.org/wiki/Confusion_matrix

https://en.wikipedia.org/wiki/Contingency_table

[17] Wikipedia, Accuracy and precision, https://en.wikipedia.org/wiki/Accuracy_and_precision, April 2016. [18] Wikipedia, SciPy, https://en.wikipedia.org/wiki/SciPy, May 2016. [19] Wikipedia, Convex optimization, https://en.wikipedia.org/wiki/Convex_optimization, May 2016. [20] Wikipedia, Computer virus, https://en.wikipedia.org/wiki/Computer_virus, March 2016. [21] Wikipedia, Computer worm, https://en.wikipedia.org/wiki/Computer_worm, March 2016. [22] Wikipedia, Trojan horse, https://en.wikipedia.org/wiki/Trojan_horse_(computing), March 2016. [23] Wikipedia, Rootkit, https://en.wikipedia.org/wiki/Rootkit, March 2016. [24] Wikipedia, Backdoor, https://en.wikipedia.org/wiki/Backdoor_(computing), March 2016. [25] Wikipedia, Keystroke logging, https://en.wikipedia.org/wiki/Keystroke_logging, March 2016. [26] Wikipedia, Rogue security software, https://en.wikipedia.org/wiki/Rogue_security_software, March 2016. [27] Wikipedia, Ransomware, https://en.wikipedia.org/wiki/Ransomware, March 2016. [28] Wikipedia, Browser hijacking, https://en.wikipedia.org/wiki/Browser_hijacking, March 2016. [29] Wikipedia, Machine learning, https://en.wikipedia.org/wiki/Machine_learning, March 2016. [30] Wikipedia, Binary classification, https://en.wikipedia.org/wiki/Binary_classification, March 2016. [31] Wikipedia, Inductive bias, https://en.wikipedia.org/wiki/Inductive_bias, March 2016. [32] Wikipedia, Precision and recall, https://en.wikipedia.org/wiki/Precision_and_recall, April 2016. [33] Wikipedia, Principal component analysis, https://en.wikipedia.org/wiki/Principal_component_analysis, May 2016. [34] Wikipedia, Linear discriminant analysis, https://en.wikipedia.org/wiki/Linear_discriminant_analysis, May 2016. [35] Wikipedia, Support Vector Machine, https://en.wikipedia.org/wiki/Support_vector_machine, April 2016.

57

https://en.wikipedia.org/wiki/Accuracy_and_precision

https://en.wikipedia.org/wiki/SciPy

https://en.wikipedia.org/wiki/Convex_optimization

https://en.wikipedia.org/wiki/Computer_virus

https://en.wikipedia.org/wiki/Computer_worm

https://en.wikipedia.org/wiki/Trojan_horse_(computing)

https://en.wikipedia.org/wiki/Rootkit

https://en.wikipedia.org/wiki/Backdoor_(computing)

https://en.wikipedia.org/wiki/Keystroke_logging

https://en.wikipedia.org/wiki/Rogue_security_software

https://en.wikipedia.org/wiki/Ransomware

https://en.wikipedia.org/wiki/Browser_hijacking

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/Binary_classification

https://en.wikipedia.org/wiki/Inductive_bias

https://en.wikipedia.org/wiki/Precision_and_recall

https://en.wikipedia.org/wiki/Principal_component_analysis

https://en.wikipedia.org/wiki/Linear_discriminant_analysis

https://en.wikipedia.org/wiki/Support_vector_machine

Abstract Malware detection software based on Support Vector Machine (SVM)

classifier.Software detects malware in executable files (such as .exe, .dll).

Dataset, divided into training, validation and test set, is an assembly code

obtained by disassembling executable. Assembly is then processed, and

afterwards frequency for each command is calculated. Using validation set to

minimize the total error, dimensionality reduction is performed using different

techniques such as Principal Component Analysis, PCA or Linear Discriminant

Analysis, LDA. Furthermore, SVM kernel is chosen based on validation set as

well. SVM with optimal parameters is tested on test set and precision, recall and

F1measure are calculated, as well as number of false positives, false negatives

and correctly classified executables.

Keywords:

malware, SVM, support vector machine, PCA, principal component analysis,

LDA, linear discriminant analysis, dimensionality reduction, dataset, kernel,

precision, recall, F1measure, false positive, false negative, classification

58

Sažetak

Program za detekciju zloćudnog softvera (eng. malware) napravljen na

temelju klasifikacije potpornog stroja vektora (eng. Support vector machine,

SVM). Program detektira zloćudni softver u izvršnim datotekama (npr .exe, .dll).

Skup podataka za učenje, validaciju i testiranje je asembler dobiven

dekompilacijom izvršnih datoteka, a potom izračun frekvencija pojave pojedine

naredbe što je ujedno i ulaz u SVM. Korištenjem skupa za validaciju za

minimizaciju ukupne pogreške, određen je broj dimenzija (eng. dimensionality

reduction) i jezgra (eng. kernel) SVMa. Za redukciju dimenzija moguće je koristiti

algoritme PCA (eng. Principal Component Analysis), LDA (eng. Linear

Discriminant Analysis) i sl. Odabrani SVM je testiran na skupu za testiranje i iz

rezultata su izračunati preciznost, odziv te tzv. F1 rezultat, te broj lažno

pozitivnih, lažno negativnih te ispravno klasificiranih jedinki (izvršnih datoteka).

Ključne riječi:

zloćudni softver, klasifikacija, SVM, potporni stroj vektora, redukcija

dimenzionalnosti, jezgra, PCA, LDA, preciznost, odziv, F1 rezultat, lažno

pozitivni, lažno negativni.

59

malware detection software using a support vector...

Documents