malware detection software using a support vector...
TRANSCRIPT
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING
(FER)
GRADUATE THESIS no. 1216
MALWARE DETECTION SOFTWARE USING
A SUPPORT VECTOR MACHINE AS A CLASSIFIER
Nicole Bilić
Zagreb, July of 2016.
Table of contents
1. Introduction 3
2. Malware 6
2.1. Adware 7
2.2. Spyware 7
2.3. Virus 8
2.4. Worm 8
2.5. Trojan 9
2.6. Rootkit 10
2.7. Backdoors 11
2.7.1. Object code backdoors 11
2.7.2. Asymmetric backdoors 11
2.7.3. Compiler backdoors 11
2.8. Keyloggers 12
2.9. Rogue security software 13
2.10. Ransomware 13
2.11. Browser Hijacker 14
3. Machine learning and classification 15
3.1. Supervised machine learning 15
3.1.1. General steps 18
3.2. Model selection 19
3.3. Evaluation of binary classifiers 21
3.4. Support vector machine, SVM 23
3.4.1. Maximal margin problem 24
3.5. Principal component analysis, PCA 30
3.6. Linear discriminant analysis, LDA 38
4. Dataset processing 43
5. Malware detection software 47
6. Results of the classifiers evaluation 52
6.1. Results obtained for the polynomial kernel, training data 52
6.2. Results obtained for the polynomial kernel, validation data 53
7. Conclusion 54
8. Literature 56
9. Abstract 58
10. Sažetak 59
1. Introduction Malware detection softwares today are inevitable as with the Internet era
the number of types of malware, as well as the total number of known malware
has drastically increased. Internet is used as a transport media to deliver those
malwares to the end users, most of whom are standard users without deep IT
knowledge making them an easy target for the attackers. Each malware has its
purpose, ranging from only annoying ads jumping out in user’s browser to
encrypting all user’s documents in the infected computer and asking for a
ransom. Therefore, malware detection (and protection) software is an obligatory
accessory to the operating system.
Machine learning has proved itself useful and a good solution to
optimization problems with either no standard solution or problems that would
require a lot of computational power to be solved in a standard manner.
Standard malware detection software has a database containing all the known
malware and has to be updated regularly, as every day new malwares are
delivered and discovered.
The idea to combine machine learning with the malware detection is
based on the following facts:
a) Dataset, in this case, is almost infinite all the malware and
virusfree software can be used. Taking into consideration that this
dataset is expanding every day, it is possible to obtain very
numerous dataset giving better opportunity to find the correct
decision function
b) Once fitted, the classifier would be able to correctly predict if the
given file is a malware or virusfree, ie. new malware would be
detected by this software even though it has never been seen
before
3
As most of the malware comes in an executable form, the analysis here is
reduced to the static analysis of the disassembled executable assuming that
there is a correlation between data and the results of the classification, ie. there
is a decision function that correctly separates malware from virusfree software.
In this paper, the whole process from the dataset processing to the final
product, ie. malware detection software, is described, as well as the theoretical
background. The paper starts with a chapter dedicated to malwares in general,
giving the general definition of malware. As the full list of malware types is very
numerous, in continuation of the paper the focus is on those which make the
majority, ie. over 99%, of all malwares, with emphasis on properties specific for
each malware type. Third chapter is dedicated to machine learning and
classification. It explains and defines some general important terms, such as
machine learning itself, supervised machine learning, model selection,
evaluation of binary classifiers, support vector machine (SVM), etc., giving a
good theoretical background for understanding how the final product works.
While the emphasis in this chapter is on the idea behind SVM, including
maximal margin problem and its dual form, Lagrange multipliers method, etc.,
two dimension reduction algorithm also found their place in the third chapter
Principal component analysis (PCA) and Linear discriminant analysis (LDA), as
a possibility of removing the noise from the data and eventually improving the
results of the classification. Furthermore, fourth chapter gives an overview and
explanation of the dataset processing. The process is divided into five steps
which thoroughly explain the way the dataset is processed, from an executable
file to a matrix which, later on, is an input for the classifier. Also, the overview of
the dataset division is given with the exact percentages of the dataset that
belong to train, validation or test set. Fifth chapter describes the final product as
the practical part of this thesis, malware detection software. It contains general
4
guidelines for the end user, as well as the explanation of what is happening in
the background explained in steps.
5
2. Malware Malware or malicious software is any software used to gather sensitive
information, gain access to private computer systems, display unwanted
advertising or in any way disrupt computer operations [1]. It should not be
confused for software that causes an unintentional damage as a side effect of
some deficiency. Software is classified as malicious for its malicious intention
towards users or their computers, which draws the line between malware and
badware.
Figure 2.1. Malware by categories [1]
Since the list of all malware types is long and almost not finite, I will focus
on several most common types classified by the general categories of infection.
6
2.1. Adware Adware is short for advertisingsupported software. This type of malware
automatically delivers advertisements. Typical example of this type of malware
often comes as a part of free versions of softwares or applications, sponsored
by advertisers, as a source of income. Popup ads on various websites belong
to this type of malware as well. Although the intention of this malware type is
solely advertising and it is considered to be the least dangerous type, in some
cases adware comes bundled with spyware which can lead to breach of privacy
of user data and similar.
2.2. Spyware As its name suggests, spyware is a type of malware that uses different
techniques to spy on user’s activity without their knowledge. However, the term
spying, in this case, extends well beyond simple monitoring. Spyware can collect
almost any type of data ranging from personal information, such as internet
surfing habits, to user logins and bank/credit account information. Furthermore,
spyware is also capable of interfering with user’s control of a computer by
installing an additional software or redirecting web browsers. The most
advanced spywares are even capable of changing computer settings resulting in
slow connection speed, unauthorized changes in browser settings etc.
Spyware spreads either by software vulnerabilities exploitation, bundling
itself with legitimate software, or in Trojans. Most common purpose of spyware
is user’s data collection to use those to determine “targeted” advertisement
impressions.
7
2.3. Virus The main characteristic of this type of malware is its capability of copying
itself and spreading to other computers, doing so without user consent. When
executed, this type of malware replicates itself by inserting possibly modified
copies of itself into other computer programs, data files or even into the boot
sector of the hard drive. After the successful replication, viruses often perform
harmful activities on infected host, such as stealing hard disk space or CPU
time, accessing private information, corrupting data, displaying political or
humorous messages on the user’s screen, spamming their contacts, logging
their keystrokes or even causing the computer to be useless [20]. However,
most of the computer viruses known today target systems running Microsoft
Windows OS using stealth strategies to avoid antivirus protection software.
Taking into consideration the fact that computer viruses cause billions of Euros
worth of damage every year [3] by wasting computer resources, causing system
failures, corrupting data, the motives for creating this type of malware are clear:
either personal data theft, profit, sending political messages, amusement,
sabotage and similar. Luckily, today exist a lot of freely available antivirus
software. However, none of them can detect all viruses with 100% accuracy due
to constant development of new viruses.
2.4. Worm Worm is a standalone malware computer program that replicates itself in
order to spread to other computers using computer network, exploiting security
failures on target computer to access it. It is often confused with a computer
virus, however there is a big difference worm does not need to attach itself to
an existing program. Furthermore, worms almost always cause some damage to
the network, at least just by consuming the bandwidth, whereas viruses almost
8
always corrupt or modify files on a targeted computer [21]. Even though their
only goal was to spread through the network, not changing anything in the
systems they pass through, even “payload free” worms can cause a lot of
damage by increasing network traffic or causing major disruption. A “payload” is
a code in the worm that other than spreading, does other harmful things such as
data deletion, data encryption, sending documents via email etc. One of the
common purposes of a payload worm is to install a backdoor in the infected
computer to allow the creation of the “zombie” computer under control of the
worm author [21]. Other malicious purposes are money extortion in the so called
“ransomware attack” or blackmailing companies by threatening with DoS attack.
2.5. Trojan Trojan horse is a type of malware that misrepresents itself to appear
useful, routine or interesting in order to persuade a victim to install it [22].
Trojan's’ payload usually acts like a backdoor and is not easily detectable.
However, it causes changes in computer’s behaviour in a way it becomes slow
due to heavy processor or network usage. As the main difference with worms
and viruses is the fact that trojans usually do not try to inject themselves into
other files or in any other way to propagate themselves further. The purposes
and uses can be divided in the following categories [22]:
a) Destructive crashing the computer or device, modification or deletion of
files, data corruption, formatting disks, destroying all contents, spread
malware across the network, spy on user activities
b) Use of resource or identity use of the machine as part of a botnet, using
computer resources for mining cryptocurrencies, using the infected
computer as proxy for illegal activities and/or attacks on other computers,
infecting other connected devices on the network
c) Money theft, ransom electronic money theft
9
d) Data theft industrial espionage, user passwords or payment card
information, user personally identifiable information, trade secrets
e) Spying, surveillance or stalking keystroke logging, watching the user’s
screen, viewing the user’s webcam, controlling the computer system
remotely
2.6. Rootkit A rootkit is a collection of usually malicious computer software designed
to enable access to a computer or areas of its software that would not otherwise
be allowed while at the same time masking its existence or the existence of
other software [23]. Usually the first step is obtaining root or Administrator
access which is done by exploiting a known vulnerability or a password
(obtained through cracking or social engineering). Rootkits are usually able of
hiding its intrusion by subverting the software intended to find them, while
maintaining privileged access. Therefore, the removal can be practically
impossible, especially when the kernel is infected by a rootkit or if dealing with
firmware rootkits. Often, complete reinstallation of the operating system is
required or even hardware replacement in the case of firmware rootkits.
However, modern rootkits are used to add stealth capabilities in order to make
payload of the other software undetectable, rather than elevating the access.
Malicious rootkits and their payloads can have one of the following uses:
1. Provide an attacker with full access via a backdoor, permitting
unauthorized access to steal or falsify documents.
2. Conceal other malware, for example passwordstealing key loggers and
computer viruses
3. Adjust the compromised machine as a zombie computer for attacks on
other computers.
10
2.7. Backdoors A backdoor in a computer system is a method of bypassing normal
authentication, securing remote access to a computer, obtaining access, while
attempting to remain undetected. It can take the form of an installed program or
could be a modification to an existing program or hardware device [4]. When
multiuser and networked operating systems became widely adopted it surfaced
the threat of backdoors. Most common types of backdoors are:
2.7.1. Object code backdoors Object code backdoors involve modifying the object code instead of the
source code which makes them hard to detect. Object code is not in the
humanreadable form but rather in machinereadable which makes it hard to
inspect. However, this type of backdoors is easily detectable by checking for
differences, notably in length or checksum and in some cases by disassembling
the object code. Furthermore, object code backdoors are easy to remove by
simply recompiling from source.
2.7.2. Asymmetric backdoors As opposed to traditional symmetric backdoors which allow anyone that
finds that backdoor to use it, asymmetric backdoors enable only the attacker
who plants it to use it.
2.7.3. Compiler backdoors Compiler backdoors are a form of black box backdoor, where not only is a
compiler subverted (to insert backdoor in some other program), but it is further
modified to detect when it is compiling itself and then inserts both the backdoor
insertion code and the code modifying selfcompilation, like the mechanism how
retroviruses infect their host. This can be done by modifying the source code,
11
and the resulting compromised compiler can compile the original source code
and insert itself: the exploit has been bootstrapped [24].
2.8. Keyloggers Keylogging is the action of recording the keys struck on a keyboard,
typically covertly, so that the person using the keyboard is unaware that their
actions are being monitored. Numerous keylogging methods exist: they range
from hardware and softwarebased approaches to acoustic analysis.
Softwarebased keyloggers can from a technical perspective be divided in
several categories [25]:
1. Hypervisorbased the keylogger resides in a malware hypervisor
underneath the OS, and therefore remains untouched
2. Kernelbased a program on the machine obtains root access to
hide itself in the OS and intercepts keystrokes that pass through
the kernel
3. APIbased hook the keyboard APIs inside a running application.
The keylogger registers keystroke events, as if it was a normal
piece of the application instead of malware
4. Form grabbing based log web form submissions by recording the
web browsing on submit events after the user fills the form and
submit it
5. Memory injection based perform their logging function by altering
the memory tables associated with the browser and other system
functions
6. Packet analyzers capture network traffic associated with HTTP
POST requests to retrieve unencrypted passwords
7. Remote access local software keyloggers with an added feature
that allows access to locally recorded data from a remote location.
12
2.9. Rogue security software Rogue security software is a form of malicious software and internet fraud
that misleads users into believing there is a virus on their computer and
manipulates them into paying money for a fake malware removal tool (that
possibly introduces malware to the computer) [26]. The attackers in this case
mostly rely on some form of social engineering to manipulate their victim to buy
or install the antivirus software which usually contains Trojan horse component.
After the installation, the rogue security software may try to persuade the victim
into purchasing a service or additional software through various ways: displaying
an animation simulating a system crash and reboot, disabling parts of the
system, disabling automatic system software updates and blocking access to
antimalware vendors’ web pages, altering system registries and/or security
settings etc.
2.10. Ransomware Ransomware is a kind of malware that restricts access to the infected
computer system in some way, and demands that the user pay a ransom to the
malware operators to remove the restriction [27]. Some ransomwares encrypt
user’s hard drive to force the user to pay the ransom in order to retrieve back
their data since it is almost impossible to decrypt it on their own. Usually, they
use Trojans to propagate itself and enter the victim’s system where the payload
is ran. Payloads may display fake warnings (for example issued by a law
enforcement agency), lock the system until ransom is paid or encrypt the files
using in such a way that only the attacker has the needed decryption key.
13
2.11. Browser Hijacker Browser hijacking is a form of malware that modifies a web browser’s
settings without user’s permission, to inject unwanted advertising into the user’s
browser [28]. The average browser hijacker can change the user’s home page,
error page or preferred search engine in order to increase the number of hits to
a particular website to increase its advertising revenue. Usually, they come
bundled with some other software as “offers” without uninstall instructions or
documentation on what they do in order to trick the user into installing them too.
Some of the browser hijackers are not dangerous but rather annoying, but easy
to detect and uninstall. However, some of them are capable of permanently
damaging the registry on Windows systems and are not so easily detectable as
well as uninstallable.
14
3. Machine learning and classification Machine learning is a type of artificial intelligence (AI) that provides
computers with the ability to learn without being explicitly programmed. Machine
learning focuses on the development of computer programs that can teach
themselves to grow and change when exposed to new data [5]. The way
learning is done can be divided in three broad categories:
1. Supervised learning computer is given two sample values, one
representing the input and the other representing its desired output.
Computer’s task is to find a general rule that maps inputs into desired
outputs.
2. Unsupervised learning output vector is not given and computer has to
find the structure only in its input
3. Reinforcement learning a computer program interacts with a dynamic
environment in which it must perform a certain goal, without a teacher
explicitly telling it whether it has come close to its goal [29].
Between supervised and unsupervised there is also semisupervised learning
approach where some (usually most) output values are unknown. However, in
this thesis, the focus will be on supervised learning as the program learns from
dataset containing all necessary input and expected output values.
3.1 Supervised machine learning Supervised machine learning is the machine learning task of inferring a
function from labeled training data [13]. It can be used for solving two types of
tasks: classification and regression. The difference between the two is that
classification associates an input with its class, while the regression associates
an input with some continuous value, i.e. the major difference is if the target
variable is discrete or continuous.
15
The set of training examples consists of pairs of instances and their
labels which can be denoted as:
x ,( (i) y(i))Ni=1
N = total number of examples in training set
Each example x can be represented as an ndimensional vector of
features,
x = ( , , ,...., )x1 x2 x3 xn
which can be interpreted as a point in an ndimensional vector space, so called
input or instance space. The assumption of all machine learning algorithms is
that all the examples from an input space are independent and identically
distributed (i.i.d.) which means that each random variable has the same
probability distribution as the others and all are mutually independent. Learning
set consists of tuples of examples and their labels and can be represented as a
table:
Table 3.1. Learning set as table [6]
x1 x2 ... xn y
x1(1) x2
(1) ... xn(1) y(1)
x1(2) x2
(2) ... xn(2) y(2)
x1(N) x2
(N) ... xn(N) y(N)
16
Binary or binomial classification is the task of classifying the elements of
a given set into two groups on the basis of a classification rule [30]. When
performing binary classification, the goal of a classification algorithm is to
determine whether an input example x belong to a class C or not which we
denote as:
h(x) = 1, x is an instance of a class C
h(x) = 0, x is not an instance of a class C
The set of all possible hypothesis is called model. Learning is then the
process of searching through the model in order to find an optimal hypothesis by
minimizing the empiric or training error. Empiric error on the training set equalsΓ
to the ratio of examples which were incorrectly classified and total number of
examples in that training set, N:
[6](h|D)E = 1N ∑N
i=11 h(x ) = (i) / y(i) = 1
N ∑N
i=1h(x )||
(i) − y(i)||
where 1P is an indicator function whose value is 1 if , and 0 otherwise.P ≡ T
In machine learning, the aim is to construct algorithms that are able to learn to
predict a certain target output so the learning algorithm is presented some
training examples that demonstrate the intended relation of input and output
values. Then the learner is supposed to approximate the correct output, even for
the examples that have not been shown during training. The set of assumptions
that the learner uses to predict outputs when given inputs that it has not
encountered is called inductive or learning bias [31]. The inductive bias can be:
1. Restriction or language bias we choose the model and that way restrict
the set of hypotheses that can be represented by the chosen mod
2. Preference or search bias we define the method hypothesis search
within the model that way giving the advantage to some hypotheses over
the others
17
3.1.1 General steps in solving supervised machine learning problem
[13]:
1. Determine the type of training examples. Before doing anything else, the
user should decide what kind of data is to be used as a training set. In the
case of handwriting analysis, for example, this might be a single
handwritten character, an entire handwritten word, or an entire line of
handwriting.
2. Gather a training set. The training set needs to be representative of the
realworld use of the function. Thus, a set of input objects is gathered and
corresponding outputs are also gathered, either from human experts or
from measurements.
3. Determine the input feature representation of the learned function. The
accuracy of the learned function depends strongly on how the input object
is represented. Typically, the input object is transformed into a feature
vector, which contains a number of features that are descriptive of the
object. The number of features should not be too large, because of the
curse of dimensionality; but should contain enough information to
accurately predict the output.
4. Determine the structure of the learned function and corresponding
learning algorithm. For example, the engineer may choose to use support
vector machines or decision trees.
5. Complete the design. Run the learning algorithm on the gathered training
set. Some supervised learning algorithms require the user to determine
certain control parameters. These parameters may be adjusted by
optimizing performance on a subset (called a validation set) of the
training set, or via crossvalidation.
18
6. Evaluate the accuracy of the learned function. After parameter
adjustment and learning, the performance of the resulting function should
be measured on a test set that is separate from the training set.
3.2 Model selection
Model selection is the task of selecting a statistical model from a set of
candidate models, given data [14]. The selection is actually optimization of
hyperparameters of some fixated model. However, the aim of the model is not to
perform a good classification of the learning dataset examples, but to be able to
generalize and perform good with previously unseen examples, predict. Criteria
for model selection is as follows [6]:
Simple, but not too simple model is capable of better generalization
Simple model is easier to use and it has lower computational complexity
Simple model is easier to teach because complex models have a lot of
parameters to optimize
Simple model is easier to understand and extract knowledge from it (for
example rules)
The aim is to choose a model whose complexity fits to the complexity of
the function we are trying to determine. Considering the complexity of the
chosen model, there are two extremes:
1. Overfitting If the chosen model is too complex comparing to the real
function, hypotheses are too flexible. Those hypotheses will assume
more information than there actually is in the given data which leads to a
loss of generalization property. Furthermore, these hypotheses will
perform well on the training set because they will adjust even to the noise
in the data. However, when in need to predict the output of an input they
never encountered in the learning set, their performance will be very low.
19
2. Underfitting If the chosen model is too simple comparing to the real
function, hypotheses will not be able to adjust to the data from the
learning set and therefore will not even perform well on the learning
dataset.
Figure 3.1. Underfitted vs convenient vs overfitted function[7]
Figure 3.1. shows three possible situations when it comes to model complexity:
1. Left picture, underfitting for the given dataset represented by 2D points,
the function of a degree 1 is not complex enough to fit the data properly.
2. Middle picture displays a function of a degree 4, which for the given
dataset is the optimal complexity and this function fits the data the best.
This function is the goal function we are looking for.
3. Right picture, overfitting the function of degree 15 displayed on this
picture is too complex for the data and it adjusts to the noise in data,
which leads to low generalization power on previously not seen data.
In order to check if the model is over or under fitted, there is a technique
called cross validation. Cross validation is performed by splitting the dataset into
training set, validation set and test set. Training set is used to teach the chosen
models and validation set is used to check the generalization capability of the
20
model in question. After the optimal model is selected, based on validation
results, validation set and train set are united and the model is trained on both of
those and its generalization capability is tested on test set.
3.3 Evaluation of binary classifiers
In the field of machine learning, a confusion (error) matrix is a specific
matrix that enables visualization of the performance of a supervised learning
algorithm. To evaluate a classifier, one compares its output to another reference
classification ideally a perfect classification, but in practice the output of
another gold standard test and cross tabulates the data into a 2 2×
contingency table, comparing the two classifications. One then evaluates the 1
classifier relative to the gold standard by computing summary statistics of these
4 numbers. Each column of the matrix represents the instances in a predicted
class while each row represents the instances in an actual class [15]. In
predictive analysis, a confusion matrix is a matrix whose rows and columns
represent the number of true positives, false positives, true negatives and false
negatives. False positive error is a result that indicates a given condition has
been fulfilled, when it actually has not. On the other hand, false negative error is
where test result indicates that a condition failed, when it actually did not.
Figure 3.2. Confusion matrix
1 Contingency table is a type of table in a matrix format that displays the frequency distribution of the variables [16]
21
Following measures are used to evaluate the binary classifier:
a) Accuracy the proportion of true results (both true positives and true
negatives) among the total number of cases examined. An accuracy of
100% means that the measured values are exactly the same as the given
values [17].
ccuracya = TP+TNTP+FP+TN+FN
b) Recall measures the proportion of positives that are correctly identified
as such [32].
ecallr = TPTP+FN
c) Precision the proportion of the true positives against all the positive
results (both true positives and false positives)
recisionp = TPTP+FP
d) a measure that combines precision and recall with theeasureF β −m
restraint . It measures the effectiveness of retrieval with respect to aβ > 0
factor which gives more or less importance to precision or recall.β
F β = 1( + β2) ∙ precision∙recallβ ∙precision+recall2
For the measure is approximately the average of precision andβ = 1
recall when they are close, and is more generally the square of the geometric
mean divided by the arithmetic mean. In this case, precision and recall are
equally weighted.
F 1 = 2 ∙precision∙recallprecision+recall
For Fmeasure weights recall higher than precisionβ = 2
F 2 = 5 ∙precision∙recall
4∙precision + recall
For Fmeasure weights precision higher than recall.5β = 0
.25F 0.5 = 1 ∙ precision∙recall0.25∙precision+recall
22
3.4 Support vector machine, SVM
Support vector machine is a discriminative model which means it models
the conditional probability directly, as opposed to generative modelsp (y|x)
which model the joint probability distribution .p (x, )y
Example
Let's assume that we have the following data in the form (x,y):
(3,4), (3,1), (4,1), (4,1)
Table 3.2. The joint probability
p (x, )y y = 1 y = 4
x = 3 41 4
1
x = 4 21 0
Table 3.3. The conditional probability
p (y|x) y = 1 y = 4
x = 3 21 2
1
x = 4 1 0
Generative algorithms model how the data was generated to classify it
and based on those generation assumptions, try to determine which class was
more likely to generate the given example. On the other hand, discriminative
algorithms do not take into an account how the data was generated, but rather
uses the data to create a decision boundary and then tries to determine to what
side of that decision boundary does the given example belong to.
23
Every machine learning algorithm is defined by a model, error and
optimization procedure. In the case of support vector machines, a data point is
viewed as a pdimensional vector (a list of p numbers), and we want to know
whether we can separate such points with a (p1)dimensional hyperplane. This
is called a linear classifier. There are many hyperplanes that might classify the
data. One reasonable choice as the best hyperplane is the one that represents
the largest separation, or margin, between the two classes. So we choose the
hyperplane so that the distance from it to the nearest data point on each side is
maximized [35]. The chosen hyperplane will also have the minimal
generalization error.
3.4.1 Maximal margin problem
Model:
(x) xh = wT +w0
Class labels:
− ,+ y∈ 1 1
Boundary between classes:
gn(h(x))y = s
We assume that the examples from the dataset are linearly separableχ
meaning that such that the following applies:w,∃ w0
(x ) , ∀y +h (i) ≥ 0 (i) = 1
(x ) , ∀y −h (i) < 0 (i) = 1
Or shorter:
(x , ) . y h(x )∀ (i) y(i) ∈ χ (i) (i) ≥ 0
It is obvious that the space version is infinite, ie. there is an infinite
number of solutions. However, due to the inductive bias, ie. the desired solution
24
is the one giving the maximum margin, there is only one solution of our interest.
Maximizing the margin gives, as a side effect, the hyperplane passing exactly
halfway between 2 closest examples.
The distance from the hyperplane:
d = ||w||h(x)
Since only the hyperplanes which correctly classify the examples are of interest,
the absolute distance to the hyperplane can be denoted as:
||w||y h(x)(i)
= ||w||y (w x +w )(i) T (i)
0
According to the definition, the distance between the hyperplane and the closest
example is as follows:
min y (w x )1||w|| i
(i) T +w0
and that distance we want to maximize:
min y (w x )argmaxw,w01||w|| i
(i) T +w0
The last equation is the optimization problem that needs to be solved. However,
in order to simplify it, since in the given form it is difficult to optimize, ie. there is
a min function within max function, there are several steps that need to be
performed.
First, it is possible to define that for the example , the closest one to thexi
margin, applies:
(w x )y(i) T +w0 = 1
As a consequence, for all the other examples, the following will apply:
(w x )y(i) T +w0 ≥ 1
With the above definition, the optimization problem becomes:
= = min y (w x )argmaxw,w01||w|| i
(i) T +w0 argmaxw,w01||w|| * 1 argmaxw,w0
1||w||
with the constraint:
, (w x )y(i) T +w0 ≥ 1 , ..,i = 1 . N
25
Furthermore, to maximize is equal to minimize which is further1||w|| |w||| = √w wT
equal to minimize . Due to simplifying the further steps, the last equation is||w||2
multiplied by and it yields the final formulation of the optimization problem:21
||w||argminw,w021 2
with the constraint:
, (w x )y(i) T +w0 ≥ 1 , ..,i = 1 . N
Finally, the optimization problem became a typical convex optimization 2
problem with the constraints which can be defined in its standard form as
follows:
Minimize (x)f
with constraints: (x) , i , ..,gi ≤ 0 = 1 . m
x , i , ..,aiT − bi = 0 = 1 . p
Lagrange multipliers method
Lagrange multipliers method is used to reformulate the optimization
problem with constraints in a way that the constraints are directly built into the
target function. The aforementioned convex optimization problem with
constraints can be transformed to:
(x, , ) (x) g(x) h(x)L α β = f + ∑m
i=1αi + ∑
p
i=1βi
where and are Lagrange multipliers for constraints with equality andαi βi
inequality. Furthermore, for multipliers the so called KarushKuhnTucker’sαi
(KKT) conditions apply:
, i , ..,αi ≥ 0 = 1 . m
g (x) , i , ..,αi i = 0 = 1 . m
2 Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. The convexity property can make optimization in some sense "easier" than the general case for example, any local minimum must be a global minimum [19].
26
Maximal margin problem’s dual form
Aforementioned maximal margin problem was defined as follows:
||w||argminw,w021 2
, (w x )y(i) T +w0 ≥ 1 , ..,i = 1 . N
In the terms of Lagrange dual function, we have a function to optimize (f(x)) and
a constraint with an inequality, leading to the following Lagrange function:
(w, , ) ||w|| y (w x ) L w0 α = 21 2 − ∑
N
i=1αi (i) T (i) +w0 − 1
is a vector of Lagrange multipliers, one for each constraint. Byα , .., )α = ( 1 . αN
choosing to optimize the dual form, the optimization problem is simplified as the
optimization comes down to optimization of just one variable ( ).α
Derivation by and and equalizing with zero:w w0
y xw = ∑N
i=1αi (i) (i) y0 = ∑
N
i=1αi (i)
The dual Lagrange function takes the form:
(α) ||w|| y (w x ) L︿
= 21 2 − ∑
N
i=1αi (i) T (i) +w0 − 1
||w|| y w x y= 21 2 − ∑
N
i=1αi (i) T (i) −w0 ∑
N
i=1αi (i) + ∑
N
i=1αi
y (x ) y x y (x ) y x = 21 ∑N
i=1αi (i) (i) T ∑
N
j=1αj (j) (j) − ∑
N
i=1αi (i) (i) T ∑
N
j=1αj (j) (j) + ∑
N
i=1αi
α y y (x ) x= ∑N
i=1αi − 2
1 ∑N
i=1∑N
j=1αi j
(i) (j) (i) T (j)
Consequently, the dual optimization problem becomes to maximize the
expression with the constraints α y y (x ) x∑N
i=1αi − 2
1 ∑N
i=1∑N
j=1αi j
(i) (j) (i) T (j) , i , ..,αi ≥ 0 = 1 . N
and y∑N
i=1αi (i) = 0
27
By transforming the optimization problem to its dual form, the complexity
of the algorithm is reduced in the case N << n because the number of variables
in:
Primary problem: variablesn + 1
Dual problem: variablesN
Model
Previously it was calculated:
y xw = ∑N
i=1αi (i) (i)
Replacing w with the above in the model function, we obtain:
in primary form(x) xh = wT +w0
(x) y x x in dual formh = ∑N
i=1αi (i) T (i) +w0
To classify an example x, we calculate the scalar product of the example
x and all the other examples in the dataset , multiplied by the weight andχ αi
signum . Instead of storing weights , we need to store examples, instead ofy(i) w
we have meaning that the complexity of the model grows with the(x|w)h (x|α, )h χ
number of examples, ie. the model is unparameterized. If SVM is trained in the
way that the primary problem is being solved, it is a parameterized model. On
the other hand, if SVM is trained to solve the dual problem, then it is
unparameterized model.
Support vectors
With KKTcondition:
(y h(x ) )αi (i) (i) − 1 = 0
It is possible to conclude that for every example from applies:x(i) χ
or αi = 0 h(x )y(i) (i) = 1
28
ie. in the expression appear only the vectors precisely on the maximum margin
plane and those vectors are called the support vectors. All of the other vectors
for which applies absolutely do not affect the output of the model and canαi = 0
be disregarded when predicting is performed. The hyperplane in the primary
problem is defined by the linear combination of the support vectors if in its dual
form.
29
3.5 Principal component analysis, PCA
Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations of possibly 3
correlated variables into a set of values of linearly uncorrelated variables called
principal components [33]. The basic idea behind it suggests that if the data
lives in a subspace, it is going to look very flat when viewed from the full space,
e.g.
Figure 3.3. Data in a subspace [11]
If we fit a Gaussian to the data, the equiprobability contours are going to be
ellipsoids. If y is Gaussian with covariance , the equiprobability contours willΣ
be ellipses whose principal components are the eigenvectors of , andΦi Σ
principal lengths are the eigenvalues of .λi Σ
3 In linear algebra, an orthogonal transformation is a linear transformation T : V → V on a real inner product space V, that preserves the inner product
30
Figure 3.4. Equiprobability contour [11]
By computing the eigenvalues we know if the data is flat:
1. Data is flat if >λ1 > λ2
Figure 3.5. Flat data [11] 2. Data is not flat if λ1 = λ2
Figure 3.6. Not flat data [11]
31
As previously mentioned, PCA finds the principal components of data.
The principal components are underlying structure in the data that tell us the
directions where there is the most variance or the directions where the data is
most spread out.
Example 1
Let the following picture represent the points of data in the dataset:
Figure 3.7. Dataset
In order to find the most variance direction, we need to find a straight line
where the data is most spread out when projected onto it.
a) A vertical straight line
Figure 3.8. Data projected on the vertical line
32
b) A horizontal straight line
Figure 3.9. Data projected on the horizontal line
By visual inspection of the above pictures, it is obvious that the data is
more spread when projected on the horizontal line. Any line, other than
horizontal in this case, would have smaller variance than a horizontal one and
therefore, in this example, the horizontal line is the principal component.
Every set of data points can be deconstructed into eigenvectors and eigenvalue.
An eigenvector is a direction of the vector, and eigenvalue is a number
describing how spread the data is. The eigenvector with the highest eigenvalue
is the principal component. The amount of eigenvectors/values that exist in a
data set equals the number of dimensions that the data set has. If, for example,
there are 2 variables, meaning that the data set is twodimensional, there are 2
eigenvectors/values.
33
Example 2
Let us assume that we have the following data:
Figure 3.10. Example data
As mentioned above, since the data is represented by 2 variables, there
are two eigenvectors. One of them, from the previous example, is the line
splitting the ellipses longways. The other eigenvector is perpendicular to the
principal component. Eigenvectors have to be able to span the whole xy area,
and the optimal way to satisfy that condition, the two eigenvector directions have
to be orthogonal to one another. The two eigenvectors would look like this:
Figure 3.11. Eigenvectors
34
These eigenvectors actually create a much more useful axis to frame the data:
Figure 3.12. New axis
The data itself was not changed in any way, it is just observed from a
different perspective. The directions of the axes are the directions in which there
are the most variation in the data, which is where there is more information.
Now let us observe how PCA is used for dimension reduction on another
example.
Example 3
Let us assume we have data set in 3 dimensions:
Figure 3.13. 3D data set [12]
35
As the data is represented by 3 variables, there are 3 eigenvectors that
need to be found. Since the data in the picture is all lying in a 2D plane, one of
the three eigenvectors will have an eigenvalue of zero, while the other two will
have large eigenvalues.
Figure 3.14. Eigenvectors [12]
Since the eigenvector ev3 equals to zero, it is pretty useless and the data
can be represented by the other two eigenvectors which means it can be
represented in 2D instead of 3D like before. 2D representation is shown in the
Figure 3.15.
It is important to mention that the dimension can be reduced even if non
of the eigenvectors is equal to zero, but instead the value of the eigenvector that
is discarded is much smaller than the values of the other eigenvectors.
36
Figure 3.15. 2D representation [12]
37
3.6 Linear discriminant analysis, LDA
Linear discriminant analysis (LDA) is a generalization of Fisher’s linear
discriminant, a method used in statistics, pattern recognition and machine
learning to fine a linear combination of features that characterizes or separates
two or more classes of objects or events [34]. The objective of LDA is to perform
dimensionality reduction but also preserve as much of the class discriminatory
information as possible. In PCA the main idea was to reexpress the available
dataset by extracting the relevant information and reducing the redundancy and
minimizing the noise. However, the discrimination between classes was not
taken into concern, instead we could project the dataset onto a less dimensional
space with more powerful data representation.
Let us assume we have a pattern classification problem, where there are
C classes. Each class has mdimensional samples, where ,N i , , ..,i = 1 2 . C
hence we have a set of mdimensional samples belonging to classx , , .., 1 x2 . xNi
. All of those samples are fit into a matrix X in a way that each columnωi
represents one sample. The idea is to obtain a transformation of X to Y through
projecting the samples in X onto a hyperplane with dimension . In order toC − 1
find a good projection vector, ie. the one that maximizes the separability of the
scalars, we need to define a measure of separation between the projections.
Example [13]
Compute the Linear Discriminant projection for the following twodimensional
dataset:
Samples for class x , ) (4, ), 2, ), 2, ), 3, ), 4, )ω1 : X1 = ( 1 x2 = 2 ( 4 ( 3 ( 6 ( 4
Samples for class x , ) (9, 0), 6, ), 9, ), 8, ), 10, )ω2 : X2 = ( 1 x2 = 1 ( 8 ( 5 ( 7 ( 8
38
Figure 3.16. 2D dataset [13]
First step is to calculate the classes means:
[(4, ) 2, ) 2, ) 3, ) 4, )] 3, 3.8)μ1 = 1N1
∑
x∈ω1
x = 51 2 + ( 4 + ( 3 + ( 6 + ( 4 = (
[(9, 0) 6, ) 9, ) 8, ) 10, )] 8.4, 7.6)μ2 = 1N2
∑
x∈ω2
x = 51 1 + ( 8 + ( 5 + ( 7 + ( 8 = (
In the second step we need to calculate the covariance matrix of the
classes. Instead, we could have chosen the distance between the projected
means as our objective function, however it would not be a good measure as it
does not take into account the standard deviation within the classes, as shown
in figure 3.17.
39
Figure 3.17. Means distance between classes [13]
The solution to this issue was proposed by Fisher maximize a function
that represents the difference between the means, but normalized by a measure
of the withinclass variability, ie. scatter.
The covariance matrix for the first class:
The covariance matrix for the second class:
40
The withinclass scatter matrix is:
Betweenclass scatter matrix is:
The LDA projection is then obtained as the solution of the generalized eigenvalue problem S w wSw−1 B = λ
Furthermore:
41
where is known as a Fisher’s Linear Discriminant, which is rather a specificw* choice of direction for the projection of the data down to one dimension.
Figure 3.18. Solution [13]
42
4. Dataset processing Dataset used for training the SVM classifier consists of the data from the
executable files (.exe or .dll) containing either malware or virusfree software.
The processing of the dataset consists of the following steps:
1. Objdump linux terminal command that is used to display information
about one or more object files. In this paper the command
was used to disassemble the executables,bjdump excutableF ileo − d
preparing the dataset for further processing. The shell script was used to
automate the process it reads in all the available .dll and .exe files, calls
the aforementioned command on each file and saves the output to textual
files, naming them in numerical order (eg. 0.txt, 1.txt, 2.txt etc.) to simplify
the further processing.
2. Dictionary creation dictionary holding all the possible assembly
commands returned by objdump command, taken over from Intel 64 and
IA32 Architectures [9], as keys. Values in the dictionary represent the
number of occurrences of each of the command in the output of objdump.
3. Regular expression matching regular expression from the Figure 4.1. is
used to match and extract only the commands from the objdump output.
Figure 4.1. Regular expression for obtaining the commands
4. Counting occurrences for every command obtained in step 3, the
number of occurrences for that command in the dictionary is increased by
one.
43
5. Matrix creation final step of the process consists of creating the matrix
that will be an input to the SVM classifier training algorithm. Columns of
the matrix represent each of the commands from the dictionary. Each row
of the matrix represents one file from the dataset, while element (i,j) of the
matrix represents the number of occurrences of jth command in ith file.
Table 4.1. The input matrix for the classifier
files/com
mands
ADD JMP MOV AND OR ...
0.txt 123 456 19 22 17 ...
1.txt 98 123 11 54 22 ...
… ... ... ... ... ... ...
44
Figure 4.2.. Code for parsing the data set
The second argument of an SVM classifier, in case of supervised
learning, is the corresponding vector Y containing the correct class for each of
the rows (files) from the matrix. Vector Y, in this paper, contains n elements with
values either 1 or 1, marking the file as malware or clean software, respectively.
Furthermore, the dataset is divided into three groups:
1. Train set contains 50% of all the files in the dataset. It is used to
determine the structure of the learned function and corresponding
learning algorithm.
45
2. Validation set contains 25% of all the files in the dataset. In order to
avoid overfitting, when any classification parameter needs to be adjusted,
validation set needs to be used.
3. Test set contains last 25% of the total files in the dataset. After choosing
the best fitting model for the problem and adjusting its parameters on
validation set, test set is used to analyze model’s generalization, ie.
precision, recall etc.
Figure 4.3. shows a part of the Python script used to both process the
dataset and calculate the precision and F1score of SVM classifiers initialized
with the provided parameters. After the creation of the classifier with the given
parameters, the classifier is fitted to the train set and tested on the validation set,
giving as an output the precision and F1score in order to make search for the
best possible classifier easier.
Figure 4.3. Code for precision calculation
46
5. Malware detection software
The malware detection software is based on the previously selected SVM
classifier. The aforementioned python script saves the input matrix into a .txt file.
Once the software is started, first it initializes the SVM classifier and calls .fit
function on the matrix read in from a file. As the Figure 5.1. shows, the graphical
interface is presented to the user with the button “Open” which opens a file
selector window where they have to choose the executable file they wanna run
the analysis on. Once the file is selected, the user can start the analysis by
clicking on “Start” button. Other option is for user to type in the full path of the file
they want to analyze.
Figure 5.1.. Application window
The results are obtained following these steps:
1. The program will call aforementioned objdump linux command on the
selected executable file
2. Using the same regular expression as when training the classifier, only
the commands from objdump file will be selected
3. Using the same list of assembler commands, the dictionary is created
4. Parsing the commands extracted in step 2, the program will count the
number of occurrences for each of the commands in the dictionary
47
5. Once the input vector from step 4 is created, program will call
classifier.predict function on that input vector and as a result get either 1
or 1, meaning software is clean or malware, respectively.
6. Based on the results of the classification from the step 5, the user will be
presented with a window containing the message either from the Figure
5.3. or 5.4.
Figure 5.2. Code containing some of the aforementioned steps above
48
Figure 5.3. Virus detected window
Figure 5.4. Virus free software
Malware detection software is written in Python and the graphical
interface is created using Glade . Furthermore, for machine learning purposes, 4
program is using open source scikitlearn library which is a python library
4 Glade is a RAD tool to enable quick & easy development of user interfaces for the GTK+ toolkit and the GNOME desktop environment. The user interfaces designed in Glade are saved as XML, and by using the GtkBuilder GTK+ object these can be loaded by applications dynamically as needed. By using GtkBuilder, Glade XML files can be used in numerous programming languages including C, C++, C#, Vala, Java, Perl, Python,and others.
49
containing simple and efficient tools for data mining and data analysis and it is
built on NumPy , SciPy , and matplotlib . 5 6 7
Both the Python script from Chapter 4 of this paper, and the malware
detection software from this chapter are programmed in a plug/unplug fashion,
giving the opportunity to easily modify several important parameters:
1. Dataset used to train the classifier as the fit function on the
classifier is trained with the data from two textual files called “x.txt”
and “y.txt”, those files can easily be replaced with a better or
updated version of the input matrix and the corresponding classes
vector. The only limitation is the format of those files, where “x.txt”
needs to be in a form of a matrix, and “y.txt” in a form of a 1D
array.
2. Parameters of the SVN classifier the classifier is initialized in an
init function when the program is ran. The parameters can easily
be changed by altering the current values in the constructor call.
This gives the opportunity to easily switch the program to a better
classifier once better results are obtained from the script in
Chapter 4.
3. Classifier SVN classifier can be replaced by any other available
classifier from the scikitlearn library if that would yield better
results.
If we take into consideration that the Python script from Chapter 4 is
easily modifiable as well, it makes the combination of these two programs
5 NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful Ndimensional array object. sophisticated (broadcasting) functions [10]. 6 SciPy (pronounced “Sigh Pie”) is an open source Python library used by scientists, analysts, and engineers doing scientific computing and technical computing [18]. 7 Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an objectoriented API for embedding plots into applications using generalpurpose GUI toolkits like wxPython, Qt, or GTK+ [2].
50
perfect for testing and investigation purposes. First with the Python script it is
possible to get the results for different classifiers, as well as process different
datasets and create different input matrices and class vectors, saving them to a
file. Then in the next step, once it is determined which classifier yielded better
results or which dataset trained the classifier better, this data can be forwarded
to the malware classification software, which will then use it for a better
performance.
Furthermore, the graphical interface itself is created separately from all
the other Python code. Therefore, replacing the current simple graphical
interface with a more advanced one, should not require any additional changes
in the source code itself. When replacing the graphical interface, it is important
to notice that there are two methods called from the Python code:
1. When button “Open” is pressed “name_of_the_function” method creates
a File selector window and displays it to the user
2. When button “Start” is pressed button_start method is called which
starts the process explained above through 6 steps.
Therefore, in order to “plug in” the new graphical user interface, it is
important to connect button click events to these two functions, while the source
code itself stays unchanged.
51
6. Results of the classifiers evaluation
Before selecting the appropriate parameters of the SVM classifier with a
polynomial kernel, several different polynomial degrees have been tested.
Testing of a classifier is a process of training it on the train set and calculating all
the evaluation measures explained in chapter 3.3 of this paper precision,
recall, accuracy and F1macro measures. Furthermore, to get an insight into the
classifier's generalization power, it is trained on the training set and then tested
on previously unseen data.
6.1. Results obtained for the polynomial kernel, training data
The SVM classifier was trained on the train set and the results from the
table 6.1. were obtained. It is obvious that the more complex polynomial
function, the better are the results. However, after reaching to approximately
60% precision, the growth starts to reduce, and it results in the classifier
evaluation values growing slower. Furthermore, the best result here is obtained
for the SVM classifier with the degree 6, with the accuracy of 72%.
Table 6.1. Results of the testing train set
degree F1macro Precision Recall Accuracy
1 0.346153846 0.26470588 0.5 0.32
2 0.452830188 0.35294117 0.63157895 0.42
3 0.5 0.419354838 0.6190476 0.48
4 0.666666666 0.6 0.75 0.64
5 0.7037037 0.678571428 0.73076923 0.68
6 0.74074074 0.769230769 0.7142857 0.72
52
6.2. Results obtained for the polynomial kernel, validation
data
After training the classifier with a specific degree, it is presented with a
new set of previously not seen data. As expected, the performance of classifiers
with each of the degrees has decreased. When tested on validation set, the best
results yielded the SVM classifier of the degree 4, as opposed to the one of
degree 6 from the table 6.1., which gave better results on the train set.
Table 6.2. Results of testing validation set
degree F1macro Precision Recall Accuracy
1 0.142857 0.114754098 0.189189 0.16
2 0.2708333 0.22807017 0.3333333 0.3
3 0.4175824 0.365384615 0.487179487 0.47
4 0.5531914 0.5909091 0.52 0.58
5 0.5454 0.551020408 0.54 0.55
6 0.51020408 0.53191489 0.490196 0.52
To conclude, from the above two tables it is possible to conclude that the
classifier with the degree 4 is the optimal option of all of the tested classifiers, as
it gave the best values on the validation set. Although the polynomial of the
degree 6 gave much better results on the train set, once presented with the new,
unseen set of data, its results decreased. This behavior could be a signal of
overfitting and therefore the polynomial of the degree 4 is chosen as an option.
53
7. Conclusion
The results of the classification with the chosen classificator show that
there is space for improvement. Taking into consideration the number of
features for every example in the dataset is pretty big, ie. the number of
occurrences for all the assembly commands, it is possible that using one of the
aforementioned algorithms for dimension reduction would yield better results.
Also, taking a look into the matrix created from the given dataset shows that
there are a lot of elements with 0 as value when some command does not
appear in the example at all, as opposed to those commands which appear
frequently setting the value of an element to numbers greater than 10 000.
These oscillations in data might be a sign of “noise” in the data which could
affect the results of the classifier, as well as the features that do not hold any
data value. Therefore, using either PCA or LDA algorithms might remove the
noise from the data and also reduce the dimensionality by disregarding
irrelevant features.
Furthermore, adding other features besides the statistical analysis (for
example, perform also a dynamic analysis of the changes in the OS when the
program is ran), could also benefit the improvement of the obtained results, as
there is always a possibility that the features in the dataset do not hold enough
data or data relevant for statistical analysis. In other words, if there is no
correlation between the data in the dataset and the results of the classification, it
will result in low efficiency no matter which statistical approach is chosen.
However, approximately 60% accuracy obtained in this paper’s practical
part I consider enough to continue working on improvements. Once again the
machine learning approach showed itself useful and applicable for a problem
that seems to be out of its scope. If the results of the classification would
54
improve to over 90%, especially when tested on the newly created malwares, it
would mean that the created software would detect malwares (already known or
yet unknown), based on what it learnt from the historic data, and without any
need for updating the database of known malwares.
55
8. Literature [1] https://en.wikipedia.org/wiki/Malware [2] Nate Lord, Common Malware Types: Cybersecurity 101, 12th of October 2012, Common Malware Types: Cybersecurity 101, https://www.veracode.com/blog/2012/10/commonmalwaretypescybersecurity101, 29th of March 2016. [3] Viruses that can cost you, Viruses that can cost you, http://www.symantec.com/region/reg_eu/resources/virus_cost.html, 1st of April 2016. [4] The malware encyclopedia, http://malware.wikia.com/, 29th of March 2016. [5] Margaret Rouse, Machine learning definition, February 2016, http://whatis.techtarget.com/definition/machinelearning, 5th of April 2016. [6] Snajder, J., Dalbelo Basic, B. Strojno ucenje. 2014. Zagreb: 2014. [7] Scikitlearn developers, Validation curves, 2010 2014, Validation curves: plotting scores to evaluate models, http://scikitlearn.org/stable/modules/learning_curve.html, April 2016. [8] Intel corporation, Intel commands, Intel® 64 and IA32 Architectures Software Developer’s Manual, http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64ia32architecturessoftwaredeveloperinstructionsetreferencemanual325383.pdf, April 2016. [9] Numpy developers, 2016, Numpy documentation, www.numpy.org/, May 2015. [10] Nuno Vasconcelos, PCA and LDA, PCA and LDA, http://www.svcl.ucsd.edu/courses/ece271BF09/handouts/Dimensionality2.pdf, May 2016. [11] George Dallas, PCA for Dummies, 30th of October 2013, Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction, https://georgemdallas.wordpress.com/2013/10/30/principalcomponentanalysis4dummieseigenvectorseigenvaluesanddimensionreduction/, May 2016. [12] Aly A. Farag, Shireen Y. Elhabian, LDA, 2nd of October 2008, A Tutorial on Data Reduction Linear Discriminant Analysis (LDA), http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid437773.pdf, May 2016. [13] Wikipedia, Supervised learning, https://en.wikipedia.org/wiki/Supervised_learning, April 2016. [14] Wikipedia, Model selection, https://en.wikipedia.org/wiki/Model_selection, April 2016. [15] Wikipedia, Confusion matrix, https://en.wikipedia.org/wiki/Confusion_matrix, April 2016. [16] Wikipedia, Contingency table, https://en.wikipedia.org/wiki/Contingency_table, April 2016.
56
[17] Wikipedia, Accuracy and precision, https://en.wikipedia.org/wiki/Accuracy_and_precision, April 2016. [18] Wikipedia, SciPy, https://en.wikipedia.org/wiki/SciPy, May 2016. [19] Wikipedia, Convex optimization, https://en.wikipedia.org/wiki/Convex_optimization, May 2016. [20] Wikipedia, Computer virus, https://en.wikipedia.org/wiki/Computer_virus, March 2016. [21] Wikipedia, Computer worm, https://en.wikipedia.org/wiki/Computer_worm, March 2016. [22] Wikipedia, Trojan horse, https://en.wikipedia.org/wiki/Trojan_horse_(computing), March 2016. [23] Wikipedia, Rootkit, https://en.wikipedia.org/wiki/Rootkit, March 2016. [24] Wikipedia, Backdoor, https://en.wikipedia.org/wiki/Backdoor_(computing), March 2016. [25] Wikipedia, Keystroke logging, https://en.wikipedia.org/wiki/Keystroke_logging, March 2016. [26] Wikipedia, Rogue security software, https://en.wikipedia.org/wiki/Rogue_security_software, March 2016. [27] Wikipedia, Ransomware, https://en.wikipedia.org/wiki/Ransomware, March 2016. [28] Wikipedia, Browser hijacking, https://en.wikipedia.org/wiki/Browser_hijacking, March 2016. [29] Wikipedia, Machine learning, https://en.wikipedia.org/wiki/Machine_learning, March 2016. [30] Wikipedia, Binary classification, https://en.wikipedia.org/wiki/Binary_classification, March 2016. [31] Wikipedia, Inductive bias, https://en.wikipedia.org/wiki/Inductive_bias, March 2016. [32] Wikipedia, Precision and recall, https://en.wikipedia.org/wiki/Precision_and_recall, April 2016. [33] Wikipedia, Principal component analysis, https://en.wikipedia.org/wiki/Principal_component_analysis, May 2016. [34] Wikipedia, Linear discriminant analysis, https://en.wikipedia.org/wiki/Linear_discriminant_analysis, May 2016. [35] Wikipedia, Support Vector Machine, https://en.wikipedia.org/wiki/Support_vector_machine, April 2016.
57
Abstract Malware detection software based on Support Vector Machine (SVM)
classifier.Software detects malware in executable files (such as .exe, .dll).
Dataset, divided into training, validation and test set, is an assembly code
obtained by disassembling executable. Assembly is then processed, and
afterwards frequency for each command is calculated. Using validation set to
minimize the total error, dimensionality reduction is performed using different
techniques such as Principal Component Analysis, PCA or Linear Discriminant
Analysis, LDA. Furthermore, SVM kernel is chosen based on validation set as
well. SVM with optimal parameters is tested on test set and precision, recall and
F1measure are calculated, as well as number of false positives, false negatives
and correctly classified executables.
Keywords:
malware, SVM, support vector machine, PCA, principal component analysis,
LDA, linear discriminant analysis, dimensionality reduction, dataset, kernel,
precision, recall, F1measure, false positive, false negative, classification
58
Sažetak
Program za detekciju zloćudnog softvera (eng. malware) napravljen na
temelju klasifikacije potpornog stroja vektora (eng. Support vector machine,
SVM). Program detektira zloćudni softver u izvršnim datotekama (npr .exe, .dll).
Skup podataka za učenje, validaciju i testiranje je asembler dobiven
dekompilacijom izvršnih datoteka, a potom izračun frekvencija pojave pojedine
naredbe što je ujedno i ulaz u SVM. Korištenjem skupa za validaciju za
minimizaciju ukupne pogreške, određen je broj dimenzija (eng. dimensionality
reduction) i jezgra (eng. kernel) SVMa. Za redukciju dimenzija moguće je koristiti
algoritme PCA (eng. Principal Component Analysis), LDA (eng. Linear
Discriminant Analysis) i sl. Odabrani SVM je testiran na skupu za testiranje i iz
rezultata su izračunati preciznost, odziv te tzv. F1 rezultat, te broj lažno
pozitivnih, lažno negativnih te ispravno klasificiranih jedinki (izvršnih datoteka).
Ključne riječi:
zloćudni softver, klasifikacija, SVM, potporni stroj vektora, redukcija
dimenzionalnosti, jezgra, PCA, LDA, preciznost, odziv, F1 rezultat, lažno
pozitivni, lažno negativni.
59