network intrusion detection systems using random …rozup.ir/download/2420063/pa_4.pdf ·...
TRANSCRIPT
NETWORK INTRUSION DETECTION SYSTEMS
USING RANDOM FORESTS ALGORITHM
by
J io n g Z h a n g
A thesis subm itted to the
School of Computing
in conformity with the requirements for
the degree of M aster of Science
Queen’s University
Kingston, Ontario, Canada
December 2005
Copyright © Jiong Zhang, 2005
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Library and Archives Canada
Bibliotheque et Archives Canada
Published Heritage Branch
395 Wellington Street Ottawa ON K1A 0N4 Canada
Your file Votre reference ISBN: 978-0-494-15337-6 Our file Notre reference ISBN: 978-0-494-15337-6
Direction du Patrimoine de I'edition
395, rue Wellington Ottawa ON K1A 0N4 Canada
NOTICE:The author has granted a nonexclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or noncommercial purposes, in microform, paper, electronic and/or any other formats.
AVIS:L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, preter, distribuer et vendre des theses partout dans le monde, a des fins commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats.
The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these.Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation.
In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis.
While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis.
Conformement a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these.
Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant.
i * i
CanadaReproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A bstract
W ith the trem endous growth of network-based services and sensitive information
on networks, the number and the severity of network-based computer attacks have
significantly increased. Completely preventing breaches of security is unrealistic by
security technologies. Therefore, intrusion detection is an im portant component in
network security. However, many current intrusion detection systems are rule-based
systems, which have lim itations to detect novel intrusions. Moreover, encoding rules
is time-consuming and highly depends on the system builder’s knowledge of a deep
understanding of known intrusions.
In this thesis, we propose new system atic frameworks th a t apply a da ta mining al
gorithm called random forests in misuse, anomaly, and hybrid network-based intrusion
detection systems. In misuse detection, patterns of intrusions are built autom atically
by random forests over training data. After th a t, intrusions are detected by matching
network activities against the patterns. In anomaly detection, novel intrusions are
detected by outlier detection of random forests. After building the patterns of net
work services by random forests, outliers related to the patterns are determined by
the outlier detection algorithm. The hybrid detection system improves the detection
performance by combining the misuse and anomaly detection. The misuse detection
can detect known intrusions with high detection rate and low false positive rate. The
i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
anomaly detection can detect some unknown intrusions. The hybrid system combines
the advantages of both the techniques.
We evaluate our approaches over the K D D’99 dataset. The experimental results
show th a t the performance provided by our misuse approach is be tter than the best
KDD’99 result. The results also indicate th a t our approach in anomaly detection
achieves higher detection rate when the false positive rate is low, compared to other
reported unsupervised anomaly detection approaches. The evaluation dem onstrates
th a t the hybrid system can improve the overall performance of the above mentioned
intrusion detection systems.
ii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgm ents
I would like to express my gratitude to all those who gave me the possibility to
complete this thesis.
Especially, I would like to thank my supervisor, Dr. M ohammad Zulkernine,
for his supervision and encouragement in all the tim e of my research. W ithout his
guidance and help, the work presented in this thesis could not have been possible. I
have learned a lot from him and have been highly impressed w ith his hard work and
dedication.
I would like to thank Dr. David Skillicorn for his helpful suggestions on my
research and comments on my paper.
I want to thank my family, especially my wife, for their support and encourage
ment.
I want to thank all members of the Queen’s Reliable Software Technology (QRST)
research group for the great tim e working together.
I also want to thank the faculty, staff, and my classmates in the School of Com
puting for their help.
This research has been supported and funded by Bell Canada through Bell Univer
sity Laboratories (BUL), and M athem atics of Information Technology and Complex
Systems (MATICS).
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Contents
A bstract i
A cknowledgm ents iii
C ontents iv
List o f Tables vii
List o f Figures viii
1 Introduction 11.1 M o tiv a tio n ............................................................................................................. 11.2 O verview ................................................................................................................ 31.3 Summary of c o n tr ib u tio n s .............................................................................. 61.4 Thesis o rg a n iz a tio n ............................................................................................ 7
2 Background and related work 92.1 Random f o r e s t s ................................................................................................... 92.2 Intrusion d e te c t io n ............................................................................................ 11
2.2.1 Misuse detection ................................................................................... 122.2.2 Anomaly d e te c t io n ................................................................................ 132.2.3 Hybrid d e t e c t i o n ................................................................................... 14
2.3 D ata mining based d e te c tio n ........................................................................... 152.3.1 A D A M ....................................................................................................... 152.3.2 MADAM I D ............................................................................................. 192.3.3 J A M ........................................................................................................... 19
2.4 D a ta s e t s ................................................................................................................ 202.4.1 DARPA d a ta s e t ....................................................................................... 202 .4 .2 K D D ’99 d a t a s e t ....................................................................................................... 22
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 M isuse detection 253.1 Mining patterns of intrusions ........................................................................ 25
3.1.1 Overview of the fram ew ork.................................................................. 253.1.2 Optim ization for random fo re s ts ........................................................ 273.1.3 Imbalanced intrusions ......................................................................... 283.1.4 Feature s e le c t i o n ................................................................................... 29
3.2 Experim ents and r e s u l t s ................................................................................... 303.2.1 D ataset and p rep ro cess in g .................................................................. 303.2.2 Performance comparison on balanced and imbalanced dataset 313.2.3 Selection of im portant features ........................................................ 323.2.4 Param eter optim ization for random f o re s ts ..................................... 343.2.5 D istribution of error r a t e s .................................................................. 353.2.6 Speed performance of d e te c t io n ........................................................ 373.2.7 Evaluation and d is c u s s io n .................................................................. 393.2.8 Im p le m e n ta tio n ....................................................................................... 41
3.3 Summary .............................................................................................................. 41
4 A nom aly detection 434.1 Detecting o u t l ie r s ................................................................................................ 43
4.1.1 Overview of the fram ew ork ................................................................. 444.1.2 Mining patterns of network se rv ic e s ................................................ 444.1.3 Unsupervised outlier d e te c tio n ........................................................... 45
4.2 Experim ents and results . . . ....................................................................... 474.2.1 D ataset and p rep ro cess in g .................................................................. 474.2.2 Evaluation and d is c u s s io n .................................................................. 484.2.3 Experim ents on the detection performance over different datasets 534.2.4 Experim ent on the detection performance over m inority intrusions 564.2.5 Im p le m e n ta tio n ....................................................................................... 58
4.3 Summary .............................................................................................................. 58
5 C om bination o f m isuse and anom aly detection 605.1 Misuse detection versus anomaly d e te c t io n ................................................. 605.2 Approaches to combine misuse and anomaly d e te c tio n ............................ 615.3 Architecture of the hybrid s y s te m .................................................................. 655.4 Experim ents and r e s u l t s ................................................................................... 68
5.4.1 D ataset and p rep ro cess in g .................................................................. 685.4.2 Evaluation and d is c u s s io n .................................................................. 685.4.3 Im p le m e n ta tio n ....................................................................................... 73
5.5 Summary .............................................................................................................. 74
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6 Conclusion and future work 756.1 C onclusion ............................................................................................................. 756.2 Lim itations and future w o r k ............................................................................. 77
Bibliography 80
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
2.1 Intrusions in the 1998 DARPA d a t a s e t ........................................... 212.2 The features in the KDD’99 d a ta s e t .................................................. 22
3.1 Numbering of the attack c a te g o r ie s .................................................. 313.2 Performance on the balanced dataset compared to the original dataset 323.3 Cost m a t r i x .............................................................................................. 403.4 Performance comparison on the KDD’99 d a ta s e t .......................... 40
4.1 The oob error rates for param eter optim ization in the anomaly detection experiments 48
4.2 The performance of each algorithm over the KDD’99 d a t a s e t .. 534.3 The optim al param eters of random f o r e s t s .................................... 54
5.1 The oob error rates for param eter optim ization in the hybrid approachexperim en t.................................................................................................. 70
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
2.1 An example of a decision t r e e ............................................................. 102.2 The training phase of ADAM ........................................................................ 172.3 Discovering intrusions with A D A M ................................................... 18
3.1 Architecture of the misuse based N I D S ............................................ 263.2 Variable im portance of the features in the misuse approach experiment 343.3 Performance w ith different values for param eter M try of random forests 353.4 D istribution of the oob error r a t e ....................................................... 363.5 Average oob error rate for different M t r y ......................................... 373.6 Speed measurement of d e te c tio n .......................................................... 39
4.1 The framework of the unsupervised anomaly NIDS ....................... 444.2 The outlier-ness of the 1% attack d a ta s e t .................................................... 514.3 The ROC curve for the 1% attack dataset ................................................ 524.4 The outlier-ness of the 2% attack d a ta s e t .................................................... 544.5 The outlier-ness of the 5% attack d a ta s e t .................................................... 554.6 The outlier-ness of the 10% attack d a t a s e t ................................................ 554.7 The ROC curves for the different datasets ................................................ 564.8 The outlier-ness of the minority attack d a ta s e t ............................... 574.9 The ROC curve for the minority attack dataset ...................................... 58
5.1 Framework of anomaly detection followed by misuse detection . . . . 625.2 Framework of the parallel a p p ro a c h ................................................... 635.3 Framework of misuse detection followed by anomaly detection . . . . 645.4 Architecture of the hybrid s y s te m ....................................................... 665.5 Variable im portance of the features in the hybrid approach experiment 695.6 Outlier-ness of the anomaly test s e t ................................................... 73
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
1.1 M otivation
Com puter networks provide people with news, email, online shopping and online
banking. More and more sensitive information such as credit card details, per
sonal information are stored on computer networks. W ith the trem endous growth
of network-based services and sensitive information on networks, network security is
getting more im portant than ever. Although a wide range of security technologies
such as information encryption, access control, and intrusion prevention are used to
protect network-based systems, there are still many undetected intrusions. For ex
ample, firewalls cannot prevent internal attacks. According to the report of C SI/FB I
computer and security survey, to ta l losses for 2004 were $141,496,560 [1], Moreover,
most of the losses caused by intrusions are not reported. Intrusion Detection Systems
(IDSs) can detect intrusions autom atically by monitoring activities of networks or
systems, instead of analyzing activities by security experts. Thus, intrusion detection
systems play a vital role in network security.
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 2
Currently, many NIDSs (Network Intrusion Detection Systems) such as Snort [4]
are rule-based systems, which employ misuse detection techniques and have limited
extensibility for novel attacks. Their performances highly rely on the rules identified
by security experts. In rule-based systems, security experts analyze traffic da ta and
develop rules to specify intrusions. However, the amount of network traffic is huge,
and it is very difficult to specify some intrusions using rules. Therefore, the process
of encoding rules is expensive and slow. Another problem of rule-based systems
is high maintenance cost. Security people have to modify the rules or deploy new
rules manually using a specific rule-driven language. If the rules are deployed in
different kinds of systems, different rule-driven languages are needed. To overcome the
lim itations of rule-based systems, a number of IDSs employ da ta mining techniques.
D ata mining is the analysis of (often large) observational d a ta sets to find patterns or
models th a t are both understandable and useful to the data owner [18]. D ata mining
can efficiently extract patterns of intrusions for misuse detection, establish profiles
of normal network activities for anomaly detection, and build classifiers to detect
attacks, especially for the vast am ount of audit data. D ata mining-based systems
are more flexible and deployable. The security experts only need to label the audit
d a ta to indicate intrusions instead of hand-coding rules for intrusions. Over the past
several years, a growing number of research projects have applied da ta mining to
intrusion detection with different algorithms [19, 8, 6]. For instance, MADAM ID
[19] and ADAM [8] employ an association rules algorithm.
There are two m ajor intrusion detection techniques: misuse detection and anomaly
detection. Misuse detection discovers attacks based on patterns extracted from known
intrusions [9]. Anomaly detection identifies attacks based on significant deviations
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 3
from the established profiles of normal activities [16]. Misuse detection has low false
positive rate, bu t cannot detect novel attacks. Anomaly detection can detect unknown
attacks, bu t usually has high false positive rate. To combine the advantages of both
misuse detection and anomaly detection, many hybrid approaches have been proposed
[8, 33, 7]. The m ajor challenge of a hybrid system is to build a framework th a t can
effectively incorporate bo th anomaly and misuse detection.
1.2 O verview
To address the problems associated with the existing approaches in network intrusion
detection, this thesis proposes new system atic frameworks th a t apply the random
forests algorithm in misuse detection, anomaly detection, and hybrid detection (com
bination of misuse and anomaly detection).
The random forests algorithm is an ensemble classification and regression ap
proach, which is unsurpassable in accuracy among current da ta mining algorithms
[12], The random forests algorithm has been used extensively in different applica
tions. For instance, it has been applied to prediction [17, 28], probability estim ation
[35], and pa tte rn analysis in m ultim edia information retrieval and bioinformatics [36].
However, to the best of our knowledge, the random forests algorithm has not been
applied in autom atic intrusion detection.
Accuracy is critical to develop effective NIDSs, since high false positive rate or low
detection rate will make NIDSs unusable. To improve detection performance, we also
propose m ethods to address the issues of imbalanced intrusions and feature selection
in the mining process as discussed below.
One of the challenges in intrusion detection systems is feature selection. Many
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 4
algorithms are sensitive to the number of features. Hence, feature selection is essential
for improving detection rate. Moreover, the raw d a ta of network traffic is usually
audited in tcpdum p format, and the tcpdum p form at is not suitable for detection.
IDSs m ust construct features from the raw data. The process of feature construction
from tcpdum p form at d a ta involves a lot of com putation. Thus, feature selection can
help reducing the com putational cost for feature construction by reducing the number
of features. However, in many current d a ta mining-based IDSs, feature selection is
based on domain knowledge or intuition. We use the feature selection algorithm of the
random forests algorithm, because the algorithm can give estim ates of w hat features
are im portant in the classification.
Another challenge of intrusion detection is imbalanced intrusion. Some intrusions
such as Denial of Service (DoS) [25] have much more connections th an others (e.g.,
User to Root). Most of the d a ta mining algorithms try to minimize the overall error
rate, bu t this leads to increasing the error rate of minority intrusions. However, in
real world network environments, the minority attacks are more dangerous than the
m ajority attacks. Therefore, we need to improve the detection performance for the
minority intrusions.
Anomaly detection is a critical issue in Network Intrusion Detection Systems
(NIDSs). Many NIDSs employ misuse detection techniques, which have limited exten
sibility for novel attacks. To detect novel attacks, many anomaly detection systems
are developed. Most of them are based on supervised approaches [8, 26, 34], For in
stance, ADAM [8] employs association rules algorithm in intrusion detection. ADAM
builds a profile of normal activities over attack-free training data, and then detects
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 5
attacks with the previously built profile. The problem of ADAM is the high depen
dency on training d a ta for normal activities. However, attack-free training da ta is
difficult to come by, since there is no guarantee th a t we can prevent all attacks in
real world networks. Actually, one of the m ost popular ways to undermine anomaly
based IDSs is to incorporate some intrusive activities into the training da ta [32]. The
IDSs trained by the data w ith intrusions will lose the ability to detect these kinds of
intrusions. Another problem of the supervised anomaly based IDS is high false posi
tive rate when network environment or services are changed. Since training d a ta only
contains historical activities, profile of normal activities can only include historical
patterns of normal behavior. Therefore, new activities due to changing of network
environment or services will deviate from the previously built profile and are detected
as attacks. T ha t will increase the number of false positives.
To overcome the lim itations of supervised anomaly based systems, a number of
IDSs employ unsupervised approaches [16, 31, 21]. Unsupervised anomaly detection
does not need attack-free training data. It detects attacks by determining unusual
activities from da ta under two assumptions [21]:
• The m ajority of activities are normal.
• A ttacks statistically deviate from normal activities.
The unusual activities are outliers which are inconsistent with the remainder of data
set [10]. Thus, outlier detection techniques can be applied in unsupervised anomaly
detection. Actually, outlier detection has been used in a number of practical appli
cations such as credit card fraud detection, voting irregularity analysis, and severe
weather prediction [23].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 6
We propose an approach to use outlier detection technique provided by the random
forests algorithm in anomaly intrusion detection. The main challenge of anomaly
intrusion detection is to reduce false positives. The outlier detection technique is
effective to reduce false positive rate w ith a desirable detection rate.
In hybrid detection, we propose a framework to combine misuse and anomaly
detection. Therefore, the hybrid system not only achieves high performance provided
by the misuse detection, bu t also can detect novel intrusions.
1.3 Sum m ary of contributions
In this thesis, we apply the random forests algorithm in network intrusion detection.
We present the approaches to employ and optimize the random forests algorithm in
misuse detection, anomaly detection, and hybrid detection. The m ajor contributions
of the thesis are listed as follows:
• Propose new systematic frameworks that employ the random forests algorithm
fo r network intrusion detection. To the best of our knowledge, the random
forests algorithm has not been applied in NIDSs, especially for anomaly detec
tion systems [37, 39, 38].
• Apply sampling techniques and feature selection algorithm in misuse detection
to improve the performance o f the NIDS. The sampling techniques increase the
detection ra te of m inority intrusions. The feature selection technique improves
the overall detection performance [37].
• Employ a new service-based unsupervised outlier detection approach in anomaly
NIDS. The outlier function provided by the random forests algorithm is used
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 7
in anomaly detection. By building patterns of network services, the algorithm
determines outliers related to the built patterns. The proposed approach does
not need attack-free training data which is difficult to obtain in real world
network environments [39].
• Combine misuse detection and anomaly detection. Misuse detection has high
detection rate w ith low false positive rate. However, misuse detection cannot
detect novel intrusions. Anomaly detection can detect novel intrusions. There
fore, the combination of the misuse and anomaly detection improves the overall
performance of NIDSs [38].
1.4 T hesis organization
The thesis is organized as follows. In Chapter 2, we introduce intrusion detection,
random forest algorithm, and datasets used in intrusion detection. We also discuss
the related work, especially da ta mining-based detection systems.
In Chapter 3, we describe in detail the misuse detection using the random forests
algorithm. We explain the approaches to improve detection performance of the misuse
detection system. We also show the experimental results in the chapter.
In Chapter 4, we discuss the framework of the anomaly detection and show how
to apply the random forests algorithm in unsupervised anomaly detection. The per
formance evaluations are also presented.
In Chapter 5, we propose a framework to combine the misuse and anomaly detec
t io n . T h e a r c h i te c tu r e of th e proposed hybrid system is explained in detail. We also
evaluate the hybrid system.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1. INTRODUCTION 8
Finally, we summarize our work and outline our future research plans in Chapter
6. We also discuss the lim itations of the presented approaches.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Background and related work
2.1 R andom forests
A decision tree has a root node connected by successive links to other nodes [13].
These nodes are similarly connected until reaching leaf nodes th a t have no further
connected nodes. An example of a decision tree is shown in Figure 2.1 on the next
page.
Random forests [12] is an ensemble of un-pruned classification or regression trees.
The random forest algorithm generates many classification trees. Each tree is con
structed by a different bootstrap sample from the original da ta using a tree classifi
cation algorithm as the following steps:
1. If the number of training da ta is A, the algorithm sample N cases a t random
with replacement from the original data. The chosen cases will be used to
construct the tree.
2. If there are M features in the training set, the algorithm chooses m features
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 10
from them at random at each node. The value of m is held constant by setting a
param eter of the algorithm. At a node, the algorithm uses each chosen feature
to split the node. After th a t, the best feature is selected to split this node in
the tree. The best feature makes the cases reaching the imm ediate descendent
nodes as pure as possible. The process is repeated recursively for each node of
the tree.
W; ( I t k ! t s m e d iu m ? rotund '
i-r-.I .y liOlNWi.\pp to «;rajs*
UrapvrnHl VNerty
Figure 2.1: An example of a decision tree [13]
After the forest is formed, a new object th a t needs to be classified is pu t down
each of the tree in the forest for classification. Each tree gives a vote th a t indicates
the tree’s decision about the class of the object. The forest chooses the class with the
most votes for the object.
The m ain features of the random forests algorithm are listed as follows [12]:
• It is one of unsurpassable in accuracy among the current d a ta mining algorithms.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 11
• It runs efficiently on large da ta sets w ith many features. It is suited for network
intrusion detection. The volume of network traffic is huge and network activities
are complex. Thus, datasets of network traffic are large and have many features.
• It can give estim ates of w hat features are im portant.
• It has no nominal d a ta problem and does not over-fit.
• It can handle unbalanced datasets.
• It provides an effective approach to estim ate missing d a ta and m aintains accu
racy when a large proportion of d a ta are missing.
• It can detect outliers using proximities between pairs of cases.
In the random forests algorithm, there is no need for cross-validation or a test set to
get an unbiased estim ate of the test error. Since each tree is constructed using the
bootstrap sample, approxim ately one-third of the cases are left out of the bootstrap
samples and not used in training. These cases are called out of bag (oob) cases. These
oob cases are used to get a run-tim e unbiased estim ate of the classification error as
trees are added to the forest.
2.2 Intrusion detection
An Intrusion Detection System (IDS) detects attacks by observing activities on a
variety of system and network sources [15]. There are two main types of intrusion
detection systems: host-based ID S a n d n e tw o rk -b a s e d ID S [8, 6]. N e tw o rk I n tru s io n
Detection Systems (NIDSs) detect attacks by observing various network activities,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 12
while Host-based Intrusion Detection Systems (HIDSs) detect intrusions in individual
hosts. An NIDS examines the output of a packet sniffer or network switch. A sniffer is
a program th a t reads raw packets off a local network segment. An NIDS can monitor
more targets on a network, and can detect some attacks th a t HIDSs miss. HIDSs do
not see packet headers, so they cannot detect some types of attacks. For example,
many IP-based denial of service (DoS) attacks can only be detected by NIDSs, since
NIDSs can look at the packet headers as they travel across networks. Besides, NIDSs
do not rely on host operating systems as detection sources, bu t HIDSs require specific
operating systems to function properly. Some hybrid IDSs use both host-based and
network-based systems to detect intrusions [19].
The techniques used in intrusion detection can also be divided into two m ajor
approaches: misuse detection and anomaly detection [8]. The following subsections
briefly explain the two approaches.
2.2 .1 M isu se d e tec tio n
Misuse detection identifies intrusions by searching for known patterns of attacks. The
current commercial NIDSs employ this strategy. A disadvantage of misuse detection
is th a t it cannot detect unknown attacks. Different techniques have been used for
misuse detection, such as expert systems, signature analysis, state-transition analysis
and d a ta mining.
The expert system uses a set of rules to describe intrusions [9]. Audit events
are translated into facts th a t carry their semantic significance in the expert system.
Then, an inference engine can draw conclusions using these rules and facts.
State transition analysis expresses attacks with a set of goals and transitions based
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 13
on sta te transition diagrams [9]. Any event th a t triggers an intrusion state will be
detected as an intrusion.
Signature analysis describes attacks using signatures th a t can be found in audit
trail [9], Any activity th a t m atches the signatures will be determ ined as an attack.
In recent years, many d a ta mining-based research work have been proposed for
intrusion detection [9]. D ata mining is an effective way to extract useful and previ
ously unnoticed models or patterns from large data sources. The models or patterns
can be represented in some forms, such as rules, decision trees, instance-based exam
ples, and neural nets. Many da ta mining algorithms have been employed in misuse
detection. For example, Association rules algorithm is used by MADAM ID (Min
ing Audit D ata for A utom ated Models for Intrusion Detection) [19], ADAM (Audit
D ata Analysis and Mining) [8], and IDDM (Intrusion Detection Using D ata Mining
Techniques) [6]. Decision tree and fuzzy association rules are employed in intrusion
detection [30, 24]. The neural network algorithm is used to improve the performance
of IDS [22],
2.2 .2 A n om aly d e tec tio n
Since misuse detection cannot detect unknown attacks, anomaly detection is used to
address this shortcoming. Various anomaly detection approaches have been proposed
and implemented.
Unsupervised anomaly detection in NIDSs, as discussed below, is a new research
area [21]. Eskin et al. [16] investigated three algorithms in unsupervised anomaly
d e te c tio n : c lu s te r -b a s e d e s t im a tio n , k - n e a re s t n e ig h b o r , a n d o n e class SVM (Support
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 14
Vector Machine). O ther researchers [31, 21] apply clustering approaches in unsu
pervised NIDSs. We employ the outlier detection of random forests in unsupervised
anomaly detection.
Supervised anomaly detection has been studied extensively. Supervised anomaly
detection uses attack-free training data to build profiles of normal activities. After
tha t, it uses the deviation from the profiles to detect intrusions. ADAM [8] builds the
profile of normal behavior from attack-free training data and represents the profile
as a set of association rules. At run-tim e, ADAM detects suspicious connections
according to the profile. O ther supervised approaches are also applied to anomaly
detection, such as fuzzy da ta mining and genetic algorithms [26], neural networks
[11, 29], and SVM [34],
Statistical m ethods and expert systems are also applied in supervised anomaly
detection [9]. Statistical m ethods build profiles of user and system normal behavior
by a number of samples. Activities are then compared against the profiles, and
deviations are determ ined as abnormal. Expert systems describe normal behavior of
users and systems by a set of rules, and then apply the rules to detect anomalous
behaviors.
2.2 .3 H ybrid d e tec tio n
Hybrid detection system consists of misuse detection and anomaly detection. It can
detect bo th known and unknown intrusions.
The Next Generation Intrusion Detection Expert System (NIDES) developed by
SRI [7], is a h y b r id in t r u s io n d e te c t io n sy s te m . NIDES performs real-time monitoring
of user activity on multiple target systems connected on a network. It consists of a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 15
misuse detection component as well as an anomaly detection component. The rule-
based misuse component employs expert rules to define known intrusive activities.
The anomaly component is based on statistical approach, and it flags activities as
attacks if they are largely deviant from the expected behaviors. By combining a
statistical component and an expert system component, NIDES increases the chances
to detect intrusions which may be missed by the other.
In our proposed hybrid system, the misuse component uses random forests for
classification in intrusion detection. The anomaly component is based on the outlier
detection provided by random forests.
2.3 D ata m ining based detection
2.3 .1 A D A M
ADAM (Audit D ata Analysis and Mining) [8] is the one of the most widely known
project in the field. It is an on-line network-based IDS. ADAM can detect known
attacks as well as unknown attacks.
ADAM uses association rules in detection. Association rules, one of d a ta mining
algorithms, is easy to understand. It searches for all possible frequent associations
among the set of given features, and usually generates many useless rules th a t cannot
describe effectively user and system activities. The goal of association rules is to
gather necessary knowledge about the nature of the audit data. An association rule
is expressed as:
X =* Y[s,c]
• X and Y are sets of attribute-values
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 16
• x n Y = <D
• s (support): percentage of dataset records th a t satisfy the conjunction of X and
Y.
• c (confidence): the conditional probability th a t a record satisfies Y, provided it
satisfies X.
ADAM does not use the packet payload; it only uses the packet header. ADAM uses
T C P connections as the basic item-set. Connections are obtained from raw packet
da ta of an audit trail. The item-set is defined as a 6-tuple:
R(Ts; Src:IP; Src:Port; Dst:IP; Dst:Port; FLAG)
• Ts : the beginning tim e of a connection
• Src:IP: source IP
• Src:Port: source port
• Dst:IP: destination IP
• D st:Port: destination port
• FLAG: sta tus of a TC P connection
The framework of ADAM has two phases: a training phase and an on-line phase. In
the training phase, as shown in Figure 2.2 on the next page, the attack-free training
d ata is fed to a module th a t performs off-line association rule discovery. The output
of this module is a rule-based p ro file o f n o rm a l a c tiv i t ie s . A f te r t h a t , th e p ro d u c e d
profile is inputted to another module called “on-line single level and domain-level
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 17
mining” . The module performs a dynamic on-line algorithm for association rules.
The training d a ta containing attacks is fed into the module, and then the module
outputs suspicious hot items. Along with feature selection, the suspicious hot items
are labeled as false alarms or attacks. The labeled da ta is fed into classifier builder
to train the classifier.
In the on-line phase as Figure 2.3 on the next page, the test da ta is fed into
the system. W ith the built profile, the on-line single level and domain-level mining
module can find suspicious hot items. These suspicious items are classified as false
alarms, attacks and unknown attacks by the trained classifier. The unknown attacks
are the suspicious items th a t cannot be classified as false alarms or attacks.
Attack-free training data
Training data
Suspicious item-sets
training
features
Profile
ClassifierbuilderFeature
selection
Off-line single and domain-level mining
On-line single and domain-level mining
Label item- se ts as fa lse alarms or attacks
Figure 2.2: The training phase of ADAM [8]
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 18
Test data
Suspicious
item-sets
False alarms,
A ttacks,
Unknown attacks
Profile
classifier
Trained
selection
Feature
Off-line sing le and domain- level mining
Figure 2.3: Discovering intrusions with ADAM [8]
There are some issues th a t need to be solved in ADAM:
• Threshold tuning. It is im portant to obtain good thresholds for declaring a
connection suspicious.
• Profile building.
• Dependency on training data. Obtaining training d a ta is not easy.
Our hybrid system has two phases (on-line phase and off-line phase) similar to ADAM.
However, our system does not need attack-free da ta to detect novel intrusions using
the outlier detection. Attack-free da ta is critical for ADAM. Since high complexity of
the outlier detection, our system detect anomalies in the off-line phase. ADAM can
detect anomalies in on-line phase. Besides, we use random forests algorithm instead
of association rules used by ADAM. Random Forests is more accurate and efficient
on large dataset th an association rules. Association ru le s a re m o re u n d e r s ta n d a b le .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 19
2.3 .2 M A D A M ID
MADAM ID (Mining Audit D ata for Autom ated Models for Intrusion Detection)
[19] is one of the best known da ta mining projects in intrusion detection. It uses
da ta mining algorithms to compute activity patterns from system audit da ta and
extracts predictive features from the patterns. It is an off-line IDS to produce anomaly
and misuse intrusion models. Association rules and frequent episodes are applied
in MADAM ID. Association rules are used to find intra-audit record patterns, and
the frequent episodes algorithm is used to find inter-audit record patterns. However,
MADAM ID heavily relies on intrusion detection expert knowledge. Expert knowledge
is not only used to prune the number of rules produced by association and frequent
episode mining, bu t also used to construct features.
Compared to MADAM ID, our system can detect known intrusions in real time,
but MADAM ID only can detect intrusions in off-line mode. We apply the random
forests algorithm in our system instead of association rules and frequent episodes
algorithms. Although MADAM ID uses d a ta mining techniques, it still has high
reliance on expert knowledge.
2.3 .3 JA M
JAM (Java Agents for M eta-learning) [20] is a distributed, scalable and portable
agent-based da ta mining system. The main target of JAM is fraud and intrusion de
tection in financial information systems. M eta-learning is one of the key techniques to
combine and integrate separately learned classifiers or models. Hence, the distributed
agents can exchange models.
Compared to JAM, our system is a centralized system. The system builds patterns
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 20
and detect intrusions in a single central location. Thus, there is no need to m aintain
separate agents scattered on computers a t each location. However, sending all da ta
to a single place will increase the volume of network traffic. The agents in JAM can
process the da ta locally, then they can exchange models.
2.4 D atasets
2.4 .1 D A R P A d ataset
Under the sponsorship of Defense Advanced Research Projects Agency (DARPA) and
Air Force Research Laboratory (AFRL), M IT Lincoln Laboratory has collected and
distributed the datasets for the evaluation of computer network intrusion detection
systems [25, 3]. The DARPA dataset is the most popular dataset used to test and
evaluate a large number of IDSs. The data can be used on both host-based and
network-based systems. An environment was set up to simulate a typical U.S. Air
Force LAN. The raw T C P /IP dum p da ta was acquired from the environment.
The DARPA dataset includes three sets: 1998 DARPA Intrusion Detection Eval
uation D ata Sets, 1999 DARPA Intrusion Detection Evaluation D ata Sets, and 2000
DARPA Intrusion Detection Scenario Specific D ata Sets. The 1998 datasets contain
seven-week training d a ta and two-week test data. The 1999 datasets contain three-
week training da ta and two-week test data. The 2000 datasets contain one-day data
to address specific scenarios.
The attacks in the datasets fall into four categories:
• DoS : Denial of Service, e.g. syn flood.
• R2L: Unauthorized access from a remote machine, e.g. guessing password.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK
Table 2.1: Intrusions in the 1998 DARPA dataset [27]
A tta ck Class OS: Solaris OS: SunOS O SiLinuxDenial o f Service Apache2 Apache2 Apache2
Back Back BackMail bomb Mail bomb Mail bom bNeptune Neptune NeptunePing o f death Ping o f death Ping o f deathProcess table Process table Process tableS m urf S m urf Sm urfSyslogd Syslogd SyslogdUDP storm UDP storm UDP storm
Rem ote to User D ictionary D ictionary D ictionaryFtp-write Ftp-write Ftp-write
Guest Guest GuestPhf Phf ImapXlock Xlock NamedXnsnoop X nsnoop Phf
Sendm ailX lockX nsnoop
User to S uper-user Eject Load m odule PerlF fbconfiq Ps XtermFdform at
Probing Ps Ip sweep Ip sweepIp sweep M scan M scanM scan Nm ap NmapNm ap Saint SaintSaint Satan SatanSatan
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 22
• U2R: Unauthorized access to root privileges, e.g., various “buffer overflow”
attacks.
• Probing: surveillance and other probing, e.g., port scanning.
As an example, we show 32 different intrusions in the 1998 datasets listed as
Table 2.1 on the previous page [27].
2.4 .2 K D D ’99 d a taset
The K D D’99 dataset is a subset of DARPA dataset prepared by Sal Stolfo and Wenke
Lee [14]. The d a ta was preprocessed by extracting 41 features from the tcpdum p d a ta
in the 1998 DARPA datasets. The K D D’99 dataset can be used w ithout further time-
consuming preprocessing and IDSs can be compared with each other by working on
this dataset. The 41 features are listed in Table 2.2 [27].
Table 2.2: The features in the K D D’99 dataset [27]
# Feature name Description1 duration Length ( # of seconds) of the connection.2 protocol type Type of the protocol, e.g. tcp, udp, etc.3 service Network service on the destination, e.g.,
h ttp , telnet, etc.4 flag Normal or error status of the connection.5 src_bytes # of d a ta bytes from source to destination.6 dst.bytes # of da ta bytes from destination to source.7 land 1 if connection is from /to the same host/port;
0 otherwise.8 w rongJragm ent # of wrong fragments.9 urgent # of urgent packets.10 h o t ^ o f h o t in d ic a to rs .11 num_failed_logins # of failed login attem pts.12 logged in 1 if successfully logged in; 0 otherwise.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 23
Table 2.2 : The features in the K D D’99 dataset(continued)# Feature name Description13 num_compromised # of compromised conditions.14 root_shell 1 if root shell is obtained;
0 otherwise.15 su_attem pted 1 if su root command attem pted;
0 otherwise.16 num_root # of root accesses.17 num Jile-creations of file creation operations.18 num_shells of shell prompts.19 num_access_files =#= of operations on access
control files.20 num_outbound_cmds $= of outbound commands in an ftp session.21 is_host_login 1 if the login belongs to the hot list;
0 otherwise.22 is _guest Jogin 1 if the login is a guest’ login;
0 otherwise.23 count # connections to the same host as the current
one during past two seconds.24 srvmount # of connections to the same service as the
current connection in the past two seconds.25 serror_rate % of connections th a t have SYN errors
to the same host during past two seconds.26 srv_serror_rate % of connections th a t have SYN errors
to the same service during past two seconds.27 rerror_rate % of connections th a t have REJ errors
to the same host during past two seconds.28 srv_rerror_rate % of connections th a t have REJ errors
to the same service during past two seconds.29 same_srv-rate % of connections to the same service
during past two seconds.30 diff_srv_rate % of connections to different services
during past two seconds.31 srv_diff_host_rate % of connections to different hosts
during past two seconds.32 dst -host -count # of connections to the same host as the
current connection in past 100 connections.33 dst_host_srv_count # of connections to the same service as the
current connection in past 100 connections.34 dst-host-same_srv_rate % of connections to the same service
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2. BACKGROUND AND RELATED WORK 24
Table 2.2 : The features in the KDD’99 dataset (continued)# Feature name Description
in past 100 connections.35 dst_host_diff_srv_rate % of connections to different services
in past 100 connections.36 dst_host_same_src_port_rate % of connections from the source port
in past 100 connections.37 dst_host_srv_diff_host_rate % of connections to different hosts
in past 100 connections.38 dst_host_serror_rate % of connections th a t have SYN errors
to the same host in past 100 connections.39 dst_host_srv_serror_rate % of connections th a t have SYN errors
to the same service in past 100 connections.40 dst_host_rerror_rate % of connections th a t have R E J errors
to the same host in past 100 connections.41 dst_host_srv_rerror_rate % of connections th a t have REJ errors
to the same service in past 100 connections.
The K D D’99 dataset includes the full training set, the 10% training set, and
the test set. The full training set has 4,898,431 connections. In TC P protocol, a
connection is established before two hosts on networks can communicate with each
other. After finishing sending data, the connection is closed. Thus, a T C P connection
has m ultiple packets. For UDP protocol, each connectionless packet is also treated
as a connection. The 10% training set has 494,020 connections. The 10% training
set contains all the minority classes (U2R and R2L) of the full training set and part
of the m ajority classes (Normal, DoS, and Probing). The test set contains 311,029
connections.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
M isuse detection
In this chapter, we describe our approach to apply the random forests algorithm in
misuse detection. We first describe the architecture of the proposed misuse-based
NIDS (Network Intrusion Detection Systems). Then, we illustrate our solutions to
build detection patterns with the high performance for intrusion detection. Finally,
we discuss our experim ental results.
3.1 M ining patterns o f intrusions
In this section, we first describe the architecture of the NIDS, and then illustrate our
solutions for imbalance intrusions, feature selection, and optim ization of the random
forests algorithm.
3.1 .1 O verview o f th e fram ew ork
The proposed framework applies d a ta mining techniques to build patterns for network
intrusion detection. The architecture of the proposed NIDS is shown in Figure 3.1.
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 26
There are two phases in the framework: an off-line phase and an on-line phase. The
system builds patterns of intrusions in the off-line phase and detects intrusions in the
on-line phase.
In the off-line phase, labeled training da ta is fed into the off-line preprocessor
module. After preprocessing, feature vectors are stored in database. The P a tte rn
Builder module retrieves the training data from the database and builds the patterns
of intrusions. The P a tte rn Builder module employs the feature selection algorithm,
handles imbalanced intrusions, and builds the patterns by the random forests algo
rithm with optim al param eters. After mining the patterns of intrusions, the patterns
are deployed to the Detector module.
Featurvectors
Packets AlarmsNetwork
Patterns On line
O fflineFeaturevectors
Alarmer
DataSet
Sensors
Audited -i data Detector
PatternBuilder
Database (Off line)
On-line Preprocessors
Training data r
Off-line Preprocessor
Database (On line)
Figure 3.1: Architecture of the misuse based NIDS
In the on-line phase, the sensors capture the packets from network traffic. Each
sensor is installed on each network segment. Each sensor can capture all traffic on
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 27
the same network segment. The features for each connection are constructed by
the on-line preprocessors from the captured network traffic. The connections are
stored in the database and can be retrieved by the Detector module. Then, in the
Detector module, the connections are classified as intrusions or normal traffic using
the patterns built in the off-line phase. Finally, the system raises an alert when it
detects any intrusion.
3.1 .2 O p tim iza tion for random forests
The error rate of a forest depends on the correlation between any two trees and
the strength of each tree in the forest. Increasing the correlation increases the error
rate of the forest. The strength of a tree is determined by the error rate of the
tree. Increasing the strength decreases the error rate of the forest. W hen the forest is
growing, random features are selected at random out of the all features in the training
data. The best split on these random features is used to split the node of the tree.
The number of random features (Mtry) is held constant. Reducing (Increasing) Mtry
reduces (increases) both the correlation and the strength. The number of features
employed in splitting each node for each tree is the prim ary tuning param eter (Mtry).
To improve the performance of random forests, this param eter should be optimized.
We use training da ta to find the optim al value of the param eter Mtry. The
minimum error rate corresponds to the optim al value. Therefore, we use the different
values of Mtry to build the forests, and evaluate the error rates of the forests. Then,
we select the value corresponding to the minimum error rate to build the pattern.
T h e r e a re tw o w ay s to e v a lu a te th e e r r o r r a te . O n e is to s p l i t t h e d a t a s e t in to
training part and test part. We can employ the training part to build the forest,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 28
and then use the test part to calculate the error rate. Another way is to use the oob
(out of bag) error estim ate. Because the random forests algorithm calculates the oob
error during the training phase, we do not need to split the training data. We choose
the oob error estim ate, since it is more effective by learning from the whole training
dataset.
3.1 .3 Im balanced in trusions
Intrusions are imbalanced. In other words, some intrusions produce much more con
nections th an others. The random forests algorithm tries to minimize the overall error
rate, by lowering the error ra te on m ajority classes (e.g., m ajority intrusions) while
increasing the error rate of minority classes (e.g., m inority intrusions) [12]. However,
the damage cost of the minority intrusions is much higher than the cost of the m a
jority intrusions. Thus, for imbalanced intrusions, we need to improve the detection
rate of the minority intrusions while m aintaining a reasonable overall detection rate.
There are two solutions to deal w ith the imbalanced intrusions problem. One is
to set different weights for different intrusions. The minority intrusions are assigned
higher weights. Although the overall error rate goes up, the error rate of the minority
intrusions will be reduced. The random forests algorithm supports this m ethod by
changing the weight param eters. The other m ethod is to use sampling techniques:
over-sampling the minority intrusions and down-sampling the m ajority intrusions.
Since the network traffic is huge, down-sampling the m ajority intrusions (e.g., nor
mal traffic and Denial of Service) can speed up building the patterns significantly
b y re d u c in g th e s ize o f th e d a ta s e ts . O v e r -s a m p lin g m in o r i ty in tr u s io n s ( e . g . , U ser
to Root and Remote to Local) can raise their weights to decrease their error rate.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 29
Therefore, we combine over-sampling and down-sampling in our NIDS to solve the
imbalanced intrusions problem instead of the first solution.
3 .1 .4 F eature se lec tion
The raw audit d a ta of network traffic is not suitable for intrusion detection. Hence,
feature construction is needed to extract a set of features which can detect intrusions
effectively. Usually, the construction is based on each connection. There are three
types of features for network connection records used in NIDSs [19]:
• Intrinsic features. Intrinsic features describe the basic information of connec
tions, such as the duration, service, source and destination host, port, and flag.
• Traffic features. These features are based on statistics, such as number of con
nections to the same host as the current connection w ithin a tim e window.
• Content features. These features are constructed from the payload of traffic
packets instead of packet headers, such as number of failed logins, whether
logged in as root, and number of accesses to control files.
Feature selection is one of the critical steps in building NIDSs. The number of
intrinsic features is fixed, since the number depends on the information of packet
header. However, traffic features and content features can be constructed using dif
ferent methods. Hundreds of traffic and content features can be designed, while only
some of them are essential for separating intrusions from normal traffic. Unessential
features not only increase com putational cost, bu t also increase the error rate, es
pecially for some algorithms th a t are sensitive to the number of features. “Deciding
upon the right set of features is difficult and tim e consuming” [20]. Currently, features
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 30
are designed by security experts. Thus, we need an approach th a t can autom ate the
feature selection. We employ variable im portance calculated by the random forests
algorithm in feature selection. The features w ith higher value of variable im portance
have more effect on classification. Therefore, we choose the features with the higher
value of variable im portance in the NIDS.
3.2 E xperim ents and results
In this section, we summarize our experimental results to build patterns for intrusion
detection over the K D D’99 dataset. We first describe the experiments using sampling
techniques for imbalanced intrusions and the random forests algorithm to select fea
tures. Then, we present the experiments on param eter optim ization for the random
forests algorithm, distribution of error rates, and speed performance of detection. Fi
nally, we evaluate our approach and compare our results w ith the best result of the
KDD’99 contest [14],
3.2.1 D a ta se t and prep rocessin g
The K D D’99 dataset can be used w ithout further time-consuming preprocessing and
different NIDSs can compare w ith each other by working on the same dataset. There
fore, we carry out our experiments on the KDD’99 dataset and compare our results
with the best result of the KDD’99 contest.
The 10% training set of the KDD’99 contains all the minority classes such as U2R
(User to Root) and R2L (Remote to Local) and part of the m ajority classes in the
full training set. It is just like down-sampling the m ajority classes such as Normal,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION
Table 3.1: Numbering of the attack categories [14]
31
0 Normal1 Probe2 DoS3 U2R4 R2L
DoS (Denial of Service), and Probing. Hence, we just use the 10% training dataset
in our experiments. The task of the KDD’99 contest was to build a classifier capable
of distinguishing between four kinds of intrusions and normal traffic numbered as one
of five classes (see Table 3.1).
3.2 .2 P erform ance com parison on balan ced and im balanced
d ataset
The original dataset (the 10% training set) is imbalanced (e.g., DoS has 391,458
connections bu t U2R only has 52 connections). To make a balanced training set, we
down-sample the Normal and DoS classes by random ly selecting 10% of connections
belonging to Normal and DoS from the original dataset. We also over-sample U2R
and R2L by replicating their connections. The balanced training set with 60,620
connections is much smaller than the original one.
The first experiment is to compare the performance of detection between the pa t
terns built on the original training set and the balanced training set w ith sampling.
The experiment is carried out by using the default values of the param eters for the
random forests algorithm in the W EKA (Waikato Environment for Knowledge Analy
sis) [5]: 66% samples as training data, 34% samples as test data, 10 trees in the forest,
and 6 random features to split the nodes. The main objective of the experiment is to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 32
Table 3.2: Performance on the balanced dataset compared to the original datasetPerformance Original dataset Balanced dataset
Overall error rate 1.92% 0.05%Time to build pattern 1975 seconds 65 seconds
True positive rate(C lass 0) 0.948 0.999True positive rate(C lass 1) 0.989 0.994True positive rate(C lass 2) 1 1True positive rate(C lass 3) 0.862 1True positive rate (Class 4) 0.83 1False positive rate(C lass 0) 0.011 0False positive rate(Class 1) 0 0False positive rate(C lass 2) 0 0False positive rate(C lass 3) 0 0False positive rate(C lass 4) 0.01 0
compare the performance differences between the balanced and the original datasets,
but not to compare the effect of the param eters. As a result, for the sake of conve
nience, we just use the default values of the param eters for bo th datasets. Table 3.2
lists overall error rate for classification, tim e to build pattern , true positive ra te for
all classes, and false positive rate for the classes. As shown in the table, the sampling
techniques can improve the performance, especially for the detection rate (true pos
itive rate) of the minority classes (Class 3 and Class 4) and can reduce the tim e to
build the patterns dramatically.
3 .2 .3 S e lec tion o f im p ortan t featu res
The second experiment is to select the most im portant features. There are 41 features
in the K D D’99 dataset numbered from 1 to 41. They cover all three types of features
in NIDSs: intrinsic features, traffic features, and content features. We employ the
feature selection algorithm supported by the random forests algorithm to calculate
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 33
the value of variable im portance. To estim ate the im portance of the variable m, the
number of votes for the correct class is counted using the oob cases in every tree.
Then, the number of the correct votes is counted again after random ly perm uting
the values of variable m in the oob cases. The average of the margin between these
two numbers over all the trees in the forest is the raw im portance score for variable
m. The raw score is divided by its standard error to get a z-score, and the value
of variable im portance is the negative z-score for variable ro. Figure 3.2 on the next
page plots the values of variable im portance for all five categories, sorted in decreasing
order. The features are listed in Table 2.2 on page 22. The figure shows the variable
im portance values of the last 3 features (Feature 7, 20, and 21) are much less than
other values. Therefore, we select the rest of the 38 most im portant features to build
the patterns for intrusion detection.
Feature 3 (service type such as h ttp , telnet, and ftp) is the most im portant feature
to detect intrusions. It means th a t the intrusions are sensitive to service type. Feature
7 (land) is used to indicate if a connection is from /to the same host. According to the
domain knowledge, it is the most discrim inating feature for land attacks. However,
land attacks belong to DoS and have much fewer connections than other types of DoS.
After down-sampling DoS attacks, the land attacks are almost excluded from the
balanced dataset. Therefore, the feature 7 is not im portant to improve the detection
rate of DoS attacks. Feature 20 ( # of outbound commands in a F T P session) and
21 (hot login to indicate if it is a hot login) do not show any variation for intrusion
detection in the training set.
The above analysis suggests th a t the feature selection can help choose features to
detect intrusions without special domain knowledge. However, the m ethod has high
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 34
dependence on training sets.
©L_15
©
Impor tance
-10 o 10 20
Figure 3.2: Variable im portance of the features in the misuse approach experiment
3 .2 .4 P aram eter op tim iza tio n for random forests
To improve the detection rate, we optimize the number of the random features (Mtry).
We build the forest with different Mtry (5, 10, 15, 20, 25, 30, 35, and 38) over the
balanced training set, then plot the oob error rate and the tim e to build the pattern
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 35
corresponding to different value for Mtry. As Figure 3.3 shows, the oob error rate
reaches the minimum when Mtry is 15, 25, or 30. Besides, increasing Mtry increases
the tim e to build the pattern . Thus, we choose 15 as the optim al value, which reaches
the minimum of the oob error rate and costs the least tim e among these three values.
6000.00215O ob E rro r R ate
0 .0 0 2 1 -
0.00205
0. 002 -
jjg 0.00195
-- 300 E0.0019 -0.001850.0018 -
0.00175
0.0017 -0.00165
0 15 20 25 30 35 38M t r y
Figure 3.3: Performance with different values for param eter Mtry of random forests
3.2 .5 D istr ib u tio n o f error rates
To build patterns of intrusions, the random forests algorithm samples cases a t random
with replacement. The features to split the nodes of trees are also selected at random.
Thus, the oob error rate is different a t each run, even with the same value of Mtry.
For this reason, we carry out an experiment to analyze the distribution of the oob
error rate.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 36
Listing 3.1 The pseudo-code of the program for the experiment on the distribution
of the error rates
Get the filename of dataset Construct instances from the file Set labels for instancesFOR each Mtry in (5, 10, 15, 20, 25, 30, 35, 38)
FOR each seed in (1 .. 20)Set the number of trees, seed and Mtry for random forests module Build pattern using the instances Calculate the oob error rate
END FOR END FOR
0.0024
0.0023
0.0022
0.0021
0.002imm
£L_® 0.0019 £O° 0.0018
0.0017
0.0016
0.0015
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
No. of m n
*Mtiy = 5
- Mtiy =10
-Mtiy = 15
Mtiy = 20
- Mtiy = 25
-Mtiy =30
-Mtiy =35
-Mtiy =38
Figure 3.4: Distribution of the oob error rate
We developed a Java program for this experiment. The pseudo-code of the pro
gram is shown in Listing 3.1. The random function in Java only generates pseudo
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 37
random number. W ith the same seed, we will get the same result for each run.
Therefore, we need to set different seed for the random function at each run.
In this experiment, we use the balanced dataset as training data and set the
number of trees as 50. The experimental result is shown in Figure 3.4 on the previous
page. The figure shows th a t the oob error rate is different a t each run, bu t the change
is not significant. For example, the error rate keeps low when the value of Mtry is 15.
Figure 3.5 shows the average of the oob error rates. The average of the error rates
reaches the minimum when the value of Mtry is 15.
0.0021 n
0.00205
I 0.002 4 -
I 0.00195 —
j= 0.0019 —
* 0.00185 -H
| 0.0018 —
0.00175
0.001710 15 20 25
Mtry30 35 38
Figure 3.5: Average oob error rate for different Mtry
3 .2 .6 S p eed perform ance o f d e tec tio n
Since network traffic is huge, the high speed performance of detection is im portant
for NIDSs. A large volume of network traffic may overwhelm a NIDS with low speed
performance. The detection speed also determines the deployment of NIDSs. The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 38
proposed NIDS is a centralized system. The captured data comes from the sensors
installed in different network segments. Thus, the number of the sensors depends on
the speed performance of the NIDS.
We carry out an experiment to measure the speed performance of detection. We
developed a Java program for this experiment. The pseudo-code of the program is
shown in Listing 3.2.
Listing 3.2 The pseudo-code of speed measurement program
Get the filename of dataset Construct instances from the file Set labels for instancesFOR each number of trees in (10, 20, 30, 40, 50)
FOR each Mtry in (5, 10, 15, 20, 25, 30, 35)Set the number of trees and Mtry for random forests moduleBuild patterns using the instancesGet the starting time of the classificationFOR each instance
Classify the instance using the built patterns END FORGet the ending time of the classification Total time = end time - start timeTime to process each connection = total time / the number of the instances
END FOR END FOR
The run tim e experimental environment is listed as follows: Dataset(60,620 in
stances), CPU (Pentium (R) 4, 3.00GHz), RAM (0.99 GB), Language (Java), JVM
(version 1.4.1).
The experimental result is shown in Figure 3.6 on the next page. The figure shows
the speed has more dependency on the number of trees in the forest than the value of
Mtry. Increasing the number will increase time to process each connection. Increasing
the value of Mtry will decrease the tim e slightly. The average tim e is about 0.014
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 39
milliseconds for each connection. Thus, the Detector Module of the NIDS can process
about 71,655 connections per second in the experimental environment.
Os *.2 a) * § 9 °
is
0.03
0.025
0.02
0.015
0.01
0.005
05 10 15 20 25 30 35
10 trees
• 20 trees
- 30 trees
•40 trees
■ 50 trees
Mtiy
Figure 3.6: Speed measurement of detection
3.2 .7 E valu ation and d iscu ssion
Different misclassifications have different levels of consequences. For example, mis-
classifying R2L as Normal is more dangerous than misclassifying DoS as Normal. We
use the cost m atrix as Table 3.3 on the next page published in the KDD’99 [14] to
measure the damage of misclassification. My denotes the number of samples in Class
i misclassified as Class j, and Cy indicates the corresponding cost in the cost ma
trix. Let N be the to ta l number of the samples. The cost th a t indicates the average
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 40
Table 3.3 : Cost m atrix 14]Normal Probe DoS U2R R2L
Normal 0 1 2 2 2Probe 1 0 2 2 2DoS 2 1 0 2 2U2R 3 2 2 0 2R2L 4 2 2 2 0
Table 3.4: Performance comparison on the K D D’99 datasetExperim ents Overall error rate Cost Time(Seconds)Best KDD Result 7.29% 0.2331 Not providedExperim ent without feature selection 7.19% 0.2306 491Experim ent with feature selection 7.07% 0.2282 423
damage of misclassification for each connection is computed as:
cost = E Mij x C n / N (3.1)
Similar to the KDD’99 contest, we evaluate our approach with the same test
dataset which contains 311,029 examples. We carry out our experiment with 50
trees and 15 random features (optimized in the previous experiments). First, we
build patterns on the balanced training set using all the 41 features. Then, we build
patterns using the 38 most im portant features. The evaluation results of the patterns
are reported in Table 3.4 along with the best result of the KD D’99 contest. The
overall error ra te is the ratio of the misclassified connections to the to ta l connections
in the test set.
There were 24 participants in to ta l in the KDD’99 contest [14]. The best result
of the K D D’99 contest is listed in Table 3.4, achieved by an ensemble of decision
trees. The experimental results show th a t our approach provides lower overall error
rate and cost compared to the best the K D D’99 result even without feature selection.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 41
The results also show th a t the overall error rate, cost, and tim e to build patterns is
reduced by selecting the most im portant features. Thus, feature selection can improve
the performance of intrusion detection.
3 .2 .8 Im p lem en tation
In our experiments, we use the random forests algorithm implemented in the W EKA
(Waikato Environment for Knowledge Analysis) [5]. The W EKA is a Java package
which contains machine learning algorithms for d a ta mining tasks. However, the
W EKA does not implement the variable im portance function in the random forests
algorithm. Therefore, we also use the FORTRAN 77 program [2] developed by Leo
Breiman and Adele Cutler to calculate the variable im portance for feature selection.
We also developed a tool for experiments on the distribution of error rates, and
speed performance of detection. W ithout the tool, we have to run the W EKA many
times for each of the experiments. Each run involves many manual operations, for
example, setting param eters. W ith the tool, we can carry out these two experiments
automatically, instead of doing them manually.
3.3 Sum m ary
In this chapter, we employ the random forests algorithm in misuse-based NIDSs
to improve detection performance. To increase the detection ra te of the minority
intrusions, we build the balanced dataset by over-sampling the minority classes and
down-sampling the m ajority classes. The random forests algorithm can build patterns
more efficiently over the balanced dataset, which is much smaller than the original
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3. MISUSE DETECTION 42
one. The experiments have shown th a t the approach can reduce the tim e to build
patterns dram atically and increase the detection rate of the minority intrusions.
Instead of selecting features based on the domain knowledge, we select features
autom atically according to their variable im portance calculated by the random forests
algorithm. By the feature selection algorithm, deciding upon the right set of features
has become easy and autom ated. Although the approach reduces the dependency on
the domain knowledge in feature selection, it increases the dependency on training
sets.
From the experiments on various values of random features, we obtain the optimal
value to improve the performance of random forests. The evaluation on the KDD’99
test set shows the performance provided by our approach is be tter than the best
result of the KDD’99 contest (reduction of the overall error rate and the cost of
misclassification).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
A nom aly detection
In this chapter, we apply the random forests algorithm in anomaly detection. We
describe the framework of the anomaly-based NIDS (Network Intrusion Detection
Systems) and illustrate the approach to detect outliers using the random forests
algorithm. Finally, we discuss our experimental results.
4.1 D etectin g outliers
In this section, we first present the overview of the proposed framework of the NIDS.
Then, we describe how to build patterns of network services and discuss our un
supervised approach to detect outliers using the random forests algorithm. In the
unsupervised approach, there is no need for attack-free training data, while super
vised approaches need attack-free da ta to build profiles of normal activities.
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 44
4 .1 .1 O verview o f th e fram ew ork
The proposed framework applies the random forests algorithm to detect novel in tru
sions. The framework is shown in Figure 4.1. The NIDS captures the network traffic
and constructs dataset by pre-processing. After tha t, the service-based patterns are
built over the dataset using the random forests algorithm. W ith the built patterns,
we can find outliers related to each pattern . Then the system will raise alerts when
outliers are detected. After capturing network traffic, the processing is off-line. Due
to the high com putational requirements of the outlier detection algorithm, on-line
processing is not suitable in real network environment.
Dataset AlertsOutlier
DetectionN etwork Traffic
Pre-Processing
PatternBuilding
Figure 4.1: The framework of the unsupervised anomaly NIDS
4.1 .2 M in in g p a ttern s o f netw ork serv ices
Network traffic can be categorized by services (e.g., h ttp , telnet, and ftp). Each
network service has its own pattern . Therefore, we can build patterns of network
services using random forests algorithm. However, the random forests algorithm is
supervised, so we need datasets labeled by network services. Since the information of
network services is in network packets, network traffic can be labeled by the services
autom atically instead of time consuming manual processing. Actually, many datasets
used to evaluate NIDSs can be labeled by network services w ith a little effort. For
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 45
example, one of features in the KD D’99 dataset is service type, which can be used as
label.
Before building the patterns, we need to optimize the param eters of the random
forests algorithm. The number of features employed in splitting each node for each
tree is the prim ary tuning param eter (M try). To improve the performance of the
random forests algorithm, this param eter should be optimized. Another param eter
is the number of trees in a forest.
We use the dataset to find the optim al value of the param eter Mtry and the
number of the trees. The minimum error rate corresponds to the optim al values.
Therefore, we use the different values of Mtry and the number of the trees to build
the forests, and evaluate the error ra te of each forest. Then, we select the values
corresponding to the minimum error rate to build the patterns of the services.
4 .1 .3 U n su p erv ised ou tlier d e tec tio n
We can detect intrusions by finding unusual activities or outliers. There are two
types of outliers in the proposed NIDS. The first type is an activity th a t deviates
significantly from others in the same network service. The second type is an activity
whose pa tte rn belongs to the services other than their own service. For instance, if an
h ttp activity is classified as ftp service, the activity will be determined as an outlier.
The random forests algorithm uses proximities to find outliers whose proximities
to all other cases in the entire d a ta are generally small. The proximities are one of the
most useful tools in random forests [12]. After the forest is constructed, all cases in
the dataset are put down each tree in the forest. If cases k and n are in the same leaf
of a tree, their proximity is increased by one. Finally, the proximities are normalized
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 46
by dividing by the number of the trees.
For a dataset with N cases, the proximities originally formed a N xN matrix.
The complexity of calculation is N xN . D atasets of network traffic are huge, so the
calculation needs a lot of memory and CPU time. To improve the performance,
we modify the algorithm to calculate the proximities. As we mentioned above, if a
service activity is classified as another service, it will be determ ined as an outlier.
Therefore, we do not care about the proximity between two cases th a t belong to
different services. Si denotes the number of cases in service i. The complexity will be
reduced to ]T) Si x Si after the modification.
W ith respect to the random forests algorithm, outliers can be defined as the cases
whose proximities to other cases in the dataset are generally small [12]. Outlier-ness
indicates a degree of being an outlier. It can be calculated over proximities, class(k)
= j denotes th a t k belongs to class j. prox(n,k) denotes the proximity between cases
n and k. The average proximity from case n in class j to case k (the rest of d a ta in
class j ) is computed as:
c l a s s ( k ) —j
N denotes the number of cases in the dataset. The raw outlier-ness of case n is
defines as:
In each class, the median and the absolute deviation of all raw outlier-ness are
calculated. The median is subtracted from each raw outlier-ness. The result of the
subtraction is divided by the absolute deviation to get the final outlier-ness. If the
(4.1)
N /P (n) (4.2)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 47
outlier-ness of a case is large, the proximity is small, and the case is determined as
an outlier.
To detect outliers in a dataset of network traffic, we build patterns of services
over the dataset. Then, we calculate the proximity and outlier-ness for each activity.
An activity th a t exceeds a specified threshold of outlier-ness will be determined as an
outlier.
4.2 E xperim ents and results
In this section, we summarize our experimental results to detect intrusions using
the unsupervised outlier detection technique. We first describe the datasets used in
the experiments. Then we evaluate our approach in a way similar to the way used
in [16, 21]. To evaluate the detection performance of our approach under different
number of attacks, we also carry out the experiments over different datasets. The
performance to detect the minority intrusions by our approach is evaluated by another
experiment.
4.2 .1 D a ta se t and prep rocessin g
The full training set, one of the K D D’99 datasets, has 4,898,431 connections. The
dataset contains attacks. The dataset is labeled by type of attacks. Since our ap
proach is unsupervised, the dataset does not satisfy the needs of our experiments.
We m ust remove the labels th a t indicate types of attacks from the dataset.
To generate new datasets for our experiments, we first separate the dataset into
two pools according to the labels. One contains normal connections. Another contains
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 48
intrusive connections. Then, we remove all the labels from the pools. However, we
need the d a ta labeled by services to build patterns of services, so we use the service
feature in the dataset as label. As a result, all the data contains 40 features and is
labeled by services.
For our experiments, we choose five most popular network services: ftp, h ttp ,
pop, smtp, and telnet. By selecting ftp, pop, telnet, 5% h ttp , and 10% sm tp normal
connections, we generate a dataset called the normal dataset, which contains 47,426
normal connections. Finally, by injecting anomalies from the pool of attacks into the
normal dataset, we generate four new datasets: 1%, 2%, 5%, and 10% datasets. 1%
(2%, 5%, and 10%) dataset means th a t 1% (2%, 5%, and 10%) of connections in the
dataset are attacks.
4 .2 .2 E va lu ation and d iscu ssion
We carry out the first experiment over the 1% attack dataset in a similar way used
in [16, 21], We first optimize the param eters (Mtry and the number of trees) of the
random forests algorithm by feeding the dataset into the NIDS. The NIDS builds
patterns of the network services with different values of the param eters, and then
calculates the oob error rates. The oob error rates are shown in the th ird column of
Table 4.1. The values of the param eters (the number of trees and M try ) corresponding
to the lowest oob error rate are optimized.
Table 4.1: The oob error rates for param eter optim ization in the anomaly detection experiments
Trees M try Error rate for 1%
Error rate for 2%
Error rate for 5%
Error rate for 10%
Error rate for minority
10 5 0.00886 0.01323 0.01884 0.03237 0.00187
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 49
Table 4.1 : The oob error rates for param eter optim ization(continued)Trees M try Error rate
for 1%Error rate
for 2%Error rate
for 5%Error rate
for 10%Error rate
for minority10 10 0.00745 0.01137 0.01885 0.03242 0.0018710 15 0.00728 0.01084 0.01889 0.0325 0.0018810 20 0.00673 0.01038 0.01889 0.03252 0.0018810 25 0.0065 0.01012 0.0189 0.03255 0.0018910 30 0.00624 0.01007 0.0189 0.03256 0.0018910 35 0.00625 0.0099 0.0189 0.03258 0.0018910 40 0.00631 0.00986 0.01892 0.03263 0.0018915 5 0.00874 0.01306 0.01895 0.03266 0.001915 10 0.00748 0.0112 0.01895 0.03267 0.001915 15 0.00718 0.01065 0.01895 0.03269 0.0019115 20 0.00665 0.01008 0.01896 0.03275 0.0019115 25 0.00637 0.00997 0.01896 0.03281 0.0019115 30 0.00615 0.00987 0.01898 0.03282 0.0019215 35 0.0062 0.00972 0.01899 0.03295 0.0019415 40 0.00615 0.00978 0.019 0.03299 0.0019720 5 0.00894 0.01299 0.01901 0.03302 0.0019720 10 0.00753 0.0112 0.01901 0.03309 0.0019920 15 0.00714 0.01085 0.01902 0.0331 0.00220 20 0.0067 0.0101 0.01902 0.0331 0.0020220 25 0.00647 0.00993 0.01903 0.0332 0.0020320 30 0.00622 0.00979 0.01906 0.03322 0.0020320 35 0.00614 0.00967 0.01907 0.03351 0.0020520 40 0.00612 0.0097 0.01908 0.03356 0.0020825 5 0.00884 0.01301 0.0191 0.03361 0.0020925 10 0.00746 0.01121 0.01911 0.03361 0.0020925 15 0.00701 0.01089 0.01912 0.03363 0.0020925 20 0.00658 0.01014 0.01912 0.03368 0.0021125 25 0.00637 0.00988 0.01914 0.03371 0.0021225 30 0.0061 0.00978 0.0192 0.03374 0.0021225 35 0.00605 0.00976 0.01921 0.03389 0.0021325 40 0.00602 0.00968 0.01923 0.03402 0.0022430 5 0.00898 0.01294 0.01924 0.03406 0.0022530 10 0.00738 0.01128 0.01953 0.03412 0.0022630 15 0.00698 0.01091 0.01966 0.03415 0.0022630 20 0.00652 0.01023 0.01966 0.03422 0.0022630 25 0.00635 0.00996 0.01969 0.03424 0.00227
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 50
Table 4.1 : The oob error rates for param eter optimization(continued)Trees M try Error rate
for 1%Error rate
for 2%Error rate
for 5%Error rate
for 10%Error rate
for minority30 30 0.00613 0.00979 0.01971 0.03425 0.002330 35 0.0061 0.00983 0.01973 0.0343 0.002330 40 0.00608 0.00976 0.01979 0.03432 0.0023235 5 0.00906 0.01297 0.0198 0.03439 0.0023635 10 0.00741 0.01135 0.01982 0.03441 0.0023735 15 0.00702 0.01096 0.01997 0.03541 0.0023735 20 0.00662 0.01025 0.02042 0.03548 0.0023935 25 0.00643 0.00996 0.02044 0.03551 0.0024135 30 0.00621 0.0098 0.02045 0.03556 0.0024135 35 0.00616 0.00987 0.02045 0.0356 0.0024235 40 0.00616 0.00982 0.02046 0.03562 0.0024340 5 0.00913 0.01298 0.02049 0.03563 0.0024440 10 0.00752 0.01138 0.02049 0.03573 0.0024740 15 0.00704 0.01092 0.0205 0.03585 0.002540 20 0.00661 0.01031 0.02053 0.03603 0.0026140 25 0.00642 0.01002 0.02086 0.03707 0.0026840 30 0.0062 0.00989 0.02115 0.03753 0.0027640 35 0.00617 0.0099 0.02122 0.03754 0.0027640 40 0.00615 0.00986 0.02125 0.0376 0.0027645 5 0.00907 0.01309 0.02125 0.03763 0.002845 10 0.00752 0.0115 0.02126 0.03765 0.0028345 15 0.00704 0.01086 0.02127 0.03765 0.0028345 20 0.00658 0.01035 0.02134 0.03766 0.0028445 25 0.00639 0.01004 0.02136 0.0377 0.0028945 30 0.00616 0.00989 0.02137 0.0378 0.0030545 35 0.00615 0.00991 0.02306 0.03985 0.0044645 40 0.00613 0.00987 0.02324 0.03986 0.0044850 5 0.00912 0.01306 0.02325 0.04008 0.0044950 10 0.00753 0.01149 0.02328 0.04021 0.004550 15 0.00703 0.01083 0.0233 0.04038 0.0045350 20 0.00656 0.01034 0.02337 0.0405 0.0045850 25 0.00638 0.00996 0.0234 0.0405 0.004650 30 0.00618 0.00988 0.02343 0.04051 0.0046150 35 0 .00615 0 .0099 0 .02344 0 .04052 0.0047150 40 0.00615 0.00984 0.02346 0.04065 0.00476
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 51
W ith the optimized param eters (the number of trees is 25, the value of Mtry is
40), we build the patterns of the network services. Over the built patterns, the NIDS
calculates the outlier-ness of each connection. Figure 4.2 plots the outlier-ness of the
1% attack dataset. Since the attacks are injected at the beginning of the dataset,
the figure shows the outlier-ness of the attacks is much higher than most of normal
activities. Some normal activities also have high outlier-ness. T ha t leads to false
positives. The NIDS will raise an alert if an outlier-ness of a connection exceeds a
specified threshold.
Connections
Figure 4.2: The outlier-ness of the 1% attack dataset
We evaluate the performance of our system by the detection rate and the false
positive rate. The detection rate is the number of attacks detected by the system
divided by the number of attacks in the dataset. The false positive rate is the number
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 52
of normal connections th a t are misclassified as attacks divided by the number of
normal connections in the dataset. We can evaluate the performance by varying the
threshold of outlier-ness.
In intrusion detection, ROC (Receiver Operating Characteristic) curve is often
used to measure performance of IDSs. The ROC curve is a plot of the detection rate
against the false positive rate. Figure 4.3 plots ROC curve to show the relationship
between the detection rates and the false positive rates over the dataset.
100
F alse p ositive rate (%)
Figure 4.3: The ROC curve for the 1% attack dataset
The result indicates th a t our system can achieve a high detection rate with a low
false positive rate. Compared to other unsupervised anomaly based systems [16, 21],
our system has better performance over the K D D’99 dataset while the false positive
rate is low. Table 4.2 on the next page lists some results from Eskin, et al. [16].
They carry out the experiments over a dataset which contains 1 to 1.5% attacks and
98.5 to 99% normal instances. The results from the others show the detection rates
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 53
Table 4.2: The performance of each algorithm over the KDD’99 dataset [16]Algorithm Detection rate False positive rateCluster 93% 10%Cluster 66% 2%Cluster 47% 1%Cluster 28% 0.5%K-NN 91% 8%K-NN 23% 6%K-NN 11% 4%K-NN 5% 2%SVM 98% 10%SVM 91% 6%SVM 67% 4%SVM 5% 3%
are reduced significantly when the false positive rate is low (below 1%). Although
our experiments are carried out under different conditions, Figure 4.3 on the previous
page shows our system still remains relatively high detection rates when the false
positive rate is low. For example, the detection rate is 95% when the false positive
rate is 1%. W hen the false positive rate is reduced to 0.1%, the detection rate is still
over 60%.
4 .2 .3 E xp erim en ts on th e d e tec tio n perform ance over differ
ent d a ta sets
To evaluate our system under different number of attacks, we carry out the experi
ments over the 1%, 2%, 5%, and 10% attack dataset.
We first optimize the param eters (Mtry and the number of trees) of the random
forests algorithm by feeding the datasets into the NIDS. The NIDS builds patterns of
the network services with different values of the param eters, and then calculates the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION
Table 4.3: The optim al param eters of random forests
54
D ataset Trees M try1% dataset 25 402% dataset 20 355% dataset 10 510% dataset 10 5
oob error rates. The oob error rates are shown in Table 4.1 on page 48. The values
corresponding to the lowest oob error rate are optimized. The optim al param eters
for each dataset are listed as Table 4.3.
W ith the optimized param eters, we build the patterns of the network services.
Over the built patterns, the NIDS calculates the outlier-ness of each connection.
Figure 4.4, Figure 4.5 on the next page, and Figure 4.6 on the next page plot the
outlier-ness of the 2%, 5%, and 10% attack dataset. Figure 4.7 on page 56 plots the
ROCs for each dataset. The result shows th a t the performance tends to be reduced
with the increasing number of attacks. Thus, the performance of anomaly detection
depends on the proportion of attacks in datasets.
30
25
20 -
15 -c
I 10■5O K
j L l j wlm
-5 J 2503 5005 7507 10009 12511 15013 17515 20017 22519 25021 27523 30025
C o n n e c t i o n s
Figure 4.4: The outlier-ness of the 2% attack dataset
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION
C o n n e c t i o n s
Figure 4.5: The outlier-ness of the 5% attack dataset
,— 1
2 4 8 6 4871 746 6 9 941 12426 14811 17396 19881 2 2 3 6 6 24861 27336 29821
C o n n e c tio n s
Figure 4.6: The outlier-ness of the 10% attack dataset
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 56
9 5 -
90 - ' /> — &------o — -o- - — -o - — o — -e - -8 5 -
8 0 -
7 5 -1 % a t t a c k s
7 0 -- - 2 % a t t a c k s
— 5 % a t t a c k s6 5 -
6 0 -1 0 % a t t a c k s
5 5 -
5 0
F a l s e p o s i t i v e r a t e
Figure 4.7: The ROC curves for the different datasets
4 .2 .4 E xp erim en t on th e d e tec tio n perform ance over m inor
ity in trusions
M inority intrusions are more difficult to detect than m ajority intrusions. Since mi
nority intrusions have much fewer connections, the above experiments cannot show
the performance of detecting minority intrusion by the NIDS. Therefore, we carry
out an experiment to evaluate the performance to detect m inority intrusions using
outlier detection.
B y in je c t in g m in o r i ty in tru s io n s fro m th e p o o l o f a t ta c k s in to t h e n o rm a l d a t a s e t ,
we generate the minority attack dataset. We first optimize the param eters by feeding
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 57
the dataset into the NIDS. As shown in the seventh column of Table 4.1 on page 48,
the optim al number of the trees is 10. The optim al value of Mtry is 5.
Figure 4.8 plots the outlier-ness of the minority attack dataset. There are 57
intrusions in the dataset. Since the attacks are injected at the beginning of the
dataset, the figure shows the outlier-ness of some attacks is much higher than most
of normal activities. Figure 4.9 on the next page plots ROC curve to show the
relationship between the detection rates and the false positive rates over the dataset.
2 7 53 79 105 131 157 183 209 235 261 287 313 339 305 391 417 443 409
C o n n e c t i o n s
Figure 4.8: The outlier-ness of the minority a ttack dataset
The result indicates th a t the detection rate of minority intrusions is lower than
the experiments on the detection performance over different datasets in the previous
subsection. However, the result is still impressive. The detection rate can reach 65%
when false positive is 1%. In the K D D’99 contest, the detection ra te of minority
intrusion is much lower than th a t of m ajority intrusions even using misuse detection
[14].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 58
0.9
0.8
0.7
41 0.6
0.5
0.2
False positive rate{%)
Figure 4.9: The ROC curve for the minority attack dataset
4 .2 .5 Im p lem en ta tion
We develop a Java program to implement our approaches in anomaly detection using
the W EKA (Waikato Environment for Knowledge Analysis) [5]. The W EKA is an
open source Java package which contains machine learning algorithms for d a ta mining
tasks. However, the W EKA does not implement the outlier detection function in the
random forests algorithm. Therefore, we modify the source code of the W EKA to
implement outlier detection.
4.3 Sum m ary
In th is c h a p te r , w e p ro p o s e a n e w f ra m e w o rk o f u n s u p e rv is e d a n o m a ly NIDS, based
on the outlier detection technique in the random forests algorithm. The framework
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. ANOMALY DETECTION 59
builds the patterns of network services over datasets labeled by the services. W ith
the built patterns, the framework detects attacks in the datasets using the outlier
detection algorithm.
Due to large population of datasets used in NIDSs, the process to detect outliers
is very time-consuming and costs a large am ount of memory. To improve the perfor
mance, we modify the original outlier detection algorithm to reduce the calculation
complexity, under the assum ption th a t each network service has its own pa tte rn for
normal activities.
Com pared to supervised approaches, our approach breaks the dependency on
attack-free training datasets. The experimental results over the K D D’99 dataset
confirm the effectiveness of our approach using the unsupervised detection technique.
The performance of our system is comparable to th a t of other reported unsuper
vised anomaly detection approaches. Especially, our approach achieve higher detec
tion ra te when the false positive rate is low. It is more significant for NIDSs, since
high false positive rate will make NIDSs useless.
The results also show th a t the performance tends to be reduced with increasing
number of attack connections. T hat is a problem of unsupervised systems. The
experiment for m inority intrusion indicates th a t it is more difficult to detect minority
intrusions than m ajority intrusions by the anomaly approach.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Com bination of m isuse and
anom aly detection
In this chapter, we present a framework to combine the misuse and anomaly detection
described in the previous chapters. We first discuss different approaches in hybrid
systems to combine misuse and anomaly detection. Then, we describe the architecture
of the proposed hybrid system. Finally, an experiment is conducted to evaluate our
approach.
5.1 M isuse detection versus anom aly detection
Misuse detection determines intrusions by patterns or signatures which can represent
attacks. Thus, misuse-based systems can detect known attacks like virus detection
systems, bu t they cannot detect unknown attacks [33]. Most of NIDS (Network In tru
sion Detection System) products depend on misuse detection, since misuse detection
usually has higher detection rate and lower false positive rate than anomaly detection.
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 61
Another advantage of misuse detection is high detection speed due to low complexity
of detection algorithms. Anomaly detection usually has high com putational com
plexity, especially for unsupervised approaches such as clusters, outlier detection of
the random forests algorithm, and Self-Organizing Map (SOM). Therefore, misuse
detection is more suitable for on-line detection than anomaly detection.
Anomaly detection identifies observed activities th a t deviate significantly from
normal usage as intrusions. Thus, anomaly detection can detect unknown intrusions,
which cannot be addressed by misuse detection. The critical technique of anomaly
detection is to build profiles of normal usage. If the profiles are too broadly defined,
some attacks might not be detected. This leads to low detection rate. On the other
hand, if the profiles are too narrowly defined, some normal activities might be detected
as intrusions. This raises false alarms. Currently, there is no effective way to define
normal profiles th a t can achieve high detection rate and low false positive rate at the
same time. Although anomaly detection does not require prior knowledge of intrusion
and can detect new intrusions, it may not be able to describe w hat the attack is.
5.2 A pproaches to com bine m isuse and anom aly
detection
To address the problems of misuse and anomaly detection, many intrusion detection
systems combine both techniques to reach the accuracy of a misuse detection system
and have the ability to deal with new attacks. There are three ways to combine
m isu se a n d a n o m a ly d e te c tio n :
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 62
1. Anomaly detection followed by misuse detection. Suspicious activities are se
lected from observed d a ta by anomaly detection. Then, misuse detection is
used to detect intrusions from the suspicious activities.
2. Parallel approach. Misuse and anomaly detection are applied in parallel.
3. Misuse detection followed by anomaly detection. F irst, misuse detection is ap
plied. Then, anomaly detection is used to detect intrusions missed by misuse
detection.
A nom aly detection followed by m isuse detection
Figure 5.1 shows the framework of the first approach. First, observed activities are
fed into the anomaly detection component. The component produces suspicious items
th a t deviate from the built normal profile. Then, the misuse detection component
identifies intrusions from the suspicious items. The items th a t m atch patterns of
attacks are determined as known attacks. The items th a t m atch patterns of false
alarms are determined as normal activities. The others are determ ined as unknown
attacks.
AlarmsActivities Suspicious items MisuseDetection
Component
AnomalyDetection
Component
Figure 5.1: Framework of anomaly detection followed by misuse detection
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 63
ADAM [8] applies this approach in NIDS. The on-line single level and domain-level
mining module in ADAM uses anomaly detection technique to produce suspicious
items. The classifier module which uses misuse technique classifies the suspicious
items into false alarms, attacks, and unknown attacks. The approach is also used in
ADFSC (Anomaly Detection First Serial Combination) by Elvis, et al. in [33].
In this approach, the anomaly detection component should have high detection
rate, since missed intrusions cannot be detected by the follow-up misuse detection
component. The misuse detection component should be able to identify false alarms.
The false positive rate can be reduced by excluding the false alarms from the suspi
cious items.
Parallel approach
Suspicious items
Activities
Suspicious items
MisuseDetection
Component
AnomalyDetection
Component
CorrelationComponent
Figure 5.2: Framework of the parallel approach
Figure 5.2 shows the framework of the parallel approach. Observed activities are fed
to the misuse detection component and the anomaly detection component in parallel.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 64
The two sets of suspicious items are produced by them. The correlation component
analyzes these two sets to detect intrusions. The parallel approach has been used in
NIDES (The Next Generation Intrusion Detection Expert System) [7].
M isuse detection followed by anom aly detection
Figure 5.3 shows the framework of the th ird approach. F irst, observed activities are
fed to the misuse detection component. The component detects known attacks by
m atching the signatures or patterns of attacks. O ther items (uncertain items) th a t
do not m atch any signature or pa tte rn are fed to the anomaly detection component
to detect unknown intrusions. The anomaly detection component should have low
false positive rate, otherwise the overall false positive rate of the hybrid system will
be high. High false positive rate makes the detection system useless.
Known attacks
Unknownattacks
Activities
Uncertain itemsAnomalyDetection
Comuoneirt
MisuseDetection
Component
Figure 5.3: Framework of misuse detection followed by anomaly detection
R ationale for choosing m isuse detection followed by anom aly detection
Since our anomaly approach has better performance when false positive rate is low,
the th ird approach is more suitable than the others. For the first approach, the
anomaly detection component should have very high detection rate. The low false
positive ra te is not critical. The misuse detection component needs the ability to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 65
identify false alarms to reduce the overall false positive rate. The high complexity
of our anomaly detection does not m atch the high speed detection performance of
our misuse detection. Thus, combining both the approaches by the parallel approach
makes real tim e detection impossible.
Moreover, the experimental results in the anomaly detection show th a t the perfor
mance tends to be reduced with increasing number of attack connections. T ha t is a
problem of unsupervised systems. Some attacks such as DoS (Denial of Service) pro
duce a large number of connections, which may undermine an unsupervised anomaly
detection system. To overcome the problem, we use the th ird approach. Misuse ap
proach can detect known attacks. By removing known attacks, the number of attacks
can be reduced significantly in datasets for unsupervised anomaly detection.
Another reason to use the th ird approach is th a t misuse detection by the random
forests algorithm has high speed performance. Thus, the hybrid system can be used
to detect known intrusions in real tim e and to detect unknown intrusions in off-line
way. Low speed performance of the anomaly detection makes real tim e detection
impossible using the first and second approaches.
5.3 A rchitecture o f the hybrid system
The proposed hybrid system combines the misuse detection and the anomaly detection
in NIDS. The architecture of the system is shown in Figure 5.4 on the next page.
The system consists of two components: misuse detection component and anomaly
detection component. The misuse detection component employs the random forests
algorithm in misuse detection. The anomaly detection component implements detec
tion using outlier detection provided by the random forests algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 66
■ M isuse Detection ComponentAttacksF e a tu n
vectors;Packdts
Network
IntrusionPatterns
On line
Of f l ineFeaturevectors
Anomaly Detection Component N ewntrusions
ServicePatterns
UnknownAttacks
Alarm s
Data Set
S en sors
Audited i data M isuse
Detector
M isuseAlarmer
OutlierDetector
AnomalyAlarmer
AnomalyD atabase
TrainingDatabase
ServicePatternBuilder
IntrusionPatternBuilder
Training data r
Off-line Preprocessor
DetectingDatabase
On-linePre
p rocessors
Figure 5.4: A rchitecture of the hybrid system
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 67
There are two phases in the framework: an off-line phase and an on-line phase.
The system can build patterns of intrusions for the misuse detection component and
can detect unknown intrusions using the anomaly detection component in the off
line phase. It detects known intrusions using the misuse detection component in the
on-line phase.
In the off-line phase, the Intrusion P a tte rn Builder module is trained from labeled
d a ta and outputs patterns of intrusions to the Misuse Detector module.
In the on-line phase, network traffic is captured and fed to Misuse Detector. Mis
use Detector raises an alarm to the Misuse Alarmer module if any connection matches
an intrusion pattern . Then, the Alarmer module will deliver alarms to security ana
lysts. If the connection does not match any intrusion pattern , it will be sent to the
Anomaly Database module which stores d a ta for the anomaly detection component.
In the off-line phase, the system can detect novel intrusions using the anomaly
detection component. F irst, the Service P a tte rn Builder module retrieves da ta from
the anomaly database to build patterns of network services, and outputs the built
patterns to the Outlier Detector module. W ith the patterns, The Outlier Detector
retrieves the da ta from the anomaly database and uses the outlier detection technique
to detect attacks. If it detects any attack, it raises alarms to the Anomaly Alarmer
module. The Anomaly Alarmer can deliver the alarms to security analysts. It also
can store the newly detected intrusions in the training database, so new patterns of
these intrusions can be built for misuse detection.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 68
5.4 E xperim ents and results
5.4 .1 D a ta se t and prep rocessin g
We evaluate our hybrid approach over the KD D’99 dataset. The full training set
of the KDD’99 is labeled by types of intrusions. To calculate the detection rate
and false positive ra te easily, we change the label to 1 or 0 (1 if connection is an
intrusion; 0 otherwise) instead of the types. Then, we choose five most popular
network services: ftp, h ttp , pop, smtp, and telnet. By selecting ftp, h ttp , pop, smtp,
and telnet connections, we generate our training set which contains 16,919 connections
w ith down-sampling normal connections. The test set of the K D D ’99 is also processed
as the above. We generate our test set which contains 49,838 connections w ithout
down-sampling normal connections. The training set is used to build patterns of
intrusions for misuse detection. The test set is used to evaluate our hybrid approach.
5 .4 .2 E valu ation and d iscu ssion
To improve the performance of misuse detection, we employ the feature selection
algorithm to calculate the value of variable im portance over the training set. The
result is shown as Figure 5.5 on the next page.
The figure shows th a t the variable im portance of the last 7 features (Feature 2, 7,
8, 9, 15, 20, and 21) are much smaller than others. Therefore, we select the 34 most
im portant features to build patterns of intrusions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 69
Importance-10 10
CD
CO<D
LL
5 1037
13
346
3340134
293824 32 22 23 26 28 3639 3117 1235411425 27 16301918 111520 9 8 7 2
21
Figure 5.5: Variable im portance of the features in the hybrid approach experiment
We also optimize the param eters (M try and the number of trees) of the random
forests algorithm for the misuse detection. By building patterns of intrusions with
different M try (5, 10, 15, 20, 25, 30, and 34) and different number of trees (10, 15, 20,
25, 30, 35, 40, 45, and 50), we get the oob error rates for each pattern as Tabic 5.1
on the next page.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 70
Table 5.1: The oob error rates for param eter optim ization in the hybrid approach experiment
Trees M try Oob error rate of intrusion patterns
Oob error rate of service patterns
10 5 0.00227 0.0021210 10 0.00199 0.0014610 15 0.00201 0.0009310 20 0.00173 0.0007510 25 0.00203 0.0009110 30 0.00184 0.0008410 34 0.00175 0.0006915 5 0.00249 0.0021615 10 0.00211 0.0014215 15 0.002 0.0010615 20 0.00191 0.0007515 25 0.00197 0.0008815 30 0.0018 0.0008115 34 0.00169 0.000720 5 0.00246 0.0021620 10 0.00212 0.0014320 15 0.00209 0.0010920 20 0.00186 0.0007720 25 0.00186 0.0008620 30 0.00177 0.0008220 34 0.00169 0.0007125 5 0.0024 0.0021225 10 0.00226 0.0014125 15 0.00213 0.0010325 20 0.002 0.0007625 25 0.00183 0.0008825 30 0.00172 0.0007925 34 0.00169 0.0007130 5 0.0026 0.0021130 10 0.00228 0.0014630 15 0.00216 0.0010130 20 0.00195 0.0007830 25 0.00176 0.0008330 30 0.00173 0.00076
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 71
Table 5.1 : The oob error rates for param eter optim ization in the hybrid approach experiment (continued)
Trees M try Oob error rate of intrusion patterns
Oob error rate of service patterns
30 34 0.00171 0.0006935 5 0.00262 0.002135 10 0.00224 0.0014635 15 0.00219 0.0010335 20 0.00201 0.0007935 25 0.00185 0.0007835 30 0.0018 0.0007435 34 0.00178 0.0006640 5 0.00274 0.0021140 10 0.00231 0.0014540 15 0.00217 0.0010140 20 0.00203 0.000840 25 0.00188 0.0007740 30 0.00186 0.0007540 34 0.0018 0.0006845 5 0.00278 0.0020645 10 0.00227 0.0014445 15 0.00218 0.00145 20 0.00203 0.0007945 25 0.0019 0.0007945 30 0.00189 0.0007545 34 0.00182 0.0006750 5 0.00283 0.0020650 10 0.00232 0.0014350 15 0.00224 0.0009950 20 0.00211 0.000850 25 0.00195 0.0007950 30 0.0019 0.0007450 34 0.00187 0.00066
In misuse detection, w ith the optimized param eters (the number of trees is 15,
the value of M try is 34), we build the patterns of intrusions. W ith the built patterns,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 72
we use the misuse approach to detection intrusions over the test set. The detection
rate is 94.2%, and the false positive rate is 1.1%.
By excluding the detected intrusions from the test set, we generate the anomaly
test set. The anomaly test set is labeled by the services. We optimize the param eters
for the anomaly detection over the anomaly test set. By building patterns of the
services w ith different M try (5, 10, 15, 20, 25, 30, and 34) and different number of
trees (10, 15, 20, 25, 30, 35, 40, 45, and 50), we get the oob error rates as Table 5.1
on page 70.
In anomaly detection, with the optimized param eters (the number of tree is 35, the
value of M try is 34), we build the patterns of the services. W ith the built patterns, we
use the outlier detection approach to calculate the outlier-ness of the connections in
the anomaly test set. The connections are sorted by outlier-ness in descending order.
The first one percent of the connections is determined as intrusions. We choose one
percent as the threshold, so the false positive rate of the anomaly detection will
be below one percent. 30 connections are identified as new intrusions. The overall
detection rate of the hybrid system is 94-7%. The overall false positive rate is 2%.
The result shows th a t the anomaly approach detects some intrusions missed by the
misuse approach. However, most of them are also missed by the anomaly approach.
We initiate further analysis on the result. Figure 5.6 on the next page plots the
outlier-ness. There are 411 attacks in the anomaly set. The attacks are injected at
the beginning of the dataset.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 73
eq4l ' 7 6^r 830 i f 9121*•21 2281 3041 3801 •321
C onnections
Figure 5.6: Outlier-ness of the anomaly test set
As shown in the figure, some intrusive connections have much higher outlier-ness
than the normal connections, bu t most of them have much lower outlier-ness than
the normal ones. The explanation is th a t these intrusions are very similar to each
other. They very likely fall into the same leaf of a tree built by the random forests
algorithm, so they have very low outlier-ness.
5.4 .3 Im p lem en ta tion
In this experiment, we use the FORTRAN 77 program [2] developed by Leo Breiman
and Adele Cutler to calculate the variable im portance in feature selection. We build
a tool to optimize the param eters of the random forests algorithm. We develop a Java
program to implement our approach in hybrid detection using the W EKA (Waikato
Environm ent for Knowledge Analysis) [5].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. COMBINATION OF MISUSE AND ANOMALY DETECTION 74
5.5 Sum m ary
In this chapter, we propose a new framework of a hybrid system to combine our
misuse detection and anomaly detection approaches. In the framework, the misuse
detection is applied first to detect known intrusions in real time. The connections
th a t are not determined as intrusions by the misuse detection are detected by the
off-line anomaly detection approach.
The misuse detection has high detection rate for known intrusions and has low
false positive rate. The anomaly detection using the outlier detection can detect
novel intrusions and has good detection performance when the false positive rate
is low. The proposed hybrid system combines the advantages of these two detection
approaches. Besides, the misuse detection can remove known intrusions from datasets,
so the performance of the anomaly detection can be improved by applying the misuse
detection first.
The experimental results show th a t the proposed hybrid approach can achieve high
detection rate with low false positive rate, and can detect novel intrusions. However,
some intrusions th a t are very similar with each other cannot be detected by the
anomaly detection. T h a t is a lim itation of the outlier detection provided by random
forests.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Conclusion and future work
6.1 C onclusion
In this thesis, we outline our da ta mining-based approaches for network intrusion
detection. We apply the random forests algorithm in misuse detection, anomaly
detection, and hybrid detection.
To address the problems of rule-based systems, we employ random forests to build
patterns of intrusions. By learning over training data, the random forests algorithm
can build the patterns autom atically instead of coding rules manually. In our misuse
detection framework, patterns of intrusions are built in off-line phase, and can be
deployed automatically. The system can detect intrusions in real tim e with the built
patterns. Detection speed is critical for real tim e NIDSs. Our experiment on speed
measurement shows th a t the system has high speed performance. Thus, the system
can be used in real tim e network environments. To improve accuracy of the system,
we use the feature selection algorithm and optimize the param eters of the random
forests algorithm. We also use sampling techniques to increase the detection rate of
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6. CONCLUSION AND FUTURE WORK 76
m inority intrusions in the framework.
We evaluate the misuse approach over the KDD’99 dataset. The experimental
result shows th a t the performance provided by our approach is be tter than the best
KD D’99 result.
Misuse detection cannot detect novel intrusions, so we propose a new approach in
unsupervised anomaly detection. We apply the outlier detection of the random forests
algorithm in anomaly detection. The outliers detected by the random forests algo
rithm are determined as intrusions. Since the random forests algorithm is supervised
d a ta mining algorithm, it uses labeled training da ta to build patterns. Therefore, our
approach builds patterns of network services instead of intrusions. W ith the built
patterns of services, the approach determines the outliers related to the patterns as
intrusions. The approach breaks the dependency on attack-free training data, which
is the m ajor problem of supervised anomaly detection. Detecting outliers in large
datasets is time-consuming and costs a large am ount of memory. To improve the
performance of detecting outliers, we modify the original outlier detection algorithm
to reduce the complexity of calculation.
We evaluate the anomaly approach over the different datasets generated from the
K D D’99 dataset. The results confirm th a t our approach achieves higher detection
rate when the false positive rate is low, compared to other reported unsupervised
anomaly detection approaches. The results also show th a t the detection performance
tends to decrease w ith increasing the number of attack connections in a dataset.
Misuse detection has high detection rate with low false positive rate. Anomaly
detection can detect novel intrusions. Therefore, combining misuse and anomaly
detection can improve the overall performance of intrusion detection systems. We
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6. CONCLUSION AND FUTURE WORK 77
propose a new framework to combine the misuse and anomaly detection in which
we apply the random forests algorithm. In the framework, the misuse detection is
applied first to detect known intrusions. By filtering out the intrusions detected by
the misuse detection approach, the number of intrusions in the dataset can be reduced
significantly. Hence, the detection performance of the anomaly approach is improved.
The evaluation experiment on our hybrid approach indicates th a t the proposed hy
brid framework can achieve high detection ra te with low false positive rate compared
with other hybrid systems. Some intrusions are very similar with each other. The
results also show th a t the outlier detection cannot detect these kinds of intrusions.
6.2 L im itations and future work
We apply the outlier detection of the random forests algorithm in anomaly detection.
The technique has two limitations. First, the intrusions in a dataset need to be much
less than normal data. The outlier detection only works when the m ajority of da ta are
normal. We could use the misuse detection to filter out known intrusions. However,
this cannot guarantee th a t the m ajority of activities are normal after removing known
intrusions. For example, a new type of intrusion may produce large number of con
nections, which cannot be filtered out by the misuse detection. This could decrease
the performance of the anomaly detection. Moreover, it may undermine the hybrid
system. Second, some intrusions with high degree of sim ilarity cannot be detected
as outliers by the anomaly detection. To solve the above problems, we suggest th a t
other da ta mining algorithms such as clustering algorithm could be investigated in
the future.
The random forests algorithm has been successfully applied in different fields,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6. CONCLUSION AND FUTURE WORK 78
especially for prediction. It can find patterns th a t are suitable for prediction in large
volumes of data. Basically, the techniques used in intrusion detection can be used
in intrusion prediction. The inputs for prediction are past da ta and the outputs for
prediction are future data.
In intrusion prediction, we can predict a specific intrusion based on symptoms.
There are some symptoms (predictor) for some intrusions. For example, IP scan
activity is a predictor of the worm propagation. Therefore, we can analyze datasets
and intrusions to predict a certain intrusion using the random forests algorithm. The
plans for this kind of intrusion prediction are listed as follows:
• Find predictable intrusions by analyzing datasets and intrusions.
• E xtract the features which can be predictors of intrusions.
• Apply the random forests algorithm effectively to build the prediction patterns.
We also can predict an intrusion in a more general way. For example, we can pre
dict whether intrusions will happen within a certain period. The currently available
datasets are not suitable for this kind of prediction. We need to find some other pre
dictors which are correlated with intrusions, such as the number of the vulnerabilities
on a network. The plans for this kind of intrusion prediction are listed as follows:
• Find predictors related to intrusions.
• Collect the da ta th a t contains the predictors of intrusions.
• A p p ly t h e r a n d o m fo re s ts a lg o r i th m e ffec tiv e ly to b u ild th e p r e d ic t io n p a t te r n s .
Since this kind of d a ta is difficult to collect, there will be some missing values
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6. CONCLUSION AND FUTURE WORK 79
in the datasets. We need to use the random forests algorithm to handle the
missing value problem.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bibliography
[1] Com puter Security Institu te, 2004 C SI/FB I Com puter Crime and Security Sur
vey. h ttp://w w w .issa-sac.org/docs/FB I2004.pdf, San Francisco, USA. (Accessed
in November 2005).
[2] Leo Breiman and Adele Cutler, random forests. h t tp ://s ta t-
www.berkeley. edu/users/breim an/R andom Forests/cc Jiom e. h tm , U niversity
of California, Berkeley, CA, USA. (Accessed in November 2005).
[3] M IT Lincoln Laboratory, DARPA Intrusion Detection Evaluation.
h ttp ://w w w .ll.m it.edu /IS T /ideva l/, MA, USA. (Accessed in November
2005).
[4] Snort, Network Intrusion Detection System, h ttp ://w w w .snort.o rg . (Accessed in
November 2005).
[5] W EKA software, Machine Learning, h ttp ://w w w .cs.w aikato .ac.nz/m l/w eka/,
The University of Waikato, Hamilton, New Zealand. (Accessed in November
2005).
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 81
[6] Tamas Abraham. IDDM: Intrusion Detection Using D ata Mining Techniques.
In D STO Electronics and Surveillance Research Laboratory, Salisbury, Australia,
May 2001.
[7] Debra Anderson, Thane Frivold, and Alfonso Valdes. Next-Generation Intrusion
Detection Expert System (NID ES) - A Summary. Technical Report SRI-CSL-95-
07, SRI, USA, May 1995.
[8] Daniel Barbara, Julia Couto, Sushil Jajodia, Leonard Popyack, and Ningning
Wu. ADAM: Detecting Intrusions by D ata Mining. In Proceedings o f the 2001
IEEE, Workshop on Information Assurance and Security, T1A3 1100 United
States M ilitary Academy, West Point, NY, USA, June 2001.
[9] Daniel B arbara and Sushil Jajodia, editors. Applications of Data Mining in
Computer Security. Kluwer Academic Publishers, 2002.
[10] V. B arnett and T. Lew. Outliers in Statistical Data. John Wiley, 1994.
[11] Alan Bivens, Mark Embrechts, Chandrika Palagiri, Rasheda Smith, and Boleslaw
Szymanski. Network-Based Intrusion Detection Using Neural Networks. In A r
tificial Neural Networks In Engineering, St. Louis, Missouri, USA, November
2002 .
[12] L. Breiman. Random Forests. In Machine Learning, 45(1):532, 2001.
[13] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification.
Wiley Interscience, 2000.
[14] Charles Elkan. Results of the KD D’99 Classifier Learning. In SIGKD D Explo
rations, 1(2): 63-64, 2000.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 82
[15] Carl Endorf, Eugene Schultz, and Jim Mellander. Intrusion Detection & Pre
vention. M cGraw-Hill/Osborne, 2004.
[16] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A Geometric Frame
work for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled
Data. In Applications of Data Mining in Computer Security, Kluwer, 2002.
[17] Lan Guo, Yan Ma, Bojan Cukic, and Harshinder Singh. Robust Prediction of
Fault-Proneness by Random Forests. In Proceedings o f the 15th International
Symposium on Software Reliability Engineering (IS S R E ’04), pp. 417-428, B rit
tany, France, November 2004.
[18] David J. Hand, Heikki Mannila, and Padhraic Smyth. Principles o f Data Mining.
The M IT Press, August 2001.
[19] Wenke Lee and Salvatore J. Stolfo. A Framework for Constructing Features and
Models for Intrusion Detection Systems. In AC M Transactions on Information
and System Security (TISSEC ), Volume 3, Issue 4, November 2000.
[20] Wenke Lee and Salvatore J. Stolfo. D ata Mining Approaches for Intrusion Detec
tion. In the 7th USENIX Security Symposium, San Antonio, TX, USA, January
1998.
[21] Kingsly Leung and Christopher Leckie. Unsupervised Anomaly Detection in
Network Intrusion Detection Using Clusters. In Australasian Computer Science
Conference, Newcastle, NSW, Australia, 2005.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 83
[22] Richard P. Lippmann and Robert K. Cunningham. Improving Intrusion Detec
tion Performance Using Keyword Selection and Neural Networks. In Computer
Networks, 34:597-603, 2000.
[23] C. T. Lu, D. Chen, and Y. Kou. Algorithms for Spatial Outlier Detection. In
Proceedings o f 3rd IE E E International Conference on Data Mining, Melbourne,
Florida, USA, November 2003.
[24] Jianxiong Luo and Susan M. Bridges. Mining Fuzzy Association Rules and
Fuzzy Frequency Episodes for Intrusion Detection. In International Journal of
Intelligent Systems, Vol. 15, No. 8, pp. 687-704, 2000.
[25] M. Mahoney and P. Chan. An Analysis of the 1999 DARPA/Lincoln Laboratory
Evaluation D ata for Network Anomaly Detection. In Proceeding of Recent Ad
vances in Intrusion Detection (RAID)-2003, P ittsburgh, USA, September 2003.
[26] Susan M.Bridges and Rayford B. Vaughn. Fuzzy D ata Mining and Genetic
Algorithms Applied to Intrusion Detection. In Proceedings o f the National In
form ation System s Security Conference (NISSC), Baltimore, MD, USA, October
2000 .
[27] S. M ukkamala and A. H. Sung. Learning machines for Intrusion Detection:
Support Vector Machines and Neural Networks. In International Conference on
Security and M anagement, Las Vegas, USA, 2002.
[28] Bogdan E. Popescu and Jerome H. Friedman. Ensemble Learning fo r Prediction.
Doctoral Thesis, Stanford University, USA, January 2004.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 84
[29] M. Ram adas, S. Osterm ann, and B. Tjaden. Detecting Anomalous Network Traf
fic with Self-Organizing Maps. In Proceedings o f Recent Advances in Intrusion
Detection (RAID ), P ittsburgh, USA, September 2003.
[30] Sara M atzner Chris Sinclair and Lyn Pierce. An Application of Machine Learning
to Network Intrusion Detection. In Proceedings o f the 15th Annual Computer
Security Applications Conference, pp. 371377, Phoenix, AZ, USA, 1999.
[31] Rasheda Smith, Alan Bivens, Mark Embrechts, Chandrika Palagiri, and Boleslaw
Szymanski. Clustering Approaches for Anomaly Based Intrusion Detection. In
Walter Lincoln Hawkins Graduate Research Conference 2002 Proceedings, New
York, USA, October 2002.
[32] K. Tan, K. Killourhy, and R. Maxion. Undermining an anomaly based intrusion
detection system using common exploits. In Proceeding o f Recent Advances in
Intrusion Detection (RAID ), Zurich, Switzerland, October 2002.
[33] Elvis Tombini, Herve Debar, Ludovic Me, and Mireille Ducasse. A Serial Combi
nation of Anomaly and Misuse IDSes Applied to H T TP Traffic. In 20th Annual
Computer Security Applications Conference, Tucson, AZ, USA, December 2004.
[34] Q.A. Tran, H. Duan, and X. Li. One-class Support Vector Machine for Anomaly
Network Traffic Detection. In The 2nd Network Research Workshop of the 18th
A PAN, Cairns, Australia, 2004.
[35] Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. Probability Estim ates for M ulti
class Classification by Pairwise Coupling. In The Journal o f Machine Learning
Research, Volume 5, December 2004.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 85
[36] Yimin Wu. High-dimensional Pattern Analysis in Multimedia Information Re
trieval and Bioinformati. PhD thesis, State University of New York, USA, Jan
uary 2004.
[37] J. Zhang and M. Zulkernine. Network Intrusion Detection Using Random Forests.
In Proc. o f the Third Annual Conference on Privacy, Security and Trust, pages
53-61, St. Andrews, New Brunswick, Canada, October 2005.
[38] J. Zhang and M. Zulkernine. A Hybrid Network Intrusion Detection Technique
Using Random Forests. In the First International Conference on Availability,
Reliability and Security, submitted, Vienna University of Technology, Austria,
April 2006.
[39] J. Zhang and M. Zulkernine. Anomaly Based Network Intrusion Detection with
Unsupervised Outlier Detection. In the 2006 IE E E International Conference on
Communications, submitted, Istanbul, Turkey, June 2006.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.