ddos datasets - diva-portal.org
TRANSCRIPT
DDoS datasetsUse of machine learning to analyse intrusion detection performance
Stefanos Kiourkoulis
Information Security, master's level (120 credits)
2020
Luleå University of Technology
Department of Computer Science, Electrical and Space Engineering
MasterThesisProject
DDoSdatasets:Useofmachinelearningtoanalyse
intrusiondetectionperformance
Author:StefanosKiourkoulisE-mail:[email protected]
Supervisor:Dr.AliIsmailAwad
June2020MasterofScienceinInformationSecurity
LuleåUniversityofTechnology
DepartmentofComputerScience,ElectricalandSpaceEngineering
II
Abstract
Threats of malware, attacks and intrusion have been around since the very conception of
computing. Yet, it was not until the sudden growth of the internet that awareness of security and
digital assets really started to pick up steam. The internet presents a new liability, as the ever-
increasing number of machines on the web provides a new goldmine for those seeking to exploit
vulnerabilities. As access increases, new ways are created for attackers to exploit network systems
and their users. Among various types of attack, DDoS remains the most devastating and severe
due to its potential impact, and the potentiality keeps on growing, making intrusion detection a
must for network security and defense. As a result, machine learning and artificial intelligence
research has flourished over the last few years, opening new doors for intrusion detection
technologies. However, data availability still limits greatly the success of such technologies, as
research faces a shortage of good quality IDS datasets.
This study bases itself on this persisting issue as it assesses the state-of-the-art of open datasets
and their ability to detect intrusion and harmful network traffic. In particular, this study focuses
on providing a comparison of intrusion detection performance of open DDoS attack datasets.
DDoS attacks are some of the most concerning due to the magnitude of damage that they are
capable of. Literature on open DDoS datasets is fairly scarce in comparison to other forms of
attacks, hence, this study seeks to shed more light on the nature of existing DDoS data in relation
to intrusion detection. The proposed solution sees four DDoS datasets analysed using a set of six
machine learning algorithms, namely, k-NN, SVM, naïve Bayes, decision tree and logistic
regression. This study aims to assess these datasets and analyse their performance with regards
to classification of network traffic.
The results of this study contribute to a better understanding of the intrusion detection capacity
of open DDoS datasets. The datasets are analysed on the basis of 5 performance metrics: accuracy,
precision, recall, F-measure and computation time. The results show that voluminous datasets,
such as the CEC-CIC-IDS2018 dataset, can achieve very high performance. In modelling terms,
III
the results denote that random forest performs very well over a wide range of datasets, while naïve
Bayes and SVM are less consistent.
IV
Acknowledgements
Throughout the writing of this dissertation, several people contributed in numerous ways, from
advisory to the much-needed support for the long hours required to finalize the writing. I would
like to thank all of them.
At the forefront my professor, Dr Ali Ismail Awad, for providing constant feedback,
recommendations, general guidance and for always being available. Furthermore, I would like to
thank LTU and Sweden’s academic system, for giving me the opportunity to further my academic
knowledge in the subject of security.
In addition, many thanks to my dissertation opponent Alexandros Marinakis, for his valuable time
and assistance during the seminars. Last but not least, Francesca Gauci for her counselling,
support and all-around enforcement of discipline towards the writing of the dissertation.
V
Table of Contents
1 INTRODUCTION ..................................................................................................................................... 1
1.1 OVERVIEW .......................................................................................................................................... 1
1.2 PROPOSED SOLUTION AND RESEARCH GOALS ..................................................................................... 3
1.3 RESEARCH QUESTIONS ........................................................................................................................ 5
1.4 RESEARCH CONTRIBUTION .................................................................................................................. 5
1.5 LIMITATIONS ....................................................................................................................................... 6
1.6 DOCUMENT STRUCTURE ...................................................................................................................... 6
2 BACKGROUND ....................................................................................................................................... 7
2.1 OVERVIEW OF DDOS ATTACKS ........................................................................................................... 7
2.1 DDOS TAXONOMY ............................................................................................................................... 7
3 LITERATURE REVIEW ....................................................................................................................... 11
3.1 OVERVIEW ......................................................................................................................................... 11
3.2 USE OF MACHINE LEARNING IN INTRUSION DETECTION ..................................................................... 12
3.3 AVAILABILITY OF GOOD DATASETS .................................................................................................... 14
3.3.1 Issues with Current Datasets ........................................................................................... 14
3.3.2 Related Work ................................................................................................................. 16
3.4 RESEARCH GAP .................................................................................................................................. 17
4 RESEARCH METHODOLOGY ............................................................................................................. 19
4.1 OVERVIEW ......................................................................................................................................... 19
4.2 LIFE CYCLE ........................................................................................................................................ 20
4.3 IMPLEMENTING CRISP-DM ............................................................................................................... 21
5 DDOS DATASETS REVIEW ................................................................................................................ 23
5.1 CICDDOS2019 ................................................................................................................................... 23
5.2 CSE-CIC-IDS2018 ON AWS ............................................................................................................... 24
5.3 NDSEC-1 (BOTNET) ........................................................................................................................... 25
5.4 CICIDS2017 ....................................................................................................................................... 26
VI
6 EXPERIMENT IMPLEMENTATION AND DESIGN ........................................................................... 27
6.1 OVERVIEW ......................................................................................................................................... 27
6.2 DATA PREPARATION .......................................................................................................................... 28
6.2.1 Data Cleaning and Transformation .................................................................................. 28
6.2.2 VOLUME AND CLASS DISTRIBUTION ................................................................................................. 29
6.2.2 SPLITTING DATASETS ...................................................................................................................... 32
6.3 MODELLING ........................................................................................................................................ 33
6.3.1 Models Selection ............................................................................................................. 33
6.3.2 Models Used in this Study ............................................................................................... 35
6.3.2 Training ........................................................................................................................ 42
6.3.3 Validation ...................................................................................................................... 43
6.3.4 Testing .......................................................................................................................... 44
6.4 EVALUATION ...................................................................................................................................... 44
7 RESULTS ............................................................................................................................................... 47
7.1 OVERVIEW OF RESULTS ...................................................................................................................... 47
7.2 CICDDOS2019 ................................................................................................................................... 49
7.3 CSE-CIC-IDS2018 ............................................................................................................................. 49
7.4 NDSEC-1 ............................................................................................................................................ 50
7.5 CICIDS2017 ....................................................................................................................................... 51
8 DISCUSSION .......................................................................................................................................... 53
8.1 CONTRIBUTIONS OF THIS STUDY ......................................................................................................... 53
8.2 CONCLUSIONS AND FUTURE WORK .................................................................................................... 55
REFERENCES .......................................................................................................................................... 57
APPENDIX A – SPECIFICS FOR CICDDOS2019 .................................................................................. 64
APPENDIX B – MODELLING SOURCE CODE ..................................................................................... 65
AB.1 K-NEAREST NEIGHBOUR ................................................................................................................. 65
AB.2 SUPPORT VECTOR MACHINE ........................................................................................................... 66
AB.3 NAÏVE BAYES .................................................................................................................................. 67
AB.4 DECISION TREE ............................................................................................................................... 68
VII
AB.5 RANDOM FOREST ............................................................................................................................ 69
AB.6 LOGISTIC REGRESSION .................................................................................................................... 70
VIII
List of Tables
Table 1: OSI model describing the functionality of each distinctive layer. ................................... 8
Table 2: OS Specification and Machine IPs for CICDDoS2019. Adapted from DDoS Evaluation
Set [47]. ........................................................................................................................................ 23
Table 3: Specification of tools and duration of DDoS attack for CSE-CIC-IDS2018 on AWS [48].
..................................................................................................................................................... 25
Table 4: Details of DDoS Attack for CICIDS2017. ...................................................................... 26
Table 5: Labelling system for binary classification. ..................................................................... 29
Table 6: Volume of records for the DDoS attack datasets. .......................................................... 30
Table 7: Pseudocode for the k-NN Algorithm [54]. ...................................................................... 36
Table 8: Pseudocode for the SVM Algorithm [54]. ...................................................................... 38
Table 9: Pseudo code for the naïve Bayes algorithm [54]. ........................................................... 39
Table 10: Pseudo code for the decision tree algorithm [54]. ......................................................... 40
Table 11: Pseudo code for the random forest algorithm [54]. ...................................................... 40
Table 12: Pseudo code for the logistic regression algorithm [54]. ................................................. 41
Table 13: Methods and classifiers from the scikit-learn Python library [57] used for building
models. ......................................................................................................................................... 42
Table 14: Example of a confusion matrix for a binary classifier [65]. .......................................... 45
Table 15: Performance metrics for each dataset. ......................................................................... 48
IX
List of Figures
Figure 1: Overview of proposed solution. ...................................................................................... 4
Figure 2: The four tiers of CRISP-DM. Reproduced from [46]. ................................................... 19
Figure 3: The six-phase life cycle of a data mining project. Reproduced from [46]. ..................... 20
Figure 4: Workflow of supervised learning process ...................................................................... 28
Figure 5: Bar chart showing the spreading of traffic type in the CICDDoS2019 dataset. ........... 30
Figure 6: Bar chart showing the spreading of traffic type in the CSE-CIC-IDS2018 dataset. ..... 31
Figure 7: Bar chart showing the spreading of traffic type in the CICIDS2017 dataset. .............. 31
Figure 8: Bar chart showing the spreading of traffic type in the NDSec-1 dataset. ..................... 32
Figure 9: An example of a k-NN classification. When k=3, the new instance is labelled as 0.
However, when the parameter is increased to k=5, the same instance is labelled as 1. Adapted
from [55]. ...................................................................................................................................... 36
Figure 10: Illustration of SVM classification. ............................................................................... 37
Figure 11: K-fold cross validation with 5 folds. ........................................................................... 43
Figure 12: Bar graph of accuracy rate for the CICDDoS2019 dataset. ........................................ 49
Figure 13: Bar graph of accuracy rate for the CSE-CIC-IDS2018 dataset. .................................. 50
Figure 14: Bar graph of accuracy rate for the NDSec-1 dataset. ................................................. 51
Figure 15: Bar graph of accuracy rate for the CICIDS2017 dataset. ........................................... 52
X
List of Equations
Equation 1: Bayes’ Theorem ........................................................................................................ 38
Equation 2: Accuracy Ratio ......................................................................................................... 45
Equation 3: Precision Ratio ......................................................................................................... 45
Equation 4: Recall Ratio .............................................................................................................. 46
Equation 5: F-measure Ratio ....................................................................................................... 46
1
1 Introduction
1.1 Overview
Cyber security refers to the application of preventive security measures to provide confidentiality,
integrity and availability of data [1]. This has been a long-standing point of discussion both in the
academic and scientific world, causing both controversy and debate. There are multiple works of
literature that describe and define cyber security. In particular, Canongia and Mandarino [2]
interpret it as “the art of ensuring the existence and continuity of the information society of a
nation, guaranteeing and protecting, in Cyberspace, its information, assets and critical
infrastructure” [2]. Safeguarding cyberspace and ensuring security is of utmost significance as
multiple organisations and operations depend on it including high risk ones such as governments
and military, and others including businesses, financial institutions and civilians that store
immense volume of data on personal computers and other devices [1], [3]. Consequently, it is
necessary for companies to organise their efforts to ensure protection across their information
systems. Cyber security is composed of different elements, including network security, data
security and mobile security, to name a few [3].
Over the last few years, the usage of the Internet and computer-based applications has
grown exponentially, as these are rapidly becoming an essential component for today’s generation.
With the aggressive growth in the use of computer applications and computer networks, secure
environments are becoming critical [4], [5]. As improvements in technological systems are making
processes easier in all aspects of life, these also create new ways for attackers to exploit these
systems and their users. Attackers can go down various paths in order to cause harm and damage
to users and organisations. These paths present different levels of risk and, accordingly, may or
may not be severe enough to attract attention [6]. The National Institute of Standards and
Technology (NIST) reported that in 2017 alone, American companies experienced losses of up to
65.5 billion dollars due to IT- related attacks and intrusions [6].
2
Among various attacks, Denial-of-Service (DOS) remains an immense threat to internet-
dependent businesses and organisations. Although security researchers and experts dedicate
continuous efforts to address this issue, DOS attacks are still one of most difficult security
problems faced by the Internet and the online world today. Of particular concern are Distributed
Denial-of-Service (DDoS) attacks, specifically because the capacity and impact of DDoS attacks
are persistently growing. With little or no prior warning, such an attack can easily and efficiently
cripple the resources of its victims in a short timespan [7].
Consequently, adopting intrusion detection measures is becoming quintessential. Intrusion
detection is highly significant in network security and defense, as it proactively aims to forewarn
security administrators about malicious behaviours, such as attacks, malware and intrusion.
Having an intrusion detection system (IDS) is considered a “mandatory line of defense” against
the growth of intrusive activities. As a result, research in the IDS domain has gained traction over
the years to come up with better intrusion detection methodologies. The very first IDS was
introduced in 1980, and since then many other systems have been proposed [1]. Yet, many of these
systems still generate a lot of alerts for low risk and non-threatening situations, resulting in a high
false alarm rate. This creates huge security risks as it may cause malicious attacks and malignant
behaviour to be ignored. Therefore, recent research has shifted its focus to reducing false alarms
and generating higher detection rates [2].
This has opened doors to new areas of research, specifically in artificial intelligence, data
mining and machine learning. These fields have become subject to extensive research that
emphasises the improvement of detection accuracy, with the aim of proposing new systems to
tackle novel or zero-day attacks [8]. Machine learning is a subset of artificial intelligence and is
concerned with the automatic discovery of useful patterns from large datasets [2]. Machine learning
algorithms can be generalised enough to detect many variants of an attack; however, the success
of machine learning-based IDS depends highly on the goodness of the training data available.
Consequently, numerous researchers working in this domain face an urgent need for quality
datasets to effectively apply machine learning models for intrusion detection. However, getting
3
suitable training data is a significant challenge in the cyber security domain, research community
and vendors [9]. As a result, the suitability and validity of existing datasets has been scrutinised
and thoroughly questioned for various reasons. Primarily, there is the issue of privacy, which keeps
real datasets from being shared by companies who suffered attacks, as this may expose
vulnerabilities. Other issues relate to anonymity and detachment from current trends, with most
of the current datasets lacking traffic and attack diversity [9], [10]. Additionally, with the
continuous change and improvements in technology and attack strategies, such datasets need to
be periodically updated [9].
Accordingly, the focal point of this study is to analyse the intrusion detection capacity of
a collection of IDS datasets. In particular, this study uses machine learning techniques to analyse
open DDoS attack data, a scarcely scoped out subject in previous works. This evaluation comprises
the analysis of these datasets on the basis of their performance in detecting intrusion and other
anomalous behaviour. The initial part of this study presents a review of literature on the subject,
focusing on two key areas: the growing use of machine learning and artificial intelligence for
intrusion detection; and, the state-of-the-art of IDS datasets, their characteristics and
shortcomings. This is followed with a CRISP-DM approach for the evaluation of the intrusion
detection performance of four DDoS datasets. The goal of this evaluation is to assess the capacity
of each of the datasets to classify DDoS traffic correctly with different learning models.
1.2 Proposed Solution and Research Goals
Machine learning and artificial intelligence research has flourished in the last few years, opening
new doors for intrusion detection technologies. However, data availability still limits greatly the
success of such technologies, as research faces a shortage of good quality IDS datasets. This study
bases itself on this persisting issue as it assesses the state-of-the-art of open datasets and their
ability to detect intrusion and harmful network traffic. In particular, this study focuses on
providing a comparison of intrusion detection performance of open DDoS attack datasets. DDoS
attacks are some of the most concerning due to the magnitude of damage that they are capable
of. Available literature on open DDoS datasets is fairly scarce in comparison to other forms of
4
attacks, hence, this study seeks to shed more light on the nature of existing DDoS data. The
proposed solution sees four DDoS datasets analysed using a set of machine learning algorithms.
Figure 1 illustrates the main concepts of this solution. The primary goal of this study is to assess
DDoS datasets and analyse their performance with regards to classification of network traffic
(malicious or benign).
Accordingly, this research seeks to:
● Analyse the current state of existing intrusion detection datasets, including characteristics
and shortcomings.
● Collect and process open DDoS datasets from reliable sources and review them based on
their qualities and features.
● Select the most suitable machine learning algorithms to assess the datasets and build
appropriate training models by labelling training instances according to the type of
network traffic, malicious or benign.
● Train, validate and test each dataset against the machine learning algorithms and generate
results for each.
● Evaluate the results of the supervised learning models using a set of performance metrics.
● Analyse the intrusion detection performance of each dataset based on the achieved results.
Figure 1: Overview of proposed solution.
5
1.3 Research Questions
In addressing the goal of this study, two research questions have been formed. The questions are
somewhat correlated, each focusing on one of the variables in this study, that is, open DDoS data
(RQ1) and the machine learning models (RQ2).
RQ1: What is the effectiveness of different open DDoS datasets in detecting intrusion and
malicious traffic?
RQ2: How does the performance of different supervised learning models compare with regards to
classification capacity and time efficiency?
1.4 Research Contribution
The main contribution of this work lies in the selection of datasets. While previous works focused
on the analysis and evaluation of a wide range of datasets, this work focuses on a specific type of
security attack, DDoS. Although DDoS attacks feature in previous works related to assessment of
IDS datasets, these were never considered as the main focal point. Previous works, such as that
done by Ring et al. [11], Sahu et al. [8], and Yavanoglu and Aydos [12] consider datasets with
various types of attacks. On the other hand, Thomas et al. [13] and Dhanabal and Shantharajah
[14] focus on one specific dataset, in this case DARPA and NSL-KDD respectively. In contrast,
this work considers multiple dataset to provide a critical review and evaluation of each and be
able to compare the different results achieved.
Moreover, this work takes on a more quantitative approach to evaluation, as opposed to
previous works [8], [11], [15] that provided qualitative analyses of IDS datasets. This study adopts
an analytical machine learning-based approach as the primary source of evaluation. This provides
a less theoretical and more observation-based evaluation, with the main method of assessment
being the comparison of the performance metrics. Previous studies do focus on the use of such
datasets in a machine-learning environment; however, this is done qualitatively, mostly with an
analysis of the features of the dataset.
6
Overall, this work aims to provide suggestions on the most appropriate algorithms to use
depending on the datasets available, and this in itself, is another contribution of this study. This
work aims to be a potential guideline for machine learning-based detection of DDoS behaviour.
1.5 Limitations
This research is limited to the analysis of DDoS datasets in the context of machine learning. This
work includes the usage of several machine learning models to assess the detection performance
by these datasets. The datasets are the primary subject of this research, and although efficiency
and operational performance of the algorithms are measured, the optimisation of the algorithms
is not the goal of this work.
1.6 Document Structure
This section introduced the main ideas and concepts that will set the groundwork for this study.
The rest of this document is organised as follows. Section 2 provides some background on DDoS,
and different types and structures of this attack. Section 3 presents a review of literature on the
subject, tackling both the use of machine learning in this domain, as well as the availability and
characteristics of current datasets. Section 4 presents the methodology for this study. Section 5
presents the different datasets used in this study, with a detailed description on each. Section 6
includes a thorough explanation of the design and implementation of the experiment. Section 7
presents the results and main findings of this study. Finally, Section 8 concludes this work with a
retrospective analysis of the contributions of this study and potential future work.
7
2 Background
2.1 Overview of DDoS Attacks
DDoS stands for ‘distributed denial-of-service’ and refers to a type of DoS (denial-of-service) where
the attack is originating from multiple sources spread on different network locations. The main
motivation of DoS attacks is to severely slow or shut down a specific resource and one way of
operation is by exploiting a system flaw and causing a processing failure or exhaustion of system
resources. Another way of attacking the victim system is by flooding and monopolizing the
network, therefore prohibiting anyone else from using it [16]. The prohibition of access to the
attacked computer, or network, is what defines and categorises an attack as DoS while the usage
of many computer systems or services indicates the presence of a ‘distributed’, known as DDoS,
attack. It is important to note that the attack agents can be any device or resource that support
the ability to have the attack code installed, including IoT devices, networked computers, servers
and weaponized mobile devices [17].
A typical architecture of a DDoS attack consists of four distinct elements; the real attacker, the
agent-controlling handlers or masters, the attack agents or zombie hosts responsible for packet
generation and forwarding towards the victim and, finally, the target host victim. Four stages can
describe a successful deployment of a DDoS attack, recruitment, compromise, communication and
attack. During the recruitment phase, the attacker selects the agents that will be used to carry
out the attack. In the compromise stage, the attack component is planted on the agents while
striving to cover itself from detection and deactivation. In the communication stage the master
agents inform the attacker that the attacking agents are ready to be deployed and carry out the
attack. The attack phase is the last stage and describes the initiation of the attack [7].
2.1 DDoS Taxonomy
There are multiple ways of describing the taxonomy and types of DDoS attacks, having different
sources both from academia and business approaching the problem of taxonomy differently.
Mirkovic and Reiher [18] divide the different types in classes based on specific criteria that are
8
both technical and social, e.g. based on the impact of the attack. Some of the criteria are degree
of automation (DA), exploited weakness to deny service (EW), source address validity (SAV),
attack rate dynamics (ARD), possibility of characterization (PC), persistence of agent set (PAS),
victim type (VT) and impact on the victim (IV) [18].
On the other hand, the NIST provides only two types of DDoS attacks, flaw exploitation and
flooding. The main distinction between the two types lies in the medium used to perform the
attack and the end-system being attacked. In the flaw exploitation, the target is the software of
the system, attempting to deplete its resources like memory, CPU, disk space or memory buffers.
In flooding attacks, the attack is on the networking capabilities of the target, depleting the network
capacity by accessing the resource with the means of attack, thus making it inaccessible for
legitimate users.
Another approach is dividing the types of attacks based on the layer being targeted. Depending
on the attack vector that the DDoS attack is engaged in, the target can differ with regards to the
components of its network connection. The most common and shared framework that is used to
communicate and decompose a single network connection, is the OSI model. The different
components that comprise a network connection are called layers. Every layer of the OSI model
has a different purpose and each part of it engages in different activities. What follows is a
conceptual dichotomy of the said model, the layers of it and how they operate. In totality, there
are 7 layers, casually marked as L1-L7 and briefly described in Table 1 with regards to their
functionality and potential attack vectors for a DDoS attack.
Table 1: OSI model describing the functionality of each distinctive layer.
Application Layer (L7)
Most commonly interacted-with layer for the end-user.
Many of the known processes and protocols operate on
this layer like HTTP(S), DNS, (S)FTP, SSH and email
processes. Common processes include user
authentication and privacy-oriented concepts.
9
Presentation Layer (L6)
The presentation layer transforms data into the form
that the application layer accepts. It also formats and
encrypts data to be sent across a network. It is
sometimes called the syntax layer.
Session Layer (L5)
The session layer is responsible for establishing,
managing, and terminating connections between
applications at each end of the communication
Transport Layer (L4)
The transport layer provides reliability in security and
crypto, performs error checking while ensuring quality
of service by resending data when data has been
corrupted. It also provides full transference of data
between systems while providing end-to-end error
recovery and flow control.
Network Layer (L3)
Opposite to the data link layer, the network layer is
engaged in the transference of data, organization and
reassembly.
Data Link Layer (L2)
The data link layer provides decomposition in frames
and transmission over the medium, physical or wireless.
Does basic error correction and detection.
Physical Layer (L1)
The physical layer provides electrical signal conversion
to bits, enabling multiplexing to allow the usage of the
same medium, physical or wireless.
Based on the layer-differentiating approach Cloudflare, the content delivery and DDoS mitigation
services provider, divides the attacks in 3 types; Application, protocol and volumetric layer.
Application layer attacks refer to any attack that operates on the 7th layer of the OSI model.
Protocol attacks exploit weaknesses found on the 3rd and 4th layer and cause disruption by
attacking resources like firewalls and load balancers. Finally, the volumetric attacks are similar to
10
NIST’s flooding attacks, exploiting the network bandwidth between the attacked system and the
legitimate user, by diverting massive volumes of requests from compromised agents to the attacked
system.
11
3 Literature Review
This chapter presents the literature review that was conducted as part of this research. This review
opens with an overview of the research subject. This is followed by a review of previous work
related to the use of machine learning for intrusion detection. This review uncovers the
methodologies, algorithms and results achieved by other researchers. Then the focus shifts on the
datasets available to conduct such experiments. A thorough review is carried out on how different
researchers analysed existing IDS datasets. This section of the literature review pinpoints the
major characteristics and issues of the data available. To conclude, the research gap is identified,
and this sets the groundwork for the rest of this research.
This review is based on peer-reviewed and reliable sources dating from 2000 to 2019.
3.1 Overview
Threats of malware, attacks and intrusion have been around since the very conception of
computing. Yet, it was not until the sudden growth of the internet that awareness of security and
digital assets really started to pick up steam. The internet presented a new liability, as the ever-
increasing number of machines on the web provided a new goldmine for those seeking to exploit
the vulnerabilities. Presently, with global internet usage estimated at 4.4 billion users,
approximately 58% of the world’s population [17, 18], the risk of intrusion has grown
exponentially. The term intrusion refers to “any unauthorized access that attempts to compromise
confidentiality, integrity and availability of information resources” [19]. In general, any form of
malicious use of the internet, computer applications or information systems is labelled as intrusion.
Attackers, or intruders, pry on the vulnerabilities of weak computer systems and networks, with
the potential to cause serious harm to users, organisations and businesses [6].
Among various types of attacks, DDOS attacks are one of the biggest threats to internet sites and
pose a great risk to the security of computer systems, particularly because of their potential
impact. With no prior warning, DDOS attacks can cause devastating damage and rip the resources
of a system apart. The harm caused by these attacks has been thoroughly described in network
12
security literature [7]. A DDoS attack aims to render a network inoperable by targeting its
bandwidth or connectivity. The attacker achieves this by sending a stream of packets that halt
the processing capabilities of a network [20]. The University of Minnesota is reportedly the first-
ever victim of a large-scale DDoS attack, back in August 1999. This attack shut down the
University’s network for over 2 days [21].
In such an environment, IDS are an essential measure for network security and defense.
Intrusion detection encloses all methods of detecting violations and interruption of a system’s
regular behaviour [7]. In recent years, intrusion detection has expanded to new areas, such as
artificial intelligence and machine learning. Current research heavily invests in these fields as the
new way to detect and prevent network intrusion and keep a safe and secure network for systems
and their users. The following section explores this concept in detail and presents literature that
focuses on the use of Machine Learning as the main technology for intrusion detection and
prevention.
3.2 Use of Machine Learning in Intrusion Detection
Machine Learning is becoming increasingly relevant in the field of intrusion detection. This section
gives an overview on how other researchers explored and tackled the issue of intrusion detection
with the application of machine learning. The primary take-aways of this section lie in
understanding how different algorithms are applied in solving intrusion problems, which
algorithms are most commonly applied, and understanding the results achieved from using such
techniques and methodologies.
Machine Learning is implemented for cybersecurity measures, particularly in three essential
regions including anomaly detection, intrusion detection, as well as misuse detection [22]. Sofi et
al. [23] analysed how machine learning methodologies can be implemented to detect and analyse
modern forms of DDoS Attacks. The researchers collected a new dataset, comprising 27 features
and 5 classes, containing present-day forms of attack detailed for diverse forms of attack aimed at
the application and network layers. The authors also used four machine learning algorithms,
including decision trees, naïve Bayes, support vector machines (SVM), and multi-layer perceptron
13
(MLP), on a dataset gathered to categorise the DDoS forms of attack such as SIDDOS, HTTP-
Flood, Smurf, and UDP-Flood. Their research concluded that MLP classifier attained the
uppermost precision level [23].
Similarly, Sharma et al. [24] carried out a literature review on how machine learning techniques
can be applied to detect DDOS Attacks. They examined common machine learning techniques
used for DDoS detection, including decision trees, SVM, naïve Bayes, artificial neural networks
(ANN), k-Means clustering, fuzzy logic and genetic algorithms. They concluded that network
attacks are extremely hazardous and IDS/IPS are insufficient in accommodating the modern-day
attacks impinging on the networks [24].
On the other hand, Zekri et al. [25] present a new formulation of an algorithm for DDoS detection
structure based on the C.4.5 algorithm to lessen the threats of DDoS. The researchers selected
other machine learning techniques to validate their proposed systems and further likened the
results obtained. They went further to use a naïve Bayes classifier to detect anomaly, while snort
was applied for signature-based detection [25].
Dewan Md. Farid et al. [26] suggested a learning algorithm for detecting the anomaly by
differentiating normal patterns from attacks, and also recognizing diverse forms of intrusions by
means of a decision tree algorithm. Yi-Chi Wu et al. [27] took a similar approach and formulated
a DDoS-detection system centred on a decision tree. In addition to the detection of an attack, the
authors also linked the locations of the attacker through a traffic-flow pattern-matching method.
The researchers implemented a C4.5 classifier to identify DoS attacks [27]. Furthermore, Andhare
and Patil [28] designed rules by means of a genetic algorithm-based method for the detection of
DoS attacks on the system [28].
In contrast, Aamir and Zaidi [29] implemented a systematic flow of feature engineering and
machine learning for detecting DDoS attacks. The results from the analysis indicated that a
considerable feature lessening is conceivable in making DDoS detection quicker and enhanced with
trifling performance hit. For their case study of DDoS datasets, the researchers discovered that
k-nearest neighbour algorithm generally demonstrates the greatest performance, shadowed by
14
SVM techniques. When applying a random forest algorithm, datasets with fewer dimensions and
distinct feature form have better performance in contrast with great dimensions with arithmetical
features.
3.3 Availability of Good Datasets
The application of machine learning in IDS demands good quality datasets. This section expands
on this notion by firstly analysing the intricacies of existing datasets. Literature is reviewed in
order to explore the different characteristics of such datasets and how this affects their validity in
a machine learning approach. Secondly, this section presents a review of previous research that
tackled the goodness and validity of these datasets, with focus on work that in itself reviews and
surveys these datasets. Accordingly, this work seeks to understand how other researchers tackled
the issue of good datasets and analyse the methodologies used to do so.
3.3.1 Issues with Current Datasets
The relevance of the results achieved from such techniques rely on the quality of the datasets
employed, as these are vital to have a realistic evaluation [30]. The validity of current datasets
has been thoroughly questioned in the cybersecurity space. It is a challenge for many researchers
to find appropriate datasets to validate and test their methods [31] and having a suitable dataset
is a significant challenge itself [32]. Privacy is a huge setback for availability of these datasets as
they contain sensitive information. In the off chance that these are made available, they are heavily
anonymized or obsolete. The unavailability of such datasets and the absence of certain statistical
characteristics remains one of the major challenges for anomaly-based intrusion detection [10],
[32].
The cybersecurity community continuously strives to tackle this problem as numerous intrusion
detection data sets have been published over the last years, such as the UNSW-NB15 data set
[33], published by the Australian Centre for Cyber Security and the CIDDS-001 dataset [34],
published by the University of Coburg. Das and Morris [22] conduct a thorough analysis on the
necessity of data for machine learning methodologies, stating that a researcher needs an in-depth
15
comprehension of the data set prior to undertaking any form of analysis. They went further to
explain why raw data including NetFlow, packet capture (PCAP), and other network data may
not be exactly functional for machine learning analysis because data has to be processed prior to
being used in standard machine learning applications. Thus, to use machine learning procedures
on conventional systems, the individual will need to comprehend the data collection methodologies
and the approaches needed for pre-processing the data [22].
Nehinbe [9] expands on previous research conducted by Ghorbani et. al [35] to identify the current
issues with evaluative datasets. Some of the findings are summarised below.
Data privacy issues. The nature of the datasets brings about data privacy issues due to certain
security policies, the sensitivity of the data and the potential risk from disclosing such information.
Moreover, there are other trust factors that inhibit realistic data from being shared among industry
stakeholders and researchers. As a result, organisations often choose to not disclose the outcomes
of computer attacks. Therefore, most researchers do not use realistic data when conducting their
own studies [9].
Getting approval from the data owner. Getting access to real datasets often requires approval
from the owner of the dataset. Some data owners, such as CAIDA1 [36] require users to sign
Acceptable Use Policies (AUP), restricting the user with time of usage and publishing of
information [35]. In addition, approval from the owner of the dataset may result in a highly
bureaucratic process, with which the researcher might not gain access in time, as approval
processes are often delayed [9].
Different research objectives. The objectives of a study and the methodologies applied are among
some of the most important factors that influence the researcher when it comes to choosing suitable
datasets to evaluate models designated to investigate intrusion detection. For instance, McHugh
1 Cooperative Association for Internet Data Analysis [36]
16
[37] highlights some significant issues with the NSL-KDD dataset [38], which should be an
enhanced version of pre-existing datasets (KDD ‘99 [39], KDD ’98 [40]). The author discusses how
the same issues that existed with the previous versions of this dataset persisted with the newer,
supposedly enhanced version [37]. Moreover, researchers often tweak the datasets through data
processing and cleaning to “lessen the challenges in matching data with the objectives of the
studies” [41]
Problem of documentation. Most IDS datasets that are available for the perusal of researchers
lack sufficient documentation. These datasets have little to no information about the network
environment in which they are simulated, the type of intrusions simulated, the goal of the
intruders, the details of the operating systems of both attacking and victim machines, and other
significant information that might impact the study [9].
3.3.2 Related Work
Numerous studies have been carried out with the aim to analyse the relevance and quality of
available IDS datasets. Malowidzki et al. [15] review the current situation with regards to publicly
available IDS datasets and provide suggestions on certain processes that should act as base
principles for a good dataset. The authors also suggest variants for data preparation and highlight
the aspects that result in a high quality, reliable dataset [15]. Koch et al. [31] also provide an
evaluation of IDS datasets, spreading across 13 different data sources. The authors analyse these
datasets on the basis of 8 data attributes. This work also provides a detailed analysis of current
security systems and investigates their shortcomings [31].
Taking a slightly different approach, the work of Thomas et al. [13] analyse one specific dataset,
DARPA [42], and investigates its use in intrusion detection. The authors conclude that the
DARPA dataset has the potential to model attacks that commonly appear on network traffic, and
therefore it can be considered as “the baseline of any research” [13]. Similarly, the work of Dhanabal
and Shantharajah [14] investigates the application of the NSL-KDD dataset in intrusion detection.
The authors study the effectiveness of the NSL-KDD dataset in detecting network traffic
anomalies using various classification algorithms. This work uses J48, SVM and naïve Bayes
17
algorithms to assess the dataset and concludes that the J48 algorithm produces the best accuracy
results [14].
Sharafaldin et al. [10] present a more exhaustive analysis of IDS datasets when compared to other
dataset studies that focus more on providing a high-level overview. The authors analyse 11 IDS
datasets and compare them with respect to 11 properties. This study also presents a framework
for the creation of new IDS datasets [10]. Bhuyan [43] et al. briefly describe and compare a large
number of network anomaly detection methods and systems. In addition, the authors discuss tools
for network defenders and datasets that researchers in network anomaly detection can use [43].
Similarly, Nisioti et al. [44] discuss 12 IDS datasets and provide a critical evaluation of
unsupervised techniques for intrusion detection.
Yavanoglu and Aydos [12] compare the most common datasets for artificial intelligence and
machine learning techniques [45]. Similarly, Ring et al. [11] take on the analysis of multiple
datasets. This work identifies 15 different attributes to analyse the applicability of individual
datasets for specific evaluation scenarios. Based on these properties, the authors also provide an
overview of existing datasets [11].
3.4 Research Gap
The safeguarding of networks and computer-based applications has been subject to extensive
research throughout the years. With the explosive growth of internet usage, the need for secure
environments has become more and more critical. Intrusion detection is becoming a quintessential
measure for network security and defense. Lately, research has taken a new turn in this respect,
with a surge in artificial intelligence and machine learning research for intrusion detection.
Researchers are investing heavily in this field and consequently require good quality datasets in
order to be able to evaluate their models. The availability of these datasets has also been
thoroughly discussed in literature, with emphasis on the characteristics and shortcomings of these
datasets [9], [22], [35].
18
Numerous works in literature focused on the analysis and evaluation of a wide range of
datasets. Specifically, the work published by Yavanoglu and Aydos [12] and Ring et al. [11]
considers multiple datasets that are commonly used in machine learning scenarios. In both cases,
the researchers do not focus on a specific type of security attack. On the other hand, Thomas et
al. [13] carry out a more narrow and focused evaluation of the DARPA dataset. The goal of their
work was to assess the potentiality of the DARPA dataset in intrusion detection. A similar
approach was taken by Dhanabal and Shantharajah [14]. The authors focused on analysing the
applicability of the NSL-KDD dataset in intrusion detection models. The dataset is assessed
against three prominent machine learning algorithms, with best results being achieved with J48.
Most of the previous work takes on a rather qualitative approach to assessing IDS datasets,
with some exceptions [14]. A significant section of literature in this respect focuses on analysing
the quality of the data from a more descriptive sense by analysing various criteria, most of which
is in line with the work of Nehinbe [9].
This study, although building on previous works, analyses and evaluates some new
concepts in this field. Firstly, the primary focus of this work are DDoS attacks. Previous research
on IDS datasets seldom focuses on datasets for one specific type of security attack. Although there
were many instances where DDoS attacks were featured in IDS dataset research, these were never
considered as the main focal point. Moreover, this work takes a tangent from the work by
Dhanabal and Shantharajah [14] by taking a more analytical approach to evaluation and uses
multiple machine learning algorithms to assess the datasets. Although similar methodology is used,
there are two aspects that distinguish this work from that done by Dhanabal and Shantharajah
[14]. Firstly, multiple datasets are used, as opposed to analysing just one. And secondly, there is
a specific security attack under evaluation.
Overall, this work aims to provide suggestions on the most appropriate algorithms to use
depending on the datasets available, and this in itself, is another contribution of this study. This
work aims to be a potential guideline for machine learning-based detection of DDoS behaviour.
19
4 Research Methodology
4.1 Overview
Given the nature of the study, CRISP-DM (CRoss Industry Standard Process for Data Mining)
is chosen as the basis for the research methodology. CRISP-DM is synonymous with projects that
involve machine learning and data analytics. CRISP-DM is both technology and industry agnostic,
and it defines a systematic way to carry out data mining projects. This framework aims to reduce
the cost of large-scale data projects, while increasing maintainability and efficiency of such
projects. CRISP-DM is a hierarchical model, involving four levels of abstraction: phases, generic
tasks, specialised tasks, and process instances [46]. Figure 2 illustrates this hierarchy.
Figure 2: The four tiers of CRISP-DM. Reproduced from [46].
At the highest tier, the data mining process is split into a number of phases, where each phase
comprises a set of second tier generic tasks. The second tier is a generalised representation of all
the possible solutions to a given data mining problem, where the tasks should be complete and
stable. That is, tasks should be complete in nature to cover the entirety of the data mining process,
and stable enough to account for any unforeseen developments in the process. The third tier puts
the general tasks under the microscope for a more granular view and divides them in specific tasks
that outline the actions for specific scenarios. The fourth tier represents all the actions and
20
outcomes of a particular data mining project. All process instances in the fourth tier are defined
according to tasks in higher tiers, however, these represent actual events, rather than generalised
ones [46].
4.2 Life Cycle
The hierarchical reference model of the CRISP-DM framework presents the lifecycle of a data
mining project, containing phases, tasks and results. The lifecycle of such projects consists of six
phases, as presented in Figure 3. The data mining process is not a rigid one. The arrows represent
the most common dependencies between phases, however, the sequence in which the phases are
carried out is entirely dependent on the nature of the project and the outcome of each phase [46].
Figure 3: The six-phase life cycle of a data mining project. Reproduced from [46].
Below is a brief explanation of each step of the data mining life cycle [46].
Business Understanding. A data mining project starts with a discovery phase, where the focus is
to define and understand the business problem, requirements and objectives. These are then
translated to a data mining problem and a plan to satisfy the requirements and objectives.
21
Data Understanding. The second understanding phase involves data collection, followed by a set
of activities and tasks to understand the nature of the data. The aim of these activities is to
familiarise oneself with the data, discover preliminary insights, identify valuable subsets, and
uncover any data quality issues. This phase is closely tied with the business understanding phase,
as the formulation of a plan requires good understanding of the data in question.
Data Preparation. The third phase consists of various tasks that focus on converting the raw
data collection into a final dataset. The nature and order of tasks may vary, and some tasks may
even be performed multiple times, depending on the state of the raw data. Some of the tasks
include data cleaning, feature selection and data transformation.
Modelling. In the fourth phase, the appropriate modelling techniques are chosen and applied to
the data. Typically, the parameters of these models are calibrated to achieve optimal performance.
This phase is closely tied with data preparation, as modelling may uncover new issues with the
data. In addition, the way the data is prepared can lead to the use of different models.
Evaluation. In the evaluation phase, the models applied in the previous phase are thoroughly
evaluated and reviewed. During this phase, the tasks carried out are assessed against the planned
objectives, to ensure that all business requirements have been considered and met. Moreover, the
models are tested for generalisation against unseen data. At the end of the evaluation phase, there
needs to be a clear understanding of how the data mining results should be applied.
Deployment. The resulting knowledge is organised and presented to the end-user. The tasks of the
deployment phase highly depend on the data mining project. The outcome can range from a simple
report of results, to a more complex implementation of a continuous data mining process.
4.3 Implementing CRISP-DM
This section presents an overview of how the CRISP-DM methodology is applied in this study.
The data mining process is described in further detail in Section 6.
22
Business Understanding. A thorough review of literature is carried out to analyse three key areas:
the use of machine learning in the context of intrusion detection; the state-of-the art of IDS
datasets, their characteristics and shortcomings; and, a review of recent works on the validity of
existing IDS datasets. This lays the groundwork for the problem definition and objective of this
study, that is, to analyse the intrusion detection performance of DDoS datasets.
Data Understanding. A total of four datasets are collected for this study. An in-depth review of
each of these datasets is presented in Section 5. Each of these are analysed to gain familiarity
with the feature set and assess the quality of the data. This analysis is crucial to determine whether
the data is a right fit for the objectives of this study.
Data Preparation. The four datasets are prepared for modelling in a systematic manner. This
phase involves several tasks, including handling missing data, decoding undefined data,
transforming data types as required by the models, and transforming class labels to generate
homogenous labels across all datasets. The datasets are split into three subsets for training,
validation and testing of the models.
Modelling. Six different supervised learning models are selected to analyse the datasets. The
selection of the models is based on specific criteria, namely: having parametric and nonparametric
models; using algorithms from different categories; and, applying models that are also commonly
used in previous works and literature. These criteria are further explained in Section 6.3.1. The
models are trained with each of the datasets.
Evaluation and deployment. The models are validated to ensure that they are generalised for
unseen data. The models are then tested using new data from the testing set and performance
metrics are generated, including rate of accuracy, precision, recall and F-measure. These
performance evaluation metrics are explained in Section 7.1. Models are also evaluated on training
efficiency, that is, the time taken for the model to train. The outcome of this phase is an analysis
of the intrusion detection performance of each dataset.
23
5 DDoS Datasets Review
For this experiment, a total of four datasets were collected and tested, namely CICDDoS2019 [47],
CSE-CICIDS2018 on AWS [48], NDSec-1 [49] and CICIDS2017 [50]. All datasets are based on
simulated data and are dated between 2017 and 2019. Selecting datasets for this study was in
itself a challenge due to the shortage of DDoS-specific datasets, despite it being one of the most
devastating security attacks. Moreover, all datasets chosen are recently dated to ensure that all
instances and features are relevant and up to date.
5.1 CICDDoS2019
CICDDoS2019 contains benign and the recent DDoS attacks, resembling real data (PCAPs). It
also includes the analysis of network traffic analysis using CICFlowMeter-V32 [51] and labelled
flows. B-Profile system [47] was used to profile the abstract behaviour of human interactions and
generate naturalistic benign background traffic. For this dataset, the abstract behaviour of 25
users was constructed based on the HTTP, HTTPS, FTP, SSH, and email protocols [47]. The
dataset includes different modern reflective DDoS attacks such as Port Map, NetBIOS, LDAP,
MSSQL, UDP, UDP-Lag, SYN, NTP, DNS, and SNMP. The capturing period for the training
day on January 12th started at 10:30 and ended at 17:15, and for the testing day on March 11th
started at 09:40 and ended at 17:35. Attacks were subsequently executed during this period.
Table 2: OS Specification and Machine IPs for CICDDoS2019. Adapted from DDoS Evaluation Set [47].
Machine OS IPs
Server Ubuntu 16.04 (Web
Server)
192.168.50.1 (first day)
192.168.50.4 (second day)
2 CICFlowMeter is a network traffic flow generator and analyser [51].
24
Firewall Fortinet 205.174.165.81
PCs (first day)
Win 7
Win Vista
Win 8.1
Win 10
192.168.50.8
192.168.50.5
192.168.50.6
192.168.50.7
PCs (second day)
Win 7
Win Vista
Win 8.1
Win 10
192.168.50.9
192.168.50.6
192.168.50.7
192.168.50.8
Refer to Appendix A for a timed breakdown of the attacks.
5.2 CSE-CIC-IDS2018 on AWS
In CSE-CIC-IDS2018 dataset, profiles were used to generate datasets in a systematic manner,
which contained detailed descriptions of intrusions and abstract distribution models for
applications, protocols, or lower level network entities. These profiles can be used by agents or
human operators to generate events on the network. Due to the abstract nature of the generated
profiles, they are applicable to a diverse range of network protocols with different topologies [48].
Profiles can be used together to generate a dataset for specific needs. Two distinct classes of
profiles were built:
B-profiles: Encapsulate the entity behaviours of users using various machine learning and
statistical analysis techniques. The encapsulated features are distributions of packet sizes of a
protocol, number of packets per flow, certain patterns in the payload, size of payload, and request
time distribution of a protocol. The following protocols were simulated: HTTPS, HTTP, SMTP,
POP3, IMAP, SSH, and FTP.
25
M-Profiles: Attempt to describe an attack scenario in an unambiguous manner. In the simplest
case, humans can interpret these profiles and subsequently carry them out. Idealistically,
autonomous agents along with compilers would be employed to interpret and execute these
scenarios
The datasets comprise various types of attacks, including DoS, Infiltration, DDoS and Brute force.
For the purpose of this study, only DDoS attacks are considered, as described in Table 3.
Table 3: Specification of tools and duration of DDoS attack for CSE-CIC-IDS2018 on AWS [48].
Tools Duration Attacker Victim
Low Orbit Ion Canon (LOIC) for UDP, TCP, or
HTTP requests Two days Kali Linux
Windows Vista, 7, 8.1, 10 (32-bit) and 10 (64-bit)
5.3 NDSec-1 (Botnet)
The NDSec-1 dataset incorporates traces and log files of cyber-attacks synthesized within the
facilities of the Network and Data Security Group at the University of Applied Sciences in Fulda,
Germany. The need for such dataset came about as a result of the absence of publicly available
captures containing a broad range of different attack footprints to either benchmark existing
intrusion detection systems or to support network security research in designing new detection
engines [34]. Using state-of-the-art tools, the distinct attack scenarios were performed, namely,
Watering Hole, Bring-your-own device (BYOD), and Botnet. This study considers the Botnet
attack scenario.
The rental of botnets operated by cyber crews is a lucrative business in the underground economy.
Hence, these illicit infrastructures increasingly gain popularity. This trend is crucial for enterprises
and organizations, because essentially any host of a legitimate network may serve as a bot, and
thus has potential to be part of a criminal act once infected. Citadel 1.3.5.1 was employed in this
scenario, as a revised version of the well-known Zeus botnet. Based on a normal operating network,
26
three legitimate hosts were infected with Citadel binaries. This task could be performed through
conventional email spam using the recent vulnerabilities CVE-2015-2509 (Windows Media
Center), CVE-2015-5122 (Flash Player), and a rogue download caused by XSS placed on a website
in the simulated Internet [49].
After the infection, all three bots communicated via HTTP to a prepared bot master. Among
several traffic footprints between master and bots, all bots were instructed to download new
commands. These contained hostile payload to perform a DDoS via SYN flooding to a single
destination outside the network. Beside this successful attack, two of the bots stole local
configuration files and transferred them to an external FTP server [49].
5.4 CICIDS2017
The CICIDS2017 dataset was the product of simulation for 5 days long, starting on a Monday
and finishing on Friday, and includes network traffic in two formats, packet and bidirectional
flow. It is important to note that for each of the flows 80 attributes were extracted which further
include extra metadata about the simulated multiple attackers IP addresses and attacks. To
simulate default and user behaviour within the normal bounds, scripts are used. The first day is
considered normal and the traffic included is benign only. The simulated attacks include DDoS
data and the attack scenario was taken from CICIDS2017 [50].
Table 4: Details of DDoS Attack for CICIDS2017.
Attack Scenario Victim Attacker IP
DDoS LOIT Ubuntu 16,
205.174.165.68
205.174.165.69
205.174.165.70
205.174.165.71
27
6 Experiment Implementation and Design
This section outlines the details of the design and implementation of the proposed solution. The
solution is implemented in Python 3. Firstly, an overview of the solution is presented, briefly
describing the phases of this implementation. Section 6.2 describes the data preparation process,
including details on data cleaning and transformation, and dataset splitting. Section 6.3 presents
the modelling process, with a detailed account of the training, validation and testing processes.
Section 6.4 concludes with an overview of the evaluation procedure, including a summary of the
performance metrics used to analyse the intrusion detection performance of the DDoS datasets.
6.1 Overview
Figure 4 presents a flow chart of the supervised learning process adopted in this study, as part of
the proposed solution. Firstly, DDoS raw data is collected from open sources. A total of four
datasets are collected. These are described in detail in Section 5. After collection, data is processed
to construct the final datasets for modelling. Data processing includes data cleaning,
transformation of data types, and dataset splitting. The splitting process is described in 6.2.2.
This is followed by the model selection process. Models are selected based on the criteria
highlighted in Section 6.3.1. The models are trained with all four datasets using six different
algorithms; k-nearest neighbour, SVM, naïve Bayes, decision tree, random forest, and logistic
regression. The model is validated using k-fold cross and retrained. Finally, the model is tested
with unseen data. The results are evaluated using several performance metrics, as described in
Section 6.4.
28
Figure 4: Workflow of supervised learning process
6.2 Data Preparation
6.2.1 Data Cleaning and Transformation
Missing data. Handling missing data is vital in machine learning, as it could lead to incorrect
predictions for any model. Accordingly, null values are eliminated by propagating the last valid
observation forward along the column axis. This is implemented using the fillna method from
the pandas library [52], as shown below.
data.fillna(method ='ffill', inplace = True)
29
Undefined Data. The elimination of null values can result in undefined data. A null field with no
cells on its left becomes NaN after propagation, since there are no cells to provide a value.
Consequently, these values are decoded into 0. This is all done using the fillna method [52].
data=data.fillna(0)
Transformation. The format of the collected data might not be suitable for modelling. In such
cases, data and data types need to be transformed so that the data can then be fed into the
models, as described by the CRISP-DM method. Accordingly, some data features were
transformed into numeric or float, since models do not perform well with strings, or do not perform
at all.
Class Labels. Each dataset instance represents a snapshot of the network traffic at a given point
in time. These instances are labelled according to the nature of the traffic, that is, whether the
traffic is benign or malicious. The labels across the four datasets vary, therefore they are encoded
to have homogeneity in the class labelling system. Classification is binary, where benign traffic is
labelled as NORMAL, and malicious traffic is labelled as ATTACK. Table 5 summarises the
classification system.
Table 5: Labelling system for binary classification.
Label Scenario
NORMAL Traffic is benign
ATTACK Traffic is malicious
6.2.2 Volume and Class Distribution
Following the thorough preparation of the data, some descriptive information is generated for each
set, specifically: (1) the volume of records in each set; and (2) the distribution of classes. Table 6
presents the volume of records for each dataset that is used in this study. Further, the sections
below give an account of the class distribution, including amount and percentage.
30
Table 6: Volume of records for the DDoS attack datasets.
Dataset No. of Records
CICDDoS2019 294,627
CSE-CIC-IDS2018 1,046,845
CICIDS2017 225,745
NDSec-1 5,838
6.2.2.1 CICDDoS2019
In the CICDDoS2019 dataset, there were 121,980 (41.4%) records classified as normal traffic and
172,647 (58.6%) classified as attack traffic.
Figure 5: Bar chart showing the spreading of traffic type in the CICDDoS2019 dataset.
6.2.2.2 CSE-CIC-IDS2018 on AWS
In the CSE-CIC-IDS2018 dataset, there were 360,833 (34.5%) records classified as normal traffic
and 686,012 (65.5%) classified as attack traffic.
31
Figure 6: Bar chart showing the spreading of traffic type in the CSE-CIC-IDS2018 dataset.
6.2.2.3 CICIDS2017
In the CICIDS2017 dataset, there were 97,718 (43.3%) records classified as normal traffic and
128,027 (56.7%) classified as attack traffic.
Figure 7: Bar chart showing the spreading of traffic type in the CICIDS2017 dataset.
6.2.2.4 NDSec-1
For this study, a subset of the NDSec-1 dataset is considered, that is, DDoS Botnet attack data.
The other instances of attack (BYOD and Watering Hole) are not taken into account for the
32
evaluation. In this subset, there were 3,508 (60.1%) records classified as normal traffic and 2,330
(39.9%) classified as attack traffic.
Figure 8: Bar chart showing the spreading of traffic type in the NDSec-1 dataset.
6.2.2 Splitting Datasets
A key characteristic of a good learning model is its ability to generalise to new, or unseen, data.
A model which is too close to a particular set of data is described as overfit, and therefore, will
not perform well with unseen data. A generalised model requires exposure to multiple variations
of input samples. Primarily, models require two sets of data, one to train and another to test. The
training data is the set of instances that the model trains on, while the testing data is used to
evaluate the generalisability of the model, that is, the performance of the model with unseen data.
The train/test split can yield good results; however, this approach has some drawbacks. Although
splitting is random, it can happen that the split creates imbalance between the training and the
testing set, where the training set has a large number of instances from only one class. In such
cases, the model fails to generalise and overfits.
To mitigate this, the datasets are split into three subsets; training, validation and testing.
This split is done in a 60:20:20 ratio, for training, validation and testing respectively. The
train_test_split helper method from the scikit-learn library [53] is used for the split, as
presented in the code snippet below. With this approach, training is done in two phases, with the
33
training and the validation sets. Firstly, the training set is used to train the model. Then, the
validation set is used to estimate the performance of the model on unseen data (data that the
model is not trained on). For the purpose of this study, validation is done using a stratified k-fold
approach. The k-fold validation method is described in Section 6.3.3.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.40, random_state=100)
X_val,X_test,y_val,y_test =
train_test_split(X_test,y_test,test_size=0.5,random_state=100)
6.3 Modelling
The classification phase constitutes two aspects; (1) the construction of the learning model, and
(2) the generation of the predicted labels. These tasks are implemented using scikit-learn, a Python
library for data mining, data analysis and machine learning.
6.3.1 Models Selection
This study features the testing and training of six different classification methods; namely, k-
nearest neighbour, SVM, naïve Bayes, decision tree, random forest and logistic regression. These
algorithms were selected on a basis of criteria: (1) To have a mix of parametric and nonparametric
algorithms; (2) To have a range of algorithms from different categories; and (3) To use algorithms
which commonly feature in previous works.
6.3.1.1 Parametric vs Nonparametric algorithms
A parameter can be loosely described as a pre-defined attribute of the data. A parametric
algorithm possesses a fixed number of parameters. While a parametric algorithm is
computationally more efficient, it makes stronger assumptions about the dataset. This would be
ideal if the assumptions are correct. However, parametric algorithms perform poorly with incorrect
34
assumptions [54]. In this study, the parametric algorithms used are SVM, naïve Bayes and logistic
regression.
In contrast, non-parametric algorithms are more flexible. In nonparametric scenarios, as
the algorithm learns, the number of parameters grows. This type of algorithm performs slower
computations; however, it makes far less assumptions about the dataset [54]. The nonparametric
algorithms used in this study are k-nearest neighbour, decision tree and random forest.
6.3.1.2 Categories of Algorithms
This section highlights the different types of algorithms used in this study.
Instance-based. Instance-based learning methods are “conceptually straightforward approaches to
approximating real-valued or discrete-values target functions” [52, p. 230]. These algorithms learn
by storing the training data that they are presented with. When a new instance is presented, this
is compared to previous instances and classified according to similarity [54]. The k-nearest
neighbour used in this study is an instance-based algorithm.
Kernel method. Kernel methods are based on kernel functions. Given the right conditions of
symmetry, kernel functions essentially define an instance in a high-dimensional space. Using the
kernel method, the original instance is replaced with a kernel to extend algorithms such as SVM
[55], which is the model used in this study.
Bayesian. Bayesian reasoning assumes that “quantities of interest are governed by probability
distributions [52] and that accurate decisions can be made when adopting these probabilities on
new data. Every instance in a training set can decrease or increase the likelihood that a hypothesis
is correct [54]. This study adopts the naïve Bayes algorithm.
Decision Tree. In decision tree learning discrete values are used to represent target functions which
are themselves represented with a decision tree. It is one of the most popular learning methods
used in inductive inference logic and has been applied to multiple real-case scenarios ranging from
medical to credit risk [54].
35
Ensemble Methods. In ensemble methods several based models are combined with the purpose of
producing one optimal predictive model. Within this machine learning technique, multiple models
are created and then combine to improve the results. It is commonly understood that with
ensemble methods more accurate solutions are produced relatively to the results a single model
would produce [56]. In this study, the ensemble method used is the random forest algorithm.
Regression. In regression-based approaches, data are used to predict, as closely as possible, the
accurate and actual labels of points that are under consideration. Regression-based approaches
are highly common in machine learning with a multitude of applications [55]. This study uses
logistic regression.
6.3.2 Models Used in this Study
6.3.2.1 K-Nearest Neighbour (k-NN)
The k-nearest neighbour is an instance-based classifier. When the k-NN is used, instances within
a dataset are contained in a dimensional space, where a new instance is labelled based on its
similarity with other instances, as shown in Figure 9. These instances are referred to as neighbours.
A new instance is labelled x, if x is the most similar class for the neighbouring observations [54].
A distance function is applied to determine the similarity between instances. For the purpose of
this study, the distance function employed is Euclidean. The Euclidean function is a relatively
common method as it reflects the human perception of distance.
36
Figure 9: An example of a k-NN classification. When k=3, the new instance is labelled as 0. However, when the parameter is increased to k=5, the same instance is labelled as 1. Adapted from [55].
Table 7 describes classification with the k-NN algorithm.
Table 7: Pseudocode for the k-NN Algorithm [54].
Algorithm 1 k-Nearest Neighbour
start
Let S = {a1, a2, …, an}, where S represents the training set and a represents article documents
k ← the desired number of nearest neighbours
Compute d(x,y) between new instance i and all a ∈ S
Select the k closest training samples to i
Classi ← best voted class
end
37
6.3.2.2 Support Vector Machine (SVM)
The learning process in SVM is carried out in two steps: firstly, the inputs are plotted in an n-
dimensional space, where n is based on the number of attributes; the coordinates of individual
attributes are referred to as support vectors. Secondly, a hyperplane separates the instances. A
hyperplane is a line that linearly separates a set of data points into two distinct classes. The SVM
is the hyperplane which best splits the data set.
When dealing with the mapping of complex nonlinear functions, computation issues are highly
probable. In fact, the larger the dimensional space, the bigger the separation problem [55]. Kernel
tricks are used to mitigate this problem. A kernel is able to transform extremely complex functions
into infinitely higher dimensional spaces, then uses predefined labels to split the inputs [54].
Figure 10: Illustration of SVM classification.
Table 8 describes classification with the SVM algorithm.
38
Table 8: Pseudocode for the SVM Algorithm [54].
Algorithm 2 Support Vector Machine
start
∀ document ∈ training set S:
Create SVM classification objects
Objects → Higher Dimensional Space
Apply a kernel trick to transform f(x) into a linear separable one
A hyperplane is computed ⇒ binary classification
end
6.3.2.3 Naïve Bayes
The naïve Bayes classifier is built on Bayes’ Theorem, where event independence is assumed. In
statistics, two events are said to be independent if the likelihood of one does not impact the other
[54]. Table 9 presents the algorithm of the Bayesian classifier to calculate probability. For instance,
let P(B|A) equal the conditional probability of any given event. Then, let P(B) be the probability
of B, and P(A) be the probability of A. Furthermore, let P(A|B) be equal to the likelihood of A
given B. As such, the theorem is formally presented as:
𝑃(𝐴|𝐵) =𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐵)
Equation 1: Bayes’ Theorem
39
Table 9: Pseudo code for the naïve Bayes algorithm [54].
Algorithm 1 Naïve Bayes
start
Let S = {a1, a2, …, an}, where S = training set and a = articles:
Calculate the probability of the classes P(C)
Calculate likelihood of attribute A for each class P(A|C)
Calculate the conditional probability P(C|A)
Assign the class with the highest probability
end
6.3.2.4 Decision Tree
Decision tree classification starts at the root node and classifies observations on the basis of the
values of the respective attributes. Every node represents a single feature, while they represent
the values that the node can assume [54].
Starting from the root node, the algorithm works its way down by iteratively computing the
information gain for each feature in the training set. Information gain is used to determine the
level of discrimination imposed by the features towards the target classes. The higher the
information gain, the higher the importance of the attribute in the classification of each
observation [54], [55]. The root node is replaced by the attribute that possesses the highest
information gain, and the algorithm continues splitting the data set by the selected feature to
produce subsets. Table 10 gives an overview of this procedure.
40
Table 10: Pseudo code for the decision tree algorithm [54].
Algorithm 4 Decision Tree
start
∀ attributes a1, a2, …, an
Find the attribute that best divides the training data using information gain
a_best ← the attribute with highest information gain
Create a decision node that splits on a_best
Recurse on the sub-lists obtained by splitting on a_best and add those nodes as children of node
end
6.3.2.5 Random Forest
The random forest algorithm is an ensemble algorithm that uses a large number of decision trees
for classification. Individual trees are built using the algorithm presented in Table 11. As
previously noted, ensemble algorithms provide higher accuracy due to the combination of multiple
models.
Table 11: Pseudo code for the random forest algorithm [54].
Algorithm 5 Random Forest
Require IDT (a decision tree inducer), T (the number of iterations), S (the training set), µ (the subsample size), N (the number of attributes used in each node)
start
t ← 1
repeat
41
St ← Sample µ instances from S with replacement.
Build classifier Mt using IDT(N) on St
t++
until t > T
end
6.3.2.6 Logistic Regression
Logistic regression is a type of predictive analysis and best suited for analysing scenarios where
the dependent variable is binary. Logistic regression describes the data and explains the
relationship between one dependent binary variable and other non-binary independent variables
[55]. Table 12 presents the algorithm for the logistic regression classifier.
Table 12: Pseudo code for the logistic regression algorithm [54].
Algorithm 6 Logistic Regression
given α, {(x i, y i)}
initialize a = <1, .., 1> T
perform feature scaling on the examples’ attributes
repeat until convergence
for each j = 0, .., n:
a`j = aj+ α Σi(yi − ha(xi))xji
for each j = 0, .., n:
aj = aj
output a
42
6.3.2 Training
During the training process, the selected algorithms are provided with training data to learn from
to eventually create machine learning models. Accordingly, the training set is used, as specified in
Section 6.2.2. At this point in the process, the input data source needs to be provided and should
contain the target attribute (class label). The training process involves finding patterns in the
training set that map the input features with the target attribute. Based on the observed patterns,
a model is produced.
In this study, four DDoS datasets are being used as the input data source, where the target
attribute is the type of network traffic, that is, attack or normal. Six algorithms are trained with
each of the four sets. Training is conducted using several methods from the scikit-learn libraries.
Table 13 provides breakdown of the methods used for each algorithm. Appendix B contains the
sources code for the models that were built to analyse the intrusion detection capacity of each
dataset.
Table 13: Methods and classifiers from the scikit-learn Python library [57] used for building models.
Model Scikit-learn Methods & Classifiers
k-NN sklearn.neighbor.KNeighborsClassifier [58]
SVM sklearn.svm.LinearSVC [59]
Naïve Bayes sklearn.naive_bayes.GaussianNB [60]
Decision Tree sklearn.tree.DecisionTreeClassifier [61]
Random Forest sklearn.ensemble.RandomForestClassifier [62]
Logistic Regression sklearn.linear_model.LogisticRegression [63]
43
6.3.3 Validation
Following the training process, the model is validated using k-fold cross validation. Cross
validation is applied to assess the generalisability of a model. This method aims to reduce the
errors of overfitting that occur when a model is too closely fit a range of data instances. Cross
validation is done in iterations, and each iteration involves splitting the dataset into k subsets,
referred to as folds. The model is trained on k-1 folds, and the other fold is held back for testing,
as illustrated in Figure 11. This process is repeated until all folds have served as a test fold. Once
the process is completed, the evaluation metric is summarised by calculating the average value
[54].
Figure 11: K-fold cross validation with 5 folds.
In this study, a stratified k-fold approach is used using the validation dataset (20% of the global
set). Stratified k-fold is a variation of k-fold cross validation that ensures that the distribution of
classes is the same across all folds. This is implemented using the StratifiedKFold method from
the scikit-learn library [64], with k=5. Below is a code snippet of the stratified k-fold, where
n_splits specifies the number of folds.
44
for tr_in,val_in in StratifiedKFold(shuffle
True,n_splits=5).split(X_val,y_val):
{{model}}.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(knn.score(X_val.iloc[val_in],y_val.iloc[val_in]))
6.3.4 Testing
In the last stage of the modelling phase, the models are tested with unseen data. The unseen data
used at this stage is the resulting test set from the data split (20%). Testing is conducted to assess
how a model represents data and how well it will perform in the future. This study ensured that
any tweaks to the models were done prior to testing, so that the testing data is used only once.
Various performance metrics were generated to be able to analyse the performance of the DDoS
datasets, such as accuracy, precision, recall, and F-measure. These are described in the next
section.
6.4 Evaluation
A crucial part of understanding the performance of a model is generating performance metrics. In
this study, various metrics are generated. These are described below.
Accuracy. One of the ways to describe the performance of a classification model is the count of
correctly and incorrectly classified instances. These values are commonly represented in a
confusion matrix. A confusion matrix is a tabulated visualisation of the performance of supervised
learning algorithms. The rows represent the count of instances in an actual class, while the columns
represent the count of instances in a predictive class [65]. Table 14 depicts the confusion matrix
for a binary classification problem.
45
Table 14: Example of a confusion matrix for a binary classifier [65].
Predicted Class
Class 0 Class 1
Actual Class
Class 0 180 15
Class 1 20 90
A confusion matrix provides enough information to determine the performance of a stand-alone
classifier. However, it is more convenient and clearer to draw the elements of the matrix into a
single value [65]. In this study, the matrix is summarised using the accuracy metric, which is
computed as follows:
Accuracy = ,-../012302566787/97:615:0/6;-1527:615:0/6
× 100%
Equation 2: Accuracy Ratio
Precision. Accuracy is often not enough to assess the performance of a learning model. Although
accuracy provides an indication on whether the model is being trained correctly, it does not give
information on detailed information on the specific application. Consequently, other performance
metrics are employed, such as precision. Precision is defined as the rate of correctly classified
positives, or true positives. There are many scenarios when false positives might have
repercussions. In the case of this study, having a high false positives rate means that traffic would
be identified as being malicious, when in fact it is not. Outside the academic world this might
result in wasted time and cost efforts. Precision is computed as follows:
Precision = ;.</=-6717>/6;.</=-6717>/6?@526/=-6717>/6
Equation 3: Precision Ratio
Recall. Another performance metric is recall. Recall is a measure of how many of the actual
positives were found or recalled. It is also a significantly important metric, as having undetected
positives, or false negatives, might have serious consequences in some areas. For instance, a model
46
that does not recall all cases of DDoS attack means that malicious network traffic will go
unnoticed, increasing the potentiality of harm to the system and its users.
Recall = ;.</=-6717>/6;.</=-6717>/6?@526/:/A517>/6
Equation 4: Recall Ratio
F-measure. The F-measure is a metric that provides an overall accuracy score for a model by
combining precision and recall. A good F-measure score means that a model has both low false
positives and low false negatives, and therefore, a model correctly identifying threats while having
minimal false alarms.
F-measure = 2 × B./0767-:×D/0522B./0767-:?D/0522
Equation 5: F-measure Ratio
Computation time. The last performance metric used in this study is the computational time.
This is not related directly to classification, but rather, it describes the training time taken by a
model. This metric gives an indication of the efficiency of the model. The recorded computational
time is based on a Linux system with 8GB RAM and an i5 processor.
47
7 Results
7.1 Overview of Results
Table 15 presents the evaluation metrics of the machine learning models based on four open DDoS
datasets, including accuracy, precision, recall, f-measure and computation time (see Section 6.4).
All machine learning models were trained, validated and tested using a 60:20:20 split of the global
datasets. The goal of this evaluation is to analyse the performance of the different DDoS datasets
in terms of their capacity to detect intrusion (via a DDoS attack). From the results, it shows that
the CSE-CIC-IDS2018 dataset [48] performs best overall, achieving an accuracy rate of 99% across
all models, and an F-measure of 99%, denoting that a model trained with CSE-CIC-IDS2018 as a
data source performs very well, as it correctly predicts threats (precision) and captures all relevant
cases of malicious traffic (recall) at a 99% rate across all models.
From a model point of view, the random forest ensemble model performed best overall, achieving
100% accuracy for the NDSec-1 dataset [49], while achieving a 99& accuracy for the other datasets.
Moreover, random forest also achieved a precision and recall of 100% for the NDSec-1 dataset.
For the other datasets, precision and recall both stand at 99%. On the other hand, the naïve
Bayes algorithm produced the lowest accuracy with the CICDDoS2019 dataset [47], achieving a
low accuracy of 45% with a precision of 66% and a recall of 54%, meaning that almost half the
time, the model misses to identify threats. The second lowest results were produced by the SVM
model for the NDSec-1 dataset, with an accuracy of 68%. In this case, the precision and recall are
not as low, standing at 81% and 79%, respectively. The remaining models were consistent in the
results.
With regards to computation time, all models took longer to train with CSE-CIC-IDS2018
as the data source. Most likely, this is due to the record volume of the dataset, with a total of
1,046,845 rows. Conversely, the computation time for NDSec-1 dataset was the lowest, with all
models taking less than a second. It is important to note that this set had the lowest data volume
of 5,838 records. In terms of models, the k-NN model for the CSE-CIC-IDS2018 took the longest
48
to train, at 148 seconds. However, when analysing the overall results, the random forest algorithm
had the longest training time across all datasets.
Table 15: Performance metrics for each dataset.
k-NN SVM Naïve Bayes Decision Tree
Random Forest
Logistic Regression
CICDDoS 2019
Accuracy 0.98 0.86 0.45 0.99 0.99 0.98
Precision 0.99 0.86 0.66 0.99 0.99 0.99
Recall 0.99 0.87 0.54 0.99 0.99 0.98
F-measure 0.99 0.85 0.38 0.99 0.99 0.99
Computation time 3.5 seconds 7.29 seconds 1.3 seconds 4.53 seconds 84.2 seconds 5.53 seconds
CSE-CIC- IDS2018
Accuracy 0.99 0.99 0.99 0.99 0.99 0.99
Precision 0.99 0.99 0.99 1 0.99 0.99
Recall 0.99 0.99 0.99 1 0.99 0.99
F-measure 0.99 0.99 0.99 1 0.99 0.99
Computation time 148.2 seconds 16.8 seconds 2.7 seconds 5.3 seconds 120.8 seconds 10.8 seconds
NDSec-1
Accuracy 0.98 0.68 0.99 0.97 1 0.99
Precision 0.99 0.81 1 0.99 1 0.99
Recall 0.99 0.79 1 0.99 1 0.99
F-measure 0.99 0.75 1 0.99 1 0.99
Computation time 0.2 seconds 0.1 seconds 0.2 seconds 0.3 seconds 0.6 seconds 0.2 seconds
CICIDS2017
Accuracy 0.99 0.89 0.8 0.99 0.99 0.98
Precision 0.99 0.93 0.88 0.99 0.99 0.98
Recall 0.99 0.93 0.78 0.99 0.99 0.98
F-measure 0.99 0.93 0.79 0.99 0.99 0.98
Computation time 7.1 seconds 5.5 seconds 1.4 seconds 1.8 seconds 39.6 seconds 1.4 seconds
49
7.2 CICDDoS2019
Figure 12 illustrates a comparative bar graph for the accuracy rates achieved by models that were
trained with the CICDDoS2019 dataset [47]. From initial observations, it is clear that the naïve
Bayes model performs poorly in comparison to the rest, with an accuracy rate of 45% (see table
15). The F-measure of the same model is also low. Taking a more granular look into this metric,
it shows that both the precision and recall of the model are problematic, with 66% and 54%
respectively. For this dataset, the best performing model was the random forest, achieving an
accuracy of 99%, with a 99% precision and 99% recall. The model also took the longest to train,
with a computation time of 84.2 seconds. Meanwhile, the other models took under 10 seconds to
train.
Figure 12: Bar graph of accuracy rate for the CICDDoS2019 dataset.
7.3 CSE-CIC-IDS2018
The CSE-CIC-IDS2018 dataset [48] performs extremely well as it achieves a 99% accuracy rate
for all machine learning models used in this study, as seen in Figure 13 below. This can be
attributed to the volume of records that the dataset has in comparison with the other datasets
50
(1,046,845 rows). Due to the record volume, some models took a longer time to train, in particular,
the k-NN and random forest models, with 148.2 and 120.8 seconds respectively (see table 15).
While all models achieve the same accuracy rate, the decision tree model performs best overall
with an F-measure of 100%. The naïve Bayes model takes the least time to train, with a
computation time of 2.7 seconds.
Figure 13: Bar graph of accuracy rate for the CSE-CIC-IDS2018 dataset.
7.4 NDSec-1
Figure 14 presents a bar graph of the accuracy rate achieved by models trained with the NDSec-
1 dataset [49]. This dataset has the lowest volume of records (5,838) and naturally, model training
took much less time in comparison to models trained with the other datasets. In fact, all models
took less than 1 second to train (see table 15). The random forest model achieved a 100% accuracy,
and the highest accuracy rate in this study. The same model also achieved a 100% F-measure
score. In contrast, the SVM model achieved the lowest accuracy for the dataset, with a score of
68%, the second lowest accuracy rate in this study. In addition, the model achieved a precision
score of 81% and slightly lower recall score of 79%, with a combined F-measure score of 75%. The
other models achieved very similar accuracy and F-measure scores, as the bar graph illustrates
51
(see Fig. 13), where the decision tree model achieved a 97% accuracy, while the k-NN and logistic
regression models both achieved a 99% accuracy score.
Figure 14: Bar graph of accuracy rate for the NDSec-1 dataset.
7.5 CICIDS2017
Figure 15 presents a bar graph of the results achieved by the CICIDS2017 dataset [50]. This
dataset is comparable with the CICDDoS2019 dataset in terms of volume (225,745 rows). It is
interesting to note, however, that when trained with CICIDS2017 data, the random forest model
takes less than half the time (39.6 seconds) it takes to train with CICDDoS2019 data (84.2
seconds). When training a naïve Bayes model, the dataset generates the lowest accuracy, with a
score of 80%, with a precision at 88% and a recall at 78%. Relatively to the rest of the models,
the SVM model also achieves slightly lower accuracy, with a score of 89%. With regards to
precision and recall, the same model achieves a score of 93% in both cases. The rest of the models
achieve very similar results, where the k-NN, decision tree, and random forest models all achieved
99% accuracy, precision and recall, while the logistic regression model achieved a 98% score for
all three metrics. This pattern is observably similar to the accuracy results achieved by the models
when trained with the CICDDoS2019 (see Fig. 11).
52
Figure 15: Bar graph of accuracy rate for the CICIDS2017 dataset.
53
8 Discussion
8.1 Contributions of this Study
This study explored the behaviour and application of multiple DDoS datasets for machine learning
in the context of intrusion detection. Intrusion detection has become a sore point and subject to
extensive research due to the increasing vulnerabilities. Over the last few years, the Internet has
grown exponentially with thousands of computer-based applications being generated every day.
Rapidly, the internet has become an essential component for today’s generation, and with its
aggressive growth, secure network environments are becoming critical. Among various types of
attacks, DDOS attacks are one of the biggest threats to internet sites and pose a devastating risk
to security of computer systems, particularly due to their potential impact. Hence why research
in this area has flourished, with researchers focusing on new ways to tackle intrusion detection
and prevention. Machine learning and artificial intelligence are among the latest additions to the
list of technologies researched for intrusion detection. However, many industry stakeholders and
researchers still find it difficult to obtain good quality datasets for evaluating and assessing their
detection machine learning models. This problem was the main motivation of this study, and the
basis for the research questions.
This work starts by reviewing literature in this domain. Firstly, this review presented an
outline of how other researchers explored and tackled the issue of intrusion detection with the
application of machine learning. This gave a better understanding of how different algorithms are
applied in solving intrusion problems. Moreover, it also provided insight into which algorithms
are commonly used to tackle problems in this domain and how the results are interpreted and
analysed. Secondly, the literature review delved deep into the characteristics and issues of current
datasets. Various work was analysed in order to explore the intricacies of these datasets and how
their validity is affected in the context of machine learning methodologies. Multiple issues were
uncovered with regards to existing datasets, including privacy concerns, documentation
availability, accessibility and alignment with research objectives. This was followed by a review
of previous work related to the surveying and comparison of datasets.
54
This study presented a solution for the analysis of the effectiveness of existing DDoS
datasets to detect intrusion, using CRISP-DM as the core methodology. The primary phases of
CRISP-DM were efficiently and effectively mapped to the research questions and these were
thoroughly followed throughout the whole study.
In the experiment, four open DDoS datasets were used: CICDDoS2019, CICIDS2017,
NDSec-1 and CEC-CIC-IDS2018. The intrusion detection performance of these datasets was
analysed using six machine learning models. The datasets were split in a 60:20:20 ratio for model
training, validation and testing, respectively. The machine learning models were chosen
systematically and carefully to ensure that the experiment is conducted in a proper manner. The
six models include naïve Bayes, SVM, decision tree, k-nearest neighbour, random forest and
logistic regression. The results were analysed using a set of performance metrics, including
accuracy, precision, recall, f-measure and computation time. Below is are the findings of this study
according to the research questions:
Assessment of RQ1: What is the effectiveness of different open DDoS datasets in detecting
intrusion and malicious traffic?
Finding 1.1 – The CSE-CIC-IDS2018 dataset [48] exhibits the best intrusion detection performance
overall, where all models achieve 99% accuracy rate with an F-measure score of 99%, denoting
that any of the six models trained with this dataset is able to correctly identify threats (precision)
and capture all relevant cases of malicious traffic (recall) at a 99% rate.
Finding 1.2 – The CSE-CIC-IDS2018 dataset [48] exhibits the best intrusion detection performance
overall, where all models achieve 99% accuracy rate with an F-measure score of 99%, denoting
that any of the six models trained with this dataset is able to correctly identify threats (precision)
and capture all relevant cases of malicious traffic (recall) at a 99% rate.
Finding 1.3 – Training with CSE-CIC-IDS2018 [48] as a data source was the most time intensive
overall, possibly due to the large record volume of the dataset (1,046,845 rows). In contrast, the
least time intensive models were the ones trained with the NDSec-1 dataset, which is also the least
55
voluminous data set (5,838 rows), where the computational time was under one second for all
models.
Assessment of RQ2: How does the performance of different supervised learning models compare
with regards to classification capacity and time efficiency?
Finding 2.1 – Random forest is the best performing model overall, achieving 100% accuracy and
100% F-measure when trained with the NDSec-1 dataset [49] and 99% score for accuracy, precision
and recall when trained with any of the other datasets.
Finding 2.2 – The Naïve Bayes model performed relatively poor overall and produced the lowest
accuracy score of this study (45%) when trained with the CICDDoS2019 dataset [47]. For the
same model, precision was 66% and recall was 54%, meaning that almost half the time, the model
misses to identify threats.
Finding 2.3 – Computation time for training of the k-NN model was the highest at 148 seconds.
Random forest, however, was the model with the highest computation time overall.
Finding 2.4 – The CICIDS2017 and CICIDS2019 dataset show similar patterns in the results
obtained by the models, with naïve Bayes and SVM producing the lowest and second-lowest
results respectively. All other models trained with either dataset achieved consistently similar
results, ranging from 98% to 99% accuracy rate.
8.2 Conclusions and Future Work
While the absence of datasets was the very focal point at which this study was conducted, it can
also be seen as a limitation on its own given the fact that potentially more accurate results would
have been obtained on the comparison between the datasets.
Although the area of IDS is heavily researched, there are many aspects to be investigated further,
especially in the area of machine learning. Specifically, future work could focus on providing an
application or service with which any new dataset could quickly be analysed and put into the
benchmark with algorithms selected by the researcher in the same manner that this study carried
56
out the analysis of the selected datasets. The application would be able to answer the question
‘which dataset performs better and with which algorithms?’. This would be of great help to
researchers who are in search of a well-performing dataset and also desire a consistent approach
to results, by using the best-performing datasets and algorithms.
With regards to possible future work, an interesting area that could be explored is how IDS-
specific data could be represented as artificial neural networks and further be analysed with deep
learning, once the datasets could be represented in non-structured forms of data. The above can
also be seen as two separate problems, which the future study could expand on.
57
References
[1] S. Dua and X. Du, Data Mining and Machine Learning in Cybersecurity. Boca Raton,
Florida: Auerbach Publications, 2016.
[2] C. Canongia and R. Mandarino, “Cybersecurity: The new challenge of the information
society,” in Handbook of Research on Business Social Networking: Organizational,
Managerial, and Technological Dimensions, 2011.
[3] P. Twomey, “Cyber Security Threats.” The Lowy Institute for International Policy, Sydney,
2010.
[4] R. Von Solms and J. Van Niekerk, “From information security to cyber security,” Comput.
Secur., vol. 38, pp. 97–102, 2013.
[5] J. B. Fraley and J. Cannady, “The promise of machine learning in cybersecurity,” in
SouthEastCon 2017, 2017, pp. 1–6.
[6] OWASP, “OWASP Top 10 - 2017 - The Ten Most Critical Web Application Security
Risks,” Top 10 2017, 2017.
[7] C. Douligeris and A. Mitrokotsa, “DDoS attacks and defense mechanisms: Classification
and state-of-the-art,” Comput. Networks, vol. 44, no. 5, pp. 643–666, 2004.
[8] S. K. Sahu, S. Sarangi, and S. K. Jena, “A detail analysis on intrusion detection datasets,”
in Souvenir of the 2014 IEEE International Advance Computing Conference, IACC 2014,
2014, pp. 1348–1353.
[9] J. O. Nehinbe, “A critical evaluation of datasets for investigating IDSs and IPSs researches,”
in Proceedings of 2011, 10th IEEE International Conference on Cybernetic Intelligent
Systems, CIS 2011, 2011, pp. 1–6.
[10] A. Gharib, I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “An Evaluation Framework
for Intrusion Detection Dataset,” in ICISS 2016 - 2016 International Conference on
58
Information Science and Security, 2017, pp. 1–6.
[11] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-
based intrusion detection data sets,” Comput. Secur., vol. 86, pp. 147–167, 2019.
[12] O. Yavanoglu and M. Aydos, “A review on cyber security datasets for machine learning
algorithms,” in Proceedings - 2017 IEEE International Conference on Big Data, Big Data
2017, 2017, pp. 2186–2193.
[13] C. Thomas, V. Sharma, and N. Balakrishnan, “Usefulness of DARPA dataset for intrusion
detection system evaluation,” in Data Mining, Intrusion Detection, Information Assurance,
and Data Networks Security 2008, 2008.
[14] L. Dhanabal and S. P. Shantharajah, “A Study on NSL-KDD Dataset for Intrusion
Detection System Based on Classification Algorithms,” Int. J. Adv. Res. Comput. Commun.
Eng., vol. 4, no. 6, 2015.
[15] M. Małowidzki, P. Berezi, and M. Mazur, “Network Intrusion Detection: Half a Kingdom
for a Good Dataset,” in ECCWS 2017 16th European Conference on Cyber Warfare and
Security, 2017.
[16] R. Bace and P. Mell, “NIST special publication on intrusion detection systems,” Special
Publication (NIST SP), 2001.
[17] Nexus Guard, “Nexusguard Research Shows DNS Amplification Attacks Grew Nearly
4,800% Year-over-Year; Highlighted by Sharp Increase in TCP SYN Flood,” 2019. [Online].
Available: https://www.nexusguard.com/newsroom/press-release/dns-amplification-
attacks-rise-twofold-in-q1-0-0.
[18] J. Mirkovic and P. Reiher, “A taxonomy of DDoS attack and DDoS defense mechanisms,”
Comput. Commun. Rev., vol. 34, no. 2, pp. 39–53, 2004.
[19] K. Scarfone and P. Mell, “Guide to Intrusion Detection and Prevention Systems (IDPS),”
59
National Institute of Standards and Technology. Special Publication (NIST SP), 2007.
[20] P. Ferguson and D. Senie, “Network Ingress Filtering: Defeating Denial of Service Attacks
which employ IP Source Address Spoofing,” RFC Editor, 2000. [Online]. Available:
https://tools.ietf.org/html/rfc2827.
[21] G. C. Kessler and D. E. Levin, Denial-of-Service Attacks, 4th ed. John Wiley & Sons, 2015.
[22] R. Das and T. H. Morris, “Machine learning and cyber security,” in 2017 International
Conference on Computer, Electrical and Communication Engineering, ICCECE 2017,
2018, pp. 1–7.
[23] I. Sofi, A. Mahajan, and V. Mansotra, “Machine Learning Techniques used for the Detection
and Analysis of Modern Types of DDoS Attacks,” Int. Res. J. Eng. Technol., 2017.
[24] N. Sharma, A. Mahajan, and V. Mansotra, “Machine Learning Techniques Used in
Detection of DOS Attacks: A Literature Review,” Int. J. Adv. Res. Comput. Sci. Softw.
Eng., 2016.
[25] M. Zekri, S. El Kafhali, N. Aboutabit, and Y. Saadi, “DDoS attack detection using machine
learning techniques in cloud computing environments,” in Proceedings of 2017 International
Conference of Cloud Computing Technologies and Applications, CloudTech 2017, 2018.
[26] D. M. Farid, N. Harbi, E. Bahri, M. Z. Rahman, and C. M. Rahman, “Attacks classification
in adaptive intrusion detection using decision tree,” World Acad. Sci. Eng. Technol., pp.
368–372, 2010.
[27] Y. C. Wu, H. R. Tseng, W. Yang, and R. H. Jan, “DDoS detection and traceback with
decision tree and grey relational analysis,” in 3rd International Conference on Multimedia
and Ubiquitous Engineering, MUE 2009, 2009.
[28] A. Andhare, P. Arvind, and B. Patil, “Denial-of-Service Attack Detection Using Genetic-
Based Algorithm,” vol. 2, no. 2, pp. 94–98, 2012.
60
[29] M. Aamir and S. M. A. Zaidi, “DDoS attack detection with feature engineering and machine
learning: the framework and performance evaluation,” Int. J. Inf. Secur., pp. 1–25, 2019.
[30] A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman, “Survey of intrusion detection
systems: techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, p. 20, 2019.
[31] R. Koch, “Towards next-generation intrusion detection,” in 2011 3rd International
Conference on Cyber Conflict, ICCC 2011 - Proceedings, 2011.
[32] J. O. Nehinbe, “A simple method for improving intrusion detections in corporate networks,”
in Lecture Notes of the Institute for Computer Sciences, Social-Informatics and
Telecommunications Engineering, 2010.
[33] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion
detection systems (UNSW-NB15 network data set),” in 2015 Military Communications and
Information Systems Conference, MilCIS 2015 - Proceedings, 2015.
[34] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, and A. Hotho, “Flow-based benchmark data
sets for intrusion detection,” in European Conference on Information Warfare and Security,
ECCWS, 2017.
[35] M. Ghorbani, Ali A., Lu, Wei, Tavallaee, Network Intrusion Detection and Prevention.
Springer, 2010.
[36] The Cooperative Association for Internet Data Analysis, “CAIDA - The Cooperative
Association for Internet Data Analysis,” CAIDA. 2010.
[37] J. Mchugh, “Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA
Intrusion Detection System Evaluations as Performed by Lincoln Laboratory,” ACM Trans.
Inf. Syst. Secur., vol. 3, no. 4, pp. 1094–9224, 2000.
[38] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “Detailed Analysis of the KDD CUP
99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for
61
Security and Defense Applications (CISDA), 2009. .
[39] University Of California, “KDD-Cup Dataset ’99,” The UCI KDD Archive, 1999. .
[40] University Of California, “KDD-Cup Dataset ’98,” The UCI KDD Archive, 1998. .
[41] J. Heidemann and C. Papadopoulos, “Uses and challenges for network datasets,” in
Proceedings - Cybersecurity Applications and Technology Conference for Homeland
Security, CATCH 2009, 2009.
[42] Defense Advanced Research Projects Agency, “1999 DARPA Intrusion Detection
Evaluation Dataset,” 1999. [Online]. Available: https://www.ll.mit.edu/r-d/datasets/1999-
darpa-intrusion-detection-evaluation-dataset.
[43] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network anomaly detection:
Methods, systems and tools,” IEEE Commun. Surv. Tutorials, vol. 16, no. 1, pp. 303–336,
2014.
[44] A. Nisioti, A. Mylonas, P. D. Yoo, and V. Katos, “From intrusion detection to attacker
attribution: A comprehensive survey of unsupervised methods,” IEEE Commun. Surv.
Tutorials, vol. 20, no. 4, pp. 3369–3388, 2018.
[45] T. H. Morris, Z. Thornton, and I. Turnipseed, “Industrial Control System Simulation and
Data Logging for Intrusion Detection System Research,” Seventh Annu. Southeast. Cyber
Secur. Summit, 2015.
[46] R. Wirth, “CRISP-DM : Towards a Standard Process Model for Data Mining,” Proc. Fourth
Int. Conf. Pract. Appl. Knowl. Discov. Data Min., pp. 29–39, 2000.
[47] University of New Brunswick, “DDoS Evaluation Dataset (CICDDoS2019),” unb.ca, 2019.
[Online]. Available: https://www.unb.ca/cic/datasets/ddos-2019.html.
[48] University of New Brunswick, “CSE-CIC-IDS2018 on AWS,” 2018. [Online]. Available:
https://www.unb.ca/cic/datasets/ids-2018.html.
62
[49] F. Beer, T. Hofer, D. Karimi, and U. Bühler, “A new attack composition for network
security,” in Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft
fur Informatik (GI), 2017.
[50] Canadian Institute for Cybersecurity, “CICIDS2017,” unb.ca, 2017. [Online]. Available:
https://www.unb.ca/cic/datasets/ids-2017.html.
[51] A. H. Lashkari, Y. Zang, G. Owhuo, M. S. I. Mamun, and G. D. Gil, “CICFlowMeter,”
Github. 2017.
[52] Pandas.pydaya.org, “Pandas.Dataframe.Fillna,” Pandas 1.0.3 Documentation, 2014.
[Online]. Available: https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.fillna.html.
[53] Scikit-learn, “Train_test_split,” Scikit-learn 0.22.2 Documentation, 2019. [Online].
Available: https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.
[54] T. Mitchell, Machine Learning. Burr Ridge, IL: McGraw Hill, 1997.
[55] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed.
London, England: The MIT Press, 2018.
[56] L. Rokach, “Ensemble-based classifiers,” Artif. Intell. Rev., vol. 33, pp. 1–39, 2010.
[57] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and B. Thirion, “Scikit-learn:
Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
[58] Scikit-learn, “KNeighborsClassier,” scikit-learn.org, 2019. [Online]. Available: https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html.
[59] Scikit-learn, “LinearSVC,” scikit-learn.org, 2019. [Online]. Available: https://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html.
[60] Scikit-learn, “GaussianNB,” scikit-learn.org, 2019. [Online]. Available: https://scikit-
63
learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html.
[61] Scikit-learn, “DecisionTreeClassifier,” scikit-learn.org, 2019. [Online]. Available:
https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.
[62] Scikit-learn, “RandomForestClassifier,” scikit-learn.org, 2019. [Online]. Available:
https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
[63] Scikit-learn, “LogisticRegressionClassifier,” scikit-learn.org, 2019. [Online]. Available:
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.
[64] Scikit-learn, “StratifiedKFold,” Scikit-learn 0.22.2 Documentation, 2019. .
[65] D. M. Powers, “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness,
Markedness & Correlation,” J. Mach. Learn. Technol., vol. 2, 2007.
64
Appendix A – Specifics for CICDDoS2019
Days Attacks Attack Time
First Day PortMap NetBIOS LDAP MSSQL UDP UDP-Lag SYN
9:43 - 9:51 10:00 - 10:09 10:21 - 10:30 10:33 - 10:42 10:53 - 11:03 11:14 - 11:24 11:28 - 17:35
Second Day NTP DNS LDAP MSSQL NetBIOS SNMP SSDP UDP UDP-Lag WebDDoS SYN TFTP
10:35 - 10:45 10:52 - 11:05 11:22 - 11:32 11:36 - 11:45 11:50 - 12:00 12:12 - 12:23 12:27 - 12:37 12:45 - 13:09 13:11 - 13:15 13:18 - 13:29 13:29 - 13:34 13:35 - 17:15
Table A1: Time of Attacks for CICDDoS2019 dataset [49].
65
Appendix B – Modelling Source Code
AB.1 K-Nearest Neighbour
res1 = time.time()
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
knn= knn.fit(X_train , y_train)
knn
res2 = time.time()
print('KNN took ',res2-res1,'seconds')
accuracy = []
for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):
Ran_For.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(Ran_For.score(X_val.iloc[val_in],y_val.iloc[val_in]))
y_pred1 = Ran_For.predict(X_test)
print('Accuracy score= {:.8f}'.format(Ran_For.score(X_test, y_test)))
from sklearn.metrics import classification_report, confusion_matrix
print('\n')
print("Precision, Recall, F1")
print('\n')
CR=classification_report(y_test, y_pred1)
print(CR)
print('\n')
66
AB.2 Support Vector Machine
res1 = time.time()
from sklearn.svm import LinearSVC
svc=LinearSVC(random_state=10, tol=1e-10, max_iter=100)
svc= svc.fit(X_train , y_train)
svc
res2 = time.time()
print('SVM took ',res2-res1,' seconds')
accuracy = []
for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):
svc.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(svc.score(X_val.iloc[val_in],y_val.iloc[val_in]))
y_pred1 = svc.predict(X_test)
print('Accuracy score= {:.8f}'.format(np.mean(accuracy)))
from sklearn.metrics import classification_report, confusion_matrix
print('\n')
print("Precision, Recall, F1")
print('\n')
CR=classification_report(y_test, y_pred1)
print(CR)
print('\n')
67
AB.3 Naïve Bayes
res1 = time.time()
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
nb= nb.fit(X_train , y_train)
nb
res2 = time.time()
print('Naive Bayes took ',res2-res1,' seconds')
accuracy = []
for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):
nb.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(nb.score(X_val.iloc[val_in],y_val.iloc[val_in]))
y_pred1 = nb.predict(X_test)
print('Accuracy score= {:.8f}'.format(np.mean(accuracy)))
from sklearn.metrics import classification_report, confusion_matrix
print('\n')
print("Precision, Recall, F1")
print('\n')
CR=classification_report(y_test, y_pred1)
print(CR)
print('\n')
68
AB.4 Decision Tree
res1 = time.time()
from sklearn.tree import DecisionTreeClassifier
DTC=DecisionTreeClassifier(random_state=10, max_depth=13)
DTC= DTC.fit(X_train , y_train)
DTC
res2 = time.time()
print('Decision tree took ',res2-res1, ' seconds')
accuracy = []
for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):
DTC.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(DTC.score(X_val.iloc[val_in],y_val.iloc[val_in]))
y_pred1 = DTC.predict(X_test)
print('Accuracy score= {:.8f}'.format(np.mean(accuracy)))
from sklearn.metrics import classification_report, confusion_matrix
print('\n')
print("Precision, Recall, F1")
print('\n')
CR=classification_report(y_test, y_pred1)
print(CR)
print('\n')
69
AB.5 Random Forest
res1 = time.time()
from sklearn.ensemble import RandomForestClassifier
Ran_For= RandomForestClassifier(n_estimators=200,max_depth=35, random_state=200,max_leaf_nodes=200)
Ran_For= Ran_For.fit(X_train , y_train)
Ran_For
res2 = time.time()
print('Random Forest took ',res2-res1,' seconds')
accuracy = []
for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):
Ran_For.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(Ran_For.score(X_val.iloc[val_in],y_val.iloc[val_in]))
y_pred1 = Ran_For.predict(X_test)
print('Accuracy score= {:.8f}'.format(Ran_For.score(X_test, y_test)))
from sklearn.metrics import classification_report, confusion_matrix
print('\n')
print("Precision, Recall, F1")
print('\n')
CR=classification_report(y_test, y_pred1)
print(CR)
print('\n')
70
AB.6 Logistic Regression
res1 = time.time()
from sklearn.linear_model import LogisticRegression
LR= LogisticRegression()
LR= LR.fit(X_train , y_train)
LR
res2 = time.time()
print('LogisticRegression took ',res2-res1,' seconds')
accuracy = []
for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):
LR.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])
accuracy.append(LR.score(X_val.iloc[val_in],y_val.iloc[val_in]))
y_pred1 = LR.predict(X_test)
print('Accuracy score= {:.3f}'.format(LR.score(X_test, y_test)))
from sklearn.metrics import classification_report, confusion_matrix
print('\n')
print("Precision, Recall, F1")
print('\n')
CR=classification_report(y_test, y_pred1)
print(CR)
print('\n')