ddos datasets - diva-portal.org

DDoS datasetsUse of machine learning to analyse intrusion detection performance

Stefanos Kiourkoulis

Information Security, master's level (120 credits)

2020

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

MasterThesisProject

DDoSdatasets:Useofmachinelearningtoanalyse

intrusiondetectionperformance

Author:StefanosKiourkoulisE-mail:[email protected]

Supervisor:Dr.AliIsmailAwad

June2020MasterofScienceinInformationSecurity

LuleåUniversityofTechnology

DepartmentofComputerScience,ElectricalandSpaceEngineering

II

Abstract

Threats of malware, attacks and intrusion have been around since the very conception of

computing. Yet, it was not until the sudden growth of the internet that awareness of security and

digital assets really started to pick up steam. The internet presents a new liability, as the ever-

increasing number of machines on the web provides a new goldmine for those seeking to exploit

vulnerabilities. As access increases, new ways are created for attackers to exploit network systems

and their users. Among various types of attack, DDoS remains the most devastating and severe

due to its potential impact, and the potentiality keeps on growing, making intrusion detection a

must for network security and defense. As a result, machine learning and artificial intelligence

research has flourished over the last few years, opening new doors for intrusion detection

technologies. However, data availability still limits greatly the success of such technologies, as

research faces a shortage of good quality IDS datasets.

This study bases itself on this persisting issue as it assesses the state-of-the-art of open datasets

and their ability to detect intrusion and harmful network traffic. In particular, this study focuses

on providing a comparison of intrusion detection performance of open DDoS attack datasets.

DDoS attacks are some of the most concerning due to the magnitude of damage that they are

capable of. Literature on open DDoS datasets is fairly scarce in comparison to other forms of

attacks, hence, this study seeks to shed more light on the nature of existing DDoS data in relation

to intrusion detection. The proposed solution sees four DDoS datasets analysed using a set of six

machine learning algorithms, namely, k-NN, SVM, naïve Bayes, decision tree and logistic

regression. This study aims to assess these datasets and analyse their performance with regards

to classification of network traffic.

The results of this study contribute to a better understanding of the intrusion detection capacity

of open DDoS datasets. The datasets are analysed on the basis of 5 performance metrics: accuracy,

precision, recall, F-measure and computation time. The results show that voluminous datasets,

such as the CEC-CIC-IDS2018 dataset, can achieve very high performance. In modelling terms,

III

the results denote that random forest performs very well over a wide range of datasets, while naïve

Bayes and SVM are less consistent.

IV

Acknowledgements

Throughout the writing of this dissertation, several people contributed in numerous ways, from

advisory to the much-needed support for the long hours required to finalize the writing. I would

like to thank all of them.

At the forefront my professor, Dr Ali Ismail Awad, for providing constant feedback,

recommendations, general guidance and for always being available. Furthermore, I would like to

thank LTU and Sweden’s academic system, for giving me the opportunity to further my academic

knowledge in the subject of security.

In addition, many thanks to my dissertation opponent Alexandros Marinakis, for his valuable time

and assistance during the seminars. Last but not least, Francesca Gauci for her counselling,

support and all-around enforcement of discipline towards the writing of the dissertation.

V

Table of Contents

1 INTRODUCTION ..................................................................................................................................... 1

1.1 OVERVIEW .......................................................................................................................................... 1

1.2 PROPOSED SOLUTION AND RESEARCH GOALS ..................................................................................... 3

1.3 RESEARCH QUESTIONS ........................................................................................................................ 5

1.4 RESEARCH CONTRIBUTION .................................................................................................................. 5

1.5 LIMITATIONS ....................................................................................................................................... 6

1.6 DOCUMENT STRUCTURE ...................................................................................................................... 6

2 BACKGROUND ....................................................................................................................................... 7

2.1 OVERVIEW OF DDOS ATTACKS ........................................................................................................... 7

2.1 DDOS TAXONOMY ............................................................................................................................... 7

3 LITERATURE REVIEW ....................................................................................................................... 11

3.1 OVERVIEW ......................................................................................................................................... 11

3.2 USE OF MACHINE LEARNING IN INTRUSION DETECTION ..................................................................... 12

3.3 AVAILABILITY OF GOOD DATASETS .................................................................................................... 14

3.3.1 Issues with Current Datasets ........................................................................................... 14

3.3.2 Related Work ................................................................................................................. 16

3.4 RESEARCH GAP .................................................................................................................................. 17

4 RESEARCH METHODOLOGY ............................................................................................................. 19

4.1 OVERVIEW ......................................................................................................................................... 19

4.2 LIFE CYCLE ........................................................................................................................................ 20

4.3 IMPLEMENTING CRISP-DM ............................................................................................................... 21

5 DDOS DATASETS REVIEW ................................................................................................................ 23

5.1 CICDDOS2019 ................................................................................................................................... 23

5.2 CSE-CIC-IDS2018 ON AWS ............................................................................................................... 24

5.3 NDSEC-1 (BOTNET) ........................................................................................................................... 25

5.4 CICIDS2017 ....................................................................................................................................... 26

VI

6 EXPERIMENT IMPLEMENTATION AND DESIGN ........................................................................... 27

6.1 OVERVIEW ......................................................................................................................................... 27

6.2 DATA PREPARATION .......................................................................................................................... 28

6.2.1 Data Cleaning and Transformation .................................................................................. 28

6.2.2 VOLUME AND CLASS DISTRIBUTION ................................................................................................. 29

6.2.2 SPLITTING DATASETS ...................................................................................................................... 32

6.3 MODELLING ........................................................................................................................................ 33

6.3.1 Models Selection ............................................................................................................. 33

6.3.2 Models Used in this Study ............................................................................................... 35

6.3.2 Training ........................................................................................................................ 42

6.3.3 Validation ...................................................................................................................... 43

6.3.4 Testing .......................................................................................................................... 44

6.4 EVALUATION ...................................................................................................................................... 44

7 RESULTS ............................................................................................................................................... 47

7.1 OVERVIEW OF RESULTS ...................................................................................................................... 47

7.2 CICDDOS2019 ................................................................................................................................... 49

7.3 CSE-CIC-IDS2018 ............................................................................................................................. 49

7.4 NDSEC-1 ............................................................................................................................................ 50

7.5 CICIDS2017 ....................................................................................................................................... 51

8 DISCUSSION .......................................................................................................................................... 53

8.1 CONTRIBUTIONS OF THIS STUDY ......................................................................................................... 53

8.2 CONCLUSIONS AND FUTURE WORK .................................................................................................... 55

REFERENCES .......................................................................................................................................... 57

APPENDIX A – SPECIFICS FOR CICDDOS2019 .................................................................................. 64

APPENDIX B – MODELLING SOURCE CODE ..................................................................................... 65

AB.1 K-NEAREST NEIGHBOUR ................................................................................................................. 65

AB.2 SUPPORT VECTOR MACHINE ........................................................................................................... 66

AB.3 NAÏVE BAYES .................................................................................................................................. 67

AB.4 DECISION TREE ............................................................................................................................... 68

VII

AB.5 RANDOM FOREST ............................................................................................................................ 69

AB.6 LOGISTIC REGRESSION .................................................................................................................... 70

VIII

List of Tables

Table 1: OSI model describing the functionality of each distinctive layer. ................................... 8

Table 2: OS Specification and Machine IPs for CICDDoS2019. Adapted from DDoS Evaluation

Set [47]. ........................................................................................................................................ 23

Table 3: Specification of tools and duration of DDoS attack for CSE-CIC-IDS2018 on AWS [48].

..................................................................................................................................................... 25

Table 4: Details of DDoS Attack for CICIDS2017. ...................................................................... 26

Table 5: Labelling system for binary classification. ..................................................................... 29

Table 6: Volume of records for the DDoS attack datasets. .......................................................... 30

Table 7: Pseudocode for the k-NN Algorithm [54]. ...................................................................... 36

Table 8: Pseudocode for the SVM Algorithm [54]. ...................................................................... 38

Table 9: Pseudo code for the naïve Bayes algorithm [54]. ........................................................... 39

Table 10: Pseudo code for the decision tree algorithm [54]. ......................................................... 40

Table 11: Pseudo code for the random forest algorithm [54]. ...................................................... 40

Table 12: Pseudo code for the logistic regression algorithm [54]. ................................................. 41

Table 13: Methods and classifiers from the scikit-learn Python library [57] used for building

models. ......................................................................................................................................... 42

Table 14: Example of a confusion matrix for a binary classifier [65]. .......................................... 45

Table 15: Performance metrics for each dataset. ......................................................................... 48

IX

List of Figures

Figure 1: Overview of proposed solution. ...................................................................................... 4

Figure 2: The four tiers of CRISP-DM. Reproduced from [46]. ................................................... 19

Figure 3: The six-phase life cycle of a data mining project. Reproduced from [46]. ..................... 20

Figure 4: Workflow of supervised learning process ...................................................................... 28

Figure 5: Bar chart showing the spreading of traffic type in the CICDDoS2019 dataset. ........... 30

Figure 6: Bar chart showing the spreading of traffic type in the CSE-CIC-IDS2018 dataset. ..... 31

Figure 7: Bar chart showing the spreading of traffic type in the CICIDS2017 dataset. .............. 31

Figure 8: Bar chart showing the spreading of traffic type in the NDSec-1 dataset. ..................... 32

Figure 9: An example of a k-NN classification. When k=3, the new instance is labelled as 0.

However, when the parameter is increased to k=5, the same instance is labelled as 1. Adapted

from [55]. ...................................................................................................................................... 36

Figure 10: Illustration of SVM classification. ............................................................................... 37

Figure 11: K-fold cross validation with 5 folds. ........................................................................... 43

Figure 12: Bar graph of accuracy rate for the CICDDoS2019 dataset. ........................................ 49

Figure 13: Bar graph of accuracy rate for the CSE-CIC-IDS2018 dataset. .................................. 50

Figure 14: Bar graph of accuracy rate for the NDSec-1 dataset. ................................................. 51

Figure 15: Bar graph of accuracy rate for the CICIDS2017 dataset. ........................................... 52

X

List of Equations

Equation 1: Bayes’ Theorem ........................................................................................................ 38

Equation 2: Accuracy Ratio ......................................................................................................... 45

Equation 3: Precision Ratio ......................................................................................................... 45

Equation 4: Recall Ratio .............................................................................................................. 46

Equation 5: F-measure Ratio ....................................................................................................... 46

1

1 Introduction

1.1 Overview

Cyber security refers to the application of preventive security measures to provide confidentiality,

integrity and availability of data [1]. This has been a long-standing point of discussion both in the

academic and scientific world, causing both controversy and debate. There are multiple works of

literature that describe and define cyber security. In particular, Canongia and Mandarino [2]

interpret it as “the art of ensuring the existence and continuity of the information society of a

nation, guaranteeing and protecting, in Cyberspace, its information, assets and critical

infrastructure” [2]. Safeguarding cyberspace and ensuring security is of utmost significance as

multiple organisations and operations depend on it including high risk ones such as governments

and military, and others including businesses, financial institutions and civilians that store

immense volume of data on personal computers and other devices [1], [3]. Consequently, it is

necessary for companies to organise their efforts to ensure protection across their information

systems. Cyber security is composed of different elements, including network security, data

security and mobile security, to name a few [3].

Over the last few years, the usage of the Internet and computer-based applications has

grown exponentially, as these are rapidly becoming an essential component for today’s generation.

With the aggressive growth in the use of computer applications and computer networks, secure

environments are becoming critical [4], [5]. As improvements in technological systems are making

processes easier in all aspects of life, these also create new ways for attackers to exploit these

systems and their users. Attackers can go down various paths in order to cause harm and damage

to users and organisations. These paths present different levels of risk and, accordingly, may or

may not be severe enough to attract attention [6]. The National Institute of Standards and

Technology (NIST) reported that in 2017 alone, American companies experienced losses of up to

65.5 billion dollars due to IT- related attacks and intrusions [6].

2

Among various attacks, Denial-of-Service (DOS) remains an immense threat to internet-

dependent businesses and organisations. Although security researchers and experts dedicate

continuous efforts to address this issue, DOS attacks are still one of most difficult security

problems faced by the Internet and the online world today. Of particular concern are Distributed

Denial-of-Service (DDoS) attacks, specifically because the capacity and impact of DDoS attacks

are persistently growing. With little or no prior warning, such an attack can easily and efficiently

cripple the resources of its victims in a short timespan [7].

Consequently, adopting intrusion detection measures is becoming quintessential. Intrusion

detection is highly significant in network security and defense, as it proactively aims to forewarn

security administrators about malicious behaviours, such as attacks, malware and intrusion.

Having an intrusion detection system (IDS) is considered a “mandatory line of defense” against

the growth of intrusive activities. As a result, research in the IDS domain has gained traction over

the years to come up with better intrusion detection methodologies. The very first IDS was

introduced in 1980, and since then many other systems have been proposed [1]. Yet, many of these

systems still generate a lot of alerts for low risk and non-threatening situations, resulting in a high

false alarm rate. This creates huge security risks as it may cause malicious attacks and malignant

behaviour to be ignored. Therefore, recent research has shifted its focus to reducing false alarms

and generating higher detection rates [2].

This has opened doors to new areas of research, specifically in artificial intelligence, data

mining and machine learning. These fields have become subject to extensive research that

emphasises the improvement of detection accuracy, with the aim of proposing new systems to

tackle novel or zero-day attacks [8]. Machine learning is a subset of artificial intelligence and is

concerned with the automatic discovery of useful patterns from large datasets [2]. Machine learning

algorithms can be generalised enough to detect many variants of an attack; however, the success

of machine learning-based IDS depends highly on the goodness of the training data available.

Consequently, numerous researchers working in this domain face an urgent need for quality

datasets to effectively apply machine learning models for intrusion detection. However, getting

3

suitable training data is a significant challenge in the cyber security domain, research community

and vendors [9]. As a result, the suitability and validity of existing datasets has been scrutinised

and thoroughly questioned for various reasons. Primarily, there is the issue of privacy, which keeps

real datasets from being shared by companies who suffered attacks, as this may expose

vulnerabilities. Other issues relate to anonymity and detachment from current trends, with most

of the current datasets lacking traffic and attack diversity [9], [10]. Additionally, with the

continuous change and improvements in technology and attack strategies, such datasets need to

be periodically updated [9].

Accordingly, the focal point of this study is to analyse the intrusion detection capacity of

a collection of IDS datasets. In particular, this study uses machine learning techniques to analyse

open DDoS attack data, a scarcely scoped out subject in previous works. This evaluation comprises

the analysis of these datasets on the basis of their performance in detecting intrusion and other

anomalous behaviour. The initial part of this study presents a review of literature on the subject,

focusing on two key areas: the growing use of machine learning and artificial intelligence for

intrusion detection; and, the state-of-the-art of IDS datasets, their characteristics and

shortcomings. This is followed with a CRISP-DM approach for the evaluation of the intrusion

detection performance of four DDoS datasets. The goal of this evaluation is to assess the capacity

of each of the datasets to classify DDoS traffic correctly with different learning models.

1.2 Proposed Solution and Research Goals

Machine learning and artificial intelligence research has flourished in the last few years, opening

new doors for intrusion detection technologies. However, data availability still limits greatly the

success of such technologies, as research faces a shortage of good quality IDS datasets. This study

bases itself on this persisting issue as it assesses the state-of-the-art of open datasets and their

ability to detect intrusion and harmful network traffic. In particular, this study focuses on

providing a comparison of intrusion detection performance of open DDoS attack datasets. DDoS

attacks are some of the most concerning due to the magnitude of damage that they are capable

of. Available literature on open DDoS datasets is fairly scarce in comparison to other forms of

4

attacks, hence, this study seeks to shed more light on the nature of existing DDoS data. The

proposed solution sees four DDoS datasets analysed using a set of machine learning algorithms.

Figure 1 illustrates the main concepts of this solution. The primary goal of this study is to assess

DDoS datasets and analyse their performance with regards to classification of network traffic

(malicious or benign).

Accordingly, this research seeks to:

● Analyse the current state of existing intrusion detection datasets, including characteristics

and shortcomings.

● Collect and process open DDoS datasets from reliable sources and review them based on

their qualities and features.

● Select the most suitable machine learning algorithms to assess the datasets and build

appropriate training models by labelling training instances according to the type of

network traffic, malicious or benign.

● Train, validate and test each dataset against the machine learning algorithms and generate

results for each.

● Evaluate the results of the supervised learning models using a set of performance metrics.

● Analyse the intrusion detection performance of each dataset based on the achieved results.

Figure 1: Overview of proposed solution.

5

1.3 Research Questions

In addressing the goal of this study, two research questions have been formed. The questions are

somewhat correlated, each focusing on one of the variables in this study, that is, open DDoS data

(RQ1) and the machine learning models (RQ2).

RQ1: What is the effectiveness of different open DDoS datasets in detecting intrusion and

malicious traffic?

RQ2: How does the performance of different supervised learning models compare with regards to

classification capacity and time efficiency?

1.4 Research Contribution

The main contribution of this work lies in the selection of datasets. While previous works focused

on the analysis and evaluation of a wide range of datasets, this work focuses on a specific type of

security attack, DDoS. Although DDoS attacks feature in previous works related to assessment of

IDS datasets, these were never considered as the main focal point. Previous works, such as that

done by Ring et al. [11], Sahu et al. [8], and Yavanoglu and Aydos [12] consider datasets with

various types of attacks. On the other hand, Thomas et al. [13] and Dhanabal and Shantharajah

[14] focus on one specific dataset, in this case DARPA and NSL-KDD respectively. In contrast,

this work considers multiple dataset to provide a critical review and evaluation of each and be

able to compare the different results achieved.

Moreover, this work takes on a more quantitative approach to evaluation, as opposed to

previous works [8], [11], [15] that provided qualitative analyses of IDS datasets. This study adopts

an analytical machine learning-based approach as the primary source of evaluation. This provides

a less theoretical and more observation-based evaluation, with the main method of assessment

being the comparison of the performance metrics. Previous studies do focus on the use of such

datasets in a machine-learning environment; however, this is done qualitatively, mostly with an

analysis of the features of the dataset.

6

Overall, this work aims to provide suggestions on the most appropriate algorithms to use

depending on the datasets available, and this in itself, is another contribution of this study. This

work aims to be a potential guideline for machine learning-based detection of DDoS behaviour.

1.5 Limitations

This research is limited to the analysis of DDoS datasets in the context of machine learning. This

work includes the usage of several machine learning models to assess the detection performance

by these datasets. The datasets are the primary subject of this research, and although efficiency

and operational performance of the algorithms are measured, the optimisation of the algorithms

is not the goal of this work.

1.6 Document Structure

This section introduced the main ideas and concepts that will set the groundwork for this study.

The rest of this document is organised as follows. Section 2 provides some background on DDoS,

and different types and structures of this attack. Section 3 presents a review of literature on the

subject, tackling both the use of machine learning in this domain, as well as the availability and

characteristics of current datasets. Section 4 presents the methodology for this study. Section 5

presents the different datasets used in this study, with a detailed description on each. Section 6

includes a thorough explanation of the design and implementation of the experiment. Section 7

presents the results and main findings of this study. Finally, Section 8 concludes this work with a

retrospective analysis of the contributions of this study and potential future work.

7

2 Background

2.1 Overview of DDoS Attacks

DDoS stands for ‘distributed denial-of-service’ and refers to a type of DoS (denial-of-service) where

the attack is originating from multiple sources spread on different network locations. The main

motivation of DoS attacks is to severely slow or shut down a specific resource and one way of

operation is by exploiting a system flaw and causing a processing failure or exhaustion of system

resources. Another way of attacking the victim system is by flooding and monopolizing the

network, therefore prohibiting anyone else from using it [16]. The prohibition of access to the

attacked computer, or network, is what defines and categorises an attack as DoS while the usage

of many computer systems or services indicates the presence of a ‘distributed’, known as DDoS,

attack. It is important to note that the attack agents can be any device or resource that support

the ability to have the attack code installed, including IoT devices, networked computers, servers

and weaponized mobile devices [17].

A typical architecture of a DDoS attack consists of four distinct elements; the real attacker, the

agent-controlling handlers or masters, the attack agents or zombie hosts responsible for packet

generation and forwarding towards the victim and, finally, the target host victim. Four stages can

describe a successful deployment of a DDoS attack, recruitment, compromise, communication and

attack. During the recruitment phase, the attacker selects the agents that will be used to carry

out the attack. In the compromise stage, the attack component is planted on the agents while

striving to cover itself from detection and deactivation. In the communication stage the master

agents inform the attacker that the attacking agents are ready to be deployed and carry out the

attack. The attack phase is the last stage and describes the initiation of the attack [7].

2.1 DDoS Taxonomy

There are multiple ways of describing the taxonomy and types of DDoS attacks, having different

sources both from academia and business approaching the problem of taxonomy differently.

Mirkovic and Reiher [18] divide the different types in classes based on specific criteria that are

8

both technical and social, e.g. based on the impact of the attack. Some of the criteria are degree

of automation (DA), exploited weakness to deny service (EW), source address validity (SAV),

attack rate dynamics (ARD), possibility of characterization (PC), persistence of agent set (PAS),

victim type (VT) and impact on the victim (IV) [18].

On the other hand, the NIST provides only two types of DDoS attacks, flaw exploitation and

flooding. The main distinction between the two types lies in the medium used to perform the

attack and the end-system being attacked. In the flaw exploitation, the target is the software of

the system, attempting to deplete its resources like memory, CPU, disk space or memory buffers.

In flooding attacks, the attack is on the networking capabilities of the target, depleting the network

capacity by accessing the resource with the means of attack, thus making it inaccessible for

legitimate users.

Another approach is dividing the types of attacks based on the layer being targeted. Depending

on the attack vector that the DDoS attack is engaged in, the target can differ with regards to the

components of its network connection. The most common and shared framework that is used to

communicate and decompose a single network connection, is the OSI model. The different

components that comprise a network connection are called layers. Every layer of the OSI model

has a different purpose and each part of it engages in different activities. What follows is a

conceptual dichotomy of the said model, the layers of it and how they operate. In totality, there

are 7 layers, casually marked as L1-L7 and briefly described in Table 1 with regards to their

functionality and potential attack vectors for a DDoS attack.

Table 1: OSI model describing the functionality of each distinctive layer.

Application Layer (L7)

Most commonly interacted-with layer for the end-user.

Many of the known processes and protocols operate on

this layer like HTTP(S), DNS, (S)FTP, SSH and email

processes. Common processes include user

authentication and privacy-oriented concepts.

9

Presentation Layer (L6)

The presentation layer transforms data into the form

that the application layer accepts. It also formats and

encrypts data to be sent across a network. It is

sometimes called the syntax layer.

Session Layer (L5)

The session layer is responsible for establishing,

managing, and terminating connections between

applications at each end of the communication

Transport Layer (L4)

The transport layer provides reliability in security and

crypto, performs error checking while ensuring quality

of service by resending data when data has been

corrupted. It also provides full transference of data

between systems while providing end-to-end error

recovery and flow control.

Network Layer (L3)

Opposite to the data link layer, the network layer is

engaged in the transference of data, organization and

reassembly.

Data Link Layer (L2)

The data link layer provides decomposition in frames

and transmission over the medium, physical or wireless.

Does basic error correction and detection.

Physical Layer (L1)

The physical layer provides electrical signal conversion

to bits, enabling multiplexing to allow the usage of the

same medium, physical or wireless.

Based on the layer-differentiating approach Cloudflare, the content delivery and DDoS mitigation

services provider, divides the attacks in 3 types; Application, protocol and volumetric layer.

Application layer attacks refer to any attack that operates on the 7th layer of the OSI model.

Protocol attacks exploit weaknesses found on the 3rd and 4th layer and cause disruption by

attacking resources like firewalls and load balancers. Finally, the volumetric attacks are similar to

10

NIST’s flooding attacks, exploiting the network bandwidth between the attacked system and the

legitimate user, by diverting massive volumes of requests from compromised agents to the attacked

system.

11

3 Literature Review

This chapter presents the literature review that was conducted as part of this research. This review

opens with an overview of the research subject. This is followed by a review of previous work

related to the use of machine learning for intrusion detection. This review uncovers the

methodologies, algorithms and results achieved by other researchers. Then the focus shifts on the

datasets available to conduct such experiments. A thorough review is carried out on how different

researchers analysed existing IDS datasets. This section of the literature review pinpoints the

major characteristics and issues of the data available. To conclude, the research gap is identified,

and this sets the groundwork for the rest of this research.

This review is based on peer-reviewed and reliable sources dating from 2000 to 2019.

3.1 Overview

Threats of malware, attacks and intrusion have been around since the very conception of

computing. Yet, it was not until the sudden growth of the internet that awareness of security and

digital assets really started to pick up steam. The internet presented a new liability, as the ever-

increasing number of machines on the web provided a new goldmine for those seeking to exploit

the vulnerabilities. Presently, with global internet usage estimated at 4.4 billion users,

approximately 58% of the world’s population [17, 18], the risk of intrusion has grown

exponentially. The term intrusion refers to “any unauthorized access that attempts to compromise

confidentiality, integrity and availability of information resources” [19]. In general, any form of

malicious use of the internet, computer applications or information systems is labelled as intrusion.

Attackers, or intruders, pry on the vulnerabilities of weak computer systems and networks, with

the potential to cause serious harm to users, organisations and businesses [6].

Among various types of attacks, DDOS attacks are one of the biggest threats to internet sites and

pose a great risk to the security of computer systems, particularly because of their potential

impact. With no prior warning, DDOS attacks can cause devastating damage and rip the resources

of a system apart. The harm caused by these attacks has been thoroughly described in network

12

security literature [7]. A DDoS attack aims to render a network inoperable by targeting its

bandwidth or connectivity. The attacker achieves this by sending a stream of packets that halt

the processing capabilities of a network [20]. The University of Minnesota is reportedly the first-

ever victim of a large-scale DDoS attack, back in August 1999. This attack shut down the

University’s network for over 2 days [21].

In such an environment, IDS are an essential measure for network security and defense.

Intrusion detection encloses all methods of detecting violations and interruption of a system’s

regular behaviour [7]. In recent years, intrusion detection has expanded to new areas, such as

artificial intelligence and machine learning. Current research heavily invests in these fields as the

new way to detect and prevent network intrusion and keep a safe and secure network for systems

and their users. The following section explores this concept in detail and presents literature that

focuses on the use of Machine Learning as the main technology for intrusion detection and

prevention.

3.2 Use of Machine Learning in Intrusion Detection

Machine Learning is becoming increasingly relevant in the field of intrusion detection. This section

gives an overview on how other researchers explored and tackled the issue of intrusion detection

with the application of machine learning. The primary take-aways of this section lie in

understanding how different algorithms are applied in solving intrusion problems, which

algorithms are most commonly applied, and understanding the results achieved from using such

techniques and methodologies.

Machine Learning is implemented for cybersecurity measures, particularly in three essential

regions including anomaly detection, intrusion detection, as well as misuse detection [22]. Sofi et

al. [23] analysed how machine learning methodologies can be implemented to detect and analyse

modern forms of DDoS Attacks. The researchers collected a new dataset, comprising 27 features

and 5 classes, containing present-day forms of attack detailed for diverse forms of attack aimed at

the application and network layers. The authors also used four machine learning algorithms,

including decision trees, naïve Bayes, support vector machines (SVM), and multi-layer perceptron

13

(MLP), on a dataset gathered to categorise the DDoS forms of attack such as SIDDOS, HTTP-

Flood, Smurf, and UDP-Flood. Their research concluded that MLP classifier attained the

uppermost precision level [23].

Similarly, Sharma et al. [24] carried out a literature review on how machine learning techniques

can be applied to detect DDOS Attacks. They examined common machine learning techniques

used for DDoS detection, including decision trees, SVM, naïve Bayes, artificial neural networks

(ANN), k-Means clustering, fuzzy logic and genetic algorithms. They concluded that network

attacks are extremely hazardous and IDS/IPS are insufficient in accommodating the modern-day

attacks impinging on the networks [24].

On the other hand, Zekri et al. [25] present a new formulation of an algorithm for DDoS detection

structure based on the C.4.5 algorithm to lessen the threats of DDoS. The researchers selected

other machine learning techniques to validate their proposed systems and further likened the

results obtained. They went further to use a naïve Bayes classifier to detect anomaly, while snort

was applied for signature-based detection [25].

Dewan Md. Farid et al. [26] suggested a learning algorithm for detecting the anomaly by

differentiating normal patterns from attacks, and also recognizing diverse forms of intrusions by

means of a decision tree algorithm. Yi-Chi Wu et al. [27] took a similar approach and formulated

a DDoS-detection system centred on a decision tree. In addition to the detection of an attack, the

authors also linked the locations of the attacker through a traffic-flow pattern-matching method.

The researchers implemented a C4.5 classifier to identify DoS attacks [27]. Furthermore, Andhare

and Patil [28] designed rules by means of a genetic algorithm-based method for the detection of

DoS attacks on the system [28].

In contrast, Aamir and Zaidi [29] implemented a systematic flow of feature engineering and

machine learning for detecting DDoS attacks. The results from the analysis indicated that a

considerable feature lessening is conceivable in making DDoS detection quicker and enhanced with

trifling performance hit. For their case study of DDoS datasets, the researchers discovered that

k-nearest neighbour algorithm generally demonstrates the greatest performance, shadowed by

14

SVM techniques. When applying a random forest algorithm, datasets with fewer dimensions and

distinct feature form have better performance in contrast with great dimensions with arithmetical

features.

3.3 Availability of Good Datasets

The application of machine learning in IDS demands good quality datasets. This section expands

on this notion by firstly analysing the intricacies of existing datasets. Literature is reviewed in

order to explore the different characteristics of such datasets and how this affects their validity in

a machine learning approach. Secondly, this section presents a review of previous research that

tackled the goodness and validity of these datasets, with focus on work that in itself reviews and

surveys these datasets. Accordingly, this work seeks to understand how other researchers tackled

the issue of good datasets and analyse the methodologies used to do so.

3.3.1 Issues with Current Datasets

The relevance of the results achieved from such techniques rely on the quality of the datasets

employed, as these are vital to have a realistic evaluation [30]. The validity of current datasets

has been thoroughly questioned in the cybersecurity space. It is a challenge for many researchers

to find appropriate datasets to validate and test their methods [31] and having a suitable dataset

is a significant challenge itself [32]. Privacy is a huge setback for availability of these datasets as

they contain sensitive information. In the off chance that these are made available, they are heavily

anonymized or obsolete. The unavailability of such datasets and the absence of certain statistical

characteristics remains one of the major challenges for anomaly-based intrusion detection [10],

[32].

The cybersecurity community continuously strives to tackle this problem as numerous intrusion

detection data sets have been published over the last years, such as the UNSW-NB15 data set

[33], published by the Australian Centre for Cyber Security and the CIDDS-001 dataset [34],

published by the University of Coburg. Das and Morris [22] conduct a thorough analysis on the

necessity of data for machine learning methodologies, stating that a researcher needs an in-depth

15

comprehension of the data set prior to undertaking any form of analysis. They went further to

explain why raw data including NetFlow, packet capture (PCAP), and other network data may

not be exactly functional for machine learning analysis because data has to be processed prior to

being used in standard machine learning applications. Thus, to use machine learning procedures

on conventional systems, the individual will need to comprehend the data collection methodologies

and the approaches needed for pre-processing the data [22].

Nehinbe [9] expands on previous research conducted by Ghorbani et. al [35] to identify the current

issues with evaluative datasets. Some of the findings are summarised below.

Data privacy issues. The nature of the datasets brings about data privacy issues due to certain

security policies, the sensitivity of the data and the potential risk from disclosing such information.

Moreover, there are other trust factors that inhibit realistic data from being shared among industry

stakeholders and researchers. As a result, organisations often choose to not disclose the outcomes

of computer attacks. Therefore, most researchers do not use realistic data when conducting their

own studies [9].

Getting approval from the data owner. Getting access to real datasets often requires approval

from the owner of the dataset. Some data owners, such as CAIDA1 [36] require users to sign

Acceptable Use Policies (AUP), restricting the user with time of usage and publishing of

information [35]. In addition, approval from the owner of the dataset may result in a highly

bureaucratic process, with which the researcher might not gain access in time, as approval

processes are often delayed [9].

Different research objectives. The objectives of a study and the methodologies applied are among

some of the most important factors that influence the researcher when it comes to choosing suitable

datasets to evaluate models designated to investigate intrusion detection. For instance, McHugh

1 Cooperative Association for Internet Data Analysis [36]

16

[37] highlights some significant issues with the NSL-KDD dataset [38], which should be an

enhanced version of pre-existing datasets (KDD ‘99 [39], KDD ’98 [40]). The author discusses how

the same issues that existed with the previous versions of this dataset persisted with the newer,

supposedly enhanced version [37]. Moreover, researchers often tweak the datasets through data

processing and cleaning to “lessen the challenges in matching data with the objectives of the

studies” [41]

Problem of documentation. Most IDS datasets that are available for the perusal of researchers

lack sufficient documentation. These datasets have little to no information about the network

environment in which they are simulated, the type of intrusions simulated, the goal of the

intruders, the details of the operating systems of both attacking and victim machines, and other

significant information that might impact the study [9].

3.3.2 Related Work

Numerous studies have been carried out with the aim to analyse the relevance and quality of

available IDS datasets. Malowidzki et al. [15] review the current situation with regards to publicly

available IDS datasets and provide suggestions on certain processes that should act as base

principles for a good dataset. The authors also suggest variants for data preparation and highlight

the aspects that result in a high quality, reliable dataset [15]. Koch et al. [31] also provide an

evaluation of IDS datasets, spreading across 13 different data sources. The authors analyse these

datasets on the basis of 8 data attributes. This work also provides a detailed analysis of current

security systems and investigates their shortcomings [31].

Taking a slightly different approach, the work of Thomas et al. [13] analyse one specific dataset,

DARPA [42], and investigates its use in intrusion detection. The authors conclude that the

DARPA dataset has the potential to model attacks that commonly appear on network traffic, and

therefore it can be considered as “the baseline of any research” [13]. Similarly, the work of Dhanabal

and Shantharajah [14] investigates the application of the NSL-KDD dataset in intrusion detection.

The authors study the effectiveness of the NSL-KDD dataset in detecting network traffic

anomalies using various classification algorithms. This work uses J48, SVM and naïve Bayes

17

algorithms to assess the dataset and concludes that the J48 algorithm produces the best accuracy

results [14].

Sharafaldin et al. [10] present a more exhaustive analysis of IDS datasets when compared to other

dataset studies that focus more on providing a high-level overview. The authors analyse 11 IDS

datasets and compare them with respect to 11 properties. This study also presents a framework

for the creation of new IDS datasets [10]. Bhuyan [43] et al. briefly describe and compare a large

number of network anomaly detection methods and systems. In addition, the authors discuss tools

for network defenders and datasets that researchers in network anomaly detection can use [43].

Similarly, Nisioti et al. [44] discuss 12 IDS datasets and provide a critical evaluation of

unsupervised techniques for intrusion detection.

Yavanoglu and Aydos [12] compare the most common datasets for artificial intelligence and

machine learning techniques [45]. Similarly, Ring et al. [11] take on the analysis of multiple

datasets. This work identifies 15 different attributes to analyse the applicability of individual

datasets for specific evaluation scenarios. Based on these properties, the authors also provide an

overview of existing datasets [11].

3.4 Research Gap

The safeguarding of networks and computer-based applications has been subject to extensive

research throughout the years. With the explosive growth of internet usage, the need for secure

environments has become more and more critical. Intrusion detection is becoming a quintessential

measure for network security and defense. Lately, research has taken a new turn in this respect,

with a surge in artificial intelligence and machine learning research for intrusion detection.

Researchers are investing heavily in this field and consequently require good quality datasets in

order to be able to evaluate their models. The availability of these datasets has also been

thoroughly discussed in literature, with emphasis on the characteristics and shortcomings of these

datasets [9], [22], [35].

18

Numerous works in literature focused on the analysis and evaluation of a wide range of

datasets. Specifically, the work published by Yavanoglu and Aydos [12] and Ring et al. [11]

considers multiple datasets that are commonly used in machine learning scenarios. In both cases,

the researchers do not focus on a specific type of security attack. On the other hand, Thomas et

al. [13] carry out a more narrow and focused evaluation of the DARPA dataset. The goal of their

work was to assess the potentiality of the DARPA dataset in intrusion detection. A similar

approach was taken by Dhanabal and Shantharajah [14]. The authors focused on analysing the

applicability of the NSL-KDD dataset in intrusion detection models. The dataset is assessed

against three prominent machine learning algorithms, with best results being achieved with J48.

Most of the previous work takes on a rather qualitative approach to assessing IDS datasets,

with some exceptions [14]. A significant section of literature in this respect focuses on analysing

the quality of the data from a more descriptive sense by analysing various criteria, most of which

is in line with the work of Nehinbe [9].

This study, although building on previous works, analyses and evaluates some new

concepts in this field. Firstly, the primary focus of this work are DDoS attacks. Previous research

on IDS datasets seldom focuses on datasets for one specific type of security attack. Although there

were many instances where DDoS attacks were featured in IDS dataset research, these were never

considered as the main focal point. Moreover, this work takes a tangent from the work by

Dhanabal and Shantharajah [14] by taking a more analytical approach to evaluation and uses

multiple machine learning algorithms to assess the datasets. Although similar methodology is used,

there are two aspects that distinguish this work from that done by Dhanabal and Shantharajah

[14]. Firstly, multiple datasets are used, as opposed to analysing just one. And secondly, there is

a specific security attack under evaluation.

Overall, this work aims to provide suggestions on the most appropriate algorithms to use

depending on the datasets available, and this in itself, is another contribution of this study. This

work aims to be a potential guideline for machine learning-based detection of DDoS behaviour.

19

4 Research Methodology

4.1 Overview

Given the nature of the study, CRISP-DM (CRoss Industry Standard Process for Data Mining)

is chosen as the basis for the research methodology. CRISP-DM is synonymous with projects that

involve machine learning and data analytics. CRISP-DM is both technology and industry agnostic,

and it defines a systematic way to carry out data mining projects. This framework aims to reduce

the cost of large-scale data projects, while increasing maintainability and efficiency of such

projects. CRISP-DM is a hierarchical model, involving four levels of abstraction: phases, generic

tasks, specialised tasks, and process instances [46]. Figure 2 illustrates this hierarchy.

Figure 2: The four tiers of CRISP-DM. Reproduced from [46].

At the highest tier, the data mining process is split into a number of phases, where each phase

comprises a set of second tier generic tasks. The second tier is a generalised representation of all

the possible solutions to a given data mining problem, where the tasks should be complete and

stable. That is, tasks should be complete in nature to cover the entirety of the data mining process,

and stable enough to account for any unforeseen developments in the process. The third tier puts

the general tasks under the microscope for a more granular view and divides them in specific tasks

that outline the actions for specific scenarios. The fourth tier represents all the actions and

20

outcomes of a particular data mining project. All process instances in the fourth tier are defined

according to tasks in higher tiers, however, these represent actual events, rather than generalised

ones [46].

4.2 Life Cycle

The hierarchical reference model of the CRISP-DM framework presents the lifecycle of a data

mining project, containing phases, tasks and results. The lifecycle of such projects consists of six

phases, as presented in Figure 3. The data mining process is not a rigid one. The arrows represent

the most common dependencies between phases, however, the sequence in which the phases are

carried out is entirely dependent on the nature of the project and the outcome of each phase [46].

Figure 3: The six-phase life cycle of a data mining project. Reproduced from [46].

Below is a brief explanation of each step of the data mining life cycle [46].

Business Understanding. A data mining project starts with a discovery phase, where the focus is

to define and understand the business problem, requirements and objectives. These are then

translated to a data mining problem and a plan to satisfy the requirements and objectives.

21

Data Understanding. The second understanding phase involves data collection, followed by a set

of activities and tasks to understand the nature of the data. The aim of these activities is to

familiarise oneself with the data, discover preliminary insights, identify valuable subsets, and

uncover any data quality issues. This phase is closely tied with the business understanding phase,

as the formulation of a plan requires good understanding of the data in question.

Data Preparation. The third phase consists of various tasks that focus on converting the raw

data collection into a final dataset. The nature and order of tasks may vary, and some tasks may

even be performed multiple times, depending on the state of the raw data. Some of the tasks

include data cleaning, feature selection and data transformation.

Modelling. In the fourth phase, the appropriate modelling techniques are chosen and applied to

the data. Typically, the parameters of these models are calibrated to achieve optimal performance.

This phase is closely tied with data preparation, as modelling may uncover new issues with the

data. In addition, the way the data is prepared can lead to the use of different models.

Evaluation. In the evaluation phase, the models applied in the previous phase are thoroughly

evaluated and reviewed. During this phase, the tasks carried out are assessed against the planned

objectives, to ensure that all business requirements have been considered and met. Moreover, the

models are tested for generalisation against unseen data. At the end of the evaluation phase, there

needs to be a clear understanding of how the data mining results should be applied.

Deployment. The resulting knowledge is organised and presented to the end-user. The tasks of the

deployment phase highly depend on the data mining project. The outcome can range from a simple

report of results, to a more complex implementation of a continuous data mining process.

4.3 Implementing CRISP-DM

This section presents an overview of how the CRISP-DM methodology is applied in this study.

The data mining process is described in further detail in Section 6.

22

Business Understanding. A thorough review of literature is carried out to analyse three key areas:

the use of machine learning in the context of intrusion detection; the state-of-the art of IDS

datasets, their characteristics and shortcomings; and, a review of recent works on the validity of

existing IDS datasets. This lays the groundwork for the problem definition and objective of this

study, that is, to analyse the intrusion detection performance of DDoS datasets.

Data Understanding. A total of four datasets are collected for this study. An in-depth review of

each of these datasets is presented in Section 5. Each of these are analysed to gain familiarity

with the feature set and assess the quality of the data. This analysis is crucial to determine whether

the data is a right fit for the objectives of this study.

Data Preparation. The four datasets are prepared for modelling in a systematic manner. This

phase involves several tasks, including handling missing data, decoding undefined data,

transforming data types as required by the models, and transforming class labels to generate

homogenous labels across all datasets. The datasets are split into three subsets for training,

validation and testing of the models.

Modelling. Six different supervised learning models are selected to analyse the datasets. The

selection of the models is based on specific criteria, namely: having parametric and nonparametric

models; using algorithms from different categories; and, applying models that are also commonly

used in previous works and literature. These criteria are further explained in Section 6.3.1. The

models are trained with each of the datasets.

Evaluation and deployment. The models are validated to ensure that they are generalised for

unseen data. The models are then tested using new data from the testing set and performance

metrics are generated, including rate of accuracy, precision, recall and F-measure. These

performance evaluation metrics are explained in Section 7.1. Models are also evaluated on training

efficiency, that is, the time taken for the model to train. The outcome of this phase is an analysis

of the intrusion detection performance of each dataset.

23

5 DDoS Datasets Review

For this experiment, a total of four datasets were collected and tested, namely CICDDoS2019 [47],

CSE-CICIDS2018 on AWS [48], NDSec-1 [49] and CICIDS2017 [50]. All datasets are based on

simulated data and are dated between 2017 and 2019. Selecting datasets for this study was in

itself a challenge due to the shortage of DDoS-specific datasets, despite it being one of the most

devastating security attacks. Moreover, all datasets chosen are recently dated to ensure that all

instances and features are relevant and up to date.

5.1 CICDDoS2019

CICDDoS2019 contains benign and the recent DDoS attacks, resembling real data (PCAPs). It

also includes the analysis of network traffic analysis using CICFlowMeter-V32 [51] and labelled

flows. B-Profile system [47] was used to profile the abstract behaviour of human interactions and

generate naturalistic benign background traffic. For this dataset, the abstract behaviour of 25

users was constructed based on the HTTP, HTTPS, FTP, SSH, and email protocols [47]. The

dataset includes different modern reflective DDoS attacks such as Port Map, NetBIOS, LDAP,

MSSQL, UDP, UDP-Lag, SYN, NTP, DNS, and SNMP. The capturing period for the training

day on January 12th started at 10:30 and ended at 17:15, and for the testing day on March 11th

started at 09:40 and ended at 17:35. Attacks were subsequently executed during this period.

Table 2: OS Specification and Machine IPs for CICDDoS2019. Adapted from DDoS Evaluation Set [47].

Machine OS IPs

Server Ubuntu 16.04 (Web

Server)

192.168.50.1 (first day)

192.168.50.4 (second day)

2 CICFlowMeter is a network traffic flow generator and analyser [51].

24

Firewall Fortinet 205.174.165.81

PCs (first day)

Win 7

Win Vista

Win 8.1

Win 10

192.168.50.8

192.168.50.5

192.168.50.6

192.168.50.7

PCs (second day)

Win 7

Win Vista

Win 8.1

Win 10

192.168.50.9

192.168.50.6

192.168.50.7

192.168.50.8

Refer to Appendix A for a timed breakdown of the attacks.

5.2 CSE-CIC-IDS2018 on AWS

In CSE-CIC-IDS2018 dataset, profiles were used to generate datasets in a systematic manner,

which contained detailed descriptions of intrusions and abstract distribution models for

applications, protocols, or lower level network entities. These profiles can be used by agents or

human operators to generate events on the network. Due to the abstract nature of the generated

profiles, they are applicable to a diverse range of network protocols with different topologies [48].

Profiles can be used together to generate a dataset for specific needs. Two distinct classes of

profiles were built:

B-profiles: Encapsulate the entity behaviours of users using various machine learning and

statistical analysis techniques. The encapsulated features are distributions of packet sizes of a

protocol, number of packets per flow, certain patterns in the payload, size of payload, and request

time distribution of a protocol. The following protocols were simulated: HTTPS, HTTP, SMTP,

POP3, IMAP, SSH, and FTP.

25

M-Profiles: Attempt to describe an attack scenario in an unambiguous manner. In the simplest

case, humans can interpret these profiles and subsequently carry them out. Idealistically,

autonomous agents along with compilers would be employed to interpret and execute these

scenarios

The datasets comprise various types of attacks, including DoS, Infiltration, DDoS and Brute force.

For the purpose of this study, only DDoS attacks are considered, as described in Table 3.

Table 3: Specification of tools and duration of DDoS attack for CSE-CIC-IDS2018 on AWS [48].

Tools Duration Attacker Victim

Low Orbit Ion Canon (LOIC) for UDP, TCP, or

HTTP requests Two days Kali Linux

Windows Vista, 7, 8.1, 10 (32-bit) and 10 (64-bit)

5.3 NDSec-1 (Botnet)

The NDSec-1 dataset incorporates traces and log files of cyber-attacks synthesized within the

facilities of the Network and Data Security Group at the University of Applied Sciences in Fulda,

Germany. The need for such dataset came about as a result of the absence of publicly available

captures containing a broad range of different attack footprints to either benchmark existing

intrusion detection systems or to support network security research in designing new detection

engines [34]. Using state-of-the-art tools, the distinct attack scenarios were performed, namely,

Watering Hole, Bring-your-own device (BYOD), and Botnet. This study considers the Botnet

attack scenario.

The rental of botnets operated by cyber crews is a lucrative business in the underground economy.

Hence, these illicit infrastructures increasingly gain popularity. This trend is crucial for enterprises

and organizations, because essentially any host of a legitimate network may serve as a bot, and

thus has potential to be part of a criminal act once infected. Citadel 1.3.5.1 was employed in this

scenario, as a revised version of the well-known Zeus botnet. Based on a normal operating network,

26

three legitimate hosts were infected with Citadel binaries. This task could be performed through

conventional email spam using the recent vulnerabilities CVE-2015-2509 (Windows Media

Center), CVE-2015-5122 (Flash Player), and a rogue download caused by XSS placed on a website

in the simulated Internet [49].

After the infection, all three bots communicated via HTTP to a prepared bot master. Among

several traffic footprints between master and bots, all bots were instructed to download new

commands. These contained hostile payload to perform a DDoS via SYN flooding to a single

destination outside the network. Beside this successful attack, two of the bots stole local

configuration files and transferred them to an external FTP server [49].

5.4 CICIDS2017

The CICIDS2017 dataset was the product of simulation for 5 days long, starting on a Monday

and finishing on Friday, and includes network traffic in two formats, packet and bidirectional

flow. It is important to note that for each of the flows 80 attributes were extracted which further

include extra metadata about the simulated multiple attackers IP addresses and attacks. To

simulate default and user behaviour within the normal bounds, scripts are used. The first day is

considered normal and the traffic included is benign only. The simulated attacks include DDoS

data and the attack scenario was taken from CICIDS2017 [50].

Table 4: Details of DDoS Attack for CICIDS2017.

Attack Scenario Victim Attacker IP

DDoS LOIT Ubuntu 16,

205.174.165.68

205.174.165.69

205.174.165.70

205.174.165.71

27

6 Experiment Implementation and Design

This section outlines the details of the design and implementation of the proposed solution. The

solution is implemented in Python 3. Firstly, an overview of the solution is presented, briefly

describing the phases of this implementation. Section 6.2 describes the data preparation process,

including details on data cleaning and transformation, and dataset splitting. Section 6.3 presents

the modelling process, with a detailed account of the training, validation and testing processes.

Section 6.4 concludes with an overview of the evaluation procedure, including a summary of the

performance metrics used to analyse the intrusion detection performance of the DDoS datasets.

6.1 Overview

Figure 4 presents a flow chart of the supervised learning process adopted in this study, as part of

the proposed solution. Firstly, DDoS raw data is collected from open sources. A total of four

datasets are collected. These are described in detail in Section 5. After collection, data is processed

to construct the final datasets for modelling. Data processing includes data cleaning,

transformation of data types, and dataset splitting. The splitting process is described in 6.2.2.

This is followed by the model selection process. Models are selected based on the criteria

highlighted in Section 6.3.1. The models are trained with all four datasets using six different

algorithms; k-nearest neighbour, SVM, naïve Bayes, decision tree, random forest, and logistic

regression. The model is validated using k-fold cross and retrained. Finally, the model is tested

with unseen data. The results are evaluated using several performance metrics, as described in

Section 6.4.

28

Figure 4: Workflow of supervised learning process

6.2 Data Preparation

6.2.1 Data Cleaning and Transformation

Missing data. Handling missing data is vital in machine learning, as it could lead to incorrect

predictions for any model. Accordingly, null values are eliminated by propagating the last valid

observation forward along the column axis. This is implemented using the fillna method from

the pandas library [52], as shown below.

data.fillna(method ='ffill', inplace = True)

29

Undefined Data. The elimination of null values can result in undefined data. A null field with no

cells on its left becomes NaN after propagation, since there are no cells to provide a value.

Consequently, these values are decoded into 0. This is all done using the fillna method [52].

data=data.fillna(0)

Transformation. The format of the collected data might not be suitable for modelling. In such

cases, data and data types need to be transformed so that the data can then be fed into the

models, as described by the CRISP-DM method. Accordingly, some data features were

transformed into numeric or float, since models do not perform well with strings, or do not perform

at all.

Class Labels. Each dataset instance represents a snapshot of the network traffic at a given point

in time. These instances are labelled according to the nature of the traffic, that is, whether the

traffic is benign or malicious. The labels across the four datasets vary, therefore they are encoded

to have homogeneity in the class labelling system. Classification is binary, where benign traffic is

labelled as NORMAL, and malicious traffic is labelled as ATTACK. Table 5 summarises the

classification system.

Table 5: Labelling system for binary classification.

Label Scenario

NORMAL Traffic is benign

ATTACK Traffic is malicious

6.2.2 Volume and Class Distribution

Following the thorough preparation of the data, some descriptive information is generated for each

set, specifically: (1) the volume of records in each set; and (2) the distribution of classes. Table 6

presents the volume of records for each dataset that is used in this study. Further, the sections

below give an account of the class distribution, including amount and percentage.

30

Table 6: Volume of records for the DDoS attack datasets.

Dataset No. of Records

CICDDoS2019 294,627

CSE-CIC-IDS2018 1,046,845

CICIDS2017 225,745

NDSec-1 5,838

6.2.2.1 CICDDoS2019

In the CICDDoS2019 dataset, there were 121,980 (41.4%) records classified as normal traffic and

172,647 (58.6%) classified as attack traffic.

Figure 5: Bar chart showing the spreading of traffic type in the CICDDoS2019 dataset.

6.2.2.2 CSE-CIC-IDS2018 on AWS

In the CSE-CIC-IDS2018 dataset, there were 360,833 (34.5%) records classified as normal traffic

and 686,012 (65.5%) classified as attack traffic.

31

Figure 6: Bar chart showing the spreading of traffic type in the CSE-CIC-IDS2018 dataset.

6.2.2.3 CICIDS2017

In the CICIDS2017 dataset, there were 97,718 (43.3%) records classified as normal traffic and

128,027 (56.7%) classified as attack traffic.

Figure 7: Bar chart showing the spreading of traffic type in the CICIDS2017 dataset.

6.2.2.4 NDSec-1

For this study, a subset of the NDSec-1 dataset is considered, that is, DDoS Botnet attack data.

The other instances of attack (BYOD and Watering Hole) are not taken into account for the

32

evaluation. In this subset, there were 3,508 (60.1%) records classified as normal traffic and 2,330

(39.9%) classified as attack traffic.

Figure 8: Bar chart showing the spreading of traffic type in the NDSec-1 dataset.

6.2.2 Splitting Datasets

A key characteristic of a good learning model is its ability to generalise to new, or unseen, data.

A model which is too close to a particular set of data is described as overfit, and therefore, will

not perform well with unseen data. A generalised model requires exposure to multiple variations

of input samples. Primarily, models require two sets of data, one to train and another to test. The

training data is the set of instances that the model trains on, while the testing data is used to

evaluate the generalisability of the model, that is, the performance of the model with unseen data.

The train/test split can yield good results; however, this approach has some drawbacks. Although

splitting is random, it can happen that the split creates imbalance between the training and the

testing set, where the training set has a large number of instances from only one class. In such

cases, the model fails to generalise and overfits.

To mitigate this, the datasets are split into three subsets; training, validation and testing.

This split is done in a 60:20:20 ratio, for training, validation and testing respectively. The

train_test_split helper method from the scikit-learn library [53] is used for the split, as

presented in the code snippet below. With this approach, training is done in two phases, with the

33

training and the validation sets. Firstly, the training set is used to train the model. Then, the

validation set is used to estimate the performance of the model on unseen data (data that the

model is not trained on). For the purpose of this study, validation is done using a stratified k-fold

approach. The k-fold validation method is described in Section 6.3.3.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.40, random_state=100)

X_val,X_test,y_val,y_test =

train_test_split(X_test,y_test,test_size=0.5,random_state=100)

6.3 Modelling

The classification phase constitutes two aspects; (1) the construction of the learning model, and

(2) the generation of the predicted labels. These tasks are implemented using scikit-learn, a Python

library for data mining, data analysis and machine learning.

6.3.1 Models Selection

This study features the testing and training of six different classification methods; namely, k-

nearest neighbour, SVM, naïve Bayes, decision tree, random forest and logistic regression. These

algorithms were selected on a basis of criteria: (1) To have a mix of parametric and nonparametric

algorithms; (2) To have a range of algorithms from different categories; and (3) To use algorithms

which commonly feature in previous works.

6.3.1.1 Parametric vs Nonparametric algorithms

A parameter can be loosely described as a pre-defined attribute of the data. A parametric

algorithm possesses a fixed number of parameters. While a parametric algorithm is

computationally more efficient, it makes stronger assumptions about the dataset. This would be

ideal if the assumptions are correct. However, parametric algorithms perform poorly with incorrect

34

assumptions [54]. In this study, the parametric algorithms used are SVM, naïve Bayes and logistic

regression.

In contrast, non-parametric algorithms are more flexible. In nonparametric scenarios, as

the algorithm learns, the number of parameters grows. This type of algorithm performs slower

computations; however, it makes far less assumptions about the dataset [54]. The nonparametric

algorithms used in this study are k-nearest neighbour, decision tree and random forest.

6.3.1.2 Categories of Algorithms

This section highlights the different types of algorithms used in this study.

Instance-based. Instance-based learning methods are “conceptually straightforward approaches to

approximating real-valued or discrete-values target functions” [52, p. 230]. These algorithms learn

by storing the training data that they are presented with. When a new instance is presented, this

is compared to previous instances and classified according to similarity [54]. The k-nearest

neighbour used in this study is an instance-based algorithm.

Kernel method. Kernel methods are based on kernel functions. Given the right conditions of

symmetry, kernel functions essentially define an instance in a high-dimensional space. Using the

kernel method, the original instance is replaced with a kernel to extend algorithms such as SVM

[55], which is the model used in this study.

Bayesian. Bayesian reasoning assumes that “quantities of interest are governed by probability

distributions [52] and that accurate decisions can be made when adopting these probabilities on

new data. Every instance in a training set can decrease or increase the likelihood that a hypothesis

is correct [54]. This study adopts the naïve Bayes algorithm.

Decision Tree. In decision tree learning discrete values are used to represent target functions which

are themselves represented with a decision tree. It is one of the most popular learning methods

used in inductive inference logic and has been applied to multiple real-case scenarios ranging from

medical to credit risk [54].

35

Ensemble Methods. In ensemble methods several based models are combined with the purpose of

producing one optimal predictive model. Within this machine learning technique, multiple models

are created and then combine to improve the results. It is commonly understood that with

ensemble methods more accurate solutions are produced relatively to the results a single model

would produce [56]. In this study, the ensemble method used is the random forest algorithm.

Regression. In regression-based approaches, data are used to predict, as closely as possible, the

accurate and actual labels of points that are under consideration. Regression-based approaches

are highly common in machine learning with a multitude of applications [55]. This study uses

logistic regression.

6.3.2 Models Used in this Study

6.3.2.1 K-Nearest Neighbour (k-NN)

The k-nearest neighbour is an instance-based classifier. When the k-NN is used, instances within

a dataset are contained in a dimensional space, where a new instance is labelled based on its

similarity with other instances, as shown in Figure 9. These instances are referred to as neighbours.

A new instance is labelled x, if x is the most similar class for the neighbouring observations [54].

A distance function is applied to determine the similarity between instances. For the purpose of

this study, the distance function employed is Euclidean. The Euclidean function is a relatively

common method as it reflects the human perception of distance.

36

Figure 9: An example of a k-NN classification. When k=3, the new instance is labelled as 0. However, when the parameter is increased to k=5, the same instance is labelled as 1. Adapted from [55].

Table 7 describes classification with the k-NN algorithm.

Table 7: Pseudocode for the k-NN Algorithm [54].

Algorithm 1 k-Nearest Neighbour

start

Let S = {a1, a2, …, an}, where S represents the training set and a represents article documents

k ← the desired number of nearest neighbours

Compute d(x,y) between new instance i and all a ∈ S

Select the k closest training samples to i

Classi ← best voted class

end

37

6.3.2.2 Support Vector Machine (SVM)

The learning process in SVM is carried out in two steps: firstly, the inputs are plotted in an n-

dimensional space, where n is based on the number of attributes; the coordinates of individual

attributes are referred to as support vectors. Secondly, a hyperplane separates the instances. A

hyperplane is a line that linearly separates a set of data points into two distinct classes. The SVM

is the hyperplane which best splits the data set.

When dealing with the mapping of complex nonlinear functions, computation issues are highly

probable. In fact, the larger the dimensional space, the bigger the separation problem [55]. Kernel

tricks are used to mitigate this problem. A kernel is able to transform extremely complex functions

into infinitely higher dimensional spaces, then uses predefined labels to split the inputs [54].

Figure 10: Illustration of SVM classification.

Table 8 describes classification with the SVM algorithm.

38

Table 8: Pseudocode for the SVM Algorithm [54].

Algorithm 2 Support Vector Machine

start

∀ document ∈ training set S:

Create SVM classification objects

Objects → Higher Dimensional Space

Apply a kernel trick to transform f(x) into a linear separable one

A hyperplane is computed ⇒ binary classification

end

6.3.2.3 Naïve Bayes

The naïve Bayes classifier is built on Bayes’ Theorem, where event independence is assumed. In

statistics, two events are said to be independent if the likelihood of one does not impact the other

[54]. Table 9 presents the algorithm of the Bayesian classifier to calculate probability. For instance,

let P(B|A) equal the conditional probability of any given event. Then, let P(B) be the probability

of B, and P(A) be the probability of A. Furthermore, let P(A|B) be equal to the likelihood of A

given B. As such, the theorem is formally presented as:

𝑃(𝐴|𝐵) =𝑃(𝐵|𝐴)𝑃(𝐴)

𝑃(𝐵)

Equation 1: Bayes’ Theorem

39

Table 9: Pseudo code for the naïve Bayes algorithm [54].

Algorithm 1 Naïve Bayes

start

Let S = {a1, a2, …, an}, where S = training set and a = articles:

Calculate the probability of the classes P(C)

Calculate likelihood of attribute A for each class P(A|C)

Calculate the conditional probability P(C|A)

Assign the class with the highest probability

end

6.3.2.4 Decision Tree

Decision tree classification starts at the root node and classifies observations on the basis of the

values of the respective attributes. Every node represents a single feature, while they represent

the values that the node can assume [54].

Starting from the root node, the algorithm works its way down by iteratively computing the

information gain for each feature in the training set. Information gain is used to determine the

level of discrimination imposed by the features towards the target classes. The higher the

information gain, the higher the importance of the attribute in the classification of each

observation [54], [55]. The root node is replaced by the attribute that possesses the highest

information gain, and the algorithm continues splitting the data set by the selected feature to

produce subsets. Table 10 gives an overview of this procedure.

40

Table 10: Pseudo code for the decision tree algorithm [54].

Algorithm 4 Decision Tree

start

∀ attributes a1, a2, …, an

Find the attribute that best divides the training data using information gain

a_best ← the attribute with highest information gain

Create a decision node that splits on a_best

Recurse on the sub-lists obtained by splitting on a_best and add those nodes as children of node

end

6.3.2.5 Random Forest

The random forest algorithm is an ensemble algorithm that uses a large number of decision trees

for classification. Individual trees are built using the algorithm presented in Table 11. As

previously noted, ensemble algorithms provide higher accuracy due to the combination of multiple

models.

Table 11: Pseudo code for the random forest algorithm [54].

Algorithm 5 Random Forest

Require IDT (a decision tree inducer), T (the number of iterations), S (the training set), µ (the subsample size), N (the number of attributes used in each node)

start

t ← 1

repeat

41

St ← Sample µ instances from S with replacement.

Build classifier Mt using IDT(N) on St

t++

until t > T

end

6.3.2.6 Logistic Regression

Logistic regression is a type of predictive analysis and best suited for analysing scenarios where

the dependent variable is binary. Logistic regression describes the data and explains the

relationship between one dependent binary variable and other non-binary independent variables

[55]. Table 12 presents the algorithm for the logistic regression classifier.

Table 12: Pseudo code for the logistic regression algorithm [54].

Algorithm 6 Logistic Regression

given α, {(x i, y i)}

initialize a = <1, .., 1> T

perform feature scaling on the examples’ attributes

repeat until convergence

for each j = 0, .., n:

a`j = aj+ α Σi(yi − ha(xi))xji

for each j = 0, .., n:

aj = aj

output a

42

6.3.2 Training

During the training process, the selected algorithms are provided with training data to learn from

to eventually create machine learning models. Accordingly, the training set is used, as specified in

Section 6.2.2. At this point in the process, the input data source needs to be provided and should

contain the target attribute (class label). The training process involves finding patterns in the

training set that map the input features with the target attribute. Based on the observed patterns,

a model is produced.

In this study, four DDoS datasets are being used as the input data source, where the target

attribute is the type of network traffic, that is, attack or normal. Six algorithms are trained with

each of the four sets. Training is conducted using several methods from the scikit-learn libraries.

Table 13 provides breakdown of the methods used for each algorithm. Appendix B contains the

sources code for the models that were built to analyse the intrusion detection capacity of each

dataset.

Table 13: Methods and classifiers from the scikit-learn Python library [57] used for building models.

Model Scikit-learn Methods & Classifiers

k-NN sklearn.neighbor.KNeighborsClassifier [58]

SVM sklearn.svm.LinearSVC [59]

Naïve Bayes sklearn.naive_bayes.GaussianNB [60]

Decision Tree sklearn.tree.DecisionTreeClassifier [61]

Random Forest sklearn.ensemble.RandomForestClassifier [62]

Logistic Regression sklearn.linear_model.LogisticRegression [63]

43

6.3.3 Validation

Following the training process, the model is validated using k-fold cross validation. Cross

validation is applied to assess the generalisability of a model. This method aims to reduce the

errors of overfitting that occur when a model is too closely fit a range of data instances. Cross

validation is done in iterations, and each iteration involves splitting the dataset into k subsets,

referred to as folds. The model is trained on k-1 folds, and the other fold is held back for testing,

as illustrated in Figure 11. This process is repeated until all folds have served as a test fold. Once

the process is completed, the evaluation metric is summarised by calculating the average value

[54].

Figure 11: K-fold cross validation with 5 folds.

In this study, a stratified k-fold approach is used using the validation dataset (20% of the global

set). Stratified k-fold is a variation of k-fold cross validation that ensures that the distribution of

classes is the same across all folds. This is implemented using the StratifiedKFold method from

the scikit-learn library [64], with k=5. Below is a code snippet of the stratified k-fold, where

n_splits specifies the number of folds.

44

for tr_in,val_in in StratifiedKFold(shuffle

True,n_splits=5).split(X_val,y_val):

{{model}}.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(knn.score(X_val.iloc[val_in],y_val.iloc[val_in]))

6.3.4 Testing

In the last stage of the modelling phase, the models are tested with unseen data. The unseen data

used at this stage is the resulting test set from the data split (20%). Testing is conducted to assess

how a model represents data and how well it will perform in the future. This study ensured that

any tweaks to the models were done prior to testing, so that the testing data is used only once.

Various performance metrics were generated to be able to analyse the performance of the DDoS

datasets, such as accuracy, precision, recall, and F-measure. These are described in the next

section.

6.4 Evaluation

A crucial part of understanding the performance of a model is generating performance metrics. In

this study, various metrics are generated. These are described below.

Accuracy. One of the ways to describe the performance of a classification model is the count of

correctly and incorrectly classified instances. These values are commonly represented in a

confusion matrix. A confusion matrix is a tabulated visualisation of the performance of supervised

learning algorithms. The rows represent the count of instances in an actual class, while the columns

represent the count of instances in a predictive class [65]. Table 14 depicts the confusion matrix

for a binary classification problem.

45

Table 14: Example of a confusion matrix for a binary classifier [65].

Predicted Class

Class 0 Class 1

Actual Class

Class 0 180 15

Class 1 20 90

A confusion matrix provides enough information to determine the performance of a stand-alone

classifier. However, it is more convenient and clearer to draw the elements of the matrix into a

single value [65]. In this study, the matrix is summarised using the accuracy metric, which is

computed as follows:

Accuracy = ,-../012302566787/97:615:0/6;-1527:615:0/6

× 100%

Equation 2: Accuracy Ratio

Precision. Accuracy is often not enough to assess the performance of a learning model. Although

accuracy provides an indication on whether the model is being trained correctly, it does not give

information on detailed information on the specific application. Consequently, other performance

metrics are employed, such as precision. Precision is defined as the rate of correctly classified

positives, or true positives. There are many scenarios when false positives might have

repercussions. In the case of this study, having a high false positives rate means that traffic would

be identified as being malicious, when in fact it is not. Outside the academic world this might

result in wasted time and cost efforts. Precision is computed as follows:

Precision = ;.</=-6717>/6;.</=-6717>/6?@526/=-6717>/6

Equation 3: Precision Ratio

Recall. Another performance metric is recall. Recall is a measure of how many of the actual

positives were found or recalled. It is also a significantly important metric, as having undetected

positives, or false negatives, might have serious consequences in some areas. For instance, a model

46

that does not recall all cases of DDoS attack means that malicious network traffic will go

unnoticed, increasing the potentiality of harm to the system and its users.

Recall = ;.</=-6717>/6;.</=-6717>/6?@526/:/A517>/6

Equation 4: Recall Ratio

F-measure. The F-measure is a metric that provides an overall accuracy score for a model by

combining precision and recall. A good F-measure score means that a model has both low false

positives and low false negatives, and therefore, a model correctly identifying threats while having

minimal false alarms.

F-measure = 2 × B./0767-:×D/0522B./0767-:?D/0522

Equation 5: F-measure Ratio

Computation time. The last performance metric used in this study is the computational time.

This is not related directly to classification, but rather, it describes the training time taken by a

model. This metric gives an indication of the efficiency of the model. The recorded computational

time is based on a Linux system with 8GB RAM and an i5 processor.

47

7 Results

7.1 Overview of Results

Table 15 presents the evaluation metrics of the machine learning models based on four open DDoS

datasets, including accuracy, precision, recall, f-measure and computation time (see Section 6.4).

All machine learning models were trained, validated and tested using a 60:20:20 split of the global

datasets. The goal of this evaluation is to analyse the performance of the different DDoS datasets

in terms of their capacity to detect intrusion (via a DDoS attack). From the results, it shows that

the CSE-CIC-IDS2018 dataset [48] performs best overall, achieving an accuracy rate of 99% across

all models, and an F-measure of 99%, denoting that a model trained with CSE-CIC-IDS2018 as a

data source performs very well, as it correctly predicts threats (precision) and captures all relevant

cases of malicious traffic (recall) at a 99% rate across all models.

From a model point of view, the random forest ensemble model performed best overall, achieving

100% accuracy for the NDSec-1 dataset [49], while achieving a 99& accuracy for the other datasets.

Moreover, random forest also achieved a precision and recall of 100% for the NDSec-1 dataset.

For the other datasets, precision and recall both stand at 99%. On the other hand, the naïve

Bayes algorithm produced the lowest accuracy with the CICDDoS2019 dataset [47], achieving a

low accuracy of 45% with a precision of 66% and a recall of 54%, meaning that almost half the

time, the model misses to identify threats. The second lowest results were produced by the SVM

model for the NDSec-1 dataset, with an accuracy of 68%. In this case, the precision and recall are

not as low, standing at 81% and 79%, respectively. The remaining models were consistent in the

results.

With regards to computation time, all models took longer to train with CSE-CIC-IDS2018

as the data source. Most likely, this is due to the record volume of the dataset, with a total of

1,046,845 rows. Conversely, the computation time for NDSec-1 dataset was the lowest, with all

models taking less than a second. It is important to note that this set had the lowest data volume

of 5,838 records. In terms of models, the k-NN model for the CSE-CIC-IDS2018 took the longest

48

to train, at 148 seconds. However, when analysing the overall results, the random forest algorithm

had the longest training time across all datasets.

Table 15: Performance metrics for each dataset.

k-NN SVM Naïve Bayes Decision Tree

Random Forest

Logistic Regression

CICDDoS 2019

Accuracy 0.98 0.86 0.45 0.99 0.99 0.98

Precision 0.99 0.86 0.66 0.99 0.99 0.99

Recall 0.99 0.87 0.54 0.99 0.99 0.98

F-measure 0.99 0.85 0.38 0.99 0.99 0.99

Computation time 3.5 seconds 7.29 seconds 1.3 seconds 4.53 seconds 84.2 seconds 5.53 seconds

CSE-CIC- IDS2018

Accuracy 0.99 0.99 0.99 0.99 0.99 0.99

Precision 0.99 0.99 0.99 1 0.99 0.99

Recall 0.99 0.99 0.99 1 0.99 0.99

F-measure 0.99 0.99 0.99 1 0.99 0.99


NDSec-1

Accuracy 0.98 0.68 0.99 0.97 1 0.99

Precision 0.99 0.81 1 0.99 1 0.99

Recall 0.99 0.79 1 0.99 1 0.99

F-measure 0.99 0.75 1 0.99 1 0.99


CICIDS2017

Accuracy 0.99 0.89 0.8 0.99 0.99 0.98

Precision 0.99 0.93 0.88 0.99 0.99 0.98

Recall 0.99 0.93 0.78 0.99 0.99 0.98

F-measure 0.99 0.93 0.79 0.99 0.99 0.98


49

7.2 CICDDoS2019

Figure 12 illustrates a comparative bar graph for the accuracy rates achieved by models that were

trained with the CICDDoS2019 dataset [47]. From initial observations, it is clear that the naïve

Bayes model performs poorly in comparison to the rest, with an accuracy rate of 45% (see table

15). The F-measure of the same model is also low. Taking a more granular look into this metric,

it shows that both the precision and recall of the model are problematic, with 66% and 54%

respectively. For this dataset, the best performing model was the random forest, achieving an

accuracy of 99%, with a 99% precision and 99% recall. The model also took the longest to train,

with a computation time of 84.2 seconds. Meanwhile, the other models took under 10 seconds to

train.

Figure 12: Bar graph of accuracy rate for the CICDDoS2019 dataset.

7.3 CSE-CIC-IDS2018

The CSE-CIC-IDS2018 dataset [48] performs extremely well as it achieves a 99% accuracy rate

for all machine learning models used in this study, as seen in Figure 13 below. This can be

attributed to the volume of records that the dataset has in comparison with the other datasets

50

(1,046,845 rows). Due to the record volume, some models took a longer time to train, in particular,

the k-NN and random forest models, with 148.2 and 120.8 seconds respectively (see table 15).

While all models achieve the same accuracy rate, the decision tree model performs best overall

with an F-measure of 100%. The naïve Bayes model takes the least time to train, with a

computation time of 2.7 seconds.

Figure 13: Bar graph of accuracy rate for the CSE-CIC-IDS2018 dataset.

7.4 NDSec-1

Figure 14 presents a bar graph of the accuracy rate achieved by models trained with the NDSec-

1 dataset [49]. This dataset has the lowest volume of records (5,838) and naturally, model training

took much less time in comparison to models trained with the other datasets. In fact, all models

took less than 1 second to train (see table 15). The random forest model achieved a 100% accuracy,

and the highest accuracy rate in this study. The same model also achieved a 100% F-measure

score. In contrast, the SVM model achieved the lowest accuracy for the dataset, with a score of

68%, the second lowest accuracy rate in this study. In addition, the model achieved a precision

score of 81% and slightly lower recall score of 79%, with a combined F-measure score of 75%. The

other models achieved very similar accuracy and F-measure scores, as the bar graph illustrates

51

(see Fig. 13), where the decision tree model achieved a 97% accuracy, while the k-NN and logistic

regression models both achieved a 99% accuracy score.

Figure 14: Bar graph of accuracy rate for the NDSec-1 dataset.

7.5 CICIDS2017

Figure 15 presents a bar graph of the results achieved by the CICIDS2017 dataset [50]. This

dataset is comparable with the CICDDoS2019 dataset in terms of volume (225,745 rows). It is

interesting to note, however, that when trained with CICIDS2017 data, the random forest model

takes less than half the time (39.6 seconds) it takes to train with CICDDoS2019 data (84.2

seconds). When training a naïve Bayes model, the dataset generates the lowest accuracy, with a

score of 80%, with a precision at 88% and a recall at 78%. Relatively to the rest of the models,

the SVM model also achieves slightly lower accuracy, with a score of 89%. With regards to

precision and recall, the same model achieves a score of 93% in both cases. The rest of the models

achieve very similar results, where the k-NN, decision tree, and random forest models all achieved

99% accuracy, precision and recall, while the logistic regression model achieved a 98% score for

all three metrics. This pattern is observably similar to the accuracy results achieved by the models

when trained with the CICDDoS2019 (see Fig. 11).

52

Figure 15: Bar graph of accuracy rate for the CICIDS2017 dataset.

53

8 Discussion

8.1 Contributions of this Study

This study explored the behaviour and application of multiple DDoS datasets for machine learning

in the context of intrusion detection. Intrusion detection has become a sore point and subject to

extensive research due to the increasing vulnerabilities. Over the last few years, the Internet has

grown exponentially with thousands of computer-based applications being generated every day.

Rapidly, the internet has become an essential component for today’s generation, and with its

aggressive growth, secure network environments are becoming critical. Among various types of

attacks, DDOS attacks are one of the biggest threats to internet sites and pose a devastating risk

to security of computer systems, particularly due to their potential impact. Hence why research

in this area has flourished, with researchers focusing on new ways to tackle intrusion detection

and prevention. Machine learning and artificial intelligence are among the latest additions to the

list of technologies researched for intrusion detection. However, many industry stakeholders and

researchers still find it difficult to obtain good quality datasets for evaluating and assessing their

detection machine learning models. This problem was the main motivation of this study, and the

basis for the research questions.

This work starts by reviewing literature in this domain. Firstly, this review presented an

outline of how other researchers explored and tackled the issue of intrusion detection with the

application of machine learning. This gave a better understanding of how different algorithms are

applied in solving intrusion problems. Moreover, it also provided insight into which algorithms

are commonly used to tackle problems in this domain and how the results are interpreted and

analysed. Secondly, the literature review delved deep into the characteristics and issues of current

datasets. Various work was analysed in order to explore the intricacies of these datasets and how

their validity is affected in the context of machine learning methodologies. Multiple issues were

uncovered with regards to existing datasets, including privacy concerns, documentation

availability, accessibility and alignment with research objectives. This was followed by a review

of previous work related to the surveying and comparison of datasets.

54

This study presented a solution for the analysis of the effectiveness of existing DDoS

datasets to detect intrusion, using CRISP-DM as the core methodology. The primary phases of

CRISP-DM were efficiently and effectively mapped to the research questions and these were

thoroughly followed throughout the whole study.

In the experiment, four open DDoS datasets were used: CICDDoS2019, CICIDS2017,

NDSec-1 and CEC-CIC-IDS2018. The intrusion detection performance of these datasets was

analysed using six machine learning models. The datasets were split in a 60:20:20 ratio for model

training, validation and testing, respectively. The machine learning models were chosen

systematically and carefully to ensure that the experiment is conducted in a proper manner. The

six models include naïve Bayes, SVM, decision tree, k-nearest neighbour, random forest and

logistic regression. The results were analysed using a set of performance metrics, including

accuracy, precision, recall, f-measure and computation time. Below is are the findings of this study

according to the research questions:

Assessment of RQ1: What is the effectiveness of different open DDoS datasets in detecting

intrusion and malicious traffic?

Finding 1.1 – The CSE-CIC-IDS2018 dataset [48] exhibits the best intrusion detection performance

overall, where all models achieve 99% accuracy rate with an F-measure score of 99%, denoting

that any of the six models trained with this dataset is able to correctly identify threats (precision)

and capture all relevant cases of malicious traffic (recall) at a 99% rate.

Finding 1.2 – The CSE-CIC-IDS2018 dataset [48] exhibits the best intrusion detection performance

overall, where all models achieve 99% accuracy rate with an F-measure score of 99%, denoting

that any of the six models trained with this dataset is able to correctly identify threats (precision)

and capture all relevant cases of malicious traffic (recall) at a 99% rate.

Finding 1.3 – Training with CSE-CIC-IDS2018 [48] as a data source was the most time intensive

overall, possibly due to the large record volume of the dataset (1,046,845 rows). In contrast, the

least time intensive models were the ones trained with the NDSec-1 dataset, which is also the least

55

voluminous data set (5,838 rows), where the computational time was under one second for all

models.

Assessment of RQ2: How does the performance of different supervised learning models compare

with regards to classification capacity and time efficiency?

Finding 2.1 – Random forest is the best performing model overall, achieving 100% accuracy and

100% F-measure when trained with the NDSec-1 dataset [49] and 99% score for accuracy, precision

and recall when trained with any of the other datasets.

Finding 2.2 – The Naïve Bayes model performed relatively poor overall and produced the lowest

accuracy score of this study (45%) when trained with the CICDDoS2019 dataset [47]. For the

same model, precision was 66% and recall was 54%, meaning that almost half the time, the model

misses to identify threats.

Finding 2.3 – Computation time for training of the k-NN model was the highest at 148 seconds.

Random forest, however, was the model with the highest computation time overall.

Finding 2.4 – The CICIDS2017 and CICIDS2019 dataset show similar patterns in the results

obtained by the models, with naïve Bayes and SVM producing the lowest and second-lowest

results respectively. All other models trained with either dataset achieved consistently similar

results, ranging from 98% to 99% accuracy rate.

8.2 Conclusions and Future Work

While the absence of datasets was the very focal point at which this study was conducted, it can

also be seen as a limitation on its own given the fact that potentially more accurate results would

have been obtained on the comparison between the datasets.

Although the area of IDS is heavily researched, there are many aspects to be investigated further,

especially in the area of machine learning. Specifically, future work could focus on providing an

application or service with which any new dataset could quickly be analysed and put into the

benchmark with algorithms selected by the researcher in the same manner that this study carried

56

out the analysis of the selected datasets. The application would be able to answer the question

‘which dataset performs better and with which algorithms?’. This would be of great help to

researchers who are in search of a well-performing dataset and also desire a consistent approach

to results, by using the best-performing datasets and algorithms.

With regards to possible future work, an interesting area that could be explored is how IDS-

specific data could be represented as artificial neural networks and further be analysed with deep

learning, once the datasets could be represented in non-structured forms of data. The above can

also be seen as two separate problems, which the future study could expand on.

57

References

[1] S. Dua and X. Du, Data Mining and Machine Learning in Cybersecurity. Boca Raton,

Florida: Auerbach Publications, 2016.

[2] C. Canongia and R. Mandarino, “Cybersecurity: The new challenge of the information

society,” in Handbook of Research on Business Social Networking: Organizational,

Managerial, and Technological Dimensions, 2011.

[3] P. Twomey, “Cyber Security Threats.” The Lowy Institute for International Policy, Sydney,

2010.

[4] R. Von Solms and J. Van Niekerk, “From information security to cyber security,” Comput.

Secur., vol. 38, pp. 97–102, 2013.

[5] J. B. Fraley and J. Cannady, “The promise of machine learning in cybersecurity,” in

SouthEastCon 2017, 2017, pp. 1–6.

[6] OWASP, “OWASP Top 10 - 2017 - The Ten Most Critical Web Application Security

Risks,” Top 10 2017, 2017.

[7] C. Douligeris and A. Mitrokotsa, “DDoS attacks and defense mechanisms: Classification

and state-of-the-art,” Comput. Networks, vol. 44, no. 5, pp. 643–666, 2004.

[8] S. K. Sahu, S. Sarangi, and S. K. Jena, “A detail analysis on intrusion detection datasets,”

in Souvenir of the 2014 IEEE International Advance Computing Conference, IACC 2014,

2014, pp. 1348–1353.

[9] J. O. Nehinbe, “A critical evaluation of datasets for investigating IDSs and IPSs researches,”

in Proceedings of 2011, 10th IEEE International Conference on Cybernetic Intelligent

Systems, CIS 2011, 2011, pp. 1–6.

[10] A. Gharib, I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “An Evaluation Framework

for Intrusion Detection Dataset,” in ICISS 2016 - 2016 International Conference on

58

Information Science and Security, 2017, pp. 1–6.

[11] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-

based intrusion detection data sets,” Comput. Secur., vol. 86, pp. 147–167, 2019.

[12] O. Yavanoglu and M. Aydos, “A review on cyber security datasets for machine learning

algorithms,” in Proceedings - 2017 IEEE International Conference on Big Data, Big Data

2017, 2017, pp. 2186–2193.

[13] C. Thomas, V. Sharma, and N. Balakrishnan, “Usefulness of DARPA dataset for intrusion

detection system evaluation,” in Data Mining, Intrusion Detection, Information Assurance,

and Data Networks Security 2008, 2008.

[14] L. Dhanabal and S. P. Shantharajah, “A Study on NSL-KDD Dataset for Intrusion

Detection System Based on Classification Algorithms,” Int. J. Adv. Res. Comput. Commun.

Eng., vol. 4, no. 6, 2015.

[15] M. Małowidzki, P. Berezi, and M. Mazur, “Network Intrusion Detection: Half a Kingdom

for a Good Dataset,” in ECCWS 2017 16th European Conference on Cyber Warfare and

Security, 2017.

[16] R. Bace and P. Mell, “NIST special publication on intrusion detection systems,” Special

Publication (NIST SP), 2001.

[17] Nexus Guard, “Nexusguard Research Shows DNS Amplification Attacks Grew Nearly

4,800% Year-over-Year; Highlighted by Sharp Increase in TCP SYN Flood,” 2019. [Online].

Available: https://www.nexusguard.com/newsroom/press-release/dns-amplification-

attacks-rise-twofold-in-q1-0-0.

[18] J. Mirkovic and P. Reiher, “A taxonomy of DDoS attack and DDoS defense mechanisms,”

Comput. Commun. Rev., vol. 34, no. 2, pp. 39–53, 2004.

[19] K. Scarfone and P. Mell, “Guide to Intrusion Detection and Prevention Systems (IDPS),”

59

National Institute of Standards and Technology. Special Publication (NIST SP), 2007.

[20] P. Ferguson and D. Senie, “Network Ingress Filtering: Defeating Denial of Service Attacks

which employ IP Source Address Spoofing,” RFC Editor, 2000. [Online]. Available:

https://tools.ietf.org/html/rfc2827.

[21] G. C. Kessler and D. E. Levin, Denial-of-Service Attacks, 4th ed. John Wiley & Sons, 2015.

[22] R. Das and T. H. Morris, “Machine learning and cyber security,” in 2017 International

Conference on Computer, Electrical and Communication Engineering, ICCECE 2017,

2018, pp. 1–7.

[23] I. Sofi, A. Mahajan, and V. Mansotra, “Machine Learning Techniques used for the Detection

and Analysis of Modern Types of DDoS Attacks,” Int. Res. J. Eng. Technol., 2017.

[24] N. Sharma, A. Mahajan, and V. Mansotra, “Machine Learning Techniques Used in

Detection of DOS Attacks: A Literature Review,” Int. J. Adv. Res. Comput. Sci. Softw.

Eng., 2016.

[25] M. Zekri, S. El Kafhali, N. Aboutabit, and Y. Saadi, “DDoS attack detection using machine

learning techniques in cloud computing environments,” in Proceedings of 2017 International

Conference of Cloud Computing Technologies and Applications, CloudTech 2017, 2018.

[26] D. M. Farid, N. Harbi, E. Bahri, M. Z. Rahman, and C. M. Rahman, “Attacks classification

in adaptive intrusion detection using decision tree,” World Acad. Sci. Eng. Technol., pp.

368–372, 2010.

[27] Y. C. Wu, H. R. Tseng, W. Yang, and R. H. Jan, “DDoS detection and traceback with

decision tree and grey relational analysis,” in 3rd International Conference on Multimedia

and Ubiquitous Engineering, MUE 2009, 2009.

[28] A. Andhare, P. Arvind, and B. Patil, “Denial-of-Service Attack Detection Using Genetic-

Based Algorithm,” vol. 2, no. 2, pp. 94–98, 2012.

60

[29] M. Aamir and S. M. A. Zaidi, “DDoS attack detection with feature engineering and machine

learning: the framework and performance evaluation,” Int. J. Inf. Secur., pp. 1–25, 2019.

[30] A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman, “Survey of intrusion detection

systems: techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, p. 20, 2019.

[31] R. Koch, “Towards next-generation intrusion detection,” in 2011 3rd International

Conference on Cyber Conflict, ICCC 2011 - Proceedings, 2011.

[32] J. O. Nehinbe, “A simple method for improving intrusion detections in corporate networks,”

in Lecture Notes of the Institute for Computer Sciences, Social-Informatics and

Telecommunications Engineering, 2010.

[33] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion

detection systems (UNSW-NB15 network data set),” in 2015 Military Communications and

Information Systems Conference, MilCIS 2015 - Proceedings, 2015.

[34] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, and A. Hotho, “Flow-based benchmark data

sets for intrusion detection,” in European Conference on Information Warfare and Security,

ECCWS, 2017.

[35] M. Ghorbani, Ali A., Lu, Wei, Tavallaee, Network Intrusion Detection and Prevention.

Springer, 2010.

[36] The Cooperative Association for Internet Data Analysis, “CAIDA - The Cooperative

Association for Internet Data Analysis,” CAIDA. 2010.

[37] J. Mchugh, “Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA

Intrusion Detection System Evaluations as Performed by Lincoln Laboratory,” ACM Trans.

Inf. Syst. Secur., vol. 3, no. 4, pp. 1094–9224, 2000.

[38] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “Detailed Analysis of the KDD CUP

99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for

61

Security and Defense Applications (CISDA), 2009. .

[39] University Of California, “KDD-Cup Dataset ’99,” The UCI KDD Archive, 1999. .

[40] University Of California, “KDD-Cup Dataset ’98,” The UCI KDD Archive, 1998. .

[41] J. Heidemann and C. Papadopoulos, “Uses and challenges for network datasets,” in

Proceedings - Cybersecurity Applications and Technology Conference for Homeland

Security, CATCH 2009, 2009.

[42] Defense Advanced Research Projects Agency, “1999 DARPA Intrusion Detection

Evaluation Dataset,” 1999. [Online]. Available: https://www.ll.mit.edu/r-d/datasets/1999-

darpa-intrusion-detection-evaluation-dataset.

[43] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network anomaly detection:

Methods, systems and tools,” IEEE Commun. Surv. Tutorials, vol. 16, no. 1, pp. 303–336,

2014.

[44] A. Nisioti, A. Mylonas, P. D. Yoo, and V. Katos, “From intrusion detection to attacker

attribution: A comprehensive survey of unsupervised methods,” IEEE Commun. Surv.

Tutorials, vol. 20, no. 4, pp. 3369–3388, 2018.

[45] T. H. Morris, Z. Thornton, and I. Turnipseed, “Industrial Control System Simulation and

Data Logging for Intrusion Detection System Research,” Seventh Annu. Southeast. Cyber

Secur. Summit, 2015.

[46] R. Wirth, “CRISP-DM : Towards a Standard Process Model for Data Mining,” Proc. Fourth

Int. Conf. Pract. Appl. Knowl. Discov. Data Min., pp. 29–39, 2000.

[47] University of New Brunswick, “DDoS Evaluation Dataset (CICDDoS2019),” unb.ca, 2019.

[Online]. Available: https://www.unb.ca/cic/datasets/ddos-2019.html.

[48] University of New Brunswick, “CSE-CIC-IDS2018 on AWS,” 2018. [Online]. Available:

https://www.unb.ca/cic/datasets/ids-2018.html.

62

[49] F. Beer, T. Hofer, D. Karimi, and U. Bühler, “A new attack composition for network

security,” in Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft

fur Informatik (GI), 2017.

[50] Canadian Institute for Cybersecurity, “CICIDS2017,” unb.ca, 2017. [Online]. Available:

https://www.unb.ca/cic/datasets/ids-2017.html.

[51] A. H. Lashkari, Y. Zang, G. Owhuo, M. S. I. Mamun, and G. D. Gil, “CICFlowMeter,”

Github. 2017.

[52] Pandas.pydaya.org, “Pandas.Dataframe.Fillna,” Pandas 1.0.3 Documentation, 2014.

[Online]. Available: https://pandas.pydata.org/pandas-

docs/stable/reference/api/pandas.DataFrame.fillna.html.

[53] Scikit-learn, “Train_test_split,” Scikit-learn 0.22.2 Documentation, 2019. [Online].

Available: https://scikit-

learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

[54] T. Mitchell, Machine Learning. Burr Ridge, IL: McGraw Hill, 1997.

[55] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed.

London, England: The MIT Press, 2018.

[56] L. Rokach, “Ensemble-based classifiers,” Artif. Intell. Rev., vol. 33, pp. 1–39, 2010.

[57] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and B. Thirion, “Scikit-learn:

Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

[58] Scikit-learn, “KNeighborsClassier,” scikit-learn.org, 2019. [Online]. Available: https://scikit-

learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html.

[59] Scikit-learn, “LinearSVC,” scikit-learn.org, 2019. [Online]. Available: https://scikit-

learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html.

[60] Scikit-learn, “GaussianNB,” scikit-learn.org, 2019. [Online]. Available: https://scikit-

63

learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html.

[61] Scikit-learn, “DecisionTreeClassifier,” scikit-learn.org, 2019. [Online]. Available:

https://scikit-

learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.

[62] Scikit-learn, “RandomForestClassifier,” scikit-learn.org, 2019. [Online]. Available:

https://scikit-

learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

[63] Scikit-learn, “LogisticRegressionClassifier,” scikit-learn.org, 2019. [Online]. Available:

https://scikit-

learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

[64] Scikit-learn, “StratifiedKFold,” Scikit-learn 0.22.2 Documentation, 2019. .

[65] D. M. Powers, “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness,

Markedness & Correlation,” J. Mach. Learn. Technol., vol. 2, 2007.

64

Appendix A – Specifics for CICDDoS2019

Days Attacks Attack Time

First Day PortMap NetBIOS LDAP MSSQL UDP UDP-Lag SYN

9:43 - 9:51 10:00 - 10:09 10:21 - 10:30 10:33 - 10:42 10:53 - 11:03 11:14 - 11:24 11:28 - 17:35

Second Day NTP DNS LDAP MSSQL NetBIOS SNMP SSDP UDP UDP-Lag WebDDoS SYN TFTP

10:35 - 10:45 10:52 - 11:05 11:22 - 11:32 11:36 - 11:45 11:50 - 12:00 12:12 - 12:23 12:27 - 12:37 12:45 - 13:09 13:11 - 13:15 13:18 - 13:29 13:29 - 13:34 13:35 - 17:15

Table A1: Time of Attacks for CICDDoS2019 dataset [49].

65

Appendix B – Modelling Source Code

AB.1 K-Nearest Neighbour

res1 = time.time()

from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier()

knn= knn.fit(X_train , y_train)

knn

res2 = time.time()

print('KNN took ',res2-res1,'seconds')

accuracy = []

for tr_in,val_in in StratifiedKFold(shuffle = True,n_splits=5).split(X_val,y_val):

Ran_For.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(Ran_For.score(X_val.iloc[val_in],y_val.iloc[val_in]))

y_pred1 = Ran_For.predict(X_test)

print('Accuracy score= {:.8f}'.format(Ran_For.score(X_test, y_test)))

from sklearn.metrics import classification_report, confusion_matrix

print('\n')

print("Precision, Recall, F1")

print('\n')

CR=classification_report(y_test, y_pred1)

print(CR)

print('\n')

66

AB.2 Support Vector Machine

res1 = time.time()

from sklearn.svm import LinearSVC

svc=LinearSVC(random_state=10, tol=1e-10, max_iter=100)

svc= svc.fit(X_train , y_train)

svc

res2 = time.time()

print('SVM took ',res2-res1,' seconds')

accuracy = []


svc.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(svc.score(X_val.iloc[val_in],y_val.iloc[val_in]))

y_pred1 = svc.predict(X_test)

print('Accuracy score= {:.8f}'.format(np.mean(accuracy)))


print('\n')


print('\n')


print(CR)

print('\n')

67

AB.3 Naïve Bayes

res1 = time.time()

from sklearn.naive_bayes import GaussianNB

nb=GaussianNB()

nb= nb.fit(X_train , y_train)

nb

res2 = time.time()

print('Naive Bayes took ',res2-res1,' seconds')

accuracy = []


nb.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(nb.score(X_val.iloc[val_in],y_val.iloc[val_in]))

y_pred1 = nb.predict(X_test)



print('\n')


print('\n')


print(CR)

print('\n')

68

AB.4 Decision Tree

res1 = time.time()

from sklearn.tree import DecisionTreeClassifier

DTC=DecisionTreeClassifier(random_state=10, max_depth=13)

DTC= DTC.fit(X_train , y_train)

DTC

res2 = time.time()

print('Decision tree took ',res2-res1, ' seconds')

accuracy = []


DTC.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(DTC.score(X_val.iloc[val_in],y_val.iloc[val_in]))

y_pred1 = DTC.predict(X_test)



print('\n')


print('\n')


print(CR)

print('\n')

69

AB.5 Random Forest

res1 = time.time()

from sklearn.ensemble import RandomForestClassifier

Ran_For= RandomForestClassifier(n_estimators=200,max_depth=35, random_state=200,max_leaf_nodes=200)

Ran_For= Ran_For.fit(X_train , y_train)

Ran_For

res2 = time.time()

print('Random Forest took ',res2-res1,' seconds')

accuracy = []


Ran_For.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(Ran_For.score(X_val.iloc[val_in],y_val.iloc[val_in]))

y_pred1 = Ran_For.predict(X_test)

print('Accuracy score= {:.8f}'.format(Ran_For.score(X_test, y_test)))


print('\n')


print('\n')


print(CR)

print('\n')

70

AB.6 Logistic Regression

res1 = time.time()

from sklearn.linear_model import LogisticRegression

LR= LogisticRegression()

LR= LR.fit(X_train , y_train)

LR

res2 = time.time()

print('LogisticRegression took ',res2-res1,' seconds')

accuracy = []


LR.fit(X_val.iloc[tr_in],y_val.iloc[tr_in])

accuracy.append(LR.score(X_val.iloc[val_in],y_val.iloc[val_in]))

y_pred1 = LR.predict(X_test)

print('Accuracy score= {:.3f}'.format(LR.score(X_test, y_test)))


print('\n')


print('\n')


print(CR)

print('\n')

ddos datasets - diva-portal.org

Documents