intrusion detection system

58
INTRUSION DETECTION SYSTEM USING DATA MINING A THESIS Submitted by SREENATH.M Register No: 712513405014 In partial fulfillment of the award of the degree of MASTER OF ENGINEERING IN COMPUTER SCIENCE AND ENGINEERING PPG INSTITUTE OF TECHNOLOGY, COIMBATORE 641035 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ANNA UNIVERSITY: CHENNAI 600 025 JUNE 2015

Upload: sreenath-murali

Post on 04-Dec-2015

12 views

Category:

Documents


5 download

DESCRIPTION

M.E PROJECT REPORT

TRANSCRIPT

INTRUSION DETECTION SYSTEM USING

DATA MINING

A THESIS

Submitted by

SREENATH.M

Register No: 712513405014

In partial fulfillment of the award of the degree

of

MASTER OF ENGINEERING IN

COMPUTER SCIENCE AND ENGINEERING

PPG INSTITUTE OF TECHNOLOGY, COIMBATORE 641035

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ANNA UNIVERSITY: CHENNAI 600 025

JUNE 2015

ii

ANNA UNIVERSITY,CHENNAI

BONAFIDE CERTIFICATE

Certified that this thesis titled “INTRUSION DETECTION SYSTEM USING

DATA MINING” is the bonafide work of SREENATH.M (Register No:

712513405014) who carried out the work under my supervision. Certified further that to

the best of my knowledge the work reported herein does not form part of any other thesis

or dissertation on the basis of which a degree or award was conferred on an earlier

occasion on this or any other candidate.

Mr. P P JOBY Mrs. D SUMATHI

HEAD OF THE DEPARTMENT SUPERVISOR Associate Professor and Head Assistant Professor[Senior Grade] Department of CSE Department of CSE PPG Institute of Technology PPG Institute of Technology Coimbatore – 641 035 Coimbatore – 641 035

Submitted for the project viva-voce examination held on ___________________

_________________ _______________

Internal Examiner External Examiner

iii

ABSTRACT

For a past few decades, there has been quick progress in internet based

applications and technology in the area of computer networks. Data is most important

asset of any organization and organizations require proper protection and management

of private and highly sensitive information. Nowadays cyber-attacks have become very

common and network security can be provided with Intrusion Detection Systems. An

intrusion detection system analyses and gathers information from various areas within

a network or computer to identify possible security breaches, which include both

misuse and intrusion. Researchers are interested in intrusion detection system using

data mining techniques as a deceitful skill. Data mining is the knowledge discovery

process by analysing the huge volume of data from various perspectives and

summarizing it into useful information. Data mining is used to find cloaked patterns

from a large data set. Classification is one of the most important applications of data

mining. Classification techniques are used to classify data items into predefined class

label. During data mining the classifier builds classification models from an input data

set, which are used to predict future data trends. This work aims to give an intrusion

detection system using Bagging Ensemble Selection.

iv

��க்கம்

கடந்த சில பத்தாண்�களாக,கண�ன� ெநட்ெவார்க்�கள்,

இைணய அ�ப்பைடய�லான பயன்பா�கள் மற்�ம்

ெதாழில்�ட்பம் வ�ைரவான �ன்ேனற்றம் இ�ந்த�. அைமப்ப�ன்

மிக �க்கியமான ெசாத்தா�ம் மற்�ம் தகவல் பா�காப்� மற்�ம்

ேமலாண்ைம ேதைவப்ப�கிற�. இப்ேபாெதல்லாம் ைசபர்

தாக்�தல்கள் மிக�ம் ெபா�வானதாக மாறின மற்�ம்

ெநட்ெவார்க் பா�காப்� ஊ��வைலக் கண்டறி�ம் �ைறகைள

வழங்க ���ம். ஊ��வல் கண்டறிதல் மற்�ம் தவறாக மற்�ம்

பா�காப்� ேதால்வ�க�க்கான கண்டறிய ெநட்ெவார்க் அல்ல�

கண�ன� உள்ள, பல்ேவ� ப�திகள�ல் இ�ந்� தகவல் திரட்�ம்.

ஆராய்ச்சியாளர்கள், �ரங்க �ட்பங்கைள பயன்ப�த்தி ஊ��வல்

கண்டறிதல் அைமப்� ஆர்வமாக உள்ளனர். ேடட்டா ைமன�ங்

பல்ேவ� ேகாணங்கள��ம் ெப�ய அள� ப�ப்பாய்� மற்�ம்

பய�ள்ள தகவல்கைள அைத ��க்�தல் அறி� கண்�ப��ப்�

ெசயல்�ைற ஆ�ம். ஒ� ெப�ய தர� ெதா�ப்� இ�ந்�

ெபாய்த்ேதாற்றப் வ�வங்கள் கண்�ப��க்க பயன்ப�த்தப்ப�ம். தள

தர� �ரங்க மிக �க்கியமான பயன்பா�கள் ஒன்றா�ம்.

v

ACKNOWLEDGEMENT

First of all, I extend my gratitude from the bottom of my heart to the honorable

Chairman Dr. L P Thangavelu, MS., FAIS, FIAGES, FICS, and to our respected

Correspondent Mrs. Shanthi Thangavelu of PPG Institute of Technology, for

providing me the best platform that satisfies my quest for knowledge throughout my

curriculum, it has also paved the way to take up a challenging project.

I feel elated in recording my heart-felt gratitude to our Principal

Dr. R Prakasam, B.E. (Hons)., M.E., Ph.D., FIE., C.Engg., MISTE, for his

motivation and his ardent attitude in providing all necessary facilities that has helped

me in shaping up this project.

I am highly grateful to Mr. P P Joby., M.Tech., (Ph.D.), Associate Professor

and Head, Department of Computer Science and Engineering, for offering incessant

help in all possible ways towards the execution of this work.

I take immense pleasure in expressing my humble note of gratitude to

Ms. Santhamani V, M.E., Mrs. T Poongodi, M.Tech., (Ph.D.), Project Coordinators,

Department of Computer Science and Engineering, for their valuable suggestions and

remarkable guidance throughout the completion of the project.

I thank my project guide Mrs. D Sumathi, M.E (Ph.D.), Assistant Professor

[Senior Grade], Department of Computer Science and Engineering, who has guided me

throughout the project with her enormous knowledge transfer.

I also extend my thankfulness to all our department faculty members, my

parents and friends for their moral support that helped me in carrying out this project.

Above all, I thank the “ETERNITY” for giving me the strength and courage in

accomplishing this project work.

SREENATH.M

vi

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO. ABSTRACT (English) iii

ABSTRACT (Tamil) iv

LIST OF FIGURES viii

LIST OF ABBREVIATIONS ix

1 INTRODUCTION 1

2 LITERATURE SURVEY 3

2.1 Intrusion Detection System 3

2.2 What is not an IDS? 3

2.3 Taxonomy of attacks and intrusions 4

2.4 Structure and architecture of IDS 7

2.5 Classification of IDSs 10

2.6 Intrusion detection approaches 12

3 SYSTEM ANALYSIS 16

3.1 Existing System 16

3.1.1 Drawbacks 17

3.2 Feasibility Study 17

3.2.1 Operational Feasibility 17

3.2.2 Economical Feasibility 18

3.2.3 Technical Feasibility 18

4 SYSTEM SPECIFICATION 19

4.1 Hardware Requirements 19

4.2 Software Requirements 19

5 SOFTWARE DESCRIPTION 20

5.1 Weka 20

6 PROJECT DESCRIPTION 23

vii

CHAPTER NO. TITLE PAGE NO. 6.1 Data sets 23 6.2 Bagging Ensemble Selection 24

6.3 Algorithms related to Bagging Ensemble Selection 25

6.4 Classifiers for performance comparison 27

6.4.1 oneR 27

6.4.2 HoeffdingTree 27

6.4.3 DecisionStump 28

6.5 Classifier performance measures 28

7 SYSTEM IMPLEMENTATION 30

8 CONCLUSION AND FUTURE ENHANCEMENT 34

8.1 Conclusion 34

8.2 Future Enhancement 34

APPENDICES 35

A.1 Source Code 35

A.2 List of Publications 47

REFERENCES 48

viii

LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.

2.1 A sample IDS 8

2.2 IDS components 8

2.3 AAFID representation of IDS 10

3.1 Pseudo code of EDADT algorithm 16

6.1 Confusion matrix 28

7.1 Weka GUI chooser 30

7.2 Commandline interface of weka 31

7.3 Performance comparison on NSL-KDD data set 32

7.4 Performance comparison on Honeypot data set 32

7.5 Performance comparison on Ping Flood data set 33

7.6 Performance comparison on Udp Flood data set 33

ix

LIST OF ABBREVIATIONS

IDS Intrusion Detection System

VPN Virtual Private Network

SSL Secure Socket Layer

TCP Transmission Control Protocol

IP Internet Protocol

UDP User Datagram Protocol

ICMP Internet Control Message Protocol

BES Bagging Ensemble Selection

1

CHAPTER 1

INTRODUCTION

As the network technology is expanding quickly, the security of that

innovation is turning into a requirement for survival, for an organization. A large

portion of the organizations are relying upon the web to correspond with the

individuals and frameworks to give them news, web shopping, email and individual

data. Because of the quick development in the engineering and boundless utilization of

the Internet, a considerable measure of issues have been confronted to secure the

organization's discriminating data inside or over the systems in light of the fact that

there are many individuals endeavouring to attack on systems to extract information.

An enormous number of assaults have been seen in the last few years. Intrusion

Detection System assumes a monstrous part against those assaults by securing the

system's discriminating data. As firewalls and antiviruses are insufficient to give full

assurance to the system, organizations need to execute the Intrusion Detection System

to ensure their critical data against different sorts of attacks.

Intrusions are activities that endeavour to sidestep security systems of

computer systems. So they are many activities that debilitate the trustworthiness,

accessibility, or secrecy of a system asset. These properties have the following

clarifications:

• Confidentiality – implies that data is not made accessible or unveiled to

unapproved people, substances or procedures;

• Integrity – implies that information has not been adjusted or obliterated in an

unapproved way;

• Availability – implies that a system or a system resource that guarantees that it

is available and usable upon interest by an approved client.

2

Intrusion Detection is the methodology of observing the occasions happening

in a computer network or system and dissecting them for indications of interruptions,

in the same way as unapproved doorway, movement, or record alteration.

3

CHAPTER 2

LITERATURE SURVEY

The detailed literature study related to the project is done and the best part of the

knowledge gained is presented in the following section.

2.1 Intrusion Detection System

An Intrusion Detection System (IDS) is a defence system, which detects

hostile activities in a network. The key is then to detect and possibly prevent activities

that may compromise system security, or a hacking attempt in progress including

reconnaissance/data collection phases that involve for example, port scans. One key

feature of intrusion detection systems is their ability to provide a view of unusual

activity and issue alerts notifying administrators and/or block a suspected connection.

According to Amoroso, intrusion detection is a process of identifying and responding

to malicious activity targeted at computing and networking resources". In addition, IDS

tools are capable of distinguishing between insider attacks originating from inside the

organization (coming from own employees or customers) and external ones (attacks

and the threat posed by hackers).

2.2 What is not an IDS?

Contrary to popular market belief and terminology employed in the literature

on intrusion detection systems, not everything falls into this category. In particular, the

following security devices are NOT IDS:

• Network logging systems used, for example, to detect complete vulnerability to

any Denial of Service (DoS) attack across a congested network. These are

network traffic monitoring systems.

4

• Vulnerability assessment tools that check for bugs and flaws in operating

systems and network services (security scanners), for example Cyber Cop

Scanner.

• Anti-virus products designed to detect malicious software such as viruses,

Trojan horses, worms, bacteria, logic bombs. Although feature by feature these

are very similar to intrusion detection systems and often provide an effective

security breach detection tool.

• Firewalls

• Security/cryptographic systems, for example VPN, SSL, Kerberos, Radius etc.

2.3 Taxonomy of attacks and intrusions

Since intrusion detection systems deal with hacking breaches, let us take a

closer look at these dangerous activities. To assist in the discussion of their taxonomy,

some definitions will be helpful although they may vary [10]:

• Intrusion – a series of concatenated activities that pose threat to the safety of IT

resources from unauthorized access to a specific computer or address domain;

• Incident – violation of the system security policy rules that may be identified as

a successful intrusion;

• Attack – a failed attempt to enter the system (no violation committed)[19].

• Modelling of intrusions – a time based modelling of activities that compose an

intrusion. The intruder starts his attack with an introductory action followed by

auxiliary ones (or evasions) to proceed to successful access; in practice, any

attempts undertaken during the attack by any person, for example by the IT

resource manager can be identified as a threat.

Generally, attacks can be categorized in two areas:

• Passive (aimed at gaining access to penetrate the system without compromising

IT resources),

5

• Active (results in an unauthorized state change of IT resources).

In terms of the relation intruder-victim, attacks are categorized as:

• Internal, coming from own enterprise’s employees or their business partners or

customers,

• External, coming from outside, frequently via the Internet.

Attacks are also identified by the source category, namely those performed

from internal systems (local network), the Internet or from remote dial-in sources [11].

Now, let us see what types of attacks and abuses are detectable (sometimes hardly

detectable) by IDS tools to put them in the ad-hoc categorization. The following types

of attacks can be identified:

• Those related to unauthorized access to the resources (often as introductory

steps toward more sophisticated actions):

• Password cracking and access violation,

• Trojan horses,

• Interceptions; most frequently associated with TCP/IP stealing

and interceptions that often employ additional mechanisms to

compromise operation of attacked systems [16] (for example by

flooding; man in the middle attacks),

• Spoofing (deliberately misleading by impersonating or

masquerading the host identity by placing forged data in the

cache of the named server i.e. DNS spoofing)

• Scanning ports and services, including ICMP scanning (Ping),

UDP, TCP Stealth Scanning TCP that takes advantage of a

partial TCP connection establishment protocol.) Etc.

• Remote OS Fingerprinting, for example by testing typical

responses on specific packets, addresses of open ports, standard

application responses (banner checks), IP stack parameters etc.,

6

• Network packet listening (a passive attack that is difficult to

detect but sometimes possible),

• Stealing information, for example disclosure of proprietary

information,

• Authority abuse; a kind of internal attack, for example,

suspicious access of authorized users having odd attributes (at

unexpected times, coming from unexpected addresses),

• Unauthorized network connections,

• Usage of IT resources for private purposes, for example to

access pornography sites,

• Taking advantage of system weaknesses to gain access to

resources or privileges,

• Unauthorized alteration of resources (after gaining unauthorized access):

• Falsification of identity, for example to get system administrator

rights,

• Information altering and deletion,

• Unauthorized transmission and creation of data (sets), for

example arranging a database of stolen credit card numbers on a

government computer.

• Unauthorized configuration changes to systems and network

services (servers).

• Denial of Service (DoS):

• Flooding – compromising a system by sending huge amounts of useless

information to lock out legitimate traffic and deny services:

• Ping flood (Smurf) – a large number of ICMP packets sent to a

broadcast address,

7

• Send mail flood - flooding with hundreds of thousands of

messages in a short period of time; also POP and SMTP

relaying,

• SYN flood – initiating huge amounts of TCP requests and not

completing handshakes as required by the protocol,

• Distributed Denial of Service (DDoS); coming from a multiple

source,

• Compromising the systems by taking advantage of their vulnerabilities:

• Buffer Overflow, for example Ping of Death — sending a very

large ICMP (exceeding 64 KB),

• Remote System Shutdown,

• Web Application attacks; attacks that take advantage of application bugs may

cause the same problems as described above.

It is important to remember, that most attacks are not a single action, rather a

series of individual events developed in a coordinated manner.

2.4 Structure and architecture of Intrusion Detection System

An intrusion detection system always has its core element - a sensor (an

analysis engine) that is responsible for detecting intrusions. This sensor contains

decision-making mechanisms regarding intrusions [17]. Sensors receive raw data from

three major information sources (Fig.2.1): own IDS knowledge base, syslog and audit

trails. The syslog may include, for example, configuration of file system, user

authorizations etc. This information creates the basis for a further decision-making

process. The arrow width is proportional to the amount of information flowing between

system components

8

Fig.2.1: A sample IDS

The sensor is integrated with the component responsible for data collection

(Fig.2.2)an event generator [8]. The collection manner is determined by the event

generator policy that defines the filtering mode of event notification information. The

event generator (operating system, network, application) produces a policy-consistent

set of events that may be a log (or audit) of system events, or network packets. This, set

along with the policy information can be stored either in the protected system or

outside. In certain cases, no data storage is employed for example, when event data

streams are transferred directly to the analyser. This concerns the network packets in

particular.

Fig.2.2: IDS components

9

The role of the sensor is to filter information and discard any irrelevant data

obtained from the event set associated with the protected system, thereby detecting

suspicious activities. The analyser uses the detection policy database for this purpose.

The latter comprises the following elements: attack signatures, normal behaviour

profiles, necessary parameters (for example, thresholds). In addition, the database

holds IDS configuration parameters, including modes of communication with the

response module. The sensor also has its own database containing the dynamic history

of potential complex intrusions (composed from multiple actions).

Intrusion detection systems can be arranged as either centralized (for example,

physically integrated within a firewall) or distributed. A distributed IDS consists of

multiple Intrusion Detection Systems (IDS) over a large network, all of which

communicate with each other. More sophisticated systems follow an agent structure

principle where small autonomous modules are organized on a per-host basis across the

protected network. The role of the agent is to monitor and filter all activities within the

protected area (depending on the approach adopted ) and make an initial analysis and

even undertake a response action. The cooperative agent network that reports to the

central analysis server is one of the most important components of intrusion detection

systems. DIDS can employ more sophisticated analysis tools, particularly connected

with the detection of distributed attacks. Another separate role of the agent is

associated with its mobility and roaming across multiple physical locations. In

addition, agents can be specifically devoted to detect certain known attack signatures.

This is a decisive factor when introducing protection means associated with new types

of attacks. IDS agent-based solutions also use less sophisticated mechanisms for

response policy updating.

One multi-agent architecture solution, which originated in 1994, is AAFID

(Autonomous Agents for Intrusion Detection) is shown in Fig.2.3. It uses agents that

monitor a certain aspect of the behaviour of the system they reside on at the time. For

example, an agent can see an abnormal number of telnet sessions within the system it

10

monitors. An agent has the capacity to issue an alert when detecting a suspicious event.

Agents can be cloned and shifted onto other systems (autonomy feature). Apart from

agents, the system may have transceivers to monitor all operations effected by agents

of a specific host [15]. Transceivers always send the results of their operations to a

unique single monitor. Monitors receive information from a specific network area (not

only from a single host), which means that they can correlate distributed information.

Additionally, some filters may be introduced for data selection and aggregation.

Fig.2.3: AAFID representation of an intrusion detection system employing

autonomous agents.

2.5 Classification of Intrusion Detection Systems

Intrusion Detection Systems are partitioned into the following categories:

host-based (HIDS), network-based (NIDS), and Hybrid Intrusion Detection [4]. A

HIDS demands small programs (agents) to be installed on individual systems to be

supervised. The programs monitor the operating system and write down results to log

files and/or trigger alarms. A NIDS customarily consists of a network application with

a Network Interface Card (NIC) working in unchaste mode and a discrete management

of the interface [19]. Intrusion Detection Systems are placed on a boundary or network

segment and observe all traffic on that segment. The prevailing tendency in intrusion

11

detection is to mix both network based and host based information to develop hybrid

systems that have more efficiency.

• Host Based Intrusion Detection System (HIDS): Host-based Intrusion

Detection System places monitoring agents on network resource nodes to

monitor the audit logs which are generated by the application program or

Network Operating System. Audit logs accommodate records for activities and

events taking place at every Network resources. HIDS can detect attacks that

cannot be seen by NIDS such as misuse by trusted insider and Intrusion. The

site-specific security policy which determines Signature rule base is utilized by

HIDS. HIDS overcomes the problems associated with the N IDS by alarming

the security personnel who can identify the source provided by site specific

security policy. HIDS can also validate if any attack was foiled, either because

of the immediate response to alarm or any other reason. HIDS can also maintain

user log off and log in user action and all activities that evolve audit records [1].

• Network Based Intrusion Detection System (NIDS): A NIDS is used to

analyse and monitor the network traffic to screen a system from the network

based threats where the data is traffic through the network. A NIDS tries to find

out malicious activities such as port scans, Ping sweeps, denial-of-service (Dos)

attacks, and Packet sniffers attacks. NIDS includes one or more than servers for

management functions, a number of sensors to oversee packet traffic, and one or

more management relieves for the human interface. NIDS explores the traffic

packet by packet in near real time or real time, to detect intrusion patterns. The

analysis of traffic to detect intrusions is done by the agents on the management

servers [12]. These network based procedures are regarded as the active

component.

• Hybrid Intrusion Detection: The network and host-based Intrusion Detection

System solutions have their own unique benefits and strengths over one another

and that is why the next generation Intrusion Detection System evolves to

12

embrace a tightly fused network and host components. Hybrid intrusion

detection system increases the security level and promises better flexibility. It

reports attacks that are aimed at entire network or particular segments and

combines Intrusion Detection System agent locations [20].

Each technique has a unique methodology for checking and securing

information and every classification has qualities and shortcomings that ought to be

measured against the prerequisites for each different target environment. The two sorts

of Intrusion Detection Systems vary fundamentally from one another, however

supplement each other well. But in the case of a proper Intrusion Detection System

implementation, it would be better to completely integrate the network intrusion

detection system, such that it would channel alarms and warnings in an

indistinguishable way to the host-based part of the system, controlled from the same

central area. In doing so, this gives a helpful means of overseeing and responding to

attack utilizing both sorts of intrusion detection.

2.6 Intrusion detection approaches

The desirable elements of an Intrusion Detection System can be achieved

through variety of approaches. There are two popular approaches to intrusion detection,

Abuse detection and Anomaly detection [2, 13].

Abuse detection: Systems possessing information on abnormal, unsafe

behaviour (attack signature-based systems) are often used in real-time intrusion

detection systems (because of their low computational complexity).

The misbehaviour signatures fall into two categories [6]:

• Attack signatures – they describe action patterns that may pose a

security threat. Typically, they are presented as a time-dependent

relationship between series of activities that may be interlaced with

neutral ones.

13

• Selected text strings – signatures to match text strings which look for

suspicious action (for example – calling /etc/passwd).

Any action that is not clearly considered prohibited is allowed. Hence, their

accuracy is very high (low number of false alarms). Typically, they do not achieve

completeness and are not immune to novel attacks.

There are two main approaches associated with signature detection (already mentioned

in the section describing real-time detectors):

• Verification of the pathology of lower layer packets— many types of

attacks (Ping of Death or TCP Stealth Scanning) exploit flaws in IP,

TCP, UDP or ICMP packets. With a very simple verification of flags

set on specific packets it is possible to determine whether a packet is

legitimate or not. Difficulties may be encountered with possible packet

fragmentation and the need for re-assembly. Similarly, some problems

may be associated with the TCP/IP layer of the system being protected.

It is well known that hackers use packet fragmentation to bypass many

IDS tools [7].

• Verification of application layer protocols — many types of attacks

exploit programming flaws, for example, out-of-band data sent to an

established network connection. In order to effectively detect such

attacks, the IDS must have implemented many application layer

protocols.

The signature detection methods have the following advantages: very low false alarm

rate, simple algorithms, and easy creation of attack signature databases, easy

implementation and typically minimal system resource usage.

Some disadvantages:

• Difficulties in updating information on new types of attacks (when

maintaining the attack signature database updated as appropriate).

14

• They are inherently unable to detect unknown, novel attacks. A

continuous update of the attack signature database for correlation is a

must.

• Maintenance of IDS is necessarily connected with analysing and

patching of security holes, which is a time-consuming process.

• The attack knowledge is operating environment–dependent, so

misbehaviour signature-based intrusion detection systems must be

configured in strict compliance with the operating system (version,

platform, applications used etc.)

• They seemed to have difficulty handling internal attacks. Typically,

abuse of legitimate user privileges is not sensed by the system as a

malicious activity (because of the lack of information on user privileges

and attack signature structure).

Anomaly detection: Normal behaviour patterns are useful in predicting both

user and system behaviour. Here, anomaly detectors construct profiles that represent

normal usage and then use current behaviour data to detect a possible mismatch

between profiles and recognize possible attack attempts [5, 10].

In order to match event profiles, the system is required to produce initial user

profiles to train the system with regard to legitimate user behaviours. There is a

problem associated with profiling: when the system is allowed to “learn” on its own,

experienced intruders (or users) can train the system to the point where previously

intrusive behaviour becomes normal behaviour. An inappropriate profile will be able to

detect all possible intrusive activities. Furthermore, there is an obvious need for profile

updating and system training which a difficult and time-consuming task. Given a set of

normal behaviour profiles, everything that does not match the stored profile is

considered to be a suspicious action. Hence, these systems are characterized by very

high detection efficiency (they are able to recognize many attacks that are new to the

system), but their tendency to generate false alarms is generally a problem.

15

Advantages of this anomaly detection method are: possibility of detection of

novel attacks as intrusions; anomalies are recognized without getting inside their

causes and characteristics; less dependence of IDSs on operating environment (as

compared with attack signature-based systems); ability to detect abuse of user

privileges.

The biggest disadvantages of this method are:

• A substantial false alarm rate. System usage is not monitored during the

profile construction and training phases. Hence, all user activities

skipped during these phases will be illegitimate.

• User behaviours can vary with time, thereby requiring a constant

update of the normal behaviour profile database (this may imply the

need to close the system from time to time and may also be associated

with greater false alarm rates).

• The necessity of training the system for changing behaviour makes a

system immune to anomalies detected during the training phase (false

negative).

16

CHAPTER 3

SYSTEM ANALYSIS

3.1 Existing System

Framework of proposed EDADT (Efficient Data Adapted Decision Tree)

algorithm: The pseudo code of the proposed EDADT algorithm shown in Fig.3.1

utilizes the hybrid PSO technique to identify the local and global best values for n

number of iterations to obtain the optimal solution. The best solution is obtained by

calculating the average value and by finding the exact efficient features from the given

training data set.

Fig.3.1: Pseudo code of EDADT algorithm

For each attribute ‘a’, select all unique values of ‘a’ to find the unique values

belong to the same class label. If n unique values belong to the same class label, split

them into ‘m’ intervals, and ‘m’ must be less than ‘n’. If the unique values belong to

17

different ‘c’ class label, check whether the probability of the value belongs to same

class. If it is found then change the class label of values with the class label of highest

probability. Split the unique values as ‘c’ interval then repeat checking of unique

values in the class label for all values in the data set. Find out the normalized

information gain for each attribute and decision node forms a best attribute with the

highest normalized information gain. Sub lists are generated using best attributes and

those nodes forms the child nodes. These processes continue until the data set

converges.

3.1.1 Drawbacks of existing system

• KDD 99 Cup dataset was used for performance evaluation.

• Better technique in terms of performance needed.

3.2 Feasibility Study

All projects are feasible, given unlimited resources and infinite time. Before

going further in to the steps of software development, the system analyst has to analyze

whether the proposed system will be feasible for the organization and must identify the

customer needs. The main purpose of feasibility study is to determine whether the

problem is worth solving. The success of a system is also lies in the amount of

feasibility study done on it. Many feasibility studies have to be done on any system.

But there are three main feasibility tests to be performed. They are Operation

Feasibility, Technical Feasibility and Economic Feasibility.

3.2.1 Operational Feasibility

Operational feasibility is mainly concerned with issues like whether the

system will be used if it is developed and implemented. Whether there will be

18

resistance from users that will affect the possible application benefits. The essential

questions that help in testing the operational feasibility of a system are following.

• Does management support the project?

• Are the users not happy with current business practices? Will it reduce the time

(operation) considerably?

• Have the users been involved in the planning and development of the project?

• Will the proposed system really benefit the organization? Does the overall

response increase? Will accessibility of information be lost?

3.2.2 Economical Feasibility

For any system if the expected benefits equal or exceed the expected costs, the

system can be judged to be economically feasible. In economic feasibility, cost benefit

analysis is done in which expected costs and benefits are evaluated. Economic analysis

is used for evaluating the effectiveness of the proposed system. In economic feasibility,

the most important is cost-benefit analysis. As the name suggests, it is an analysis of

the costs to be incurred in the system and benefits derivable out of the system.

3.2.3 Technical Feasibility

In technical feasibility the following issues are taken into consideration.

• Whether the required technology is available or not.

• Whether the required resources are available – Manpower - programmers,

testers & debuggers- Software and hardware.

Once the technical feasibility is established, it is important to consider the monetary

factors also. Since it might happen that developing a particular system may be

technically possible but it may require huge investments and benefits may be less. For

evaluating this, economic feasibility of the proposed system is carried out.

19

CHAPTER 4

SYSTEM SPECIFICATION

4.1 Hardware Requirements

Processor : Core i3 1.8 GHz

Hard Disk : 500 GB

RAM : 4 GB

4.2 Software Requirements

Operating System : Windows XP and above

Programming language : Java

Integrated Development Environment: Weka 3.7.11

20

CHAPTER 5

SOFTWARE DESCRIPTION

5.1 Weka

Weka (Waikato Environment for Knowledge Analysis) is a popular suite of

machine learning software written in Java, developed at the University Of Waikato,

New Zealand. Weka is free software available under the GNU General Public License

[21].

The Weka (pronounced Weh-Kuh) workbench contains a collection of

visualization tools and algorithms for data analysis and predictive modeling, together

with graphical user interfaces for easy access to this functionality. The original non-

Java version of Weka was a TCL/TK front-end to (mostly third-party) modeling

algorithms implemented in other programming languages, plus data pre-processing

utilities in C, and a Makefile-based system for running machine learning experiments.

This original version was primarily designed as a tool for analyzing data from

agricultural domains, but the more recent fully Java-based version (Weka 3), for which

development started in 1997, is now used in many different application areas, in

particular for educational purposes and research. Advantages of Weka include:

• free availability under the GNU General Public License

• portability, since it is fully implemented in the Java programming language and

thus runs on almost any modern computing platform

• a comprehensive collection of data pre-processing and modelling techniques

• ease of use due to its graphical user interfaces

Weka supports several standard data mining tasks, more specifically, data pre-

processing, clustering, classification, regression, visualization, and feature selection.

All of Weka's techniques are predicated on the assumption that the data is available as

a single flat file or relation, where each data point is described by a fixed number of

21

attributes (normally, numeric or nominal attributes, but some other attribute types are

also supported). Weka provides access to SQL databases using Java Database

Connectivity and can process the result returned by a database query. It is not capable

of multi-relational data mining, but there is separate software for converting a

collection of linked database tables into a single table that is suitable for processing

using Weka. Another important area that is currently not covered by the algorithms

included in the Weka distribution is sequence modeling.

Weka's main user interface is the Explorer, but essentially the same

functionality can be accessed through the component-based Knowledge Flow interface

and from the command line. There is also the Experimenter, which allows the

systematic comparison of the predictive performance of Weka's machine learning

algorithms on a collection of datasets.

The Explorer interface features several panels providing access to the main

components of the workbench:

• The Preprocess panel has facilities for importing data from a database, a CSV

file, etc., and for preprocessing this data using a so-called filtering algorithm.

These filters can be used to transform the data (e.g., turning numeric attributes

into discrete ones) and make it possible to delete instances and attributes

according to specific criteria.

• The Classify panel enables the user to apply classification and regression

algorithms (indiscriminately called classifiers in Weka) to the resulting dataset,

to estimate the accuracy of the resulting predictive model, and to visualize

erroneous predictions, ROC curves, etc., or the model itself (if the model is

amenable to visualization like, e.g., a decision tree).

• The Associate panel provides access to association rule learners that attempt to

identify all important interrelationships between attributes in the data.

22

• The Cluster panel gives access to the clustering techniques in Weka, e.g., the

simple k-means algorithm. There is also an implementation of the expectation

maximization algorithm for learning a mixture of normal distributions.

• The Select attributes panel provides algorithms for identifying the most

predictive attributes in a dataset.

• The Visualize panel shows a scatter plot matrix, where individual scatter plots

can be selected and enlarged, and analyzed further using various selection

operators

23

CHAPTER 6

PROJECT DESCRIPTION

6.1 Data sets

NSL-KDD data set: The NSL-KDD data set is advised to resolve a number of

the inherent issues of the KDD CUP'99 data set. KDD CUP’99 is the mostly wide used

data set for anomaly detection. However Tavallaee et al conducted a statistical analysis

on this data set and located two important issues that greatly affected the performance

of evaluated systems, and lands up in a very poor analysis of anomaly detection

approaches. To resolve these problems, they projected a new data set, NSL-KDD that

consists of selected records of the whole KDD data set [18]. The following are the

benefits of the NSL-KDD over the original KDD data set: First, it doesn't include

redundant records within the train set, so the classifiers won't be biased towards more

frequent records. Second, the amount of selected records from every difficulty level

group is inversely proportional to the share of records in the original KDD data set. As

a result, the classification rates of distinct machine learning methods vary in a very

wider range that makes it more efficient to have an accurate evaluation of different

learning techniques. Third, the numbers of records in the train and test sets is

reasonable, that makes it affordable to run the experiments on the entire set without the

requirement to randomly choose a tiny low portion. Consequently, analysis results of

different research works are going to be consistent and comparable.

HONEY POT data set: This data set was collected from honey pot which was

set up at Tokyo University.

DENIAL OF SERVICE data sets: Ping flood and UDP flood data sets were

collected for the performance testing on individual attacks.

24

6.2 Bagging Ensemble Selection

The simple forward model selection based ensemble selection algorithm is

superior to many other prominent ensemble learning algorithms, such as bagging

decision trees, stacking with linear regression at the meta-level and boosting decision

stumps [3]. However, sometimes the performance of the final ensemble may be

reduced due to ensemble selection over fits the hill climbing set. [3] Explains that the

performance of ensemble selection on the hill climb set gradually increases as the

number of models in the model library increases. The performance on the test set does

not always increase but it may reach a global or local value and then gradually decline.

The root-mean-squared-error metric may decline very quickly as depicted in [3], for

certain data sets. The authors of [3] proposed three additions to the simple forward

selection procedure to reduce the chance of hill climb set over fitting. The proposed

additions are: (1) the individual classifier can be selected multiple times, which results

in some classifiers gets larger weights than others; (2) the models in the library are

sorted by their performance, and the best N models are put into the initial ensemble

which avoids starting with an empty ensemble; (3) ensemble selection is done inside

each bag out of the K bags of models are randomly selected from the model library; the

final ensemble is the union of the subsets selected from each of the bags.

Based on the observations from [8], resulted in proposing a new ensemble

learning algorithm called bagging ensemble selection: the simple forward ensemble

selection algorithm can be taken as an unstable base classifier, then can be applied with

bagging idea to construct an ensemble of simple ensemble selection classifiers, which

result in a more robust technique than an individual ensemble selection classifier. The

hill climb set can be taken from out of bag samples.

The full bootstrap sample is used for model generation and the corresponding

out of bag instances as the hill climb set for gathering in the Bagging Ensemble

algorithm [14]. The bootstrap sample is predicted to contain about 1-1/e ≈ 63:2% of

exclusive examples of the training set. Hence the hill climb set is awaited to have about

25

1/e≈ 36:8% exclusive examples of the training set for each bagging repetition. Here

Reduced Error Pruning Tree is used as the base classifier.

The following shows the pseudo code for the Bagging Ensemble Selection

Inputs:

Training set S, Ensemble Selection classifier E, Integer T(number of bootstrap

samples)

Basic Procedure:

for i=1 to T

{

Sb = bootstrap sample from S

Sob=out of bag sample

Train base classifiers (can be a diverse model library) in E on Sb

Ei = do ensemble selection based on base classifiers performance on Sob

}

6.3 Algorithms related to Bagging Ensemble Selection

The data mining community was always facing the problem of constructing

ensemble classifiers with best predictive performance on practical problems. The

ensemble methods are more stable and accurate when compared with individual

classifiers. Here the mathematical expression used in [9] to illustrate the idea of

ensemble learning is noted down: let ‘y’ be an instance and ni; = 1...k a set of base

classifiers that output probability distributions ni(y, dj) for each class label dj ; j = 1...n.

The final classifier ensemble output x(y), for instance ‘y’ is shown in equation 6.1,

Where wi is the weight of base classifier ni.

6.1

26

Many ensemble methods have been proposed from mid-90's.The instability of

base classifiers is exploited in bagging method, which is utilized to improve the

predictive performance of such unstable base classifiers. The basic idea is that, for

given a training set T of size m and a classifier C, bagging generates n new training

sets with replacement, Ti, each of size m′≤m. Then, bagging applies C to each Ti to

build n models. The final output of bagging is based on the simple voting determines

the final output.

AdaBoost (Adaptive Boosting) is a popular ensemble algorithm that uses an

iterative process for improving simple boosting algorithm [6]. This algorithm gives

more focus to patterns that are difficult to classify. P-AdaBoost algorithm is a

distributed version of AdaBoost which was introduced later. P-AdaBoost works in two

phases, which runs in its sequential fashion for a bounded number of steps in the first

phase. The classifiers are trained in parallel using weights that are estimated from the

first phase which will be utilized in second phase.

The method of constructing ensembles from a library of base classifiers is

called Ensemble selection. Initially, the different machine learning algorithms are used

for building base models. The well performing subsets of all models are extracted

using construction strategy such as forward stepwise selection, guided by some scoring

function. The procedure for simple forward model selection proposed in [8] works as

follows: (1) initialize with an empty ensemble; (2) The model in the library which

maximizes the ensemble's performance to the error metric on a hill-climb set is added

to the ensemble (3) repeat Step 2 until all models have been examined;(4) The model

with maximum performance on the hill-climb set is taken and subset of that model is

returned. The advantage of ensemble selection is that it can be optimized for many

common performance metrics or a combination of metrics

27

6.4 Classifiers for performance comparison

The following classifier algorithms have been implemented for the

performance comparison on the datasets mentioned in 6.1.

• oneR

• HoeffdingTree

• DecisionStump

6.4.1 oneR

OneR, short for "One Rule", accurate and simple classification algorithm that

generates one rule for every predictor within the data, then selects the rule with the

tiniest total error as its "one rule". To make a rule for a predictor, oneR constructs a

frequency table for every predictor against the target. It's been shown that OneR

produces rules only slightly less accurate than progressive classification algorithms

whereas producing rules that are easy for humans to interpret.

6.4.2 HoeffdingTree

A Hoeffding tree [3] is a progressive, anytime decision tree induction

algorithm that's capable of learning from data streams, assuming that the distribution

generating examples doesn't change over time. Hoeffding trees exploit the actual fact

that a small sample will usually be enough to decide on the optimal splitting attribute.

This idea is supported mathematically by the Hoeffding bound that quantifies the

amount of observations required to estimate some statistics within a prescribed

preciseness. A theoretically appealing feature of Hoeffding Trees not shared by other

incremental decision tree learners is that its sound guarantees of performance. Using

the Hoeffding using one can show that its output is asymptotically nearly similar to that

of a non-incremental learner using infinitely several examples.

28

6.4.3 DecisionStump

A decision stump [19] is a machine learning model consisting of a one-level

decision tree. That is, it's a decision tree with one internal node that is instantly

connected to the terminal nodes (its leaves). A decision stump makes a prediction that

supports the value of just one input feature. They're also known as 1-rules.Decision

stumps are usually used as base learners in machine learning ensemble techniques like

boosting and bagging. For example, a state-of-the-art Viola–Jones face detection

algorithm employs AdaBoost with decision stumps as weak learners.

6.5 Classifier performance measures

A confusion matrix contains information regarding actual and foreseen

classifications done by a classification system. Performance of such systems is often

evaluated using the data within the matrix. The following Fig.6.1 shows the confusion

matrix,

Fig.6.1: confusion matrix

The entries within the confusion matrix have meaning which means within the

context of our study:

• a is that the number of correct predictions that an instance is negative,

• b is that the number of incorrect predictions that an instance is positive,

• c is that the number of incorrect of predictions that an instance negative, and

• d is that the number of correct predictions that an instance is positive.

The following are the metrics that is used for the evaluation of data set:

29

• Accuracy: The accuracy is that the proportion of the total number of predictions

that were correct. It’s determined using the equation:

Accuracy=(𝑎 + 𝑑)/(𝑎 + 𝑏 + 𝑐 + 𝑑)

• Detection Rate: Detection Rate is the proportion of the predicted positive cases

that were correct, as calculated using the equation:

Detection Rate=𝑑/(𝑏 + 𝑑)

• False Alarm Rate: False Alarm Rate is the proportion of negatives cases that

were incorrectly classified as positive, as calculated using the equation

False Alarm Rate=b/ (a+b)

30

CHAPTER 7

SYSTEM IMPLEMENTATION

The classifier algorithms were simulated using weka. The Fig 7.1 shows

weka GUI chooser.

Fig.7.1: Weka GUI chooser

31

The Fig.7.2 shows Command Line Interface of weka, from which the

operations can be initiated.

Fig.7.2: Command Line Interface of weka

32

The Fig.7.3 shows the performance comparison of classifier algorithms on

NSL-KDD data set.

Fig.7.3: Performance comparison on NSL-KDD data set

The Fig.7.4 shows the performance comparison of classifier algorithms on

Honeypot data set.

Fig.7.4: Performance comparison on Honeypot data set

020406080

100

Accuracy

DetectionRateFalse AlarmRate

020406080

100

Accuracy

DetectionRate

False AlarmRate

33

The Fig.7.5 shows the performance comparison of classifier algorithms on

Ping flood data set.

Fig.7.5: Performance comparison on Ping Flood data set

The Fig.7.6 shows the performance comparison of classifier algorithms on

Udp flood data set.

Fig.7.6: Performance comparison on Udp Flood data set

020406080

100

Accuracy

DetectionRateFalse AlarmRate

020406080

100

Accuracy

DetectionRateFalse AlarmRate

34

CHAPTER 8

CONCLUSION AND FUTURE ENHANCEMENT

8.1 Conclusion

The proposed Bagging Ensemble Selection classifier has been tested using the

data sets described in 6.1. Comparative study and analysis related to classification

measures included Accuracy, Detection Rate and False Alarm Rate have been

computed by simulation using Weka Toolkit. Experimental Results show that Bagging

Ensemble Selection gives the best performance in terms of Accuracy, Detection Rate

and False Alarm Rate when NSL-KDD data set is used. Results show that Bagging

Ensemble Selection gives the best performance in terms of Accuracy, Detection Rate

and False Alarm Rate when Honeypot data set is used . Results show that Bagging

Ensemble Selection falls behind the Hoeffding Tree which give the best performance

in terms of Accuracy, Detection Rate and False Alarm Rate when ping flood data set is

used. Bagging Ensemble Selection and Hoeffding Tree gives the best performance in

terms of Accuracy, Detection Rate and False Alarm Rate when udp flood data set is

used.

8.2 Future Enhancement

As a future work, the proposed method may be extended for streaming data.

The proposed method may be tested with more real data sets and more performance

measures may be taken into consideration.

35

APPENDICES

A.1 Source Code

The source code of Bagging Ensemble Selection is given below

package weka.classifiers.meta;

import weka.core.*;

import java.util.*;

import weka.classifiers.meta.baggingEnsembleSelection.*;

import weka.classifiers.trees.REPTree;

import weka.classifiers.Classifier;

import weka.classifiers.AbstractClassifier;

import weka.core.AdditionalMeasureProducer;

import weka.core.Capabilities;

import weka.core.Instance;

import weka.core.Instances;

import weka.core.Option;

import weka.core.OptionHandler;

import weka.core.Randomizable;

import weka.core.TechnicalInformation;

import weka.core.TechnicalInformationHandler;

import weka.core.Utils;

import weka.core.WeightedInstancesHandler;

import weka.core.TechnicalInformation.Field;

import weka.core.TechnicalInformation.Type;

import java.util.Enumeration;

import

weka.classifiers.meta.baggingEnsembleSelection.BESHelper.BESBaseClassifier;

36

import

weka.classifiers.meta.baggingEnsembleSelection.BESHelper.BESBaseClassifierLib;

public class BESTrees

extends AbstractClassifier

implements OptionHandler, Randomizable, WeightedInstancesHandler,

AdditionalMeasureProducer, TechnicalInformationHandler {

private BESoob m_BES = new BESoob();

protected boolean m_UseNNLS = false;

protected int m_NumBags = 10;

protected int m_NumTreesPerBag = 10;

protected int m_NumFeatures = 0;

protected int m_RandomSeed = 1;

protected int m_KValue = 0;

protected int m_FinalEnsembleSize = -1;

protected int m_Slots = 1;

protected int m_MaxDepth = -1;

protected int m_NumFolds = 3;

protected int m_HillclimbMetric = BESHelper.METRIC_RMSE;

public SelectedTag getHillclimbMetricMethod() {

return new SelectedTag(m_HillclimbMetric,

BESDirectHillclimbingES.TAGS_METRIC);

}

public void setHillclimbMetricMethod(SelectedTag method) {

if (method.getTags() == BESDirectHillclimbingES.TAGS_METRIC) {

m_HillclimbMetric = method.getSelectedTag().getID();

} }

public void setNumExecutionSlots(int numSlots) {

m_Slots = numSlots;

37

}

public int getNumExecutionSlots() {

return m_Slots;

}

public int getNumFeatures() {

return m_NumFeatures;

}

public void setNumFeatures(int newNumFeatures) {

m_NumFeatures = newNumFeatures;

}

public int getMaxDepth() {

return m_MaxDepth;

}

public void setMaxDepth(int value) {

m_MaxDepth = value;

if (m_MaxDepth <= 0) {

m_MaxDepth = -1;

} }

@Override

public Capabilities getCapabilities() {

return new REPTree().getCapabilities();

}

@Override

public void setSeed(int seed) {

m_RandomSeed = seed;

}

@Override

public int getSeed() {

38

return m_RandomSeed;

}

public int getFinalEnsembleSize() {

if (m_UseNNLS) {

return m_FinalEnsembleSize;

} else {

return m_NumBags;

} }

public void setNumTreesPerBag(int n) {

if (n < 2) {

m_NumTreesPerBag = 2;

} else {

m_NumTreesPerBag = n;

} }

public int getNumTreesPerBag() {

return m_NumTreesPerBag;

}

public int getTotalNumberTrees() {

return m_BES.getTotalNumberTrees();

}

public void setNumBags(int n) {

if (n < 1) {

m_NumBags = 1;

} else {

m_NumBags = n;

} }

public int getNumBags() {

return m_NumBags;

39

}

public void setUseNNLS(boolean use) {

m_UseNNLS = use;

}

public boolean getUseNNLS() {

return m_UseNNLS;

}

public void setOptions(String[] options) throws Exception {

String selectionString = Utils.getOption('Z', options);

if (selectionString.length() != 0) {

setHillclimbMetricMethod(new SelectedTag(Integer

.parseInt(selectionString),

BESDirectHillclimbingES.TAGS_METRIC));

} else {

setHillclimbMetricMethod(new SelectedTag(BESHelper.METRIC_RMSE,

BESDirectHillclimbingES.TAGS_METRIC));

}

String tmpStr;

tmpStr = Utils.getOption('I', options);

if (tmpStr.length() != 0) {

m_NumBags = Integer.parseInt(tmpStr);

} else {

m_NumBags = 10;

}

tmpStr = Utils.getOption('H', options);

if (tmpStr.length() != 0) {

m_NumTreesPerBag = Integer.parseInt(tmpStr);

} else {

40

m_NumTreesPerBag = 10;

}

tmpStr = Utils.getOption('K', options);

if (tmpStr.length() != 0) {

m_NumFeatures = Integer.parseInt(tmpStr);

} else {

m_NumFeatures = 0;

}

tmpStr = Utils.getOption('G', options);

if (tmpStr.length() != 0) {

m_Slots = Integer.parseInt(tmpStr);

} else {

m_Slots = 1;

}

tmpStr = Utils.getOption('S', options);

if (tmpStr.length() != 0) {

setSeed(Integer.parseInt(tmpStr));

} else {

setSeed(1);

}

tmpStr = Utils.getOption("depth", options);

if (tmpStr.length() != 0) {

setMaxDepth(Integer.parseInt(tmpStr));

} else {

setMaxDepth(0);

}

setUseNNLS(Utils.getFlag('N', options));

super.setOptions(options);

41

Utils.checkForRemainingOptions(options);

}

public String[] getOptions() {

Vector result;

String[] options;

int i;

result = new Vector();

result.add("-Z");

result.add("" + getHillclimbMetricMethod().getSelectedTag().getID());

result.add("-I");

result.add("" + getNumBags());

result.add("-H");

result.add("" + getNumTreesPerBag());

result.add("-K");

result.add("" + getNumFeatures());

result.add("-S");

result.add("" + getSeed());

result.add("-G");

result.add("" + this.getNumExecutionSlots());

if (getUseNNLS()) {

result.add("-N");

}

if (getMaxDepth() > 0) {

result.add("-depth");

result.add("" + getMaxDepth());

}

options = super.getOptions();

for (i = 0; i < options.length; i++) {

42

result.add(options[i]);

}

return (String[]) result.toArray(new String[result.size()]);

}

public void setNumFolds(int n) {

m_NumFolds = n;

}

public void buildClassifier(Instances data) throws Exception {

getCapabilities().testWithFail(data);

if (data.attribute(data.classIndex()).isNominal()) {

m_UseNNLS = false;

}

data = new Instances(data);

data.deleteWithMissingClass();

Random r = new Random(this.m_RandomSeed);

m_KValue = m_NumFeatures;

if (m_KValue < 1) {

m_KValue = (int) Utils.log2(data.numAttributes()) + 1;

}

double attRatio = 1.0 * m_KValue / (data.numAttributes() - 1);

BESHelper.BESBaseClassifierLib lib = new BESHelper.BESBaseClassifierLib();

lib.setDebug(false);

ArrayList<String> algos = new ArrayList<String>();

for (int i = 1; i <= m_NumTreesPerBag; i++) {

algos.add("weka.classifiers.meta.RandomSubSpace -P "

+ attRatio + " -S " + r.nextInt()

+ " -num-slots 1 -I 1 -W weka.classifiers.trees.REPTree -- -M 2 -V 0.0010

-N "

43

+ m_NumFolds + " -S " + r.nextInt() + " -L " + m_MaxDepth);

}

for (int run = 0; run < algos.size(); run++) {

String algoStr = algos.get(run);

String[] parts = algoStr.split(" ");

int firstSpace = algoStr.indexOf(" ");

String opStr = algoStr.substring(firstSpace + 1, algoStr.length());

try {

String[] ops = weka.core.Utils.splitOptions(opStr);

Classifier model = AbstractClassifier.forName(parts[0], ops);

String threadID = "algo" + run;

BESHelper.BESBaseClassifier bcModel = new

BESHelper.BESBaseClassifier(threadID, model);

lib.add(bcModel);

} catch (Exception e) {

System.err.println(algoStr + ": " + e.toString());

}

}

BESDirectHillclimbingES des = new BESDirectHillclimbingES();

des.setBaseClassifierLibrary(lib);

des.setDebug(false);

des.setInternalHillclimbSetRatio(0.33);

des.setHillclimbMetric(getHillclimbMetricMethod());

des.setHillclimbIterations(lib.size());

BESoob besOOB = new BESoob();

besOOB.setNumExecutionSlots(m_Slots);

besOOB.setNumIterations(m_NumBags);

besOOB.setClassifier(des);

44

besOOB.setCalcOutOfBag(true);

besOOB.setDebug(m_Debug);

besOOB.setUseNNLS(m_UseNNLS);

besOOB.buildClassifier(data);

m_BES = besOOB;

m_FinalEnsembleSize = m_BES.getFinalEnsembleSize();

}

public double[] distributionForInstance(Instance instance) throws Exception {

return m_BES.distributionForInstance(instance);

}

public TechnicalInformation getTechnicalInformation() {

TechnicalInformation result;

result = new TechnicalInformation(Type.INPROCEEDINGS);

result.setValue(Field.AUTHOR, "Quan Sun and Bernhard Pfahringer");

result.setValue(Field.YEAR, "2012");

result.setValue(Field.TITLE, "Bagging Ensemble Selection for Regression");

result.setValue(Field.JOURNAL, "In Proceedings of the 25th Australasian Joint

Conference on Artificial Intelligence (AI'12)");

result.setValue(Field.PAGES, "695-706");

return result;

}

public Enumeration enumerateMeasures() {

Vector newVector = new Vector(2);

newVector.addElement("measureFinalEnsembleSize");

newVector.addElement("measureFinalNumTrees");

return newVector.elements();

}

public double getMeasure(String additionalMeasureName) {

45

if (additionalMeasureName.equalsIgnoreCase("measureFinalEnsembleSize")) {

return measureFinalEnsembleSize();

} else if (additionalMeasureName.equalsIgnoreCase("measureFinalNumTrees")) {

return measureFinalNumTrees();

} else {

throw new IllegalArgumentException(additionalMeasureName + " not

supported (RandomForest)");

} }

public double measureFinalEnsembleSize() {

return getFinalEnsembleSize();

}

public double measureFinalNumTrees() {

return m_BES.getTotalNumberTrees();

}

public Enumeration listOptions() {

Vector newVector = new Vector();

newVector.addElement(new Option(

"\tNumber of trees to build.",

"I", 1, "-I <number of trees>"));

newVector.addElement(new Option(

"\tNumber of features to consider (<1=int(logM+1)).",

"K", 1, "-K <number of features>"));

newVector.addElement(new Option(

"\tSeed for random number generator.\n" + "\t(default 1)",

"S", 1, "-S"));

newVector.addElement(new Option(

"\tThe maximum depth of the trees, 0 for unlimited.\n" + "\t(default 0)",

"depth", 1, "-depth <num>"));

46

newVector.addElement(

new Option(

"\tSet the target metric"

+ " to use. 0 = Correlation Coefficient, 1 = RMSE, 2 = ROC, "

+ "3 = PRECISION, 4 = Recall, 5 = F score, 6 = All, "

+ "7 = Weighted TP, 8 = Mean Abs Error, 9 = Accuracy \n"

+ "\t(default 1 = RMSE)",

"Z", 2, "-Z <target metric>"));

Enumeration enu = super.listOptions();

while (enu.hasMoreElements()) {

newVector.addElement(enu.nextElement());

}

return newVector.elements();

}

public String toString() {

if (m_BES == null) {

return "BESTrees: No model built yet.";

}

int totalNumTrees = getNumTreesPerBag() * getNumBags();

String result = "The final BESTrees model has " + this.getTotalNumberTrees() + "

trees (out of " + totalNumTrees + ") \n";

return result;

}

public static void main(String[] argv) throws Exception {

runClassifier(new BESTrees(), argv);

}

}

47

A.2 List of publications

1. “Intrusion Detection System using Bagging Ensemble Selection ,” 2015 IEEE International Conference on Engineering and Technology (ICETECH), 20thMarch 2015, Coimbatore, TN, India.

2. “A Comprehensive Review on Intrusion Detection Systems,” CiiT International Journal of Networking and Communication Engineering, Vol 6, No 9, 2014.

3. “Performance comparison of classification algorithms using Weka,” International Journal of Advanced and Innovative Research, Vol 4, No 4, 2015.

48

REFERENCES

1. A.S Asmaa & G. Sharad, (2011) “Importance of Intrusion Detection System (IDS),”International journal of scientific and Engineering Research, ISSN 2229-5518, Volume 2, Issue 1.

2. J Anderson, (1995), An Introduction to Neural Networks MIT, Cambridge.[Book]

3. R. Caruana, A. Niculescu-Mizil, G. Crew & A. Ksikes,( 2004) “Ensemble selection from libraries of models.” ICML '04 Proceedings of the twenty-first international conference on machine learning.

4. H. Guang-Bin, H.W Dian & Yuan Lan, (2011) “ Extreme learning machines: a

survey. ” International Journal of Machine Learning and Cybernetics 2, no. 2. 5. D. Herve, D. Marc & W. Andreas, (1999) “ Towards taxonomy of intrusion-

detection systems ” Computer Networks 31, no. 8. 6. B. Hyeran, & L. Seong-Whan, (2002) “Applications of support vector machines for

pattern recognition: A survey.” In Pattern recognition with support vector machines, Springer Berlin Heidelberg, pp. 213-236.

7. R. Jagannathan, L. Teresa , A. Debra, D. Chris, G. Fred, J. Caveh, J. Hal,N. Peter,

T. Ann & V. Alfonso, (1993) “Next-generation intrusion detection expert system (NIDES)”. Technical Report A007/A008/A009/A011/A012/A014, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, March.

8. P. Nicholas, C. Mandy, A. Ronald. Olsson & M. Biswanath, (1997) “A software

platform for testing intrusion detection systems.” Software, IEEE 14, no. 5.

9. I.Partalas, Tsoumakas & G. Vlahavas, (2010)“An ensemble uncertainty aware measure for direct hill climbing ensemble pruning.” Machine Learning,Vol. 81, no.3.

10. S. Paul, K. Sokratis, G. Dimitris, A.Francois, D. John, G. Claude, K. Dimitris, P.

Kess, P. Heiki, & S. Thomas, (1994) “SECURENET: A network-oriented intelligent intrusion prevention and detection system”. Network Security Journal, 1(1), November.

49

11. P. Porras & A. Valdes,(1998) “Live Traffic Analysis of TCP/IP Gateways”, Proceedings of the 1998 ISOC Symposium on Network and Distributed System Security (NDSS’98), San Diego, CA, March.

12. K. Przemyslaw & D. Piotr, (2003) “Intrusion Detection Systems (IDS) Part I-

(network intrusions; attack symptoms; IDS tasks; and IDS architecture).” Retrieved April 20

13. B. Rebecca, & M. Peter, (2001) NIST special publication on intrusion detection

systems. Booz-allen and Hamilton Inc Mclean va. 14. S.Quan & P.Bernhard , (2011) “Bagging ensemble selection.” AI'11 Proceedings

of the 24th international conference on Advances in Artificial Intelligence, pp.251-260.

15. B. Rhodes, J. Mahaffey & J. Cannady, (2000) “Multiple self-organizing maps for

intrusion detection”, Paper presented at the Proceedings of the 23rd National Information Systems Security Conference, Baltimore, pp.16–19.

16. K. Sandeep & S. Eugene. (1994) “A pattern matching model for misuse intrusion

detection”. In Proceedings of the 17th National Computer Security Conference, pp. 11-21, October.

17. N. Stephen & N. Judy, (2002) Network intrusion detection. Sams Publishing.

[Book]

18. M. Tavallaee, E. Bagheri, W. Lu & A. Ghorbani, (2009) “A detailed analysis of the KDD CUP’99 data set”, IEEE Int. Conf. Comput. Intell.Security Defense Appl., pp. 53–58.

19. S. William, (2006) Cryptography and Network Security, 4/E. Pearson Education India. [Book]

20. http://www.cerias.purdue.edu/about/history/coast_resources/idcontent/detection.html [Website]

21. http://www.cs.waikato.ac.nz/~ml/weka/ [Website]