chapter 2 literature survey -...
TRANSCRIPT
27
CHAPTER 2
LITERATURE SURVEY
A large proportion of research in developing IDS focuses on developing
new system architectures to improve the accuracy and completeness of the IDS.
Several research areas in the domain of IDS help move the field towards
a set of ideal requirements as listed in Table 2.1.
Table 2.1 Ideal requirements of IDS
Accuracy No false positives
Completeness No false negatives
Performance Real time detection
Fault Tolerance IDS not becoming security vulnerability itself
Timeliness Handling large amounts of data
Scalability Quick propagation of information in the network to react to potential intrusions using IDS
Intrusion Detection procedures are classified into three categories and
they differ in the reference data that is used for detecting unusual activity. Signature
based or MD considers signatures of unusual activity for detection. AD mechanism
considers a profile of normal system activity and Protocol-Based or Specification
based detection considers constraints that characterize the normal behavior of a
particular protocol or a program. The trend is to apply ML to IDS that offers
flexibility for detection and lends itself conveniently to AD. The AD operates
assuming that the attacks are different from the normal activity and try to focus on
28
identifying unusual behavior in a host or a network. However it is now common to
develop hybrid systems, which may combine misuse and anomaly detectors, host
based and network based modules, and event correlation and stateless detectors.
With increasing research on hybrid IDS, recent research focuses on correlating alerts
between the different modules in an efficient manner [3, 4]. Alert aggregation is one
such area in which similar alerts/events are grouped into a single generalized event.
With this method, data required to analyze to detect intrusion by the system
administrator gets reduced. Event correlation is another research area, which has
been established well for MD. MD based IDS often uses a set of rules or signatures
as attack model, with each rule usually dedicated to detect a different attack. Earlier
research work emphasized that data set for analysis can be obtained by real traffic,
sanitized traffic and simulated traffic [5,6]. But in real time fast response to external
events within an extremely short time is demanded and expected. Therefore, an
alternative algorithm to implement real time learning is imperative for critical
applications for fast changing environments. Even for offline applications, speed is
still a need. A real time learning algorithm that reduces training time and human
effort to nearly zero would always be of considerable value. The advent of new
technologies has greatly increased the ability to monitor and resolve the details of
changes in order to analyze better. Analyzing large amount of data is still a new
challenge. For identifying frequently changing trend data need to be analyzed and
corrected. In some cases, feature selection may improve the performance of the
detection as it simplifies the complexity problem by reducing the dimensions.
Researchers have proposed several methods of feature selection to achieve real time
IDS. The major benefit of feature selection is that the amount of data required to
process is significantly reduced, without compromising the performance of the
detection.
2.1 CURRENT IDS PRODUCTS
IDS can be classified according to many different features [7,8].
Table 2.2 lists some of the currently available IDS with features.
29
Table 2.2 Leading IDS products currently available
Name Description
SNORT
An open source network Intrusion Prevention and Detection System (IDS/IPS). SNORT is developed by Sourcefire thatcombines the benefits of signature, protocol and anomaly based inspection. SNORT is one of the widely deployedIDS/IPS technologies worldwide.
COUNTERACT
Delivers an entirely unique approach to prevent network intrusions. The system stops attackers based on their “proven intent” to attack. It will not use signatures, AD orpattern matching of any kind. To launch an attack, an attacker needs knowledge about a network's resources. Prior to attack, intruders compile vulnerability and configuration information by scanning and probing. This information is used to launch attacks based on the unique structure and characteristics of the targeted network. These characteristics of intruders are used by COUTERACT to prevent intrusions.
AIRMAGNET Provides a simple, scalable WLAN monitoring solution This enables an organization to proactively mitigate all types ofwireless threats.
BRO – IDS
An open source, Unix based NIDS. Bro will passively monitor network traffic and looks for suspicious activity. Bro detects intrusions by first parsing network traffic to extract its application level semantics and then executes event oriented analyzers. It will compare the activity with patterns deemed to be troublesome.
CISCO INTRUSION PREVENTION SYSTEM (IPS)
One of the most and widely deployed IPS. It provides protection against more than 30,000 known threats, Timely signature updates and Cisco Global Correlation to dynamically recognize, evaluate, and stop emerging Internet threats. Cisco IPS includes industry leading research and the expertise of Cisco Security Intelligence Operations. It also protects against increasingly sophisticated attacks, including Directed attacks, Worms, Botnets, Malware, application abuse. It provides intrusion prevention that stops outbreaks at the network level and supports a wide range of deployment options, with near real time updates for the most recent threat.
30
Table 2.2 Leading IDS products currently available (Continued)
Name Description
JUNIPER NETWORKS INTRUSION DETECTION AND PREVENTION (IDP)
Offers comprehensive coverage by leveraging multiple detection mechanisms. Backed by Juniper Networks Security Lab, signatures for detection of new attacks are generated on a daily basis. Working very closely with many software vendors to assess new vulnerabilities, it’s not uncommon for IDP Series to be equipped to prevent attacks which have not yet occurred. Such coverage ensures that organizations can merely react to new attacks and can proactively secure network from future attacks.
McAFee HOSTINTRUSION PREVENTION FOR SERVER
Defends servers from known and new zero day attacks with McAfee Host Intrusion Prevention. Boosts security with low costs and simplify compliance by reducing the frequency of patching new signatures.
SOURCEFIRE INTRUSION PREVENTION SYSTEM
Built on the foundation of the award-winning Snort® rules-based detection engine. It uses a powerful combination of vulnerability and AD based inspection methods.
STRATA GUARD IDS / IPS
This award winning high speed IDS/IPS gives a real time protection from network attacks and malicious traffic. It will prevent Malware, spyware, port scans, viruses, and DoS and Distributed DoS(DDoS) attacks.
JUNIPER NETWORKS INTRUSION DETECTION AND PREVENTION (IDP)
Offers comprehensive coverage by leveraging multiple detection mechanisms. Backed by Juniper Networks Security Lab, signatures for detection of new attacks are generated on a daily basis. Working very closely with many software vendors to assess new vulnerabilities, it’s not uncommon for IDP Series to be equipped to prevent attacks which have not yet occurred. Such coverage ensures that organizations can merely react to new attacks and can proactively secure network from future attacks.
McAFee HOSTINTRUSION PREVENTION FOR SERVER
Defends servers from known and new zero day attacks with McAfee Host Intrusion Prevention. Boosts security with low costs and simplify compliance by reducing the frequency of patching new signatures.
SOURCEFIRE INTRUSION PREVENTION SYSTEM
Built on the foundation of the award-winning Snort® rules-based detection engine. It uses a powerful combination of vulnerability and AD based inspection methods.
STRATA GUARD IDS/ IPS
This award winning high speed IDS/IPS gives a real time protection from network attacks and malicious traffic. It will prevent Malware, spyware, port scans, viruses, and DoS and Distributed DoS(DDoS) attacks.
31
Over the years, researchers and designers have used many techniques to
design IDS. But, there have been one or more issues with the existing IDS. Current
AD methods are mainly classified as Statistical Anomaly Detection, Detection
Based on Neural Network and Detection Based on DM, etc. The IDS for the AD
should first learn the characteristics of normal activities and abnormal activities, and
then the IDS detect traffic that deviate from normal activities. AD tries to determine
whether deviation from established normal usage patterns can be flagged as
intrusions [9]. AD techniques are based on the assumption that misuse or intrusive
behavior deviates from normal system procedure [10]. The advantage of AD is that
it can detect attacks that are never seen before but it is ineffective in detecting
insiders’ attacks. Shoubridge [11] developed IDS that can analyze critical network
events and trends. The [12,13] authors represents the dynamic network as a directed
graph and similarity measures were calculated that showed a change in trend of the
network behavior over time. With the same principle, Pincombe [14] developed
IDS that uses graph distance metrics, such as weight, modality, and diameter, to
compute graph similarities. Cumulative summation and minimum mean square
errors are then used recursively to detect CP. Although this method is faster
compared to previous methods, it did not provide good results for all graph distance
metrics. Hence an open question still remains as to which distance measure if
different types of graphs exists.
Recently DM and ML methods [15-18] for a data stream have been
actively proposed. A data stream is an ordered sequence of objects o1,…,on that must
be accessed in the same order. It can be read only once or a specified number of
times. Hence, it is not possible to maintain all the objects of a data stream in the
main memory. Each object should be examined only once to analyze the data
stream. The memory space for data stream analysis should be confined finitely,
although new objects get generated infinitely over time. Newly generated objects
should be analyzed as quickly as possible to maintain up to date results with
minimum false alarms. Therefore reducing false positives is major area of research.
Currently the detection of outliers has gained significant research interest with the
insight that outliers can be the key discovery for a possible new attack.
32
2.2 OUTLIER DETECTION (OD)
OD refers to the problem of finding interesting patterns in data that are
very different from the rest of the data. Such a pattern found, may often contain
useful information regarding abnormal behavior of the system. These patterns are
usually called outliers or noise. OD is an extensively researched area that finds
immense use in application domains such as credit card fraud detection, illicit access
in computer networks, military surveillance for enemy activities and many others.
Detection of outlier approaches found in literatures [19-21] has varying scopes and
abilities. Due to lack of prior knowledge on the data set collected OD problem falls
into the category of unsupervised learning. Another area of research is semi
supervised OD where some examples of outlier and inlier will be available as a
training set. The semi supervised outlier detection methods perform better than
unsupervised methods since additional label information is available. But such
outlier samples for training are not always available and if available may be diverse.
Thus learning from known types of outliers is not necessarily useful in detecting
unknown types of outliers. OD searches for objects that do not follow rules and
expectations in the data set. The detection of an outlier may be evidence that there
are new trends/patterns in data. Although, outliers are considered noise or errors,
they may carry important information. OD depends on the applied detection
methods and also data structures that are used. Depending on the approaches used in
OD, the methodologies can be broadly classified as Distance based, Density based
or Soft Computing based. Selecting subspaces in the case of OD is a complex and a
challenging problem [22] and outliers are rare and very hard to collect [23].
Rejecting some dimensions for the sake of easy calculation may lead to some loss of
important and also interesting knowledge.
2.3 STATISTICAL BASED ANOMALY DETECTION
Statistics is the widely used tool to build behavior based IDS [24,25].
The system or behavior of the user is measured by a number of variables sampled
over time. This includes
33
1. User login and logout time of each session.
2. Duration of the resource usage.
3. Amount of processor and memory consumed during that session etc.
4. Number of commands executed that are sampled over time.
One of the popular IDS is Intrusion Detection Expert System (IDES) that
works on statistical based AD. IDES monitors the users, remote hosts and target
systems with different parameters that include CPU usage, command usage, network
activity etc. Vectors are formed with these parameters and statistical profiles are
updated to reflect the new user behavior. To detect anomalies, IDES processes each
new data set and verifies it against the known profile. If any deviations are detected,
they are reported as probable intrusions. IDES is not suitable, if the parameters have
multi model distributions. This problem is sorted out in the next version of IDES
known as Next-generation Intrusion Detection Expert System (NIDES) [26]. NIDES
stores only statistics such as frequencies, means, variances, and covariance of the
profile since storing the audit data itself is too cumbersome. Given a profile with n
measures, NIDES characterizes any point in the n-space of the measures to be
anomalous, if it is sufficiently far from an expected or defined value. NIDES
evaluate the total deviation and not just the deviation of each individual measure.
Wisdom and Sense [27] is specifically designed using statistical anomaly detection
that analyzes behavior of users. Based on the activities of users over a period of time
the system updates a set of rules that statistically describe the behavior of the users.
Current behavior is then matched against these rules to detect inconsistent behavior.
These rules are regularly updated to analyze/detect new usage patterns. One of the
methods may be to model a system that keeps averages of all or any one of these
variables and detect whether thresholds are exceeded based on the standard
deviation. This model is too simple to represent the data faithfully. Even after
comparing the variables of individual users with aggregates group statistics may not
yield much improvement. Therefore, a more complex model needs to be developed
that compares profiles of long term and short term users or system activities. These
profiles are periodically updated as the behavior of user activities and this model are
now used in a number of intrusion detection tools and prototypes.
34
2.4 MACHINE LEARNING FOR ANOMALY DETECTION
AI is the simulation of human intelligence in machines with a feature to
be able to make decisions. ML is a branch of AI that is specifically concerned with
enabling machines to understand information. Recent research focuses more on the
combination of techniques to improve the detection rates of ML classifiers. For
example, Mahbod et.al [28] examined the performance of seven ML algorithms on
the KDD Cup’99 dataset. They found that different techniques performed better on
different classes of intrusion. By combining the best techniques for each class,
overall performance of the detector has been increased. However, there are still
discrepancies in the findings reported in the literature as to how well different
techniques perform on the different classes of intrusions. ML is an ideal technology
for defending against attacks. Knowing that programmers tend to repeat mistakes, it
provides defenders with an advantage by detecting flaws before an intrusion
happens. Sophisticated IDS may use statistical techniques such as Naïve Bayes [29]
to find new vulnerabilities. This enables the defender to capture, mislead or use
other counter measures against the attacker. ML provides an advantage to the
defender because it can detect any anomaly. Thus, the attacker would need to hide
byte patterns in addition to finding and exploiting vulnerabilities. This requires the
attacker to add complexity to bypass defenses. The IDS will learn with each attack
and ML makes the system more intelligent and secure over time. ML is an
algorithmic method in which an application automatically learns from the input and
provides the feedback to improve its performance over time. Unlike statistical
methods, which aim at determining the deviations in traffic features, ML based
approach aims at detecting anomalies using some unique mechanism. ML is focused
on finding relationships in data by analyzing the process and are classified as
1. Supervised Learning (SL) – Attempts to learn some function with
given input vector and actual output.
2. Unsupervised Learning (UL) – Attempts to learn only with given
input vector by identifying relationships among data.
35
3. Reinforcement Learning (RL) [30,31] – Learns with a single bit of
information which indicates to the neuron whether the output is good
or bad.
These ML techniques can also recognize the patterns not presented
during a training phase. Some of the ML techniques used to detect attacks are Naïve
Bayesian, SVM and ANN. Most of the ML algorithms applied to intrusion detection
have not considered minimizing the false alarms. The cost associated with false
alarm is more expensive than misdetection.
2.5 MACHINE LEARNING VERSUS STATISTICAL TECHNIQUES
A wide range of real world applications are discussed in the community
of Statistical Analysis and Data Mining. Statistical techniques usually assume an
underlying distribution of data and require the elimination of data instances
containing noise. Statistical methods, though computationally intense, can be
applied to analyze the data. Statistical methods are widely used to build behavior-
based IDS. The behavior of the system is measured by a number of variables
sampled over time such as the resource usage duration, the amount of processor-
memory-disk resources consumed during that session etc. The model keeps averages
of all the variables and detects whether thresholds are exceeded based on the
standard deviation of the variable.
2.6 INSTANCE BASED LEARNING (IBL)
Researchers have also employed IBL techniques in intrusion detection
and event correlation/fault management as a means to obtain a more flexible system
compared with most Expert Systems (ES). The drawbacks of using ES are extracting
knowledge of intrusions and coding this in the form of rules. This is difficult and
time consuming as managing and updating the rule base dynamically is a difficult
task. Another problem is specific rules cannot detect slight variations of known
attacks. IBL operates by solving these problems based on solved instances/cases
unlike ES which require previous knowledge to determine specific rules [32]. The
36
knowledge repository of instances/cases can be updated automatically and the
system can learn from its own experience during operation. However, IBL is not as
efficient as ES in performing event correlation and has high memory requirements
as it is necessary to store a large number of cases/rules. Case Based Reasoning
(CBR) may be used to improve the performance in acquiring and representing the
knowledge for IDS. Lane [33] developed IDS to perform anomaly detection by
means of IBL. In this the system builds up user profiles based on UNIX commands,
which are used to catch long term, unconventional as well as data that is misused.
The research focus is on data reduction techniques, addressing the general issue of
high memory requirement of IBL. However, IBL was not able to maintain the
characteristics of the users as compared to clustering method. Hence, clustering is
considered as the better alternative.
2.7 CHANGE POINT TECHNIQUE
Large scale computer network intrusions during the final stages can be
identified by observing the abrupt changes/threshold in the network traffic [34].
However, these changes are hard to detect and difficult to distinguish from usual
traffic fluctuations in the early stages. Researchers have developed efficient adaptive
sequential and batch sequential methods for an early detection of attacks/intrusions
that lead to changes in network traffic. These methods employ a statistical analysis
of network traffic to detect very subtle traffic changes. The algorithms are based on
CP detection methods that utilize threshold to achieve an alarm. The CP algorithms
are self learning, allow for the detection of attacks with a small average delay and
are computationally simple and thus can be implemented online. Application of CP
models falls into various categories such as Gaussian observations with varying
mean or variance, Poisson process with a piecewise constant rate, changing linear
regression models and Hidden Markov Models (HMM) with time varying transition
matrices. CP detection methods can be divided into two categories, posterior and
sequential. In posterior tests the entire data set is collected first and CP is detected
off-line based on the analysis on the data collected. In contrast sequential tests are
done on-line with the data collected and the analysis is made on the fly. In the
research work on statistical data analysis, detecting changes in mean of a given data
37
series plays an important role. Some of the approaches for CP detection are [35-37]
Chauvenet’s Criterion, Peirce’s Criterion, CUmulative SUM (CUSUM), Direct
Density Ratio estimation have been actively explored by the ML community, e.g.,
Kernel Mean Matching, Generalized Likelihood Ratio (GLR) and Direct Density
Ratio.
2.7.1 Coefficient of Variation (CV)
The behavior of certain type of data increases proportionally to the
average and the average shifts upwards at least by 50% so as the standard deviation
[38]. The common examples of this include filling the process, systems
measurements and accuracy of systems. CV can be used for such processes to better
characterize the ratio of the standard deviation to the average.
2.7.2 Chauvenet's Criterion [25,39]
From the mean value of a given sample of N measurements, a scatter is
defined from this criterion. All data points which fall within a band around the mean
that corresponds to the probability of [1-(1/2N)] should be retained. Data points are
considered for rejection only if the probability of the deviation obtained from the
mean is less than 1/2N.
2.7.3 Peirce's criterion [40]
This technique applies a rigorous method based on probability theory
which can be used to eliminate data “outliers” or spurious data in a better way.
However, Peirce's criterion can be applied more generally to a data set which
follows Gaussian distributions. A piecewise segmented function as proposed by
Stephen M Ross which caters for time dependent data where the CP is qualified as
the points between successive segments. A CP may be detected by discovering the
point such that all errors of local model fittings of segments to the data before and
after that point is minimized. However, it is computationally expensive to converge
to such a point as the local model fitting is required as many times as the number of
points between the successive points whenever the data is given as an input.
38
2.7.4 CUSUM (CUmulative SUM) [41-43]
CUSUM charts can be used to detect deviations from a given
predetermined values. This method computes the standard deviation of the observed
data from the desired process mean. This is accumulated over time to compute the
CUSUM at each given point. The basic rules for interpreting a CUSUM values are if
the data is above the overall average - CUSUM value increases, if the data is below
the overall average - CUSUM value decreases and if values have shifted it means a
there is a sudden change in direction. CUSUM method applies a hypothesis test to
distinguish between acceptable and unacceptable (quality) attribute values. CUSUM
can also be used to detect a shift in a normal mean based on inferences of the normal
distribution. It should be noted that the data provided to CUSUM calculations have
to follow the normal distribution. The continuous normal or Gaussian probability
distribution is parameterized by the population mean μ and the population variance
�2.
2.7.5 Generalized Likelihood Ratio (GLR) [37, 44]
This is an intuitive approach for handling the testing problems based on
discrepancy measures. The logarithmic value of the likelihood ratio between two
consecutive intervals in time-series data will be monitored for detecting change
points. The above premise has been extensively explored in the DM community in
connection with real world applications. Because of the computational cost of the
GLR, nonlinear models such as NN have never been employed, even for off line
analysis. Recent advances in both training algorithms and speed of computer has
made it possible to implement GLR for both off line and real time applications.
2.7.6 DDR (Direct Density Ratio) [45,46]
This is an estimation that has been actively explored in the ML
community. Kernel Mean Matching (KMM) avoids density estimation and directly
gives an estimate of the importance at test points. The values of the importance are
unknown in practice, so there is a need to estimate from the sample data that is
39
collected. If the training and test densities are estimated separately from the data
samples, then it is possible to estimate the importance by taking the ratio of the two
estimated densities. But this approach will suffer from the curse of dimensionality, if
the data has neither low dimensions nor a simple distribution. Vapnik [47] suggested
that DDR estimation is very crucial in statistical learning.
2.8 APPLICATION OF DM IN DEVELOPING IDS
Due to large volumes of intrusion detection, data set researchers have
applied many DM and ML algorithms for detecting intrusions. DM with ML can be
defined as the process of extracting patterns from large data sets by combining
methods from statistics and AI techniques. DM is seen as an increasingly important
tool by an enterprise to transform data into Business Intelligence (BI) giving an
informational advantage. It is also currently used in a wide range of profiling
practices, such as marketing, surveillance, fraud detection, and scientific discovery
[48-50]. The relevance of DM in detecting intrusion is still an open research area in
intelligent computing. DM can be used to clean, classify and study large amount of
network data to correlate violation for intrusion detection. The main reason for using
DM techniques for IDS is due to the enormous volume of existing and newly
appearing network data that require processing. The amount of data accumulated
each day by a network is enormous. DM algorithms can be used for misuse
detection and Anomaly Detection AD. Many DM algorithms have already been used
for AD such as DT, Naïve Bayesian (NB), Neural Networks (NN), SVM etc.
The earlier work emphasized that data can be obtained by three
ways [51]:
i. By using real traffic.
ii. Using sanitized traffic.
iii. Using simulated traffic.
But in real time fast response to external events within an extremely short
time is demanded and expected. Therefore, an alternative algorithm to implement
40
real time learning is imperative for critical applications for fast changing
environments. Even for offline applications, speed is still a need, and a real-time
learning algorithm that reduces training time and human effort to nearly zero would
always be of considerable value. Mining data in real time is a big challenge.
2.8.1 Artificial Neural Networks (ANN)
ANN consists of a collection of processing units called neurons that are
well interconnected in a given topology. ANN has the ability of learning by example
and generalization from limited, noisy, and incomplete data. Hence ANN has been
successfully employed in a wide range of data intensive applications. ANN
contributions to and performance in the intrusion detection domain can be classified
as:
2.8.2 Feed Forward Neural Networks (FFNN)
FFNN is the first and the simplest type of ANN devised. Two types of
FFNN are commonly used in modeling either normal or intrusive patterns.
2.8.2.1 Multi Layered Feed Forward (MLFF) Neural Networks
MLFF uses various learning techniques and the most popular is Back
Propagation (MLFFBP). MLFFBP networks were applied to develop IDS primarily
in anomaly detection of user behavior level [52,53]. To distinguish between normal
and abnormal behavior Seth Freeman [54] used data set that consists of user
behavior. Ryan [55] considered the command patterns and their frequency of
execution. The recent research interest is to detect software behavior that is
described by sequences of system calls. Since system call sequences are more stable
than commands Ghosh [56] built a model using MLFFBP for the lpr program and
the DARPA BSM98 dataset. Detailed descriptions of this dataset can be found at
http://www.ll.mit.edu/IST/ideval/data/ data_index.html. The network traffic is
another vital data source that can be applied on network packets for the detection of
misuse. Although the training and test iterations required a day to complete,
experiments showed MLFFBP was successful as a binary classifier to correctly
41
identify attacks in the test data. MLFFBP can also be used as a Multi Class
Classifier (MCC). Such NN will have multiple output neurons and is more flexible.
Mukkamala, Sung and Ajith [57] compared twelve different learning algorithms on
the KDD99 dataset. They found that resilient back propagation achieved the best
performance in terms of accuracy and training time.
2.8.2.2 Radial Basis Function Neural Networks (RBFNN)
RBFNN are another popular type of FFNN. The classification is
performed by measuring distances between inputs and the centers of the RBFNN
hidden neurons. RBFNN are much faster than back propagation and is suitable for
problems with large data set [58]. Many researchers [59, 60] have developed
systems using RBFNN that can learn from multiple local clusters for well known
attacks and for normal events. A hybrid approach is used that integrates both misuse
and anomaly detections in a hierarchical RBF network. The first layer has an RBF
anomaly detector that identifies whether an event is normal or not. Anomaly events
are then passed through an RBF misuse detector chain for a specific type of attack.
Anomaly events which could not be classified were saved to a database and a
C-Means clustering algorithm clustered these events into different groups. Later a
misuse RBF detector was trained on each group, and added to the misuse detector
chain. Finally all intrusion events were automatically and adaptively detected and
labeled.
Since RBF and MLFF networks are widely used Jiang and Zhang [61]
compared the RBF and MLFF networks for misuse and anomaly detection on the
KDD99 dataset. Their experiments have shown that for misuse detection, BP has a
slightly better performance than RBF in terms of detection rate and false positive
rate, but requires longer training time. For AD, the RBF network improves
performance with a high detection rate and a low false positive rate, and requires
less training time. In general RBF networks achieve better performance which was
also concluded by Hofmann et. al [62] using the DARPA98 dataset.
42
2.8.3 Recurrent Neural Networks (RNN)
It is important but difficult to detect attacks spread over a period of time.
The window size defined should be adjustable in predicting the user behavior.
A large window size is needed to enhance deterministic behavior when users
perform a particular job. During this time their behavior is stable and predictable.
When users are switching from one job to another, behavior becomes unstable and
unpredictable. Hence a small window size is required in order to quickly forget
meaningless past events. The inclusion of memory in NN led to the invention of
RNN or Elman network [63]. RNN was used in applications of forecasting, where a
network predicted the next event given an input sequence. If there is a deviation
between a predicted output and an actual event, an alarm was generated. Sheikhan
et.al [64] modified the RNN model with three layers. The results showed that the
model had an improvement in Classification Rate (CR), Detection Rate (DR) and
Cost Per Example (CPE). The model was compared with similar related works and
also the simulated MLP and Elman-based intrusion detectors. Ghosh et.al [65]
compared RNN with MLFFBP network for forecasting system call sequences and
the results showed that RNN achieved the best performance, with a detection
accuracy of 77.3% and zero false positives. Cheng et.al [66] developed a RNN to
detect network anomalies using the KDD99 dataset and emphasized the importance
of payload information in network packets. They showed that by discarding the
payload leads to an undesirable information loss and indicated that with payload
information the system outperformed RNN. Much research work confirms that RNN
outperforms MLFF networks in detection accuracy and generalization capability.
The Cerebellar Model Articulation Controller (CMAC) NN [67] is an additional
type of RNN which has the capability of incremental learning. This will avoid
retraining a NN every time when a new intrusion is detected.
2.8.4 Self Organizing Maps (SOM)
SOM and Adaptive Resonance Theory (ART) are two unsupervised
Neural Networks based on statistical clustering algorithms. They group objects by
similarity measure and are suitable for intrusion detection tasks. When grouped
43
normal behavior will be densely populated around one or two centers and abnormal
behavior or intrusions appear in sparse regions as outliers. SOM are Single Layer
Feed Forward Networks (SLFFN) where data is clustered in a low dimensional grid
[68]. It preserves topological relationships of input data according to their similarity
and is one of the most popular NN. Fox first employed SOM to detect viruses in a
multiuser machine in 1990. Researchers [69, 70] used SOMs to learn patterns of
normal system activities which have been used for misuse detection. Other
classification algorithms, such as FFNN were then trained on the output from the
SOM. Sarasamma et.al [71] proposed a work that calculates the probability of a
record mapped to a heterogeneous neuron being of a type of attack. A confidence
factor was defined to determine the type of attack that dominated the neuron. They
showed that different subsets of features were good at detecting different attacks.
The results showed that false positive rates were significantly reduced in hierarchical
SOMs as compared to single layer SOMs. Rhodes [72] examines network packets
and stated that every network protocol layer has a unique structure and function.
Malicious activities aiming at a specific protocol should also be unique and it is
unrealistic to build a single SOM to tackle all these activities. They organized a
multilayer SOM in which each layer corresponds to one protocol layer. Zanero [73]
analyzed payload of network packets and proposed a multi layer detection
framework. K-means algorithm was used to avoid calculating the distance between
each neuron. This greatly improved the runtime efficiency of the algorithm.
Several NN techniques have been used in intrusion detection and are
described as landmarks in the development of IDS. The aim is to simulate the
operation of the human brain, make it flexible and adaptable to environmental
changes. An alternative approach to training ANNs is proposed using A to evolve
the weights of the ANN, referred to as an ENN [74]. Hybrid systems developed
using NN and Fuzzy Logic [75] performed well with limited training sets on labeled
alerts. An excellent improvement was provided by hybrid systems with solutions for
real-world problems.
44
2.8.5 Bayesian Networks (BN)
BN is a probabilistic model that represents a set of variables and their
probabilistic independencies. BN are directed acyclic graphs with nodes
representing variables and edges representing the encoded conditional dependencies
between the variables [76]. They have been applied in AD in different ways and
have been utilized in the decision process of hybrid systems. Ben et.al [77]
developed an AD system that employed NB that assumes complete independency
between the nodes with two layers. BN are utilized in the decision process of hybrid
systems. They offer a sophisticated way of dealing with most hybrid systems that
generally obtain high false alarm rates. This is due to the simplistic approach of
combining the outputs of the techniques during the decision phase. Hybrid host
based AD system consists of the detection techniques like analyzing string length,
character distribution structure, and identifying learned tokens, in which a BN can
be used to decide the final output classification. Generally, in anomaly intrusion
detection, the number of possible features is large, but an attacker’s activity is
usually related to just a few features. Furthermore, the effectiveness of a specific
feature mainly depends on the behavior and for this reason activity can be analyzed
using individual feature independently. A typical AD method relies on statistical
analysis with an advantage that it can generate a concise profile containing only a
statistical summary without maintaining the activities. This can lessen the burden of
computation overhead for real time intrusion monitoring. However, when the value
of each feature varies largely, the statistical summary failed to make a concise
profile. However, most conventional classification algorithms [78], do not consider
any updates in a data set and are not suitable for real time data. Consequently, the
concept of updating should be incorporated, and a classification method that
considers updates in the data set has been proposed. The basic assumption of
conventional classification algorithms is that the data set is fixed and available
before classification can be performed. This assumption is valid only when static
knowledge embedded in a specific data set is the target of clustering. Therefore, it is
very important to identify an appropriate data set that reflects the characteristics of
the target application domain very well. Hence conventional classification
45
algorithms pose limitations as the normal behavior of a user is generally analyzed
off-line. Kok-Chin Khor et.al [79] implemented BN by selecting important features
by utilizing feature selection algorithm and filter approach. With respect to
performance they concluded that the BN performed equivalently well in detecting
network attacks. Mutz [80] extended the work by proposing an application based
IDS that considers system call arguments during analysis of user commands. Most
IDS exclude this information, which is a reason for the occurrence of False
Negatives (FN), as it is possible to execute intrusions with valid system calls.
Authors also focus on parameters like CPU load, since the IDS should not take up
too many resources. This is due to the fact that it may prevent the user from using
the computer efficiently. In their work the CPU load remained relatively low and
during stress tests, the increase in CPU load was within 20% on average.
2.8.6 Decision Trees (DT)
DT is popular in IDS, as they yield good performance and offer some
benefits over other ML techniques. For example, they learn quickly compared to
ANN and the tree structure built from the training data can be used to produce rules
for ES. DTs cannot generalize to new attacks in the same manner as certain other
ML approaches and they are not suitable for anomaly detection. New findings
demonstrate that DTs are very sensitive to the training data and do not learn well
from imbalanced data. DTs have been successfully implemented to IDS both as a
standalone and as a part of hybrid systems. An example to the success of DTs is an
application of a C5.0 DT [81]. Lot of work has been carried out to examine the
performance of several ML techniques on the KDD Cup 99 data set, including a
C4.5 DT. The DT provided good accuracy but could not perform well as other
techniques on some classes of intrusion. An ANN and k-means clustering obtained
higher detection rates and able to generalize from learned data to new, unseen, data.
Classification is a method of mapping from a set of attributes to a particular class.
DT induction is one of the classification algorithms in DM. The DT classifies the
given data item using the values of its attributes. The DT is constructed from a set of
pre-classified data set which is also known as training set. The main approach is to
select the attributes, which best divides the data items into their classes. The major
46
problem is deciding the attribute that will best partition the data into various classes.
The ID3 algorithm uses the Information Gain (IG) approach to solve this problem by
using the concept of Entropy, which measures the impurity of a data items. DT
induction has been implemented with several algorithms. ID3 later on got extended
to C4.5 and C5.0. C4.5 avoids over fitting the data and can handle continuous
attributes. C4.5 builds the tree from a set of data items using the best attribute to test
in order to divide the data item into subsets and then it uses the same procedure on
each sub set recursively. The best attribute to divide the subset at each stage is
selected using the IG of the attributes. Intrusion detection can be considered as a
classification problem where each network connection is identified either as an
attack or normal based on some existing data. DT can solve the problem of intrusion
detection by learning the model from the data set. Later using DT it is possible to
classify the new data item into one of the classes specified in the data set. Learning
is based on the training data and can predict the future data as one of the attack or
normal. DT works well with large data sets and this is important as large amounts of
network data flow across computer. The high performance of DT makes them
applicable in real time intrusion detection. Generalization accuracy of DT is another
useful property for intrusion detection model. New attacks on the system with small
variations of known attacks can also be detected after the model is built. This ability
to detect new intrusions is possible due to the generalization accuracy of DT.
2.8.7 Support Vector Machines (SVM)
SVM is a supervised learning algorithm that is used increasingly in IDS.
The classification performance of SVM model is better than the classification
methods, such as ANN [82]. The benefits of SVM are that they learn very
effectively with high dimensional data. A SVM maps input feature vectors into a
higher dimensional feature space through some nonlinear mapping. SVMs can learn
a larger set of patterns and are able to scale better, because the classification
complexity does not depend on the dimensionality of the feature space. SVMs also
have the ability to update the training patterns dynamically whenever there is a new
pattern detected during classification. The main disadvantage is that SVM can only
47
handle binary class classification whereas intrusion detection requires multi-class
classification.
The SVM is one of the most successful classification algorithms in the
DM area. The training time of SVM is a serious problem in the processing of large
data sets which limits its use in DM applications as it requires the processing of
huge data sets. Normally it would take years to train SVM on a data set consisting of
one million records. Many researchers have carried out work to enhance SVM in
order to increase its training performance [83-85]. This is achieved either through
random selection or approximation of the marginal classifier. These approaches are
still not feasible as multiple scans of entire data set are required which is also
expensive to perform [86]. Seo [87] applied SVM to Host-based AD of
masquerades. Their work is to analyze sequences of UNIX commands executed by
users on a host. Kim applied SVM with a Radial Basis Function (RBF) kernel,
analyzing commands over a sliding window and achieved a detection rate of 80.1%.
Seo examines two different kernels, K-gram and String kernel, which yielded higher
detection rates of 89.61% and 97.40%, respectively. The drawback is the same as
with the RBF kernel employed by Seo and Cha, that the false positive rate is higher.
Seo also examined a hybrid of the two kernel methods, which gave nearly identical
results as obtained by Kim. An unsupervised class of SVM was proposed by Dennis
[88], which has been adopted in several studies, comparing its performance with
clustering techniques. SVMs are supervised learning algorithms, which have been
applied increasingly to misuse detection in the last decade. One of the primary
benefits of SVMs is that they learn very effectively from high dimensional data.
Furthermore, they are trained very quickly. Mukkamala [89] conducted a
comparative study of feed forward MLP and SVM for misuse detection. Identical
detection rates were obtained, and the SVM training time was comparatively less
than MLP. SVM algorithms are binary classifiers, which will be sufficient for only
distinguishing between normal and attack. Recent SVM algorithms support multi
class learning [90]. The approach is to combine several two classes of SVM. Sung
and Mukkamala [91] applied SVM to network based intrusion detection with five
types of SVM. For each SVM, the training data is partitioned into two classes as
48
normal or intrusions. The hybrid technique adopted is that SVM with the highest
output value is taken as the final output. Peddabachigari [92] conducted practical
analysis of SVM and DT performed as standalone detectors and also as hybrids.
Performance was considered as a parameter and the results indicate that the hybrid
method performs better. Due to the magnitude of data involved in network-based
intrusion detection, Rung [93] proposed a hybrid which combines SVM with
weighted voting schema technique to shorten the training time. A hierarchical
clustering algorithm was employed to locate boundary points in the data that best
separates the two classes. These classes are then used to train the SVM as an
iterative process. During each iteration the support vectors were recalculated and the
SVM is tested against a stopping criterion. This is to determine if a desirable
threshold of accuracy is exceeded or not. The evaluation was done on the DARPA98
data set and the accuracy was improved. This was mainly due to correctly
classifying more DoS attacks. However there was an increase in false positive rates.
Song [94] proposed a Robust SVM (RSVM) that was developed to better deal with
noise. The RSVM was applied to host based intrusion detection by Hu [95]. The
benefit of using RSVM is that it produces less support vectors, which makes it a
quicker algorithm.
Ganapathy [96] pointed out that SVM can obtain generalization ability
with less training time through simulation experiments on a few artificial and real
benchmark function approximation and classification problems. They have indicated
that SVMs can perform well in text classification problems. Recently a significant
contribution showing the relationship between Extreme Learning Machines (ELM)
and SVM in the context of classification is made [97]. Recently researchers have
made a more in depth exploration of their relationship, and compared the
performance of ELM, SVM, and Least Squares SVM (LSSVM) [98]. ELM provides
a unified learning platform to different applications, such as regression, binary, and
multiclass classifications for the LSSVM, Proximal SVM (PSVM) [99] and other
regularization algorithms. ELM avoids issues involving manual tuning control
parameters like learning rate, learning epochs etc which are difficult to manage in
traditional approaches and reaches good solutions analytically. ELM can be
49
implemented and used easily with faster learning speed, response time and ease of
implementation that are keys to the success in the design of IDS. ELM algorithm
tends to achieve similar or better generalization performance at much faster learning
speed than the SVM and LSSVM algorithms. However, there also remain several
aspects needing further consideration. Recent experimental investigations focus
mainly on the comparisons of SVM and ELM. Both are applied to a variety of
examples but the advantages/disadvantages of applying these methods are still
unknown in real time network intrusion area. Knowing such information may
provide more insight into the SVM and ELM algorithms because the former is based
on the Structural Risk Minimization (SRM) principle which is especially suited for
learning small samples, while the latter is based on the inductive principle known as
Empirical Risk Minimization (ERM). The results can strengthen the understanding
on the essential relationship between SVM and ELM. This can also serve as
complementary knowledge for the past experimental and theoretical comparisons
between them. SVM algorithms are binary classifiers that are sufficient to
distinguish between normal and intrusive data. Recent SVM algorithms support
multi class learning. The approach combined several two-class SVMs and for each
SVM, the training data is partitioned into two classes so that one represents an
original class and the other class represents the attacks. It is also necessary to specify
an upper bound parameter C that can be determined experimentally. This results in a
cross-validation procedure, which is wasteful both for computation as well as data.
Kernel based ML algorithms are based on mapping data from the original
input feature space to a kernel feature space of higher dimensionality to solve a
linear problem in that space. These methods allow us to interpret and design learning
algorithms geometrically in the kernel space. SVM is one of the several Kernel
based techniques available in the field of ML. The choice of a proper kernel function
plays an important role in SVM based classification/regression. It is difficult to
choose one which gives the best generalization for a given dataset. Many Kernels
have been proposed in the SVM literature. Cheng [100] creates a kernel function
suitable for the training data using a GA mechanism. They showed that their genetic
50
kernel has good generalization abilities when compared with the polynomial and the
RBF kernel functions.
Ye [101] proposed an orthogonal Chebyshev kernel function. Chebyshev
polynomials are first constructed through Chebyshev formulae. Then based on these
polynomials Chebyshev kernels are created satisfying Mercer condition. They
showed that it is possible to reduce the number of support vectors using this kernel.
Wang et. al [102] proposed the Weighted Mahalanobis Distance Kernels. They first
find the data structure for each class in the input space via agglomerative
hierarchical clustering and then construct the weighted Mahalanobis distance kernels
which are affected by the size of clusters they reside in. Xu [103] proposed using the
weighted Levenshtein distance as a kernel function for strings. They used the UCI
splice site recognition dataset for testing their proposed specific kernel which got the
best results in this problem. They used the boosting paradigm to construct the
learned kernel. Their approach is suitable in learning tasks where the test data
distribution is different from the training data distribution. Lodhi [104] introduced a
novel kernel for comparing two text documents. The kernel is an inner product in the
feature space consisting of all subsequences of length k. A subsequence is any order
sequence of k characters occurring in the text though not necessarily contiguously.
These subsequences were given weightage based on some decay factor of their full
length in the text, hence putting some emphasis on contiguous characters. Rieck et al
[105] proposed an algorithm for computation of similarity measures for sequential
data. The algorithm uses suffix trees for efficient calculation of various kernel
functions. Its worst-case run-time is linear in the length of sequences and
independent of the underlying embedding language, which can cover words,
k-grams or all contained subsequences. Experiments with network intrusion
detection, Dynamic Network Analysis (DNA) and text processing applications
demonstrate the utility of distances and similarity coefficients for sequences as
alternatives to classical kernel functions.
Many of the detection results reported till date using ML algorithms with
DT, NN and SVM indicate that attacks involving more features in the data set have
substantially lower detection rates. Hence feature relevance analysis is another
51
research area of interest to substantiate the performance of ML IDS. The objective is
to investigate the relevance of the features with respect to dataset labels. That is, for
normal behavior and each type of attack the system should determine the most
relevant feature, which best discriminates the given class from the others.
To achieve this IG, which is the underlying feature selection measure for
constructing DT can be used. For a given class, the feature with the highest IG is
considered the most discriminative feature. Researchers have proposed several
methods of feature selection to achieve real time IDS. The major benefit of feature
selection is that the amount of data required to process is significantly reduced,
without compromising the performance of the detection. In some cases, feature
selection may improve the performance of the detection as it simplifies the
complexity problem by reducing the dimensions.
2.9 IMPORTANCE OF FEATURE SELECTION FOR IDS
Data preprocessing is considered as an important step in IDS. The
amount of data set that needs to be examined for detection of attack is very large
even for a small network. Analysis is very difficult as the number of features
available in the data set can make it harder to detect suspicious behavior patterns.
As complex relationships exist between the features, it is better to reduce the amount
of data to be processed for IDS. This is particularly important if real time intrusion
detection is preferred. Reduction of features can be made by considering the data
that is not useful by filtering. Data can be grouped or clustered by storing the
characteristics of the clusters instead of the individual data. Feature selection can
improve classification performance by reducing the computational complexity and is
an important preprocessing technique. Feature selection is the important step in
building intrusion detection models [106,107]. This will also increase the available
time for detecting intrusions but most of the work is still done manually and the
features selection depends strongly on expert domain knowledge. ML technique
provides the wrapper and the filter models for automatic feature selection. The major
problem that many researchers face is how to choose the optimal set of features.
This is because not all features are relevant to the learning algorithm. Irrelevant and
redundant features with noisy data can affect the learning algorithm by severely
52
degrading the performance with respect to training and testing time. Feature
selection was proven to have a significant impact on the performance of the
classifiers. Many researchers as in [108-110] illustrate that feature selection can
reduce the building and testing time of a classifier. Currently two models are of most
importance namely the filter model and the wrapper model. In the filter model
statistical characteristics of a data set is considered directly without relating to any
learning algorithm. The filter model uses a measure such as correlation, consistency,
or distance measures to compute the relevance of a set of features. In contrast, the
wrapper model will assess the features that are selected by learning algorithm’s
performance. The wrapper model uses the predictive accuracy of a classifier as a
means to evaluate the “goodness” of a feature set. Hence the wrapper model
requires more time [111]. The requirement of computational resources to find the
best feature subsets is also more in wrapper model. In order to increase the
computational efficiency, usually the filter method is used for selection of features
from high dimensional data sets. It is well known that the redundant features can
reduce the performance of IDS. A major challenge in the IDS feature selection
process is to choose appropriate measures that can precisely determine the relevance
and the relationship between features of a given data set.
2.10 RELEVANCE VECTOR MACHINES (RVM)
In spite of good performance with different datasets, SVM still suffers
from shortcomings such as visualization/interpretation of model, kernel choice and
kernel specific parameter. Recently RVM, another kernel based approach is being
explored for classification and regression problems. RVM proposed by Tipping
[112] is a sparse ML algorithm that is similar to the SVM in many respects. RVM is
another area of interest in the research community as they provide a number of
advantages. The advantage of RVM over the SVM is the availability of probabilistic
predictions, using arbitrary kernel functions and not requiring setting of the
regularization parameter. RVM is based on a Bayesian formulation of a linear
model with an appropriate sparse weight prior distribution. The sparseness property
enables selection of the proper kernel at each location by pruning all irrelevant
kernels which results in a sparse data representation. As a result, they can generalize
53
well and provide inferences at very low computational cost [113]. Through the use
of proper kernels in SVM, good generalization performance can be achieved. Some
desirable properties are, SVM fits functions in high dimensional feature spaces and
large space of functions available in feature space. It is sparse, which means only a
subset of training data set is retained at runtime that improves the computational
efficiency. Although relatively sparse, SVM makes unnecessary use of basis
functions as the number of Support Vector (SV) required typically grows linearly
with the size of the training data set. SVM outputs a point estimate with regression
and a binary decision in classification. As a result it is difficult to estimate the
conditional distribution to capture the uncertainty during prediction. In RVM the
kernel function must be the continuous symmetric kernel of positive integer operator
to satisfy Mercer condition. Maintaining its classification accuracy RVM has the
ability to yield a decision function that is much sparser than SVM. This leads to
significant reduction in the computational complexity of the decision function and
thereby making it more suitable for real time applications.
The RVM produces a function which is comprised of a set of kernel
functions also known as basis functions and a set of weights. This function
represents a model for the system presented to the learning process from a set of
training data set. The kernels and weights calculated by the learning process and the
model function defined by the weighted sum of kernels are fixed. From this set of
training vectors the RVM selects a sparse subset of input vectors which are deemed
to be relevant by the probabilistic learning scheme [114]. This is used for building a
function that estimates the output of the system from the inputs. These relevant
vectors are used to form the basis functions and comprise the model function.
2.11 CURRENT STATE OF IDS
IDS typically consist of security functions, firewall, IPS/IDS and some
filtering functions like anti-spam, antivirus and URL. Recent challenge in
developing IDS is to develop security software solutions and appliances to defend
against the threats faced by enterprise networks. The main focus is to develop
systems that work in real time with detection, prevention and response [115].
54
Detection can be done either through static signatures or anomaly detection. New
research work focuses on approaches that can secure the network by looking for the
reason about risks. This can happen before an attack happens and limit exposure to
threats. A general framework of IPS uses a trigger based approach to do reactive
network measurement [116]. NN approaches combine the complexity of some of
the statistical techniques with the ML objective of imitating human intelligence.
This is done at a more “unconscious” level and hence there is no accompanying
ability to make learned concepts transparent to the user. Important problems remain
to be solved although variety of security tools incorporating AD functionalities
exists. IDS are continuously evolving with the goal of improving the security and
protection of networks and computer infrastructures but there still exist several open
research issues. Some of the most significant challenges in the area are:
1. Low detection efficiency : Due to the high FP rate it calls for the
exploration and development of new, accurate processing schemes, as
well as better structured approaches to modeling network systems.
2. Low throughput and high cost: Due to the high data rates, IDS is
intended to optimize intrusion detection concerned with grid
techniques and distributed detection paradigms.
3. Absence of appropriate metrics: Due to lack of a general
framework to evaluate and compare different techniques assessing
IDS is a real challenge. Research shows that most of the IDS systems
perform poorly in defending themselves from attacks and significant
efforts should be done to improve intrusion detection technology in
this aspect.
2.11.1 Intrusion Prevention System (IPS)
The inadequacies inherent in current defense have driven the
development of a new breed of security products known as IPS [117]. IPS software
has all the capabilities of IDS and can also attempt to stop possible incidents. This
section provides an overview of IPS technologies and describes the key functions,
methodologies that they use. An overview of the major classes of IPS technologies
55
is also provided in [6, 58]. The purpose of an IPS is to not only detect that an attack
is occurring, but also to stop it. To do so, it can be considered to be an advanced
combination of a firewall and IDS. Recent trends in industry show that more and
more companies are choosing IPS-based solutions over IDS-based solutions,
primarily due to the need to actively block worm and hacker attacks, instead of
passively monitoring them as an IDS system would do. IPS research took its root
from IDS research and some researchers define IPSs as combination of IDSs with
added functionalities.
So IPS can be defined as an in line product that focuses on identifying
and blocking malicious network activity in real time. In general, there are two
categories namely Rate based and Content based IPS. The devices often look like
firewalls and often have some basic firewall functionality. But firewalls block all
traffic except that for which they have a reason to pass, whereas IPS pass all traffic
except that for which they have a reason to block.
2.11.1.1 Rate based IPS
Rate-based IPS blocks traffic based on network load that includes flow of
too many packets in a specified time, number of connections per unit time or the
number of errors that are generated. In the presence of these, a rate based IPS kicks
in and blocks, throttles or otherwise mediates the traffic. Most useful rate based IPS
include a combination of powerful configuration options with range of response
technologies. The process includes limiting queries to the DNS server and/or offers
other simple rules covering bandwidth and connection limit. A rate-based Intrusion
Prevention System can set a threshold of maximum amount of traffic to be directed
at a given port or service. If the threshold is exceeded, the IPS will block all further
traffic of the source IP only, still allowing other users to use that service.
The major problem in deploying rate based IPS products is deciding what
constitutes an overload. For any rate based IPS to work properly, the network owner
needs to know not only what ‘‘normal’’ traffic levels are but also other network
details such as how many connections their servers can handle. However, most
commercial products do not yet provide any help in establishing this base line
56
behavior, but require the services of a ‘‘trained’’ product specific systems engineer
who often spend hours on site setting-up the IPS. Because rate based IPS requires
frequent tuning and adjustment, they will be most useful in very high volume Web,
application and mail server environments.
2.11.1.2 Content based IPS
This is also referred to as signature and anomaly based. Content based
IPS blocks traffic based on attack signatures and protocol anomalies and they are the
natural evolution of the IDS and firewalls. If the packets do not comply with TCP/IP
and if any suspicious behavior is detected, IPS will trigger and block future traffic
from that host. The recent content based IPS offers a range of techniques for
identifying malicious content and many options for how to handle the attacks, such
as simply dropping bad packets to dropping future packets from the same attacker,
and advanced reporting and alerting strategies. As content based IPS offer intrusion
detection like technology for identifying threats and blocking them, they can be used
deep inside the network to complement firewalls and provide security policy
enforcement as they often require less manual maintenance and fine tune to perform
a useful function as compared to rate based method. The major challenge in
designing IPS is the fact that it is designed to work in line, presenting a potential
choke point and single point of failure. If passive IDS fail, the worst that can happen
is that some attempted attacks may go undetected. If an in-line device fails, it can
seriously impact the performance of the network. The latency rises to an
unacceptable value and if the device fails, a self inflicted DoS condition may also
occur. Even though IPS device does not fail altogether it still has the potential to act
as a bottleneck, increasing latency and reducing throughput as it struggles to keep up
with Gigabit or more of network traffic.
2.11.2 Intrusion Response System (IRS)
The task of most traditional IDSs is to detect intrusion, but once the alert
is generated human intervention is required and implementing an automated action
of response is certainly a challenge. For a traditional IRS, such a response involves
notifying the central decision core, wait its arbitration, and apply decision.
57
The current IRSs meet only a subset of the above challenges, and none will address these problems. The general principles followed in the development of the IRS naturally classify them into two categories.
2.11.2.1 Static Decision Making
This class of IRS provides a static mapping of the alert from the detector to the response that is to be deployed. The IRS includes basically a look-up table where the administrator has anticipated all alerts possible in the system and an expert indicated responses to be taken for each. In some cases, the response site is the same as the site from which the alarm was flagged, as with the responses often bundled with anti-virus products (disallow access to the file that was detected to be infected) or network-based IDS (terminate a network connection which matches a signature for anomalous behavior).
2.11.2.2 Dynamic Decision Making
This class of IRS reasons about an ongoing attack based on the observed alerts and determines an appropriate response to take. The first step in the reasoning process is to determine which services in the system are likely affected, taking into account the characteristics of the detector, the network topology, etc. The actual choice of the response is then taken dependent on a host of factors, such as, the amount of evidence about the attack, the severity of the response, etc. The challenges in designing an IRS are the attacks through automated scripts are fast moving and the owner of the distributed system does not have knowledge of or access to the internals of the different services.
2.11.3 Artificial Immune Systems
Artificial Immune Systems (AIS) have been extensively researched in the last decade, mainly for AD. Much research has been conducted on AIS as the model lends itself conveniently to AD. Several researchers came to the conclusion that the model has problems with scalability, limiting its application to real problems. Consequently, some researchers considered alternative models, while others have in recent years proposed enhancements to address scalability [118-120].