slides

Data Mining Based Intrusion Detection System

Krishna C Surendra Babu

Papers: A Data Mining Framework for Building

Intrusion Detection Models(Wenke Lee, Salvotore J. Stolfo)- Research supported in parts by grants from

DARPA

Creation and Deployment of Data Mining-Based Intrusion Detection Systems in Oracle Database 10g

Intrusion Detection System:

Intrusion Detection Techniques: Anomaly Detection

Misuse Detection DOS Probing Unauthorized access to local super user

(U2R) Unauthorized access from a remote

machine (R2L)

Requirements: Reliable Extensible Easy to manage Low maintenance cost

Data MiningData mining refers to extracting or mining knowledge from large amounts of data.

Data Warehouse A data warehouse is a repository

of information collected from multiple sources

A Data Mining Framework for Building Intrusion Detection Models

Why Data Mining? The dataset is large. Constructing IDS manually is

expensive and slow. Update is frequent since new

intrusionoccurs frequently.

A Data Mining Framework for Building Intrusion Detection Models

Challenges for Data Mining in building IDS

Develop techniques to automate theprocessing of knowledge-intensive feature

selection. Customize the general algorithm to incorporate

domain knowledge so only relevant patterns are reported

Compute detection models that are accurate and efficient in run-time

Mining the data

Dataset Types: Network based dataset Host based dataset

Build IDS by mining in the records. When an attack is detected, give alarms to

the administration system.

Framework of Building IDS

Preprocessing. Summarize the raw data. Association Rule Mining. Find sequence patterns (Frequent

Episodes) based on the association rules. Construct new features based on the sequence patterns. Construct Classifiers on different set of features

Preprocessing To summarize raw data to high level

event, e.g network connection, time, duration,

service, host, destination

Bro and NFR Packet filtering Techniques can be used.

Classification Classify each audit record into one of

a discrete set of possible categories, normal or a particular kind of intrusion.

Association rule mining

Searches for interesting relationships among attributes in a given data set i.e. to derieve multi feature(attribute) correlations from a database table.

Sequence Pattern Mining

Frequent Episodes. X,Y->Z, [c,s,w] With the existence of itemset X and Y, Z

will occur in time w.

Feature Construction

Feature extraction is the processes of determining what evidence that can be taken from raw audit data is most useful for analysis.

Construct new feature according to the frequent episode.

Some features will show close relationship to

each other. Then combine the features. Some frequent episode may indicate

interesting new features.

Build Model (classifier) Build different classifiers for differentattacks.

Experiments

The DARPA data 4G compressed tcpdump data of 7 weeks of network

traffics. Contains 4 main categories of attacks

DOS: denial of service, e.g., ping-of-death, syn flood

R2L: unauthorized access from a remote machine, e.g., guessing password

U2R: unauthorized access to local super user privileges by a local unprivileged user, e.g., buffer overflow

PROBING: e.g., port-scan, ping-sweep

Results

Training on the 7 weeks of labeled data, and testing on

the 2 weeks unlabeled data. The test data contains 14 attack types which do

not exist in training data. Comparing 4 methods:

Columbia: the IDS developed according to the framework

introduced above Group 1-3: three systems developed by knowledge

engineering approaches.

Results

Detection rate on New and Old attacks. Old attacks: type of attacks occur in both

training and testing data. New attacks: type of attacks occur in testing

data only.

Creation and Deployment of Data Mining Based Intrusion Detection Systems in Oracle Database 10G

DAID A database centric architecture that leverages data mining with in the Oracle RDBMS to address the challenges.

Scheduling capabilities Alert infrastructure Data analysis tools Security Scalability reliability

Requirements for a production quality IDS

Centralized view of the data Data transformation capabilities Analytic and data mining methods Flexible detector deployment, including

scheduling that enables periodic model creation and distribution

Real-time detection and alert infrastructure Reporting capabilities Distributed processing High system availability Scalability with system load

• Sensors • Extraction, transformation

and load (ETL) • Centralized data

warehousing • Automated model

generation • Automated model

distribution • Real-time and offline

detection • Report and analysis • Automated alerts

Sensors Collects audit information

Network traffic data System logs on individual hosts System calls made by processes

ETL

Used for pre processing audit streams and feature extraction

Use SQL and user defined functions to extract key pieces of information.Ex: computes windowing analytic function to

compute the number of http connections to a given host

Model Generation

Popular Techniques for misuse and anomaly detection: Association Rules Clustering Support Vector Machines

Supervised learning methods for Classification

Decision Trees

Model build functionality: Dbms_data_mining PL/SQL package- to train linear SVM anomaly and misuse

detection models.- Test dataset

- Probing- Denial of service- Unauthorized access to a local

superuser(u2r)- Unauthorized access from a remote

machine(r2l)(37 subclasses of attacks under the 4 generic

categories)

Misuse Detection Problem

Anomaly Detection Problem

Accuracy of the system 92.1%

Periodic Model Updates as new data is accumulated

Model rebuild when the performance falls below a predefined level

Model Distribution

Real Application Clusters (RAC)

DetectionReal time / offline

Audit data are classified as attack or not by misuse detection SVM model.

Functional index on the probability of a case being an attack or not

returns all cases in audit_data with probability greater than 0.5 of being an attack

Combination of multiple models

The query returns all cases where either model1 or model2 indicate an attack with probability higher than 0.4:

In this case, when the anomaly_model classifies a case as an attack with probability greater than 0.5, the misuse_model will attempt to identify the type of attack:

Reports and Analysis

Conclusion

Data mining techniques are very useful in Intrusion Detection Still need manually interpretation/advice in some processing steps More efficient on known attacks than on

unknown attacks only if the training data contains all normal behavior

slides

Documents

oracle database

association

data mining

data mining

anomaly detection

association

building ids

data warehouse