slides
DESCRIPTION
TRANSCRIPT
Data Mining Based Intrusion Detection System
Krishna C Surendra Babu
Papers: A Data Mining Framework for Building
Intrusion Detection Models(Wenke Lee, Salvotore J. Stolfo)- Research supported in parts by grants from
DARPA
Creation and Deployment of Data Mining-Based Intrusion Detection Systems in Oracle Database 10g
Intrusion Detection System:
Intrusion Detection Techniques: Anomaly Detection
Misuse Detection DOS Probing Unauthorized access to local super user
(U2R) Unauthorized access from a remote
machine (R2L)
Requirements: Reliable Extensible Easy to manage Low maintenance cost
Data MiningData mining refers to extracting or mining knowledge from large amounts of data.
Data Warehouse A data warehouse is a repository
of information collected from multiple sources
A Data Mining Framework for Building Intrusion Detection Models
Why Data Mining? The dataset is large. Constructing IDS manually is
expensive and slow. Update is frequent since new
intrusionoccurs frequently.
A Data Mining Framework for Building Intrusion Detection Models
Challenges for Data Mining in building IDS
Develop techniques to automate theprocessing of knowledge-intensive feature
selection. Customize the general algorithm to incorporate
domain knowledge so only relevant patterns are reported
Compute detection models that are accurate and efficient in run-time
Mining the data
Dataset Types: Network based dataset Host based dataset
Build IDS by mining in the records. When an attack is detected, give alarms to
the administration system.
Framework of Building IDS
Preprocessing. Summarize the raw data. Association Rule Mining. Find sequence patterns (Frequent
Episodes) based on the association rules. Construct new features based on the sequence patterns. Construct Classifiers on different set of features
Preprocessing To summarize raw data to high level
event, e.g network connection, time, duration,
service, host, destination
Bro and NFR Packet filtering Techniques can be used.
Classification Classify each audit record into one of
a discrete set of possible categories, normal or a particular kind of intrusion.
Association rule mining
Searches for interesting relationships among attributes in a given data set i.e. to derieve multi feature(attribute) correlations from a database table.
Sequence Pattern Mining
Frequent Episodes. X,Y->Z, [c,s,w] With the existence of itemset X and Y, Z
will occur in time w.
Feature Construction
Feature extraction is the processes of determining what evidence that can be taken from raw audit data is most useful for analysis.
Construct new feature according to the frequent episode.
Some features will show close relationship to
each other. Then combine the features. Some frequent episode may indicate
interesting new features.
Build Model (classifier) Build different classifiers for differentattacks.
Experiments
The DARPA data 4G compressed tcpdump data of 7 weeks of network
traffics. Contains 4 main categories of attacks
DOS: denial of service, e.g., ping-of-death, syn flood
R2L: unauthorized access from a remote machine, e.g., guessing password
U2R: unauthorized access to local super user privileges by a local unprivileged user, e.g., buffer overflow
PROBING: e.g., port-scan, ping-sweep
Results
Training on the 7 weeks of labeled data, and testing on
the 2 weeks unlabeled data. The test data contains 14 attack types which do
not exist in training data. Comparing 4 methods:
Columbia: the IDS developed according to the framework
introduced above Group 1-3: three systems developed by knowledge
engineering approaches.
Results
Detection rate on New and Old attacks. Old attacks: type of attacks occur in both
training and testing data. New attacks: type of attacks occur in testing
data only.
Creation and Deployment of Data Mining Based Intrusion Detection Systems in Oracle Database 10G
DAID A database centric architecture that leverages data mining with in the Oracle RDBMS to address the challenges.
Scheduling capabilities Alert infrastructure Data analysis tools Security Scalability reliability
Requirements for a production quality IDS
Centralized view of the data Data transformation capabilities Analytic and data mining methods Flexible detector deployment, including
scheduling that enables periodic model creation and distribution
Real-time detection and alert infrastructure Reporting capabilities Distributed processing High system availability Scalability with system load
• Sensors • Extraction, transformation
and load (ETL) • Centralized data
warehousing • Automated model
generation • Automated model
distribution • Real-time and offline
detection • Report and analysis • Automated alerts
Sensors Collects audit information
Network traffic data System logs on individual hosts System calls made by processes
ETL
Used for pre processing audit streams and feature extraction
Use SQL and user defined functions to extract key pieces of information.Ex: computes windowing analytic function to
compute the number of http connections to a given host
Model Generation
Popular Techniques for misuse and anomaly detection: Association Rules Clustering Support Vector Machines
Supervised learning methods for Classification
Decision Trees
Model build functionality: Dbms_data_mining PL/SQL package- to train linear SVM anomaly and misuse
detection models.- Test dataset
- Probing- Denial of service- Unauthorized access to a local
superuser(u2r)- Unauthorized access from a remote
machine(r2l)(37 subclasses of attacks under the 4 generic
categories)
Misuse Detection Problem
Anomaly Detection Problem
Accuracy of the system 92.1%
Periodic Model Updates as new data is accumulated
Model rebuild when the performance falls below a predefined level
Model Distribution
Real Application Clusters (RAC)
DetectionReal time / offline
Audit data are classified as attack or not by misuse detection SVM model.
Functional index on the probability of a case being an attack or not
returns all cases in audit_data with probability greater than 0.5 of being an attack
Combination of multiple models
The query returns all cases where either model1 or model2 indicate an attack with probability higher than 0.4:
In this case, when the anomaly_model classifies a case as an attack with probability greater than 0.5, the misuse_model will attempt to identify the type of attack:
Reports and Analysis
Conclusion
Data mining techniques are very useful in Intrusion Detection Still need manually interpretation/advice in some processing steps More efficient on known attacks than on
unknown attacks only if the training data contains all normal behavior