intrusion detection using neural networks and support vector machine srinivas mukkamala, guadalupe...
TRANSCRIPT
1
INTRUSION DETECTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINE
Srinivas Mukkamala, Guadalupe Janoski, Andrew SungDept. of CS in New Mexico Institute of Mining and Technology
IEEE WCCI IJCNN 2002World Congress on Computational IntelligenceInternational Joint Conference on Neural Networks
2
Outline
Approaches to intrusion detection using neural networks and support vector machines
DARPA dataset Neural Networks Support Vector Machines Experiments Conclusion and Comments
3
Approaches
Key ideas are to discover useful patterns or features that describe
user behavior on a system And use the set of relevant features to build
classifiers that can recognize anomalies and known intrusions
Neural networks and support vector machines are trained with normal user activity and attack patterns Significant deviations from normal behavior are
flagged as attacks
4
DARPA Data for Intrusion Detection
DARPA (Defense Advanced Research Projects Agency) An agency of US Department of Defense responsible for
the development of new technology for use by the military
Benchmark from a KDD (Knowledge Discovery and Data Mining) competition designed by DARPA
Attacks fall into four main categories DOS: denial of service R2L: unauthorized access from a remote machine U2R: unauthorized access to local super user (root)
privileges Probing: surveillance and other probing
5
Features
http://kdd.ics.uci.edu/databases/kddcup99/task.html
6
Signals
Signals
Sign
als
Signal
s
Signal
s
Signals
Signals
Signal
s
Neuron 神經
Dendrite 樹突
Axon 軸突
Soma 中心
Gather signals
Output signal
Combine signals & decide to trigger
Neural Networks
7
OUTPUT
X1
X2
平面的線 : w1X1 + w2X2 – θ = 0
w1
w2
θ
A
BC
D
INPUT
WEIGHT
ACTIVATION
Divide and Conquer
N1
N2
N3
1
1
1
1
1
1
-1
-1
-1
Data N1 N2
A +1 +1 +1 -3
B +1 -1 -1 -1
C -1 -1 -3 +1
D -1 +1 -1 -1N3
A +1 -1 +1
B -1 -1 -1
C -1 +1 +1
D -1 -1 -1
-1
-1
-1x1
x1
x2
x2
out1
out2
out3
Σ
Σ
Σ
Σ
8
1
2
Layer 1 Layer 2 Layer 3 Layer 4
w01(1)
w11(1)
w21(1)
Σ
Layer 1
N1
S1(1)
x1(1)
x0(0)
x1(0)
x2(0)
general
wij(l)
Σ
Layer l
Nj
Sj(l)
xj(l)
xi(l-1)
Hyperbolicfunction
tanh(S) = eS – e-S
eS + e-S
S
tanh(S)
)1(
0
)1()()(
ld
i
li
lij
lj xwS
)tanh( )()( lj
lj Sx
Decide Architecture
Determine Weight Automatically
Feed Forward Neural Network (FFNN)
9
Σ
Σ
ΣInputOutpu
t
g(x) 由w所組成的 classifier
w
w
w
w
w
w
ww
w
Training Data: Nnnn yx 1)},{(
Error Function:
N
nnn yxg
NwE
1
2))((1
)(
How to minimize E(w) ? Stochastic Gradient Descent (SGD)
w
E
w is random small value at the beginningfor T iterations
wnew wold – η .▽ w(En)learning rate
10
……
Layer 1Layer 2 Layer L-1Layer L… …
wij(l)
Σ
Layer l
Nj
Sj(l)
x1(l)
xi(l-1)
forwardfor l = 1, 2, …, L compute Sj
(l) and xj(l)
)1(
0
)1()()(
ld
i
li
lij
lj xwS
)tanh( )()( lj
lj Sx
2)( )(1
yxE L
2)1()(1
2)(1
))(tanh(
))(tanh(
yxw
ySLi
Li
L
)(1
)(1
)(1
)(1
Li
L
LLi w
S
S
E
w
E
)(1
)(1
2)(1 )](tanh1[))(tanh(2
L
LL SyS
)1( Lix
)1()(1)(
1
L
iL
Li
xw
E
)1()()(
l
iljl
ij
xw
E
j
li
lij
lj
li Sw ))(tanh1( )1(2)()()1(
Back Propagation Algorithm
General
backwardfor l = L, L-1, …, 1 compute δi
(l)
11
Σ
Σ
Σ
w
w
w
w
w
w
ww
w
… …
wij(l)
Σ
Layer l
Nj
Sj(l)
x1(l)
xi(l-1)
Feed Forward NNet
)1(
0
)1()()(
ld
i
li
lij
lj xwS
Consists of layers 1, 2, …, L
wij(l) connect neuron i in layer (l-
1) to neuron j in layer lCumulated signal
Activated output
)( )()( lj
lj Sx
often tanh
Minimize E(w) and determine the weights automatically
SGD (Stochastic Gradient Descent)
)1()()(
l
iljl
ij
xw
E Forward: compute Sj(l) and xj
(l)
Backward: compute δi(l)
w is random small value at the beginningfor T iterations wnew wold – η .▽ w(En)
Stop when desired error rate was met
12
Support Vector Machine
A supervised learning method Is known as the maximum margin
classifier Find the max-margin separating
hyperplane
SVM – hard margin13
x1
x2
2∥w∥
<w, x> - θ = 0
<w, x> - θ = -1
<w, x> - θ = +1
max2
∥w∥w, θyn(<w, xn> - θ) ≧1
argmin
2w, θyn(<w, xn> - θ) ≧1
1<w, w>
Quadratic programming14
argmin
1Σ Σ aijvivj + Σ bivi2 i j
Σ rkivi ≧ qki
vV* quadprog(A, b, R, q)
argmin
2w, θyn(<w, xn> - θ) ≧1
1<w, w>
Let V = [ θ, w1, w2, …, wD ]
Σ wd2
21
d=1
D
(-yn) θ + Σ yn (xn)d wd ≧ 1d=1
D
Adapt the problem for quadratic programming
Find A, b, R, q and put into the quad. solver
Adaptation15
V = [ θ, w1, w2, …, wD ]
v0, v1, v2, .…, vD
Σ wd2
21
d=1
D
(-yn) θ + Σ yn (xn)d wd ≧ 1d=1
D
v0 vd
argmin
1Σ Σ aijvivj + Σ bivi2 i j
Σ rkivi ≧ qki
v
a00 = 0a0j = 0ai0 = 0
i ≠ 0, j ≠ 0aij = 1 (i = j)
0 (i ≠ j)
b0 = 0
i ≠ 0bi = 0
qn = 1
rn0 = -yn
d > 0
rnd = yn (xn)d
(1+D)*(1+D)
(1+D)*1
(2N)*(1+D)
(2N)*1
SVM – soft margin
Allow possible training errors
Tradeoff c Large c : thinner hyperplane, care about
error Small c : thicker hyperplane, not care about
error
16
argmin
2w, θyn(<w, xn> - θ) ≧1 – ξn
1<w, w> + c Σξnn
ξn ≧ 0
errorstradeoff
Adaptation17
argmin
1Σ Σ aijvivj + Σ bivi2 i j
Σ rkivi ≧ qki
v
V = [ θ, w1, w2, …, wD, ξ1, ξ2, …, ξN ]
(1+D+N)*(1+D+N)
(2N)*(1+D+N)
(1+D+N)*1
(2N)*1
Primal form and Dual form
Primal form
18
Dual form
argmin
2w, θyn(<w, xn> - θ) ≧1 – ξn
1<w, w> + c Σξnn
ξn ≧ 0
argmin
2α
0 ≦αn≦C
1ΣΣ αnynαmym<xn, xm> - Σ αnn m
Σ ynαn = 0
n
n
Variables: 1+D+N
Constraints: 2N
Variables: N
Constraints: 2N+1
Dual form SVM
Find optimal α* Use α* solve w* and θ
αn=0 correct or on 0<αn<C on αn=C wrong or on
19
αn=C
free SV
αn=0
Support Vector
Nonlinear SVM
Nonlinear mapping X Φ(X) {(x)1, (x)2} R2 {1, (x)1, (x)2, (x)1
2, (x)22,
(x)1(x)2} R6
Need kernel trick
20
argmin
2α
0 ≦αn≦C
1ΣΣ αnynαmym<Φ(xn), Φ(xm)> - Σ αnn m
Σ ynαn = 0
n
n
(1+ <xn, xm>)2
21
Experiments
Using automated parsers to process the raw TCP/IP dump data into machine-readable form
7312 training data (different types of attacks and normal data) has 41 features
6980 testing data evaluate the classifier
Pre-processing Training Testing
Support Vector Machines
Neural Networks
Details RBF kernelC = 1000
204 support vectors (29 free)
3-layer 41-40-40-1 FFNNetsScaled conjugate gradient
descentDesired error rate = 0.001
Accuracy
99.5% 99.25%
Time spent
17.77 sec 18 min
22
Conclusion and Comments
Speed SVMs is significant shorter
Avoid the ”curse of dimensionality” by max-margin
Accuracy Both have high accuracy
SVMs can only make binary classification IDS requires multiple-class identification
How to determine the features?