ch 4: non linear classifiers
TRANSCRIPT
![Page 1: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/1.jpg)
1
Ch 4: Non Linear Classifiers
The XOR problem
x1 x2 XOR Class
0 0 0 B
0 1 1 A
1 0 1 A
1 1 0 B
![Page 2: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/2.jpg)
2
There is no single line (hyperplane) that separatesclass A from class B. On the contrary, AND and ORoperations are linearly separable problems
![Page 3: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/3.jpg)
3
The Two-Layer Perceptron
For the XOR problem, draw two, instead, of one lines
![Page 4: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/4.jpg)
4
Then class B is located outside the shaded area andclass A inside. This is a two-phase design.
• Phase 1: Draw two lines (hyperplanes)
Each of them is realized by a perceptron. Theoutputs of the perceptrons will be
depending on the position of x.
• Phase 2: Find the position of x w.r.t. both lines,based on the values of y1, y2.
0)()( 21 xgxg
2 ,1 1
0))(( ixgfy ii
![Page 5: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/5.jpg)
5
• Equivalently: The computations of the first phase perform a mapping
1st phase 2nd
phasex1 x2 y1 y2
0 0 0(-) 0(-) B(0)
0 1 1(+) 0(-) A(1)
1 0 1(+) 0(-) A(1)
1 1 1(+) 1(+) B(0)
Tyyyx ] ,[ 21
![Page 6: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/6.jpg)
6
The decision is now performed on the transformeddata.
This can be performed via a second line, which can also be realized by a perceptron.
0)( yg
y
![Page 7: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/7.jpg)
7
Computations of the first phase perform amapping that transforms the nonlinearlyseparable problem to a linearly separable one.
The architecture
![Page 8: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/8.jpg)
8
This is known as the two layer perceptron withone hidden and one output layer. The activationfunctions are
The neurons (nodes) of the figure realize thefollowing lines (hyperplanes)
1
0(.)f
02
12)(
02
3)(
02
1)(
21
212
211
yyyg
xxxg
xxxg
![Page 9: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/9.jpg)
9
Classification capabilities of the two-layer perceptron
The mapping performed by the first layer neurons is onto the vertices of the unit side square, e.g., (0, 0), (0, 1), (1, 0), (1, 1).
The more general case,
piyyyyx i
T
p ,...2 ,1 1 ,0 ,],...[ 1
lRx
![Page 10: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/10.jpg)
10
performs a mapping of a vector onto the vertices ofthe unit side Hp hypercube
The mapping is achieved with p neurons each
realizing a hyperplane. The output of each of theseneurons is 0 or 1 depending on the relative position ofx w.r.t. the hyperplane.
Intersections of these hyperplanes form regions in thel-dimensional space. Each region corresponds to avertex of the Hp unit hypercube.
![Page 11: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/11.jpg)
11
For example, the 001 vertex corresponds to the
region which is located
to the (-) side of g1 (x)=0
to the (-) side of g2 (x)=0
to the (+) side of g3 (x)=0
-y1 - y2 +y3 +0.5=0
![Page 12: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/12.jpg)
12
The output neuron subsequently realizes anotherhyperplane, which separates the hypercube into twoparts, having some of its vertices on one and some onthe other side.
The output neuron realizes a hyperplane in thetransformed space, that separates some of thevertices from the others. Thus, the two layerperceptron has the capability to classify vectors intoclasses that consist of unions of polyhedralregions. But NOT ANY union. It depends on therelative position of the corresponding vertices.
y
![Page 13: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/13.jpg)
13
Three layer-perceptrons
The architecture
This is capable to classify vectors into classes consistingof ANY union of polyhedral regions.
The idea is similar to the XOR problem. It realizesmore than one planes in the space.
pRy
![Page 14: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/14.jpg)
14
The reasoning
• For each vertex, corresponding to class, say A, construct a hyperplane which leaves THIS vertex on one side (+) and ALL the others to the other side (-).
• The output neuron realizes an OR gate
Overall:
The first layer of the network forms the hyperplanes, the second layer forms the regions and the output neuron forms the classes.
Designing Multilayer Perceptrons One direction is to adopt the above rationale and develop
a structure that classifies correctly all the training patterns.
The other direction is to choose a structure and compute the synaptic weights to optimize a cost function.
![Page 15: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/15.jpg)
15
The Backpropagation Algorithm
This is an algorithmic procedure that computes thesynaptic weights iteratively, so that an adopted costfunction is minimized (optimized)
In a large number of optimizing procedures,computation of derivatives are involved. Hence,discontinuous activation functions pose a problem, i.e.,
There is always an escape path!!! The logistic function
is an example. Other functions are also possible andin some cases more desirable.
00
01)(
x
xxf
)exp(1
1)(
axxf
![Page 16: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/16.jpg)
16
)exp(1
1)(
axxf
2( ) 1
1 exp( )f x
ax
1 exp( )( ) .tanh
1 exp( ) 2
ax axf x c c
ax
Squashing functions
![Page 17: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/17.jpg)
17
The steps:
• Adopt an optimizing cost function, e.g.,
– Least Squares Error
– Relative Entropy
between desired responses and actual responses of the network for the available training patterns. That is, from now on we have to live with errors. We only try to minimize them, using certain criteria.
• Adopt an algorithmic procedure for the optimization of the cost function with respect to the synaptic weights. e.g.,
– Gradient descent
– Newton’s algorithm
– Conjugate gradient
![Page 18: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/18.jpg)
18
The weight vector (including the threshold) of the jth
neuron in the rth layer, which is a vector of
dimension kr-1 +1 is defined as:
The task is a nonlinear optimization one. For the
gradient descent method
(new) (old)r r r
j j j
r
j r
j
w w w
Jw
w
10 1, , ,r
Tr r r r
j j j jkw w w w
![Page 19: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/19.jpg)
19
The Procedure:
• Initialize unknown weights randomly with small values.
• Compute the gradient terms backwards, starting with the weights of the last (3rd) layer and then moving towards the first
• Update the weights
• Repeat the procedure until a termination procedure is met
Two major philosophies:
• Batch mode: The gradients of the last layer are computed once ALL training data have appeared to the algorithm, i.e., by summing up all error terms.
• Pattern mode: The gradients are computed every time a new training data pair appears. Thus gradients are based on successive individual errors.
![Page 20: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/20.jpg)
20
![Page 21: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/21.jpg)
21
A major problem: The algorithm may converge to a local minimum
![Page 22: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/22.jpg)
22
The Cost function choice
Examples:
• The Least Squares
Desired response of the mth output neuron
(1 or 0) for
Actual response of the mth output neuron,
in the interval [0, 1], for input
N
i
iJ1
)(E
2 2
1 1
ˆ( ) ( ) ( ( ) ( ))
1,2,...,
L Lk k
m m m
m m
i e i y i y i
i N
E
)(iym
)(ˆ iym
)(ix
)(ix
![Page 23: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/23.jpg)
23
we define
1r
kv
1rk rk
Layer 1r Layer r
![Page 24: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/24.jpg)
24
where
;
The computations start from r=L and propagate backward for
r=L-1, L-2, . . . , 1.
1. r = L
![Page 25: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/25.jpg)
25
2. r < L
But
with
Hence,
![Page 26: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/26.jpg)
26
N1
1
(new) (old)
( ) ( )
r r r
j j j
r r r
j j
i
w w w
w i y i
1 1 1
1
1
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
r
r r r
j j j
kr r r
j k kj
k
L L
j j j
i e i f v i
e i i w
i e i f v i
where
Demo: nnd11bc nnd11fa nnd11gn
The Backpropagation Algorithm
1- Initialization
2- Forward computations:
3- Backward computations:
4- Update the weights
![Page 27: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/27.jpg)
VARIATIONS ON THE BACKPROPAGATION THEME
Use Momentum term
27
N1
1
(new) (old) ( )
( ) ( ) ( ) ( )
r r r
j j j
r r r r
j j j
i
w w w new
w new w old i y i
Use an adaptive value for the learning factor
delta-delta rule Conjugate gradient algorithm Newton family approaches Algorithms based on the Kalman
filtering approach Levenberg–Marquardt algorithm Quickprop scheme
Adaptive learning factor μ
μ=0.01, α=0.85,
ri=1.05, c=1.05, rd =0.7.
![Page 28: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/28.jpg)
28
The cross-entropy
the cross-entropy cost function depends on the relative errors and not on the absolute errors, as its least squares counterpart; thus it gives the same weight to small and large values.
This presupposes an interpretation of y and ŷ as
probabilities.
Classification error rate. This is also known asdiscriminative learning. Most of these techniques use asmoothed version of the classification error.
1
( )N
i
J E i
1
ˆ ˆ( ) ( ) ln ( ) (1 ( )) ln(1 ( ))Lk
m m m m
m
E i y i y i y i y i
![Page 29: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/29.jpg)
29
Remark 1: A common feature of all the above is the danger of local minimum convergence. “Well formed” cost functionsguarantee convergence to a “good” solution, that is one that classifies correctly ALL training patterns, provided such a solution exists. The cross-entropy cost function is a well formed one.The Least Squares is not.
Remark 2: Both, the Least Squares and the cross entropy lead to output values that approximate optimally class a-posteriori probabilities!!!
That is, the probability of class given .
This is a very interesting result. It does not depend on the
underlying distributions. It is a characteristic of certain costfunctions. How good or bad is the approximation, depends onthe underlying model. Furthermore, it is only valid at the globalminimum.
))(()(ˆ ixPiy mm
m )(ix
![Page 30: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/30.jpg)
30
Choice of the network size.How big a network can be. How many layers and howmany neurons per layer?
The number of free parameters (synaptic weights) tobe estimated should be
(a) large enough to learn what makes “similar” thefeature vectors within each class and at the same timewhat makes one class different from the other.
(b) small enough, with respect to number N of trainingpairs, so as not to be able to learn the underlyingdifferences among the data of the same class.
There are 3 major directions:
• 1) Analytical methods. This category employs algebraicor statistical techniques to determine the number of itsfree parameters. It is static and does not take intoconsideration the cost function used as well as thetraining procedure
![Page 31: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/31.jpg)
31
• 2) Pruning Techniques: These techniques start from a large network and then weights and/or neurons are removed iteratively, according to a criterion.
—Methods based on parameter sensitivity
Near a minimum and assuming that the Hessian matrix is diagonal
21 1
2 2
higher order terms
i i ii i ij i j
i i i j
J g w h w h w w
ji
ij
i
iww
Jh,
w
Jg
2
2
2
1i
i
ii whJ
where
![Page 32: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/32.jpg)
32
Pruning is now achieved in the following procedure:
Train the network using Backpropagationfor a number of steps
Compute the saliencies
Remove weights with small si.
Repeat the process
—Methods based on function regularization
The first term is the performance cost function, and it is chosen according to what we have already discussed (e.g., least squares, cross entropy).
2
2
iiii
whs
N
i
p waEiEJ1
)()(
![Page 33: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/33.jpg)
33
The second term favours small values for the weights, e.g.,
whereAfter some training steps, weights with small values are removed.
• 3) Constructive techniques:They start with a small network and keep increasing it, according to a predetermined procedure and criterion.
2
1
( ) ( )K
p k
k
E h w
22
0
22 )(
k
kk
ww
wwh
0 1.w
h(·) is an appropriately
chosen differentiable function.
![Page 34: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/34.jpg)
34
Remark: Why do not start with a large network and leave the algorithm to decide which weights are small? This approach is just naïve. It overlooks that classifiers must have good generalization properties. A large network can result in small errors for the training set, since it can learn the particular details of the training set. On the other hand, it will not be able to perform well when presented with data unknown to it. The size of the network must be:
• Large enough to learn what makes data of the same class similar and data from different classes dissimilar
• Small enough not to be able to learn underlying differences between data of the same class. This leads to the so called overfitting.
![Page 35: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/35.jpg)
35
400 training SamplesThe mean valuesVariances=0.08
Example: NN: (2-3-2-1) ; Logistic Function with a=1. (a) The
momentum μ=0.05, α=0.85 and (b) The adaptive momentum μ=0.01,
α=0.85, ri=1.05, c=1.05, rd =0.7.
![Page 36: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/36.jpg)
36
MLP: (2-20-20-1)
Decision curve (a) before pruning and (b) after pruning.
![Page 37: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/37.jpg)
37
Overtraining is another side of the same coin, i.e., the network adapts to the peculiarities of the training set.
![Page 38: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/38.jpg)
38
Generalized Linear Classifiers
Remember the XOR problem. The mapping
The activation function transforms the nonlinear task into a linear one.
In the more general case:
• Let and a nonlinear classification task.
))((
))((
2
1
xgf
xgfyx
(.)f
lRx
kifi ,...,2,1 (.),
: , 1,2, ,l
if R R i k
![Page 39: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/39.jpg)
39
Are there any functions and an appropriate k, so that
the mapping
transforms the task into a linear one, in the space?
If this is true, then there exists a hyperplaneso that
)(
...
)(1
xf
xf
yx
k
kRy
kRw
2
T
0
1
T
0
x,0yww
x,0yww
If
![Page 40: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/40.jpg)
40
In such a case this is equivalent withapproximating the nonlinear discriminant functiong(x), in terms of i.e.,
Given , the task of computing the weights isa linear one.
How sensible is this?
• From the numerical analysis point of view, thisis justified if are interpolation functions.
• From the Pattern Recognition point of view, thisis justified by Cover’s theorem.
,)(xfi
0 )( )()(1
0
xfwwxg i
k
i
i
)(xfi
)(xfi
![Page 41: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/41.jpg)
41
Generalized Linear Classification.
g(x) corresponds to a two-layer network where the nodes of the
hidden layer have different activation functions, fi(·), i=1, 2, ..., k.
0 )( )()(1
0
xfwwxg i
k
i
i
![Page 42: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/42.jpg)
42
Capacity of the l-dimensional space in Linear
Dichotomies
Assume N points in assumed to be in general
position, that is:
lR
Not of these lie on a dimensional space1 1
![Page 43: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/43.jpg)
43
Cover’s theorem states: The number of groupings that can be formed by (l-1)-dimensional hyperplanesto separate N points in two classes is
Example: N=4, l=2, O(4,2)=14
Notice:The total number of possible groupings is24=16
,1
2),(0
l
i i
NlNO
!)!1(
)!1(1
iiN
N
i
N
![Page 44: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/44.jpg)
44
Probability of grouping N points in two linearly
separable classes is
)1( lrN
( , )
2
l
N N
O N lP
![Page 45: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/45.jpg)
45
Thus, the probability of having N points in linearly separableclasses tends to 1, for large l , provided N<2(l+1).
Hence, by mapping to a higher dimensional space, we
increase the probability of linear separability, provided thespace is not too densely populated.
POLYNOMIAL CLASSIFIERS
Function g(x) is approximated in terms of up to order rpolynomials of the x components, for large enough r. For the special case of r =2 we have:
![Page 46: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/46.jpg)
46
XOR problem
![Page 47: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/47.jpg)
47
Radial Basis Function Networks (RBF)
Choose
![Page 48: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/48.jpg)
48
Equivalent to a single layer network, with RBF activations and linear output node.
2
2( ) exp
2
i
i
i
x cf x
![Page 49: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/49.jpg)
49
Example: The XOR problem
• Define:
•
2
1
0
0
1
12121
,c,c
368.0
368.0
1
0 ,
368.0
368.0
0
1
135.0
1
1
1 ,
1
135.0
0
0
2
1
2
2
exp( )( )
exp( )
x cy y x
x c
2( ) exp iif x x c
![Page 50: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/50.jpg)
50
01)exp()exp()(2
2
2
1 cxcxxg
01yyyg 21 )(
![Page 51: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/51.jpg)
51
Training of the RBF networks
Fixed centers: Choose centers randomly among the data points. Also fix σi’s. Then
is a typical linear classifier design.
Training of the centers: This is a nonlinear optimization task
Combine supervised and unsupervised learning procedures.
The unsupervised part reveals clustering tendenciesof the data and assigns the centers at the cluster representatives.
ywwxgT
0)(2 2
1
2 2
1
exp , , exp2 2
T
k
k
x c x cy
![Page 52: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/52.jpg)
52
![Page 53: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/53.jpg)
53FIGURE 5.5 RBF network trained with K-means and RLS algorithms for distance d = -5.
The MSE in part (a) of the figure stands for mean-square error.
(b)
Tes
tin
g re
sult
![Page 54: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/54.jpg)
54
![Page 55: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/55.jpg)
FIGURE 5.6 RBF network trained with K-means and RLS algorithms for distanced d=-6.
![Page 56: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/56.jpg)
56
Universal Approximators
It has been shown that any nonlinear continuous function canbe approximated arbitrarily close, both, by a two layerperceptron, with sigmoid activations, and an RBF network,provided a large enough number of nodes is used.
Multilayer Perceptrons vs. RBF networks
MLP’s involve activations of global nature. All points on aplane give the same response.
RBF networks have activations of a local nature, due to theexponential decrease as one moves away from the centers.
MLP’s learn slower but have better generalizationproperties.
cxwT
![Page 57: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/57.jpg)
57
Support Vector Machines: The non-linear case
Recall that the probability of having linearly separableclasses increases as the dimensionality of the featurevectors increases. Assume the mapping:
Then use SVM in Rk
Recall that in this case the dual problem formulation will be
lkRyRx kl ,
k
i
j
T
ijij
N
i ji
ii
Ry
yyyy
where
)2
1( maximize
1 ,
![Page 58: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/58.jpg)
58
Also, the classifier will be
Thus, inner products in a high dimensional spaceare involved, hence
• High complexity
k
ii
N
i
i
T
Ryx
yyy
wywyg
s
where
)(
1
0
![Page 59: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/59.jpg)
59
Something clever: Compute the inner products inthe high dimensional space as functions of innerproducts performed in the low dimensional space!!!
Is this POSSIBLE? Yes. Here is an example
Then, it is easy to show that
3
2
2
21
2
1
2
21
2Let
,Let
R
x
xx
x
yx
RxxxT
2)( j
T
ij
T
ixxyy
![Page 60: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/60.jpg)
60
Mercer’s Theorem
Then, the inner product in H is represented as:
where
for any g(x), x:
K(x,z) is a symmetric function known as kernel.
Hxx )(Let
( ), ( ) ( ) ( ) ( , )r r
r
x z x z K x z
( , ) ( ) ( ) 0C C
Κ x z g x g z d xd z
2( )C
g x d x
H: Hilbert Space
C: Compact (finite) Subset of Rl.
![Page 61: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/61.jpg)
61
The opposite is also true. Any kernel, with the above properties, corresponds to an inner product in SOMEspace!!!
Examples of kernels
• Polynomial:
• Radial Basis Functions:
• Hyperbolic Tangent:
for appropriate values of β, γ.
( , ) ( 1) , 0 T qK x z x z q
2
2( , ) exp
x zK x z
( , ) tanh ( ) T
K x z x z
![Page 62: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/62.jpg)
62
SVM Formulation
• Step 1: Choose appropriate kernel. This implicitly assumes a mapping to a higher dimensional (yet, not known)space.
• Step 2:
This results to an implicit combination
,
1( ( , ))
2
subject to: 0 , 1, 2, ...,
0
max i ji i j i j
i i j
i
i i
i
y y K x x
C i N
y
)(1
ii
N
i
i xyws
![Page 63: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/63.jpg)
63
• Step 3: Assign x to
• The SVM Architecture
1 0
1
2 if ( ) 0 , )sN
ii i
i
g x y Κ( x x w
![Page 64: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/64.jpg)
64
Example of a nonlinear SVM classifier for the case of two nonlinearly separable classes. The Gaussian RBF kernel was used. Dotted lines mark the margin and circled points the support vectors.
![Page 65: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/65.jpg)
65(a) Training result
![Page 66: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/66.jpg)
66(a) Testing result
![Page 67: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/67.jpg)
67
Decision Trees
This is a family of non-linear classifiers. They are multistage decisionsystems, in which classes are sequentially rejected, until a finallyaccepted class is reached. To this end:
The feature space is split into unique regions in a sequentialmanner.
Upon the arrival of a feature vector, sequential decisions, assigningfeatures to specific regions, are performed along a path of nodesof an appropriately constructed tree.
The sequence of decisions is applied to individual features, and thequeries performed in each node are of the type:
is feature
where α is a pre-chosen (during training) threshold.
?ix
![Page 68: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/68.jpg)
68
The figures below are such examples. This type of trees isknown as Ordinary Binary Classification Trees (OBCT). Thedecision hyperplanes, splitting the space into regions, areparallel to the axis of the spaces. Other types of partitionare also possible, yet less popular.
![Page 69: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/69.jpg)
69
Design Elements that define a decision tree.
Each node, t, is associated with a subset ,where X is the training set. At each node, Xt is split
into two (binary splits) disjoint descendant subsetsXt,Y and Xt,N, where
Xt,Y Xt,N = Ø
Xt,Y Xt,N = Xt
Xt,Y is the subset of Xt for which the answer to thequery at node t is YES. Xt,N is the subset
corresponding to NO. The split is decided accordingto an adopted question (query).
XΧ t
![Page 70: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/70.jpg)
70
• A splitting criterion must be adopted for the best split of Xt
into Xt,Y and Xt,N.
• A stop-splitting criterion must be adopted that controls thegrowth of the tree and a node is declared as terminal (leaf).
• A rule is required that assigns each (terminal) leaf to a class.
Set of Questions: In OBCT trees the set of questions isof the type
is ?
The choice of the specific xi and the value of thethreshold α, for each node t, are the results of
searching, during training, among the features and a setof possible threshold values. The final combination isthe one that results to the best value of a criterion.
ix
![Page 71: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/71.jpg)
71
Splitting Criterion: The main idea behind splitting at eachnode is the resulting descendant subsets Xt,Y and Xt,N to bemore class homogeneous compared to Xt. Thus the criterion
must be in harmony with such a goal. A commonly usedcriterion is the node impurity:
and
where is the number of data points in Xt that belong toclass i. The decrease in node impurity is defined as:
2
1
( ) | log |M
i i
i
I t P t P t
t
i
ti
N
NtP |
i
tN
)()()()(,,
N
t
Nt
t
ttI
N
NtI
N
NtItI
Note: I(t) is the entropy associated with the subset Xt .
![Page 72: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/72.jpg)
72
• The goal is to choose the parameters in each node(feature and threshold) that result in a split with thehighest decrease in impurity.
• Why highest decrease? Observe that the highest value ofI(t) is achieved if all classes are equiprobable, i.e., Xt is
the least homogenous.
Stop - splitting rule. Adopt a threshold T and stop splitting a
node (i.e., assign it as a leaf), if the impurity decrease is lessthan T. That is, node t is “pure enough”.
Class Assignment Rule: Assign a leaf to a class j , where:
)|(maxarg tPj ii
![Page 73: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/73.jpg)
73
Summary of an OBCT algorithmic scheme:
![Page 74: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/74.jpg)
74
Remarks:
• A variety of node impurity measures can be defined.
• A critical factor in the design is the size of the tree.Usually one grows a tree to a large size and then appliesvarious pruning techniques.
• Decision trees belong to the class of unstable classifiers.This can be overcome by a number of “averaging”techniques. Bagging is a popular technique. Usingbootstrap techniques in X, various trees are constructed,Ti, i=1, 2, …, B for B variants, X1, X2, . . . ,XB, of the training
set,. The decision is taken according to a majority votingrule.
• More general partition of the feature space, viahyperplanes not parallel to the axis, is possible viaquestions of the type:
1
Is ?l
k k
k
c x
![Page 75: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/75.jpg)
75
![Page 76: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/76.jpg)
76
Combining Classifiers
The basic philosophy behind the combination of differentclassifiers lies in the fact that even the “best” classifier fails insome patterns that other classifiers may classify correctly.Combining classifiers aims at exploiting this complementaryinformation residing in the various classifiers.
Thus, one designs different optimal classifiers and thencombines the results with a specific rule.
Assume that each of the, say, L designed classifiers providesat its output the posterior probabilities:
, ..., M, ixP i 21 ),|(
![Page 77: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/77.jpg)
77
Product Rule: Assign to the class :
where is the respective posterior prob. of the jth classifier.
(Proof by minimizing the average Kullback–Leibler (KL) distancebetween probabilities, by employing Lagrange multipliesoptimization).
Sum Rule: Assign to the class :
(Using the alternative KL distance formulation.)
x i
L
j
kjk
xPi1
|maxarg
xP kj |
L
j
kjk
xPi1
|maxarg
x i
Although the product rule often produces better results than
the sum rule, it may lead to less reliable results when the
outputs of some of the classifiers result in values close to zero.
![Page 78: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/78.jpg)
78
Majority Voting Rule: Assign x to the class for which
there is a consensus or when at least lc of the classifiers
agree on the class label of x where:
otherwise the decision is rejection, that is no decision istaken.
Thus, correct decision is made if the majority of theclassifiers agree on the correct label, and wrong decision ifthe majority agrees in the wrong label.
1, even2
1, odd
2
c
LL
LL
![Page 79: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/79.jpg)
79
Dependent or not Dependent classifiers?
Although there are not general theoretical results,experimental evidence has shown that the moreindependent in their decision the classifiers are, thehigher the expectation should be for obtaining improvedresults after combination. However, there is no guaranteethat combining classifiers results in better performancecompared to the “best” one among the classifiers.
Towards Independence: A number of Scenarios.
Train the individual classifiers using different trainingdata points. To this end, choose among a number ofpossibilities:Bootstrapping: This is a popular technique to combine unstable
classifiers such as decision trees (Bagging belongs to thiscategory of combination).
![Page 80: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/80.jpg)
80
Stacking: Train the combiner with data points that have beenexcluded from the set used to train the individual classifiers.
Use different subspaces to train individual classifiers:According to the method, each individual classifier operates ina different feature subspace. That is, use different features foreach classifier.
Remarks:
The majority voting and the summation schemes rank amongthe most popular combination schemes.
Training individual classifiers in different subspaces seems tolead to substantially better improvements compared toclassifiers operating in the same subspace.
Besides the above three rules, other alternatives are alsopossible, such as to use the median value of the outputs ofindividual classifiers.
![Page 81: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/81.jpg)
81
The Boosting Approach The origins: Is it possible a weak learning algorithm (one
that performs slightly better than a random guessing) to beboosted into a strong algorithm? (Villiant 1984).
The procedure to achieve it:
• Adopt a weak classifier known as the base classifier.
• Employing the base classifier, design a series ofclassifiers, in a hierarchical fashion, each time employinga different weighting of the training samples. Emphasis inthe weighting is given on the hardest samples, i.e., theones that keep “failing”.
• Combine the hierarchically designed classifiers by aweighted average procedure.
![Page 82: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/82.jpg)
82
The AdaBoost (adaptive boosting) Algorithm.Let the training data be {(x1 ,y1), (x2 ,y2), ..., (xN ,yN)}
with yi ∈ {-1,1}, i = 1, 2, ..., N.
Construct an optimally designed classifier of the form:
where:
where denotes the base classifier that returns abinary class label:
is a parameter vector.
)(sign)( xFxf
; kx
; 1, 1kx
1
( ) ;K
kk
k
F x x
![Page 83: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/83.jpg)
83
The essence of the method.Design the series of classifiers:
The parameter vectors
are optimally computed so as: To minimize the error rate on thetraining set. Each time, the training samples are re-weighted so that the
weight of each sample depends on its history. Hard samplesthat “insist” on failing to be predicted correctly, by thepreviously designed classifiers, are more heavily weighted.
Optimizing (highly complex) :
It penalizes the samples that are wrongly classified muchmore heavily than those correctly classified.
1 2; , ; , ..., ; kx x x
, 1 2k k , , ..., K
; , :1,
1
arg min exp ( )k k
N
i ik K
i
y F
θ
x
![Page 84: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/84.jpg)
84
Updating the weights for each sample
Zm is a normalizing factor common for all samples.
where Pm<0.5 (by assumption) is the error rate of the optimal
classifier at stage m. Thus αm > 0.
The term:
takes a large value if (wrong classification) and
a small value in the case of correct classification
The update equation is of a multiplicative nature. That is,successive large values of weights (hard samples) result in largerweight for the next iteration
, ..., N, ixi 21 ,
( )
( 1)exp ;m
mi i m im
i
m
w y xw
Z
11ln ,
2
mm
m
P
P
; mx
exp ;i mi my x
; 0i miy x
{ ; 0}i miy x
( )
1exp ;
N m
mm i i m iiZ w y x
![Page 85: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/85.jpg)
85
• The algorithm
![Page 86: Ch 4: Non Linear Classifiers](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d54d5c327a08724a756e/html5/thumbnails/86.jpg)
86
Remarks:• Training error rate tends to zero after a few iterations.
The test error levels to some value.
• AdaBoost is greedy in reducing the margin that samplesleave from the decision surface.
The margin lies in the interval [1, 1] and is positive if and only if the
respective pattern is classified correctly.