nature inspired learning: classification and prediction algorithms
Post on 31-Dec-2015
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Nature Inspired Learning: Classification and Prediction Algorithms
Šarūnas Raudys
Computational Intelligence Group
Department of Informatics
Vilnius University. Lithuania
e-mail: sarunas@raudys.com
Juodkrante, 2009 05 22
2
-2 0 2-3
-2
-1
0
1
2
3
-2 0 2-3
-2
-1
0
1
2
3
-2 0 2-3
-2
-1
0
1
2
3
-2 0 2-3
-2
-1
0
1
2
3
Nature inspired learningStatics
Dynamics
Accuracy, and the relations between sample size, and
complexity
+learning rapiditybecomes a very important issue
W = S -1 (M1-M2)
perceptron
3
4
Nature inspired learning
A Non-linear Single Layer Perceptron - a main element in the ANN theory
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
4
6
8
10
12
14
16
weighted sum
x1, x2, …, xp
i n p u t s
nonlinearity
output y
5
Nature inspired learning
TRAINING THE SINGLE LAYER PERCEPTRON
OUTLINE A plot of 300 bivariate vectors (dots and pluses) sampled from
two Gaussian pattern classes, and the linear decision boundary
START FINISH
CLASSIFICATION
CLUSTERIZATION, if target2 = target1
Minimization of deviations
Three tasks
6
CLASSIFICATION
2 category case
I will speak also about the multi-category case
START
FINISH1. Cost function and training SLP used for classification.
2. When to stop training?
3. Six types of classification Equation while training SLP:
1. Euclidean distance, (only means)
2 Regularized,
3 Fisher, or
4 Fisher with pseudo-inversion of
5 Robust,
6. Minimal empirical error,
7 Support vector (maximal margin).
How to train SLP in the best way?
7
Nature inspired learning
Training the non-linear SLP
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
4
6
8
10
12
14
16
weighted sum
inputs X= (x1, x2, …, xp)
nonlinearity
output y
Training
Data
x1, x2, …, xp y
1 2
N
net = f( VTX + v0), where
f(net ) is a non-linear activation function, e.g. a sigmoid function: f (net)= 1/(1+e-net ) = f sigmoid(net), and v0, VT = (v1 , v2 , ... , vp) are the weights of the DF.
STANDART
8
TRAINING THE SINGLE LAYER PERCEPTRON BASED CLASSIFIER
Training Data
x1, x2, …, xp y
1 2
N
o = f( VTX + v0), where f(net ) is a non-linear activation function, and v0,
VT = (v1 , v2 , ... , vp) are the weights.
Cost function (Amari 1967; Tsypkin, 1966)
C = 1/N (yj - f( VTXj + v0))2,
Vt+1= Vt - x gradient, Training
where is a learning step parameter and yj is training signal (desired output)
V(0)
V(FINISH) mimimum of the cost functionA true (unknown) minimum
Optimal
stopping
Rule
9
Training the Non-linear Single Layer Perceptron
Training Data
Vt+1= Vt - x gradient
1 2
N
Finish
True landscape
Training data landscape
Videal
Optimal
stopping
10
Vt+1= Vt - x gradientEarly stopping
Vopt= optVstart + (1-opt)Vfinish,
where 2finish
2start
2finish
οpt
Raudys&Amari, 1998
Late stopping
Majority, who stopped too late, are here.
A general
Principle
accuracy
11
Where to use Early stopping? - Knowledge discovery in very large databases
Nature inspired learning
Data Set 1
Data Set 2
Data Set 3In order to save previous information, stop training early!Train, however, st
op training early!
12
Standard sum of squares cost function = Standard regression
C = 1/N (yj – f ( VTXj + v0))2.
We assume that the data is normalized:
yT XXXSS 1
1. ,0 ,0 eviationsstandard dyX
Covariances
Let correlations between input variables x1, x2, …, xp be zero.
Then components of vector V will be proportional to correlations between x1, x2, …, xp and y.
We may obtain such regression after the first iteration.
Gradient descent training algorithm Vt+1= Vt - x gradient
13
SLP AS SIX REGRESSIONS
START
14
-10 -5 0 5 100
20
40
60
80
100
(yj - VTXj)2
yj - VTXj
robust
In order to obtain robust regression,instead of square function we have to use “robust function”
Š. Raudys (2000). Evolution and generalization of a single neurone. III. Primitive, regularized, standard, robust and minimax regressions.
Neural Networks 13
(3/4):507-523.
Nature inspired learning. Robust regression
150 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-4
-2
0
2
4
6
8
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-4
-2
0
2
4
6
8
10
Mother and a fetus (“baby”) ECG. Two signals
Result: the fetus signal
A real world problem. Use of robust regression in order to distinguish very weak baby signal from mother’s ECG.
Robust regression pays attention to smallest deviations, not to the largest ones considered as the outliers.
16
Nature inspired learning. Standard and regularized regression
Use of “statistical methods” to perform diverse
whitening data transformations,
where the input variables x1, x2, …, xp are decorrelated and scaled in order to have the same variances. Then while training the perceptron in the transformed feature space, we can obtain standard regression after the very first iteration.
Xnew=T Xold T = -1/2 , where SXX = is a singular value decomposition of the covariance matrix SXX.
0. ,0 ,0 deviations standardyX
Vstart = 0,
If SXX = SXX + I, we obtain regularized regression. Moreover, we can equalize eigenvalues and speed up training process.
Speeding up the calculations (a converegence)
17
SLP AS SEVEN STATISTICAL CLASSIFIERS
STARTThe simplest classifier
Large
weights
Small weights
18
Nature inspired learning
E1) a centre M=(M1+M2)/2 is moved to 0 point, E2) training begins from zero weights, E3) the target t2 = - t1N1/N2, E4) a total gradient training (batch mode) is used.
Conditions to obtain Euclidean distance classifier
just after the first iteration
V t+1 = (2/(t-1)/ I + S) -1 (M1-M2)
When we train further, we have regularized discriminant analysis (RDA):
is regularization parameter, 0 with an increase in the number of training iterations
Fisher classifier,
or Fisher classifier with pseudoinverse of the covariance matrix
19
Nature inspired learning. Standard approach.
Use the diversity of “statistical methods and multivariate models” in order to obtain efficient estimate of covariance matrix. Then perform whitening data transformations, where the input variables are decorrelated and scaled in order to have the same variances.
While training the perceptron in the transformed feature space, we can obtain the Euclidean distance classifier after the first iteration. In original feature space it corresponds to the Fisher classifier or to modification of the Fisher (it depends on a method used to estimate covariance matrix) in original feature space.
Untransformed data
Transformed data
Euclidean classifier
Fisher classifier
Euclidean classifier = Fisher in original space
20
Nature inspired learning
Generalisation errors. EDC, Fisher and Quadratic classifiers
Table 1. Learning quantity, ratio =NA/
E of the Euclidean distance E, the Fisher F
and the quadratic classifiers versus N, the training set size, for dimensionality n=50 and five values of distance (asymptotic error; from Raudys and Pikelis, 1980). EDC Fisher LDF QDF N 1.82 2.34 3.09 3.66 4.22 8 1.70 2.03 2.41 2.65 2.87 12 1.54 1.70 1.84 1.92 1.99 20 1.43 1.50 1.55 1.58 1.61 2.05 3.39 8.40 19.7 52.0 30 1.30 1.32 1.33 1.34 1.35 1.62 2.15 3.61 5.95 10.6 2.21 3.25 7.87 18.3 40.6 50* 1.18 1.17 1.16 1.16 1.17 1.33 1.51 1.93 2.47 3.27 2.13 3.12 7.10 13.1 25.1 100 1.08 1.07 1.06 1.06 1.06 1.14 1.19 1.31 1.44 1.61 1.81 2.35 3.23 4.03 5.05 250 1.04 1.03 1.03 1.03 1.03 1.07 1.09 1.15 1.20 1.27 1.58 1.78 2.01 2.18 2.35 500 1.02 1.02 1.02 1.02 1.02 1.04 1.05 1.07 1.10 1.13 1.37 1.42 1.47 1.51 1.56 1000 1.1 1.01 1.01 1.01 1.01 1.01 1.02 1.03 1.04 1.05 1.18 1.16 1.18 1.18 1.20 2500 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50
0.2 0.1 0.03 0.01 .003 0.2 0.1 0.03 0.01 .003 0.2 0.1 0.03 0.01 .003 E
*) 80 for QDF)
21
S. Raudys, M. Iwamura. Structures of covariance matrix in handwritten character recognition. Lecture Notes in Computer Science, 3138, 725-733, 2004.
S. Raudys, A. Saudargiene. First-order tree-type dependence between variables and classification performance. IEEE Trans. on Pattern Analysis and Machine Intelligence. Vol. PAMI-23 (2),
pp. 233-239, 2001.
A real world problem. Dozens of ways used to estimate covariance matrix and perform whitening data transformation. It is “an additional information” (if correct), that can be useful in SLP training
196-dimensional data
22
Covariance matrices are different.
Decision boundaries of EDC, LDF, QDF and Anderson- Bahadur linear DF. AB and F are different.
If we would start with
the AB decision boundary,
not with the Fisher,
it would be better.
Hence, we have proposed a special method of input data transformation.S. Raudys (2004). Integration of statistical and neural methods
to design classifiers in case of unequal covariance matrices. Lecture Notes in Artificial intelligence, Springer-Verlag. Vol. 3238, pp. 270-280
Q Fisher AB
23
Non-linear discrimination. Similarity features LNCS 3686, pp. 136 – 145, 2005
-10 -8 -6 -4 -2 0 2 4 6 8 10-8
-6
-4
-2
0
2
4
6
8
-10 -8 -6 -4 -2 0 2 4 6 8 10-8
-6
-4
-2
0
2
4
6
8
-10 -8 -6 -4 -2 0 2 4 6 8 10-8
-6
-4
-2
0
2
4
6
8
200 400 600 800 1000 1200 1400 1600 1800 2000 2200
0.02
0.022
0.024
0.026
0.028
0.03
0.032
a b
c
d
100+100 2D two class training vectors (pluses and circles) and decision boundaries of Kernel Discriminant Analysis (a), SVM (b), SLP trained in 200D dissimilarity feature space (c). Learning curve: generalization error of SLP classifier as a function of number of training epochs (d).
SV classifier
SLP
KDA
SV classifier
SLP
optimal stopping of SLP
Generalization error
epochs
24
A “coloured” noise, used to form pseudo-validation set:we are adding a noise in directions of closest training vectors. So, we almost do not distort “geometry of the data”.
In this technique, we use “additional information”:
a space between neighboring points in multidimensional feature space is not empty – it is filled by vectors of the same class.
Nature inspired learning. A noise injection
A pseudo-validation data set used to realize
early stopping
25
Nature inspired learning. Multi-category cases
-4 -2 0 2
-4
-2
0
2
4
6
2
1
2(1/3)
213
133
113
3(2/3)
112
212
1(1/2)
3
233
232
-4 -2 0 2
-4
-2
0
2
4
6
3
1
A B
C
O
2
Pair-wise classifiers: optimally stopped (+noise) SLPs + H-T fusion. Wee need to obtain the classifier (SLP) of optimal complexity:
Early stopping
1 2
26
Learning Rapidity. Two Pattern Recognition (PR) tasks
-2 0 2-3
-2
-1
0
1
2
3
-2 0 2-3
-2
-1
0
1
2
3
A time to learn the second task is restricted, say 300 training epochs
Parameters that affect learning rapidity:
– learning step & the weights growth
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
4
6
8
10
12
14
16
s = target1 – target2
+ Regularization: a) weight decay term, b) a noise injection to input vectors, c) a corruption of the targets
Wstart= xWstart. also controls learning rapidity
, s, and
27
Optimal values of learning parameters
0 0.2 0.4 0.6 0.8 10
50
100
150
200
250
300
350
400
num
ber
of e
poch
s
difference between targets
1
2
1a
2a
3a
3
100
101
102
103
0
50
100
150
200
250
300
350
num
ber
of e
poch
s
learning step
1
2
3
200300 500500
weights magnitude
diff
eren
ce b
etw
een
the
targ
ets
0 0.5 1 1.5 2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
s = target1 – target2
# of epochs
, s, and
s, and
– the learning step
s
28
Collective learning. A l e n g t h y sequence of diverse PR tasks
-2 0 2-3
-2
-1
0
1
2
3
-2 0 2-3
-2
-1
0
1
2
3
-2 0 2-3
-2
-1
0
1
2
3
5 10 15 20 25 30 35
1
2
3
4
5
6
atpazinimo uzdavinio pasikeitimai
pasis
ukim
as
-2 0 2-3
-2
-1
0
1
2
3
The angle and/or the time between two changes are varying all the time
29
The multi-agent system composed of adaptive agents – the single layer perceptrons
In order to survive the agents should learn rapidly.
Unsuccessful agents are replaced by newborn. Inside the group the agents help each other.
In a case of emergency, they help to the weakest groups.
Genetics learning and adaptive one.
A moral: a single agent (SLP) can not learn very long sequence of the PR tasks successfully
30
A power of the PR task changes and parameter s as a
function of time
50 100 150 200 250 300 350 400 450
3.5
4
4.5
5
5.5
atpazinimo uzdaviniu pasikeitimai
pasis
ukim
as "
teta
max"
0 50 100 150 200 250 300 350 400 450 5000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
atpazinimo uzdaviniu pasikeitimai
stim
uliavim
as
t1-t
2
PR task changes
A power of
the changes
s = t1-t2
I tried to learn: s, “emotions”, “altruism”, the noise intensity, a length of learning set, e.t.c.
s is following the variation of the power of the changes
31
Integrating Statistical Methods and Neural Networks.
Nature inspired learning
The theory for equal covariance matrix case
The theory for unequal covariance matrices and multicategory cases LNCS, 4432, pp. 1 – 10,
2007 LNCS, 4472, pp. 62–71, 2007 LNCS, 4142, pp. 47 – 56, 2006 LNAI, 3238, pp. 270-280, 2004
Regression
Neural Networks, 13 (3/4), pp. 507-523, 2000
JMLR, ICNC'08
32
33
top related