nature inspired learning: classification and prediction algorithms

Nature Inspired Learning: Classification and Prediction Algorithms

Šarūnas Raudys

Computational Intelligence Group

Department of Informatics

Vilnius University. Lithuania

e-mail: sarunas@raudys.com

Juodkrante, 2009 05 22

-2 0 2-3

Nature inspired learningStatics

Dynamics

Accuracy, and the relations between sample size, and

complexity

+learning rapiditybecomes a very important issue

W = S -1 (M1-M2)

perceptron

Nature inspired learning

A Non-linear Single Layer Perceptron - a main element in the ANN theory

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

weighted sum

x1, x2, …, xp

i n p u t s

nonlinearity

output y

TRAINING THE SINGLE LAYER PERCEPTRON

OUTLINE A plot of 300 bivariate vectors (dots and pluses) sampled from

two Gaussian pattern classes, and the linear decision boundary

START FINISH

CLASSIFICATION

CLUSTERIZATION, if target2 = target1

Minimization of deviations

Three tasks

CLASSIFICATION

2 category case

I will speak also about the multi-category case

FINISH1. Cost function and training SLP used for classification.

2. When to stop training?

3. Six types of classification Equation while training SLP:

1. Euclidean distance, (only means)

2 Regularized,

3 Fisher, or

4 Fisher with pseudo-inversion of

5 Robust,

6. Minimal empirical error,

7 Support vector (maximal margin).

How to train SLP in the best way?

Training the non-linear SLP

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

weighted sum

inputs X= (x1, x2, …, xp)

nonlinearity

output y

Training

x1, x2, …, xp y

net = f( VTX + v0), where

f(net ) is a non-linear activation function, e.g. a sigmoid function: f (net)= 1/(1+e-net ) = f sigmoid(net), and v0, VT = (v1 , v2 , ... , vp) are the weights of the DF.

STANDART

TRAINING THE SINGLE LAYER PERCEPTRON BASED CLASSIFIER

Training Data

x1, x2, …, xp y

o = f( VTX + v0), where f(net ) is a non-linear activation function, and v0,

VT = (v1 , v2 , ... , vp) are the weights.

Cost function (Amari 1967; Tsypkin, 1966)

C = 1/N (yj - f( VTXj + v0))2,

Vt+1= Vt - x gradient, Training

where is a learning step parameter and yj is training signal (desired output)

V(FINISH) mimimum of the cost functionA true (unknown) minimum

Optimal

stopping

Training the Non-linear Single Layer Perceptron

Training Data

Vt+1= Vt - x gradient

Finish

True landscape

Training data landscape

Videal

Optimal

stopping

Vt+1= Vt - x gradientEarly stopping

Vopt= optVstart + (1-opt)Vfinish,

where 2finish

2start

2finish

Raudys&Amari, 1998

Late stopping

Majority, who stopped too late, are here.

A general

Principle

accuracy

Where to use Early stopping? - Knowledge discovery in very large databases

Data Set 1

Data Set 2

Data Set 3In order to save previous information, stop training early!Train, however, st

op training early!

Standard sum of squares cost function = Standard regression

C = 1/N (yj – f ( VTXj + v0))2.

We assume that the data is normalized:

yT XXXSS 1

1. ,0 ,0 eviationsstandard dyX

Covariances

Let correlations between input variables x1, x2, …, xp be zero.

Then components of vector V will be proportional to correlations between x1, x2, …, xp and y.

We may obtain such regression after the first iteration.

Gradient descent training algorithm Vt+1= Vt - x gradient

SLP AS SIX REGRESSIONS

-10 -5 0 5 100

(yj - VTXj)2

yj - VTXj

robust

In order to obtain robust regression,instead of square function we have to use “robust function”

Š. Raudys (2000). Evolution and generalization of a single neurone. III. Primitive, regularized, standard, robust and minimax regressions.

Neural Networks 13

(3/4):507-523.

Nature inspired learning. Robust regression

150 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-4

Mother and a fetus (“baby”) ECG. Two signals

Result: the fetus signal

A real world problem. Use of robust regression in order to distinguish very weak baby signal from mother’s ECG.

Robust regression pays attention to smallest deviations, not to the largest ones considered as the outliers.

Nature inspired learning. Standard and regularized regression

Use of “statistical methods” to perform diverse

whitening data transformations,

where the input variables x1, x2, …, xp are decorrelated and scaled in order to have the same variances. Then while training the perceptron in the transformed feature space, we can obtain standard regression after the very first iteration.

Xnew=T Xold T = -1/2 , where SXX = is a singular value decomposition of the covariance matrix SXX.

0. ,0 ,0 deviations standardyX

Vstart = 0,

If SXX = SXX + I, we obtain regularized regression. Moreover, we can equalize eigenvalues and speed up training process.

Speeding up the calculations (a converegence)

SLP AS SEVEN STATISTICAL CLASSIFIERS

STARTThe simplest classifier

weights

Small weights

E1) a centre M=(M1+M2)/2 is moved to 0 point, E2) training begins from zero weights, E3) the target t2 = - t1N1/N2, E4) a total gradient training (batch mode) is used.

Conditions to obtain Euclidean distance classifier

just after the first iteration

V t+1 = (2/(t-1)/ I + S) -1 (M1-M2)

When we train further, we have regularized discriminant analysis (RDA):

is regularization parameter, 0 with an increase in the number of training iterations

Fisher classifier,

or Fisher classifier with pseudoinverse of the covariance matrix

Nature inspired learning. Standard approach.

Use the diversity of “statistical methods and multivariate models” in order to obtain efficient estimate of covariance matrix. Then perform whitening data transformations, where the input variables are decorrelated and scaled in order to have the same variances.

While training the perceptron in the transformed feature space, we can obtain the Euclidean distance classifier after the first iteration. In original feature space it corresponds to the Fisher classifier or to modification of the Fisher (it depends on a method used to estimate covariance matrix) in original feature space.

Untransformed data

Transformed data

Euclidean classifier

Fisher classifier

Euclidean classifier = Fisher in original space

Generalisation errors. EDC, Fisher and Quadratic classifiers

Table 1. Learning quantity, ratio =NA/

E of the Euclidean distance E, the Fisher F

and the quadratic classifiers versus N, the training set size, for dimensionality n=50 and five values of distance (asymptotic error; from Raudys and Pikelis, 1980). EDC Fisher LDF QDF N 1.82 2.34 3.09 3.66 4.22 8 1.70 2.03 2.41 2.65 2.87 12 1.54 1.70 1.84 1.92 1.99 20 1.43 1.50 1.55 1.58 1.61 2.05 3.39 8.40 19.7 52.0 30 1.30 1.32 1.33 1.34 1.35 1.62 2.15 3.61 5.95 10.6 2.21 3.25 7.87 18.3 40.6 50* 1.18 1.17 1.16 1.16 1.17 1.33 1.51 1.93 2.47 3.27 2.13 3.12 7.10 13.1 25.1 100 1.08 1.07 1.06 1.06 1.06 1.14 1.19 1.31 1.44 1.61 1.81 2.35 3.23 4.03 5.05 250 1.04 1.03 1.03 1.03 1.03 1.07 1.09 1.15 1.20 1.27 1.58 1.78 2.01 2.18 2.35 500 1.02 1.02 1.02 1.02 1.02 1.04 1.05 1.07 1.10 1.13 1.37 1.42 1.47 1.51 1.56 1000 1.1 1.01 1.01 1.01 1.01 1.01 1.02 1.03 1.04 1.05 1.18 1.16 1.18 1.18 1.20 2500 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50

0.2 0.1 0.03 0.01 .003 0.2 0.1 0.03 0.01 .003 0.2 0.1 0.03 0.01 .003 E

*) 80 for QDF)

S. Raudys, M. Iwamura. Structures of covariance matrix in handwritten character recognition. Lecture Notes in Computer Science, 3138, 725-733, 2004.

S. Raudys, A. Saudargiene. First-order tree-type dependence between variables and classification performance. IEEE Trans. on Pattern Analysis and Machine Intelligence. Vol. PAMI-23 (2),

pp. 233-239, 2001.

A real world problem. Dozens of ways used to estimate covariance matrix and perform whitening data transformation. It is “an additional information” (if correct), that can be useful in SLP training

196-dimensional data

Covariance matrices are different.

Decision boundaries of EDC, LDF, QDF and Anderson- Bahadur linear DF. AB and F are different.

If we would start with

the AB decision boundary,

not with the Fisher,

it would be better.

Hence, we have proposed a special method of input data transformation.S. Raudys (2004). Integration of statistical and neural methods

to design classifiers in case of unequal covariance matrices. Lecture Notes in Artificial intelligence, Springer-Verlag. Vol. 3238, pp. 270-280

Q Fisher AB

Non-linear discrimination. Similarity features LNCS 3686, pp. 136 – 145, 2005

-10 -8 -6 -4 -2 0 2 4 6 8 10-8

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

100+100 2D two class training vectors (pluses and circles) and decision boundaries of Kernel Discriminant Analysis (a), SVM (b), SLP trained in 200D dissimilarity feature space (c). Learning curve: generalization error of SLP classifier as a function of number of training epochs (d).

SV classifier

optimal stopping of SLP

Generalization error

epochs

A “coloured” noise, used to form pseudo-validation set:we are adding a noise in directions of closest training vectors. So, we almost do not distort “geometry of the data”.

In this technique, we use “additional information”:

a space between neighboring points in multidimensional feature space is not empty – it is filled by vectors of the same class.

Nature inspired learning. A noise injection

A pseudo-validation data set used to realize

early stopping

Nature inspired learning. Multi-category cases

-4 -2 0 2

2(1/3)

3(2/3)

1(1/2)

-4 -2 0 2

Pair-wise classifiers: optimally stopped (+noise) SLPs + H-T fusion. Wee need to obtain the classifier (SLP) of optimal complexity:

Early stopping

Learning Rapidity. Two Pattern Recognition (PR) tasks

-2 0 2-3

A time to learn the second task is restricted, say 300 training epochs

Parameters that affect learning rapidity:

– learning step & the weights growth

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

s = target1 – target2

+ Regularization: a) weight decay term, b) a noise injection to input vectors, c) a corruption of the targets

Wstart= xWstart. also controls learning rapidity

, s, and

Optimal values of learning parameters

0 0.2 0.4 0.6 0.8 10

difference between targets

learning step

200300 500500

weights magnitude

0 0.5 1 1.5 2

s = target1 – target2

# of epochs

, s, and

s, and

– the learning step

Collective learning. A l e n g t h y sequence of diverse PR tasks

-2 0 2-3

5 10 15 20 25 30 35

atpazinimo uzdavinio pasikeitimai

-2 0 2-3

The angle and/or the time between two changes are varying all the time

The multi-agent system composed of adaptive agents – the single layer perceptrons

In order to survive the agents should learn rapidly.

Unsuccessful agents are replaced by newborn. Inside the group the agents help each other.

In a case of emergency, they help to the weakest groups.

Genetics learning and adaptive one.

A moral: a single agent (SLP) can not learn very long sequence of the PR tasks successfully

A power of the PR task changes and parameter s as a

function of time

50 100 150 200 250 300 350 400 450

atpazinimo uzdaviniu pasikeitimai

0 50 100 150 200 250 300 350 400 450 5000.1

atpazinimo uzdaviniu pasikeitimai

uliavim

PR task changes

A power of

the changes

s = t1-t2

I tried to learn: s, “emotions”, “altruism”, the noise intensity, a length of learning set, e.t.c.

s is following the variation of the power of the changes

Integrating Statistical Methods and Neural Networks.

The theory for equal covariance matrix case

The theory for unequal covariance matrices and multicategory cases LNCS, 4432, pp. 1 – 10,

2007 LNCS, 4472, pp. 62–71, 2007 LNCS, 4142, pp. 47 – 56, 2006 LNAI, 3238, pp. 270-280, 2004

Regression

Neural Networks, 13 (3/4), pp. 507-523, 2000

JMLR, ICNC'08

nature inspired learning: classification and prediction algorithms

Documents

bats echolocation-inspired algorithms for global...

nature-inspired optimization algorithms

nature-inspired metaheuristic algorithms second edition

time series prediction algorithms : literature review

survey of bio inspired optimization algorithms

no-regret algorithms for structured prediction problems

lecture xvii: distributed systems algorithms inspired by...

preference-inspired co-evolutionary...

lecture xvii: distributed systems algorithms inspired by...

biologically inspired computing: operators for...

ab initio protein structure prediction algorithms

bee-inspired algorithms applied to vehicle routing...

lecture about gene prediction, algorithms for -

biologically-inspired algorithms for financial modelling...

probabilistic prediction algorithms

score2risk prediction algorithms: new ... - mrc.soton.ac.uk

rna structure prediction algorithms

scaling biologically inspired computer vision algorithms...

chapter bio-inspired algorithms for big data...

research article physics-inspired optimization algorithms...