the multilayer perceptron - Örebro...

Self-organizing systems

Self-organizing

• No supervisor available.• Instead, try to find order/structure

in the environment.–Clusters

(i.e.data is not homogeneously distributed)

–Directions (i.e. projections that carry more information than others)

Finding projections

Auto-encoder MLP

Output = input

The auto-encoder is trained to reproduce the output through a“bottle neck” - must try to find an efficient coding. Train using standard training algorithms.

Will lead to ~ principal components.

2-1-2 MLPwith linearoutput butnonlinear input, trainedto reproduce the input data.

The line showsthe directionof w, the weightvector for thehidden unit.

Principal components(Karhounen-Loeve transform)

Task: Find a linear recoding of the data that preserver as much information as possible (”information” = variance in the signal)

⇒ Principal components

Principal components

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+++≡

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

2122112

Find a new ON basis Q with M < D

zqqqxx =

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+++=≈

2211ˆ

[ ]∑ −n

nn 2)(ˆ)( xxSuch that is minimized

Reminder on change of basis

iz qx=

The coefficient zi is given by the scalar productof x and the basis vector qi.

Suppose M = D-1

[ ] [ ]

nxnxnxnxnx

nxnxnxnxnxnxnxnxnxnx

qxxqqx

∑∑∑

∑∑

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=⎥⎦

⎤⎢⎣

=+−=−

)()()()()(

)()()()()()()()()()(

)()()(

)()()(ˆ)(

[ ] DTD

nNnn Rqqxx =−∑ 2)(ˆ)(

∑⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

nxnxnxnxnx

nxnxnxnxnxnxnxnxnxnx

N)()()()()(

)()()()()()()()()()(

is the correlation matrix

We can guarantee minimum loss if we chooseqD to be the eigenvector of R with minimumeigenvalue

DDD qRq λ=

[ ] DDTD

nNNnn λ==−∑ Rqqxx 2)(ˆ)(

A zero eigenvalue means no loss.

• Choose new basis Q of ON eigenvectors of the correlation matrix R.

• Discard basis vectors in increasing order of their eigenvalues (i.e. throw away smallest eigenvalues first)

• Can also be done with the eigenvectors of the covariance matrix Σ. (Identical to the correlation matrix if data has zero mean.)

Covariance matrix

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

σσσ

σσσσσσ

[ ][ ]∑ −−−

jjiiij nxnxN

µµσ )()(1

ii nxN

)(1µ [ ]∑ −−

iiiii nxN

22 )(1

1 µσσ

Covariance matrix

[ ][ ]

njjiiij

NNnxnx

µµµµ

µµσ

−−

=−−−

11)()(

12)()(

TµµRΣ −≈

iii qq λ=ΣExpress the data in the new basis Q, with basis vectorsthat are eigenvectors of the data covariance matrix.

-10 -5 0 5 10-8

The red lines showthe directions of thetwo eigenvectors.

Variance along eigendirection(Zero mean data)

[ ] kn

=−−

-10 -5 0 5 10-8

PCA example: NIR spectra of meat

0 50 1002

4.520 NIR spectra

0 50 100-1

1Same 20 demeaned

0 50 100-2

2...and rescaled

0 50 100-3

2Grouped in fat%

Each curve is a pointin 100 dimensionalspace.

NIR: The 9 leading eigenvectors

0 50 100-0.5

0 50 100-0.2

0 50 100-0.5

λi = 2.4308, 0.9372, 0.0489, 0.0256, 0.0108, 0.0023, 0.0014, 0.0002, 0.0001

NIR reconstruction with PCA

0 50 100-2

21:st PC

0 50 100-2

21,2 PC

0 50 100-2

21,2,3 PC

0 50 100-2

21,2,3,4 PC

0 50 100-2

21-5 PC

0 50 100-2

21-6 PC

0 50 100-2

21-7 PC

0 50 100-2

21-8 PC

z = (2.64, 0.01, 2.35, -9.24, 0.66, -0.23, 0.71, -0.34, 0.06)

The first eigenvectorfor the Legodatacovariancematrix.

The line showsthe directionof the eigen-vector.

= The firstprincipaldirection.

In this case,the firstprincipaldirectionis goodfor doingtheclassification

2-1-2 MLPautoencoderwith linearoutput butnonlinear input, trainedto reproduce the input data.

The line showsthe directionof w, the weightvector for thehidden unit.

PCA application: image compression

Original image

PCA (KL) basisestimated from 12x12patches (144 dim).

Recoded image using 10% of PCA basis for each 12x12 patch.

Recoded image using 50% of PCA basis for each 12x12 patch.

Original image

PCA application: Eigenfaces• Images are high-

dimensional data with high correlation (faces look quite similar after all...the eyes are located above the nose, the mouth below the nose, hair on top...etc.)

• Reduce the dimensionality of the face image database by using PCA.

• Requires that the face is centered in the image and that the individual is looking into the camera (i.e. Same pose all the time).

M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.

Images from the ORL database (http://www.cam-orl.co.uk/facedatabase.html)

Large λ

Medium λ

Small λ

Eigenvectors (”eigenfaces”) when different subsetsof 200 face images are used to compute PCA

• You need only 10-20 eigenfaces to do a reliable identification.

• Compare with dimension of original image.

http://cnx.org/

PCA not always an optimal projection for classification

Auto-encoder applications

Output = input

• Induction motor failure detection (Siemens). Input: Power spectrum of electrical current.

• Failure prediction in helicopter gear boxes (US Navy). Input: Vibration spectrum of gear box.

• Bank note rejection (and acceptance) at automatic vending machines. (U. Firenze)Input: Reflected and transmitted light along bank note.

PCA ≠ Autoencoder

The PCA basis can represent data in a subspace that extendsinfinitely.

The MLP autoencoder reliablyrepresents data in a lower dimensional subspace and in alimited region. This is due tosigmoid functions that saturate.

Nonlinear autoencoder

Output = input

Has been very difficult to train.Now ”solved” by using smart”pretraining” (a ”Boltzmannmachine”).

Matlab code available athttp://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006

Nonlinear autoencoderOriginal

6-dim nonlin. autoencoder

6-dim lin. autoencoder

6-dim linear PCA

Original

30-dim nonl.autoencoder

30-dim PCA

Visualization of newswire stories2D nonlinear autoenc. 2D latent semantic analysis

PCA with kernels (cf. SVM)Map to high-dim.Space and comp.PCA there.

Can be done withkernels.

Figure from ftp://ftp.research.microsoft.com/users/mtipping/skpca_nips.ps.gz

• ICA = Independent component analysis

• PCA computes eigenvectors of covariance matrix (2:nd order statistics)

• ICA looks at higher order statistics and finds ”independent”components.

Clustering

k-means clustering

For K “cluster vectors” minimize

||)(||)(21

kk nnE wxx −Λ= ∑∑

= = The ”distortion”

[ ]⎩⎨⎧

=Λotherwise0

)( closest to is if1)(

Λ is an ”assignment function”

k-means update

[ ][ ]∑=

−Λ+=+N

nkkkk nntt

)()()()1( wxxww η

⎩⎨⎧ +−

=+otherwise)(

closest for )()()1()1(

k-means can be done in batch and on-line mode.Often on-line.

TrainE=0.56%TestE=0.80%

But the alg.wasn’t toldabout red &green.

Takes a longtime to getvectors w toconvergeinto regionof interest.

“Better” to pickinitial pointsrandomlyfrom data.

TrainE=0.55%TestE=0.52%

But the alg.needs to knowabout red &green.

k-means problem

• How select the number of centers?

• Common to minimize Schwarz criterion:

[ ] [ ] )log(),()(21

NDKndnE k

kk λ+Λ= ∑∑

Distortion Complexity cost

Learning vector quantization

For correctly classified patterns – move closer:

⎩⎨⎧ +−

=+otherwise)(

For incorrectly classified patterns – move away:

⎩⎨⎧ −−

=+otherwise)(

Self-organizing maps

• Impose a topology among the “neurons”, i.e. define neighborhood relationships.

• Update neighbors along with closest unit.

• Will encode the data in a 2D or 3D submanifold.

A 2D square lattice topology

Every neuronhas 4 nearneighbors.

A 2D hexagonal lattice topology

Every neuronhas 6 nearneighbors.

SOM maps

For K “cluster vectors” (neurons) minimize

||)(||)(21

kk nnE wxx −Λ= ∑∑

Example of switch

[ ]⎪⎩

⎪⎨⎧

=Λotherwise0

unitclosest oneighbor t is ifor )( closest to is if1)( k

SOM update

Let the closest unit to x(n) be called unit j.

)()()1()1( ntt jkkjkk xww Λ+Λ−=+ ηη

⎥⎦

⎤⎢⎣

⎡−=Λ 22

d djk is distance in latticeσ is decreased with time

First, bigneighborhood

Then, smallerneighborhood

Then, noneighborhood

Initial

5 epochs

10 epochs

15 epochs

20 epochs

SOM only

Hierarchical clustering

• Agglomerative: Start out with all points as individual clusters. Join closest clusters until you’re satisfied.

Hierarchical clustering

Clustering orderand distances

Dendrogram

k-means

Metrics

( ) ( )),(),(||||),(

sgn),(

||||),(

wxwxwxΣwxwxwx

=−−=−=

⎥⎦

⎤⎢⎣

⎡−=−=

⎥⎦

⎤⎢⎣

⎡−=−=

∑Euclidean

Minkowski

Manhattan

Mahalanobis

Kernel

etc...mutations, alignments,...whatever...

the multilayer perceptron - Örebro...

Documents

math study booklet max gross period 1 miss lilien max gross...

kl divergence based agglomerative clustering for …...

using agglomerative dust suppression and wind …€¦ ·...

slide 1 ee3j2 data mining lecture 18 k-means and...

hemodialysis access: guidelines, evidence and controversies...

magisterarbeit - core · magisterarbeit titel der ... 5.2...

robustness of three hierarchical agglomerative clustering

cs 8751 ml & kdddata clustering1 clustering unsupervised...

39 - game theory...by chatterjee&lilien

a multistage agglomerative approach for defining functional...

the simple perceptron - Örebro...

agglomerative hierarchical clustering - msc. data mining ......

image matching - university of...

mosaic: a proximity graph approach for agglomerative...

unsupervised object matching and categorization via...

landmarks in israeli art (2) introduction ii. ephraim mose...

cse601 hierarchical clustering - university at buffalo ·...

competence maps using agglomerative hierarchical...

modern hierarchical, agglomerative clustering algorithms ·...

comparison of hierarchical agglomerative algorithms for...