the multilayer perceptron - Örebro...

Post on 01-Apr-2018

234 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Self-organizing systems

Self-organizing

• No supervisor available.• Instead, try to find order/structure

in the environment.–Clusters

(i.e.data is not homogeneously distributed)

–Directions (i.e. projections that carry more information than others)

Finding projections

Auto-encoder MLP

Output = input

Input

The auto-encoder is trained to reproduce the output through a“bottle neck” - must try to find an efficient coding. Train using standard training algorithms.

Will lead to ~ principal components.

2-1-2 MLPwith linearoutput butnonlinear input, trainedto reproduce the input data.

The line showsthe directionof w, the weightvector for thehidden unit.

Principal components(Karhounen-Loeve transform)

Task: Find a linear recoding of the data that preserver as much information as possible (”information” = variance in the signal)

⇒ Principal components

Principal components

LMM

LM

+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+++≡

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

0

10

0

01

2122112

1

xxxxx

x

xx

DD

D

eeex

Find a new ON basis Q with M < D

zqqqxx =

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+++=≈

M

MM

z

zz

zzzM

L 2

1

2211ˆ

[ ]∑ −n

nn 2)(ˆ)( xxSuch that is minimized

Reminder on change of basis

iT

iz qx=

The coefficient zi is given by the scalar productof x and the basis vector qi.

Suppose M = D-1

[ ] [ ]

( )

Dn

DDD

D

D

TD

Dn

TTD

nD

TTD

nD

T

nD

nDD

n

nxnxnxnxnx

nxnxnxnxnxnxnxnxnxnx

nn

nnnz

znnnn

qq

qxxq

qxxqqx

qxxxx

∑∑∑

∑∑

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=⎥⎦

⎤⎢⎣

===

=+−=−

)()()()()(

)()()()()()()()()()(

)()(

)()()(

)()()(ˆ)(

221

22212

12121

22

22

L

MOMM

L

L

[ ] DTD

nNnn Rqqxx =−∑ 2)(ˆ)(

where

∑⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=n

DDD

D

D

nxnxnxnxnx

nxnxnxnxnxnxnxnxnxnx

N)()()()()(

)()()()()()()()()()(

1

221

22212

12121

L

MOMM

L

L

R

is the correlation matrix

We can guarantee minimum loss if we chooseqD to be the eigenvector of R with minimumeigenvalue

DDD qRq λ=

[ ] DDTD

nNNnn λ==−∑ Rqqxx 2)(ˆ)(

A zero eigenvalue means no loss.

Principal components

• Choose new basis Q of ON eigenvectors of the correlation matrix R.

• Discard basis vectors in increasing order of their eigenvalues (i.e. throw away smallest eigenvalues first)

• Can also be done with the eigenvectors of the covariance matrix Σ. (Identical to the correlation matrix if data has zero mean.)

Covariance matrix

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

DDDD

D

D

σσσ

σσσσσσ

L

MOMM

L

L

21

22221

11211

Σ

[ ][ ]∑ −−−

=n

jjiiij nxnxN

µµσ )()(1

1

∑=n

ii nxN

)(1µ [ ]∑ −−

==n

iiiii nxN

22 )(1

1 µσσ

Covariance matrix

[ ][ ]

jn

iji

jijn

iji

njjiiij

NNnxnx

N

NNnxnx

N

nxnxN

µµ

µµµµ

µµσ

−+

−−

=+−

−−

=−−−

=

11)()(

11

12)()(

11

)()(1

1

TµµRΣ −≈

Principal components

iii qq λ=ΣExpress the data in the new basis Q, with basis vectorsthat are eigenvectors of the data covariance matrix.

-10 -5 0 5 10-8

-6

-4

-2

0

2

4

6

8

x1

x 2

The red lines showthe directions of thetwo eigenvectors.

Variance along eigendirection(Zero mean data)

[ ]

[ ] kn

kT

nkk

Tk

NNn

N

nN

λ

µσ

1)(

11

)(1

1

2

22

−=

=−−

=

qx

qx

-10 -5 0 5 10-8

-6

-4

-2

0

2

4

6

8

x1

x 2

PCA example: NIR spectra of meat

0 50 1002

2.5

3

3.5

4

4.520 NIR spectra

0 50 100-1

-0.5

0

0.5

1Same 20 demeaned

0 50 100-2

-1

0

1

2...and rescaled

0 50 100-3

-2

-1

0

1

2Grouped in fat%

Each curve is a pointin 100 dimensionalspace.

NIR: The 9 leading eigenvectors

0 50 100-0.5

0

0.5

0 50 100-0.5

0

0.5

0 50 100-0.5

0

0.5

0 50 100-0.2

0

0.2

0 50 100-0.5

0

0.5

0 50 100-0.5

0

0.5

0 50 100-0.5

0

0.5

0 50 100-0.5

0

0.5

0 50 100-0.5

0

0.5

λi = 2.4308, 0.9372, 0.0489, 0.0256, 0.0108, 0.0023, 0.0014, 0.0002, 0.0001

NIR reconstruction with PCA

0 50 100-2

0

2

0 50 100-2

0

21:st PC

0 50 100-2

0

21,2 PC

0 50 100-2

0

21,2,3 PC

0 50 100-2

0

21,2,3,4 PC

0 50 100-2

0

21-5 PC

0 50 100-2

0

21-6 PC

0 50 100-2

0

21-7 PC

0 50 100-2

0

21-8 PC

z = (2.64, 0.01, 2.35, -9.24, 0.66, -0.23, 0.71, -0.34, 0.06)

The first eigenvectorfor the Legodatacovariancematrix.

The line showsthe directionof the eigen-vector.

= The firstprincipaldirection.

In this case,the firstprincipaldirectionis goodfor doingtheclassification

2-1-2 MLPautoencoderwith linearoutput butnonlinear input, trainedto reproduce the input data.

The line showsthe directionof w, the weightvector for thehidden unit.

PCA application: image compression

Original image

PCA (KL) basisestimated from 12x12patches (144 dim).

Recoded image using 10% of PCA basis for each 12x12 patch.

Recoded image using 50% of PCA basis for each 12x12 patch.

Original image

PCA application: Eigenfaces• Images are high-

dimensional data with high correlation (faces look quite similar after all...the eyes are located above the nose, the mouth below the nose, hair on top...etc.)

• Reduce the dimensionality of the face image database by using PCA.

• Requires that the face is centered in the image and that the individual is looking into the camera (i.e. Same pose all the time).

M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.

Images from the ORL database (http://www.cam-orl.co.uk/facedatabase.html)

Large λ

Medium λ

Small λ

Eigenvectors (”eigenfaces”) when different subsetsof 200 face images are used to compute PCA

• You need only 10-20 eigenfaces to do a reliable identification.

• Compare with dimension of original image.

http://cnx.org/

PCA not always an optimal projection for classification

Auto-encoder applications

Output = input

Input

• Induction motor failure detection (Siemens). Input: Power spectrum of electrical current.

• Failure prediction in helicopter gear boxes (US Navy). Input: Vibration spectrum of gear box.

• Bank note rejection (and acceptance) at automatic vending machines. (U. Firenze)Input: Reflected and transmitted light along bank note.

PCA ≠ Autoencoder

The PCA basis can represent data in a subspace that extendsinfinitely.

The MLP autoencoder reliablyrepresents data in a lower dimensional subspace and in alimited region. This is due tosigmoid functions that saturate.

Nonlinear autoencoder

Output = input

Has been very difficult to train.Now ”solved” by using smart”pretraining” (a ”Boltzmannmachine”).

Matlab code available athttp://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

Input

Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006

Nonlinear autoencoderOriginal

6-dim nonlin. autoencoder

6-dim lin. autoencoder

6-dim linear PCA

Original

30-dim nonl.autoencoder

30-dim PCA

Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006

Visualization of newswire stories2D nonlinear autoenc. 2D latent semantic analysis

Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006

PCA with kernels (cf. SVM)Map to high-dim.Space and comp.PCA there.

Can be done withkernels.

Figure from ftp://ftp.research.microsoft.com/users/mtipping/skpca_nips.ps.gz

ICA

• ICA = Independent component analysis

• PCA computes eigenvectors of covariance matrix (2:nd order statistics)

• ICA looks at higher order statistics and finds ”independent”components.

Clustering

k-means clustering

For K “cluster vectors” minimize

[ ] 2

1 1

||)(||)(21

k

N

n

K

kk nnE wxx −Λ= ∑∑

= = The ”distortion”

[ ]⎩⎨⎧

=Λotherwise0

)( closest to is if1)(

nn k

k

xwx

Λ is an ”assignment function”

k-means update

[ ][ ]∑=

−Λ+=+N

nkkkk nntt

1

)()()()1( wxxww η

⎩⎨⎧ +−

=+otherwise)(

closest for )()()1()1(

tnt

tk

kkk w

wxww

ηη

k-means can be done in batch and on-line mode.Often on-line.

TrainE=0.56%TestE=0.80%

But the alg.wasn’t toldabout red &green.

Takes a longtime to getvectors w toconvergeinto regionof interest.

“Better” to pickinitial pointsrandomlyfrom data.

TrainE=0.55%TestE=0.52%

But the alg.needs to knowabout red &green.

k-means problem

• How select the number of centers?

• Common to minimize Schwarz criterion:

[ ] [ ] )log(),()(21

1 1

NDKndnE k

N

n

K

kk λ+Λ= ∑∑

= =

wxx

Distortion Complexity cost

Learning vector quantization

For correctly classified patterns – move closer:

⎩⎨⎧ +−

=+otherwise)(

closest for )()()1()1(

tnt

tk

kkk w

wxww

ηη

For incorrectly classified patterns – move away:

⎩⎨⎧ −−

=+otherwise)(

closest for )()()1()1(

tnt

tk

kkk w

wxww

ηη

Self-organizing maps

• Impose a topology among the “neurons”, i.e. define neighborhood relationships.

• Update neighbors along with closest unit.

• Will encode the data in a 2D or 3D submanifold.

A 2D square lattice topology

Every neuronhas 4 nearneighbors.

A 2D hexagonal lattice topology

Every neuronhas 6 nearneighbors.

SOM maps

For K “cluster vectors” (neurons) minimize

[ ] 2

1 1

||)(||)(21

k

N

n

K

kk nnE wxx −Λ= ∑∑

= =

Example of switch

[ ]⎪⎩

⎪⎨⎧

=Λotherwise0

unitclosest oneighbor t is ifor )( closest to is if1)( k

k

k

nn w

xwx

SOM update

Let the closest unit to x(n) be called unit j.

)()()1()1( ntt jkkjkk xww Λ+Λ−=+ ηη

⎥⎦

⎤⎢⎣

⎡−=Λ 22

expσ

jkjk

d djk is distance in latticeσ is decreased with time

First, bigneighborhood

Then, smallerneighborhood

Then, noneighborhood

Initial

5 epochs

10 epochs

15 epochs

20 epochs

SOM only

Hierarchical clustering

• Agglomerative: Start out with all points as individual clusters. Join closest clusters until you’re satisfied.

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Clustering orderand distances

Dendrogram

k-means

k-means

k-means

Metrics

( )

( )

[ ]

( ) ( )),(),(||||),(

sgn),(

||||),(

||||),(

1

/1

2/12

wxwxwxΣwxwxwx

wx

wxwx

wxwx

Kdd

wxd

wxd

wxd

Tk

kk

p

k

pkkp

kkk

=−−=−=

−=

⎥⎦

⎤⎢⎣

⎡−=−=

⎥⎦

⎤⎢⎣

⎡−=−=

−Σ

∑Euclidean

Minkowski

Manhattan

Mahalanobis

Kernel

etc...mutations, alignments,...whatever...

top related