local methods - texas a&m university

Local Methods

• Olive slides: Alpaydin

• Blue slides: Haykin, clarifications/notations

Introduction 3

Divide the input space into local regions and learn simple (constant/linear) models in each patch

Unsupervised: Competitive, online clustering Supervised: Radial-basis functions, mixture of

experts

Competetive Learning

• X = {xt}t : set of samples (green).

• mi, i = 1, 2, ..., k: cluster centers (red).

• bti : if mi is closest to xt, 1.

• Note: t = index for input, i = index for cluster center.

Competetive Learning: k-Means

• Batch: update cluster centers according to simple “mean” at each moment.

• Online: Use stochastic gradient descent.

– Note: mi is a vector, having scalar componentsmij (see next page).

• Both are iteratively done until convergence is achieved.

Competetive Learning

∆ η m = (x−m)

x

m

m’

x−m

• Updating the center.

• ∆m = η(x−w)

• m = m + η(x−w)

• “Move center closer to the current input”

Competitive Learning

ijtj

ti

ij

t

ij

t

ti

t

tti

i

lt

li

tti

t i itt

i

k

ii

mxbm

Em

k

b

bk

b

bE

:means- Online

:means- Batch

otherwise

min if

xm

mxmx

mxm

0

1

1X

4

Replacing bti , etc.

• We can use lateral inhibition to implement bti in a more biologically plausible manner (see figure

in next slide). Needs iteration until vlaues settle.

• We can also use dot product instead of Euclidean distance as a distance measure.

• Hebbian learning is usually used for biologically plausible models.

5

Winner-take-all network

i

Adaptive Resonance Theory

Incremental; add a new cluster if not covered; defined by vigilance, ρ

otherwise

if

min

it

i

it

k

lt

k

li

tti

b

b

mxm

xm

mxmx

1

1

6

(Carpenter and Grossberg, 1988)

SOM Overview

SOM is based on three principles:

• Competition: each neuron calculates a discriminant function.

The neuron with the highest value is declared the winner.

• Cooperation: Neurons near-by the winner on the lattice get a

chance to adapt.

• Adaptation: The winner and its neighbors increase their

discriminant function value relative to the current input.

Subsequent presentation of the current input should result in

enhanced function value.

Redundancy in the input is needed!

5

Redundancy, etc.

• Unsupervised learning such as SOM require redundancy in the

data.

• The following are intimately related:

– Redundancy

– Structure (or organization)

– Information content relative to channel capacity

6

Redundancy, etc. (cont’d)

Left Right

Structure No Yes

Redundancy No Yes

Info<Capacity No Yes

Consider each pixel as one random variable.

7

Self-Organizing Map (SOM)

x x

w

1 2

2

x =

w1w =i i i

2D SOM Layer

Input

Kohonen (1982)

• 1-D, 2-D, or 3-D layout of units.

• One weight vector for each unit.

• Unsupervised learning (no target output).

9

SOM Algorithm

x x

w

1 2

2

x =

w1w =i i i

2D SOM Layer

Input

Neighbor

1. Randomly initialize weight vectors wi

2. Randomly sample input vector x

3. Find Best Matching Unit (BMU):

i(x) = argminj‖x−wj‖

4. Update weight vectors:

wj ← wj + ηh(j, i(x))(x−wj)

η : learning rate

h(j, i(x)) : neighborhood function of BMU.

5. Repeat steps 2 – 4.

10

SOM Learning

Input Space

input

wc(x−w)

SOM lattice

weightvector

• Weight vectors can be plotted in the input space.

• Weight vectors move, not according to their proximity to the input

in the input space, but according to their proximity in the lattice.

11

Self-Organizing Maps 7

2

2

22

1

ilile

ile lt

l

exp,

, mxm

Units have a neighborhood defined; mi is “between” mi-1 and mi+1, and are all updated together

One-dim map: (Kohonen, 1990)

Typical Neighborhood FunctionsGaussian Neighborhood

exp(-(x*x+y*y)/2)

-4 -2 0 2 4 -4-2

02

4

00.10.20.30.40.50.60.70.80.91

• Gaussian: h(j, i(x)) = exp(−‖rj − ri(x)‖2/2σ2)

• Flat: h(j, i(x)) = 1 if ‖rj − ri(x)‖ ≤ σ, and 0 otherwise.

• σ is called the neighborhood radius.

• rj is the location of unit j on the lattice.

13

Radial-Basis Functions

Locally-tuned units:

01

wpwyH

h

thh

t

2

2

2 h

ht

th

sp

mxexp

8

RBF

• Input x to p: cluster centers m and radius (variance) s are estimated.

• p to output weights w can be calculated in one shot using pseudo inverse (output units areusually linear units). n RBF activation values (each row in P is the RBF activation valuesgenerated from each input vector),H RBF units,m output units.

p11 p12 · · · p1H

p21 p22 · · · p2H

.

.

.

.

.

.

.

.

.

.

.

.

pn1 pn2 · · · pnH

w1

w2

.

.

.

wH

=

y1

y2

.

.

.

ym

,

Pw = y

w = P−1

y, if n = H

w =(PT

P)−1

PT

y, if n > H

• Note: (PTP )−1PTP = (P−1(PT )−1)PTP = P−1(PT )−1PTP = P−1P = I

• Other iterative methods also exist (see next few slides).

Training RBF 10

Hybrid learning: First layer centers and spreads:

Unsupervised k-means

Second layer weights: Supervised gradient-descent

Fully supervised

(Broomhead and Lowe, 1988; Moody and Darken, 1989)

Regression 11

3

2

2

01

2

2

1

h

ht

th

t iih

ti

tih

h

hjtjt

ht i

ihti

tihj

t

th

ti

tiih

i

H

h

thih

ti

t i

ti

tihiihhh

spwyrs

s

mxpwyrm

pyrw

wpwy

yrwsE

mx

m

X|,,,

Learning Vector Quantization 20

otherwise

)label)label( if

it

i

it

it

i

mxm

mxmxm

(

H units per class prelabeled (Kohonen, 1990)

Given x, mi is the closest:

x

mi mj

References

local methods - texas a&m university

Documents