jacek mazurkiewicz, phd softcomputing part 13: …...•radial basis function (rbf) networks are...

Post on 20-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Internet EngineeringJacek Mazurkiewicz, PhD

Softcomputing

Part 13: Radial Basis Function Networks

Contents• Overview• The Models of Function Approximator• The Radial Basis Function Networks• RBFN’s for Function Approximation• The Projection Matrix• Learning the Kernels• Bias-Variance Dilemma• The Effective Number of Parameters• Model Selection

RBF• Linear models have been studied in

statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model.

• However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians

Typical Applications of NN

• Pattern Classification

• Function Approximation

• Time-Series Forecasting

( )l f= xl C N

( )f=y x mY R y

mX R x

nX R x

1 2 3( ) ( , , , )t t tft − − −=x x x x

Function Approximation

mX R fnY R

:f X Y→Unknown

mX RnY R

ˆ :f X Y→Approximator

f

Linear Models

1

(( ) )i

i

i

m

f w=

=x xFixed BasisFunctions

Weights

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

1

(( ) )i

i

i

m

f w=

=x xy

1 2 m

x1 x2 xn

w1 w2 wm

x =

Example Linear Models

• Polynomial

• Fourier Series

( ) i

i

iwf xx =

( )0exp 2( )k

k jf w k xx =

( ) , 0,1,2,i

i ix x = =

( )0( ) exp 2 0,1,, 2, k x j k x k = =

1

(( ) )i

i

i

m

f w=

=x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Single-Layer Perceptrons as Universal Aproximators

HiddenUnits

With sufficient number of sigmoidal units, it can be a universal approximator.

1

(( ) )i

i

i

m

f w=

=x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Radial Basis Function Networks as Universal Aproximators

HiddenUnits

With sufficient number of radial-basis-function units, it can also be a universal approximator.

1

(( ) )i

i

i

m

f w=

=x x

Non-Linear Models

1

(( ) )i

i

i

m

f w=

=x xAdjusted by theLearning process

Weights

Radial Basis Functions

• Center

• Distance Measure

• Shape

Three parameters for a radial function:

xi

r = ||x − xi||

i(x)= (||x − xi||)

Typical Radial Functions

• Gaussian

• Hardy-Multiquadratic (1971)

• Inverse Multiquadratic

( )2

22 0 and r

r e r −

=

( ) 2 2 0 and r r c c c r = +

( ) 2 2 0 and r c r c c r = +

Gaussian Basis Function (=0.5,1.0,1.5)

( )2

22 0 and r

r e r −

=

0.5 =1.0 =

1.5 =

Inverse Multiquadratic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

( ) 2 2 0 and r c r c c r = +

c=1

c=2c=3

c=4c=5

Most General RBF

( ) ( )1( ) ( )i i

T

i −= − −μ x μx x

1μ2μ 3μ

Basis {i: i =1,2,…} is `near’ orthogonal.

Properties of RBF’s

• On-Center, Off Surround

• Analogies with localized receptive fieldsfound in several biological structures, e.g.,– visual cortex;

– ganglion cells

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Projection

Interpolation

As a function approximator

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Subclasses

Classes

As a pattern classifier.

• Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm.

• The activation function is selected from a class of functions called basis functions.

• They usually train much faster than BP.

• They are less susceptible to problems with non-stationary inputs

• Popularized by Broomhead and Lowe (1988), and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture.

• The major difference between RBF and BP is the behavior of the single hidden layer.

• Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.

The idea

x

y Unknown Function to Approximate

TrainingData

The idea

x

y

Basis Functions (Kernels)

Function Learned

1

)) ((m

i

iiwy f =

= = xx

The idea

x

y

Basis Functions (Kernels)

Function Learned

NontrainingSample

1

)) ((m

i

iiwy f =

= = xx

The idea

x

yFunction Learned

NontrainingSample

1

)) ((m

i

iiwy f =

= = xx

Radial Basis Function Networks as Universal Aproximators

1

( ) ( )m

i

i

iy wf =

= =x x

x1 x2 xn

w1 w2 wm

x =

Training set ( ) ( )

1

( ),p

k

k

ky=

= xT

Goal ( )(( ))k kfy x for all k

( )2

( )

1

( )min kp

k

k

SSE fy=

= − x

( )2

)(

1

) (

1

p mk

i

k

k

i

i

y w= =

= −

x

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xnx =

Training set ( ) ( )

1

( ),p

k

k

ky=

= xT

Goal ( )(( ))k kfy x for all k

1

( ) ( )m

i

i

iy wf =

= =x x

( )2

( )

1

( )min kp

k

k

SSE fy=

= − x

( )2

)(

1

) (

1

p mk

i

k

k

i

i

y w= =

= −

x

( )2

( )

1

( )min kp

k

k

SSE fy=

= − x

( )2

)(

1

) (

1

p mk

i

k

k

i

i

y w= =

= −

x

Regularization

Training set ( ) ( )

1

( ),p

k

k

ky=

= xT

Goal ( )(( ))k kfy x for all k

2

1

m

i

iiw=

+

2

1

m

i

iiw=

+

0i

If regularization is unneeded, set 0i =

C

Learn the Optimal Weight Vector

( )2

2( )

1

( )

1

p mk

k

i

i

i

k wfyC = =

= − + xMinimize

( )( )

(

( )

) ( )

1

2 2kp

j

k

j

k

k

j j

wfC

fw w

y =

= − − +

x

x

( ) ( )( ) ( ) ( )

1

2 2p

k k

j

k

jj

k f wy =

= − − + x x

0 =

( ) ( ) ( )( ) * ( ) ( )

1 1

)* (p p

k k k

j j

k k

jj

kwf y = =

+ = x x x

Learn the Optimal Weight Vector

( ) ( )( )(1) ( ), ,T

p

j j j =φ x x

( ) ( )( )* * (1) * ( ), ,T

pf f=f x xDefine

( )(1) ( ), ,T

py y=y

* *

j

T T

j jjw+ =φ f φ y 1, ,j m=

( ) ( ) ( )( ) * ( ) ( )

1 1

)* (p p

k k k

j j

k k

jj

kwf y = =

+ = x x x

Learn the Optimal Weight Vector*

1 1

*

2 2

*

1

*

*

1

2 2

*

T T

T T

T T

m mmm

w

w

w

+ =

+ =

+ =

φ f φ y

φ f φ y

φ f φ y

( )1 2, , , m=Φ φ φ φ

1

2

m

=

Λ

( )* * *

1 2* , , ,T

mw w w=w

Define

**T T+ =wΦ Λf Φ y

Learn the Optimal Weight Vector

**T T+ =wΦ Λf Φ y

( )

( )

( )

* (1)

* (2)

*

* ( )p

f

f

f

=

x

xf

x

( )

( )

( )

(1)

1

(2)

1

(

*

*

1

*

)

k

k

m

k

k

m

k

k

mP

k

k

k

w

w

w

=

=

=

=

x

x

x

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

1 2

m

m

p p pm

m

w

w

w

=

x x x

x x x

x x x

*=Φw

* *T T+ =ΛwΦ wΦ Φ y

Learn the Optimal Weight Vector

( ) *T T+ =wΦ ΛΦ Φ y

( )1

* T T−

= +Λw Φ Φ Φ y

Variance Matrix

1 T−= A Φ y

1 :−A

Design Matrix:Φ

The Empirical-Error Vector1* T−=w A Φ y

*

1w *

2w*

mw

( )kx

( )

1

kx ( )

2

kx( )k

nx( )kx ( )

1

kx ( )

2

kx( )k

nx

UnknownFunction

y ** =f Φw

1φ 2φ mφ

The Empirical-Error Vector1* T−=w A Φ y

*

1w *

2w*

mw

( )kx

( )

1

kx ( )

2

kx( )k

nx( )kx ( )

1

kx ( )

2

kx( )k

nx

UnknownFunction

y ** =f Φw

1φ 2φ mφ

Error Vector

*= −e y f*= −y Φw 1 T−= −ΦAy Φ y

( )1 T

p

−= − AI Φ Φ y = Py

Sum-Squared-Error

=e Py1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

( )1 2, , , m=Φ φ φ φError Vector

( )T

= Py Py

2T= y P y

( )2

( ) * ( )

1

pk k

k

SSE y f=

= − x

If =0, the RBFN’s learning algorithm

is to minimize SSE (MSE).

The Projection Matrix

=e Py1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

( )1 2, , , m=Φ φ φ φError Vector

2TSSE = y P y

1 2( , , , )mspan⊥e φ φ φ=Λ 0

( ) =P Py Pe = e = Py

T= y Py

RBFN’s as Universal Approximators

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Training set

( ) ) ( )(

1,

pk

k

k

== yxT

2

2( ) exp

2

j

j

j

− = −

x μx

Kernels

What to Learn?

• Weights wij’s• Centers j’s of j’s• Widths j’s of j’s

• Number of j’s −Model Selection

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

One-Stage Learning

( )( ) ( )( ) ( ) ( )

1

k k k

ij i i jw y f = − x x

( ) ( )( )( ) ( ) ( )

2 21

ljk k k

j j ij i i

ij

w y f =

− = −

x μμ x x

( ) ( )( )2

( ) ( ) ( )

3 31

ljk k k

j j ij i i

ij

w y f =

− = −

x μx x

One-Stage Learning

The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.

( )( ) ( )( ) ( ) ( )

1

k k k

ij i i jw y f = − x x

( ) ( )( )( ) ( ) ( )

2 21

ljk k k

j j ij i i

ij

w y f =

− = −

x μμ x x

( ) ( )( )2

( ) ( ) ( )

3 31

ljk k k

j j ij i i

ij

w y f =

− = −

x μx x

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Two-Stage Training

Determines• Centers j’s of j’s.• Widths j’s of j’s.• Number of j’s.

Step 1

Step 2 Determines wij’s.

E.g., using batch-learning.

Train the Kernels

Unsupervised Training

Methods• Subset Selection

– Random Subset Selection– Forward Selection– Backward Elimination

• Clustering Algorithms– KMEANS– LVQ

• Mixture Models– GMM

Subset Selection

Random Subset Selection

• Randomly choosing a subset of points from training set

• Sensitive to the initially chosen points.

• Using some adaptive techniques to tune– Centers– Widths– #points

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

1

2

3

4

Questions -How should the user choose the kernel?

• Problem similar to that of selecting features for other learning algorithms.➢Poor choice ---learning made very difficult.

➢Good choice ---even poor learners could succeed.

• The requirement from the user is thus critical.– can this requirement be lessened?

– is a more automatic selection of features possible?

Goal Revisit

Minimize Prediction Error

• Ultimate Goal − Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Badness of Fit• Underfitting

– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.

– Produces excessive bias in the outputs.

• Overfitting– A model (e.g., network) that is too complex may

fit the noise, not just the signal, leading to overfitting.

– Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance

• Model selection • Jittering • Early stopping • Weight decay

– Regularization– Ridge Regression

• Bayesian learning • Combining networks

Best Way to Avoid Overfitting

• Use lots of training data, e.g.,– 30 times as many training cases as there

are weights in the network.

– for noise-free data, 5 times as many training cases as weights may be sufficient.

• Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of FitUnderfit Overfit

Badness of FitUnderfit Overfit

Bias-Variance Dilemma

Underfit Overfit

Large bias Small bias

Small variance Large variance

Underfit Overfit

Large bias Small bias

Small variance Large variance

Bias-Variance Dilemma

More on overfitting

• Easily lead to predictionsthat are far beyond the range of the training data.

• Produce wild predictionsin multilayer perceptrons even with noise-free data.

Bias-Variance Dilemma

fitmore

overfitmore

underfit

Bias

Variance

Bias-Variance Dilemma

Sets of functions

E.g., depend on # hidden nodes used.

The true modelnoise

bias

Solution obtained with training set 2.

Solution obtained with training set 1.

Solution obtained with training set 3.

biasbias

Bias-Variance Dilemma

Sets of functions

E.g., depend on # hidden nodes used.

The true modelnoise

bias

Variance

Model Selection

Sets of functions

E.g., depend on # hidden nodes used.

The true modelnoise

bias

Variance

Bias-Variance Dilemma

Sets of functions

E.g., depend on # hidden nodes used.

The true modelnoise

bias

( )g x( ) ( )y g = +x x

( )f x

Bias-Variance Dilemma

( )2

( ) ( )E y f −

x x ( )2

( ) ( )E g f = + −

x x

( ) ( )2 2( ) ( ) 2 ( ) ( )E g f g f = − + − +

x x x x

( ) 2 2( ) ( ) 2 ( ) ( )E g f E g f E = − + − +

x x x x

Goal: ( )2

min ( ) ( )E g f −

x x( )2

min ( ) ( )E y f −

x x

0 constant

Bias-Variance Dilemma

( )2

( ) ( )E g f −

x x ( )2

[ ( )] [( ) () )( ]E f E fE g f − −

+ =

x xx x

( )2

[ ]( )) (E fE g = −

x x ( )2

[ ]( )) (E fE f + −

x x

( )( )[ ( )] [ ( )) ( ]2 ( )E g fE f E f− − − x xx x

( ) ( ) ( )( )2 2

[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff = + − − −

− −

x x x xx x x x

( )( )( ) ( )[ ( )] [ ( )]E E f Eg f f− −x xxx

( ) ( )E g f= x x [ )]( ()EE fg− x x [ ( ()] )E E f f− x x [ ( )] [ ( )]E f E fE+ x x

( ) ( )E g E f= x x [ )]( () EE fg− x x [ ( )] ( )EE f f− x x [ ( )] [ ( )]E f E f+ x x 0=

0

Bias-Variance Dilemma

( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −

x x x x

( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −

x x xx

noise bias2 variance

Cannot be minimized

Minimize both bias2 and variance

Model Complexity vs. Bias-Variance

( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −

x x x x

( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −

x x xx

noise bias2 variance

ModelComplexity(Capacity)

Bias-Variance Dilemma

( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −

x x x x

( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −

x x xx

noise bias2 variance

ModelComplexity(Capacity)

Example (Polynomial Fits)22exp( 16 )y x x= + −

Example (Polynomial Fits)

Example (Polynomial Fits)

Degree 1 Degree 5

Degree 10 Degree 15

Variance Estimation

Mean

Variance ( )22

1

p

i

i

xp

=

= −

In general, not available.

Variance Estimation

Mean1

p

i

i

x xp

=

= =

Variance ( )22 2

1

ˆ1

p

i

i

s xp

=

= = −−

Loss 1 degree of freedom

Simple Linear Regression

0 1y x = + +

2~ (0, )N 0

1

y

x

=

+

y

x

Simple Linear Regression

y

x

0

1ˆˆ

y

x

=

+0 1y x = + +

2~ (0, )N

Mean Squared Error (MSE)

y

x

0

1ˆˆ

y

x

=

+0 1y x = + +

2~ (0, )N

2

SSEMSE

p = =

−Loss 2 degrees

of freedom

Variance Estimation

2ˆSSE

MSEmp

= =−

Loss m degrees of freedom

m: #parameters of the model

The Number of Parameters

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

i

i

iy wf =

= =x x#degrees of freedom:

The Effective Number of Parameters ()

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

i

i

iy wf =

= =x x The projection Matrix

1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

=Λ 0

( )p trace = − P m=

The Effective Number of Parameters ()

The projection Matrix

1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

=Λ 0

( )1( ) T

ptrace trace −= − AP I Φ Φ

( )1 Tp trace −= − ΦA Φ

( )( )1TTp trace

= − Φ Φ ΦΦ

Pf)

( )( )1T Tp trace

= − Φ ΦΦ Φ

( )mp trace= − I

p m= −( )p trace = − P m=

Regularization

( )((

1

)

2

)p

k

k

k fy=

− x

2

1

m

i

iiw=

Empirial

Er

Model s

penalror ts

yCo t = +

( )2

( )

1 1

( )p m

k

i

k

k

i

iwy = =

x

Penalize models withlarge weightsSSE

Regularization

( )((

1

)

2

)p

k

k

k fy=

− x

2

1

m

i

iiw=

Empirial

Er

Model s

penalror ts

yCo t = +

( )2

( )

1 1

( )p m

k

i

k

k

i

iwy = =

x

Penalize models withlarge weightsSSE

Without penalty (i=0), there are m degrees

of freedom to minimize SSE (Cost).

The effective number of parameters =m.

Regularization

( )((

1

)

2

)p

k

k

k fy=

− x

2

1

m

i

iiw=

Empirial

Er

Model s

penalror ts

yCo t = +

( )2

( )

1 1

( )p m

k

i

k

k

i

iwy = =

x

Penalize models withlarge weightsSSE

With penalty (i>0), the liberty to minimize SSE will be reduced.

The effective number of parameters <m.

Variance Estimation

2ˆ MSE =

Loss degrees of freedom

SSE

p =

Variance Estimation

2ˆ MSE =( )

SSE

trace=

P

Model Selection• Goal

– Choose the fittest model

• Criteria– Least prediction error

• Main Tools (Estimate Model Fitness)– Cross validation– Projection matrix

• Methods– Weight decay (Ridge regression)– Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness

Minimize Prediction Error

• Ultimate Goal − Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Minimize Prediction Error

(MSE)

Estimating Prediction Error

• When you have plenty of data use independent test sets– E.g., use the same training set to train

different models, and choose the best model by comparing on the test set.

• When data is scarce, use– Cross-Validation– Bootstrap

Cross Validation

• Simplest and most widely used method for estimating prediction error.

• Partition the original set into several different ways and to compute an average score over the different partitions, e.g.,– K-fold Cross-Validation– Leave-One-Out Cross-Validation– Generalize Cross-Validation

Global hyperplane Local receptive field

EBP LMS

Local minima Serious local minima

Smaller number of hidden

neurons

Larger number of hidden

neurons

Shorter computation time Longer computation time

Longer learning time Shorter learning time

Comparison of RBF and MLP

RBF networks MLP

Learning speed Very Fast Very Slow

Convergence Almost guarantee Not guarantee

Response time Slow Fast

Memory

requirement

Very large Small

Hardware

implementation

IBM ZISC036

Nestor Ni1000www-5.ibm.com/fr/cdlab/zisc.html

Voice Direct 364

www.sensoryinc.com

Generalization Usually better Usually poorer

Comparison of RBF and MLP

top related