jacek mazurkiewicz, phd softcomputing part 13: …...•radial basis function (rbf) networks are...

Internet EngineeringJacek Mazurkiewicz, PhD

Softcomputing

Part 13: Radial Basis Function Networks

Contents• Overview• The Models of Function Approximator• The Radial Basis Function Networks• RBFN’s for Function Approximation• The Projection Matrix• Learning the Kernels• Bias-Variance Dilemma• The Effective Number of Parameters• Model Selection

RBF• Linear models have been studied in

statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model.

• However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians

Typical Applications of NN

• Pattern Classification

• Function Approximation

• Time-Series Forecasting

( )l f= xl C N

( )f=y x mY R y

mX R x

nX R x

1 2 3( ) ( , , , )t t tft − − −=x x x x

Function Approximation

mX R fnY R

:f X Y→Unknown

mX RnY R

ˆ :f X Y→Approximator

Linear Models

(( ) )i

=x xFixed BasisFunctions

Weights

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

(( ) )i

x1 x2 xn

w1 w2 wm

Example Linear Models

• Polynomial

• Fourier Series

iwf xx =

( )0exp 2( )k

k jf w k xx =

( ) , 0,1,2,i

i ix x = =

( )0( ) exp 2 0,1,, 2, k x j k x k = =

(( ) )i

x1 x2 xn

w1 w2 wm

Single-Layer Perceptrons as Universal Aproximators

HiddenUnits

With sufficient number of sigmoidal units, it can be a universal approximator.

(( ) )i

x1 x2 xn

w1 w2 wm

Radial Basis Function Networks as Universal Aproximators

HiddenUnits

With sufficient number of radial-basis-function units, it can also be a universal approximator.

(( ) )i

Non-Linear Models

(( ) )i

=x xAdjusted by theLearning process

Weights

Radial Basis Functions

• Center

• Distance Measure

• Shape

Three parameters for a radial function:

r = ||x − xi||

i(x)= (||x − xi||)

Typical Radial Functions

• Gaussian

• Hardy-Multiquadratic (1971)

• Inverse Multiquadratic

22 0 and r

r e r −

( ) 2 2 0 and r r c c c r = +

( ) 2 2 0 and r c r c c r = +

Gaussian Basis Function (=0.5,1.0,1.5)

22 0 and r

r e r −

0.5 =1.0 =

Inverse Multiquadratic

-10 -5 0 5 10

( ) 2 2 0 and r c r c c r = +

c=2c=3

c=4c=5

Most General RBF

( ) ( )1( ) ( )i i

i −= − −μ x μx x

1μ2μ 3μ

Basis {i: i =1,2,…} is `near’ orthogonal.

Properties of RBF’s

• On-Center, Off Surround

• Analogies with localized receptive fieldsfound in several biological structures, e.g.,– visual cortex;

– ganglion cells

The Topology of RBF

Feature Vectorsx1 x2 xn

Inputs

HiddenUnits

OutputUnits

Projection

Interpolation

As a function approximator

The Topology of RBF

Feature Vectorsx1 x2 xn

Inputs

HiddenUnits

OutputUnits

Subclasses

Classes

As a pattern classifier.

• Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm.

• The activation function is selected from a class of functions called basis functions.

• They usually train much faster than BP.

• They are less susceptible to problems with non-stationary inputs

• Popularized by Broomhead and Lowe (1988), and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture.

• The major difference between RBF and BP is the behavior of the single hidden layer.

• Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.

The idea

y Unknown Function to Approximate

TrainingData

The idea

Basis Functions (Kernels)

Function Learned

)) ((m

iiwy f =

= = xx

The idea

Basis Functions (Kernels)

Function Learned

NontrainingSample

)) ((m

iiwy f =

= = xx

The idea

yFunction Learned

NontrainingSample

)) ((m

iiwy f =

= = xx

Radial Basis Function Networks as Universal Aproximators

( ) ( )m

iy wf =

= =x x

x1 x2 xn

w1 w2 wm

Training set ( ) ( )

Goal ( )(( ))k kfy x for all k

( )min kp

SSE fy=

= − x

y w= =

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xnx =

( ) ( )m

iy wf =

= =x x

( )min kp

SSE fy=

= − x

y w= =

( )min kp

SSE fy=

= − x

y w= =

Regularization

If regularization is unneeded, set 0i =

k wfyC = =

= − + xMinimize

( )( )

= − − +

( ) ( )( ) ( ) ( )

k f wy =

= − − + x x

( ) ( ) ( )( ) * ( ) ( )

)* (p p

kwf y = =

+ = x x x

( ) ( )( )(1) ( ), ,T

j j j =φ x x

( ) ( )( )* * (1) * ( ), ,T

pf f=f x xDefine

( )(1) ( ), ,T

py y=y

j jjw+ =φ f φ y 1, ,j m=

( ) ( ) ( )( ) * ( ) ( )

)* (p p

kwf y = =

+ = x x x

Learn the Optimal Weight Vector*

φ f φ y

( )1 2, , , m=Φ φ φ φ

( )* * *

1 2* , , ,T

mw w w=w

Define

**T T+ =wΦ Λf Φ y

* ( )p

( ) ( ) ( )

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

p p pm

* *T T+ =ΛwΦ wΦ Φ y

( ) *T T+ =wΦ ΛΦ Φ y

* T T−

= +Λw Φ Φ Φ y

Variance Matrix

1 T−= A Φ y

1 :−A

Design Matrix:Φ

The Empirical-Error Vector1* T−=w A Φ y

kx ( )

kx( )k

nx( )kx ( )

kx ( )

kx( )k

UnknownFunction

y ** =f Φw

1φ 2φ mφ

The Empirical-Error Vector1* T−=w A Φ y

kx ( )

kx( )k

nx( )kx ( )

kx ( )

kx( )k

UnknownFunction

y ** =f Φw

1φ 2φ mφ

Error Vector

*= −e y f*= −y Φw 1 T−= −ΦAy Φ y

( )1 T

−= − AI Φ Φ y = Py

Sum-Squared-Error

=e Py1 T

−= −P I Φ ΦA

T= +A Φ Φ Λ

( )1 2, , , m=Φ φ φ φError Vector

= Py Py

2T= y P y

( ) * ( )

SSE y f=

= − x

If =0, the RBFN’s learning algorithm

is to minimize SSE (MSE).

The Projection Matrix

=e Py1 T

−= −P I Φ ΦA

T= +A Φ Φ Λ

( )1 2, , , m=Φ φ φ φError Vector

2TSSE = y P y

1 2( , , , )mspan⊥e φ φ φ=Λ 0

( ) =P Py Pe = e = Py

T= y Py

RBFN’s as Universal Approximators

x1 x2 xn

w11 w12w1m

wl1 wl2wlml

Training set

( ) ) ( )(

== yxT

2( ) exp

− = −

Kernels

What to Learn?

• Weights wij’s• Centers j’s of j’s• Widths j’s of j’s

• Number of j’s −Model Selection

x1 x2 xn

w11 w12w1m

wl1 wl2wlml

One-Stage Learning

( )( ) ( )( ) ( ) ( )

ij i i jw y f = − x x

( ) ( )( )( ) ( ) ( )

ljk k k

j j ij i i

w y f =

− = −

x μμ x x

( ) ( )( )2

( ) ( ) ( )

ljk k k

j j ij i i

w y f =

− = −

x μx x

One-Stage Learning

The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.

( )( ) ( )( ) ( ) ( )

ij i i jw y f = − x x

( ) ( )( )( ) ( ) ( )

ljk k k

j j ij i i

w y f =

− = −

x μμ x x

( ) ( )( )2

( ) ( ) ( )

ljk k k

j j ij i i

w y f =

− = −

x μx x

x1 x2 xn

w11 w12w1m

wl1 wl2wlml

Two-Stage Training

Determines• Centers j’s of j’s.• Widths j’s of j’s.• Number of j’s.

Step 1

Step 2 Determines wij’s.

E.g., using batch-learning.

Train the Kernels

Unsupervised Training

Methods• Subset Selection

– Random Subset Selection– Forward Selection– Backward Elimination

• Clustering Algorithms– KMEANS– LVQ

• Mixture Models– GMM

Subset Selection

Random Subset Selection

• Randomly choosing a subset of points from training set

• Sensitive to the initially chosen points.

• Using some adaptive techniques to tune– Centers– Widths– #points

Clustering Algorithms

Questions -How should the user choose the kernel?

• Problem similar to that of selecting features for other learning algorithms.➢Poor choice ---learning made very difficult.

➢Good choice ---even poor learners could succeed.

• The requirement from the user is thus critical.– can this requirement be lessened?

– is a more automatic selection of features possible?

Goal Revisit

Minimize Prediction Error

• Ultimate Goal − Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Badness of Fit• Underfitting

– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.

– Produces excessive bias in the outputs.

• Overfitting– A model (e.g., network) that is too complex may

fit the noise, not just the signal, leading to overfitting.

– Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance

• Model selection • Jittering • Early stopping • Weight decay

– Regularization– Ridge Regression

• Bayesian learning • Combining networks

Best Way to Avoid Overfitting

• Use lots of training data, e.g.,– 30 times as many training cases as there

are weights in the network.

– for noise-free data, 5 times as many training cases as weights may be sufficient.

• Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of FitUnderfit Overfit

Bias-Variance Dilemma

Underfit Overfit

Large bias Small bias

Small variance Large variance

Underfit Overfit

Large bias Small bias

Small variance Large variance

jacek mazurkiewicz, phd softcomputing part 13: …...•radial basis function (rbf) networks are...

Documents

a softcomputing model for depression...

minutes of regular meeting of the civil ...mieszko...

pipeline · 2016. 6. 29. · robert-james sales, inc....

ed mazurkiewicz sst 309-03. 6-g5.13: identify the ways in...

jacek mazurkiewicz, phd softcomputing · softcomputing part...

method of creating ultra-fine particles of materials using a...

adam mazurkiewicz, beata...

introduction to the co-operative theory of a region ·...

technical conditions of construction and operation of the...

highlights based on digital shearlet transform(dst) is...

applied soft computing -...

dear friends and colleagues, - podziękowanie€¦ · dear...

improvedefﬁciencyofdirecttorquecontrolled induction motor...

neuralnetworksina softcomputing framework - university of...

scientific abstract mazurkevich, ya. s. - mazurkiewicz…...

a survey of pipelined workﬂow scheduling: models and...

salalm: an introduction víctor federico torres universidad...

church of our lady of the...

jacek mazurkiewicz, phd softcomputing · of softcomputing...

548p. · aija tuna. poland. alicja pacewicz grzegorz...