chapter 4 supervised learning: mltil nt k iimultilayer ...ypeng/nn/f11nn/lecture... · – hidden...

Chapter 4Supervised learning:

M ltil N t k IIMultilayer Networks II

Other Feedforward Networks• Madaline

– Multiple adalines (of a sort) as hidden nodesMultiple adalines (of a sort) as hidden nodes– Weight change follows minimum disturbance principle

• Adaptive multi-layer networksAdaptive multi layer networks– Dynamically change the network size (# of hidden nodes)

• Prediction networks– BP nets for prediction– Recurrent nets

• Networks of radial basis function (RBF)– e.g., Gaussian function– Perform better than sigmoid function (e.g., interpolation) in

function approximation

• Some other selected types of layered NN

Madaline• Architectures

– Hidden layers of adaline nodesHidden layers of adaline nodes– Output nodes differ

• Learning• Learning– Error driven, but not by gradient descent

Minimum disturbance: smaller change of weights is– Minimum disturbance: smaller change of weights is preferred, provided it can reduce the error

• Three Madaline models• Three Madaline models– Different node functions– Different learning rules (MR I II and III)Different learning rules (MR I, II, and III)– MR I and II developed in 60’s, MR III much later (88)

Madaline

MRI net:– Output nodes with logic

function

MRII net:– Output nodes are adalines

MRIII net:– Same as MRII, except the , p

nodes with sigmoid function

Madaline• MR II rule

– Only change weights associated with nodes whichOnly change weights associated with nodes which have small |netj |

– Bottom up, layer by layer otto up, aye by aye• Outline of algorithm

1 At layer h: sort all nodes in order of increasing net1. At layer h: sort all nodes in order of increasing netvalues, put those with net <θ, in S

2. For each Aj in S jif reversing its output (change xj to -xj) improves the output error, then change the weight vector leading into Aj by LMS of Adaline (or other ways)

( )j i j j j iw x net i , ,( )j i j j j i

Madaline• MR III rule

– Even though node function is sigmoid, do not use gradientEven though node function is sigmoid, do not use gradient descent (do not assume its derivative is known)

– Use trial adaptation – E: total square error at output nodes

Ek: total square error at output nodes if netk at node k is i d b ( 0)increased by ε (> 0)

– Change weight leading to node k according to 222 /)( EEi /)( EEiEor

– Update weight to one node at a timeIt can be shown to be equivalent to BP

222 /)( EEiw k /)( EEiEw k

– It can be shown to be equivalent to BP– Since it is not explicitly dependent on derivatives, this method

can be used for hardware devices that inaccurately implement ca be used o a dwa e dev ces t at accu ate y p e e tsigmoid function

Adaptive Multilayer Networks• Smaller nets are often preferred

– Computing is fasterComputing is faster– Generalize better– Training is fasterg

• Fewer weights to be trained • Smaller # of training samples needed

• Heuristics for “optimal” net size – Pruning: start with a large net, then prune it by removing

unimportant nodes and associated connections/weights– Growing: start with a very small net, then continuously increase

its size with small increments until the performance becomesits size with small increments until the performance becomes satisfactory

– Combining the above two: a cycle of pruning and growing until Co b g t e above two: a cyc e o p u g a d g ow g u tperformance is satisfied and no more pruning is possible

Adaptive Multilayer Networks• Pruning a network by removing

Weights with small magnitude (e g ≈ 0)– Weights with small magnitude (e.g., ≈ 0)– Nodes with small incoming weights

Weights whose existence does not significantly affect– Weights whose existence does not significantly affect network output• If is negligiblewo /If is negligible

– By examining the second derivativewo /

)(''where)(''1 2 EEwEwEE

)(where)(

2 wwEwEw

wE

)(i e0it tochangetoisremo ingofeffect )(''

21 then ,0/ minimum, local a approaches when 2wEEwEE

I t d l b d if th lti h fsmallly sufficient is )(''

21 ifon depends remove whether to

)(i.e.,0, it tochangetoisremoving ofeffect 2wEEw

www

E– Input nodes can also be pruned if the resulting change of is negligible

E

Adaptive Multilayer Networks• Cascade correlation (example of growing net size)

– Cascade architecture development• Start with a net without hidden nodes• Each time one hidden node is added between the output nodes and all

other nodes• The new node is connected to output nodes, and from all other nodes

(input and all existing hidden nodes)• Not strictly feedforwardNot strictly feedforward

Adaptive Multilayer Networks

– Correlation learning: when a new node n is addedg• first train all input weights to node n from all nodes below

(maximize covariance with current error of output nodes E)then train all eight to o tp t nodes (minimi e E)• then train all weight to output nodes (minimize E)

• quickprop is used• all other weights to lower hidden nodes are not changes (so itall other weights to lower hidden nodes are not changes (so it

trains fast)

Adaptive Multilayer Networks xnew– Train wnew to maximize covariance

• covariance between x and Eold

xnew

where,))(()(1

,1

,

K

kkpk

P

pnewpnewnew EExxwS

sampleforofoutputtheis thpxx

samplefor nodeoutput on error thesamples allover of mean value the

sample,for ofoutput theis

,

,

ththpk

new

thpnew

pkExx

pxx

wnewsamples allover mean value its

weights,oldth wi,

k

p

Eh S( ) i i i d i f f i th t fx x• when S(wnew) is maximized, variance of from mirrors that of

error from , • S(wnew) is maximized by gradient ascent

px xpkE , kE

S(wnew) is maximized by gradient ascent

where,)('

,'

1,

1pip

K

kkpk

P

pk

ii IfEES

wSw

sample ofinput the and function, node s' of derivativetheis,andbetween n correlatioofsign theis

,

'

ththpi

pknewk

piIxfExS

Adaptive Multilayer Networks

– Example: corner isolation problem• Hidden nodes are with sigmoid

function ([-0.5, 0.5]) Wh t i d ith t hidd

XX

• When trained without hidden node: 4 out of 12 patterns are misclassifiedmisclassified

• After adding 1 hidden node, only 2 patterns are misclassified

XX

2 patterns are misclassified• After adding the second hidden

node, all 12 patterns are correctly p yclassified

• At least 4 hidden nodes are required with BP learning

Prediction Networks• Prediction

– Predict f(t) based on values of f(t – 1), f(t – 2),…T NN d l f df d d t– Two NN models: feedforward and recurrent

• A simple example (section 3.7.3)Forecasting commodity price at month t based on its prices at– Forecasting commodity price at month t based on its prices at previous months

– Using a BP net with a single hidden layerUs g e w s g e dde ye• 1 output node: forecasted price for month t• k input nodes (using price of previous k months for prediction)• k hidden nodes • Training sample: for k = 2: {(xt-2, xt-1) xt}

R d t fl i f 100 ti th 90 f• Raw data: flour prices for 100 consecutive months, 90 for training, 10 for cross validation testing

• one-lag forecasting: predict xt based on xt-2 and xt-1g g p t t 2 t 1multilag: using predicted values for further forecasting

Prediction Networks• Training:

– 90 input data l

ResultsN k MSEvalues

– Last 10 prices for validation testTh tt t

Network MSE2-2-1 Training 0.0034

one-lag 0.0044– Three attempts:

k = 2, 4, 6– Learning rate = 0.3,

0 6

gmultilag 0.0045

4-4-1 Training 0.0034one-lag 0 0098momentum = 0.6

– 25,000 – 50,000 epochs2 2 2 i h

one-lag 0.0098multilag 0.0100

6-6-1 Training 0.0028l 0 0121– 2-2-2 net with

good prediction– Two larger nets

i d ( i h

one-lag 0.0121multilag 0.0176

over-trained (with larger prediction errors for validation data)validation data)

Prediction Networks• Generic NN model for prediction

– Preprocessor prepares training samples from time series data )(tx)(txp p p g p– Train predictor using samples (e.g., by BP learning)

)()()(tx

• Preprocessor• Preprocessor– In the previous example,

• Let k = d + 1 (using previous d + 1 data points to predict)Let k d 1 (using previous d 1 data points to predict)• input sample at time t: ( ) ( ( ),..., ( 1), ( ))

the desired output (e.g., prediction): ( 1)x t x t d x t x t

x t

– More general:• ci is called a kernel function for different memory model (how

previous data are remembered)• Examples: exponential trace memory; gamma memory (see p.141)

Prediction Networks• Recurrent NN architecture

– Cycles in the nety• Output nodes with connections to hidden/input nodes• Connections between nodes at the same layer

N d t t it lf• Node may connect to itself– Each node receives external input as well as input from other

nodes– Each node may be affected by output of every other node– With a given external input vector, the net often converges to an

eq ilibri m state after a n mber of iterations (o tp t of e er nodeequilibrium state after a number of iterations (output of every node stops to change)

• An alternative NN model for function approximationpp– Fewer nodes, more flexible/complicated connections– Learning procedure is often more complicated

Prediction Networks

• Approach I: unfolding to a feedforward net– Each layer represents a time delay

of the network evolutionA fully connected net of 3 nodes

of the network evolution– Weights in different layers are

identical

– Cannot directly apply BP learning y pp y g(because weights in different layers are constrained to be identical)identical)

– How many layers to unfold to? Hard to determine

Equivalent FF net of k layers

Prediction Networks• Approach II: gradient descent

– A more general approach– Error driven: for a given external input

)())()(()( 22 tetotdtE k kk kk

– Weight updateknown)areoutput (desirednodesoutput are where k

)()()1( ,,, twtwtw jijiji

)()())()(()()(, t

tototdtEtw kk kkji

)(,,, tww ji

kji

j

, ,

, ,

( 1) ( )'( ( )[ ( ) ( )] where( 1) ( )

k lk k l i k ll

i j i j

o t z tf net t w t z tw t w t

, 1 if , 0 otherwise and ( ) is input to node k from either

input nodes or other nodesi k li k z t

)0(o 0)0(

)0(

,

ji

kwo

NN of Radial Basis Functions• Motivations: better performance than sigmoid functions

– For some classification problemsFor some classification problems– Function interpolation

• DefinitionDefinition– A function is radial symmetric (or is RBF) if its output depends on

the distance between the input vector and a stored vector related pto that function•

i h hi d vector theis or,input vect theis where Distance iiu

• Output NN with RBF node function are called RBF nets

RBFwith theassociated 2121 whenever )()( uuuu

– NN with RBF node function are called RBF-nets

NN of Radial Basis Functions• Gaussian function is the most widely used RBF

a bell shaped function centered at u = 02)/()( cueu – a bell-shaped function centered at u = 0.

– Continuous and differentiable

)()(g eu

)(2)')/(()(then)(if 2)/(')/( 22 uucueueu cucu

– Other RBF• In erse q adratic f nction h persh]pheric f nction etc

)(2))/(()(then )(if 2)/()/( uc

cueueu gcu

gcu

g

• Inverse quadratic function, hypersh]pheric function, etc

Gaussian functionμ

Inverse quadratic μ

hyperspheric functionμ

Gaussian function function0,)()( 22

2 βforucu

hyperspheric function

cu

cuus if0if1)(

2)/()(• Consider Gaussian function again– gives the center of the region for activating this unit

2)/()( cug eu

– gives the max output– c determines the size of the region

ex: for c = 0.1 u = 0.03246

9.0)( 2)/( cug eu

c = 1.0 u = 0.3246c = 10. u = 3.246

Small c Large c

NN of Radial Basis Functions• Pattern classification

– 4 or 5 sigmoid hidden nodes4 or 5 sigmoid hidden nodes are required for a good classification

xx

xxx

xxx

x

x

x

– Only 1 RBF node is required if the function can approximate the circleapproximate the circle

NN of Radial Basis Functions• XOR problem

– 2-2-1 networke wo• 2 hidden nodes are RBF:

]1,1[,)( 11

21 texρ tx

)(xρ )(xρ

• Output node can be step or sigmoid

]0,0[,)( 22

22 texρ tx x

(1,1) 1 0.1353(0,1) 0.3678 0.3678

)(1 xρ )(2 xρ

Output node can be step or sigmoid– When input x is applied

• Hidden node calculates distance jtx

(1,0) 0.3678 0.3678(0,0) 0.1353 1

Hidden node calculates distance then its output

• All weights to hidden nodes set to 1

jtx(0, 0)

• Weights to output node trained by LMS

(1, 1)(0, 1) (1, 0)

• t1 and t2 can also been trained

NN of Radial Basis Functions• Function interpolation

– Suppose you know and to approximate)( 1xf )(xf )(xfSuppose you know and , to approximate ( ) by linear interpolation:

)( 1xf )( 2xf )( 0xf

)/()))(()(()()( 12101210 xxxxxfxfxfxf 201 xxx

– Let be the distances of from and then

1 0 1 2 0 2,D x x D x x 0x 1x 2x

0 1 1 2 2 1 1 1 2( ) [ ( )( ) ( ( ) ( )) ] /( )f x f x D D f x f x D D D

i f f ti l i ht d d li d b di t

1 2 2 1 1 21 1 1 1

1 1 2 2 1 2

[ ( ) ( ) ] /( )[ ( ) ( )] /[ ]

f x D f x D D DD f x D f x D D

i.e., sum of function values, weighted and normalized by distances– Generalized to interpolating by more than 2 known f values

1 1 1( ) ( ) ( )D f x D f x D f x • 0 0

0

1 1 2 20 1 1 1

1 2

0 0

( ) ( ) ( )( )

w here is the num ber o f neighbors to

P P

P

D f x D f x D f xf x

D D DP x

• Only those with small distance to are useful0 0w e e s e u be o e g bo s o x

)( ixf 0x

NN of Radial Basis Functions

• Example:8 samples with known– 8 samples with known function values

– can be interpolated )( 0xf pusing only 4 nearest neighbors

)( 0f

),,,( 5432 xxxx

1 1 1 12 2 3 3 4 4 5 5

0 1 1 1 1

( ) ( ) ( ) ( )( ) D f x D f x D f x D f xf xD D D D

0 1 1 1 1

2 3 4 51 1 1 1

2 3 4 51 1 1 1

( )

8 9 3 8

fD D D D

D D D DD D D D

2 3 4 5D D D D

NN of Radial Basis Functions

• Using RBF node to achieve neighborhood– One hidden node per sample xp: = xp, and 1( )D D p p p p,– Network output for approximating is proportional to

( )

)(xf

)(where xfd )(where pp xfd

output nodeweights Poutput nodegwp = dp/P xxwn

P

ppp

1||)(||etwith

hidden RBF nodes:Output (||x – xp||)

x

NN of Radial Basis Functions• Clustering samples

– Too many hidden nodes when # of samples is large– Grouping similar samples (having similar input and similar desired

output) together into N clusters, each withTh t t • The center: vector

• Mean desired output:

N t k t t

i

i

• Network output:

• Suppose we know how to determine N and how to cluster all P samples (not a easy task itself), and can be determined by learning

i i

NN of Radial Basis Functions• Learning in RBF net

– Objective:learning

t i i ito minimize

– Gradient descent approach (sequential mode)

O l bt i b th l t i t h i th)()( as defined is function where 2 DDRR

– One can also obtain by other clustering techniques, then use GD learning for only

ii

2 21 , , 1 1 1( ) , ( ), ( )n N P P

p i j p j i j p i i p i p p p p pD x x o w x E E d o

( )2( ) 2( ) ( ), ( ) ( )p pE o

d o d o x w d o x

Learning iw

2( ) 2( ) ( ), ( ) ( )p p p p p i i i p p p i

i i

d o d o x w d o xw w

,Learning i j

, , ,

2

( )( )2( ) 2 ( )

( ) ( )

p ip pp p i p p

i j i j i j

xE od o w d o

2

2 , ,

, ,

2

( ) ( )= 2 '( )( )

h ( ) ( )

p i p i p i

p i p j i j

i j i jp i

x x xR x x

xR D D

2

2

, , , ,

where ( ) ( )( ) '( )( )i j i j i p p p i p j i j

R D Dw d o R x x

2 2F G i f i ( ) ( / )D D2 2

2 2 2

2 2

For Gaussian functions: ( ) exp( / ), ( ) exp( / ), '( ) ( 1/ )exp( / )

( )exp( / )

D DR D D R D D

w d o x

2 2

, , , ,

( ) exp( / )

( )( )exp( / )i i p p p i

i j i j i p p p j i j p i

w d o x

w d o x x

NN of Radial Basis Functions• A strategy for learning RBF net

– Start with a single RBF hidden node for a single cluster g gcontaining only the first training sample.

– For each of the new training samples xg p• If it is close to any of the existing clusters, do the gradient

descent based updates of the w and φ for all existing p φ gclusters/hidden nodes

• Otherwise, adding a new hidden node for a cluster containing only x

• RBF networks are universal approximatorspp– same representational power as BP networks

Polynomial Networks• Polynomial networks

Node functions allow direct computing of polynomials of– Node functions allow direct computing of polynomials of inputs

– Approximating higher order functions with fewer nodesApproximating higher order functions with fewer nodes (even without hidden nodes)

– Each node has more connection weightsEach node has more connection weights

• Higher-order networks

2 k – # of weights per node:

– Can be trained by LMS

21 1 2n n kn

k

Can be trained by LMS– General function approximator

Polynomial Networks• Sigma-pi networks

– Does not allow terms with higher powers of inputs, so they are not a general function approximator

– # of weights per node:– Can be trained by LMS

1 1 2n n n

k

y

• Pi-sigma networks– One hidden layer with Sigma function:

– Output nodes with Pi function:• Product units:

• Node computes product:• Integer power Pj,i can be learned • Often mix with other units (e.g., sigmoid)

chapter 4 supervised learning: mltil nt k iimultilayer ...ypeng/nn/f11nn/lecture... · – hidden...

Documents