on simple adaptive momentum - 1 presented at cis 2008 © dr richard mitchell 2008 on simple adaptive...

On Simple Adaptive Momentum - 1Presented at CIS 2008 © Dr Richard Mitchell 2008

On Simple Adaptive Momentum

Dr Richard Mitchell

Cybernetics Intelligence Research GroupCybernetics, School of Systems

EngineeringUniversity of Reading, UK

[email protected]


OverviewSimple Adaptive Momentum speeds training of (MLPs)

It adapts the normal momentum term depending on the angle between the current and previous changes in the weights of the MLP.

In the original paper, the weight changes of the whole network are used in determining this angle.

This paper considers adapting the momentum term using certain subsets of these weights.

It is inspired by the author’s object oriented approach to programming MLPs, successfully used in teaching.

It is concluded that the angle is best determined using the weight changes in each layer separately.


Nomenclature in Multi Layer Net

xr(i) is o/p of node i in layer r;wr(i,j) is weight i of link to node j in layer r

x1(2)

x3(1)x1(1)

w3(3,2)

x2(2)

x2(3)x3(2)

x2(1)

w2(0,1)

w2(0,2)

w2(0,3)w3(0,2)

w3(0,1)

w3(1,2)

w3(2,2)

w3(3,1)

w3(2,1)

w2(1,2)

w3(1,1)w2(1,1)

w2(2,1)

w2(2,3)

w2(1,3)

w2(2,2)

Inputs Outputs

nr r r-1

i 0x (j ) w (i, j) * x (i) f (z) is activation f unctionf

Change weights : Δtwr(i,j) = η δr(j) xr-1(i) + α Δt-1wr(i,j)

δ is function of error; varies with f(z); error also varies


Simple Adaptive Momentum

Swanston, Bishop, & Mitchell, R.J. (1994), "Simple adaptive momentum: new algorithm for training multilayer perceptrons", Elect. Lett, Vol 30, No 18, pp1498-1500Concept: adapt the momentum term depending on whether weight change this time in same direction as last.Direction? Weight changes in array … so are a vectorHave two vectors, for current and previous, Δwc & Δwp

w2

w1

Δwp2

Δwp

1

Can see angle between vectors

w2

w1

θ

Δwp

Δwc

e.g. In 2D


Implementing SAMThe simple idea is to replace momentum constant by (1+cos()) where is angle between vector of current and previous deltaWeights, Δwc and Δwp.

c p

c p

w . wcos( ) ; i.e. use vector dot products

w w

In original paper Δws apply to all weights in network

In this paper, we consider adapting α at the network level, layer level and neuron level.

Inspired by object oriented programming of MLP – provides good example and practice for students of properties of OOP albeit on old ANN.


OO Approach – Network Layers

Can program MLP with objects for each neuron.But as need inputs from prev layer and deltas from next –

need many pointers – problematic for students.So easier to have object for layer of neurons (all with same

inputs): get inputs and weighted deltas in an arrayBase object is layer of linearly activated neurons LinActLayer – a single layer network of neurons f(z) = z.

For Neurons with Sigmoidal Activation – only need two different functions – for calculating output and delta

So have SigActLayer – an object inheriting LinActLayer uses existing members, adds 2 different ones


Network For Hidden Layers

Need enhanced SigActLayer with own calculate error func: (weighted deltas in next layer).

Existing objects are whole net.

So have SigActHidLayer as a multiple layer network,

Inherits from SigActLayer but also has a pointer to next layer.

Most functions have 2 lines - process own layer and next

ClassBase

SigActHidLayer

LinActLayer

SigActLayer


SAM and HierarchyGiven approach can adjust momentum using weight changes

a) over the whole network

b) separately by layer

c) separately for each neuron

For a) need to calculate the η * delta * inputs for all layers, then globally set α (1 + cosθ)

For b) calculate η * delta * inputs for each layer and set the α (1 + cosθ) for each layer separately

For c) do the same, but for each neuron in each layer.

This works easily in the hierarchy.


Experimentation3 problems. Have Training Validation Unseen data

Stop training when error on validation set rises

Run 6 times per problem with different initial weights

Problem 1: 2 inputs, 10 nodes in hidden, 1 output

SAM Mode None Neuron Layer Network

Mean Epochs taken 867 227 202 257

SAM mode Train SSE Valid SSE Unseen SSE

None 0.0081985 0.0065965 0.0092535

Neuron 0.0100445 0.0084395 0.0107985

Layer 0.0103265 0.0086805 0.0106505

Network 0.0077125 0.0071095 0.0084845


Problem 25 inputs, 15 nodes in hidden layer and 1 output

SAM mode None Neuron Layer Network

Mean Epochs 1712 315 262 312

SAM mode Train SSE Valid SSE Unseen SSE

None 0.0004725 0.0005625 0.0006665

Neuron 0.0006585 0.0007635 0.0009525

Layer 0.0007685 0.0008745 0.0011055

Network 0.0006215 0.0007655 0.0009505

Trained much more quickly, but SSE worse

Very little diff one layer and whole network, so ..


Problem 35 inputs, 15 nodes in hidden layer and 3 outputs

SAM Mode None Neuron Layer Network

Mean Epochs

1133 497 638 977

SAM Mode Train SSE Valid SSE Unseen SSE

None 0.0044735 0.0043835 0.0054605

Neuron 0.0048205 0.0045685 0.0057955

Layer 0.0045675 0.0044105 0.0053225

Network 0.0045465 0.0044055 0.0053445

SSEs averaged over 3 outputs : here Layer best


Conclusions and Further Work

The Object Oriented hierarchy works neatly hereSAM clearly reduces number of Epochs taken to

learn – little extra overhead per epochIn one example it increased the Sum Squared Errors

This needs investigatingIt needs to be tested on other problems, but it looks

as if SAM at the layer level may be best (particularly with multiple outputs)

Momentum used in other learning problems – SAM could be investigated for these.

on simple adaptive momentum - 1 presented at cis 2008 © dr richard mitchell 2008 on simple adaptive...

Documents