on simple adaptive momentum - 1 presented at cis 2008 © dr richard mitchell 2008 on simple adaptive...

12
On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence Research Group Cybernetics, School of Systems Engineering University of Reading, UK [email protected]

Upload: brandon-dobson

Post on 28-Mar-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 1Presented at CIS 2008 © Dr Richard Mitchell 2008

On Simple Adaptive Momentum

Dr Richard Mitchell

Cybernetics Intelligence Research GroupCybernetics, School of Systems

EngineeringUniversity of Reading, UK

[email protected]

Page 2: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 2Presented at CIS 2008 © Dr Richard Mitchell 2008

OverviewSimple Adaptive Momentum speeds training of (MLPs)

It adapts the normal momentum term depending on the angle between the current and previous changes in the weights of the MLP.

In the original paper, the weight changes of the whole network are used in determining this angle.

This paper considers adapting the momentum term using certain subsets of these weights.

It is inspired by the author’s object oriented approach to programming MLPs, successfully used in teaching.

It is concluded that the angle is best determined using the weight changes in each layer separately.

Page 3: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 3Presented at CIS 2008 © Dr Richard Mitchell 2008

Nomenclature in Multi Layer Net

xr(i) is o/p of node i in layer r;wr(i,j) is weight i of link to node j in layer r

x1(2)

x3(1)x1(1)

w3(3,2)

x2(2)

x2(3)x3(2)

x2(1)

w2(0,1)

w2(0,2)

w2(0,3)w3(0,2)

w3(0,1)

w3(1,2)

w3(2,2)

w3(3,1)

w3(2,1)

w2(1,2)

w3(1,1)w2(1,1)

w2(2,1)

w2(2,3)

w2(1,3)

w2(2,2)

Inputs Outputs

nr r r-1

i 0x (j ) w (i, j) * x (i) f (z) is activation f unctionf

Change weights : Δtwr(i,j) = η δr(j) xr-1(i) + α Δt-1wr(i,j)

δ is function of error; varies with f(z); error also varies

Page 4: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 4Presented at CIS 2008 © Dr Richard Mitchell 2008

Simple Adaptive Momentum

Swanston, Bishop, & Mitchell, R.J. (1994), "Simple adaptive momentum: new algorithm for training multilayer perceptrons", Elect. Lett, Vol 30, No 18, pp1498-1500Concept: adapt the momentum term depending on whether weight change this time in same direction as last.Direction? Weight changes in array … so are a vectorHave two vectors, for current and previous, Δwc & Δwp

w2

w1

Δwp2

Δwp

1

Can see angle between vectors

w2

w1

θ

Δwp

Δwc

e.g. In 2D

Page 5: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 5Presented at CIS 2008 © Dr Richard Mitchell 2008

Implementing SAMThe simple idea is to replace momentum constant by (1+cos()) where is angle between vector of current and previous deltaWeights, Δwc and Δwp.

c p

c p

w . wcos( ) ; i.e. use vector dot products

w w

In original paper Δws apply to all weights in network

In this paper, we consider adapting α at the network level, layer level and neuron level.

Inspired by object oriented programming of MLP – provides good example and practice for students of properties of OOP albeit on old ANN.

Page 6: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 6Presented at CIS 2008 © Dr Richard Mitchell 2008

OO Approach – Network Layers

Can program MLP with objects for each neuron.But as need inputs from prev layer and deltas from next –

need many pointers – problematic for students.So easier to have object for layer of neurons (all with same

inputs): get inputs and weighted deltas in an arrayBase object is layer of linearly activated neurons LinActLayer – a single layer network of neurons f(z) = z.

For Neurons with Sigmoidal Activation – only need two different functions – for calculating output and delta

So have SigActLayer – an object inheriting LinActLayer uses existing members, adds 2 different ones

Page 7: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 7Presented at CIS 2008 © Dr Richard Mitchell 2008

Network For Hidden Layers

Need enhanced SigActLayer with own calculate error func: (weighted deltas in next layer).

Existing objects are whole net.

So have SigActHidLayer as a multiple layer network,

Inherits from SigActLayer but also has a pointer to next layer.

Most functions have 2 lines - process own layer and next

ClassBase

SigActHidLayer

LinActLayer

SigActLayer

Page 8: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 8Presented at CIS 2008 © Dr Richard Mitchell 2008

SAM and HierarchyGiven approach can adjust momentum using weight changes

a) over the whole network

b) separately by layer

c) separately for each neuron

For a) need to calculate the η * delta * inputs for all layers, then globally set α (1 + cosθ)

For b) calculate η * delta * inputs for each layer and set the α (1 + cosθ) for each layer separately

For c) do the same, but for each neuron in each layer.

This works easily in the hierarchy.

Page 9: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 9Presented at CIS 2008 © Dr Richard Mitchell 2008

Experimentation3 problems. Have Training Validation Unseen data

Stop training when error on validation set rises

Run 6 times per problem with different initial weights

Problem 1: 2 inputs, 10 nodes in hidden, 1 output

SAM Mode None Neuron Layer Network

Mean Epochs taken 867 227 202 257

SAM mode Train SSE Valid SSE Unseen SSE

None 0.0081985 0.0065965 0.0092535

Neuron 0.0100445 0.0084395 0.0107985

Layer 0.0103265 0.0086805 0.0106505

Network 0.0077125 0.0071095 0.0084845

Page 10: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 10Presented at CIS 2008 © Dr Richard Mitchell 2008

Problem 25 inputs, 15 nodes in hidden layer and 1 output

SAM mode None Neuron Layer Network

Mean Epochs 1712 315 262 312

SAM mode Train SSE Valid SSE Unseen SSE

None 0.0004725 0.0005625 0.0006665

Neuron 0.0006585 0.0007635 0.0009525

Layer 0.0007685 0.0008745 0.0011055

Network 0.0006215 0.0007655 0.0009505

Trained much more quickly, but SSE worse

Very little diff one layer and whole network, so ..

Page 11: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 11Presented at CIS 2008 © Dr Richard Mitchell 2008

Problem 35 inputs, 15 nodes in hidden layer and 3 outputs

SAM Mode None Neuron Layer Network

Mean Epochs

1133 497 638 977

SAM Mode Train SSE Valid SSE Unseen SSE

None 0.0044735 0.0043835 0.0054605

Neuron 0.0048205 0.0045685 0.0057955

Layer 0.0045675 0.0044105 0.0053225

Network 0.0045465 0.0044055 0.0053445

SSEs averaged over 3 outputs : here Layer best

Page 12: On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence

On Simple Adaptive Momentum - 12Presented at CIS 2008 © Dr Richard Mitchell 2008

Conclusions and Further Work

The Object Oriented hierarchy works neatly hereSAM clearly reduces number of Epochs taken to

learn – little extra overhead per epochIn one example it increased the Sum Squared Errors

This needs investigatingIt needs to be tested on other problems, but it looks

as if SAM at the layer level may be best (particularly with multiple outputs)

Momentum used in other learning problems – SAM could be investigated for these.