on simple adaptive momentum - 1 presented at cis 2008 © dr richard mitchell 2008 on simple adaptive...
TRANSCRIPT
On Simple Adaptive Momentum - 1Presented at CIS 2008 © Dr Richard Mitchell 2008
On Simple Adaptive Momentum
Dr Richard Mitchell
Cybernetics Intelligence Research GroupCybernetics, School of Systems
EngineeringUniversity of Reading, UK
On Simple Adaptive Momentum - 2Presented at CIS 2008 © Dr Richard Mitchell 2008
OverviewSimple Adaptive Momentum speeds training of (MLPs)
It adapts the normal momentum term depending on the angle between the current and previous changes in the weights of the MLP.
In the original paper, the weight changes of the whole network are used in determining this angle.
This paper considers adapting the momentum term using certain subsets of these weights.
It is inspired by the author’s object oriented approach to programming MLPs, successfully used in teaching.
It is concluded that the angle is best determined using the weight changes in each layer separately.
On Simple Adaptive Momentum - 3Presented at CIS 2008 © Dr Richard Mitchell 2008
Nomenclature in Multi Layer Net
xr(i) is o/p of node i in layer r;wr(i,j) is weight i of link to node j in layer r
x1(2)
x3(1)x1(1)
w3(3,2)
x2(2)
x2(3)x3(2)
x2(1)
w2(0,1)
w2(0,2)
w2(0,3)w3(0,2)
w3(0,1)
w3(1,2)
w3(2,2)
w3(3,1)
w3(2,1)
w2(1,2)
w3(1,1)w2(1,1)
w2(2,1)
w2(2,3)
w2(1,3)
w2(2,2)
Inputs Outputs
nr r r-1
i 0x (j ) w (i, j) * x (i) f (z) is activation f unctionf
Change weights : Δtwr(i,j) = η δr(j) xr-1(i) + α Δt-1wr(i,j)
δ is function of error; varies with f(z); error also varies
On Simple Adaptive Momentum - 4Presented at CIS 2008 © Dr Richard Mitchell 2008
Simple Adaptive Momentum
Swanston, Bishop, & Mitchell, R.J. (1994), "Simple adaptive momentum: new algorithm for training multilayer perceptrons", Elect. Lett, Vol 30, No 18, pp1498-1500Concept: adapt the momentum term depending on whether weight change this time in same direction as last.Direction? Weight changes in array … so are a vectorHave two vectors, for current and previous, Δwc & Δwp
w2
w1
Δwp2
Δwp
1
Can see angle between vectors
w2
w1
θ
Δwp
Δwc
e.g. In 2D
On Simple Adaptive Momentum - 5Presented at CIS 2008 © Dr Richard Mitchell 2008
Implementing SAMThe simple idea is to replace momentum constant by (1+cos()) where is angle between vector of current and previous deltaWeights, Δwc and Δwp.
c p
c p
w . wcos( ) ; i.e. use vector dot products
w w
In original paper Δws apply to all weights in network
In this paper, we consider adapting α at the network level, layer level and neuron level.
Inspired by object oriented programming of MLP – provides good example and practice for students of properties of OOP albeit on old ANN.
On Simple Adaptive Momentum - 6Presented at CIS 2008 © Dr Richard Mitchell 2008
OO Approach – Network Layers
Can program MLP with objects for each neuron.But as need inputs from prev layer and deltas from next –
need many pointers – problematic for students.So easier to have object for layer of neurons (all with same
inputs): get inputs and weighted deltas in an arrayBase object is layer of linearly activated neurons LinActLayer – a single layer network of neurons f(z) = z.
For Neurons with Sigmoidal Activation – only need two different functions – for calculating output and delta
So have SigActLayer – an object inheriting LinActLayer uses existing members, adds 2 different ones
On Simple Adaptive Momentum - 7Presented at CIS 2008 © Dr Richard Mitchell 2008
Network For Hidden Layers
Need enhanced SigActLayer with own calculate error func: (weighted deltas in next layer).
Existing objects are whole net.
So have SigActHidLayer as a multiple layer network,
Inherits from SigActLayer but also has a pointer to next layer.
Most functions have 2 lines - process own layer and next
ClassBase
SigActHidLayer
LinActLayer
SigActLayer
On Simple Adaptive Momentum - 8Presented at CIS 2008 © Dr Richard Mitchell 2008
SAM and HierarchyGiven approach can adjust momentum using weight changes
a) over the whole network
b) separately by layer
c) separately for each neuron
For a) need to calculate the η * delta * inputs for all layers, then globally set α (1 + cosθ)
For b) calculate η * delta * inputs for each layer and set the α (1 + cosθ) for each layer separately
For c) do the same, but for each neuron in each layer.
This works easily in the hierarchy.
On Simple Adaptive Momentum - 9Presented at CIS 2008 © Dr Richard Mitchell 2008
Experimentation3 problems. Have Training Validation Unseen data
Stop training when error on validation set rises
Run 6 times per problem with different initial weights
Problem 1: 2 inputs, 10 nodes in hidden, 1 output
SAM Mode None Neuron Layer Network
Mean Epochs taken 867 227 202 257
SAM mode Train SSE Valid SSE Unseen SSE
None 0.0081985 0.0065965 0.0092535
Neuron 0.0100445 0.0084395 0.0107985
Layer 0.0103265 0.0086805 0.0106505
Network 0.0077125 0.0071095 0.0084845
On Simple Adaptive Momentum - 10Presented at CIS 2008 © Dr Richard Mitchell 2008
Problem 25 inputs, 15 nodes in hidden layer and 1 output
SAM mode None Neuron Layer Network
Mean Epochs 1712 315 262 312
SAM mode Train SSE Valid SSE Unseen SSE
None 0.0004725 0.0005625 0.0006665
Neuron 0.0006585 0.0007635 0.0009525
Layer 0.0007685 0.0008745 0.0011055
Network 0.0006215 0.0007655 0.0009505
Trained much more quickly, but SSE worse
Very little diff one layer and whole network, so ..
On Simple Adaptive Momentum - 11Presented at CIS 2008 © Dr Richard Mitchell 2008
Problem 35 inputs, 15 nodes in hidden layer and 3 outputs
SAM Mode None Neuron Layer Network
Mean Epochs
1133 497 638 977
SAM Mode Train SSE Valid SSE Unseen SSE
None 0.0044735 0.0043835 0.0054605
Neuron 0.0048205 0.0045685 0.0057955
Layer 0.0045675 0.0044105 0.0053225
Network 0.0045465 0.0044055 0.0053445
SSEs averaged over 3 outputs : here Layer best
On Simple Adaptive Momentum - 12Presented at CIS 2008 © Dr Richard Mitchell 2008
Conclusions and Further Work
The Object Oriented hierarchy works neatly hereSAM clearly reduces number of Epochs taken to
learn – little extra overhead per epochIn one example it increased the Sum Squared Errors
This needs investigatingIt needs to be tested on other problems, but it looks
as if SAM at the layer level may be best (particularly with multiple outputs)
Momentum used in other learning problems – SAM could be investigated for these.