![Page 1: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/1.jpg)
Backpropagation Training
Michael J. Watts
http://mike.watts.net.nz
![Page 2: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/2.jpg)
Lecture Outline
• Backpropagation training• Error calculation• Error surface• Pattern vs. Batch mode training• Restrictions of backprop• Problems with backprop
![Page 3: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/3.jpg)
Backpropagation Training
• Backpropagation of error training• Also known as backprop or BP• A supervised learning algorithm• Outputs of network are compared to desired
outputs
![Page 4: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/4.jpg)
Backpropagation Training
• Error calculated• Error propagated backwards over the network• Error used to calculate weight changes
![Page 5: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/5.jpg)
Backpropagation Training
![Page 6: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/6.jpg)
Backpropagation Training
![Page 7: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/7.jpg)
Backpropagation Training
• where:o delta is the weight changeo eta is the learning rateo alpha is the momentum term
![Page 8: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/8.jpg)
Backpropagation Training
• Where Is the change to the weight is the partial derivative of the error E with Respect to the weight w
![Page 9: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/9.jpg)
Error Calculation
• Several different error measures existo applied over examples or complete training set
• Sum Squared Erroro SSE
• Mean Squared Erroro MSE
![Page 10: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/10.jpg)
Error Calculation
• Sum Squared Erroro SSEo Measured over entire data set
![Page 11: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/11.jpg)
Error Calculation
• Mean Squared Erroro measured over individual exampleso reduces errors over multiple outputs to a single
value
![Page 12: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/12.jpg)
Error Surface
• Plot of network error against weight values• Consider network as a function that returns an
error• Each connection weight is one parameter of
this function• Low points in surface are local minima• Lowest is the global minimum
![Page 13: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/13.jpg)
Error Surface
![Page 14: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/14.jpg)
Error Surface
• At any time t the network is at one point on the error surface
• Movement across surface from time t to t+1 is because of BP rule
• Network moves “downhill” to points of lower error
• BP rule is like gravity
![Page 15: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/15.jpg)
Error Surface
• Learning rate is like a multiplier on gravity• Determines how fast the network will move
downhill• Network can get stuck in a dip
o stuck in a local minimumo low local error, but not lowest global error
![Page 16: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/16.jpg)
Error Surface
• Too low learning rate = slow learning• Too high = high chance of getting stuck
![Page 17: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/17.jpg)
Error Surface
• Momentum is like the momentum of a mass• Once in motion, it will keep moving• Prevents sudden changes in direction• Momentum can carry the network out of a
local minimum
![Page 18: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/18.jpg)
Error Surface
• Not enough momentum = less chance of escaping a local minimum
• Too much momentum means network can fly out of global minimum
![Page 19: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/19.jpg)
Pattern vs Batch Training
• Also known aso batch and onlineo offline and online
• Pattern mode applies weight deltas after each example
• Batch accumulates deltas and applies them all at once
![Page 20: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/20.jpg)
Pattern vs Batch Training
• Batch mode is closer to true gradient descento requires smaller learning rate
• Step size is smaller• Smooth traversal of the error surface• Requires many epochs
![Page 21: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/21.jpg)
Pattern vs Batch Training
• Pattern mode is easier to implement• Requires shuffling of training set• Not simple gradient descent
o Might not take a direct downward path• Requires a small step size (learning rate) to
avoid getting stuck
![Page 22: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/22.jpg)
Restrictions on Backprop
• Error and activation functions must be differentiable
• Hard threshold functions cannot be usedo e.g. step (Heaviside) function
• Cannot model discontinuous functions• Mixed activation functions cannot be used
![Page 23: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/23.jpg)
Problems with Backprop
• Time consumingo hundreds or thousands of epochs may be
requiredo variants of BP address this
quickprop
• Selection of parameters is difficulto epochs, learning rate and momentum
![Page 24: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/24.jpg)
Problems with Backprop
• Local minimao can easily get stuck in a locally minimal error
• Overfittingo can overlearn the training datao lose the ability to generalise
• Sensitive to the MLP topologyo number of connections
“free parameters”
![Page 25: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/25.jpg)
Summary
• Backprop is a supervised learning algorithm• Gradient descent reduction of errors• Error surface is a multidimensional surface
that describes the performance of the network in relation to the weight
![Page 26: Backpropagation Training Michael J. Watts mike.watts.nz](https://reader035.vdocuments.us/reader035/viewer/2022070402/568138a8550346895da066cc/html5/thumbnails/26.jpg)
Summary
• Traversal of the error surface is influenced by the learning rate and momentum parameters
• Pattern vs. Batch mode trainingo difference is when deltas are applied
• Backprop has some problemso local minimao Overfitting