neural networks - oliver schulte - cmpt 726
TRANSCRIPT
Feed-forward Networks Network Training Error Backpropagation Applications
Neural NetworksOliver Schulte - CMPT 726
Bishop PRML Ch. 5
Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks
• Neural networks arise from attempts to modelhuman/animal brains
• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons
• Mathematical properties rather than plausibility• Prof. Hadley CMPT418
Feed-forward Networks Network Training Error Backpropagation Applications
Uses of Neural Networks
• Pros• Good for continuous input variables.• General continuous function approximators.• Highly non-linear.• Learn feature functions.• Good to use in continuous domains with little knowledge:
• When you don’t know good features.• You don’t know the form of a good functional model.
• Cons• Not interpretable, “black box”.• Learning is slow.• Good generalization can require many datapoints.
Feed-forward Networks Network Training Error Backpropagation Applications
Applications
There are many, many applications.• World-Champion Backgammon Player.• No Hands Across America Tour.• Digit Recognition with 99.26% accuracy.• ...
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• We have looked at generalized linear models of the form:
y(x,w) = f
M∑j=1
wjφj(x)
for fixed non-linear basis functions φ(·)
• We now extend this model by allowing adaptive basisfunctions, and learning their parameters
• In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:
φj(x) = f
M∑j=1
. . .
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• We have looked at generalized linear models of the form:
y(x,w) = f
M∑j=1
wjφj(x)
for fixed non-linear basis functions φ(·)
• We now extend this model by allowing adaptive basisfunctions, and learning their parameters
• In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:
φj(x) = f
M∑j=1
. . .
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =
D∑i=1
w(1)ji xi + w(1)
j0
These aj are known as activations• Pass through an activation function h(·) to get output
zj = h(aj)• Model of an individual neuron
from Russell and Norvig, AIMA2e
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =
D∑i=1
w(1)ji xi + w(1)
j0
These aj are known as activations• Pass through an activation function h(·) to get output
zj = h(aj)• Model of an individual neuron
from Russell and Norvig, AIMA2e
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =
D∑i=1
w(1)ji xi + w(1)
j0
These aj are known as activations• Pass through an activation function h(·) to get output
zj = h(aj)• Model of an individual neuron
from Russell and Norvig, AIMA2e
Feed-forward Networks Network Training Error Backpropagation Applications
Activation Functions
• Can use a variety of activation functions• Sigmoidal (S-shaped)
• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)
• Hyperbolic tangent tanh• Radial basis function zj =
∑i(xi − wji)
2
• Softmax• Useful for multi-class classification
• Identity• Useful for regression
• Threshold• . . .
• Needs to be differentiable for gradient-based learning(later)
• Can use different activation functions in each unit
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
• Connect together a number of these units into afeed-forward network (DAG)
• Above shows a network with one layer of hidden units• Implements function:
yk(x,w) = h
M∑j=1
w(2)kj h
(D∑
i=1
w(1)ji xi + w(1)
j0
)+ w(2)
k0
Feed-forward Networks Network Training Error Backpropagation Applications
Hidden Units Compute Basis Functions
• red dots = network function• dashed line = hidden unit activation function.• blue dots = data points
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our
network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminativemodel, ML:
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our
network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminativemodel, ML:
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our
network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminativemodel, ML:
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
Feed-forward Networks Network Training Error Backpropagation Applications
Parameter Optimization
w1
w2
E(w)
wA wB wC
∇E
• For either of these problems, the error function E(w) isnasty
• Nasty = non-convex• Non-convex = has local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort isa descent method:
w(τ+1) = w(τ) + ∆w(τ)
• These come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w(τ))• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descentis particularly effective
• Redundancy in training data, escaping local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort isa descent method:
w(τ+1) = w(τ) + ∆w(τ)
• These come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w(τ))• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descentis particularly effective
• Redundancy in training data, escaping local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort isa descent method:
w(τ+1) = w(τ) + ∆w(τ)
• These come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w(τ))• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descentis particularly effective
• Redundancy in training data, escaping local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivativeswith respect to weights
• Numerical method for calculating error derivatives, usefinite differences:
∂En
∂wji≈
En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights inthe network?
• O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivativeswith respect to weights
• Numerical method for calculating error derivatives, usefinite differences:
∂En
∂wji≈
En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights inthe network?
• O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivativeswith respect to weights
• Numerical method for calculating error derivatives, usefinite differences:
∂En
∂wji≈
En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights inthe network?
• O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivativeswith respect to weights
• Numerical method for calculating error derivatives, usefinite differences:
∂En
∂wji≈
En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights inthe network?
• O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing errorderivatives ∂En
∂wji
• O(W) to compute derivatives wrt all weights
• First, feed training example xn forward through the network,storing all activations aj
• Calculating derivatives for weights connected to outputnodes is easy
• e.g. For linear output nodes yk =∑
i wkizi:
∂En
∂wki=
∂
∂wki
12
(yn − tn)2 = (yn − tn)xni
• For hidden layers, propagate error backwards from theoutput nodes
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing errorderivatives ∂En
∂wji
• O(W) to compute derivatives wrt all weights
• First, feed training example xn forward through the network,storing all activations aj
• Calculating derivatives for weights connected to outputnodes is easy
• e.g. For linear output nodes yk =∑
i wkizi:
∂En
∂wki=
∂
∂wki
12
(yn − tn)2 = (yn − tn)xni
• For hidden layers, propagate error backwards from theoutput nodes
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing errorderivatives ∂En
∂wji
• O(W) to compute derivatives wrt all weights
• First, feed training example xn forward through the network,storing all activations aj
• Calculating derivatives for weights connected to outputnodes is easy
• e.g. For linear output nodes yk =∑
i wkizi:
∂En
∂wki=
∂
∂wki
12
(yn − tn)2 = (yn − tn)xni
• For hidden layers, propagate error backwards from theoutput nodes
Feed-forward Networks Network Training Error Backpropagation Applications
Chain Rule for Partial Derivatives
• A “reminder”• For f (x, y), with f differentiable wrt x and y, and x and y
differentiable wrt u and v:
∂f∂u
=∂f∂x∂x∂u
+∂f∂y∂y∂u
and
∂f∂v
=∂f∂x∂x∂v
+∂f∂y∂y∂v
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En
∂wji=
∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layeras node j
• Using the chain rule:
∂En
∂wji=∂En
∂aj
∂aj
∂wji+∑
k
∂En
∂ak
∂ak
∂wji
where∑
k runs over all other nodes k in the same layer asnode j.
• Since ak does not depend on wji, all terms in thesummation go to 0
∂En
∂wji=∂En
∂aj
∂aj
∂wji
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En
∂wji=
∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layeras node j
• Using the chain rule:
∂En
∂wji=∂En
∂aj
∂aj
∂wji+∑
k
∂En
∂ak
∂ak
∂wji
where∑
k runs over all other nodes k in the same layer asnode j.
• Since ak does not depend on wji, all terms in thesummation go to 0
∂En
∂wji=∂En
∂aj
∂aj
∂wji
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En
∂wji=
∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layeras node j
• Using the chain rule:
∂En
∂wji=∂En
∂aj
∂aj
∂wji+∑
k
∂En
∂ak
∂ak
∂wji
where∑
k runs over all other nodes k in the same layer asnode j.
• Since ak does not depend on wji, all terms in thesummation go to 0
∂En
∂wji=∂En
∂aj
∂aj
∂wji
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Introduce error δj ≡ ∂En∂aj
∂En
∂wji= δj
∂aj
∂wji
• Other factor is:
∂aj
∂wji=
∂
∂wji
∑k
wjkzk = zi
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Error δj can also be computed using chain rule:
δj ≡∂En
∂aj=∑
k
∂En
∂ak︸︷︷︸δk
∂ak
∂aj
where∑
k runs over all nodes k in the layer after node j.• Eventually:
δj = h′(aj)∑
k
wkjδk
• A weighted sum of the later error “caused” by this weight
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Error δj can also be computed using chain rule:
δj ≡∂En
∂aj=∑
k
∂En
∂ak︸︷︷︸δk
∂ak
∂aj
where∑
k runs over all nodes k in the layer after node j.• Eventually:
δj = h′(aj)∑
k
wkjδk
• A weighted sum of the later error “caused” by this weight
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Applications of Neural Networks
• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)
Feed-forward Networks Network Training Error Backpropagation Applications
Hand-written Digit Recognition
• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images
Feed-forward Networks Network Training Error Backpropagation Applications
LeNet-5
INPUT
32x32
Convolutions SubsamplingConvolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps6@14x14
S4: f. maps 16@5x5
C5: layer120
C3: f. maps 16@10x10
F6: layer 84
Full connection
Full connection
Gaussian connections
OUTPUT
10
• LeNet developed by Yann LeCun et al.• Convolutional neural network
• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry
• Seehttp://www.codeproject.com/KB/library/NeuralNetRecognition.aspx
Feed-forward Networks Network Training Error Backpropagation Applications
4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3
9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4
8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1
9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8
4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4
8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8
1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1
2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0
4!>9 2!>8
• The 82 errors made by LeNet5 (0.82% test error rate)
Feed-forward Networks Network Training Error Backpropagation Applications
Conclusion
• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or
classification• Similar to linear models, except with adaptive non-linear
basis functions• These allow us to do more than e.g. linear decision
boundaries
• Different error functions• Learning is more difficult, error function not convex
• Use stochastic gradient descent, obtain (good?) localminimum
• Backpropagation for efficient gradient computation