neural networks - oliver schulte - cmpt 726

Feed-forward Networks Network Training Error Backpropagation Applications

Neural NetworksOliver Schulte - CMPT 726

Bishop PRML Ch. 5

Neural Networks

• Neural networks arise from attempts to modelhuman/animal brains

• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons

• Mathematical properties rather than plausibility• Prof. Hadley CMPT418

Uses of Neural Networks

• Pros• Good for continuous input variables.• General continuous function approximators.• Highly non-linear.• Learn feature functions.• Good to use in continuous domains with little knowledge:

• When you don’t know good features.• You don’t know the form of a good functional model.

• Cons• Not interpretable, “black box”.• Learning is slow.• Good generalization can require many datapoints.

Applications

There are many, many applications.• World-Champion Backgammon Player.• No Hands Across America Tour.• Digit Recognition with 99.26% accuracy.• ...

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

Outline

Network Training

Applications

• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:

φj(x) = f

M∑j=1

• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:

φj(x) = f

M∑j=1

Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear

combinations:

D∑i=1

w(1)ji xi + w(1)

These aj are known as activations• Pass through an activation function h(·) to get output

zj = h(aj)• Model of an individual neuron

from Russell and Norvig, AIMA2e

combinations:

D∑i=1

w(1)ji xi + w(1)

combinations:

D∑i=1

w(1)ji xi + w(1)

Activation Functions

• Can use a variety of activation functions• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)

• Hyperbolic tangent tanh• Radial basis function zj =

∑i(xi − wji)

• Softmax• Useful for multi-class classification

• Identity• Useful for regression

• Threshold• . . .

• Needs to be differentiable for gradient-based learning(later)

• Can use different activation functions in each unit

w(1)MD w

w(2)10

hidden units

inputs outputs

• Connect together a number of these units into afeed-forward network (DAG)

• Above shows a network with one layer of hidden units• Implements function:

yk(x,w) = h

M∑j=1

w(2)kj h

w(1)ji xi + w(1)

)+ w(2)

Hidden Units Compute Basis Functions

• red dots = network function• dashed line = hidden unit activation function.• blue dots = data points

Outline

Network Training

Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our

network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminativemodel, ML:

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn

E(w) = −N∑

{tn ln yn + (1− tn) ln(1− yn)}

E(w) =

N∑n=1

{y(xn,w)− tn}2

p(t|w) =

N∏n=1

E(w) = −N∑

E(w) =

N∑n=1

{y(xn,w)− tn}2

p(t|w) =

N∏n=1

E(w) = −N∑

Parameter Optimization

wA wB wC

• For either of these problems, the error function E(w) isnasty

• Nasty = non-convex• Non-convex = has local minima

Descent Methods

• The typical strategy for optimization problems of this sort isa descent method:

w(τ+1) = w(τ) + ∆w(τ)

• These come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w(τ))• Newton-Raphson (second order) ∇2

• All of these can be used here, stochastic gradient descentis particularly effective

• Redundancy in training data, escaping local minima

Descent Methods

w(τ+1) = w(τ) + ∆w(τ)

Descent Methods

w(τ+1) = w(τ) + ∆w(τ)

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivativeswith respect to weights

• Numerical method for calculating error derivatives, usefinite differences:

∂wji≈

En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights inthe network?

• O(W) per derivative, O(W2) total per gradient descent step

Computing Gradients

∂wji≈

Computing Gradients

∂wji≈

Computing Gradients

∂wji≈

Outline

Network Training

Applications

• Backprop is an efficient method for computing errorderivatives ∂En

∂wji

• O(W) to compute derivatives wrt all weights

• First, feed training example xn forward through the network,storing all activations aj

• Calculating derivatives for weights connected to outputnodes is easy

• e.g. For linear output nodes yk =∑

i wkizi:

∂wki=

∂wki

(yn − tn)2 = (yn − tn)xni

• For hidden layers, propagate error backwards from theoutput nodes

∂wji

i wkizi:

∂wki=

∂wki

∂wji

i wkizi:

∂wki=

∂wki

Chain Rule for Partial Derivatives

• A “reminder”• For f (x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v:

∂f∂u

=∂f∂x∂x∂u

+∂f∂y∂y∂u

∂f∂v

=∂f∂x∂x∂v

+∂f∂y∂y∂v

Error Backpropagation• We can write

∂wji=

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layeras node j

• Using the chain rule:

∂wji=∂En

∂wji+∑

∂wji

where∑

k runs over all other nodes k in the same layer asnode j.

• Since ak does not depend on wji, all terms in thesummation go to 0

∂wji=∂En

∂wji

∂wji=

∂wjiEn(aj1 , aj2 , . . . , ajm)

∂wji=∂En

∂wji+∑

∂wji

where∑

∂wji=∂En

∂wji

∂wji=

∂wjiEn(aj1 , aj2 , . . . , ajm)

∂wji=∂En

∂wji+∑

∂wji

where∑

∂wji=∂En

∂wji

Error Backpropagation cont.

• Introduce error δj ≡ ∂En∂aj

∂wji= δj

∂wji

• Other factor is:

∂wji=

∂wji

wjkzk = zi

• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑

∂ak︸︷︷︸δk

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑

wkjδk

• A weighted sum of the later error “caused” by this weight

• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑

∂ak︸︷︷︸δk

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑

wkjδk

• A weighted sum of the later error “caused” by this weight

Outline

Network Training

Applications

Applications of Neural Networks

• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)

Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images

LeNet-5

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connection

Gaussian connections

OUTPUT

• LeNet developed by Yann LeCun et al.• Convolutional neural network

• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry

• Seehttp://www.codeproject.com/KB/library/NeuralNetRecognition.aspx

4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3

9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4

8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1

9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8

4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4

8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8

1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1

2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0

4!>9 2!>8

• The 82 errors made by LeNet5 (0.82% test error rate)

Conclusion

• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or

classification• Similar to linear models, except with adaptive non-linear

basis functions• These allow us to do more than e.g. linear decision

boundaries

• Different error functions• Learning is more difficult, error function not convex

• Use stochastic gradient descent, obtain (good?) localminimum

• Backpropagation for efficient gradient computation

neural networks - oliver schulte - cmpt 726

Documents