lectures 12&13&14: multilayer perceptrons (mlp) networks

1

Lectures 12&13&14: Multilayer Perceptrons (MLP) Networks

MultiLayer Perceptron (MLP)y ( )

• formulated from loose biological principles• popularized mid 1980s▫ Rumelhart, Hinton & Williams 1986 Werbos 1974 Ho 1964 Werbos 1974, Ho 1964

• “learn” pre-processing stage from data• layered, feed-forward structurey ,▫ sigmoidal pre-processing▫ task-specific output

li d lnon-linear model

MLP

Input layer; Hidden layers; Output layerp y ; y ; p y

x1 y1IN

OUTk

x2 y2

more

PUT

TPU

jixu yi

e layeT

LA

T

L

xd yn

ersAYE

LAYEHIDDEN LAYERS

L0 L1 LM-1 LM

R ER

HIDDEN LAYERS

A solution for the XOR problem

x1

1

1-1

x1 x2 x1 xor x2

-1 -1 -1-1 1 1

-1

x21 -1 11 1 -1

1

+1+1x1 1 if v > 0

-1

0.1

+1

-1

-1x2

1 if v > 0(v) =

-1 if v 0 is the sign function+1 is the sign function.

-1

MLP

• Hidden layers of computation nodesI t t i f d di ti l b• Input propagates in a forward direction, layer-by-layer basis▫ also called Multilayer Feedforward Network, MLPE b k i l i h• Error back-propagation algorithm▫ supervised learning algorithm▫ error-correction learning algorithm▫ Forward pass input vector is applied to input nodes its effects propagate through the network layer-by-layer

i h fi d i i h with fixed synaptic weights▫ backward pass synaptic weights are adjusted in accordance with error

signalsignal error signal propagates backward, layer-by-layer fashion

MLP Distinct Characteristics• Non-linear activation function▫ differentiable )exp(1

1

ji v

y

differentiable

▫ sigmoidal function, logistic function▫ nonlinearity prevents reduction to single-layer perceptron !

)p( j

h h ldy y

threshold linear

a ai i li

y ypiece-wise linear sigmoid

a a

MLP Distinct Characteristics

• One or more layers of hidden neuronsy▫ progressively extracting more meaningful features

from input patterns• High degree of connectivity

N li it d hi h d f ti it• Nonlinearity and high degree of connectivity makes theoretical analysis difficult

• Learning process is hard to visualize• Learning process is hard to visualize• BP is a landmark in NN: computationally efficient

trainingtraining

NN: Universal Approximator?

• Any desired continuous function can be implemented by th l t k i ffi i t b f hidda three-layer network given sufficient number of hidden

units, proper nonlinearitiers and weights (Kolmogorov)• Kolmogorov proved that any continuous function g(x) g p y g( )

can be represented as

12

1 1))(()( n

j

d

i iijj xxg for properly chosen and .

(A. N. Kolmogorov. On the representation of continuous

j

ijj

( g pfunctions of several variables by superposition of continuous functions of one variable and addition.Doklady Akademiia Nauk SSSR, 114(5):953-956, y , ( ) ,1957)

Universal Approximation Property of ANN

Boolean functionsE b l f ti b t d b• Every boolean function can be represented by network with single hidden layer

• But might require exponential (in number of inputs) hidden unitshidden units

Continuous functions• Every bounded continuous function can be• Every bounded continuous function can be

approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]

• Any function can be approximated to arbitrary• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]

Preliminaries

• Function signal▫ input signals come in at the input end of the

network▫ propagates forward to output nodes▫ propagates forward to output nodes

• Error signal▫ originates from output neuron▫ propagates backward to input nodes

Two computations in Training• Two computations in Training▫ computation of function signal▫ computation of an estimate of gradient vectorp g gradient of error surface with respect to the weights

MLP: Nonlinear multilayer networks

x0

Two-layer networksOutput of 1st layer yjx0

x1z1

p y yj

Inputs xi Outputs zk

2nd layer weights

zn

xd2nd layer weights wkj from j to k1st layer weights wji

from i to j

Multi-Modal Cost Surface

5

di t?0 gradient?

10

-5

12

30

2

-10

global minlocal min

-3-2

-10

-2

Heading Downhillg

• assume▫ minimization (e.g. SSE)▫ analytically intractable

d hill• step parameters downhill• wnew = wold + step in right direction

b k ti ( f )• backpropagation (of error)▫ slow but efficient

• conjugate gradients Levenburg/Marquardt• conjugate gradients, Levenburg/Marquardt▫ for preference

Neuron with Sigmoid-Functiong

x w

inputsweights

x1

x2

w1

w2

y

weights

activation output

.. wna=i=1

n wi xi

y

xn

.y=(a) =1/(1+e-a)

Sigmoid Unitg

x0=-1x1 w1

w2

w0

0=

a=i=0n wi xi

y=(a)=1/(1+e-a)

x2

.. w

y

(x) is the sigmoid function: 1/(1+e-x)

xn

.. wn (x) is the sigmoid function: 1/(1+e )d(x)/dx= (x) (1- (x))

Derive gradient decent rules to train:Derive gradient decent rules to train:• one sigmoid functionE/wi = - (d-y) y (1-y) xi

Gradient Descent Rule for Sigmoid Output Function

sigmoid E[w1,…,wn] = ½ (d-y)2g

E/wi = /wi ½ (d-y)2

a= /wi ½ (d- (i wi xi))2

= (d-y) ‘(i wi xi) (-xi)( y) ( i i i) ( i)

for y=(a) = 1/(1+e-a)’(a)= e-a/(1+e-a)2=(a) (1-(a))

’ (a)= e /(1+e ) =(a) (1 (a))

aw’i= wi + wi = wi + η y(1-y)(d-y) xi

a

Gradient Descent Learning Rule

y

wji

yj

xi

wi = η yj(1-yj) (dj-yj) xi

activation oflearning ratepre-synaptic neuron

error of post-synaptic neuronderivative of ti ti f ti

learning rate

activation function

Learning with hidden unitsg• Networks without hidden units are very limited in the

input-output mappings they can modelinput-output mappings they can model.▫ More layers of linear units do not help; It is still linear.▫ Fixed output nonlinearities are not enough

• We need multiple layers of adaptive nonlinear hidden units. This gives us a universal approximator. But how can we train such nets?can we train such nets?▫ We need an efficient way of adapting all the weights,

not just the last layer; this is hard. Learning the weights going into hidden units is equivalent to learning features.

▫ Nobody is telling us directly what hidden units shouldNobody is telling us directly what hidden units should do.

Learning by perturbing weightsLearning by perturbing weights• Randomly perturb one weight and see

if it improves performance. If so, save p p ,the change.▫ Very inefficient. We need to do

multiple forward passes on a f

output units

representative set of training data just to change one weight.

▫ Towards the end of learning, large eight pert rbations ill nearl

hidden units

weight perturbations will nearly always make things worse.

• We could randomly perturb all the weights in parallel and correlate the

Learning the hidden to output i ht i L i th

input units

weights in parallel and correlate the performance gain with the weight changes. ▫ Not any better because we need lots

weights is easy. Learning the input to hidden weights is hard.

Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others.

The idea behind backpropagationp p g

• We do not know what the hidden units might do, but we can compute how fast the error changes asbut we can compute how fast the error changes as we change a hidden activity.▫ Instead of using desired activities to train the hidden

it d i ti t hidd ti itiunits, use error derivatives w.r.t. hidden activities.▫ Each hidden activity can affect many output units and

can therefore have many separate effects on the Th ff b bi derror. These effects must be combined.

▫ We can compute error derivatives for all the hidden units efficiently.

▫ Once we have the error derivatives for the hidden activities, it is easy to get the error derivatives for the weights going into a hidden unit.g g g

Multi-Layer Networksy

yj output layeryj

w

hidden layerx

wji

hidden layerxi

input layer

Back-Propagation Algorithm (BPA)g g ( )

• Error signal for neuron j at iteration n: )()()( nyndne jjj

• Total error energy )(21)( 2 nenE

Cjj

▫ C is set of the output nodes• Average squared error energy

N

av nEN

E )(1

▫ average over all training sample▫ cost function as a measure of learning performance

• Objective of Learning process

n

av N 1)(

Objective of Learning process▫ adjust NN parameters (synaptic weights) to minimize

E(n) or Eav• Weights updated pattern-by-pattern basis until oneWeights updated pattern by pattern basis until one

epoch▫ complete presentation of the entire training set

BPA

• Induced local field )()()( nynwnv i

m

jij

• output of neuron j

0i

))(()( nvny jjj

• Gradient )()()()()( nvnynenEnE jjj

▫ Sensitivity factord t i th di ti f h i i ht

)()(

)()(

)()(

)()(

)()(

nwnvy

nynen

nwn

ij

j

j

j

j

j

jij

▫ determine the direction of search in weight space▫ according to chain rule

)(nE )( ne )(ny )(nv)()()( ne

nenE

jj

1)()(

nyne

j

j ))(()()(

nvnvny

jj

j

)(

)()(

nynwnv

iij

j

Gradient Descent

• Therefore, )())(()()(

)( nynvnenw

nEijjj

, )(nw jjj

ji

• By delta rule)(

)(η)(

nwnEnw

jiji

which is gradient descent in weight space• Local gradient

( ) ηδ ( ) ( )w n n y n

j

( ) ( )( ) ( )δ ( ) ( ( ))( ) ( ) ( ) ( )

j jj j

j j j j

e n y nE n E nn e(n) v nv n e n y n v n

f¶ ¶¶ ¶ ¢=- =- =

¶ ¶ ¶ ¶

j( ) ηδ ( ) ( )ji iw n n y n

Local Gradient

• Neuron j is an output node )()()( nyndne jjj j p

• Neuron j is a hidden node

jjj

▫ credit assignment problem how to determine their share of responsibility

)()()( EE

• for output neuron k

))(()()(

)()(

)()()(δ j nv

nynE

nvny

nynEn jj

jj

j

j

)(1)( 2 nenE for output neuron k

)()(

)()()(

)()()(

)()(

nynv

nvnene

nynene

nynE kk

kk

k

)(2

)( nenECk

k

)()()()( nynvnyny jkkjkj

Local Gradient

• Error in neuron k ))(()()()()( nvndnyndne kkkkkk

• Hence

m

))(()()( nv

nvne

kkk

k

)(since ,

m

jjkjk nynwnv

0)()()( )(

)()( nw

nynv

kjj

k

• desired partial derivative)()(δ)())(()(

)()( nwnnwnvne

nynE

kjkkjkkk

• back-propagation formula for hidden neuron j

)(ny kkj

)()(δ))(()(δ nwnnvn kjk

kjjj

BP Summaryy

signalinput locallearningWeight

)(jneuron to

)(δgradient

η

parameter)(

correction

j nynnw iji

• forward pass)()()(

0nynwnv i

m

ijij

))(()(

• backward pass

))(()( nvny jjj

▫ recursively compute local gradient from output layer towards input layer

▫ synaptic weight change by delta rule▫ synaptic weight change by delta rule

Indeed …

• Back-propagation algorithmBack propagation algorithm

Function signalsF d StForward Step

Error signals

• It adjusts the weights of the NN in order to Backward Step

minimize the average squared error.

Activation Function (logistic function)• Sigmoidal Function

1)(v

Activation Function (logistic function)

jave

1

1j)(v)( jv

1

Increasing a

jvijijv yw

• induced field of neuron j

-10 -8 -6 -4 -2 2 4 6 8 10 i

,...,0jij y

mi

jv j• Most common form of activation function• a threshold function

Diff ti bl

j

• Differentiable

Activation Function (logistic function)

1( ) ( ( )) 0 and - ( )1 exp( ( ))j j j j

j

y n v n a v nav n

1 exp( ( ))jav n

)](1)[( ))](exp(1[))(exp(

))(( 2 nynaynavnava

nv jjj

jjj

• local gradient

))](exp(1[ nav j

▫ for output node

jδ ( ) ( ( )) [ ( ) ( )] ( )[1 ( )]i j i j j j jn e v n a d n y n y n y nf¢= = - -

▫ for hidden nodeδ ( ) ( ( )) δ ( ) ( ) ( )[1 ( )] δ ( ) ( )j j i k kj j j k kjn v n n w n ay n y n n w nf¢= = -å åδ ( ) ( ( )) δ ( ) ( ) ( )[1 ( )] δ ( ) ( )j j i k kj j j k kj

k k

n v n n w n ay n y n n w nf å å

Activation Function (Hyperbolic tangent function)tangent function)

( ) ( ( )) tanh( ( )) 0j j i jy n v n a bv n (a,b) ( ) ( ( )) ( ( ))j j i jy ( , )

)]()][([

)))((tanh1())((hsec))(( 22

b

nbvabnbvabnv jjij

• local gradient

)]()][([ nyanyaa jj

▫ for output nodejδ ( ) ( ( )) [ ( ) ( )][ ( )[ ( )]i j i j j j j

bn e v n d n y n a y n a y na

f¢= = - - +

▫ for hidden nodeδ ( ) ( ( )) δ ( ) ( ) [ ( )][ ( )] δ ( ) ( )bn v n n w n a y n a y n n w nf¢= = - +å åδ ( ) ( ( )) δ ( ) ( ) [ ( )][ ( )] δ ( ) ( )j j i k kj j j k kj

k k

n v n n w n a y n a y n n w na

f= = +å å

Moment term• BP approximates the trajectory of steepest descent▫ smaller learning-rate parameter makes smoother pathsmaller learning rate parameter makes smoother path

• increase rate of learning yet avoiding danger of instability

)()(ηδ)1(α)( nynnwnw

where is momentum constant

)()(ηδ)1(α)( nynnwnw jjjiji

n nn-1 n-1( )

( ) η α η α δ ( ) ( )E t

w n t y t

▫ converge if 0 | | 1▫ the partial derivative has the same sign on consecutive

t 0 t 0( ) η α η α δ ( ) ( )

( )ji j iji

w n t y tw t

p giterations: grows in magnitude; accelerate descent

▫ opposite sign: shrinks; stabilizing effect• benefit of preventing the learning process from g g

terminating in a shallow local minimum

Mode of Trainingg

▫ Epoch: one complete presentation of training datad i th d f t ti f h h▫ randomize the order of presentation for each epoch

• Sequential mode▫ for each training sample, synaptic weights are updated

i l t▫ requires less storage▫ converges fast, particularly training data is redundant▫ random order makes trapping at local minimum less

lik llikely• Batch mode▫ at the end of one epoch, synaptic weights are updated

)()(

)(Nη

)()(

η)(N

1n nwne

nenwnE

nwji

jj

ji

avgji

▫ may be robust with outliers

Stopping Criteriag

• No well-defined stopping criteriapp g• Terminate when Gradient vector g(W) = 0▫ located at local or global minimum

• Terminate when error measure is stastionary• Terminate if NN’s generalization performance is

dadequate

Two-layer networks

x0 Output of 1st layer zj

y

0

x1

y1

p y j

Inputs xk

y1

Outputs yi

yn

xm2nd layer weights wij from j to i1st layer weights wij from j to i1st layer weights

vjk from k to j

An Example

All biases set to 1. Will not draw them for clarity.

Learning rate h = 0.1

v11= -1y1

x1v11= -1

v21= 0v 0

w11= 1

w21= -1

x1= 0

y2x2

v12= 0

v22= 1w12= 0

w22= 1x2= 1

Have input [0 1] with target [1 1].

w22 1

Use identity activation function (i.e. g(a) = a)

An Example

Forward pass: C l l h hidd i iCalculate the hidden unit input:

u = 1y1

v11= -1

v21= 0w11= 1

w21= -1

u1 = 1x1

y2

v12= 0

v22= 1

w21 1

w12= 0

1x2

w22= 1u2 = 2

1121

11

2221212

2121111

xvxvuxvxvu

An Example

Calculate the activities of hidden units:Calculate the activities of hidden units:

v = -1z1 = 1

y1x1

v11= -1

v21= 00

w11= 1

w21= -1

y2x2

v12= 0

v22= 1w12= 0

w22= 1w22 1z2 = 2

z = g(u ) = 1z1 = g(u1) = 1

z2 = g(u2) = 2

An Example

Calculate outputs:

y1= 2x1v11= -1

v21= 0w11= 1

w21= -1

y2= 3x2

v12= 0

v22= 1

w21= -1

w12= 0

w22= 1

3121

2221212

2121111

zwzwyzwzwy

2221212y

An Example

Backward pass:

D = 1x1v11= -1 w11= 1

pCalculate error signal from the output units:

D1= -11

v21= 0v12= 0

11

w21= -1

w 0D2= -2x2 v22= 1

w12= 0

w22= 1

Target =[1, 1] So:

D1 = (y1 - t1 )= 1 – 2 = -1D2 = (y2 - t2 )= 1 – 3 = -2

An Example

C l l i l f hidd i

v11= -1 δ 1

Calculate error signal from hidden units:

δ1= -1x111

v21= 0v = 0

δ1 w11= -1

δ2 w21= 2

δ2= -2x2

v12= 0

v22= 1δ1 w12= 0

δ2 w22= -2

( ) '( ( ) ) ( )j j i i jt g u t t w

δ2 w22 2

j j i i ji

An Example

1D1= -1x1

v11= -1

v21= 0

δ1= 1

D2= -2x2

v12= 0

v22= 1

δ2 = -2

δ1 = - 1 + 2 = 1δ2 = 0 – 2 = -2

( ) '( ( ) ) ( )i i k k ik

t g u t t w

An Example

Update weights in the output layer:

D z = 1v11= -1 z1 = 1D1 z1 =-1x1

v11= -1

v21= 0v 0

w11= 1

w21= -1 D1 z2 =-2

x2

v12= 0

v22= 1w12= 0

w22= 1D2 z1 =-2

D z =-4

( 1) ( ) ( ) ( )w t w t t z t

w22 1z2 = 2

D2 z2 =-4

( 1) ( ) ( ) ( )i j i j i jw t w t t z t

An Example

O t t l i ht hOutput layer weights change:

x1v11= -1

0w11= 0.9

x

v21= 0v12= 0

1

w21= -1.2

w12= -0.2x2 v22= 1

w22= 0.6

An Example

Hidden layer weights change:

v11= -1 d1 x1 = 0x1= 0D1= -1

v21= 0v12= 0

x1 0

d1 x2 = 1

D2= -2v12 0

v22= 1

d 2

x2= 1d2 x1 = 0

( 1 ) ( ) ( ) ( )

d2 x2 = -2

( 1 ) ( ) ( ) ( )j k j k j kv t v t t x t

Practical considerations in MLP

Data:• Learning set• Test set• Validation set

• Stopping criterion• Learning curve,

the average error per patternthe average error per pattern• Cross-validation• The total training error is minimized• It usually decreases monotonically,

even though this is not the case forthe error on each individual patternthe error on each individual pattern.

MLP• When the training set is small one can generate surrogate training patterns.

• In the absence of problem-specific information, the surrogate patterns can be made by adding Gaussian noise to true training points, the category label should be left unchanged.

• If we know about the source of variation among patterns, we can manufacture training datamanufacture training data.

• The number of hidden units (neurons) should be less than the total number of training points n, say roughly n/10.

• Initializing weights

• We cannot initialize the weights to 0 (why?).• uniform learning =>• uniform learning =>

choose weights randomly from a single distribution• Input-to-hidden weights:

-1 / d < wij < + 1 / d ,where d is the number of input units• Hidden-to-output weights:• Hidden-to-output weights:

-1 / nH < wkj < + 1 / nH ,where nH is the number of hidden units

Newton’s Method to speed up

• The idea is to minimize the quadratic approximation of the cost function E(w) around the current point w(n). Using a second order Taylor series expansion of• Using a second-order Taylor series expansion of the cost function around the point w(n).

• E[w(n)] gT(n)w(n) +½ wT(n) H(n) w(n)E[w(n)] g (n)w(n) +½ w (n) H(n) w(n) • g(n) is the m-by-1 gradient vector of the cost

function E(w) evaluated at the point w(n). The matrix H(n) is the Hessian m-by-m matrix of E(w) (second derivative), H = ²E(w)

Newton’s Method to speed up

• H = ²E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements ofcontinuously differentiable with respect to the elements of w.

• Differentiating E[w(n)] gT(n)w(n) +½ wT(n) H(n) w(n) with respect to w the change E(n) is minimizedw(n) with respect to w, the change E(n) is minimized when

• g(n) + H(n)w(n) = 0 -> w(n) = H-1(n)g(n)• w(n+1) = w(n) + w(n)• w(n+1) = w(n)+H-1(n)g(n)• where H-1(n) is the inverse of the Hessian of E(w)• where H 1(n) is the inverse of the Hessian of E(w).• Newton’s method converges quickly asymtotically and

does not exhibit the zigzagging behavior.• Newton’s method requires that the Hessian H(n) to be

inversible for all n!

Gauss-Newton Method

• It is applicable to a cost function that is expressed th fas the sum of error squares.

• E(w) = ½i=1n e²(i), note that all the error terms are

calculated on the basis of a weight vector w thatcalculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n.

• The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w bywe linearize the dependence of e(i) on w by writing e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,ne (i,w) e(i) [e(i)/w] w=w(n) (w w(n)), i 1,2,...,n

Gauss-Newton Method

e’(n,w) = e(n) + J(n)(w-w(n))where e(n) is the error vector e(n) =

[e(1),e(2),...,e(n)]T and J(n) is the n-by-m Jacobian matrix of e(n) (The Jacobian J(n) is theJacobian matrix of e(n) (The Jacobian J(n) is the transpose of the m-by-n gradient matrix e(n), where e(n) =[e(1), e(2), ...,e(n)].

w(n+1) = arg min w {½e’(n,w)²}= ½e(n)² +eT(n)J(n)(w-w(n)) + ½(w-

w(n))TJT(n)J(n)(w w(n))w(n))TJT(n)J(n)(w-w(n))Differentiating the expression with respect w and

setting the result equal to zerosetting the result equal to zero

Gauss-Newton Method

JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0w(n+1) = w(n) – [JT(n)J(n)]-1JT(n)e(n)The Gauss-Newton requires only the Jacobian

matrix of the error function e(n)matrix of the error function e(n).For the Gauss-Newton iteration to be computable,

the matrix product JT(n)J(n) must be nonsigularthe matrix product J (n)J(n) must be nonsigular. JT(n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n)

t h k dd th di lmust have row rank n. -> add the diagonal matrix I to the matrix JT(n)J(n), the parameter is a small positive constant.is a small positive constant.

Gauss-Newton Method

• JT(n)J(n)+ I ; positve definite for all n.• -> The Gauss-Newton method is

implemented in the following form:w(n+1) = w(n) – [JT(n)J(n) + I]-1JT(n)e(n)• This is the solution to the modified cost

function:E(w) = ½{w-w(0)²+ i=1

n e²(i)}where w(0) is the initial value of w.

Heuristics for making BP Betterg

• Training with BP is more an art than scienceg• Sequential vs. Batch update• Maximizing information content▫ examples of largest training error▫ examples of radically different from previous ones

• Randomize the order of presentation▫ successive examples rarely belongs to the same

classclass• Activation function▫ antisymmetric function learns fast

)tanh()( bvav

antisymmetric function learns fast


• Normalizing the inputs▫ preprocessed so that its mean value is close to

zero▫ input variables should be uncorrelated▫ input variables should be uncorrelated by principal component analysis

▫ scaled so that covariance are equal• Weight Initialization▫ large weight value => saturation local gradient value is small = slow learninglocal gradient value is small = slow learning

▫ small weight value => operate on flat area = slow learning

▫ somewhere between two extremes


• Learning from hintsg▫ prior information should be included in the

learning process• Learning Rate▫ all the neurons learn at the same rate

last layer has large local gradient last layer has large local gradient last layer learns fast of last layer should be assigned smaller one y g

Overfittingg• The training data contains information about the

regularities in the mapping from input to outputregularities in the mapping from input to output. But it also contains noise▫ The target values may be unreliable.g y▫ There is sampling error. There will be accidental

regularities just because of the particular training cases that were chosencases that were chosen.

• When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. ▫ So it fits both kinds of regularity.▫ If the model is very flexible it can model the▫ If the model is very flexible it can model the

sampling error really well. This is a disaster.

A simple example of overfittingg

• Which model do you believe?y▫ The complicated model fits the data

better.▫ But it is not economical

• A model is convincing when it fits a lot of data surprisingly well.▫ It is not surprising that a complicated▫ It is not surprising that a complicated

model can fit a small amount of data.

Generalization• The objective of learning is to achieve good

generalization to new cases otherwise justgeneralization to new cases, otherwise just use a look-up table.

• Generalization can be defined as a• Generalization can be defined as a mathematical interpolation or regression over a set of training points:g p

f(x)

x

Training Set Size for Generalization

• Generalization is influenced byy▫ size of training set▫ architecture of Neural Network

• Given architecture, determine the size of training set for good generalizationGi t f t i i l d t i th• Given set of training samples, determine the best architecture for good generalization

ε

WON

Approximation of Functions• Non-linear input-output mapping▫ M input space to M output space▫ Mo input space to ML output space

• What is the minimum number of hidden layers in a MLP that provides approximate any continuous mapping?Universal Approximation Theorem• Universal Approximation Theorem▫ existence of approximation of arbitrary continuous function▫ single hidden layer is sufficient for MLP to compute a uniform

i ti t i t i i t-approximation to a given training set▫ not saying single layer is optimum in the sense of training time,

easy of implementation, or generalizationB d f A i ti E f i l hidd d NN• Bound of Approximation Errors of single hidden node NN▫ larger the number of hidden nodes, more accurate the

approximationll th b f hidd d t th▫ smaller the number of hidden nodes, more accurate the

empirical fit, i.e., better generalization

Curse of Dimensionalityy

• For good generalization, N > W / g g ,▫ where W is total number of synaptic weights

• We need dense sample points to learn it well.• Dense samples are hard to find in high

dimensionsti l th i l it i f▫ exponential growth in complexity as increase of

dimensions

Practical Consideration

• Single hidden layer vs double (multiple) hidden g y ( p )layer▫ single HL NN is good for any approximation of

continuous function▫ double HL NN may be good some times

• double(multiple) hidden layer▫ first hidden layer - local feature detectionfirst hidden layer local feature detection▫ second hidden layer - global feature detection

Cross-Validation

• Validate learned model on different sets to assess the generalization performance ▫ guarding against overfitting

• Partition Training set into▫ Estimation subset (or training subset)

lid ti b t ( t t b t)▫ validation subset (or test subset)• cross-validation for▫ best model selection▫ best model selection▫ determine when to stop training

Model selection

• Choosing MLP with the best number of free parameters with given N training sampleswith given N training samples

• Issue is to choose r▫ that determines split of training set between estimation set

and validation setand validation set▫ to minimize classification error of model trained by the

estimation set when it tested with the validation set• Kearns(1996): Qualitative properties of optimum r( ) p p p▫ for small complexity problem (desired response is small

compared to N), performance of cross-validation is insensitive to ri l fi d l ti l f id f t t▫ single fixed r nearly optimal for wide range of target

function• suggest r = 0.2▫ 80% of training set is estimation set▫ 80% of training set is estimation set

Stopping method of trainingg g• Right time to stop training

to avoid overfitting▫ to avoid overfitting• Early stopping method▫ after some training with fixed synaptic weights▫ after some training, with fixed synaptic weights

computed test error▫ resume training after computing test errorg p g

Meansquarederror

testsample

error

Training

Number of epoch

Trainingsample

Early stopping point

Stopping methodg

• Amari(1996)▫ for N<W early stopping improves generalization

▫ for N<30W▫ for N<30W overfitting occurs

)1(21121

W

Wropt

1

example: w=100, r=0.07W

ropt 211 for large W

p , 93% for estimation, 7% for validation

▫ for N>30W early stopping improvement is small early stopping improvement is small

• Leave-one-out method

Network Pruningg• Minimizing network improves generalization

less likely to learn idiosyncrasies or noise▫ less likely to learn idiosyncrasies or noise• Network growing• Network pruning• Network pruning▫ weakening or eliminate synaptic weights

• Complexity-regularizationComplexity regularization▫ tradeoff between reliability of training data and

goodness of the model▫ supervised learning by minimizing the risk function

h)(λ)()( WEWEwR cs

where)()(

WEWE

c

s : standard performance measure

: complexity penalty

Complexity-regularizationy g

• Weight Decays 22||||)( ic wWWE

▫ some weights are forced to take value zero▫ weights in network are grouped into two categoriesweights in network are grouped into two categories those of large influence those of little or no influence: excess weights

• Weight Elimination• Weight Elimination

2

2

)/(1)/()(oi

oic ww

wwWE

▫ when wi << w0, eliminated

• Approximate Smoother• Approximate Smoother

Hessian-based Network Pruningg

• Identify parameters whose deletion will cause the least increase in Eav

• by Taylor series1

▫ Parameters are deleted after training process has d (i ( ) 0)

)||(||21)()()( 3wwHwwgwww OwEE tt

avav

converged (i.e., g (w) ≈ 0)▫ quadratic approximation (i.e., O(higher orders) ≈ 0)

wHwww tav wEEE

21)()(

• eliminate the weights with small effect• Solve the constrained optimization problem:

av 2)()(

iw 1Hw 1

▫ if is small, even small weight is importanti

ii

1HH

w,

1][

ii,1][ H

Optimal Brain Surgeong

• Saliency of wi as ii

wS 1

2

][2 Hi

▫ represent the increase in the mean-squared error from deleting w

ii ,1][2 H

from deleting wi

• OBS procedurep▫ weight of small saliency will be deleted

Optimal Brain Damage• Optimal Brain Damage▫ with assumption of the Hessian matrix is diagonal

• Computation of the inverse of Hessian

Accelerated Convergenceg

Heuristics

1. Adjustable weights should have own learning rate parameterparameter

2. Learning rate parameters should be allowed to vary on iteration

3 If i f th d i ti i f l it ti3.If sign of the derivative is same for several iteration, learning rate parameter should be increased▫ Apply the Momentum idea even on learning rate pp y g

parameters4. If sign of the derivative is alternating for several

iteration learning rate parameter should beiteration, learning rate parameter should be decreased

Network Design and Training Issues

Design:g• Architecture of network• Structure of artificial neurons

L i l• Learning rules Training:• Ensuring optimum training• Ensuring optimum training• Learning parameters• Data preparation• and more ....

Network Designg

Architecture of the network: How many nodes?Architecture of the network: How many nodes?• Determines number of network weights• How many layers? • How many nodes per layer?

Input Layer Hidden Layer Output Layer

• Automated methods: ▫ augmentation (cascade correlation)g ( )▫ weight pruning and elimination

Network Designg

Architecture of the network: Connectivity?• Concept of model or hypothesis space• Constraining the number of hypotheses:• Constraining the number of hypotheses:

▫ selective connectivity▫ shared weightsg▫ recursive connections

Network Designg

Structure of artificial neuron nodes• Choice of input integration:

▫ summed, squared and summed▫ multiplied▫ multiplied

• Choice of activation (transfer) function:▫ sigmoid (logistic)

h b li t t▫ hyperbolic tangent▫ Guassian▫ linear▫ soft-max

Network Designg

Selecting a Learning RuleSelecting a Learning Rule • Generalized delta rule (steepest descent)• Momentum descent• Momentum descent• Advanced weight space search techniques• Global Error function can also vary• Global Error function can also vary

- normal - quadratic - cubic

Network Trainingg

How do you ensure that a network has ybeen well trained?

• Objective: To achieve good generalizationaccuracy on new examples/cases

• Establish a maximum acceptable error rate • Train the network using a validation test set to tune it• Train the network using a validation test set to tune it• Validate the trained network against a separate test

set which is usually referred to as a production test set

Network Trainingg

Approach #1: Large SampleWhen the amount of available data is large ...

Available Examples

Training Production

70% 30%Divide randomly

Generalization errort t

TestgSet Set

Used to develop one ANN model ComputeT t

= test errorSet

p Test error

Network Trainingg

Approach #2: Cross-validationWhen the amount of available data is small ...

Available Examples Repeat 10times

Training Pro

10%90%

Generalization errord i d bTestTraining

SetPro.Set

Used to develop 10 different ANN models Accumulate

determined by meantest error and stddev

TestSet

Used o deve op 0 d e e NN ode stest errors

Network Trainingg

How do you select between two ANN designs ? y g• A statistical test of hypothesis is required to ensure that

a significant difference exists between the error rates of two ANN modelsof two ANN models

• If Large Sample method has been used then apply McNemar’s test

• If Cross-validation then use a paired t-test for difference of two proportions

Network Trainingg

Mastering ANN ParametersMastering ANN ParametersTypical Range

learning rate - 0 1 0 01 - 0 99learning rate - 0.1 0.01 - 0.99momentum - 0.8 0.1 - 0.9weight-cost - 0 1 0 001 - 0 5

weight cost 0.1 0.001 0.5

Fine tuning : - adjust individual parameters at each node and/or connection weight

▫ automatic adjustment during training

Network Trainingg

Network weight initialization• Random initial values +/- some range

Smaller eight al es for nodes ith man incoming• Smaller weight values for nodes with many incoming connections

• Rule of thumb: initial weight range should be g gapproximately

coming into a node

1# weightscoming into a node g

Network Trainingg

Typical Problems During TrainingTypical Problems During TrainingE

Would like:Steady, rapid declinei t t l

# iter

E

Would like: in total error

Seldom a localE

# iterBut

sometimes:

Seldom a local minimum - reduce learning or momentum

EparameterReduce learning parms.

# iter - may indicate data is not learnable

An Example

Three-layer network for solving the Exclusive-OR y goperation

1

x1 31

3w13

w23w35 5

1

y55

w23

w24 w

5

x2

Inputlayer

Outputlayer

42w24

w45

4layer layer

Hi dden layer1

An Example

The effect of the bias applied to a neuron in the pphidden or output layer is represented by its weight, , connected to a fixed input equal to 1. The initial weights and bias levels are set

randomly as follows: 0 5 0 9 0 4 1 0w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0,

w35 = 1.2, w45 = 1.1, 3 = 0.8, 4 = 0.1 and 0 35 = 0.3.

An Example

We consider a training set where inputs x1 and x2 are equal to 1 and desired output yd,5 is 0. The actual outputs of neurons 3 and 4 in the hidden layer are calculated as

525001/1)( )8.014.015.01( ewxwxsigmoidy 5250.01/1)( )(32321313 ewxwxsigmoidy

8808.01/1)( )1.010.119.01(42421414 ewxwxsigmoidy

Now the actual output of neuron 5 in the output layer is determined as:

Thus, the following error is obtained:

5097.01/1)( )3.011.18808.02.15250.0(54543535 ewywysigmoidy

5097.05097.0055, yye d

An Example

The next step is weight training. To update the weights and bias levels in our network, we propagate the error, e, from the output layer backward to the input layer. First we calculate the error gradient for neuron 5 in the First, we calculate the error gradient for neuron 5 in the

output layer:1274.05097).0(0.5097)(10.5097)1( 555 eyy

Then we determine the weight corrections assuming that the learning rate parameter is equal to 0 1:

555

the learning rate parameter, , is equal to 0.1:

0112.0)1274.0(8808.01.05445 yw0067.0)1274.0(5250.01.05335 yw0112.0)1274.0(8808.01.05445 yw

0127.0)1274.0()1(1.0)1( 55

An Example

Next we calculate the error gradients for 3 d 4 i th hidd lneurons 3 and 4 in the hidden layer:

0381.0)2.1(0.1274)(0.5250)(10.5250)1( 355333 wyy

0 0147114)0 127(0 8808)(10 8808)1(

We then determine the weight corrections:0.0147.114)0.127(0.8808)(10.8808)1( 455444 wyy

0038.00381.011.03113 xw0038.00381.011.03223 xw

0038003810)1(10)1( 0038.00381.0)1(1.0)1( 33 0015.0)0147.0(11.04114 xw0015.0)0147.0(11.04224 xw 4224

0015.0)0147.0()1(1.0)1( 44

An Example At last, we update all weights and bias levels:

503800038050 5038.00038.05.0131313 www8985.00015.09.0141414 www

4038.00038.04.0232323 www 232323

9985.00015.00.1242424 www

2067.10067.02.1353535 www

0888.10112.01.1454545 www

7962.00038.08.0333

098500015010

The training process is repeated until the sum of

0985.00015.01.0444

3127.00127.03.0555

The training process is repeated until the sum of squared errors is less than 0.001.

Learning curve for operation Exclusive ORExclusive-OR

10 1Sum-Squared Network Error for 224 Epochs

100

10-1

10-2

10 -3

4

0 50 100 150 200Epoch

10 -4

Final results of three-layer network learninglearning

Inputs Desiredoutput

Actualoutput

Sum ofsquared

x1 x2

10

11

01

yd0.0155

y5 e errors

0 98490.0010

010

100

110

0.98490.98490 01750 0 0 0.0175

Software

N l N k f F R i i• Neural Networks for Face Recognition http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html

• SNNS Stuttgart Neural Networks Simulatorghttp://www-ra.informatik.uni-tuebingen.de/SNNS• Neural Networks at your fingertipshttp://www geocities com/CapeCanaveral/1624/http://www.geocities.com/CapeCanaveral/1624/• Neural Network Design Demonstrationshttp://ee.okstate.edu/mhagan/nndesign_5.ZIP

’• Bishop’s network toolbox• Matlab Neural Network toolbox

MLP for object recognition from imagesimages

• Objectivej▫ Identify interesting objects from input images Face recognition Locate faces happy/sad faces gender face pose orientation Locate faces, happy/sad faces, gender, face pose, orientation Recognize specific faces: authorization

Vehicle recognition (traffic control or safe driving assistant) Passenger car van pick up bus truckPassenger car, van, pick up, bus, truck

Traffic sign detection

• Challenges▫ Image size (100x100, 10240x10240)▫ Object size, pose and object orientation▫ Illuminations▫ Illuminations

Example

Example: Face Detection Challenges

pose variation

lighting condition variation

facial expression variation

Normal procedures

• Training (identify your problem and build specific model)Build training dataset▫ Build training dataset Isolate sample images Images containing faces

Extract regions containing the objectsi t i i f region containing faces

Normalization (size and illumination) 200x200 etc.

Select counter-class examples N f i Non-face regions

▫ Determine Neural Net Input layers are determined by the input images E.g., a 200x200 image requires 40,000 input dimensions, each containing a

l b t 0 255value between 0-255 Neural net architectures A three layer FF NN (two hidden layers) is a common practice

Output layers are determined by the learning problemBi l l ifi ti lti l l ifi ti Bi-class classification or multi-class classification

▫ Train Neural Net

Normal procedures

• Test▫ Given a test image Select a small region (considering all possibilities of

th bj t l ti d i )the object location and size) Scanning from the top left to the bottom right Sampling at different scale levelsp g

Feed the region into the network, determine whether this region contains the object or notRepeat the above process Repeat the above process Which is a time consuming process

CMU Neural Nets for Face Pose RecognitionRecognition

Head pose (1-of-4):90% accuracy90% accuracy

Face recognition (1-of-20):90% accuracy

Neural Net Based Face Detection

• Large training set of faces and small set of non-faces

• Training set of non-faces automatically built up:

• Set of images with no faces

• Every ‘face’ detected is added to the non-face training set.

Traffic sign detectiondetection

• Demo▫ http://www.mathworks.com/products/demos/videoimag

e/traffic sign/vipwarningsigns htmle/traffic_sign/vipwarningsigns.html• Intelligent traffic light control system▫ Instead of using loop detectors (like metal detectors)Instead of using loop detectors (like metal detectors) Using surveillance video: Detecting vehicle and bicycles

Vehicle Detection

• Intelligent vehicles aim at improving the driving g p g gsafety by machine vision techniques

http://www.mobileye.com/visionRange.shtml

Readingg

• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapter 5).

lectures 12&13&14: multilayer perceptrons (mlp) networks

Documents