instructor: dr. benjamin thompson lecture 10: 12 february 2009

Instructor:

Dr. Benjamin Thompson

Lecture 10: 12 February 2009

Announcement� Reminder: Homework 3 is due right now

� Homework 1 solutions are available online

� Homework 2 is graded (avg ~75%, minus two outliers it’s 88%)

� Homework 4 will be on the web tonight

� Homework 4 will be very, very easy. I promise.

� Seriously.

� No, really. It’s one Matlab problem, and you don’t have to train or solve anything.

<obscure or slightly humorous synonym for

“during the last lecture”>:

� The Iris Classification Problem

� The Sumo-Basketball Player-Jockeys-Footballers Problem

� Stock Market Data

Chapter Four!

� The Multilayer Perceptron

� Developing the MLP

� General MLP Structure

<obscure or slightly humorous synonym for “In

Today’s Lecture”>:

� Nomenclature Refresher

� Batch Learning versus Online Learning

� Backpropagation!

� Matlab tip for HW4

Almost half as exciting as the first time you saw it!

We’ll See This Slide A Lot.

This might be described as a 3x4x5x2 Neural Network. The bias is assumed and thus not counted in the dimensions.

Some Nomenclature: Inputs

and Layers� x is the input vector

� x1 is the first input

� xi is the ith input

� “The nth layer” refers to any type of layer

� “The first layer” always means the input layer

� “The first hidden layer” always means the second layer of a proper MLP

� So “the second layer” is equivalent to “the first hidden layer”

� “The last layer” is the output layer, and is notsynonymous with “the last hidden layer”, which always precedes it by one layer.

Study these next few slides. The development of the learning algorithm for the MLP, while mathematically elegant, is notational hell, and knowing these will make following the next lecture vastly easier

More Nomenclature:

Activations� a1(1) is the activation potential (aka induced local

field, aka neuronal input) of the first neuron of the first hidden layer

� ak(1) is the activation potential of the kth neuron of the first hidden layer

� ak(n) is the activation potential of the kth neuron of the nth hidden layer

� In general, the activation is the weighted sum of all the connected inputs (more on this later)

Haykin uses υ instead of a, but I think a makes more sense contextually speaking.

More Nomenclature: Weights� w1,1(1) is the weight or synaptic connection between

the first input and the first neuron of the first hidden layer.

� wj,k(1) is the weight of the connection between the jth

input and the kth neuron of the first hidden layer

� wj,k(n) is the weight of the connection between the output of the jth neuron on the nth layer and the input of the kth neuron on the (n+1)st layer

� W(n) is the weight matrix between layers n and n+1, whose (j,k)th element is wj,k(n)

Generally, subscript j will always refer to something to the left of something with subscript k

So if something occurs in a temporal order, I’ll try to replicate this in alphabetical order.

More Nomenclature: Bias� b(1) refers to the bias vector feeding into the first

hidden layer

� b1(1) refers to the weight connecting the bias to the first neuron of the first hidden layer

� bk(1) refers to the weight connecting the bias to the kth

neuron of the first hidden layer

� bk(n) refers to the weight connecting the bias to the kth

neuron of the nth hidden layer (or output if n is the last layer)

� b(n) refers to the bias vector feeding the nth hidden layer (or output if n is the last layer)

More Nomenclature: Neuron

Outputs � φφφφ(n) is the output vector of the nth hidden layer

� φ1(1) is the output of the 1st node in the 1st hidden layer

� φk(1) is the output of the kth node in the 1st hidden layer

� φk(n) is the output of the kth node in the nth hidden layer

� We may abuse the notation and use φφφφ(0) to refer to the input vector!

� We use φ to signify the activation function.

� When given, we may use the function itself in place of φ� threshold function θ( )

� sigmoid function σ( )

� signum function sgn( )

Pay close attention to this subtlety!

More Nomenclature� Alternately, rather than φφφφ(n), we may use y(n) and

yk(n) to refer to the same thing

� This makes it clear in the y=f(x) sense that it’s an outputof something

� Finally: o refers to the output vector of the whole neural network

� o1 is the first output node

� ok is the kth output of the neural network

A Look at the First Layer� The input to the first hidden neuron

may be written as:

� The keen observer will note that this is simply an affine operation wTx+b for some w

� From there, it’s easy to see that the activations of the entire first hidden layer can be calculated as:

x

( )1W

( )1b

( ) ( ) ( ) ( )1 1,1 1 ,1 11 1 1n na w x w x b n= ⋅ + + ⋅ +…

( ) ( ) ( ) ( )

( ) ( ) ( )1

1 1 1 1

1 1 1

T

m

T

= +

= +

a w w x b

a W x b

…

A Peek Under the Hood� Operating the activation function on the

activation values produces the output of the first hidden layer:

� And propagating the signal forward, the activation of the second hidden layer becomes:

( )2W

( ) ( )( ) ( ) ( )( )1 1 1 1Tφ φ= = +y a W x b

� �

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )( )( ) ( )

2 2 1 2

2 2 1 1 2

T

T Tφ

= +

= + +

a W y b

a W W x b b�

( )2b

( )1a

…And the Output Layer� Now we repeat the process for

the next layer:

� Finally, the output layer becomes:

( )2a

( )3W

( )3b

o( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )( )2 2

2 2 1 1 2T T

φ

φ φ

=

= + +

y a

y W W x b b

�

� �

( ) ( ) ( )

( ) ( ) ( ) ( )( )( ) ( )( ) ( )

3 2 3

3 2 1 1 2 3

T

T T Tφ φ

= +

= + + +

o W y b

o W W W x b b b� �

Note: generally, there is no good reason to add a nonlinearity to the final output neurons, hence the above form.

As opposed to Learning about Neural Nets, which is the point of this lecture… in theory. Except for you, sleeping-in-the-back-of-the-class-person.

What Are We Learning?� Recall that the Rosenblatt Perceptron was a function

parameterized by w (plus some bias b):

� The result was y=θ(wTx+b)

� Our goal was to determine the best w that separated two classes determined by the training sample {xi,di}

� Similarly, the only free parameters of the multilayer perceptron are the weight matrices and bias vectors

� Recall, the output of our example neural network was given as

( ) ( ) ( ) ( )( )( ) ( )( ) ( )3 2 1 1 2 3T T Tφ φ= + + +o W W W x b b b� �

Our Goal� So much like the Rosenblatt Perceptron, our goal is to

determine the best set of weights (and biases) that approximate the (unknown) relationship between the inputs xi and the corresponding outputs di

� How do we determine “best”?

� We minimize the error!

� What error?

� The error of approximation!

� How do we calculate that?

� I’m out of indentations, so I’ll tell you on the next slide!

The Error of Approximation� Recall that our neural network produces an output

vector:

� So for some particular input xp from the training set, the neural network will produce an output op

� We compare this to the desired output from the training set, dp by calculating the error, ep = dp - op

� Since the desired output may be a vector, we propose the Euclidean distance as our error metric:

( ) ( ) ( ) ( )( )( ) ( )( ) ( )3 2 1 1 2 3T T Tφ φ= + + +o W W W x b b b� �

( ) ( )2 2 2

, , ,

1 1

1 1 1

2 2 2

M M

p p p i p i p i

i i

E e d o= =

= = = −∑ ∑eThe ½ term is only there to make the math a little cleaner

In Other Words…� Our goal is to find the weights that minimize the

squared error of the neural network output compared to the desired output

� We do this by adjusting the weights in some way

� Since we do this on a pattern-by-pattern basis, or in an iterative fashion, this is considered learning

Learning in Neural Networks� The process goes like this:

� Start with a guess at all the weights

� Present an input xp to the neural network

� Calculate the output for that input, op

� Determine the error, ep

� Adjust the weights (somehow!) according to how bad the error is

� Go to the next training pattern (input) xp+1 and repeat

� If our error stops decreasing by some amount, or reaches some desired level, we stop!

How Is That Learning?� Painting a picture

� You look at an object. This is your input.

� You try to paint it on canvas. This is your output

� You compare that to the desired object. This is the error.

� You adjust your painting to make it look more like the picture.

� Then you look at another object, and try to paint it.

� You’re learning how to paint!

� Beating this analogy to death, what are the free parameters of this system?

Neural Net Example� Your input is the last two weeks of stock market

numbers.

� Your target output is tomorrow’s numbers.

� So you:

� Initialize an MLP with 10 inputs (5 trading days over two weeks) and one output.

� You should a single input vector to the MLP

� It produces its own output

� Compare that to your target output

� Adjust the weights on the MLP to improve the answer, and show it the next input – LEARNING!

“Online Learning” == Wikipedia? I hope not.

“Online” Learning� What we’ve described so far (look-error-adjust-look

again) is known as online learning

� That is, we change our weights on a pattern-by-pattern basis

� What we’re doing is minimizing the instantaneous error: the error for only a single input/output pair at a time

� Since we adjust the weights before we try another pattern, we don’t know how well those weights would have performed over all input/output pairs.

The Ideal Scenario� Ideally, we would want to calculate the error of all our

input/output pairs for our current weights, and use that error information to adjust our weights

� This is known as batch learning

� Rather than calculating Ek, we sum up the errors for allthe patterns:

( )2 2

, ,

1 1 1

1 1

2 2

P P M

batch p p i p i

p p i

E d o= = =

= = −∑ ∑∑e

Scoreboard:Batch Learning Online Learning

� Pros:

� Incorporates all knowledge into a single weight change

� Enables us to get a very good approximation of the gradient, which we can use for gradient descent optimization

� Fewer iterations required

� Learning may be parallelized

� Cons:

� Each iteration takes much longer

� Large memory requirements

� Pros:

� Only requires information from a single pattern

� Each iteration executes very rapidly

� Able to adapt to changing (time-varying) conditions

� Cons:

� No guarantee of convergence

� Error not guaranteed to decrease after each iteration, even for very small learning rate

� Requires random shuff ling of data to avoid cyclical artifacts

Or, Why Neural Networks Didn’t Fade Into Complete Obscurity Years Ago

The Setup� Recall that we said “adjust the weights somehow”

� The original “Neural Network Winter” was caused by a lack of answer to this very problem:

� For most meaningful neural networks, the number of weights makes a huge search space

� Our example neural network has ((3+1)x4 + (4+1)x5 + (5+1)x2), or 53 weights

� This is quite literally a 53-dimensional search space!

� So: we need an intelligent way to search through “weight space” to find the best weight

The Credit-Assignment Problem� We want to use the error to adjust the weights

� That is, big error should result in a big adjustment, small error should result in a small adjustment

� The problem: how do we know how much a particular weight contributed to the overall output (and subsequently, the overall error)?

� Let us see how a single weight impacts the overall output of the neural network

Output Layer Weight

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

1x

nx

2x

1o

mo

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φφ

So a weight on the output impacts the single output to which it is connected

Hidden Layer Weight

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

1x

nx

2x

1o

mo

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

So a weight one layer back impacts all the outputs and some of the hidden neurons…

Earlier Weight

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

1x

nx

2x

1o

mo

φ

φ

φ

φ

φ

φ

φ

φ

φ

φ

φφ

And further back still, a single weight impacts many more neurons!

The Takeaway from That� Since the error becomes a nonlinear function of many

weights, we need a way to adjust each weight individually

Go With What Works� How do we pick a direction (∆W) to move from some

particular operating point W(n)?

� That is, given a particular error, that was produced from a particular input/output pair and a particular set of weights, how do we know what amount to change our weights to get a (hopefully) better result?

� What methods have we used to accomplish this so far?

� LMS Algorithm used… ?

� Rosenblatt Perceptron used… ?

� “Batch Perceptron” used… ?

Our Old Friend� I propose that we use gradient descent (steepest

descent) to optimize our weights

� Recall the generic form of gradient descent:

� W[n] is just the set of all the weights on the nth iteration

� g( ) is the gradient: the set of all the derivatives of the error function with respect to each weight

� g(W[n]) is the gradient at the point W[n]

� That is, the new guess is the old guess minus a (tiny) step in the direction opposite the gradient

[ ] [ ] [ ]( )1n n nη+ = − ⋅W W g WBe careful here! I’ll use ( ) for layer number, and [ ] for iteration number, to avoid confusion.

To Clarify� We’ll discuss what the gradient is in a second. Here is

what we are proposing:

� Make a guess at the weights (i.e., randomly initialize them)

� Show a pattern to the neural network and calculate the error

� Evaluate the gradient for the given weights

� Change the weights by “stepping” in the direction opposite that gradient

� Show another pattern, and repeat

A Lot of Hand-Waving� What I haven’t said yet: how do you calculate the

gradient?

� Recall that we actually have a nice, compact form for our input-output relationship:

� And our error function is given by:

� We could just brute-force it and calculate for each and every weight in the network (yup, all 53 of ‘em!)

� Any takers?

( ) ( ) ( ) ( )( )( ) ( )( ) ( )3 2 1 1 2 3T T T

p pφ φ= + + +o W W W x b b b� �

( )2, ,

1

1

2

M

p p i p i

i

E d o=

= −∑

k

E

w

∂∂

op,i is just the ith element of the output vector calculated above

I have a better idea� This is one of those rare occasions where some elegant

math and a clever eye give us a vastly improved algorithm

� The only tool we’ll need: The Chain Rule for Derivatives

� Recall from Calculus I:

( )( )( ) ( ) ( ) ( )f h g x f h h g g x

x h g x

∂ ∂ ∂ ∂= ⋅ ⋅

∂ ∂ ∂ ∂

As sexy as the FFT? Maybe so…

Start Simply� Let’s start with a weight on the

output layer:

� From our example, let’s look at a particular wj,k(n), where j=3, k=1, and n=3.

� The error function is (still) given by:

� Since o is a function of the weights, we may start our chain rule:

( )2a

( )3W

( )3b

o

( )2, ,

1

1

2

M

p p i p i

i

E d o=

= −∑

( ) ( ),

, , ,

p k

j k p k j k

oE E

w n o w n

∂∂ ∂= ⋅

∂ ∂ ∂

Start Simply� Let’s look at the first term of that

chain:

� Since the sum of the errors is over the index i, all the other terms fall out, and this is just

� or

� In other words, it’s just the negative of the error.

( )2a

( )3W

( )3b

o

( )2, ,

, ,

1

2

p k p k

p k p k

d oE

o o

∂ −∂=

∂ ∂

( ), , ,

,

p k p k p k

p k

Ed o e

o

∂= − − = −

∂

Start Simply� Now the second term, :

� Since the output neurons are typically linear (that is, they have no activation function), we may write the function as:

� and the second term is just:

� or just the output of the hidden neuron connected to the output node by the weight of interest!

( )2a

( )3W

( )3b

o

( ),

,

p i

j k

o

w n

∂

∂

( ), , ,

1

3Q

p i p j j i

j

o y w=

= ⋅∑

( ),

,

,

p i

p j

j k

oy

w n

∂=

∂

Putting those together:� Remember, we want to solve for the gradient, in order

to update our weights:

� Since this is a vectorized equation, we may write for a single weight:

� and we just showed that

[ ] [ ] [ ]( )1n n nη+ = − ⋅W W g W

( )[ ] ( )[ ]( ), ,

,

1j k j k

j k

Ew n p w n p

w nη

∂+ = − ⋅

( ) ( ) ( ) ( ),

, ,

, , ,

p k

p k p k

j k p k j k

oE Ee y

w n o w n

∂∂ ∂= ⋅ = − ⋅

∂ ∂ ∂

We Have Our First Rule!� So for a weight connected to an output node only,

and assuming a linear activation function for that node, the weight update equation is just:

� In words: the new weight connecting hidden node j to output node k is just the old weight times the error on that output (for the current training pattern) times the output of the hidden node j (for the current training pattern) times the learning rate.

( )[ ] ( )[ ] ( ) ( ), , , ,1

j k j k p k p kw n p w n p e yη+ = + ⋅ ⋅

Sneak Preview:� Things get much more complicated past the output

layer, but the chain rule will help. We left off at:

� but since weights in the second-to-last layer impact allthe outputs (remember the animated slide?), we have to sum up all the partials:

� and that second term becomes much trickier, since it’s now a nonlinear function of the weights!

( ) ( ),

, , ,

p k

j k p k j k

oE E

w n o w n

∂∂ ∂= ⋅

∂ ∂ ∂

( ) ( ),

1, , ,1 1

Mp k

kj k p k j k

oE E

w n o w n=

∂∂ ∂= ⋅

∂ − ∂ ∂ −∑

instructor: dr. benjamin thompson lecture 10: 12 february 2009

Documents