formal computational skills week 4: differentiation

Formal Computational Skills

Week 4: Differentiation

Overview

Will talk about differentiation, how derivatives are calculated and its uses (especially for optimisation). Will end with how it is used to train neural networks

By the end you should:• Know what dy/dx means• Know how partial differentiation works• Know how derivatives are calculated• Be able to differentiate simple functions• Intuitively understand the chain rule• Understand the role of differentiation in optimisation• Be able to use the gradient descent algorithm

DerivativesDerivative of a function: y = f(x) is dy/dx (or df/dx or df(x)/dxRead as: “dy by dx”

How a variable changes with respect to (w.r.t) other variables ie the rate of change y wrt x

It is a function of x and for any given value of x calculates the gradient at that point ie how much/fast y changes as x changes

Examples: If x is space/distance, dy/dx tells us the slope or steepness, If x time, dy/dx tells us speeds and accelerations etc

Essential to any calculation in a dynamic world

If dy/dx is >0 function is increasing

The bigger the absolute size of dy/dx (written as |dy/dx|) the faster change is happening and y is assumed to be more complex/difficult to work with

If dy/dx is <0 function is decreasing

If dy/dx = 0 function is constant

In general, dy/dx changes as x changes x

yx

y

x

y

x

y

Partial DifferentiationFor functions of 2 variables (z=x exp(-x2-y2)), the partial derivatives and tell us how z varies wrt each parameter

-2-1

01

2

-2

-1

0

1

2

-0.4

-0.2

0

0.2

0.4Thus partial derivatives go along the lines

xy

z

x

z

yx

z

x

z

y

z

Calculated by treating the other variables as if they are constant

And for a function of n variables y=f(x1,x2, ...,xn),

ix

y

z=x exp(-x2-y2)

y x

x

z

y

z

tells us how y changes if only xi varies

eg z=xy: to get treat y as if it is a constant and

Higher Dimensions

w1

w2

E(w)

nw

E

w

E

w

EE ,,,

21

In more dimensions, the gradient is a vector called grad (represented by an upside down triangle)

Grad tells us steepness and direction

Elements are partial derivatives wrt each variable

01

w

EHere so 1,0,21

w

E

w

EE

12

w

E01

w

E

12

w

E

Vector FieldsSince grad is a vector, often useful to view vectors as a field with vectors drawn as arrows

Note that grad points in the steepest direction up hill

Similarly minus grad points in the steepest direction downhill

Calculating the Derivativedy/dx = rate of change of y = change in y/change in x (if dy/dx remained constant)

Thus dy/dx is constant and equals m.

y = 6-2 = 4

So: dy/dx = y/x = 4/2 =2

6

2

1 3

x = 3-1 = 2

For a straight line, y=mx + c it is easySince y remains constant, divide change in y by change in x

Eg y=2x Aside: different sorts of d’sMeans “a change in” Means “a small change in”d Means “an infinitesimally small change in”

However if derivative is not constant: we want gradient at a point ie gradient of a tangent to the curve at that pointA tangent is a line that touches the curve at that point

It is approximated by the gradient of a chord = y/x (a chord is a straight line from one point to another on a curve)

For infinitesimally small chord:y/x = dy/dx and the approximation is exact

y+y

y

x x+x

y

x

The smaller the chord, the better the approximation.

Eg y = x2

y(x+h)

y(x)

x x+h

xhx

xyhxy

x

y

dx

dy

)()(

Can get derivatives in this way but have rules for most

222 2)()( hxhxhxhxy

y

x=h

h

xyhxy

dx

dy )()(

Now do this process mathematically;

xhxdx

dyh

2)2(lim0

hxh

hxh

h

xhxhx

dx

dy

2

22 2222

Function | Derivative

01

1

1

1

0 !)!1(!! i

xi

i

i

i

i

i

ix e

i

x

i

x

i

ix

dx

dy

i

xe

y=constant eg y=3, dy/dx = 0 as no change

NB n! is n factorial and means eg 5! = 5x4x3x2x1

xx

nn

edx

dyey

xdx

dyxy

xdx

dyxy

xdx

dyxy

nxdx

dyxy

1,ln

)sin(,cos

)cos(,sin

, 1

Can prove all eg why is dy/dx = ex if y = ex?

y= k f(x) dy/dx = k df/dx

eg y = 3sin(x), dy/dx = 3cos(x)

y=f(x) + g(x), dy/dx = df/dx + dg/dx

y = x3 + ex, dy/dx = 3x2 + ex

Other useful rules:

Product Rule

Quotient Rule

dx

duv

dx

dvu

dx

dy

xvxuy

)()(

2)(

)(

vdxdvu

dxduv

dx

dy

xv

xuy

eg: y = x sin(x),

dx

duv

dx

dvu

dx

dy

dy/dx=x sin(x)+ 1cos(x)

u= x, v = sin(x)

du/dx = 1x0 = 1, dv/dx = cos(x)

Also known as function of a function

dx

dv

dv

vdu

dx

dy

xvuy

)(

))((

eg: y = sin(x2),

EASY IN PRACTICE (honestly)

The one you will see most

Chain Rule

u= sin(v), v = x2

du/dv = cos(v), dv/dx = 2x

dy/dx=dx

dv

dv

vdu

dx

dy )( cos(v)2x 2x= cos(x2)

2121

xwx

2121

11111

1 xwx

xwxx

y

1111

xwx

y1= 3x1

= w11 = 0 So

Some examples:

y1= w11x1 + w12x2

y1= w11x11

1

dx

dy= 3 = w11

1

1

x

y

11111

1 0 wwx

y

11

1

w

y

= x1

partial d’s as y a function of 2 variables

Similarly, if: y1= w11x1 + w12x2 + … + w1nxn 166

1 wx

y

E = u2 dE/du = 2u E = u2 +2u + 20 dE/du = 2u + 2

1)1(1

1

a

ae

ey Can show dy/da= y(1-y)Sigmoid function:

How does a change in x1 affect the network output?

Intuitively, the bigger the weight on the connection, the more effect a change in the input along that connection will make.

1

1

x

y

111

1 wx

y

2121

11111

1 xwx

xwxx

y

x1

x2

y1= w11x1 + w12x2

w11

w12

So

EG: Networks

So intuitively, changing w12 has a larger effect the stronger the signal travelling along it

21212

1111212

1 xww

xwww

y

x1

x2

y1= w11x1 + w12x2

w11

w12

Similarly, w12 affects y1 by:

Note also that by following the path of x2 through the network, you can see which variables affect it

= x2

x1

x2 y1= w11x1 + w12x2

w11

w12

Now how does E change if the output changes?

Now suppose we want to train the network so that it outputs a target t1 when the input is x. Need an error function E to evaluate how far the output y1 is from the target t1

E = (y1 – t1)2

To calculate this we need the chain rule as E is a function of a function of y1

11

1))((

dy

dv

dv

du

dy

dE

yvuE

1dy

dE

where v = (y1-t1)

and u = v2

du/dv = = 2(y1- t1)

dv/dy1 = 1

1dy

dE1x 2(y1- t1) =2(y1- t1)and so

Now

2v

12w

E

x2 y1= w11x1 + w12x2

w11

w12

We train the network by adjusting the weights, so we need to know how the error varies if eg w12 varies ie

E = (y1 – t1)2

The chain of events: w12 affecting y1 affecting E indicates the chain rule so

12

1

w

y

1y

E

12

1

112 w

y

y

E

w

E

To calculate this note that w12 affects y1 by:

which in turn affects E by:

iijij

i

iij

tyxw

y

y

E

w

E

2And in general: cf backprop ij

x2

2(y1 – t1)

x2 2(y1 – t1)

Back to 2 (or even N outputs). Consider the route (or more accurately, the variables) by which a change in w12 affects E:

x1

x2

w11

w21

w12

w22

y1

y2

See that w12 still affects E via x2 and y1 so still have:

2

1

2)(i

ii tyE

12

1

112 w

y

y

E

w

E iijij

i

iij

tyxw

y

y

E

w

E

2And: x2 2(y1 – t1)

1y

E

12

1

w

y

Visualise the partial derivatives working along the connections

How about how E is changed by x2: 2x

E

Therefore need to somehow combine the contributions along these 2 routes …

x1

x2

w11

w21

w12

w22

y1

y2

1y

E

x2 changes y1 along w12 which changes E

x2 also changes y2 along w22 which changes E

Again look at the route along which x2 travels:

12

1

w

y

12

1

w

y

2y

E

2

1

2)(i

ii tyE

Chain rule for partial differentiationIf y is a function of u and v which are both functions of x then:

Quite intuitive: sum the contributions along each path affecting E

dx

dv

v

y

dx

du

u

y

dx

dy

x1

x2

w11

w21

w12

w22

y1

y2

1y

E

2

1

x

y

2

2

x

y

2y

E

2x

ESo: = 2(y1- t1)

2

1

2)(i

ii tyE

Outputs

2)(2 iii wty

2

1

1 x

y

y

E

2

2

2 x

y

y

E

w12 + 2(y2- t2)w22

What about another layer? x1

x2

w11

w21

w12

w22

y1

y2

2

1

2)(i

ii ytE

v11

v21

v12

v22

u1

u2

We need to know how E changes wrt the v’s and w’s. The calculation for the w’s is the same but what about the v’s?

ji

j

v

x

jiv

E

which involves finding andjx

E

Thus: ik j

k

kji

j

jji

ux

y

y

E

v

x

x

E

v

E

2

1

Seems complicated but can do via matrix multiplication

v’s affect x’s which affect y’s which affect E.

So we need:

Finding Maxima/Minima (Optima)At a maximum or minimum (optimum, stationary point etc) the gradient is flat ie dy/dx =0

To find max/min calculate dy/dx and solve dy/dx =0

maximaminima

Global minimum = lowest point

global maximum = highest point

Similarly for a function of n variables y=f(x1,x2, ...,xn), at optima for i = 1, …, n0

ix

y

local maxima

Local minima

Eg: y = x2 , dy/dx = 2x so at optima: dy/dx = 0 => 2x =0 ie x=0

E

w11

Why do we want to find optima? Error minimisation

In neural network training have the error function E = (yi – ti)2 At the global minimum of E, outputs are close as possible to targets, so network is ‘trained’

Therefore to train a network ‘simply’ set dE/dw=0 and solve

However, In many cases (especially neural network training) cannot solve dy/dx=0.

In such cases can use gradient descent to change the weights in such a way that the error decreases

Gradient DescentAnalogy is being on a foggy hillside where we can only see the area around us. To get to the valley floor a good plan is to take a step downhill. To get there quickly, take a step in the steepest direction downhill ie in the direction of –grad. As we can’t see very far and only know the gradient locally only take a small step

Algorithm• Start at a random position in space, w• Calculate • Update w via: wnew = wold – • Repeat from 2

)(wE)(wE

is the learning rate and governs the size of the step taken. It is set to a low value and often changed adaptively eg if landscape seems ‘easy’ (not jagged or steep) increase step-size etc

w1

w2

E(w)

Original point inweight space: w

New point inweight space

221

,0,w

E

w

E

w

EE

)(wEw

Many variations which help with these problems eg: • Introduce more stochasticity to ‘escape’ local minima • use higher order gradients to improve convergence• Approximate grad or use a proxy eg take a step, see if lower down or not and if so, stay there eg hill-climbing etc

Problems of local minima. Gradient descent goes to the closest local minimum and stays there:

Start here Start here

End up here

End up here

Other problems: length of time to get a solution especially if there are flat regions in the search space; Can be difficult to calculate grad

E

Higher order derivativesRates of rates of change.

Also used to tell whether a function (eg mapping produced by a network) is smooth as it calculates the curvature at a point

is greater for the first mapping than the 2nd

3

3

dx

yd

x

y2

yn

x

y

x

y

In 1 dimension

The higher the order, the more information known about y and so used eg improve gradient descent search

In many dimensions a vector

Eg velocity dy/dx is 1st order, acceleration: d2y/dx2 is 2nd order

formal computational skills week 4: differentiation

Documents