formal computational skills week 4: differentiation
Post on 21-Dec-2015
216 views
TRANSCRIPT
![Page 1: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/1.jpg)
Formal Computational Skills
Week 4: Differentiation
![Page 2: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/2.jpg)
Overview
Will talk about differentiation, how derivatives are calculated and its uses (especially for optimisation). Will end with how it is used to train neural networks
By the end you should:• Know what dy/dx means• Know how partial differentiation works• Know how derivatives are calculated• Be able to differentiate simple functions• Intuitively understand the chain rule• Understand the role of differentiation in optimisation• Be able to use the gradient descent algorithm
![Page 3: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/3.jpg)
DerivativesDerivative of a function: y = f(x) is dy/dx (or df/dx or df(x)/dxRead as: “dy by dx”
How a variable changes with respect to (w.r.t) other variables ie the rate of change y wrt x
It is a function of x and for any given value of x calculates the gradient at that point ie how much/fast y changes as x changes
Examples: If x is space/distance, dy/dx tells us the slope or steepness, If x time, dy/dx tells us speeds and accelerations etc
Essential to any calculation in a dynamic world
![Page 4: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/4.jpg)
If dy/dx is >0 function is increasing
The bigger the absolute size of dy/dx (written as |dy/dx|) the faster change is happening and y is assumed to be more complex/difficult to work with
If dy/dx is <0 function is decreasing
If dy/dx = 0 function is constant
In general, dy/dx changes as x changes x
yx
y
x
y
x
y
![Page 5: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/5.jpg)
Partial DifferentiationFor functions of 2 variables (z=x exp(-x2-y2)), the partial derivatives and tell us how z varies wrt each parameter
-2-1
01
2
-2
-1
0
1
2
-0.4
-0.2
0
0.2
0.4Thus partial derivatives go along the lines
xy
z
x
z
yx
z
x
z
y
z
Calculated by treating the other variables as if they are constant
And for a function of n variables y=f(x1,x2, ...,xn),
ix
y
z=x exp(-x2-y2)
y x
x
z
y
z
tells us how y changes if only xi varies
eg z=xy: to get treat y as if it is a constant and
![Page 6: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/6.jpg)
Higher Dimensions
w1
w2
E(w)
nw
E
w
E
w
EE ,,,
21
In more dimensions, the gradient is a vector called grad (represented by an upside down triangle)
Grad tells us steepness and direction
Elements are partial derivatives wrt each variable
01
w
EHere so 1,0,21
w
E
w
EE
12
w
E01
w
E
12
w
E
![Page 7: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/7.jpg)
Vector FieldsSince grad is a vector, often useful to view vectors as a field with vectors drawn as arrows
Note that grad points in the steepest direction up hill
Similarly minus grad points in the steepest direction downhill
![Page 8: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/8.jpg)
Calculating the Derivativedy/dx = rate of change of y = change in y/change in x (if dy/dx remained constant)
Thus dy/dx is constant and equals m.
y = 6-2 = 4
So: dy/dx = y/x = 4/2 =2
6
2
1 3
x = 3-1 = 2
For a straight line, y=mx + c it is easySince y remains constant, divide change in y by change in x
Eg y=2x Aside: different sorts of d’sMeans “a change in” Means “a small change in”d Means “an infinitesimally small change in”
![Page 9: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/9.jpg)
However if derivative is not constant: we want gradient at a point ie gradient of a tangent to the curve at that pointA tangent is a line that touches the curve at that point
It is approximated by the gradient of a chord = y/x (a chord is a straight line from one point to another on a curve)
For infinitesimally small chord:y/x = dy/dx and the approximation is exact
y+y
y
x x+x
y
x
The smaller the chord, the better the approximation.
![Page 10: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/10.jpg)
Eg y = x2
y(x+h)
y(x)
x x+h
xhx
xyhxy
x
y
dx
dy
)()(
Can get derivatives in this way but have rules for most
222 2)()( hxhxhxhxy
y
x=h
h
xyhxy
dx
dy )()(
Now do this process mathematically;
xhxdx
dyh
2)2(lim0
hxh
hxh
h
xhxhx
dx
dy
2
22 2222
![Page 11: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/11.jpg)
Function | Derivative
01
1
1
1
0 !)!1(!! i
xi
i
i
i
i
i
ix e
i
x
i
x
i
ix
dx
dy
i
xe
y=constant eg y=3, dy/dx = 0 as no change
NB n! is n factorial and means eg 5! = 5x4x3x2x1
xx
nn
edx
dyey
xdx
dyxy
xdx
dyxy
xdx
dyxy
nxdx
dyxy
1,ln
)sin(,cos
)cos(,sin
, 1
Can prove all eg why is dy/dx = ex if y = ex?
y= k f(x) dy/dx = k df/dx
eg y = 3sin(x), dy/dx = 3cos(x)
y=f(x) + g(x), dy/dx = df/dx + dg/dx
y = x3 + ex, dy/dx = 3x2 + ex
Other useful rules:
![Page 12: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/12.jpg)
Product Rule
Quotient Rule
dx
duv
dx
dvu
dx
dy
xvxuy
)()(
2)(
)(
vdxdvu
dxduv
dx
dy
xv
xuy
eg: y = x sin(x),
dx
duv
dx
dvu
dx
dy
dy/dx=x sin(x)+ 1cos(x)
u= x, v = sin(x)
du/dx = 1x0 = 1, dv/dx = cos(x)
![Page 13: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/13.jpg)
Also known as function of a function
dx
dv
dv
vdu
dx
dy
xvuy
)(
))((
eg: y = sin(x2),
EASY IN PRACTICE (honestly)
The one you will see most
Chain Rule
u= sin(v), v = x2
du/dv = cos(v), dv/dx = 2x
dy/dx=dx
dv
dv
vdu
dx
dy )( cos(v)2x 2x= cos(x2)
![Page 14: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/14.jpg)
2121
xwx
2121
11111
1 xwx
xwxx
y
1111
xwx
y1= 3x1
= w11 = 0 So
Some examples:
y1= w11x1 + w12x2
y1= w11x11
1
dx
dy= 3 = w11
1
1
x
y
11111
1 0 wwx
y
11
1
w
y
= x1
partial d’s as y a function of 2 variables
Similarly, if: y1= w11x1 + w12x2 + … + w1nxn 166
1 wx
y
E = u2 dE/du = 2u E = u2 +2u + 20 dE/du = 2u + 2
1)1(1
1
a
ae
ey Can show dy/da= y(1-y)Sigmoid function:
![Page 15: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/15.jpg)
How does a change in x1 affect the network output?
Intuitively, the bigger the weight on the connection, the more effect a change in the input along that connection will make.
1
1
x
y
111
1 wx
y
2121
11111
1 xwx
xwxx
y
x1
x2
y1= w11x1 + w12x2
w11
w12
So
EG: Networks
![Page 16: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/16.jpg)
So intuitively, changing w12 has a larger effect the stronger the signal travelling along it
21212
1111212
1 xww
xwww
y
x1
x2
y1= w11x1 + w12x2
w11
w12
Similarly, w12 affects y1 by:
Note also that by following the path of x2 through the network, you can see which variables affect it
= x2
![Page 17: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/17.jpg)
x1
x2 y1= w11x1 + w12x2
w11
w12
Now how does E change if the output changes?
Now suppose we want to train the network so that it outputs a target t1 when the input is x. Need an error function E to evaluate how far the output y1 is from the target t1
E = (y1 – t1)2
To calculate this we need the chain rule as E is a function of a function of y1
11
1))((
dy
dv
dv
du
dy
dE
yvuE
1dy
dE
where v = (y1-t1)
and u = v2
du/dv = = 2(y1- t1)
dv/dy1 = 1
1dy
dE1x 2(y1- t1) =2(y1- t1)and so
Now
2v
![Page 18: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/18.jpg)
12w
E
x2 y1= w11x1 + w12x2
w11
w12
We train the network by adjusting the weights, so we need to know how the error varies if eg w12 varies ie
E = (y1 – t1)2
The chain of events: w12 affecting y1 affecting E indicates the chain rule so
12
1
w
y
1y
E
12
1
112 w
y
y
E
w
E
To calculate this note that w12 affects y1 by:
which in turn affects E by:
iijij
i
iij
tyxw
y
y
E
w
E
2And in general: cf backprop ij
x2
2(y1 – t1)
x2 2(y1 – t1)
![Page 19: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/19.jpg)
Back to 2 (or even N outputs). Consider the route (or more accurately, the variables) by which a change in w12 affects E:
x1
x2
w11
w21
w12
w22
y1
y2
See that w12 still affects E via x2 and y1 so still have:
2
1
2)(i
ii tyE
12
1
112 w
y
y
E
w
E iijij
i
iij
tyxw
y
y
E
w
E
2And: x2 2(y1 – t1)
1y
E
12
1
w
y
Visualise the partial derivatives working along the connections
![Page 20: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/20.jpg)
How about how E is changed by x2: 2x
E
Therefore need to somehow combine the contributions along these 2 routes …
x1
x2
w11
w21
w12
w22
y1
y2
1y
E
x2 changes y1 along w12 which changes E
x2 also changes y2 along w22 which changes E
Again look at the route along which x2 travels:
12
1
w
y
12
1
w
y
2y
E
2
1
2)(i
ii tyE
![Page 21: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/21.jpg)
Chain rule for partial differentiationIf y is a function of u and v which are both functions of x then:
Quite intuitive: sum the contributions along each path affecting E
dx
dv
v
y
dx
du
u
y
dx
dy
x1
x2
w11
w21
w12
w22
y1
y2
1y
E
2
1
x
y
2
2
x
y
2y
E
2x
ESo: = 2(y1- t1)
2
1
2)(i
ii tyE
Outputs
2)(2 iii wty
2
1
1 x
y
y
E
2
2
2 x
y
y
E
w12 + 2(y2- t2)w22
![Page 22: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/22.jpg)
What about another layer? x1
x2
w11
w21
w12
w22
y1
y2
2
1
2)(i
ii ytE
v11
v21
v12
v22
u1
u2
We need to know how E changes wrt the v’s and w’s. The calculation for the w’s is the same but what about the v’s?
ji
j
v
x
jiv
E
which involves finding andjx
E
Thus: ik j
k
kji
j
jji
ux
y
y
E
v
x
x
E
v
E
2
1
Seems complicated but can do via matrix multiplication
v’s affect x’s which affect y’s which affect E.
So we need:
![Page 23: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/23.jpg)
Finding Maxima/Minima (Optima)At a maximum or minimum (optimum, stationary point etc) the gradient is flat ie dy/dx =0
To find max/min calculate dy/dx and solve dy/dx =0
maximaminima
Global minimum = lowest point
global maximum = highest point
Similarly for a function of n variables y=f(x1,x2, ...,xn), at optima for i = 1, …, n0
ix
y
local maxima
Local minima
Eg: y = x2 , dy/dx = 2x so at optima: dy/dx = 0 => 2x =0 ie x=0
E
w11
![Page 24: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/24.jpg)
Why do we want to find optima? Error minimisation
In neural network training have the error function E = (yi – ti)2 At the global minimum of E, outputs are close as possible to targets, so network is ‘trained’
Therefore to train a network ‘simply’ set dE/dw=0 and solve
However, In many cases (especially neural network training) cannot solve dy/dx=0.
In such cases can use gradient descent to change the weights in such a way that the error decreases
![Page 25: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/25.jpg)
Gradient DescentAnalogy is being on a foggy hillside where we can only see the area around us. To get to the valley floor a good plan is to take a step downhill. To get there quickly, take a step in the steepest direction downhill ie in the direction of –grad. As we can’t see very far and only know the gradient locally only take a small step
Algorithm• Start at a random position in space, w• Calculate • Update w via: wnew = wold – • Repeat from 2
)(wE)(wE
is the learning rate and governs the size of the step taken. It is set to a low value and often changed adaptively eg if landscape seems ‘easy’ (not jagged or steep) increase step-size etc
![Page 26: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/26.jpg)
w1
w2
E(w)
Original point inweight space: w
New point inweight space
221
,0,w
E
w
E
w
EE
)(wEw
![Page 27: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/27.jpg)
Many variations which help with these problems eg: • Introduce more stochasticity to ‘escape’ local minima • use higher order gradients to improve convergence• Approximate grad or use a proxy eg take a step, see if lower down or not and if so, stay there eg hill-climbing etc
Problems of local minima. Gradient descent goes to the closest local minimum and stays there:
Start here Start here
End up here
End up here
Other problems: length of time to get a solution especially if there are flat regions in the search space; Can be difficult to calculate grad
E
![Page 28: Formal Computational Skills Week 4: Differentiation](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d565503460f94a34497/html5/thumbnails/28.jpg)
Higher order derivativesRates of rates of change.
Also used to tell whether a function (eg mapping produced by a network) is smooth as it calculates the curvature at a point
is greater for the first mapping than the 2nd
3
3
dx
yd
x
y2
yn
x
y
x
y
In 1 dimension
The higher the order, the more information known about y and so used eg improve gradient descent search
In many dimensions a vector
Eg velocity dy/dx is 1st order, acceleration: d2y/dx2 is 2nd order