1/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
Rapid Introduction to Machine Learning/Deep Learning
Hyeong In Choi
Seoul National University
2/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
Lecture 4aFeedforward neural network
October 30, 2015
3/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
Table of contents
1 1. Objectives of Lecture 4a
2 2. Multilayer perceptron2.1. Feedforward data flow2.2. Back propagation algorithm
3 3. Training neural networks3.1. Simple neural network3.2. Training general feedforward neural network
4/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
1. Objectives of Lecture 4a
Objective 1
Learn the basic formalism of multilayer feedforward neural network
Objective 2
Learn the back propagation algorithm
Objective 3
Learn about the basic issues related to training of neural network
5/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
Objective 4
Learn some useful tricks for training the neural network
6/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
2. Multilayer perceptron2.1. Feedforward data flow
7/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
First layer
8/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
First layer
9/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
z1i ∶ input (pre-activation) to the ith neuron in Layer 1
z1i =
d
∑j=1
ω1ijxj + b1
i
b1i ∶ bias at the ith neuron in Layer 1 in vector notation
z1=W 1x + b
h1i ∶ output of the ith neuron in Layer 1
h1i = ϕ1(z
1i ),
in vector notationh1
= ϕ1(z1),
10/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
where
ϕ1(z) =
⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩
sigm(z) =1
1 + e−ztanh(z)ReLU(z) = max(z ,0) = z+
etc .
At the `th layer
11/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
The input (pre-activation) Layer at the ith neuron in Layer `
z`i =∑j
ω`ijh
`−1j + bi ,
in vector notationz` =W `h`−1
+ b`
The output at the ith neuron in Layer `
h`j = ϕ`(z`j )
in vector notationh` = ϕ`(W
`h`−1+ b`)
12/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
The output layer: Layer L
pre-activationzL =W LhL−1
+ bL
OutputhL = ϕL(z
L)
In case of multi-class classification (K classes)
hL = softmax(zL)
i.e.
hLi =ezi
∑Kk=1 e
zk=
exp(W Li ⋅ h
L−1 + bLi )
∑Kk=1 exp(W L
k ⋅hL−1 + bLk)
hLi ∼ P(Y = 1 ∣ x)
[Note: Wi ⋅ denotes the ith row of matrix W ]
In case of regression, hLi ∈ R,∀i
13/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
The loss (error) function
For real output (regression)
E =∑k
∣hLk − yk ∣2
For discrete output (classification)Recall: Multivariate Bernoulli
P(y) = µy11 ⋯µ
yKK
Given data D = {(x(t), y (t))}Nt=1
Likelihood
∏t
P(y (t) ∣ x(t)) =∏t
µy(t)1
1 ⋯µy(t)KK
14/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
log likelihood
∑t
{y(t)1 logµ1 +⋯ + y
(t)K logµK}
hLi maximizes this log likelihood and is the estimator of µi .The error is defined to be the negative of the log likelihood(hLi minimizes this error)
E = −N
∑t=1
K
∑k=1
y(t)k log hLk
= −N
∑t=1
K
∑k=1
I(y (t) = k) logez
Lk
∑kj=1 e
zLj,
where zLj =W Lj ⋅ h
L−1 + bLj[Note: this error is called the softmax error or the crossentropy error]
15/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.1. Feedforward data flow
Remark
For regression, the `2-error is typical. But one can use `1-errorlike
E =∑k
∣hLk − yk ∣,
or other convex function of hL − y
For classification, this softmax error is typical. But can useother similar errors
Feedforward network has the property that once the values(z` or h`) of all the neurons of a Layer ` are given, the valuesof layers that come after Layer ` are all determined by them(assuming all weights and biases are fixed). Thus can writethe error E(x , y) as E(h`, y) or E(z`, y) for any ` = 0,1,⋯,L
16/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.2. Back propagation algorithm
2.2. Back propagation algorithm
17/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.2. Back propagation algorithm
Change of E w.r.t. ω`ij
Fix y , treat E as a function of the values (z` or h`) of the `thlayer. Then
∂E
∂z`i=dh`idz`i
∂E
∂h`i, (1)
where
dh`idz`i
=
⎧⎪⎪⎪⎨⎪⎪⎪⎩
h`i (1 − h`i ) if ϕ` is sigm
I(z`i ≥ 0) if ϕ` is ReLU
sech2z`i if ϕ` is tanh z
Then∂E
∂h`−1j
=∑i
∂z`i∂h`−1
j
∂E
∂z`i.
Fromz`i =∑
j
ω`ijh
`−1j + b`i , (2)
18/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.2. Back propagation algorithm
get∂z`i∂h`−1
j
= ω`ij
∂z`i∂ω`
ij
= h`−1j
Thus∂E
∂h`−1j
=∑j
ω`ij∂E
∂z`i(3)
Using (2), we get
∂E
∂ω`ij
=∂z`i∂ω`
ij
∂E
∂z`i(4)
Thus∂E
∂ω`ij
= h`−1j
∂E
∂z`i
19/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.2. Back propagation algorithm
Change of E w.r.t. b`i
The weight connecting the bias neuron in Layer ` − 1 to theith neuron in Layer ` is b`i = ω
`i0
h`−10 = 1 and there is no input (pre-activation) to the bias
neuron
20/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.2. Back propagation algorithm
Note from (2)∂z`i∂b`i
=∂z`i∂ω`
i0
= 1
Thus from (4)∂E
∂b`i=∂E
∂ω`i0
=∂E
∂z`i
Hence we get
∂E
∂ω`ij
= h`−1j
∂E
∂z`i∂E
∂b`i=∂E
∂ω`i0
=∂E
∂z`i(5)
21/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
2.2. Back propagation algorithm
Propagation mechanism
The data flows from Layer ` − 1 to Layer `, i.e. move forward(hence the name “feedforward network”)
Error derivative can compute∂E
∂h`i
by(1)Ð→
∂E
∂z`i
by(3)Ð→
∂E
∂h`−1i
by(1)Ð→
∂E
∂z`−1i
→ so on
Namely, the error derivatives can be computed from theoutput layer (Layer L) and backward all the way to the inputlayer (Layer 0) (hence the name back propagation)
Equation (5) is the basic equation to be used for the gradientdescent algorithm
22/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
3. Training neural networks3.1. Simple neural network
Reference: Hinton’s Coursera Lectures
h = ω1x1 + ω2x2
`2-error for a single training set ((x1, x2), y)
E = E(ω1, ω2) = (ω1x1 + ω2x2 − y)2
23/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
x1, x2, y are fixedω1 and ω2 are variables
24/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
25/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
Gradient
∇E = (∂E
∂ω1,∂E
∂ω2)
= (2x1(ω1x1 + ω2x2 − y),2x2(ω1x1 + ω2x2 − y)) ∼ (x1, x2)
∇E is pointing perpendicularly to the line ω1x1 + ω2x2 − y = 0
26/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
`2-error for two data points
D = {((x11 , x
12 ), y
1), ((x21 , x
22 ), y
2)}
E = E(ω1, ω2) = (ω1x11 + ω2x
12 − y1)2 + (ω1x
21 + ω2x
22 − y2)2
27/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
The level sets E = constant are ellipses
28/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
Steepest descent: full batch learning
ω(new) = ω(old) − ε∇E(ω(old))
ε: learning rate
29/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
Stochastic gradient descent: online learning
30/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
Pathological situation
If two lines are almost parallel
31/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
then the ellipse has a ravine-like shape
32/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
Full batch
33/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
Online
34/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.1. Simple neural network
In either case, ω(new) does not move much in the minimumdirection [oscillation phenomenon]
If the learning rate is big, ω(new) diverges
35/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
3.2. Training general feedforward neural network
Data: ((x1, x2), y)
y = 0,1
36/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Example: error of binary classifier
Error
E = − I(y = 1) logeω1x1+ω2x2
1 + eω1x1+ω2x2
− I(y = 0) log1
1 + eω1x1+ω2x2
When y = 1,
ErrorE = − log sigm(ω1x1 + ω2x2)
37/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
sigm(z) is concave for z > 0log is also concaveThus E = − log sigm(ω1x1 + ω2x2) is convex whenω1x1 + ω2x2 > 0
when y = 0
Error
E = − log1
1 + eω1x1+ω2x2
1
1 + etis concave for t < 0
log is also concaveE is convex if ω1x1 + ω2x2 < 0
Thus E is convex in the “correct” region, but not everywhere
38/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Error
The error of a general neural network is a very complicatedfunction of huge number of variables
39/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
The purpose of training (learning) is to find the value of ωthat minimizes E .
(Stochastic) gradient descent
The basic workhorse is a variant of gradient descent
Two stages
Initialization: how to find a “good” starting point(configuration of ω)Algorithm: how to get to a “good” minimum point from thegiven starting point
40/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Basic issues
(A) Mode
Full-batch learning (gradient descent): use the entire data setMinim-batch learning (stochastic gradient descent): divide thedata set into a family of smaller data sets (mini-batches) anduse each mini-batch alternatingly [Preferred method for largedata set with much redundancy]Online learning (stochastic gradient descent): every mini-batchconsisting of single data point
(B) InputWhether to use the given data as input or do sometransformation of it
(C) WeightHow to set weights initially and change them in the course oftraining
41/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
(D) Learning rateHow to choose learning rate(s) initially and change it (them)in the course of training
(E) Generalization error
How to avoid overfittingHow to estimate the generalization error
(B) Input
Example(Hinton)
h = ω1x1 + ω2x2
42/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Data ((101,101),2), ((101,99),0)
43/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Subtract 100 ⇒ ((1,1),2), ((1,−1),0)
44/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Example (Hinton)
Data ((0.1,10),2), ((0.1,−10),0)
45/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Divide componentwise by average magnitude⇒ ((1,1),2), ((1,−1),0)
46/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
If the inputs are highly correlated, they tend to create“ravines”. To alleviate this problem,
decorrelate the inputnormalize each co-ordinate value to have similar variability (i.e.variance)
May use, e.g., PCA or autoencoder (see the forthcominglecture)
47/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
(C) Weight
Weights are to be determined by the (learning) algorithm
Initialization is still an issue
Random initializationBreaking symmetry
Fan-in
Fan-in of a neuron (layer) is the number of layer from theinput to the neuron (layer) in questionBig fan-in may result in big change in the value of the neuronin the latter layer even with small change in the earlier layer.Thus, better to initialize the incoming weight of the neuronwith big fan-in smallWith small fan-in, may initialize the incoming weight bigger(not so small)
48/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
General (esp. deep) neural network has lots of local minimumin which the gradient descent process may get trapped. Thisis one of the central issues in the training of neural network
Pre-training has the effect of putting the initial positionreasonably close to he intended minimum [see the forthcominglectures]
49/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
(D) Learning rate
Big learning rate
learn quickly (error decreases rapidly) in the early stageand then may start to oscillate and the error gets erratic
Small learning rate
takes long time to learnmay get stuck in a “bad” local minimum
Rule of thumb
If error oscillates, decrease the learning rateIf error decreases consistently, increase the learning rate
Using different learning rate for each weight
The magnitude of weights vary greatly, and it may causeproblem if a single learning rate is used for all weightsSee Hinton’s Coursera lecture for this
50/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Momentum method
IdeaIn a ravine, oscillatory phenomenon occurs
Gradient descentω(t) = ω(t − 1) − ε∇E(ω(t − 1))ω(t + 1) = ω(t) − ε∇E(ω(t))
51/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
ω(t) − ω(t − 1) and ω(t + 1) − ω(t1) are nearly opposite toeach other. If added up, they nearly cancel each other and theresulting vector is
52/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
This vector may point roughly to the bottom of ravine,
the momentum method came out of this kind of observation
53/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Algorithm
Keep track of the momentum vector v(t)
Given weight ω(t) and momentum v(t), let
v(t + 1) = µv(t) − ε∇E(ω(t))
ω(t + 1) = ω(t) + v(t + 1)
µ ∶ momentum decay coefficientε ∶ learning rate
54/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Improved momentum method (Sutskever et al. a la Nesterov)
v(t + 1) = µv(t) − ε∇E(ω(t) + µv(t))
ω(t + 1) = ω(t) + v(t + 1)
55/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
In the ravine, the sequence {ω(t)} may look like
At the beginning, set the momentum coefficient small (e.g.,0.5) and eventually increase it big (e.g., 0.9)
56/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Other methods
Separate adaptive learning rate
Rmsprop
[See Hinton]
57/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
(E) Generalization error
Neural network tends to overfit
Some care is needed to control the generalization error
Make use of machine learning techniques
Regularization: Add regularizing term in the error function(standard regularization); keep weights within some prescribedbound; keep the network simpler;Model selection: Try many neural networks and choose thebest one according to the model selection criterion; try earlystoppingAggregation/Bagging: Train many neural network and use theaveraging technique; try bootstrap in conjunction withaggregationRandomization: Apply dropoutetc.
58/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network
Dropout
Imitation of random forests idea
Repeat
at the start of each training step, randomly select someneuronsremove them together with the edges connected to themdo the training with the remaining networkput back the removed neurons and edges (with old values)
This forces each edge (weight) to individually adapt to thepatterns without the co-operation from other edges (weights)
59/59
1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks
3.2. Training general feedforward neural network