statistical machine learning hilary term 2020palamara/teaching/sml20/ht20_lecture... · 2020. 2....

Statistical Machine LearningHilary Term 2020

Pier Francesco PalamaraDepartment of Statistics

University of Oxford

Slide credits and other course material can be found at:http://www.stats.ox.ac.uk/~palamara/SML20.html

February 27, 2020

1

http://www.stats.ox.ac.uk/~palamara/SML20.html

Neural Networks

Neural Networks

2

Neural Networks Introduction

Biological inspiration

Basic computational elements:neurons.Receives signals from otherneurons via dendrites.Sends processed signals viaaxons.Axon-dendrite interactions atsynapses.1010 − 1011 neurons.1014 − 1015 synapses.

3


Single Neuron Classifier

x(1)

x(2)

x(3)

x(p)

b

b

b

∑

w1

w2

w3

wp

s(.)

P(Y = 1|X = x)

1

b

w⊤x+ b

activation w>x+ b (linear in inputs x)activation/transfer function s gives the output/activity (potentiallynonlinear in x)b often called bias (not to be confused with other biases we discussed!)common nonlinear activation function s(a) = 1

1+e−a : logistic regressionlearn w and b via gradient descent

4


Single Neuron Classifier

xi1

xi2

i is the index of a training point.

5


Overfitting

iterations 30,80

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

6

http://www.inference.phy.cam.ac.uk/mackay/itila/book.html


Overfitting

iterations 500,3000



6



Overfitting

iterations 10000,40000



6



Multilayer Networks

Data vectors x ∈ Rp,binary labels y ∈ {0, 1}.inputs x = [x1, . . . , xp]>

outputy = P(Y = 1|X = x)hidden unit activitiesh1, . . . , hm

Compute hidden unitactivities:

h` = s

(bh

` +p∑

j=1

whj`xj

)Compute outputprobability:

y = s

(bo +

m∑`=1

wokh`

)

x1

x2

x3

11

Output

Hiddenlayer

Inputlayer

Outputlayer

7


Multilayer Networks

xi1

xi2

8

Neural Networks Backpropagation

Training a Neural Network

Objective function: L2-regularized log-loss

J = −n∑

i=1yi log yi + (1− yi) log(1− yi) + λ

2

∑jl

(whjl)2 +

∑l

(wol )2

where

yi = s

(bo +

m∑l=1

wol hil

)hil = s

bhl +

p∑j=1

whjlxij

Optimize parameters θ =

{bh, wh, bo, wo

}, where bh ∈ Rm, wh ∈ Rp×m,

bo ∈ R, wo ∈ Rm with gradient descent.

∂J

∂wol

= λwol +

n∑i=1

∂J

∂yi

∂yi

∂wol

= λwol +

n∑i=1

(yi − yi)hil,

∂J

∂whjl

= λwhjl +

n∑i=1

∂J

∂yi

∂yi

∂hil

∂hil

∂whjl

= λwhjl +

n∑i=1

(yi − yi)wol hil(1− hil)xij .

L2-regularization often called weight decay.Multiple hidden layers: Backpropagation algorithm

9


Multiple hidden layersyi = hL+1

i

b b b b b bhLi1 hL

im

b b b b b bh1i1 h1

im

b b b

xi1 = h0i1

xip = h0ip

h`+1i = s

(W `+1h`

i

)W `+1 =

(w`

jk

)jk

: weight matrix at

the (`+ 1)-th layer, weight w`jk on

the edge between h`−1ik and h`

ij

s: entrywise (logistic) transferfunction

yi = s(WL+1s

(WL

(· · · s

(W 1xi

))))

Many hidden layers can be used: they are usually thought of as forming ahierarchy from low-level to high-level features.

10


Backpropagation

yi = hL+1

i

b b bhℓ+1

i1 hℓ+1

im

hℓij

hℓ−1

ik

wℓjk

b b b

b b b

b b b

J = −n∑

i=1yi log hL+1

i +(1−yi) log(1−hL+1i )

Gradients wrt hìj computed by

recursive applications of chain rule,and propagated through the networkbackwards.

∂J

∂hL+1i

= − yi

hL+1i

+ 1− yi

1− hL+1i

∂J

∂hìj

=m∑

r=1

∂J

∂h`+1ir

∂h`+1ir

∂hìj

∂J

∂w`jk

=n∑

i=1

∂J

∂hìj

∂hìj

∂w`jk

11


Neural Networks

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Solution (global minimum)Local minimum 1Local minimum 2Local minimum 3

Global solution and local minima

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Neural network fit with a weight decay of 0.01

R package implementing neural networks with a single hidden layer: nnet.

12

Neural Networks Dropout

Dropout Training of Neural Networks

Neural network with single layer of hiddenunits:

Hidden unit activations:

hik = s

(bh

k +p∑

j=1

W hjkxij

)Output probability:

yi = s

(bo +

m∑k=1

W ok hik

)Large, overfitted networks often haveco-adapted hidden units.What each hidden unit learns may in factbe useless, e.g. predicting the negation ofpredictions from other units.Can prevent co-adaptation by randomlydropping out units from network.

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4 hi5 hi6 hi7

yi

Hinton et al (2012).13

http://arxiv.org/abs/1207.0580


Dropout Training of Neural Networks

Model as an ensemble of networks (more on ensembles later):

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4

yi

xi1 xi2 xi3 xi4

hi1 hi3 hi5 hi7

yi

xi1 xi2 xi3 xi4

hi1 hi4 hi6

yi

xi1 xi2 xi3 xi4

hi3 hi4 hi5

yi

Weight-sharing among all networks: each network uses a subset of theparameters of the full network (corresponding to the retained units).Training by stochastic gradient descent: at each iteration a network issampled from ensemble, and its subset of parameters are updated.Biological inspiration: 1014 weights to be fitted in a lifetime of 109 seconds

Poisson spikes as a regularization mechanism which preventsco-adaptation: Geoff Hinton on Brains, Sex and Machine Learning

14

https://www.youtube.com/watch?v=DleXA5ADG78


Dropout Training of Neural NetworksClassification of phonemes in speech.

Figure from Hinton et al.15

Neural Networks Variations

Neural Networks – Variations

Other loss functions can be used, e.g. for regression:n∑

i=1|yi − yi|2

For multiclass classification, use softmax outputs:

yik =exp(bo

k +∑

` wolkhi`)∑K

k′=1 exp(bok′ +

∑` w

olk′hi`)

L(yi, yi) = −K∑

k=11(yi = k) log yik

Other activation functions can be used:

rectified linear unit (ReLU):s(z) = max(0, z)softplus: s(z) = log(1 + exp(z))tanh: s(z) = tanh(z)

16


Deep learning intuition

Source: http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html

17

http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html

http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html


Deep learning demo

http://playground.tensorflow.org/

18

http://playground.tensorflow.org/


Convolutions

Click for animation.Source: https://ujjwalkarn.me/2016/08/11/

intuitive-explanation-convnets/

19

https://ujwlkarn.files.wordpress.com/2016/08/giphy.gif?w=748

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/


Deep Convolutional Neural Networks

Input is a 2D image, X ∈ Rp×q.

Convolution: detects simple object parts or features Weights W m nowcorrespond to a filter to be learned - typically much smaller than the input thusencouraging sparse connectivity.

Pooling and Sub-sampling: replace the output with a summary statistic of thenearby outputs, e.g. max-pooling (allows invariance to small translations in theinput).

LeCun et al, Krizhevsky et al.20

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

https://www.cs.toronto.edu/~hinton/absps/imagenet.pdf

Neural Networks Discussion

Neural Networks – Discussion

Nonlinear hidden units introduce modelling flexibility, hierarchicalrepresentations.In contrast to user-introduced nonlinearities, features are global, and canbe learned to maximize predictive performance.Neural networks with a single hidden layer and sufficiently many hiddenunits can model arbitrarily complex functions.Highly flexible framework, with many variations to solve different learningproblems and introduce domain knowledge.Optimization problem is not convex, and objective function can havemany local optima, plateaus and ridges.On large scale problems, often use stochastic gradient descent, alongwith a whole host of techniques for optimization, regularization, andinitialization.Explosion of interest in the field recently and many new developmentsnot covered here, especially by Geoffrey Hinton, Yann LeCun, YoshuaBengio, Andrew Ng and others. See alsohttp://deeplearning.net/.

21

https://www.cs.toronto.edu/~hinton/

http://yann.lecun.com/

http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html

http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html

http://cs.stanford.edu/people/ang/

http://deeplearning.net/

statistical machine learning hilary term 2020palamara/teaching/sml20/ht20_lecture... · 2020. 2....

Documents