statistical machine learning hilary term 2020palamara/teaching/sml20/ht20_lecture... · 2020. 2....

23
Statistical Machine Learning Hilary Term 2020 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/SML20.html February 27, 2020 1

Upload: others

Post on 21-Jul-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Statistical Machine LearningHilary Term 2020

Pier Francesco PalamaraDepartment of Statistics

University of Oxford

Slide credits and other course material can be found at:http://www.stats.ox.ac.uk/~palamara/SML20.html

February 27, 2020

1

Page 2: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks

Neural Networks

2

Page 3: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Biological inspiration

Basic computational elements:neurons.Receives signals from otherneurons via dendrites.Sends processed signals viaaxons.Axon-dendrite interactions atsynapses.1010 − 1011 neurons.1014 − 1015 synapses.

3

Page 4: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Single Neuron Classifier

x(1)

x(2)

x(3)

x(p)

b

b

b

w1

w2

w3

wp

s(.)

P(Y = 1|X = x)

1

b

w⊤x+ b

activation w>x+ b (linear in inputs x)activation/transfer function s gives the output/activity (potentiallynonlinear in x)b often called bias (not to be confused with other biases we discussed!)common nonlinear activation function s(a) = 1

1+e−a : logistic regressionlearn w and b via gradient descent

4

Page 5: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Single Neuron Classifier

xi1

xi2

i is the index of a training point.

5

Page 6: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Overfitting

iterations 30,80

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

6

Page 7: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Overfitting

iterations 500,3000

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

6

Page 8: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Overfitting

iterations 10000,40000

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

6

Page 9: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Multilayer Networks

Data vectors x ∈ Rp,binary labels y ∈ {0, 1}.inputs x = [x1, . . . , xp]>

outputy = P(Y = 1|X = x)hidden unit activitiesh1, . . . , hm

Compute hidden unitactivities:

h` = s

(bh

` +p∑

j=1

whj`xj

)Compute outputprobability:

y = s

(bo +

m∑`=1

wokh`

)

x1

x2

x3

11

Output

Hiddenlayer

Inputlayer

Outputlayer

7

Page 10: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Introduction

Multilayer Networks

xi1

xi2

8

Page 11: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Backpropagation

Training a Neural Network

Objective function: L2-regularized log-loss

J = −n∑

i=1yi log yi + (1− yi) log(1− yi) + λ

2

∑jl

(whjl)2 +

∑l

(wol )2

where

yi = s

(bo +

m∑l=1

wol hil

)hil = s

bhl +

p∑j=1

whjlxij

Optimize parameters θ =

{bh, wh, bo, wo

}, where bh ∈ Rm, wh ∈ Rp×m,

bo ∈ R, wo ∈ Rm with gradient descent.

∂J

∂wol

= λwol +

n∑i=1

∂J

∂yi

∂yi

∂wol

= λwol +

n∑i=1

(yi − yi)hil,

∂J

∂whjl

= λwhjl +

n∑i=1

∂J

∂yi

∂yi

∂hil

∂hil

∂whjl

= λwhjl +

n∑i=1

(yi − yi)wol hil(1− hil)xij .

L2-regularization often called weight decay.Multiple hidden layers: Backpropagation algorithm

9

Page 12: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Backpropagation

Multiple hidden layersyi = hL+1

i

b b b b b bhLi1 hL

im

b b b b b bh1i1 h1

im

b b b

xi1 = h0i1

xip = h0ip

h`+1i = s

(W `+1h`

i

)W `+1 =

(w`

jk

)jk

: weight matrix at

the (`+ 1)-th layer, weight w`jk on

the edge between h`−1ik and h`

ij

s: entrywise (logistic) transferfunction

yi = s(WL+1s

(WL

(· · · s

(W 1xi

))))

Many hidden layers can be used: they are usually thought of as forming ahierarchy from low-level to high-level features.

10

Page 13: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Backpropagation

Backpropagation

yi = hL+1

i

b b bhℓ+1

i1 hℓ+1

im

hℓij

hℓ−1

ik

wℓjk

b b b

b b b

b b b

J = −n∑

i=1yi log hL+1

i +(1−yi) log(1−hL+1i )

Gradients wrt h`ij computed by

recursive applications of chain rule,and propagated through the networkbackwards.

∂J

∂hL+1i

= − yi

hL+1i

+ 1− yi

1− hL+1i

∂J

∂h`ij

=m∑

r=1

∂J

∂h`+1ir

∂h`+1ir

∂h`ij

∂J

∂w`jk

=n∑

i=1

∂J

∂h`ij

∂h`ij

∂w`jk

11

Page 14: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Backpropagation

Neural Networks

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Solution (global minimum)Local minimum 1Local minimum 2Local minimum 3

Global solution and local minima

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Neural network fit with a weight decay of 0.01

R package implementing neural networks with a single hidden layer: nnet.

12

Page 15: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Dropout

Dropout Training of Neural Networks

Neural network with single layer of hiddenunits:

Hidden unit activations:

hik = s

(bh

k +p∑

j=1

W hjkxij

)Output probability:

yi = s

(bo +

m∑k=1

W ok hik

)Large, overfitted networks often haveco-adapted hidden units.What each hidden unit learns may in factbe useless, e.g. predicting the negation ofpredictions from other units.Can prevent co-adaptation by randomlydropping out units from network.

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4 hi5 hi6 hi7

yi

Hinton et al (2012).13

Page 16: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Dropout

Dropout Training of Neural Networks

Model as an ensemble of networks (more on ensembles later):

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4

yi

xi1 xi2 xi3 xi4

hi1 hi3 hi5 hi7

yi

xi1 xi2 xi3 xi4

hi1 hi4 hi6

yi

xi1 xi2 xi3 xi4

hi3 hi4 hi5

yi

Weight-sharing among all networks: each network uses a subset of theparameters of the full network (corresponding to the retained units).Training by stochastic gradient descent: at each iteration a network issampled from ensemble, and its subset of parameters are updated.Biological inspiration: 1014 weights to be fitted in a lifetime of 109 seconds

Poisson spikes as a regularization mechanism which preventsco-adaptation: Geoff Hinton on Brains, Sex and Machine Learning

14

Page 17: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Dropout

Dropout Training of Neural NetworksClassification of phonemes in speech.

Figure from Hinton et al.15

Page 18: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Variations

Neural Networks – Variations

Other loss functions can be used, e.g. for regression:n∑

i=1|yi − yi|2

For multiclass classification, use softmax outputs:

yik =exp(bo

k +∑

` wolkhi`)∑K

k′=1 exp(bok′ +

∑` w

olk′hi`)

L(yi, yi) = −K∑

k=11(yi = k) log yik

Other activation functions can be used:

rectified linear unit (ReLU):s(z) = max(0, z)softplus: s(z) = log(1 + exp(z))tanh: s(z) = tanh(z)

16

Page 19: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Variations

Deep learning intuition

Source: http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html

17

Page 20: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Variations

Deep learning demo

http://playground.tensorflow.org/

18

Page 21: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Variations

Convolutions

Click for animation.Source: https://ujjwalkarn.me/2016/08/11/

intuitive-explanation-convnets/

19

Page 22: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Variations

Deep Convolutional Neural Networks

Input is a 2D image, X ∈ Rp×q.

Convolution: detects simple object parts or features Weights W m nowcorrespond to a filter to be learned - typically much smaller than the input thusencouraging sparse connectivity.

Pooling and Sub-sampling: replace the output with a summary statistic of thenearby outputs, e.g. max-pooling (allows invariance to small translations in theinput).

LeCun et al, Krizhevsky et al.20

Page 23: Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

Neural Networks Discussion

Neural Networks – Discussion

Nonlinear hidden units introduce modelling flexibility, hierarchicalrepresentations.In contrast to user-introduced nonlinearities, features are global, and canbe learned to maximize predictive performance.Neural networks with a single hidden layer and sufficiently many hiddenunits can model arbitrarily complex functions.Highly flexible framework, with many variations to solve different learningproblems and introduce domain knowledge.Optimization problem is not convex, and objective function can havemany local optima, plateaus and ridges.On large scale problems, often use stochastic gradient descent, alongwith a whole host of techniques for optimization, regularization, andinitialization.Explosion of interest in the field recently and many new developmentsnot covered here, especially by Geoffrey Hinton, Yann LeCun, YoshuaBengio, Andrew Ng and others. See alsohttp://deeplearning.net/.

21