introduction to machine learning - github pages · introduction to machine learning recitation 2...

31
Introduction to machine learning Recitation 2 MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2019 Sachit Saksena

Upload: others

Post on 20-May-2020

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Introduction to machine learningRecitation 2MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2019 Sachit Saksena

Page 2: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Recap of recitation 1

Google Colabratory setup

Introduction to Tensorflow

Setting up Google Cloud with Datalab

Everyone set up for PS1?

Local conda and jupyter notebook setup

Page 3: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: steps

I. Get dataII. Identify the space of possible solutionsIII. Formulate an objectiveIV. Choose algorithmV. Train (loss)VI. Validate results (metrics)

Page 4: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: tasks

Page 5: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: loss

Task LossRegression

(penalize large errors)

Regression (penalize error linearly)

Classification(binary)

Classification(multi-class)

Generative

Page 6: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: metrics

Page 7: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: metrics

Page 8: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: ROC

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Page 9: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: data

Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8

Page 10: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: data

Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8

Page 11: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: data

Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8

Page 12: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: non-parametric models

1 𝐷 = {(𝑋$, 𝑌$), 𝑋(, 𝑌( , … , (𝑋*, 𝑌*)}2 function knn(𝑘,dist, 𝐷 ,-./* , 𝐷 ,01, ):3 votes = []4 for 𝑖 = 1 to len(𝐷 ,01, )5 d = []6 for j = 1 to len(𝐷 ,-./* ) 7 d[j] = dist(𝑥/, 𝑥4)8 d = argsort(d)9 votes[i] = most_common(set(labels[d]))

Page 13: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: semi-parametric models

The idea here is that we would like to find a partition of the input space and then fit very simple models to predict the output in each piece. The partition is described using a (typically binary) “decision

tree,” which recursively splits the space.

Page 14: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: implementations

k-Nearest Neighbors (kNN)

Classification and regression decision trees (CART)

Page 15: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: perceptrons to neurons

Page 16: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Basics of machine learning: feature representations

Polynomial basis

Page 17: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: perceptrons to neurons

Page 18: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: activation functions

Step function

Sigmoid

Rectified linear unit

Hyperbolic tangent

Softmax

Page 19: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: activation functions

Task LossRegression

(penalize large errors)

Regression (penalize error linearly)

Activation

Classification(binary)

Classification(multi-class)

Generative

Linear (ReLU, Leaky ReLU, etc)

Linear (ReLU, Leaky ReLU, etc)

Sigmoid, tanh

Softmax

Linear (ReLU, Leaky ReLU, etc)

Other considerations: gradient intensity, computational activation cost, exploding/vanishing gradients, depth of network (linear is useless)

Page 20: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: single layer feed-forward NN

Page 21: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Optimization: batch gradient update

1 function gradient_update(W/*/,, 𝜂, 𝐽, 𝜖):2 𝑊0 = Winit3 t = 04 while 𝐽 𝑊, − 𝐽 𝑊,<$ > 𝜖5 t = t+16 𝑊, = 𝑊,<$ − 𝜂 ∇@ 𝐽

Gradient of objective J with respect to parameter vector W

Pseudocode for gradient update algorithm

Batch gradient update

This is an arbitrary update criteria

Page 22: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: multi-layer feed-forward NN

X f1W1

b1Z1 A1

layer 1

f2W2

b2Z2 A2

layer 2

… fLWL

bLZL AL

layer L

AL-1

Loss

y

Inspired by 6.036 lecture notes (Leslie Kaebling)

Page 23: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: error backpropogation

X f1W1

b1Z1 A1

f2W2

b2Z2 A2

… fLWL

bLZL ALAL-1

Loss

Inspired by 6.036 lecture notes (Leslie Kaebling)

Let’s work out error backprop on the board!

y

Page 24: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: error backpropogation

So let’s use the following shorthand from the previous figure,

First, let’s break down how the loss depends on the final layer,

Since,

We can re-write the equation as,

Now, to propagate through the whole network, we can keep applying the chain rule until the first layer of the network,

If you spend a few minutes looking at matrix dimensions, it becomes clear that this is an informal derivation. Here are the dimensions to think about:

Since we have the outputs of every layer, all we need to compute for the gradient of the last layer with respect to the weights is the gradient of the loss with respect to the pre-activation output.

The equation with the correct dimensions for matrix multiplication,

Inspired by 6.036 lecture notes (Leslie Kaebling)

Page 25: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: automatic differentiation

https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf

End result

Augment a standard Taylor series (numerical differentiation), with a “dual number”,

Forward automatic differentiation

Because ”dual numbers” have the (manufactured) property,

The Taylor Series simplifies to,

Which recovers the function output as well as the first derivative.

Dual numbers

Page 26: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Neural networks: regularization

Objective function

Objective function with ridge regularization Penalize

Dropout

Popular new approach: Batch normalization

Page 27: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Optimization: overview

Page 28: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Optimization: stochastic gradient descent

Stochastic gradient update (per randomly sampled training example):

Pseudocode for stochastic gradient update algorithm

1 function sgd(W/*/,, 𝜂, 𝐽, T, 𝜖):2 𝑊0 = Winit3 for 𝑡 = 1 to T4 randomly select i ∈ {1,2,...,n} 5 𝑊, = 𝑊,<$ − 𝜂(𝑡) ∇@ 𝐽

Mini-batch gradient descent

Page 29: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Optimization: momentum

Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006

Nesterov inequality? How can we make momentum better?

Page 30: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Optimization: adam

https://arxiv.org/pdf/1412.6980.pdf

Page 31: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena

Next week

Neural network interpretability

Convolutional neural networks Recurrent neural networks

Batch normalization