introduction to machine learning - github pages · introduction to machine learning recitation 2...

Post on 20-May-2020

34 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to machine learningRecitation 2MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2019 Sachit Saksena

Recap of recitation 1

Google Colabratory setup

Introduction to Tensorflow

Setting up Google Cloud with Datalab

Everyone set up for PS1?

Local conda and jupyter notebook setup

Basics of machine learning: steps

I. Get dataII. Identify the space of possible solutionsIII. Formulate an objectiveIV. Choose algorithmV. Train (loss)VI. Validate results (metrics)

Basics of machine learning: tasks

Basics of machine learning: loss

Task LossRegression

(penalize large errors)

Regression (penalize error linearly)

Classification(binary)

Classification(multi-class)

Generative

Basics of machine learning: metrics

Basics of machine learning: metrics

Basics of machine learning: ROC

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Basics of machine learning: data

Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8

Basics of machine learning: data

Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8

Basics of machine learning: data

Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8

Basics of machine learning: non-parametric models

1 𝐷 = {(𝑋$, 𝑌$), 𝑋(, 𝑌( , … , (𝑋*, 𝑌*)}2 function knn(𝑘,dist, 𝐷 ,-./* , 𝐷 ,01, ):3 votes = []4 for 𝑖 = 1 to len(𝐷 ,01, )5 d = []6 for j = 1 to len(𝐷 ,-./* ) 7 d[j] = dist(𝑥/, 𝑥4)8 d = argsort(d)9 votes[i] = most_common(set(labels[d]))

Basics of machine learning: semi-parametric models

The idea here is that we would like to find a partition of the input space and then fit very simple models to predict the output in each piece. The partition is described using a (typically binary) “decision

tree,” which recursively splits the space.

Basics of machine learning: implementations

k-Nearest Neighbors (kNN)

Classification and regression decision trees (CART)

Neural networks: perceptrons to neurons

Basics of machine learning: feature representations

Polynomial basis

Neural networks: perceptrons to neurons

Neural networks: activation functions

Step function

Sigmoid

Rectified linear unit

Hyperbolic tangent

Softmax

Neural networks: activation functions

Task LossRegression

(penalize large errors)

Regression (penalize error linearly)

Activation

Classification(binary)

Classification(multi-class)

Generative

Linear (ReLU, Leaky ReLU, etc)

Linear (ReLU, Leaky ReLU, etc)

Sigmoid, tanh

Softmax

Linear (ReLU, Leaky ReLU, etc)

Other considerations: gradient intensity, computational activation cost, exploding/vanishing gradients, depth of network (linear is useless)

Neural networks: single layer feed-forward NN

Optimization: batch gradient update

1 function gradient_update(W/*/,, 𝜂, 𝐽, 𝜖):2 𝑊0 = Winit3 t = 04 while 𝐽 𝑊, − 𝐽 𝑊,<$ > 𝜖5 t = t+16 𝑊, = 𝑊,<$ − 𝜂 ∇@ 𝐽

Gradient of objective J with respect to parameter vector W

Pseudocode for gradient update algorithm

Batch gradient update

This is an arbitrary update criteria

Neural networks: multi-layer feed-forward NN

X f1W1

b1Z1 A1

layer 1

f2W2

b2Z2 A2

layer 2

… fLWL

bLZL AL

layer L

AL-1

Loss

y

Inspired by 6.036 lecture notes (Leslie Kaebling)

Neural networks: error backpropogation

X f1W1

b1Z1 A1

f2W2

b2Z2 A2

… fLWL

bLZL ALAL-1

Loss

Inspired by 6.036 lecture notes (Leslie Kaebling)

Let’s work out error backprop on the board!

y

Neural networks: error backpropogation

So let’s use the following shorthand from the previous figure,

First, let’s break down how the loss depends on the final layer,

Since,

We can re-write the equation as,

Now, to propagate through the whole network, we can keep applying the chain rule until the first layer of the network,

If you spend a few minutes looking at matrix dimensions, it becomes clear that this is an informal derivation. Here are the dimensions to think about:

Since we have the outputs of every layer, all we need to compute for the gradient of the last layer with respect to the weights is the gradient of the loss with respect to the pre-activation output.

The equation with the correct dimensions for matrix multiplication,

Inspired by 6.036 lecture notes (Leslie Kaebling)

Neural networks: automatic differentiation

https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf

End result

Augment a standard Taylor series (numerical differentiation), with a “dual number”,

Forward automatic differentiation

Because ”dual numbers” have the (manufactured) property,

The Taylor Series simplifies to,

Which recovers the function output as well as the first derivative.

Dual numbers

Neural networks: regularization

Objective function

Objective function with ridge regularization Penalize

Dropout

Popular new approach: Batch normalization

Optimization: overview

Optimization: stochastic gradient descent

Stochastic gradient update (per randomly sampled training example):

Pseudocode for stochastic gradient update algorithm

1 function sgd(W/*/,, 𝜂, 𝐽, T, 𝜖):2 𝑊0 = Winit3 for 𝑡 = 1 to T4 randomly select i ∈ {1,2,...,n} 5 𝑊, = 𝑊,<$ − 𝜂(𝑡) ∇@ 𝐽

Mini-batch gradient descent

Optimization: momentum

Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006

Nesterov inequality? How can we make momentum better?

Optimization: adam

https://arxiv.org/pdf/1412.6980.pdf

Next week

Neural network interpretability

Convolutional neural networks Recurrent neural networks

Batch normalization

top related