introduction to machine learning - github pages · introduction to machine learning recitation 2...
TRANSCRIPT
Introduction to machine learningRecitation 2MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2019 Sachit Saksena
Recap of recitation 1
Google Colabratory setup
Introduction to Tensorflow
Setting up Google Cloud with Datalab
Everyone set up for PS1?
Local conda and jupyter notebook setup
Basics of machine learning: steps
I. Get dataII. Identify the space of possible solutionsIII. Formulate an objectiveIV. Choose algorithmV. Train (loss)VI. Validate results (metrics)
Basics of machine learning: tasks
Basics of machine learning: loss
Task LossRegression
(penalize large errors)
Regression (penalize error linearly)
Classification(binary)
Classification(multi-class)
Generative
Basics of machine learning: metrics
Basics of machine learning: metrics
Basics of machine learning: ROC
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
Basics of machine learning: non-parametric models
1 𝐷 = {(𝑋$, 𝑌$), 𝑋(, 𝑌( , … , (𝑋*, 𝑌*)}2 function knn(𝑘,dist, 𝐷 ,-./* , 𝐷 ,01, ):3 votes = []4 for 𝑖 = 1 to len(𝐷 ,01, )5 d = []6 for j = 1 to len(𝐷 ,-./* ) 7 d[j] = dist(𝑥/, 𝑥4)8 d = argsort(d)9 votes[i] = most_common(set(labels[d]))
Basics of machine learning: semi-parametric models
The idea here is that we would like to find a partition of the input space and then fit very simple models to predict the output in each piece. The partition is described using a (typically binary) “decision
tree,” which recursively splits the space.
Basics of machine learning: implementations
k-Nearest Neighbors (kNN)
Classification and regression decision trees (CART)
Neural networks: perceptrons to neurons
Basics of machine learning: feature representations
Polynomial basis
Neural networks: perceptrons to neurons
Neural networks: activation functions
Step function
Sigmoid
Rectified linear unit
Hyperbolic tangent
Softmax
Neural networks: activation functions
Task LossRegression
(penalize large errors)
Regression (penalize error linearly)
Activation
Classification(binary)
Classification(multi-class)
Generative
Linear (ReLU, Leaky ReLU, etc)
Linear (ReLU, Leaky ReLU, etc)
Sigmoid, tanh
Softmax
Linear (ReLU, Leaky ReLU, etc)
Other considerations: gradient intensity, computational activation cost, exploding/vanishing gradients, depth of network (linear is useless)
Neural networks: single layer feed-forward NN
Optimization: batch gradient update
1 function gradient_update(W/*/,, 𝜂, 𝐽, 𝜖):2 𝑊0 = Winit3 t = 04 while 𝐽 𝑊, − 𝐽 𝑊,<$ > 𝜖5 t = t+16 𝑊, = 𝑊,<$ − 𝜂 ∇@ 𝐽
Gradient of objective J with respect to parameter vector W
Pseudocode for gradient update algorithm
Batch gradient update
This is an arbitrary update criteria
Neural networks: multi-layer feed-forward NN
X f1W1
b1Z1 A1
layer 1
f2W2
b2Z2 A2
layer 2
… fLWL
bLZL AL
layer L
AL-1
Loss
y
Inspired by 6.036 lecture notes (Leslie Kaebling)
Neural networks: error backpropogation
X f1W1
b1Z1 A1
f2W2
b2Z2 A2
… fLWL
bLZL ALAL-1
Loss
Inspired by 6.036 lecture notes (Leslie Kaebling)
Let’s work out error backprop on the board!
y
Neural networks: error backpropogation
So let’s use the following shorthand from the previous figure,
First, let’s break down how the loss depends on the final layer,
Since,
We can re-write the equation as,
Now, to propagate through the whole network, we can keep applying the chain rule until the first layer of the network,
If you spend a few minutes looking at matrix dimensions, it becomes clear that this is an informal derivation. Here are the dimensions to think about:
Since we have the outputs of every layer, all we need to compute for the gradient of the last layer with respect to the weights is the gradient of the loss with respect to the pre-activation output.
The equation with the correct dimensions for matrix multiplication,
Inspired by 6.036 lecture notes (Leslie Kaebling)
Neural networks: automatic differentiation
https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf
End result
Augment a standard Taylor series (numerical differentiation), with a “dual number”,
Forward automatic differentiation
Because ”dual numbers” have the (manufactured) property,
The Taylor Series simplifies to,
Which recovers the function output as well as the first derivative.
Dual numbers
Neural networks: regularization
Objective function
Objective function with ridge regularization Penalize
Dropout
Popular new approach: Batch normalization
Optimization: overview
Optimization: stochastic gradient descent
Stochastic gradient update (per randomly sampled training example):
Pseudocode for stochastic gradient update algorithm
1 function sgd(W/*/,, 𝜂, 𝐽, T, 𝜖):2 𝑊0 = Winit3 for 𝑡 = 1 to T4 randomly select i ∈ {1,2,...,n} 5 𝑊, = 𝑊,<$ − 𝜂(𝑡) ∇@ 𝐽
Mini-batch gradient descent
Optimization: momentum
Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006
Nesterov inequality? How can we make momentum better?
Next week
Neural network interpretability
Convolutional neural networks Recurrent neural networks
Batch normalization