machine learning introduction to - google search€¦ · introduction to machine learning portions...
TRANSCRIPT
![Page 1: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/1.jpg)
Introduction To Machine Learning
Portions of this course are from Machine Learning Crash Course
![Page 2: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/2.jpg)
Goals of This Class
● Learn to take a real-life problem and apply machine learning to make predictions.
● Learn to implement machine learning solutions using TensorFlow
● Learn how to evaluate the quality of your solution● Machine learning is a very broad field -- we only just
touch upon some of the most common machine learning algorithms
![Page 3: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/3.jpg)
Some Sample Applications of Machine Learning
![Page 4: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/4.jpg)
Sample Applications of Machine Learning
● Medical applications such as disease prediction● Speech recognition and understanding● Recommendation systems● Malware and spam detection● Image understanding and annotation● AI for games● Translating between languages● Predicting likelihood of earthquakes● Matching resumes with jobs
![Page 5: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/5.jpg)
Google Products Using Machine Learning
![Page 6: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/6.jpg)
Google Assistant
![Page 7: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/7.jpg)
Google Photos: Searching Images via Text
![Page 8: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/8.jpg)
Gmail: Smart Reply
![Page 9: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/9.jpg)
Google Play Music: Recommending Music
![Page 10: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/10.jpg)
Game Playing: Alpha Go
![Page 11: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/11.jpg)
Combined Vision and Translation
![Page 12: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/12.jpg)
What is Machine Learning?
![Page 13: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/13.jpg)
What is Machine Learning (ML)?
There are many ways to define ML.● ML systems learn how to combine data to
produce useful predictions on never before seen data
● ML algorithms find patterns in data and use these patterns to react correctly to brand new data.
![Page 14: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/14.jpg)
Sample Machine Learning Problem
● Predict cricket chirps/min from the temperature in centigrade
● To the right is a plot showing cricket chirps per minute (y-axis) vs temperature (x-axis)
● What do you observe?
![Page 15: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/15.jpg)
Sample Machine Learning Problem (cont)● The line shown in blue fits the
data well.● Recall that the equation for a
line is y = mx + b with slope m and y-intercept b.
![Page 16: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/16.jpg)
Sample Machine Learning Problem (cont)● The line shown in blue fits the
data well.● Recall that the equation for a
line is y = mx + b with slope m and y-intercept b.
● In this example: y = 7x - 30
![Page 17: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/17.jpg)
Sample Machine Learning Problem (cont)
17
89
● The line shown in blue fits the data well.
● Recall that the equation for a line is y = mx + b with slope m and y-intercept b.
● In this example: y = 7x - 30
● If we pick a temp not in the data, we can use this line to predict the chirps. If x = 17, then y = 7 * 17 - 30 = 89
![Page 18: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/18.jpg)
Sample Machine Learning Problem (cont)● The line shown in blue fits the
data well.● Recall that the equation for a
line is y = mx + b with slope m and y-intercept b.
● In machine learning we often call this a model since it is what we use to make our predictions.
● We call this model a linear model since we fit the data with a line
![Page 19: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/19.jpg)
Definition: Linear Regression● When a learning model fits data by a line, we say it is linear
● Regression is relationship between correlated variables
○ When you have a +/- change in a variable that affects the other variable +/-
● When you have just one input value (feature), linear regression is the problem if fitting a line to the data and using this line to make new predictions.
● By convention in machine learning we write: y’ = b + wx. So instead of m, we use w.
![Page 20: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/20.jpg)
Linear Regression with Many Features● Often you have multiple features that can be used to predict a
value. ○ For example suppose you want to predict the rent for an
apartment. ○ Two relevant features are: the number of bedrooms and
the apt condition from 1 (poor) to 5 (excellent)
● In this case the input x becomes a pair (x1 , x2 ) where x1 is the number of bedrooms and x2 is the apt condition.
● By convention we make x bold since it is a vector.● Thus our equation to predict the rent becomes:
y’ = b + w1x1 + w2x2
![Page 21: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/21.jpg)
Group Chat: Applying ML
Imagine the problem of building an ML model to predict apartment rental price.● What features would you use?● What would the training labels be?● Describe some unlabeled examples● Find a way to describe this as a regression task
![Page 22: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/22.jpg)
Linear Regression NotationWhen you have n features the linear model becomes:
y’ = b + w1x1 + w2x2 + … + wnxn
● y’ is what we use as our predicted value for the label
● xk is a feature (one of the input values)
● wk is the weight (importance) associated with feature k
● b is the bias (y-intercept)
Sometimes w0 is used instead of b with a default that x0 = 1
The model parameters (which are to be learned) are: b, w1, w2 , …, wn
![Page 23: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/23.jpg)
Loss ● We need a way to say what solution fits the data best● Loss is a penalty for an incorrect prediction● Lower loss is generally better ● Choice of loss function guides the model we choose
Do you think the green line or blue line will lead to better predictions?
![Page 24: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/24.jpg)
Evaluating If a Linear Model Is Good ● We will eventually look at mathematical ways to evaluate the
quality of a model but that should not replace more informal yet intuitive ways to evaluate a model
● When you just have one variable, drawing a line on a scatter plot that shows the model predictions is a good tool.
● Let’s look at two different models to predict the price of a car from its engine’s horsepower.
![Page 25: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/25.jpg)
Which Model is Better?
![Page 26: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/26.jpg)
Calibration Plots To Visualize a Model● We want something that can serve a similar role to a scatter plot
when we have more than one input feature.
● For this we can use a calibration plot that shows the prediction (x-axis) versus the target value (y-axis)
● On the next slide we’ll look at two calibrations plots from different models trained on the same data set.
![Page 27: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/27.jpg)
Which Model is Better?
![Page 28: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/28.jpg)
Group Chat: Measuring Loss
What are some mathematical ways to measure loss? Remember that● Loss should become smaller when predictions improve.● Loss should become larger when prediction get worse.
Lots of ideas are reasonable here; be creative.
![Page 29: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/29.jpg)
A Convenient Loss Function for Regression
● L2 Loss for a given example is also called squared error= square of the difference between prediction and label= (observation - prediction)2
= (y - y’ )2
![Page 30: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/30.jpg)
For training set D composed of examples x = <x0 , x1 , x2 , …, xn> and correct label y, and the prediction for the current model w of y’= wTx = w0x0 + w1x1 + w2x2 + … + wnxn, the squared error (L2Loss) is:
Computing Squared Error on a Data Set
We’re summing over all examples in our training set D
Generally average over all examples, so divide by the number of examples in D (denoted by |D|)
![Page 31: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/31.jpg)
For training set D composed of examples x = <x0 , x1 , x2 , …, xn> and correct label y, and the prediction for the current model y’
RMSE - Root Mean Squared Error
Our goal is to train a model that minimizes RMSE (which is the same as minimizing the mean squared error).
![Page 32: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/32.jpg)
Summary of Linear Regression● y’ = wx + b ● Minimize mean
squared error (or RMSE)
● In general have: y’ = b + w1x1 + … + wnxn
● Learn model weights b, w1, w2, … wn from data
w = 129.25b = 0.382
![Page 33: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/33.jpg)
Learning Model Weights
![Page 34: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/34.jpg)
Gradient Descent: High Level View
● Derivative of (y - y’)2 w.r.t. the model weights w tells us how loss changes in the neighborhood of weights we're currently using.
● A gradient is the generalization of a derivative when you have more than one variable.
● We can take small steps in the negative gradient direction to modify the weights so that the loss on that example is lower.
● This optimization strategy is called Gradient Descent.
![Page 35: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/35.jpg)
Pictorial View of Gradient Descent
value of weight
loss
● As a simple illustration assume our model has a single weight and we are using squared loss
● To the right we plot the squared loss in blue
● The yellow dot represents our current model weight
● The red dot is the model weight minimizing squared loss
loss function
![Page 36: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/36.jpg)
Pictorial View of Gradient Descent
value of weight
loss
● As a simple illustration assume our model has a single weight and we are using squared loss
● To the right we plot the squared loss in blue
● The yellow dot represents our current model weight
● The red dot is the model weight minimizing squared loss
loss function
current model weight
![Page 37: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/37.jpg)
Pictorial View of Gradient Descent
value of weight
loss
● As a simple illustration assume our model has a single weight and we are using squared loss
● To the right we plot the squared loss in blue
● The yellow dot represents our current model weight
● The red dot is the model weight minimizing squared loss
loss function
optimal model weight
current model weight
![Page 38: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/38.jpg)
Training: A Gradient Step
starting point
value of weight w
loss
(negative) gradient
next point
![Page 39: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/39.jpg)
Learning Rate: Size of Step to Make
value of weight
loss
● The sequence of yellow dots shows how the model evolves with the weight updates
● Gradient descent with a good learning rate gives small but noticeable steps towards the optimal value
● We want to end with a model very close to the optimal model as seen here
![Page 40: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/40.jpg)
Using TensorFlow (TF)
Later in the course we will present Gradient Descent in more depth. For now we just present enough detail for you to be apply to apply the TF API to obtain a good linear model for a simple real data set.
![Page 41: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/41.jpg)
Using TensorFlow (TF)
import tensorflow as tf
# Define a linear regression model.estimator = tf.contrib.learn.LinearRegressor( optimizer=tf.GradientDescentOptimizer(learning_rate=0.001))
# Fit the model on training data -- minimizes squared lossestimator.fit(X_train, y_train, steps=10000)
# Use it to predict on new datapredictions = estimator.predict(X_new)
![Page 42: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/42.jpg)
Training a Model with Gradient Descent
In the scatter plot we see the model evolve (from the blue to red line)
The Learning Curve shows how the loss changes over time
![Page 43: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/43.jpg)
Things You Need to Decide● Learning Rate
○ Very important since this is the size of the step to take. Typically change by powers of 10 until the model is training reasonably well and then fine tune
● Number of Steps to Train
○ Time to train is proportional to this (for fixed set of features). You want to make this is as small as you can while still getting to the minimum loss possible.
● What Features to Use
○ This is very important and will be our next main topic
![Page 44: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/44.jpg)
Learning Rate Way Too Low
● Moving towards optimal model but too slowly
value of weight w
loss
![Page 45: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/45.jpg)
Learning Rate Too High
value of weight w
loss
Though loss is decreasing, going back and forth overshooting optimal weights
123
456
![Page 46: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/46.jpg)
Learning Rate Still Too High
Still overshooting as seen in loss oscillating
value of weight w
loss
1 2
3 45
6
![Page 47: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/47.jpg)
Need More Steps (loss still going down)
![Page 48: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/48.jpg)
Good Learning Rate and Number of Steps
![Page 49: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/49.jpg)
Converging to a Poor Model
Though the model appears to have converged, we can see from the calibration plot it is a poor model. Try a different learning rate or better features. Very important to evaluate if model is any good!
![Page 50: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/50.jpg)
Transforming Features
![Page 51: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/51.jpg)
Start By Exploring Your Data
● Gather Statistics○ Average values○ Min and max values○ Standard deviation
● Visualize○ Plot histograms○ Are there missing values?○ Are there outliers?
![Page 52: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/52.jpg)
Feature Engineering
Process of creating features from raw data is Feature Engineering
This process of using domain knowledge and a high-level understanding of the ML model you are using to create good features
Feature Engineering is important and often a very big part of whether a ML system working well in practice.
![Page 53: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/53.jpg)
Why Transform Features
● Linear models can only use a feature values via taking linear combinations of them (y’ = b + w1x1 + … + wnxn)
● Can learn better with normalized numeric features so all are equally important initially
● We must handle missing data, outliers,....
● Convert non-numeric features into numeric values○ can't multiply a weight times a string
![Page 54: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/54.jpg)
Why Transform Features (cont)
● Linear models can represent a lot more problems when we add non-linearities to the features
● We can often introduce new features using domain knowledge. For example, given the width and length of a room, we might add the feature: area = width * length.
● Tokenize or lower-casing of text features● ...
![Page 55: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/55.jpg)
Models tend to converge faster if features on similar scale
Transforming Numeric Features
![Page 56: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/56.jpg)
Transforming Numeric Features (cont)
● As an example:○ What if you are creating a linear model to predict city
mpg of a car from highway mpg (x1) and price (x2)○ What happens if you directly use these features?
![Page 57: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/57.jpg)
Transforming Numeric Features (cont)
● As an example:○ What if you are creating a linear model to predict city
mpg of a car from highway mpg (x1) and price (x2)○ What happens if you directly use these features?○ The price is a much larger number and so dominates
in the linear function w1x1 + w2x2 thus making it hard for the model to learn that highway mpg is much more important.
![Page 58: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/58.jpg)
Common scaling methods:● linear: x' = (x - xmin ) / (xmax- xmin ) to bring to [0,1]● clipping: set min/max values to avoid outliers● log scaling: x = log(x) to handle very skewed distributions● z-score: x' = (x - μ) / σ to center to 0 mean and stddev σ = 1As always, experiment to see which works best in practice
Transforming Numeric Features
![Page 59: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/59.jpg)
Linear Scaling
● Linearly scaling just stretches/shrinks the feature values to fall between 0 and 1 (or sometimes -1 to +1 is used).
● Transformed value x’ = (x - a) / (b - a)● For example, a feature that ranges from a=5000 to b=100000
get scaled between 0 to 1 as:
1000005000
0.0 1.00.5
52500
![Page 60: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/60.jpg)
Feature Clipping● It helps to get rid of extreme values (outliers) by capping all
features above (or below) some value to a fixed value. This can be done before or after other normalizations.
Same feature, capped to max of 4.0
Rooms Per Person Capped Rooms Per Person
![Page 61: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/61.jpg)
Log Scaling● Common data distribution: power law shown on the left
○ Most movies have very few ratings (often called the tail) ○ A few have lots of ratings (often called the head)
● Log scaling improves linear model performance
![Page 62: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/62.jpg)
Z-Score Scaling● This is a variation of linear scaling that normalized that data to
have mean 0 and standard deviation of 1. It’s useful when there are some outliers but not so extreme you need clipping
![Page 63: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/63.jpg)
Add Non-Linearities To a Linear Model
● Linear models can represent a lot more problems when we add non-linearities to the features
● One way to do this is to Bucketize numerical features
![Page 64: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/64.jpg)
Motivation: Bucketizing Numeric Features
● Can try to use a linear model, but can’t capture feature behavior well.
● RMSE 9.065
● Problem is there is a different behavior for the two ranges of compression ratio
![Page 65: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/65.jpg)
Bucketizing Compression Ratio
● Each binary bucketized feature has its own weight.
● RMSE 6.3
● Without also using the raw feature, we only get a single value (the bias) per bucket.
![Page 66: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/66.jpg)
Using Raw and Bucketized Features
● Use raw feature and bucketized feature
● Linear + step function
● Shared slope, independent biases
● RMSE 5.7
![Page 67: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/67.jpg)
Selection of Bucket Boundaries
● We want a way to automate the selection of bucket boundaries
● Quantiles: Each bucket has roughly the same number of points.
● Another option is equal width bins.
![Page 68: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/68.jpg)
Creating Buckets by QuantilesEach bucket is a Boolean variable with each car price falling into a single bucket.
car_price_under_6649car_price_6649_to_7349car_price_7349_to_7999car_price_7999_to_9095car_price_9095_to_10295car_price_10295_to_12440car_price_12440_to_15250car_price_15250_to_17199car_price_17199_to_22470car_price_22470_and_up
![Page 69: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/69.jpg)
Creating Bins by QuantilesBy using quantiles all “expensive” cars get put in a bucket together allowing more resolution for cars in the price range of 90% of the cars in the data set.
Bucket values for car of price $9000:
car_price_under_6649car_price_6649_to_7349car_price_7349_to_7999car_price_7999_to_9095car_price_9095_to_10295car_price_10295_to_12440car_price_12440_to_15250car_price_15250_to_17199car_price_17199_to_22470car_price_22470_and_up
0001000000
![Page 70: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/70.jpg)
Quantiles vs Equal Width Bins
What is the advantage of using quantiles (left) versus equal width bins (right)?
![Page 71: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/71.jpg)
Quantiles vs Equal Width Bins
● Both forms of binning provide non-linear behavior since the feature weight for each bin can be independently set
● Using quantiles gives more resolution in areas where there is more data
● For both techniques, you can adjust the number of bins to vary the amount of resolution versus the number of features introduced.
![Page 72: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/72.jpg)
Representing Categorical Features
How can we represent features such as:● The day of the week● A person’s occupation(s)● The words in an advertisement● The movies a person has rated
Remember a linear model can only take a weighted combination of features.
![Page 73: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/73.jpg)
Graphical View of a Linear Model
Output: y’ = w1x1 + w2x2 + w3x3 + b
Input: x = (x1 , x2 , x3 )w1 w2 w3
x1 x2 x3
b
1
● In the graphical view below inputs are at the bottom and the output is at the top. Each input (or the constant 1 for the bias) is multiplied by the edge weight and they are summed together at the output node.
![Page 74: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/74.jpg)
Transforming Categorical Features
The features can be mapped to numerical values. For example for the days of the week we could use:
0 1 2 3 4 5 6Mon Tues Wed Thur Fri SunSat
If we the value we are predicting increases linearly from Mon to Sun then we could directly use this, but typically you want to learn an independent weight for each value.
![Page 75: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/75.jpg)
Categorical Features: One-Hot Encoding● Introduce a Boolean variable for each feature value● Independent weight is learned for each feature value.● Example: For days of the week, introduce 7 Boolean
variables each with its own learned weight.● Sample one-hot encodings:
Mon Tues Wed Thur Fri SunSatEncoding Tuesday 0 1 0 0 0 0 0
0 0 0 0 1 0 00 0 0 0 0 0 1
7 Boolean Variables for Day of the Week
Encoding FridayEncoding Sunday
![Page 76: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/76.jpg)
Efficiently Representing One-Hot Encoding
Mon Tues Wed Thur Fri SunSat
0 0 0 0 1 0 0
● Map each feature value to a unique index
One-hot encoding for Friday
7 Boolean Variables for Day of the Week
![Page 77: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/77.jpg)
Efficiently Representing One-Hot Encoding
Mon Tues Wed Thur Fri SunSat
0 0 0 0 1 0 0
● Map each feature value to a unique index● Represent a one-hot encoding as sparse vector using
the index of the feature value that is not zero● Encode Friday as <4> using the indices shown in purple.
index
Dense Boolean one-hot encoding for Friday
0 1 3 4 5 62
![Page 78: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/78.jpg)
One-Hot Encoding vs Single Numeric Feature
● Single numeric feature and weight○ Represent Friday as the integer 4 that
is multiplied by single model weight ● One-hot encoding with 7 Boolean variables
for day of week○ Represent Friday as the sparse vector
<4> meaning the Boolean variable for Friday is 1 and the rest are 0.
● If the difference is not clear, please ask.
1 day of week
4
Mon Tue Wed Thu Fri Sat
1 Sun
0 0 0 0 0 0
learns a weight and bias
Learns one bias per day of week since one input is 1 and the rest are 0
![Page 79: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/79.jpg)
Encoding Features that are Sets/Phrases
Sample ad: “1995 Honda Civic good condition”
Set Boolean variable for each word in ad to 1 and rest to 0:
...1993 1994 1995 1996 1997 … poor good excellent … Honda Toyota ... Civic Accord …. condition
● Sparse representation of ad: <102, 151, 200, 225, 350>
0 0 1 0 0 0 1 0 1 0 1 0 1
100 101 102 103 104 150 151 152 200 201 225 226 350indices ...
dense representation
![Page 80: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/80.jpg)
Encoding Features that are Sets/Phrases
Sample ad: “1995 Honda Civic good condition”
Set Boolean variable for each word in ad to 1 and rest to 0:
...1993 1994 1995 1996 1997 … poor good excellent … Honda Toyota ... Civic Accord …. condition
● Sparse representation of ad: <102, 151, 200, 225, 350>
● We call this a sparse encoding since in the dense representation there are only a few 1s. A one-hot encoding is a special case with a single 1.
0 0 1 0 0 0 1 0 1 0 1 0 1
100 101 102 103 104 150 151 152 200 201 225 226 350indices ...
![Page 81: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/81.jpg)
Vocabulary for Categorical Features
Vocabulary: The mapping from the feature value to its unique index.● Boolean variable is 1 if the feature has
that value and 0 otherwise● Represent as a one-hot or sparse vector
using a set of the non-zero indices
... but sometimes we can't fit all possible feature values into the vocabulary
0 Mon
1 Tues
2 Wed
3 Thur
4 Fri
5 Sat
6 Sun
Vocabulary
![Page 82: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/82.jpg)
Vocabulary - Out of Vocab
Out of Vocabulary (OOV)● Give unique indices to
frequent terms/values● Add one extra index to the
vocabulary, OOV● For items not in the
vocabulary, assign to OOV
0 Super
1 Common
2 Terms
...
N-1 Important
N OOV
![Page 83: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/83.jpg)
{ man, woman, car, pear, castle, these, are, not, frequent, terms}
Hashing to Define Vocabulary Mapping0 car
1 man
2 not
3 frequentwoman
4 castle
5 are
6 theseterms
7 pear
Hash(string)
Hash Table size of 8
![Page 84: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/84.jpg)
Categorical vs Numerical Features
● How would you encode a feature such as zipcode?● Though it could be used directly as a numerical
feature, you don’t want to multiply it by a weight.● Best to treat zipcode as categorical data.● You could use domain knowledge to group zipcodes
that are geographically nearby into a single bucket.● You need to think about raw features and how to best
use them.
![Page 85: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/85.jpg)
Feature Engineering: Missing FeaturesOften data sets are missing features. If this is extremely rare we could just skip those examples, but otherwise what do we do?● For non-numerical data?
![Page 86: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/86.jpg)
Feature Engineering: Missing FeaturesOften data sets are missing features. If this is extremely rare we could just skip those examples, but otherwise what do we do?● For non-numerical data?
○ A common solution is to just introduce a feature value for “missing”
● How do we handle this for numerical data?
![Page 87: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/87.jpg)
Feature Engineering: Missing FeaturesOften data sets are missing features. If this is extremely rare we could just skip those examples, but otherwise what do we do?● For non-numerical data?
○ A common solution is to just introduce a feature value for “missing”
● How do we handle this for numerical data?○ Use the average value (or some common value)○ Bin the data and introduce a “missing” bin
![Page 88: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/88.jpg)
Add Non-Linearities To a Linear Model
● Linear models can represent a lot more problems when we add non-linearities to the features
● We’ve already seen one way to do this:○ Bucketizing Numerical Features
● Other ways to introduce non-linearities:○ Feature Crosses○ Adding Artificial Variables
![Page 89: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/89.jpg)
Preview: Feature Crosses
● We will study feature crosses in more depth later.● In a linear model the contribution captured for
each feature is independent of the others and this is often not the case in data.
● Feature Crosses introduce non-linear behavior between a set of two or more features by capturing dependencies between the features.
![Page 90: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/90.jpg)
Feature Cross Example
● There is one weight for each feature and for all 16 possible combinations of these feature values.
● Color encodes the value of the weight with red being low and green being high
Cold Cool Warm Hot
None w9 w10 w11 w12
Low w13 w14 w15 w16
Med w17 w18 w19 w20
High w21 w22 w23 w24
Rainfall x Temperature
None Low Med High
w1 w2 w3 w4
RainfallCold Cool Warm Hot
w5 w6 w7 w8
Temperature
![Page 91: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/91.jpg)
Feature Crosses in TensorFlow
● Define new features called crossed column in TF for the cross [A x B] using either bucketized numerical features or categorical features A and B
● The resulting crosses are often extremely sparse
● Crosses can involve any number of features, such as: [A x B x C x D x E]
![Page 92: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/92.jpg)
Feature Crosses: Some Examples
● Housing market price predictor:[latitude X longitude x num_bedrooms]
● Predictor of pet owner satisfaction with pet:[pet behavior type X time of day]
● Tic-Tac-Toe predictor:[pos1 x pos2 x … x pos9]
![Page 93: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/93.jpg)
Visualization of Weights with a Cross
● There is one weight for each feature and for all 16 possible combinations of these feature values.
● Color encodes the value of the weight with red being low and green being high
Cold Cool Warm Hot
None w9 w10 w11 w12
Low w13 w14 w15 w16
Med w17 w18 w19 w20
High w21 w22 w23 w24
Rainfall x Temperature
None Low Med High
w1 w2 w3 w4
RainfallCold Cool Warm Hot
w5 w6 w7 w8
Temperature
![Page 94: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/94.jpg)
Feature Crosses: Why would we do this?
● Linear learners scale well to massive data● But without feature crosses, the expressivity of these
models would be limited● Using feature crosses + massive data is one efficient
strategy for learning highly complex models○ Foreshadowing: Neural nets provide another
● Are there downsides to adding more and more crosses?
![Page 95: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/95.jpg)
Feature Engineering: Be Creative
● We’ve just seen some examples but the key is to think about your data, what a linear model can do and then how to best capture that.
● As one more example, if you wanted to predict the rental price for an apartment and you had a street address, how might you represent that?
● Remember, there’s not one right answer. Sometimes you need to try a few things and see what gives you the best predictions.
![Page 96: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/96.jpg)
Stochastic Gradient Descent: A Closer Look
![Page 97: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/97.jpg)
Review of Overview of Gradient Descent
● Derivative of (y - y’)2 w.r.t. the model weights w tells us how loss changes in the neighborhood of weights we're currently using.
● A gradient is the generalization of a derivative when you have more than one variable.
● We can take small steps in the negative gradient direction to modify the weights so that the loss on that example is lower.
● This optimization strategy is called Gradient Descent.
![Page 98: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/98.jpg)
Training: Decreasing the Loss
loss function
starting point
value of weight w
loss
![Page 99: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/99.jpg)
Training: A Gradient Step
starting point
value of weight w
loss
(negative) gradient
next point
![Page 100: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/100.jpg)
Learning Rate (Size of Step) Too Small
loss function
value of weight w
loss
Too small of steps learning is too slow
![Page 101: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/101.jpg)
Learning Rate (Size of Step) Too Large
value of weight w
loss
Overshoots if learning rate is too high
![Page 102: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/102.jpg)
Well-Tuned Learning Rate
value of weight w
loss
![Page 103: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/103.jpg)
Local Minimum
value of weight w
loss
When the loss function we are optimizing is not convex we can get stuck in local minimum
Local minimum
![Page 104: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/104.jpg)
Mini-Batch Gradient Descent
● Computing gradient over the entire dataset for every step is expensive and unnecessary
● Mini-Batch Gradient Descent: Compute gradient on batches typically of 10-1000 samples
![Page 105: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/105.jpg)
Stochastic Gradient Descent (SGD)
● Mini-Batch Gradient Descent using batches that are random samples of examples
● Random samples allow the gradient estimate to be more accurate.
● Many data sets are arranged in some order (e.g. imagine if the automobile data set was sorted by price).
● Easy way to get random samples is to randomly “shuffle” the data and then just take first batch-size examples, then next batch-size examples, ...
![Page 106: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/106.jpg)
Update Step in SGD
● At a high level what happens in each step is
● Remember w and b are instance variables that are part of the internal state of the estimator object that are updated via this formula when the fit method is called.
This is the gradient for the weight being updated as estimated from the labeled examples in the batch
![Page 107: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/107.jpg)
● To make this concrete let’s consider when y’ = wx + b and we are minimizing (y’ - y)2 . So for w
● is
● So at each step:
Update Step in SGD (cont)
![Page 108: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/108.jpg)
Update Step in SGD (cont)● To make this concrete let’s consider when y’ = wx + b and we are
minimizing (y’ - y)2 . So for w
● is
● So at each step:
● Similarly, at each step:
![Page 109: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/109.jpg)
Update Step in SGD (cont)● To make this concrete let’s consider when y’ = wx + b and we are
minimizing (y’ - y)2 . So for w
● is
● So at each step:
● Similarly, at each step:
● The gradient highlighted in yellow with a red outline is computed by taking the average value over the examples in the batch.
● When y’ < y (prediction too small), y’ - y is negative so w gets bigger bringing the prediction closer y. If y’ > y, w gets smaller.
![Page 110: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/110.jpg)
Generalization
![Page 111: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/111.jpg)
Will Our Model Make Good Predictions?
● Our Key Goal: predict well on new unseen data.● Problem: We only get a sample of data D for training
● We can measure the loss for our model on the sample D but how can we know if it will predict well on new data?
● Performance Measure: A measure of how well the model predicts on unseen data. This is usually related to the loss function.
![Page 112: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/112.jpg)
Example Performance Measure
For regression a common performance metric is root mean squared error (RMSE)
On this dataset, for the best linear model: RMSE = 22.24
![Page 113: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/113.jpg)
Non-Linear Model
What if we use a model that is not a line (called non-linear)
For the model on the right:RMSE = 21.44
This is better than the linear model with RMSE = 22.24
![Page 114: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/114.jpg)
“Perfect” Model
If we make a very complex model then we can perfectly fit the data, so RMSE = 0
Why might you not want to do this?
![Page 115: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/115.jpg)
Measuring Generalization Ability
Here’s the other half of the data.
We will discuss this more but for now, let’s refer to this “other half” as the test data and the original points as the training data since they were used to train the model.
![Page 116: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/116.jpg)
Generalization for Linear Model
We’ll look at our performance using the best linear model.
Training RMSE = 22.24Test data RMSE = 21.98
Pretty similar.
![Page 117: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/117.jpg)
Generalization for Non-Linear Model
We’ll look at our performance using the non-linear model.
Old RMSE = 21.44New RMSE = 22.74
Still pretty similar.
![Page 118: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/118.jpg)
Generalization for “Perfect” Model
Training data RMSE = 0Test data RMSE = 32This did not generalize to new data!
This is called overfitting.
![Page 119: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/119.jpg)
Overfitting
If we make a very complex model then can perfectly (or near perfectly) fit the training data we just memorize versus the goal of generalizing.
Remember our goal is to build a system to deal with new data!
![Page 120: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/120.jpg)
An Underfit Model
Underfitting happens when you try to use a model that is too simple.
![Page 121: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/121.jpg)
How do we know if our model is good?
● William of Occam (back in 14th century) argued that simple explanations of nature are better
● Occam’s Razor principle: the less complex a model is, the more likely to predict new data well
● How can we define how complex a learning model is?● How can we measure how well our model generalizes?
![Page 122: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/122.jpg)
Model Complexity
Our goal is to determine what model complexity is most appropriate.
We’ll discuss ways to do this.
![Page 123: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/123.jpg)
How do we know if our model is good?
● In practice to determine if our model do well on a new sample of data we use a new sample of data that we call the test set
● Good performance on the test set is a useful indicator of good performance on the new data as long as:
■ The test set is large enough■ The test set is independent of the training set so it
truly represents new data■ We don’t cheat by using the test set over and over
![Page 124: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/124.jpg)
Generalization Curve and Overfitting
A learning curve that is showing overfitting beginning to occur.
![Page 125: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/125.jpg)
Training, Validation, and Test Data Sets
![Page 126: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/126.jpg)
● Set aside some of the data as test data○ Often just do at random○ Sometimes use most recent data as test data○ You need to be very careful here
Partitioning Data Sets
Training Data Test Data
![Page 127: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/127.jpg)
A Possible Workflow?
Train model on Training Data
Evaluate model onTest Data
Select features, learning rate, batch size, ... according to results on Test Data
Pick the model that does best on Test Data.Any issues here?
![Page 128: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/128.jpg)
A Solution to “Polluting” Test Data
Training DataTest Data
Validation Data
● Divide the data we are provided for training our model into two datasets○ Most of it will be in our Training Data○ A portion of it (typically 5-10%) will be used as a
Validation Data.
![Page 129: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/129.jpg)
Better Workflow: Use a Validation Data
Train model on Training Data
Evaluate model on Validation Data
Select features, learning rate, batch size, ... according to results on Validation Data
Pick model that does best on Validation DataCheck for generalization ability on Test Data
![Page 130: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/130.jpg)
● You need to be very careful when partitioning your data into training, validation and test sets.
● You will explore this some in your next lab.● Let’s look at some real-life examples showing some
subtle but significant errors you can make that would lead you to believe you have a really good model but in practice one that will not generalize to new data.
Partitioning Data Sets
![Page 131: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/131.jpg)
k-fold Cross Validation
● Test data must be set aside unless there is no concern about overfitting
● What if we don’t have enough data to set aside enough for validation data?
● For these cases k-fold cross validation is often used● Basic idea is to divide the data into k roughly even size
pieces and in each of k training phases uses 1 piece as validation and the other k-1 as training data
![Page 132: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/132.jpg)
Illustration of k-fold Cross-Validation
Demonstrate with k=5 where each row represents ⅕ of the data.
Training SetValidation Set
Performance measure m1 computed just on the validation set for this fold
![Page 133: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/133.jpg)
Illustration of k-fold Cross-Validation
In second training phases shift which points are used for validation.
Training SetValidation Set
Performance measure m2 computed just on the validation set for this fold
![Page 134: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/134.jpg)
Illustration of k-fold Cross-Validation
Continue shifting which points are used for validation.
Training SetValidation Set
Performance measure m3 computed just on the validation set for this fold
![Page 135: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/135.jpg)
Illustration of k-fold Cross-Validation
Training SetValidation Set
Performance measure m5 computed just on the validation set for this fold
![Page 136: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/136.jpg)
Cross-Validation: Compute Metric
● For each of the k training phases, compute the performance metric (over the validation set for that phase). This gives us m1 , m2 , . . ., mk
● Average m1 , m2 , . . ., mk to get an aggregate performance metric.
● You can also check model stability by comparing the performance across the k runs and also compute standard statistical measures such as standard deviation and error bars over the k folds.
![Page 137: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/137.jpg)
Cross-Validation: Train Final Model
● To train the final model, choose the hyperparameter setting that gives you best aggregated performance over the k runs.
● Now run the algorithm with the chosen hyperparameters using all examples (other than those set aside as test data throughout) as the training data to obtain the final model.
● Use the test data, which has not been used during cross validation to check for any issues with overfitting.
![Page 138: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/138.jpg)
k-fold Cross-Validation: Pros and Cons
● The advantage of k-fold cross validation is that we can train on (k-1)/k of the data each phase and that all points are used for validation.
● The disadvantage is that k different models need to be trained for every set of hyperparameters being considered, which is slow.
● Only use k-fold cross-validation if you don’t have enough labeled data to split into independent train, validate and test sets.
![Page 139: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/139.jpg)
Regularization for Simplicity
![Page 140: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/140.jpg)
Ways to Prevent Overfitting
Early stopping: Use learning curve to detect overfitting. Stop training at somewhere around the red line.
Are there better approaches?
![Page 141: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/141.jpg)
Penalizing Model Complexity● We want to avoid model complexity where possible.
○ remember Occam’s razor● We can introduce this idea into the optimization we do
at training time.● Empirical Risk Minimization:
minimize: Loss(Data|Model)
aim for low training error
![Page 142: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/142.jpg)
Penalizing Model Complexity● We want to avoid model complexity where possible.
○ Occam’s razor● We can introduce this idea into the optimization we do
at training time.● Structural Risk Minimization:
minimize: Loss(Data|Model) + complexity(Model)
aim for low training error
...but balance against complexity
![Page 143: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/143.jpg)
Regularization
● How to define complexity(model)?● One possible "prior belief" is that weights
should be normally distributed around 0● Diverging from this should incur a cost
● Can encode this idea via L2 regularization○ complexity(model) = sum of squares of the weights○ a.k.a. square of the L2 norm of the weight vector○ Encourages small weights centered around zero
0
![Page 144: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/144.jpg)
L2 Regularization
Aim for low training loss
...but balance against
complexity
Lambda controls how
these are balanced
Training Loss
![Page 145: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/145.jpg)
Linear Classifier
![Page 146: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/146.jpg)
Logistic Regression
● Many problems require a probability estimate as output.● Logistic Regression is focused on such problems.● Handy because the probability estimates are calibrated.
○ for example, prob(click) * bid = expected revenue● Also useful for when we need a binary classification
○ click or not click? → prob(click)○ spam or not spam? → prob(spam)
![Page 147: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/147.jpg)
Logistic Regression For Classification
● Suppose we have a linear classifier to predict if email is spam or not-spam.
● The goal is to modify how we define and train a linear model in a way that we can estimate a probability that an email is spam versus just give a classification.
![Page 148: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/148.jpg)
Logistic Regression: Predictions
LogOdds (wTx + b)pass linear model through a sigmoid
(pictured to the right)P
roba
bilit
y O
utpu
tRecall that:
linear model
![Page 149: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/149.jpg)
LogLoss For Predicting Probabilities
Close relationship to Shannon’s Entropy
measure from Information Theory
![Page 150: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/150.jpg)
Logistic Regression and Regularization
Regularization is super important for logistic regression.● Remember the asymptotes● It’ll keep trying to drive loss to 0 in high dimensions
Two strategies are especially useful:● L2 regularization -- penalizes huge weights.● Early stopping -- limiting training steps or learning rate.
![Page 151: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/151.jpg)
Prediction Bias
● Logistic Regression predictions should be unbiased meaning that: average of predictions ≅ average of observations
● Bias is a sign that something is wrong.○ Zero bias alone does not mean everything is working.○ But it’s a great sanity check.
● If you have bias, you have a problem (e.g. incomplete feature set, biased training sample, …)
● Don’t fix bias with a calibration layer, fix it in the model.● Look for bias in slices of data, this can guide improvements.
![Page 152: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/152.jpg)
Calibration Plots show Bucketed Bias
Each dot represents many examples in the same bucketed prediction range
![Page 153: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/153.jpg)
Evaluation Metrics for Linear Classification
![Page 154: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/154.jpg)
Classification via Thresholding
● Sometimes, we use the output from logistic regression to predict the probability an event occurs.
● Another common use for logistic regression is to use it for binary classification by introducing a threshold.
● Choice of threshold is a very important choice, and can be tuned.
![Page 155: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/155.jpg)
Evaluation Metrics: Accuracy
● How do we evaluate classification models?● One possible measure: Accuracy
○ the fraction of predictions we got right
![Page 156: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/156.jpg)
Group Chat: Accuracy
● Devise a scenario in which accuracy would be a misleading metric of model quality.
![Page 157: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/157.jpg)
Accuracy Can Be Misleading
● In many cases, accuracy is a poor or misleading metric. Two common situations that case this are:○ Class imbalance, when positives or negatives are
extremely rare○ Different kinds of mistakes have different costs
![Page 158: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/158.jpg)
Useful to separate out different kinds of errors. Let’s use spam detection as an example where spam(1), and not spam(0)
Confusion Matrix
ML System Says
Spam Not Spam
Truth
Spam
True Positive
TPFalse Negative
FNNot Spam
False Positive
FPTrue Negative
TN
This is called a confusion matrix.
![Page 159: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/159.jpg)
Confusion Matrix for Multi-Class Problems
● The confusion matrix to the right is for the task of digit prediction
● The strong weights along the diagonal indicate this model is very good.
● Two regions outlined in red illustrate 3 and 5 being confused.
![Page 160: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/160.jpg)
Evaluation Metrics: Precision and Recall
● We return to binary classification (True/False labels)● Precision: (True Positives) / (All Positive Predictions)
○ When model said “positive” class, was it right?○ Intuition: Did the model classify as “spam” too
often?● Recall: (True Positives) / (All Actual Positives)
○ Out of all the possible positives, how many did the model correctly identify?
○ Intuition: Did it classify as “not spam” too often?
![Page 161: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/161.jpg)
Precision vs Recall Trade-offs
● A system with high precision might leave out some good email, but it is very unlikely to let spam through
● A system with high recall might let through some spam, but it also is very unlikely to miss any good email.
● The trick is to balance them -- and which kind of error is more problematic depends a lot on the application. What are some examples of this?
![Page 162: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/162.jpg)
Selecting a Threshold
1.0 (Spam)0.0 (Not Spam)
● True label shown by color ● Prediction from model shown as location on the line
● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP)○ Recall = TP / (TP + FN)
A B
TPFPFNTN
![Page 163: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/163.jpg)
Selecting a Threshold
1.0 (Spam)0.0 (Not Spam)
● True label shown by color ● Prediction from model shown as location on the line
● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP)○ Recall = TP / (TP + FN)
A
TPFPFNTN FPFP
![Page 164: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/164.jpg)
Selecting a Threshold
1.0 (Spam)0.0 (Not Spam)
● True label shown by color ● Prediction from model shown as location on the line
● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP)○ Recall = TP / (TP + FN)
B
TPFPFNTN FNFN
![Page 165: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/165.jpg)
Selecting a Threshold
1.0 (Spam)0.0 (Not Spam)
● True label shown by color ● Prediction from model shown as location on the line
● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP) -- Increases since less FP○ Recall = TP / (TP + FN) -- Decreases since more FN
A B
TPFPFNTN
![Page 166: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/166.jpg)
An ROC Curve
FP Rate
TP R
ate
10
0 1
TP vs. FP rate at another decision threshold
TP vs. FP rate at one decision threshold
![Page 167: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/167.jpg)
Sample ROC Curve
1.0 (Spam)0.0 (Not Spam)FPR = 1/19, TPR = 3/7FPR = 7/19, TPR = 6/7
![Page 168: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/168.jpg)
Sample ROC Curve
1.0 (Spam)0.0 (Not Spam)FPR = 2/19, TPR = 5/7
![Page 169: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/169.jpg)
● AUC: “Area under the ROC Curve”● Interpretation:
○ If we pick a random positive and a random negative, what’s the probability my model scores them in the correct relative order?
● Intuition: gives an aggregate measure of performance aggregated across all possible classification thresholds
Evaluation Metrics: AUC
![Page 170: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/170.jpg)
Group Chat: AUC
What will happen to the AUC if multiply each of the predictions (the y') for a given model by 2.0?
![Page 171: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/171.jpg)
precision * recall
F-measure● Another commonly way to balance recall and precision is F-measure which
is the harmonic mean of precision and recall (so ranges from 0 to 1).
precision + recallF = 2 *
● Observe that if precision = recall = x then the F-measure is x. Just to give a feel here are a few sample values.
recall
![Page 172: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/172.jpg)
Confusion Matrix for Multiclass Data
This confusion matrix provides a visualization for the task of digit recognition (so possible labels are 0, 1, …, 9). The bar on the right shows the mapping from color to the probability. The high values along the diagonal indicate a strong model.
![Page 173: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/173.jpg)
Regularization for Sparsity
![Page 174: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/174.jpg)
Model Complexity
● Remember Occam’s Razor: Increasing the model complexity increases the chances of overfitting.
● Start simple and build up the complexity only when needed. Your final solution might be sophisticated, but your first attempt shouldn’t be.
![Page 175: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/175.jpg)
Let’s Go Back to Feature Crosses
Caveat: Sparse feature crosses may significantly increase feature space
Possible issues:○ Model size may become huge○ Model requires more data and longer to train○ Overfitting becomes more of a problem
![Page 176: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/176.jpg)
Reasons to Want Sparsity
● Would be nice to encourage weights to exactly 0 where possible○ Saves RAM, may reduce overfitting○ Also those features can be removed from
the model and no longer gathered and stored
![Page 177: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/177.jpg)
L1 Regularization
● L0 regularization: penalize a weight for being non-zero○ Computationally hard and too expensive to compute
● Relax to L1 regularization: ○ Penalize sum of abs(weights)○ Convex problem so computationally feasible○ Encourages sparsity (unlike L2)
![Page 178: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/178.jpg)
Visualizing L1 and L2 Regularization in 2D
L2 ball lookslike a circle
L1 ball lookslike a diamond
Nearest point on L1 ball is a corner:has a coordinate with value exactly 0.0
Nearest point on L2 ball has two non-zero coordinates
Here’s a point we want to get back to the edge of a ball
![Page 179: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/179.jpg)
Intro to Neural Networks
![Page 180: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/180.jpg)
Other Ways to Learn Nonlinearities
● One alternative to feature crosses:○ Structure the model so that features are used
together as in feature crosses○ Then combine those combinations to get even more
intricate features.
● How can we choose these combinations?○ Let the machine learning model discover them
![Page 181: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/181.jpg)
A Linear Model
Output: y’ = w1x1 + w2x2 + w3x3 + b
Input: x = (x1 , x2 , x3 )
w1 w2w3
x1 x2 x3
Note: the bias b is part of the model though not pictured
![Page 182: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/182.jpg)
Adding Another Linear Layer
Output
Hidden Layer
Input
![Page 183: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/183.jpg)
Adding Another Linear Layer
![Page 184: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/184.jpg)
Adding Another Linear Layer
w1w2
w3
b
Output
Hidden Layer
Input
![Page 185: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/185.jpg)
Adding Another Linear Layer
w1w2
w3
b
Output
Hidden Layer
Input
![Page 186: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/186.jpg)
ReLu: A Simple Non-Linear Function
ReLuRectified Linear Unit
( )=max(0, )
![Page 187: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/187.jpg)
Adding a Non-Linearity
Output
Hidden Layer(Linear)
Input
Non-Linear Transformation Layer (a.k.a. Activation Function)
By convention we combine the Activation function into the hidden layer making it a non-linear function. So the network to the left is draw as:
![Page 188: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/188.jpg)
Neural Nets Can Be Arbitrarily Complex
Output
Hidden2
Hidden1
Input
Training done via BackProp algorithm: gradient descent in very non-convex space. We will describe this soon.
Demo 1Demo 2Demo 3
![Page 189: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/189.jpg)
Training Neural Nets
![Page 190: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/190.jpg)
Intro to Backpropagation
●
●● Backpropagation algorithm visual explanation
Neural Networks are trained with an algorithm called backpropagation that generalizes gradient descent to deeper networks.
![Page 191: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/191.jpg)
Backprop: What You Need To Know
● Gradients are important○ If it’s differentiable, we can probably learn on it
● Gradients can vanish○ Each additional layer can successively reduce signal vs. noise○ ReLu’s are useful here
● Gradients can explode○ Learning rates are important here○ Batch normalization (useful knob) can help
● ReLu layers can die○ Lower your learning rates
![Page 192: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/192.jpg)
Review: Normalizing Feature Values
● We’d like our features to have reasonable scales○ Roughly zero-centered, [-1, 1] range often works well○ Helps gradient descent converge; avoid NaN trap○ Avoiding outlier values can also help
● Can use a few standard methods:○ Linear scaling○ Hard cap (clipping) to max, min○ Log scaling
![Page 193: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/193.jpg)
Dropout Regularization
● Dropout: Another form of regularization, useful for Neural Networks
● Works by randomly “dropping out” unit activations in a network for a single gradient step
● The more you drop out, the stronger the regularization○ 0.0 = no dropout regularization○ 1.0 = drop everything out! learns nothing○ Intermediate values more useful
![Page 194: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/194.jpg)
Multi-Class Neural Nets
![Page 195: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/195.jpg)
More than two classes?
● Logistic regression gives useful probabilities for binary-class problems.○ spam / not-spam○ click / not-click
● What about multi-class problems?○ animal, vegetable, mineral○ red, orange, yellow, green, blue, indigo, violet○ apple, banana, car, cardiologist, ..., walk sign, zebra, zoo
![Page 196: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/196.jpg)
● Create a unique output for each possible class● Train that on a signal of “my class” vs “all other classes”● Typically optimize logistic loss
One-Vs-All Multi-Class
apple: yes/no?
bear: yes/no?
candy: yes/no?
dog: yes/no?
egg: yes/no?
one-vs-all(sigmoid)
hidden
hidden logits
![Page 197: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/197.jpg)
SoftMax Multi-Class● Add an additional constraint that all nodes to sum to 1.0
○ Allows outputs to be interpreted as probabilities● Typically optimize cross-entropy loss (measure of similarity of distributions)
SoftMax Multi-Class
apple: yes/no?
bear: yes/no?
candy: yes/no?
dog: yes/no?
egg: yes/no?
one-vs-all(softmax)
hidden
hidden logits
![Page 198: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/198.jpg)
SoftMax Intuition
For example, If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
The prediction for label j is:
Softmax definition where K is the set of possible labels:
![Page 199: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/199.jpg)
SoftMax Options
● Full SoftMax○ Brute force --Calculates the denominator on all classes. This is
only feasible when there are a relatively small number of classes.
● Candidate Sampling○ Calculates the denominator for all the positive classes, but only
for a random sample of the negatives. This can scale to millions of classes since the number of positive classes for any single example is generally small.
![Page 200: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/200.jpg)
Single-Label vs. Multi-Label
Multi-Class, Single-Label Classification○ An example may be a member of only one class.○ Constraint that classes are mutually exclusive is important if you
want to be able to treat the predictions as probabilities
Multi-Class, Multi-Label Classification:○ An example may be a member of more than one class.○ No additional constraints on class membership to exploit.○ One logistic regression loss for each possible class.
![Page 201: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/201.jpg)
What to use, when?One-vs-all is less computationally expensive than SoftMax, but at the expense of not having calibration across the labels.
Use One-vs.-All Multi-Class when:○ You want to view the output for each class as a prediction of the
probability that the example belongs to the given class○ You do not need to rank the classes (and thus can reduce the
computational cost by not calibrating across the classes).
Use SoftMax Multi-Class when:○ You want to rank all classes as to which are the best fit
![Page 202: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/202.jpg)
Embeddings
![Page 203: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/203.jpg)
Motivation From Collaborative Filtering
● Input: 1,000,000 users and which of 500,000 movies the user has chosen to watch
● Task: Recommend movies to users
To solve this problem some method is needed to determine which movies are similar to each other.
![Page 204: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/204.jpg)
Organizing Movies by Similarity (1d)
Shrek Incredibles Harry Potter Star Wars The Dark Knight Rises
The Triplets of Belleville
MementoBleu
![Page 205: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/205.jpg)
Organizing Movies by Similarity (2d)
Shrek
Incredibles
Harry Potter
Star WarsThe Dark
Knight Rises
The Triplets of Belleville
Memento
Bleu
![Page 206: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/206.jpg)
Two-Dimensional Embedding
Children
Blockbuster
Arthouse
Adult
Shrek
Incredibles
Harry Potter
Star WarsThe Dark
Knight Rises
Crouching Tiger, Hidden Dragon
School of Rock
The Triplets of Belleville
Wallace and Gromit
Waking LIfeMemento
Bleu
Hero
![Page 207: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/207.jpg)
Two-Dimensional Embedding
Children
Blockbuster
Arthouse
Adult
Shrek
Incredibles
Harry Potter
Star WarsThe Dark
Knight Rises
Crouching Tiger, Hidden Dragon
School of Rock
The Triplets of Belleville
Wallace and Gromit
Waking LIfeMemento
Bleu
(-1.0, 0.95)
(0.65, -0.2)
Hero
![Page 208: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/208.jpg)
d-Dimensional Embeddings
● Assumes user interest in movies can be roughly explained by d aspects
● Each movie becomes a d-dimensional point where the value in dimension d represents how much the movie fits that aspect
● Embeddings can be learned from data
![Page 209: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/209.jpg)
Learning Embeddings in a Deep Network
● No separate training process needed -- the embedding layer is just a hidden layer with one unit per dimension
● Supervised information (e.g. users watched the same two movies) tailors the learned embeddings for the desired task
● Intuitively the hidden units discover how to organize the items in the d-dimensional space in a way to best optimize the final objective
![Page 210: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/210.jpg)
Input Representation
(0, 1, 0, 1, 0, 0, 0, 1)
● Each example (a row in this matrix) is a sparse vector of features (movies) that have been watched by the user
● Dense representation of this example as:
Is not efficient in terms of space and time
...
![Page 211: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/211.jpg)
Input Representation
...
( , , )
1. Build a dictionary mapping each feature to an integer from 0, …, # movies - 1
2. Efficiently represent the sparse vector as just the movies the user watched:
Represented as: (1, 3, 999999)
210 9999993
![Page 212: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/212.jpg)
An Embedding Layer in a Deep Network
Words in real estate ad
...
3 Dimensional Embedding
Sparse Vector Encoding
Latitude Longitude
L2 LossRegression problem to predict home sales prices:
Sale Price
![Page 213: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/213.jpg)
An Embedding Layer in a Deep Network
Raw bitmap of the hand drawn digit
...
3 Dimensional Embedding
Sparse Vector Encoding
Other features
...
Logit Layer
Target Class Label
Softmax Loss
“One-hot” target prob dist. (sparse)
Multiclass Classification to predict a handwritten digit 0 1
0 1 2 3 4 5 6 7 8 9
0 2 3 4 5 6 7 8 9
![Page 214: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/214.jpg)
An Embedding Layer in a Deep Network
User Movies (subset to use as input features)
...
3 Dimensional Embedding
Sparse Vector Encoding
Other features (optional)
...
Logit Layer
User Movies (subset to use as “labels”)
...
Softmax Loss
Target prob dist. (sparse)
...
Collaborative Filtering to predict movies to recommend:
![Page 215: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/215.jpg)
Deep Network
Correspondence to Geometric View
. . .
● Each of hidden units corresponds to a dimension (latent feature)
● Edge weights between a movie and hidden layer are coordinate values
(0.9, 0.2, 0.4)
Geometric view of a single movie embedding
x
zy
![Page 216: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/216.jpg)
Selecting How Many Embedding Dims
● Higher-dimensional embeddings can more accurately represent the relationships between input values
● But more dimensions increases the chance of overfitting and leads to slower training
● Empirical rule-of-thumb:○ A good starting point but should be tuned using the validation data.
![Page 217: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/217.jpg)
Embeddings as a Tool
● Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other
● Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric
● Jointly embedding diverse data types (e.g. text, images, audio, …) define a similarity between them
![Page 218: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/218.jpg)
Unsupervised Learning: Clustering
![Page 219: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/219.jpg)
What is Clustering?
Given● a set of items● a similarity metric on these items
Identify groups of most similar items
![Page 220: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/220.jpg)
Some Applications of Clustering
● Grouping similar sets of items● Data compression● Generate features for ML systems● Transfer Information learned from one
setting to another (Transfer Learning)
![Page 221: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/221.jpg)
How “Hard” is Clustering?
Hard Soft
![Page 222: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/222.jpg)
How “Hard” is Clustering?
0 0 1 0
1 0 0 0
. . . .
0 1 0 0
0.1 0 0.7 0.2
0.8 0.1 0.1 0
. . . .
0.2 0.6 0.1 0.1
Items
Clusters Clusters
Hard Soft
![Page 223: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/223.jpg)
Is Clustering Supervised?
● No, if similarity is given (rarely the case)
● Otherwise ask “Is similarity learned ?”○ Unsupervised: similarity is designed ad hoc from
features ○ Supervised: similarity is derived from embeddings
via a supervised Deep Neural Network
![Page 224: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/224.jpg)
Understand Your Similarity Measure!● Ad hoc similarity:
○ Handle heterogeneous features with care■ Numerical data of different scales, different
distributions (e.g. Home Price, Lot size, #rooms)A good generic solution: use Quantiles
■ Categorical data of different cardinality(e.g. Zip code, Home type)
■ Missing data: ignore if rare, otherwise infer via ML
![Page 225: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/225.jpg)
Clustering in Euclidean Space
● Designed withthis in mind
● Applied to this
![Page 226: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/226.jpg)
Quality Analysis of Clustering
● Clustering is unsupervised (given similarity): no ground truth to compare to
● Quality Analysis is done by○ Eyeballing○ When clustering is a piece of a larger system,
measure it in the global system
![Page 227: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/227.jpg)
k-Means
● Simple● Efficient● Good basic
clustering algorithm
![Page 228: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/228.jpg)
k-Means● Initialization
○ Select k cluster centersrandomly
![Page 229: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/229.jpg)
k-Means in action● Initialization
○ Select k cluster centersrandomly
● E-Step (Expectation)○ Assign each point to the
closest cluster
![Page 230: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/230.jpg)
k-Means in action● Initialization
○ Select k cluster centersrandomly
● E-Step (Expectation)○ Assign each point to the closest
cluster● M-Step (Maximization)
○ Recompute centers as mean x and y coordinate among current assignment
![Page 231: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/231.jpg)
k-Means in action● Initialization
○ Select k cluster centersrandomly
● E-Step (Expectation)○ Assign each point to the closest
cluster● M-Step (Maximization)
○ Recompute centers as mean x and y coordinate among current assignment
Rep
eat
![Page 232: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/232.jpg)
k-Means in action
● Initialization○ Select k cluster centers
randomly● E-Step (Expectation)
○ Assign each point to the closest cluster
● M-Step (Maximization)○ Recompute centers
Rep
eat
![Page 233: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/233.jpg)
k-Means in action
● Repeat until convergence○ Always converges
when no item is jumping clusters in E-Step
Minimizes within cluster distance
● Depends on initialization
![Page 234: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/234.jpg)
k-Means for Different Initializations
![Page 235: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/235.jpg)
Understand Your Data/Similarity Measure!Why does k-means produce such drastically different results for visually similar data?
![Page 236: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/236.jpg)
k-Means Summary● Cons
○ Need to determine k○ Result depends on the initialization○ “Curse of dimensionality” -- cost grows exponentially with
the dimensionality of the data (points)● Pros
○ Simple & efficient○ Guaranteed convergence○ Can be warm-started○ Generalization: soft clusters, non-spherical clusters
![Page 237: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/237.jpg)
Course Review
![Page 238: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/238.jpg)
What is Machine Learning (ML)?
There are many ways to define ML.● ML systems learn how to combine data to
produce useful predictions on never before seen data
● ML algorithms find patterns in data and use these patterns to react correctly to brand new data.
![Page 239: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/239.jpg)
Goals of This Class
● Learn to take a real-life problem and apply machine learning to make predictions.
● Learn to implement Machine Learning solutions using TensorFlow
● Learn how to evaluate the quality of your solution● Machine Learning is a very broad field -- we only
just touch upon some of the most common machine learning algorithms
![Page 240: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/240.jpg)
Linear Regression● y’ = wx + b ● Minimize mean
squared error (or RMSE)
● In general have: y’ = b + w1x1 + … + wnxn
● Learn b, w1, w2, … wn from data
w = 129.25b = 0.382
![Page 241: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/241.jpg)
Learn Model Weights with SGD
Use Learning Curve as shown on right to see when the model loss is no longer improving
![Page 242: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/242.jpg)
Negative Correlation
If feature is negatively correlated then weight will be negative.
w = -772.25b = 36806.86
![Page 243: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/243.jpg)
For training set D composed of examples x = <x0 , x1 , x2 , …, xn> and correct label y, and the prediction for the current model y’
RMSE - Root Mean Squared Error
Our goal is to train a model that minimizes RMSE (which is the same as minimizing the mean squared error).
![Page 244: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/244.jpg)
Using TensorFlow to Do the Computation
import tensorflow as tf
# Define a linear regression model.estimator = tf.contrib.learn.LinearRegressor( optimizer=tf.GradientDescentOptimizer(learning_rate=0.001))
# Fit the model on training data -- minimizes L2 lossestimator.fit(X_train, y_train, steps=10000, batch_size=100)
# Use it to predict on new datapredictions = estimator.predict(X_test)
![Page 245: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/245.jpg)
Linear Regressor With Many Features
● We generally have more than 2 features in a linear regressor. Sample TensorFlow code:
● Calibration plot is a good way to visualize since a scatter plot doesn’t scale well to higher dimensions.
linear_regressor = tf.contrib.learn.LinearRegressor( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
![Page 246: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/246.jpg)
Calibration Plot
Good way to visualize if a regression model is good (just having the loss converge does not guarantee this!)
![Page 247: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/247.jpg)
From Raw Features to Better Features
● How can we best use numeric features that vary a lot in their range, e.g. the purchase price of a car and the age of the car?
● How can we use numeric features like zipcode that are very different a feature like the number of bedrooms in an apartment?
● How can we make use of non-numeric features (e.g. street address, apt type)?
![Page 248: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/248.jpg)
Transformations for Real-Valued Features
![Page 249: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/249.jpg)
Ways to Introduce Non-Linear Behavior
● Divide real-valued features into buckets/bins○ Good strategy is to use
quantiles so there are roughly the same number of examples in each bin
● Take the cross product of any set of bucketized and sparse features.
● Numeric features (e.g. zipcode) can be treated as categorical data and converted to sparse features
![Page 250: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/250.jpg)
Reason to Bucketize (Bin) Features
If you want to predict city-mpg from the compression-ratio a single linear function would not fit well but you can get a pretty good fit by dividing compression ratio into two buckets and then learn a linear model for each bucket.
![Page 251: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/251.jpg)
Creating Bins by QuantilesEach bin is a Boolean variable with each car price falling into a single bin.
car_price_under_6649car_price_6649_to_7349car_price_7349_to_7999car_price_7999_to_9095car_price_9095_to_10295car_price_10295_to_12440car_price_12440_to_15250car_price_15250_to_17199car_price_17199_to_22470car_price_22470_and_up
● So if a car had a price of $9000 then the values of the variable associated with each bin would be as shown to the right in blue.
● Each bin becomes a independent feature with a weight learned for it that only applies when the feature is 1
0001000000
![Page 252: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/252.jpg)
Encoding Categorical Data
Sample ad: “1995 Honda Civic good condition”Representation:..1993 1994 1995 1996 1997 … poor good excellent … Honda Toyota ... Civic Accord …. condition
If I really showed all the words in the vocabulary for any ad very few of the words would occur. This is what we mean by saying it is sparse
0 0 1 0 0 0 1 0 1 0 1 0 1
![Page 253: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/253.jpg)
Feature Crosses For Correlated Features
None Low Med High
RainfallCold Cool Warm Hot
Temperature
Cold Cool Warm Hot
None w1 w2 w3 w4
Low w5 w6 w7 w8
Med w9 w10 w11 w12
High w13 w14 w15 w16
Rainfall x Temperature
Color encodes the value of the weight with red being low and green being high
![Page 254: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/254.jpg)
TensorFlow - hash bucket size # Sample of creating a categorical column with known values race = tf.contrib.layers.sparse_column_with_keys( column_name="race", keys=["White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"]) # Sample of creating a categorical columns with a hash bucket education = tf.contrib.layers.sparse_column_with_hash_bucket( "education", hash_bucket_size=50) # Sample of creating a cross gender_x_education_x_race = tf.contrib.layers.crossed_column( [gender, education, race], hash_bucket_size=1000)
![Page 255: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/255.jpg)
Overfitting
If we make a very complex model then can perfectly (or near perfectly) fit the training data we just memorize versus the goal of generalizing.
Remember our goal is to build a system to deal with new data!
![Page 256: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/256.jpg)
Setting Aside Validation Data
Train model on Training Data
Evaluate model on Validation Data
Select features, learning rate, batch size, ... according to results on Validation Data
Pick model that does best on Validation DataCheck for generalization ability on Test Data
![Page 257: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/257.jpg)
Ensure Validation Data is Representative
This is an example of what happens if you partition data without first randomizing it. Validation data is NOT representative and thus not a good estimate of the classifier/regressor’s performance.
![Page 258: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/258.jpg)
Things You Need to Decide● Learning Rate
○ Very important. Typically change by powers of 10 until the model is training reasonably well and then fine tune
● Number of Steps to Train○ Time to train is proportional to this (for a fixed set of
features) so you want to make this as small as you can but still important that you don’t undertrain.
● Batch Size ○ not that sensitive, that can be the last thing to vary
● What features to use, feature normalization, when to introduce buckets and crosses
![Page 259: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/259.jpg)
Learning Rate Too High
![Page 260: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/260.jpg)
Learning Rate Way Too Low
![Page 261: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/261.jpg)
Learning Rate Could Still Be Higher
![Page 262: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/262.jpg)
Good Learning Rate
NOTE: This model is still training and not yet overfitting so increase the number of steps!
![Page 263: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/263.jpg)
Training Curve Showing Overfitting
A model with same data and learning rate trained for 500 (versus 50) iterations. Now we see it overfitting
![Page 264: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/264.jpg)
Things You Need to Decide
● Learning Rate, Steps to Train, Batch Size ● What features to use, feature normalization,
when to introduce buckets and crosses● When the model is more complex you also need
to introduce ways to prevent overfitting.○ Early Stopping, L2-regularization, or dropout
● Ways to Reduce Model Size○ Smaller Buckets, Fewer Features, L1
Regularization
![Page 265: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/265.jpg)
Linear Classifier
loan amount (x1)
Inco
me
(x2)
Convert Real-Valued to Probability Using:
LogOdds (wTx + b)
Pro
babi
lity
Out
put
1
0
![Page 266: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/266.jpg)
LinearClassifier vs LinearRegressor linear_regressor = tf.contrib.learn.LinearRegressor( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
● Use Regressor to predict a real-valued feature (minimize RMSE)
● Use Classifier to predict a True (1), False (0) feature (minimize log loss)
![Page 267: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/267.jpg)
LinearClassifier vs LinearRegressor ● Choices in feature engineering the same● Process of selecting and using validation (and test data) is the same● Tuning learning rate, number of steps, batch size, regularization are
the same
● Evaluation metrics change● Instead of RMSE interested in things like accuracy, ROC curve
(trade-off in false positive vs false negative rate), AUC (area under ROC)
● AUC gives probability a random + example is predicted with a higher probability than a random - example. So 0.5 random guess and 1.0 is a perfect model.
![Page 268: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/268.jpg)
Sample ROC Curve
1.0 (Spam)0.0 (Not Spam)FPR = 1/19, TPR = 3/7FPR = 7/19, TPR = 6/7
![Page 269: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/269.jpg)
ROC Curves for Models from Lab 3
Model size original: 533Model size no reg: 429Model size l2: 429Model size l1, l2: 119Model size l1 strong, l2: 70
![Page 270: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/270.jpg)
LinearClassifier With > 2 Classes
linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=feature_columns, n_classes=10, optimizer=SGDoptimizer, gradient_clip_norm=5.0 )
● Example using LinearClassifier to learn 10 classes (digits 0, …, 9)
● Here the labels must be 0, …, 9 (or a sparse feature with 10 values).● Now optimize softmax loss which is a generalization of log loss when you
have a probability distribution of more than just two values● Again we need to modify the visualizations a bit looking at a confusion
matrix versus an ROC curve.
![Page 271: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/271.jpg)
Confusion Matrix
![Page 272: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/272.jpg)
DNN: Add a Non-Linearity
Output
Hidden Layer(Linear)
Input
Non-Linear Transformation Layer (a.k.a. Activation Function)
By convention we combine the Activation function into the hidden layer making it a non-linear function. So the network to the left is draw as:
![Page 273: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/273.jpg)
Non-linearity in DNN let’s it learn to do this
If you want to predict city-mpg from the compression-ratio a single linear function would not fit well but you can get a pretty good fit by dividing compression ratio into two buckets and then learn a linear model for each bucket.
![Page 274: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/274.jpg)
Deep Neural Networks -- Add Layers
Output
Hidden2
Hidden1
Input
● Training done via BackProp algorithm which is an extension of SGD
● The hidden layers closer to the output capture higher level features (since they learn over the features from the previous layer)
● For this network in TensorFlow you’d have :○ hidden_units=[4, 3]
![Page 275: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/275.jpg)
DNN Classifier or Regressor in TFDNNclassifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns, n_classes=10, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, )
DNNregressor = tf.contrib.learn.DNNRegressor( feature_columns=feature_columns, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, )
![Page 276: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/276.jpg)
DNN Reduces Feature Engineering● It can learn to bucketize real-value features
● It can learn crosses
● As you add model weights it takes more data and time to train and overfitting becomes more of an issue
● Use L2 regularization or drop out to control overfitting
● Along with other hyperparameters (e.g. learning rate, num steps) you need to pick the DNN configuration (how many levels of hidden units and how many units at each level).
![Page 277: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/277.jpg)
Embeddings as a Tool
● Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other
● Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric
● Jointly embedding diverse data types (e.g. text, images, audio, …) define a similarity between them
![Page 278: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/278.jpg)
An Embedding Layer in a DNNRegresor
Words in real estate ad
...
3 Dimensional Embedding
Input to Embedding Layer is Sparse Vector Encoding
Latitude Longitude
DNNRegressorRegression problem to predict home sales prices:
Sale Price
![Page 279: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine](https://reader030.vdocuments.us/reader030/viewer/2022041014/5ec4619517d06d7cdf35baf4/html5/thumbnails/279.jpg)
An Embedding Layer in a DNNClassifier
Raw bitmap of the hand drawn digit
...
3 Dimensional Embedding
Input to Embedding Layer is Sparse Vector Encoding
Other features
...
Predicted probability for the 10 classes
Target Class Label
DNNClassifier
“One-hot” target prob dist. (sparse)
Multiclass Classification to predict a handwritten digit 0 1
0 1 2 3 4 5 6 7 8 9
0 2 3 4 5 6 7 8 9