15-388/688 -practical data science: linear …15-388/688 -practical data science: linear...
TRANSCRIPT
![Page 1: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/1.jpg)
15-388/688 - Practical Data Science:Linear classification
J. Zico KolterCarnegie Mellon University
Fall 2019
1
![Page 2: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/2.jpg)
OutlineExample: classifying tumors
Classification in machine learning
Example classification algorithms
Libraries for machine learning
2
![Page 3: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/3.jpg)
OutlineExample: classifying tumors
Classification in machine learning
Example classification algorithms
Libraries for machine learning
3
![Page 4: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/4.jpg)
Classification tasksRegression tasks: predicting real-valued quantity 𝑦 ∈ ℝ
Classification tasks: predicting discrete-valued quantity 𝑦
Binary classification: 𝑦 ∈ −1,+1
Multiclass classification: 𝑦 ∈ 1,2,… , 𝑘
4
![Page 5: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/5.jpg)
Example: breast cancer classificationWell-known classification example: using machine learning to diagnose whether a breast tumor is benign or malignant [Street et al., 1992]
Setting: doctor extracts a sample of fluid from tumor, stains cells, then outlines several of the cells (image processing refines outline)
System computes features for each cell such as area, perimeter, concavity, texture (10 total); computes mean/std/max for all features
5
![Page 6: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/6.jpg)
Example: breast cancer classificationPlot of two features: mean area vs. mean concave points, for two classes
6
![Page 7: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/7.jpg)
Linear classification exampleLinear classification ≡ “drawing line separating classes”
7
![Page 8: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/8.jpg)
OutlineExample: classifying tumors
Classification in machine learning
Example classification algorithms
Libraries for machine learning
8
![Page 9: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/9.jpg)
Formal settingInput features: 𝑥 !
∈ ℝ", 𝑖 = 1,… ,𝑚
E. g. : 𝑥!=
Mean_Area!
Mean_Concave_Points!
1
Outputs: 𝑦 !∈ 𝒴, 𝑖 = 1,… ,𝑚
E. g. : 𝑦!∈ {−1 benign ,+1 (malignant)}
Model parameters: 𝜃 ∈ ℝ"
Hypothesis function: ℎ#: ℝ
"→ℝ, aims for same sign as the output (informally,
a measure of confidence in our prediction)E. g. : ℎ
#𝑥 = 𝜃
$𝑥, 𝑦 = sign(ℎ
#𝑥 )
9
![Page 10: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/10.jpg)
Understanding linear classification diagrams
Color shows regions where the ℎ!(𝑥) is positive
Separating boundary is given by the equation ℎ!𝑥 = 0
10
![Page 11: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/11.jpg)
Loss functions for classificationHow do we define a loss function ℓ:ℝ×{−1,+1} → ℝ
+?
What about just using squared loss?
11
y
−1
+1
x0
y
−1
+1
x0Least squares
y
−1
+1
x0Least squaresPerfect classifier
![Page 12: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/12.jpg)
0/1 loss (i.e. error)The loss we would like to minimize (0/1 loss, or just “error”):
ℓ0/1
ℎ#𝑥 , 𝑦 = {
0 if sign ℎ#𝑥 = 𝑦
1 otherwise
= 𝟏{𝑦 ⋅ ℎ#𝑥 ≤ 0}
12
![Page 13: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/13.jpg)
Alternative lossesUnfortunately 0/1 loss is hard to optimize (NP-hard to find classifier with minimum 0/1 loss, relates to a property called convexity of the function)
A number of alternative losses for classification are typically used instead
13
ℓ0/1
= 1 𝑦 ⋅ ℎ#𝑥 ≤ 0
ℓlogistic
= log 1 + exp −𝑦 ⋅ ℎ#𝑥
ℓhinge
= max{1 −𝑦 ⋅ ℎ#𝑥 , 0}
ℓexp
= exp(−𝑦 ⋅ ℎ#𝑥 )
![Page 14: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/14.jpg)
Poll: sensitivity to outliersHow sensitive would you estimate each of the following losses would be to outliers (i.e., points typically heavily misclassified)?
1. 0/1 < Exp < {Hinge,Logitistic}
2. Exp < Hinge < Logistic < 0/1
3. Hinge < 0/1 < Logistic < Exp
4. 0/1 < {Hinge,Logistic} < Exp
5. Outliers don’t exist in the classification because output space is bounded
14
![Page 15: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/15.jpg)
Machine learning optimizationWith this notation, the “canonical” machine learning problem is written in the exact same way
minimize!
∑
"=1
%
ℓ ℎ!𝑥
", 𝑦
"
Unlike least squares, there is not an analytical solution to the zero gradient condition for most classification losses
Instead, we solve these optimization problems using gradient descent (or a alternative optimization method, but we’ll only consider gradient descent here)
Repeat: 𝜃 ≔ 𝜃 − 𝛼∑
"=1
%
𝛻!ℓ( ℎ
!𝑥
", 𝑦
")
15
![Page 16: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/16.jpg)
OutlineExample: classifying tumors
Classification in machine learning
Example classification algorithms
Libraries for machine learning
16
![Page 17: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/17.jpg)
Support vector machineA (linear) support vector machine (SVM) just solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, plus an additional regularization term, more on this next lecture
minimize!
∑
"=1
%
max{1 − 𝑦"⋅ 𝜃
&𝑥
", 0} +
𝜆
2𝜃
2
2
Even more precisely, the “standard” SVM doesn’t actually regularize the 𝜃"
corresponding to the constant feature, but we’ll ignore this here
Updates using gradient descent:
𝜃 ≔ 𝜃 − 𝛼∑
"=1
%
−𝑦"𝑥
"1{ 𝑦
"⋅ 𝜃
&𝑥
"≤ 1} − 𝛼𝜆𝜃
17
![Page 18: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/18.jpg)
Support vector machine exampleRunning support vector machine on cancer dataset, with small regularization parameter (effectively zero)
18
𝜃 =
1.456
1.848
−0.189
![Page 19: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/19.jpg)
SVM optimization progressOptimization objective and error versus gradient descent iteration number
19
![Page 20: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/20.jpg)
Logistic regressionLogistic regression just solves this problem using logistic loss and linear hypothesis function
minimize#
∑
!=1
6
log 1 + exp −𝑦!⋅ 𝜃
$𝑥!
Gradient descent updates (can you derive these?):
𝜃 ≔ 𝜃 − 𝛼∑
!=1
6
−𝑦!𝑥!
1
1 + exp 𝑦 ! ⋅ 𝜃$𝑥 !
20
![Page 21: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/21.jpg)
Logistic regression exampleRunning logistic regression on cancer data set, small regularization
21
![Page 22: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/22.jpg)
Logistic regression exampleRunning logistic regression on cancer data set, small regularization
22
![Page 23: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/23.jpg)
Multiclass classificationWhen output is in {1,… , 𝑘} (e.g., digit classification), a few different approaches
Approach 1: Build 𝑘 different binary classifiers ℎ#!
with the goal of predicting class 𝑖 vs all others, output predictions as
𝑦 = argmax
!
ℎ#!
(𝑥)
Approach 2: Use a hypothesis function ℎ#: ℝ
"→ℝ
7, define an alternative loss function ℓ:ℝ7
× 1,… , 𝑘 → ℝ+
E.g., softmax loss (also called cross entropy loss):
ℓ ℎ#𝑥 , 𝑦 = log∑
8=1
7
exp ℎ#𝑥
8−ℎ
#𝑥
9
23
![Page 24: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/24.jpg)
OutlineExample: classifying tumors
Classification in machine learning
Example classification algorithms
Classification with Python libraries
24
![Page 25: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/25.jpg)
Support vector machine in scikit-learnTrain a support vector machine:
Make predictions:
Note: Scikit-learn in solving the problem (inverted regularization term):
minimize#𝐶∑
!=1
6
max{1 − 𝑦!⋅ 𝜃
$𝑥!, 0} +
1
2𝜃
2
2
25
from sklearn.svm import LinearSVC, SVC
clf = SVC(C=1e4, kernel='linear') # orclf = LinearSVC(C=1e4, loss='hinge', max_iter=1e5)clf.fit(X, y) # don’t include constant features in X
y_pred = clf.predict(X)
![Page 26: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/26.jpg)
Native Python SVMIt’s pretty easy to write a gradient-descent-based SVM too
For the most part, ML algorithms are very simple, you can easily write them yourself, but it’s fine to use libraries to quickly try many algorithms
But watch out for idiosyncratic differences (e.g., 𝐶 vs 𝜆, the fact that I’m using 𝑦 ∈ −1,+1 , not 𝑦 ∈ {0,1}, etc)
26
def svm_gd(X, y, lam=1e-5, alpha=1e-4, max_iter=5000):m,n = X.shapetheta = np.zeros(n)Xy = X*y[:,None]for i in range(max_iter):
theta -= alpha*(-Xy.T.dot(Xy.dot(theta) <= 1) + lam*theta)return theta
![Page 27: 15-388/688 -Practical Data Science: Linear …15-388/688 -Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying](https://reader030.vdocuments.us/reader030/viewer/2022041017/5ec994a5a6dc8c7c24682c26/html5/thumbnails/27.jpg)
Logistic regression in scikit-learnAdmittedly very nice element of scikit-learn is that we can easily try out other algorithms
For both this example and SVM, you can access resulting parameters using the fields
27
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=10000.0)clf.fit(X, y)
clf.coef_ # parameters other than weight on constant featureclf.intercept_ # weight on constant feature