sta141c: big data & high performance statistical computing -...

19
STA141C: Big Data & High Performance Statistical Computing Lecture 9: Classification Cho-Jui Hsieh UC Davis May 18, 2017

Upload: others

Post on 17-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

STA141C: Big Data & High Performance StatisticalComputing

Lecture 9: Classification

Cho-Jui HsiehUC Davis

May 18, 2017

Page 2: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Linear SVM

Page 3: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Support Vector Machines

SVM is a widely used classifier.Given:

Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.

Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1; if yi = −1, wTxi ≤ −1.

Page 4: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Support Vector Machines (hard constraints)

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (with hard constraints):

minw ,ξ

1

2wTw

s.t. yi (wTxi ) ≥ 1, i = 1, . . . , n,

What if there are outliers?

Page 5: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Support Vector Machines

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTxi ) ≥ 1− ξi , i = 1, . . . , n,

ξi ≥ 0

Page 6: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Support Vector Machines

SVM primal problem can be written as

minw

1

2‖w‖2︸ ︷︷ ︸

L2 regularization

+n∑

i=1

max(0, 1− yiwTxi )︸ ︷︷ ︸hinge loss

Non-differentiable when yiwTxi = 1 for some i

Page 7: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

General Empirical Risk Minimization

Regularized ERM:

minw∈Rd

P(w) :=n∑

i=1

`i (wTxi ) + R(w)

`i (·): loss functionR(w): regularization

Dual problem may have a different form (?)

Page 8: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Examples

Loss functions:

Regression: `i (xi ) = (xi − yi )2

SVM (hinge loss): `i (xi ) = max(1− yiwTxi , 0)Square hinge loss: `i (xi ) = max(1− yiwTxi , 0)2

Logistic regression: `i (xi ) = log(1 + e−yiwT xi )

Page 9: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Examples

Regularizations:

L2-regularization: ‖w‖22: small but dense solutionL1-regularization: ‖w‖1: sparse solutionNuclear norm: ‖W ‖∗: low-rank solution

Page 10: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

LIBLINEAR

Implemented in LIBLINEAR:

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

Other functionalities:

Logistic regression (L1 or L2 regularization)Multi-class SVMSupport vector regressionCross-validation

Page 11: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

LIBLINEAR

RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.

Time vs primal objective function value

Page 12: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

LIBLINEAR

RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.

Time vs prediction accuracy

Page 13: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Kernel SVM

Page 14: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Non-linearly separable problems

What if the data is not linearly separable?

Solution: map data xi to higher dimensional(maybe infinite) featurespace ϕ(xi ), where they are linearly separable.

Page 15: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Support Vector Machines (SVM)

SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

The dual problem for SVM:

minα

1

2αTQα−eTα,

s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .

Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).

At optimum: w =∑n

i=1 αiyiϕ(xi ),

Page 16: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Various types of kernels

Gaussian kernel: K (xi , yj) = e−γ‖xi−xj‖22 ;

Polynomial kernel: K (xi , xj) = (γxTi xj + c)d .

Hard to solve: need to solve n-by-n quadratic minimization problem,≥ O(n2) time.

LIBSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

For linear SVM, use LIBLINEAR instead of LIBSVM.

Page 17: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Scikit-learn

Linear SVM: sklearn.svm.LinearSVC

Logistic Regression: sklearn.linear model.LogisticRegression

Kernel SVM: sklearn.svm.SVC

· · ·

Practice in homework.

Page 18: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Scikit-learn

Linear SVM: sklearn.svm.LinearSVC

Logistic Regression: sklearn.linear model.LogisticRegression

Kernel SVM: sklearn.svm.SVC

· · ·Practice in homework.

Page 19: STA141C: Big Data & High Performance Statistical Computing - …chohsieh/teaching/STA141C_Spring... · 2017-05-18 · STA141C: Big Data & High Performance Statistical Computing Lecture

Coming up

Classification

Questions?