ch12-ml2 · title: microsoft powerpoint - ch12-ml2.pptx author: akrevl created date: 8/11/2014...

47
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

Upload: others

Post on 27-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify

them to fit your own needs. If you make use of a significant portion of these slides in your own

lecture, please include this message, or a link to our web site: http://www.mmds.org

Page 2: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

High dim. data

High dim. data

Locality sensitive hashing

Clustering

Dimensionality

reduction

Graph data

Graph data

PageRank, SimRank

Community Detection

Spam Detection

Infinite data

Infinite data

Filtering data

streams

Web advertising

Queries on streams

Machine learningMachine learning

SVM

Decision Trees

Perceptron, kNN

AppsApps

Recommender systems

Association Rules

Duplicate document detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Page 3: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Example: Spam filtering

� Instance space x ∈∈∈∈ X (|X|= n data points)

� Binary or real-valued feature vector x of

word occurrences

� d features (words + other things, d~100,000)

� Class y ∈∈∈∈ Y

� y: Spam (+1), Ham (-1)

� Goal: Estimate a function f(x) so that y = f(x)J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Page 4: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Would like to do prediction:estimate a function f(x) so that y = f(x)

� Where y can be:

� Real number: Regression

� Categorical: Classification

� Complex object:

� Ranking of items, Parse tree, etc.

� Data is labeled:

� Have many pairs {(x, y)}

� x … vector of binary, categorical, real valued features

� y … class ({+1, -1}, or a real number)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Page 5: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Task: Given data (X,Y) build a model f() to predict Y’ based on X’

� Strategy: Estimate � � � �on ��, . Hope that the same ��� also works to predict unknown ’� The “hope” is called generalization

� Overfitting: If f(x) predicts well Y but is unable to predict Y’

� We want to build a model that generalizeswell to unseen data

� But Jure, how can we well on data we have never seen before?!?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

X Y

X’ Y’Test

data

Training

data

Page 6: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Idea: Pretend we do not know the data/labelswe actually do know

� Build the model f(x) onthe training dataSee how well f(x) does onthe test data

� If it does well, then apply it also to X’

� Refinement: Cross validation

� Splitting into training/validation set is brutal

� Let’s split our data (X,Y) into 10-folds (buckets)

� Take out 1-fold for validation, train on remaining 9

� Repeat this 10 times, report average performanceJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

X Y

X’

Validation

set

Trainingset

Test

set

Page 7: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Binary classification:

� Input: Vectors xj and labels yj

� Vectors xj are real valued where � � � � Goal: Find vector w = (w(1), w(2) ,... , w(d) )

� Each w(i) is a real number

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

f (x) =+1 if w(1) x(1) + w(2) x(2) +. . .w(d) x(d) ≥≥≥≥ θθθθ-1 otherwise{

w ⋅ x = 0 - --- -

-

-- -

- -

- -

-

-

θ−→

∀→

,

1,

ww

xxxNote:

Decision

boundary

is linear

w

Page 8: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� (Very) loose motivation: Neuron

� Inputs are feature values

� Each feature has a weight wi

� Activation is the sum:

� � � � ∑ ������ � � ⋅ ���

� If the f(x) is:

� Positive: Predict +1

� Negative: Predict -1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

∑∑∑∑

x(1)

x(2)

x(3)

x(4)

≥≥≥≥ 0?

w(1)

w(2)

w(3)

w(4)

viagran

ige

ria

Spam=1

Ham=-1

w

x1

x2

w⋅⋅⋅⋅x=0

Page 9: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Perceptron: y’ = sign(w⋅⋅⋅⋅ x)� How to find parameters w?

� Start with w0 = 0

� Pick training examples xt one by one

� Predict class of xt using current wt

� y’ = sign(wt⋅⋅⋅⋅ xt)

� If y’ is correct (i.e., yt = y’)

� No change: wt+1 = wt

� If y’ is wrong: Adjust wt

wt+1 = wt + ηηηη ⋅ yt ⋅ xt

� ηηηη is the learning rate parameter

� xt is the t-th training example

� yt is true t-th class label ({+1, -1})

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

wt

η⋅ η⋅ η⋅ η⋅yt⋅⋅⋅⋅xt

xt, yt=1

wt+1

Note that the Perceptron is

a conservative algorithm: it

ignores samples that it

classifies correctly.

Page 10: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Good: Perceptron convergence theorem:

� If there exist a set of weights that are consistent

(i.e., the data is linearly separable) the Perceptron

learning algorithm will converge

� Bad: Never converges:

If the data is not separable

weights dance around

indefinitely

� Bad: Mediocre generalization:

� Finds a “barely” separating solution

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Page 11: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Perceptron will oscillate and won’t converge

� So, when to stop learning?

� (1) Slowly decrease the learning rate ηηηη� A classic way is to: ηηηη = c1/(t + c2)

� But, we also need to determine constants c1 and c2

� (2) Stop when the training error stops chaining

� (3) Have a small test dataset and stop when the

test set error stops decreasing

� (4) Stop when we reached some maximum

number of passes over the data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Page 12: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM
Page 13: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Want to separate “+” from “-” using a line

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

Data:� Training examples:

� (x1, y1) … (xn, yn)

� Each example i:

� xi = ( xi(1),… , xi

(d) )

� xi(j) is real valued

� yi ∈∈∈∈ { -1, +1 }

� Inner product:

� ⋅ � � ∑ ��� ⋅ �������

+

++

+

+ + --

-

-

--

-

Which is best linear separator (defined by w)?

Page 14: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

+ +

+

+

+

+

+

+

+

-

--

-

--

-

-

-

A

B

C� Distance from the

separating

hyperplane

corresponds to

the “confidence”

of prediction

� Example:

� We are more sure

about the class of

A and B than of C

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

Page 15: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Margin �: Distance of closest example from

the decision line/hyperplane

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

The reason we define margin this way is due to theoretical convenience and existence of

generalization error bounds that depend on the value of margin.

Page 16: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Remember: Dot product

� ⋅ � � � ⋅ � ⋅ ��� �

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

| � | � ! ��" ��

"�

� #$%�

Page 17: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Dot product� ⋅ � � � � ��� �

� What is � ⋅ � , � ⋅ ��?

� So, � roughly corresponds to the margin

� Bigger � bigger the separationJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

�+ +

x2 x1

In this case

� & � �

+x1+x2

+x2

In this case

�� & � � �

+x1

Page 18: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

Distance from a point to a line

A (xA(1), xA

(2))

M (x1, x2)

H

d(A, L) = |AH|

= |(A-M) · w|

= |(xA(1) – xM

(1)) w(1) + (xA(2) – xM

(2)) w(2)|

= xA(1) w(1) + xA

(2) w(2) + b

= w · A + b

Remember xM(1)w(1) + xM

(2)w(2) = - bsince M belongs to line L

wL

+

� Let:

� Line L: w∙x+b = w(1)x(1)+w(2)x(2)+b=0

� w = (w(1), w(2))

� Point A = (xA(1), xA

(2))

� Point M on a line = (xM(1), xM

(2))

(0,0)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

Note we assume

� � �

Page 19: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Prediction = sign(w⋅⋅⋅⋅x + b)

� “Confidence” = (w⋅⋅⋅⋅ x + b) y

� For i-th datapoint:

�� � �⋅�� ' ( ��� Want to solve:

)*+� ),-� ��� Can rewrite as

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

+

+ ++

+

++

--

-

-

--

-

γ

γγ

≥+⋅∀ )(,..

max,

bxwyits ii

w

Page 20: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Maximize the margin:

� Good according to intuition,

theory (VC dimension) &

practice

� � is margin … distance from

the separating hyperplane

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

++

+

+

+

+

+

+

-

- --

---

w⋅⋅⋅⋅x+b=0

γγγγ

γγγγ

Maximizing the margin

γγγγγ

γγ

≥+⋅∀ )(,..

max,

bxwyits ii

w

Page 21: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM
Page 22: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Separating hyperplane

is defined by the

support vectors

� Points on +/- planes

from the solution

� If you knew these

points, you could

ignore the rest

� Generally,

d+1 support vectors (for d dim. data)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

Page 23: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Problem:

� Let �⋅� ' ( � � �then ��⋅� ' �( � � ��� Scaling w increases margin!

� Solution:

� Work with normalized w:

� � �� ⋅� ' ( �

� Let’s also require support vectors �"to be on the plane defined by: � ⋅�" ' ( � .

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

2x

|||| w

w

1x

| w | � ! ��� 0�

���

Page 24: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Want to maximize margin �!� What is the relation

between x1 and x2?

� � � �� ' �� �||�||

� We also know:

� � ⋅ � ' ( � ' � � ⋅ �� ' ( � 1

� So:

� � ⋅ � ' ( � ' � � �� ' �� �

||�|| ' ( � ' � � ⋅ �� ' ( ' �� �⋅�

� � ' J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

2γγγγ

-1

www

w 1=

⋅=⇒ γ

2ww w=⋅

Note:

2x

1x

|||| w

w

Page 25: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

γ

γγ

≥+⋅∀ )(,..

max ,

bxwyits ii

w

1)(,..

||||min 2

21

≥+⋅∀ bxwyits

w

ii

w

This is called SVM with “hard” constraints

2

21minargminarg

1maxargmaxarg ww

w===γ

2γγγγ

� We started with

But w can be arbitrarily large!

� We normalized and...

� Then:

2x

1x

|||| w

w

Page 26: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� If data is not separable introduce penalty:

� Minimize ǁwǁ2 plus the

number of training mistakes

� Set C using cross validation

� How to penalize mistakes?

� All mistakes are not

equally bad!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

1)(,..

mistakes) ofnumber (# C min2

21

≥+⋅∀

⋅+

bxwyits

w

ii

w

++

+

+

+

+

+

-

-

-

-

--

-

+-

-

Page 27: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Introduce slack variables ξξξξi

� If point xi is on the wrong

side of the margin then

get penalty ξξξξi

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

iii

n

i

ibw

bxwyits

Cwi

ξ

ξξ

−≥+⋅∀

⋅+ ∑=

1)(,..

min1

2

21

0,,

++

+

+

+

++ -

-

---

For each data point:If margin ≥ 1, don’t careIf margin < 1, pay linear penalty

+

ξξξξj

- ξξξξi

Page 28: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� What is the role of slack penalty C:

� C=∞∞∞∞: Only want to w, b

that separate the data

� C=0: Can set ξξξξi to anything,

then w=0 (basically

ignores the data)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

1)(,..

mistakes) ofnumber (# C min2

21

≥+⋅∀

⋅+

bxwyits

w

ii

w

++

+

+

+

++ -

-

---

+ -

big C

“good” Csmall C

(0,0)

Page 29: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� SVM in the “natural” form

� SVM uses “Hinge Loss”:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

{ }∑=

+⋅−⋅+⋅n

i

iibw

bxwyCww1

21

,

)(1,0max minargMargin

Empirical loss L (how well we fit training data)Regularization

parameter

iii

n

i

ibw

bxwyits

Cw

ξ

ξ

−≥+⋅⋅∀

+ ∑=

1)(,..

min1

2

21

,

-1 0 1 2

0/1 loss

penalty

)( bwxyz ii +⋅⋅=

Hinge loss: max{0, 1-z}

Page 30: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM
Page 31: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Want to estimate � and (!

� Standard way: Use a solver!

� Solver: software for finding solutions to “common” optimization problems

� Use a quadratic solver:

� Minimize quadratic function

� Subject to linear constraints

� Problem: Solvers are inefficient for big data!J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

iii

n

i

ibw

bwxyits

Cww

ξ

ξ

−≥+⋅⋅∀

⋅+⋅ ∑=

1)(,..

min1

21

,

Page 32: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Want to estimate w, b!

� Alternative approach:

� Want to minimize f(w,b):

� Side note:

� How to minimize convex functions 2�3?

� Use gradient descent: minz g(z)

� Iterate: zt+1 ←←←← zt – ηηηη ∇∇∇∇g(zt)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

∑ ∑= =

+−⋅+⋅=n

i

d

j

j

i

j

i bxwyCwwbwf1 1

)()(

21 )(1,0max),(

iii

n

i

ibw

bwxyits

Cww

ξ

ξ

−≥+⋅⋅∀

+⋅ ∑=

1)(,..

min1

21

,

g(z)

z

Page 33: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Want to minimize f(w,b):

� Compute the gradient ∇∇∇∇(j) w.r.t. w(j)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

∑= ∂

∂+=

∂=∇

n

ij

iij

j

j

w

yxLCw

w

bwff

1)(

)(

)(

)( ),(),(

else

1)(w if 0),(

)(

)(

j

ii

iij

ii

xy

bxyw

yxL

−=

≥+⋅=∂

( ) ∑ ∑∑= ==

+−+=n

i

d

j

j

i

j

i

d

j

jbxwyCwbwf

1 1

)()(

1

2)(

21 )(1,0max),(

Empirical loss 4�����

Page 34: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

Iterate until convergence:

• For j = 1 … d

• Evaluate:

• Update:

w(j) ←←←← w(j) - η∇η∇η∇η∇f(j)

� Gradient descent:

� Problem:

� Computing ∇∇∇∇f(j) takes O(n) time!

� n … size of the training dataset

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34

∑= ∂

∂+=

∂=∇

n

ij

iij

j

j

w

yxLCw

w

bwff

1)(

)(

)(

)( ),(),(

ηηηη…learning rate parameter

C… regularization parameter

Page 35: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Stochastic Gradient Descent

� Instead of evaluating gradient over all examples

evaluate it for each individual training example

� Stochastic gradient descent:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

)(

)()( ),()(

j

iij

i

j

w

yxLCwxf

∂⋅+=∇

∑= ∂

∂+=∇

n

ij

iijj

w

yxLCwf

1)(

)()( ),(We just had:

Iterate until convergence:

• For i = 1 … n

• For j = 1 … d

• Compute: ∇∇∇∇f(j)(xi)

• Update: w(j) ←←←← w(j) - ηηηη ∇∇∇∇f(j)(xi)

Notice: no summation

over i anymore

Page 36: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM
Page 37: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Example by Leon Bottou:

� Reuters RCV1 document corpus

� Predict a category of a document

� One vs. the rest classification

� n = 781,000 training examples (documents)

� 23,000 test examples

� d = 50,000 features

� One feature per word

� Remove stop-words

� Remove low frequency words

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

Page 38: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Questions:

� (1) Is SGD successful at minimizing f(w,b)?

� (2) How quickly does SGD find the min of f(w,b)?

� (3) What is the error on a test set?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

Training time Value of f(w,b) Test error

Standard SVM

“Fast SVM”

SGD SVM

(1) SGD-SVM is successful at minimizing the value of f(w,b)

(2) SGD-SVM is super fast(3) SGD-SVM test set error is comparable

Page 39: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

Optimization quality: | f(w,b) – f (wopt,bopt) |

Conventional

SVM

SGD SVM

For optimizing f(w,b) within reasonable quality

SGD-SVM is super fast

Page 40: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� SGD on full dataset vs. Conjugate Gradient on

a sample of n training examples

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times

Theory says: Gradient

descent converges in

linear time 5. Conjugate

gradient converges in 5.

5… condition number

Page 41: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Need to choose learning rate ηηηη and t0

� Leon suggests:

� Choose t0 so that the expected initial updates are comparable with the expected size of the weights

� Choose ηηηη:

� Select a small subsample

� Try various rates ηηηη (e.g., 10, 1, 0.1, 0.01, …)

� Pick the one that most reduces the cost

� Use ηηηη for next 100k iterations on the full dataset

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

∂+

+−←+

w

yxLCw

ttww ii

tt

tt

),(

0

1

η

Page 42: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Sparse Linear SVM:

� Feature vector xi is sparse (contains many zeros)

� Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]

� But represent xi as a sparse vector xi=[(4,1), (9,5), …]

� Can we do the SGD update more efficiently?

� Approximated in 2 steps:

cheap: xi is sparse and so few

coordinates j of w will be updated

expensive: w is not sparse, all

coordinates need to be updated

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

w

yxLCww ii

∂−←

),(η

)1( η−← ww

Page 43: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Solution 1: � � % ⋅ 6� Represent vector w as the

product of scalar s and vector v

� Then the update procedure is:

� (1) 6 � 6 1 78 94 ��,��9�

� (2) % � %� 1 7� Solution 2:

� Perform only step (1) for each training example

� Perform step (2) with lower frequency

and higher ηηηη

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

w

yxLCww ii

∂−←

),(η

)1( η−← ww

Two step update procedure:

(1)

(2)

Page 44: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Stopping criteria:

How many iterations of SGD?

� Early stopping with cross validation

� Create a validation set

� Monitor cost function on the validation set

� Stop when loss stops decreasing

� Early stopping

� Extract two disjoint subsamples A and B of training data

� Train on A, stop by validating on B

� Number of epochs is an estimate of k

� Train for k epochs on the full dataset

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44

Page 45: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Idea 1:One against allLearn 3 classifiers

� + vs. {o, -}

� - vs. {o, +}

� o vs. {+, -}

Obtain:

w+ b+, w- b-, wo bo

� How to classify?� Return class c

arg maxc wc x + bc

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

Page 46: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Idea 2: Learn 3 sets of weights simoultaneously!

� For each class c estimate wc, bc

� Want the correct class to have highest margin:

wyi xi + byi ≥≥≥≥ 1 + wc xi + bc ∀∀∀∀c ≠≠≠≠ yi , ∀∀∀∀i

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

(xi, yi)

Page 47: ch12-ml2 · Title: Microsoft PowerPoint - ch12-ml2.pptx Author: akrevl Created Date: 8/11/2014 11:14:11 AM

� Optimization problem:

� To obtain parameters wc , bc (for each class c)

we can use similar techniques as for 2 class SVM

� SVM is widely perceived a very powerful

learning algorithm

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47

icicyiy

n

i

icbw

bxwbxw

Cw

iiξ

ξ

−++⋅≥+⋅

+ ∑∑=

1

min1c

2

21

,

i

iyc

i

i

∀≥

∀≠∀

,0

,

ξ