text classification + machine learning review 3 logistic regression (murphy, p 268) cons i harder to...

36
Text Classification + Machine Learning Review 3 CS 287

Upload: dinhkhanh

Post on 16-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Text Classification

+

Machine Learning Review 3

CS 287

Page 2: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Review: Logistic Regression (Murphy, p 268)

Cons

I Harder to fit versus naive Bayes.

I Must fit all classes together.

I Not a good fit for semi-supervised/missing data cases

Pros

I Better calibrated probability estimates

I Natural handling of feature input

I Features likely not multinomials

I (For us) extend naturally to neural networks

Page 3: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Review: Gradients for Softmax Regression

For multiclass logistic regression:

∂L(y, y)

∂zi= ∑

j

∂yj∂zi

1(j = c)

yj=

−(1− yi ) i = c

yi ow .

Therefore for parameters θ,

∂L

∂bi=

∂L

∂zi

∂L

∂Wf ,i= xf

∂L

∂zi

Intuition:

I Nothing happens on correct classification.

I Weight of true features increases based on prob not given.

I Weight of false features decreases based on prob given.

Page 4: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Review: Gradients for Softmax Regression

For multiclass logistic regression:

∂L(y, y)

∂zi= ∑

j

∂yj∂zi

1(j = c)

yj=

−(1− yi ) i = c

yi ow .

Therefore for parameters θ,

∂L

∂bi=

∂L

∂zi

∂L

∂Wf ,i= xf

∂L

∂zi

Intuition:

I Nothing happens on correct classification.

I Weight of true features increases based on prob not given.

I Weight of false features decreases based on prob given.

Page 5: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Gradient-Based Optimization: SGD

procedure SGD

while training criterion is not met do

Sample a training example xi , yi

Compute the loss L(yi , yi ; θ)

Compute gradients g of L(yi , yi ; θ) with respect to θ

θ ← θ − ηg

end while

return θ

end procedure

Page 6: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Quiz: Softmax Regression

Given bag-of-word features

F = {The, movie, was, terrible, rocked, A}

and two training data points:

Class 1: The movie was terrible

Class 2: The movie rocked

Assume that we start with parameters W = 0 and b = 0, and we train

with learning rate η = 1 and λ = 0. What is the loss and the

parameters after one pass through the data in order?

Page 7: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Answer: Softmax Regression (1)

First iteration,

y1 =[

0.5 0.5]

L(y1, y1) = − log 0.5

W =

[0.5 0.5 0.5 0.5 0 0

−0.5 −0.5 −0.5 −0.5 0 0

]

b =[

0.5 −0.5]

Page 8: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Answer: Softmax Regression (2)

Second iteration,

y1 = softmax([1.5 − 1.5]) ≈[

0.95 0.05]

L(y2, y2) = − log 0.05

W ≈[−0.45 −0.45 0.5 0.5 −0.95 0

0.45 0.45 −0.5 −0.5 0.95 0

]

b =[−0.45 0.45

]

Page 9: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Today’s Class

So far

I Naive Bayes (Multinomial)

I Multiclass Logistic Regression (SGD)

Today

I Multiclass Hinge-loss

I More about optimization

Page 10: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Contents

Multiclass Hinge-Loss

Gradients

Black-Box Optimization

Page 11: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Other Loss Functions

What if we just try to directly find W and b?

y = xW+ b

I No longer a probabilistic interpretation.

I Just try to find parameters that fit training data.

Page 12: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

0/1 Loss

Just count the number of training examples we classify correctly,

L(θ) =n

∑i=1

L0/1(y, y) = 1(arg maxc ′

yc ′ 6= c)

∂L(y, y)

∂yj=

0 j = c

0 o.w .

L0/1([x y ]) = 1(x > y)

Page 13: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

0/1 Loss

Just count the number of training examples we classify correctly,

L(θ) =n

∑i=1

L0/1(y, y) = 1(arg maxc ′

yc ′ 6= c)

∂L(y, y)

∂yj=

0 j = c

0 o.w .

L0/1([x y ]) = 1(x > y)

Page 14: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Hinge Loss

L(θ) =n

∑i=1

Lhinge(y, y)

Lhinge(y, y) = max{0, 1− (yc − yc ′)}

Where

I Let c be defined as true class yi ,c = 1

c ′ = arg maxi∈C\{c}

yi

Minimizing hinge loss is an upper-bound for 0/1.

Lhinge(y, y) ≥ L0/1(y, y)

Page 15: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Hinge Loss

L(θ) =n

∑i=1

Lhinge(y, y)

Lhinge(y, y) = max{0, 1− (yc − yc ′)}

Where

I Let c be defined as true class yi ,c = 1

c ′ = arg maxi∈C\{c}

yi

Minimizing hinge loss is an upper-bound for 0/1.

Lhinge(y, y) ≥ L0/1(y, y)

Page 16: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Hinge Loss

hinge(y) = 1(max{0, 1− (y − x)}) arg max([x y ]) = 1(x > y)

Page 17: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Important Case: Hinge-loss for Binary

Lhinge([0 1], [x y ]) = max{0, 1− (y − x)} = ReLU(1− (y − x))

Neural network name (Rectified linear unit):

ReLU(t) = max{0, t}

Page 18: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Hinge-Loss Properties

Complete objective:

Lhinge(θ) =n

∑i=1

max{0, 1− (yc − yc ′)}

=n

∑i=1

max{0, 1− (yc − maxc ′∈C\{c}

yc ′)}

I Apply convexity rules: Linear y is convex, max of convex functions

is convex, linear + convex is convex, sum of convex functions is

convex (Boyd and Vandenberghe, 2004 p. 72-74)

I However, non-differentiable because of max.

Page 19: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Piecewise Linear Objective

L(θ) =n

∑i=1

max{0, 1− (yc − yc ′)}

10 ∗max{0, 1− (y − x)}+ 5 ∗max{0, 1− (x − y)}

Page 20: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Objective with Regularization

L(θ) =n

∑i=1

max{0, 1− (yc − yc ′)}+ λ||θ||2

10 ∗max{0, 1− (y − x)}+ 5 ∗max{0, 1− (x − y)}+ 5 ∗ ||θ||2

Page 21: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Contents

Multiclass Hinge-Loss

Gradients

Black-Box Optimization

Page 22: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

(Sub)Gradient Rule

I Technically ReLU is non-differentiable.

I Only an issue at 0, generally for “ties”.

I We informally use subgradients,

d ReLU(x)

dx=

1 x > 0

0 x < 0

1 or 0 o.w

Generally,

d maxv ′(f (x , v ′))

dx= f ′(x , v) for any v ∈ arg max

v ′f (x , v ′)

Page 23: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Symbolic Gradients

I Let c be defined as true class

I Let c ′ be defined as the highest scoring non-true class

c ′ = arg maxi∈C\{c}

yi

I Partials of L(y , y)

∂L(y , ky)

∂yj=

0 yc − yc ′ > 1

1 j = c ′

−1 j = c

0 o.w .

Intuition: If wrong or close to wrong, improve correct and lower closest

incorrect.

Page 24: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Notes: Hinge Loss: Regularization

I Many different names,

I Margin Classifier

I Multiclass Hinge

I Linear SVM

I Important to use regularization.

L(θ) = −n

∑i=1

L(y, y) + ||θ||22

I Can be much more efficient to train than LR. (No partition).

Page 25: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Results: Longer Reviews

IMDB (longer movie review), Subj (longer subjectivity)

I NBSVM is hinge-loss interpolated with Naive Bayes.

Page 26: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Contents

Multiclass Hinge-Loss

Gradients

Black-Box Optimization

Page 27: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Optimization Methods

Goal: Minimize function L : R|θ| 7→ R

First-order Methods

L(θ + δ) ≈ L(θ) + L′(θ)>δ

I Require computing L(θ) and gradient L′(θ).

Second-order Methods

L(θ + δ) ≈ L(θ) + L′(θ)>δ + 1/2δ>Hδ>

I Require computing L(θ) and gradient L′(θ) and Hessian H.

Stochastic Methods

I Require computing E(L(θ)) and expected gradient.

Page 28: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Gradient Descent

while training criterion is not met do

k ← 0

g← 0

for i = 1 to n do

Compute the loss L(yi , yi ; θ)

Compute gradients g′ of L(yi , yi ; θ) with respect to θ

g← g+ 1ng′

end for

θk+1 ← θk − ηk g

k ← k + 1

end while

return θ

Page 29: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Choosing the Learning Rate

Page 30: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Gradient Descent with Momentum

Standard Gradient Descent (figure from Boyd):

θk+1 ← θk − ηk g

Momentum terms:

θk+1 ← θk − ηk g+ µk(θk − θk−1)

I Also known as: Heavy-ball method

Page 31: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Second-Order

Requires compute Hessian H

Second-order update becomes:

θk+1 ← θk − ηkH−1g

I Used for strictly convex functions (although there are variants)

I Also known as: Newton’s Method

Page 32: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Quasi-Newton Methods

Construct an approximate Hessian from first-order information

I BFGS

I construct approx. Hessian directly

I O(|θ|2) space

I L-BFGS;

I limited-memory BFGS, only save last m gradients

I can often set m < 20 or smaller

Fast implementations available, specific details are beyond scope of

course.

Page 33: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Stochastic Methods

I Minimize function L(θ)

I Require computing E(L(θ)) and gradient

I Typically, we by sampling a subset of the data. computing a

gradient, and updating

I Other first-order optimizers (like momentum) can be used.

Page 34: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Gradient-Based Optimization: Minibatch SGD

while training criterion is not met do

Sample a minibatch of m examples (x1, y1), . . . , (xm, ym)

g← 0

for i = 1 to m do

Compute the loss L(yi , yi ; θ)

Compute gradients g′ of L(yi , yi ; θ) with respect to θ

g← g+ 1mg′

end for

θ ← θ − ηk g

end while

return θ

Page 35: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Tricks On Using SGD (Bottou 2012)

I Can be crucial to experiment with learning rate.

I Often useful to use development set for stopping.

I Shuffle data first and run over it.

I Also describes “averaged” versions which work well in practice.

Page 36: Text Classification + Machine Learning Review 3 Logistic Regression (Murphy, p 268) Cons I Harder to t versus naive Bayes. I Must t all classes together. I Not a good t for semi-supervised/missing

Optimization in NLP

I For convex batch-methods:

I L-BFGS is easy to use and effective.

I Nice for verifying results.

I Sometimes even m-times the parameters is a lot though.

I For both convex, and especially, non-convex problems:

I SGD and variants are dominant.

I Trade-off of speed vs. exact optimization.

I Also see notes on AdaGrad, another popular method.