text classification + machine learning review 3 logistic regression (murphy, p 268) cons i harder to...

Text Classification

+

Machine Learning Review 3

CS 287

Review: Logistic Regression (Murphy, p 268)

Cons

I Harder to fit versus naive Bayes.

I Must fit all classes together.

I Not a good fit for semi-supervised/missing data cases

Pros

I Better calibrated probability estimates

I Natural handling of feature input

I Features likely not multinomials

I (For us) extend naturally to neural networks

Review: Gradients for Softmax Regression

For multiclass logistic regression:

∂L(y, y)

∂zi= ∑

j

∂yj∂zi

1(j = c)

yj=

−(1− yi ) i = c

yi ow .

Therefore for parameters θ,

∂L

∂bi=

∂L

∂zi

∂L

∂Wf ,i= xf

∂L

∂zi

Intuition:

I Nothing happens on correct classification.

I Weight of true features increases based on prob not given.

I Weight of false features decreases based on prob given.

Gradient-Based Optimization: SGD

procedure SGD

while training criterion is not met do

Sample a training example xi , yi

Compute the loss L(yi , yi ; θ)

Compute gradients g of L(yi , yi ; θ) with respect to θ

θ ← θ − ηg

end while

return θ

end procedure

Quiz: Softmax Regression

Given bag-of-word features

F = {The, movie, was, terrible, rocked, A}

and two training data points:

Class 1: The movie was terrible

Class 2: The movie rocked

Assume that we start with parameters W = 0 and b = 0, and we train

with learning rate η = 1 and λ = 0. What is the loss and the

parameters after one pass through the data in order?

Answer: Softmax Regression (1)

First iteration,

y1 =[

0.5 0.5]

L(y1, y1) = − log 0.5

W =

[0.5 0.5 0.5 0.5 0 0

−0.5 −0.5 −0.5 −0.5 0 0

]

b =[

0.5 −0.5]

Answer: Softmax Regression (2)

Second iteration,

y1 = softmax([1.5 − 1.5]) ≈[

0.95 0.05]

L(y2, y2) = − log 0.05

W ≈[−0.45 −0.45 0.5 0.5 −0.95 0

0.45 0.45 −0.5 −0.5 0.95 0

]

b =[−0.45 0.45

]

Today’s Class

So far

I Naive Bayes (Multinomial)

I Multiclass Logistic Regression (SGD)

Today

I Multiclass Hinge-loss

I More about optimization

Contents

Multiclass Hinge-Loss

Gradients

Black-Box Optimization

Other Loss Functions

What if we just try to directly find W and b?

y = xW+ b

I No longer a probabilistic interpretation.

I Just try to find parameters that fit training data.

0/1 Loss

Just count the number of training examples we classify correctly,

L(θ) =n

∑i=1

L0/1(y, y) = 1(arg maxc ′

yc ′ 6= c)

∂L(y, y)

∂yj=

0 j = c

0 o.w .

L0/1([x y ]) = 1(x > y)

Hinge Loss

L(θ) =n

∑i=1

Lhinge(y, y)

Lhinge(y, y) = max{0, 1− (yc − yc ′)}

Where

I Let c be defined as true class yi ,c = 1

c ′ = arg maxi∈C\{c}

yi

Minimizing hinge loss is an upper-bound for 0/1.

Lhinge(y, y) ≥ L0/1(y, y)

Hinge Loss

hinge(y) = 1(max{0, 1− (y − x)}) arg max([x y ]) = 1(x > y)

Important Case: Hinge-loss for Binary

Lhinge([0 1], [x y ]) = max{0, 1− (y − x)} = ReLU(1− (y − x))

Neural network name (Rectified linear unit):

ReLU(t) = max{0, t}

Hinge-Loss Properties

Complete objective:

Lhinge(θ) =n

∑i=1

max{0, 1− (yc − yc ′)}

=n

∑i=1

max{0, 1− (yc − maxc ′∈C\{c}

yc ′)}

I Apply convexity rules: Linear y is convex, max of convex functions

is convex, linear + convex is convex, sum of convex functions is

convex (Boyd and Vandenberghe, 2004 p. 72-74)

I However, non-differentiable because of max.

Piecewise Linear Objective

L(θ) =n

∑i=1

max{0, 1− (yc − yc ′)}

10 ∗max{0, 1− (y − x)}+ 5 ∗max{0, 1− (x − y)}

Objective with Regularization

L(θ) =n

∑i=1

max{0, 1− (yc − yc ′)}+ λ||θ||2

10 ∗max{0, 1− (y − x)}+ 5 ∗max{0, 1− (x − y)}+ 5 ∗ ||θ||2

Contents


Gradients


(Sub)Gradient Rule

I Technically ReLU is non-differentiable.

I Only an issue at 0, generally for “ties”.

I We informally use subgradients,

d ReLU(x)

dx=

1 x > 0

0 x < 0

1 or 0 o.w

Generally,

d maxv ′(f (x , v ′))

dx= f ′(x , v) for any v ∈ arg max

v ′f (x , v ′)

Symbolic Gradients

I Let c be defined as true class

I Let c ′ be defined as the highest scoring non-true class

c ′ = arg maxi∈C\{c}

yi

I Partials of L(y , y)

∂L(y , ky)

∂yj=

0 yc − yc ′ > 1

1 j = c ′

−1 j = c

0 o.w .

Intuition: If wrong or close to wrong, improve correct and lower closest

incorrect.

Notes: Hinge Loss: Regularization

I Many different names,

I Margin Classifier

I Multiclass Hinge

I Linear SVM

I Important to use regularization.

L(θ) = −n

∑i=1

L(y, y) + ||θ||22

I Can be much more efficient to train than LR. (No partition).

Results: Longer Reviews

IMDB (longer movie review), Subj (longer subjectivity)

I NBSVM is hinge-loss interpolated with Naive Bayes.

Contents


Gradients


Optimization Methods

Goal: Minimize function L : R|θ| 7→ R

First-order Methods

L(θ + δ) ≈ L(θ) + L′(θ)>δ

I Require computing L(θ) and gradient L′(θ).

Second-order Methods

L(θ + δ) ≈ L(θ) + L′(θ)>δ + 1/2δ>Hδ>

I Require computing L(θ) and gradient L′(θ) and Hessian H.

Stochastic Methods

I Require computing E(L(θ)) and expected gradient.

Gradient Descent


k ← 0

g← 0

for i = 1 to n do


Compute gradients g′ of L(yi , yi ; θ) with respect to θ

g← g+ 1ng′

end for

θk+1 ← θk − ηk g

k ← k + 1

end while

return θ

Choosing the Learning Rate

Gradient Descent with Momentum

Standard Gradient Descent (figure from Boyd):

θk+1 ← θk − ηk g

Momentum terms:

θk+1 ← θk − ηk g+ µk(θk − θk−1)

I Also known as: Heavy-ball method

Second-Order

Requires compute Hessian H

Second-order update becomes:

θk+1 ← θk − ηkH−1g

I Used for strictly convex functions (although there are variants)

I Also known as: Newton’s Method

Quasi-Newton Methods

Construct an approximate Hessian from first-order information

I BFGS

I construct approx. Hessian directly

I O(|θ|2) space

I L-BFGS;

I limited-memory BFGS, only save last m gradients

I can often set m < 20 or smaller

Fast implementations available, specific details are beyond scope of

course.

Stochastic Methods

I Minimize function L(θ)

I Require computing E(L(θ)) and gradient

I Typically, we by sampling a subset of the data. computing a

gradient, and updating

I Other first-order optimizers (like momentum) can be used.

Gradient-Based Optimization: Minibatch SGD


Sample a minibatch of m examples (x1, y1), . . . , (xm, ym)

g← 0

for i = 1 to m do


Compute gradients g′ of L(yi , yi ; θ) with respect to θ

g← g+ 1mg′

end for

θ ← θ − ηk g

end while

return θ

Tricks On Using SGD (Bottou 2012)

I Can be crucial to experiment with learning rate.

I Often useful to use development set for stopping.

I Shuffle data first and run over it.

I Also describes “averaged” versions which work well in practice.

Optimization in NLP

I For convex batch-methods:

I L-BFGS is easy to use and effective.

I Nice for verifying results.

I Sometimes even m-times the parameters is a lot though.

I For both convex, and especially, non-convex problems:

I SGD and variants are dominant.

I Trade-off of speed vs. exact optimization.

I Also see notes on AdaGrad, another popular method.

text classification + machine learning review 3 logistic regression (murphy, p 268) cons i harder to...

Documents