text classification + machine learning review 3 logistic regression (murphy, p 268) cons i harder to...
Post on 16-Mar-2018
213 Views
Preview:
TRANSCRIPT
Text Classification
+
Machine Learning Review 3
CS 287
Review: Logistic Regression (Murphy, p 268)
Cons
I Harder to fit versus naive Bayes.
I Must fit all classes together.
I Not a good fit for semi-supervised/missing data cases
Pros
I Better calibrated probability estimates
I Natural handling of feature input
I Features likely not multinomials
I (For us) extend naturally to neural networks
Review: Gradients for Softmax Regression
For multiclass logistic regression:
∂L(y, y)
∂zi= ∑
j
∂yj∂zi
1(j = c)
yj=
−(1− yi ) i = c
yi ow .
Therefore for parameters θ,
∂L
∂bi=
∂L
∂zi
∂L
∂Wf ,i= xf
∂L
∂zi
Intuition:
I Nothing happens on correct classification.
I Weight of true features increases based on prob not given.
I Weight of false features decreases based on prob given.
Review: Gradients for Softmax Regression
For multiclass logistic regression:
∂L(y, y)
∂zi= ∑
j
∂yj∂zi
1(j = c)
yj=
−(1− yi ) i = c
yi ow .
Therefore for parameters θ,
∂L
∂bi=
∂L
∂zi
∂L
∂Wf ,i= xf
∂L
∂zi
Intuition:
I Nothing happens on correct classification.
I Weight of true features increases based on prob not given.
I Weight of false features decreases based on prob given.
Gradient-Based Optimization: SGD
procedure SGD
while training criterion is not met do
Sample a training example xi , yi
Compute the loss L(yi , yi ; θ)
Compute gradients g of L(yi , yi ; θ) with respect to θ
θ ← θ − ηg
end while
return θ
end procedure
Quiz: Softmax Regression
Given bag-of-word features
F = {The, movie, was, terrible, rocked, A}
and two training data points:
Class 1: The movie was terrible
Class 2: The movie rocked
Assume that we start with parameters W = 0 and b = 0, and we train
with learning rate η = 1 and λ = 0. What is the loss and the
parameters after one pass through the data in order?
Answer: Softmax Regression (1)
First iteration,
y1 =[
0.5 0.5]
L(y1, y1) = − log 0.5
W =
[0.5 0.5 0.5 0.5 0 0
−0.5 −0.5 −0.5 −0.5 0 0
]
b =[
0.5 −0.5]
Answer: Softmax Regression (2)
Second iteration,
y1 = softmax([1.5 − 1.5]) ≈[
0.95 0.05]
L(y2, y2) = − log 0.05
W ≈[−0.45 −0.45 0.5 0.5 −0.95 0
0.45 0.45 −0.5 −0.5 0.95 0
]
b =[−0.45 0.45
]
Today’s Class
So far
I Naive Bayes (Multinomial)
I Multiclass Logistic Regression (SGD)
Today
I Multiclass Hinge-loss
I More about optimization
Contents
Multiclass Hinge-Loss
Gradients
Black-Box Optimization
Other Loss Functions
What if we just try to directly find W and b?
y = xW+ b
I No longer a probabilistic interpretation.
I Just try to find parameters that fit training data.
0/1 Loss
Just count the number of training examples we classify correctly,
L(θ) =n
∑i=1
L0/1(y, y) = 1(arg maxc ′
yc ′ 6= c)
∂L(y, y)
∂yj=
0 j = c
0 o.w .
L0/1([x y ]) = 1(x > y)
0/1 Loss
Just count the number of training examples we classify correctly,
L(θ) =n
∑i=1
L0/1(y, y) = 1(arg maxc ′
yc ′ 6= c)
∂L(y, y)
∂yj=
0 j = c
0 o.w .
L0/1([x y ]) = 1(x > y)
Hinge Loss
L(θ) =n
∑i=1
Lhinge(y, y)
Lhinge(y, y) = max{0, 1− (yc − yc ′)}
Where
I Let c be defined as true class yi ,c = 1
c ′ = arg maxi∈C\{c}
yi
Minimizing hinge loss is an upper-bound for 0/1.
Lhinge(y, y) ≥ L0/1(y, y)
Hinge Loss
L(θ) =n
∑i=1
Lhinge(y, y)
Lhinge(y, y) = max{0, 1− (yc − yc ′)}
Where
I Let c be defined as true class yi ,c = 1
c ′ = arg maxi∈C\{c}
yi
Minimizing hinge loss is an upper-bound for 0/1.
Lhinge(y, y) ≥ L0/1(y, y)
Hinge Loss
hinge(y) = 1(max{0, 1− (y − x)}) arg max([x y ]) = 1(x > y)
Important Case: Hinge-loss for Binary
Lhinge([0 1], [x y ]) = max{0, 1− (y − x)} = ReLU(1− (y − x))
Neural network name (Rectified linear unit):
ReLU(t) = max{0, t}
Hinge-Loss Properties
Complete objective:
Lhinge(θ) =n
∑i=1
max{0, 1− (yc − yc ′)}
=n
∑i=1
max{0, 1− (yc − maxc ′∈C\{c}
yc ′)}
I Apply convexity rules: Linear y is convex, max of convex functions
is convex, linear + convex is convex, sum of convex functions is
convex (Boyd and Vandenberghe, 2004 p. 72-74)
I However, non-differentiable because of max.
Piecewise Linear Objective
L(θ) =n
∑i=1
max{0, 1− (yc − yc ′)}
10 ∗max{0, 1− (y − x)}+ 5 ∗max{0, 1− (x − y)}
Objective with Regularization
L(θ) =n
∑i=1
max{0, 1− (yc − yc ′)}+ λ||θ||2
10 ∗max{0, 1− (y − x)}+ 5 ∗max{0, 1− (x − y)}+ 5 ∗ ||θ||2
Contents
Multiclass Hinge-Loss
Gradients
Black-Box Optimization
(Sub)Gradient Rule
I Technically ReLU is non-differentiable.
I Only an issue at 0, generally for “ties”.
I We informally use subgradients,
d ReLU(x)
dx=
1 x > 0
0 x < 0
1 or 0 o.w
Generally,
d maxv ′(f (x , v ′))
dx= f ′(x , v) for any v ∈ arg max
v ′f (x , v ′)
Symbolic Gradients
I Let c be defined as true class
I Let c ′ be defined as the highest scoring non-true class
c ′ = arg maxi∈C\{c}
yi
I Partials of L(y , y)
∂L(y , ky)
∂yj=
0 yc − yc ′ > 1
1 j = c ′
−1 j = c
0 o.w .
Intuition: If wrong or close to wrong, improve correct and lower closest
incorrect.
Notes: Hinge Loss: Regularization
I Many different names,
I Margin Classifier
I Multiclass Hinge
I Linear SVM
I Important to use regularization.
L(θ) = −n
∑i=1
L(y, y) + ||θ||22
I Can be much more efficient to train than LR. (No partition).
Results: Longer Reviews
IMDB (longer movie review), Subj (longer subjectivity)
I NBSVM is hinge-loss interpolated with Naive Bayes.
Contents
Multiclass Hinge-Loss
Gradients
Black-Box Optimization
Optimization Methods
Goal: Minimize function L : R|θ| 7→ R
First-order Methods
L(θ + δ) ≈ L(θ) + L′(θ)>δ
I Require computing L(θ) and gradient L′(θ).
Second-order Methods
L(θ + δ) ≈ L(θ) + L′(θ)>δ + 1/2δ>Hδ>
I Require computing L(θ) and gradient L′(θ) and Hessian H.
Stochastic Methods
I Require computing E(L(θ)) and expected gradient.
Gradient Descent
while training criterion is not met do
k ← 0
g← 0
for i = 1 to n do
Compute the loss L(yi , yi ; θ)
Compute gradients g′ of L(yi , yi ; θ) with respect to θ
g← g+ 1ng′
end for
θk+1 ← θk − ηk g
k ← k + 1
end while
return θ
Choosing the Learning Rate
Gradient Descent with Momentum
Standard Gradient Descent (figure from Boyd):
θk+1 ← θk − ηk g
Momentum terms:
θk+1 ← θk − ηk g+ µk(θk − θk−1)
I Also known as: Heavy-ball method
Second-Order
Requires compute Hessian H
Second-order update becomes:
θk+1 ← θk − ηkH−1g
I Used for strictly convex functions (although there are variants)
I Also known as: Newton’s Method
Quasi-Newton Methods
Construct an approximate Hessian from first-order information
I BFGS
I construct approx. Hessian directly
I O(|θ|2) space
I L-BFGS;
I limited-memory BFGS, only save last m gradients
I can often set m < 20 or smaller
Fast implementations available, specific details are beyond scope of
course.
Stochastic Methods
I Minimize function L(θ)
I Require computing E(L(θ)) and gradient
I Typically, we by sampling a subset of the data. computing a
gradient, and updating
I Other first-order optimizers (like momentum) can be used.
Gradient-Based Optimization: Minibatch SGD
while training criterion is not met do
Sample a minibatch of m examples (x1, y1), . . . , (xm, ym)
g← 0
for i = 1 to m do
Compute the loss L(yi , yi ; θ)
Compute gradients g′ of L(yi , yi ; θ) with respect to θ
g← g+ 1mg′
end for
θ ← θ − ηk g
end while
return θ
Tricks On Using SGD (Bottou 2012)
I Can be crucial to experiment with learning rate.
I Often useful to use development set for stopping.
I Shuffle data first and run over it.
I Also describes “averaged” versions which work well in practice.
Optimization in NLP
I For convex batch-methods:
I L-BFGS is easy to use and effective.
I Nice for verifying results.
I Sometimes even m-times the parameters is a lot though.
I For both convex, and especially, non-convex problems:
I SGD and variants are dominant.
I Trade-off of speed vs. exact optimization.
I Also see notes on AdaGrad, another popular method.
top related