intro logistic regression gradient descent + sgd...

1

1

Intro Logistic Regression Gradient Descent + SGD AdaGrad

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox January 7th, 2014

©Emily Fox 2014

Case Study 1: Estimating Click Probabilities

Ad Placement Strategies

n  Companies bid on ad prices

n  Which ad wins? (many simplifications here) ¨  Naively:

¨  But:

¨  Instead:

©Emily Fox 2014 2

2

Key Task: Estimating Click Probabilities

n  What is the probability that user i will click on ad j

n  Not important just for ads: ¨ Optimize search results ¨ Suggest news articles ¨ Recommend products

n  Methods much more general, useful for: ¨ Classification ¨ Regression ¨ Density estimation

©Emily Fox 2014 3

Learning Problem for Click Prediction

n  Prediction task: n  Features:

n  Data:

¨  Batch: ¨  Online:

n  Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision

trees, boosting,…) ¨  Focus on logistic regression; captures main concepts, ideas generalize to other approaches

©Emily Fox 2014 4

3

Logistic Regression Logistic function (or Sigmoid):

n  Learn P(Y|X) directly ¨ Assume a particular functional form ¨ Sigmoid applied to a linear function

of the data:

Z

Features can be discrete or continuous! 5 ©Emily Fox 2014

Very convenient!

6 ©Emily Fox 2014

0

implies

linear classification

rule!

0

1

4

Digression: Logistic regression more generally

n  Logistic regression in more general case, where Y in {y1,…,yR}

for k<R

for k=R (normalization, so no weights for this class)

Features can be discrete or continuous! 7 ©Emily Fox 2014

Loss function: Conditional Likelihood

n  Have a bunch of iid data of the form:

n  Discriminative (logistic regression) loss function:

Conditional Data Likelihood

8 ©Emily Fox 2014

5

Expressing Conditional Log Likelihood

9 ©Emily Fox 2014

`(w) =X

j

yj lnP (Y = 1|xj ,w) + (1� yj) lnP (Y = 0|xj ,w)

=

X

j

y

j(w0 +

dX

i=1

wixji )� ln

1 + exp(w0 +

dX

i=1

wixji )

!

Maximizing Conditional Log Likelihood

Good news: l(w) is concave function of w, no local optima problems

Bad news: no closed-form solution to maximize l(w)

Good news: concave functions easy to optimize

10 ©Emily Fox 2014

=

X

j

y

j(w0 +

dX

i=1

wixji )� ln

1 + exp(w0 +

dX

i=1

wixji )

!

6

Optimizing concave function – Gradient ascent

n  Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent

n  Gradient ascent is simplest of optimization approaches ¨  e.g., Conjugate gradient ascent much better (see reading)

Gradient:

Step size, η>0

Update rule:

11 ©Emily Fox 2014

Gradient Ascent for LR

Gradient ascent algorithm: iterate until change < ε

For i = 1,…,d,

repeat

12 ©Emily Fox 2014

(t)

(t)

7

Regularized Conditional Log Likelihood

n  If data is linearly separable, weights go to infinity n  Leads to overfitting à Penalize large weights

n  Add regularization penalty, e.g., L2:

n  Practical note about w0:

©Emily Fox 2014 13

`(w) = lnY

j

P (yj |xj ,w))� �||w||222

14

Standard v. Regularized Updates

n  Maximum conditional likelihood estimate

n  Regularized maximum conditional likelihood estimate

©Emily Fox 2014

(t)

(t)

w

⇤= argmax

wln

2

4Y

j

P (yj |xj ,w))

3

5� �X

i>0

w2i

2

8

Stopping criterion

n  Regularized logistic regression is strongly concave ¨  Negative second derivative bounded away from zero:

n  Strong concavity (convexity) is super helpful!!

n  For example, for strongly concave l(w):

©Emily Fox 2014 15

`(w) = lnY

j

P (yj |xj ,w))� �||w||22

`(w⇤)� `(w) 1

2�||r`(w)||22

2

Convergence rates for gradient descent/ascent

n  Number of Iterations to get to accuracy

n  If func Lipschitz: O(1/ϵ2)

n  If gradient of func Lipschitz: O(1/ϵ)

n  If func is strongly convex: O(ln(1/ϵ))

©Emily Fox 2014 16

`(w⇤)� `(w) ✏

9

Challenge 1: Complexity of computing gradients

n  What’s the cost of a gradient update step for LR???

©Emily Fox 2014 17

(t)

Challenge 2: Data is streaming

n  Assumption thus far: Batch data

n  But, click prediction is a streaming data task: ¨  User enters query, and ad must be selected:

n  Observe xj, and must predict yj

¨  User either clicks or doesn’t click on ad: n  Label yj is revealed afterwards

¨  Google gets a reward if user clicks on ad

¨  Weights must be updated for next time:

©Emily Fox 2014 18

10

Learning Problems as Expectations

n  Minimizing loss in training data: ¨  Given dataset:

n  Sampled iid from some distribution p(x) on features:

¨  Loss function, e.g., hinge loss, logistic loss,… ¨  We often minimize loss in training data:

n  However, we should really minimize expected loss on all data:

n  So, we are approximating the integral by the average on the training data ©Emily Fox 2014 19

`(w) = Ex

[`(w,x)] =

Zp(x)`(w,x)dx

`D(w) =1

N

NX

j=1

`(w,xj)

Gradient ascent in Terms of Expectations

n  “True” objective function:

n  Taking the gradient:

n  “True” gradient ascent rule:

n  How do we estimate expected gradient?

©Emily Fox 2014 20

`(w) = Ex

[`(w,x)] =

Zp(x)`(w,x)dx

11

SGD: Stochastic Gradient Ascent (or Descent)

n  “True” gradient: n  Sample based approximation:

n  What if we estimate gradient with just one sample??? ¨  Unbiased estimate of gradient ¨  Very noisy! ¨  Called stochastic gradient ascent (or descent)

n  Among many other names ¨  VERY useful in practice!!!

©Emily Fox 2014 21

r`(w) = Ex

[r`(w,x)]

Stochastic Gradient Ascent: general case

n  Given a stochastic function of parameters: ¨  Want to find maximum

n  Start from w(0) n  Repeat until convergence:

¨  Get a sample data point xt ¨  Update parameters:

n  Works on the online learning setting! n  Complexity of each gradient step is constant in number of examples! n  In general, step size changes with iterations

©Emily Fox 2014 22

12

Stochastic Gradient Ascent for Logistic Regression

n  Logistic loss as a stochastic function:

n  Batch gradient ascent updates:

n  Stochastic gradient ascent updates: ¨  Online setting:

©Emily Fox 2014 23

Ex

[`(w,x)] = Ex

⇥lnP (y|x,w)� �||w||22

⇤

w

(t+1)i w

(t)i + ⌘

8<

:��w(t)i +

1

N

NX

j=1

x

(j)i [y(j) � P (Y = 1|x(j)

,w

(t))]

9=

;

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

2

Convergence rate of SGD

n  Theorem: ¨  (see Nemirovski et al ‘09 from readings) ¨  Let f be a strongly convex stochastic function ¨  Assume gradient of f is Lipschitz continuous and bounded

¨  Then, for step sizes:

¨  The expected loss decreases as O(1/t):

©Emily Fox 2014 24

13

Convergence rates for gradient descent/ascent versus SGD

n  Number of Iterations to get to accuracy

n  Gradient descent: ¨  If func is strongly convex: O(ln(1/ϵ)) iterations

n  Stochastic gradient descent: ¨  If func is strongly convex: O(1/ϵ) iterations

n  Seems exponentially worse, but much more subtle: ¨  Total running time, e.g., for logistic regression:

n  Gradient descent: n  SGD: n  SGD can win when we have a lot of data

¨  And, when analyzing true error, situation even more subtle… expected running time about the same, see readings

©Emily Fox 2014 25

`(w⇤)� `(w) ✏

Motivating AdaGrad (Duchi, Hazan, Singer 2011)

n  Assuming , standard stochastic (sub)gradient descent updates are of the form:

n  Should all features share the same learning rate?

n  Often have high-dimensional feature spaces ¨  Many features are irrelevant ¨  Rare features are often very informative

n  Adagrad provides a feature-specific adaptive learning rate by incorporating knowledge of the geometry of past observations

©Emily Fox 2014 26

w 2 Rd

w(t+1)i w(t)

i � ⌘gt,i

14

Why adapt to geometry?

Hard

Nice

y

t

�

t,1 �

t,2 �

t,3

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive


Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

x

x

x

©Emily Fox 2014 27

Examples from Duchi et al. ISMP 2012

slides

Why adapt to geometry?

Hard

Nice

y

t

�

t,1 �

t,2 �

t,3

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant




Why Adapt to Geometry?

Not All Features are Created Equal

n  Examples:

©Emily Fox 2014 28

Motivation

Text data:

The most unsung birthday

in American business and

technological history

this year may be the 50th

anniversary of the Xerox

914 photocopier.

a

aThe Atlantic, July/August 2010.

High-dimensional image features

Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...


Images from Duchi et al. ISMP 2012 slides

15

Projected Gradient

n  Brief aside…

n  Consider an arbitrary feature space

n  If , can use projected gradient for (sub)gradient descent

©Emily Fox 2014 29

w(t+1) =

w 2 W

w 2 W

w(t+1)i w(t)

i � ⌘gt,i

R(T ) =TX

t=1

ft(w(t))� inf

w2W

TX

t=1

ft(w)

Regret Minimization

n  How do we assess the performance of an online algorithm?

n  Algorithm iteratively predicts n  Incur loss n  Regret:

What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively

©Emily Fox 2014 30

w(t)

ft(w(t))

w

16

Regret Bounds for Standard SGD

n  Standard projected gradient stochastic updates:

n  Standard regret bound:

©Emily Fox 2014 31

TX

t=1

ft(w(t))� ft(w

⇤) 1

2⌘||w(1) �w⇤||22 +

⌘

2

TX

t=1

||gt||22

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

Projected Gradient using Mahalanobis

n  Standard projected gradient stochastic updates:

n  What if instead of an L2 metric for projection, we considered the Mahalanobis norm

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

©Emily Fox 2014 32

17

Mahalanobis Regret Bounds

n  What A to choose?

n  Regret bound now:

n  What if we minimize upper bound on regret w.r.t. A in hindsight?

©Emily Fox 2014 33

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

minA

TX

t=1

⌦gt, A

�1gt↵

TX

t=1

ft(w(t))� ft(w

⇤) 1

2⌘||w(1) �w⇤||2A +

⌘

2

TX

t=1

||gt||2A�1

Mahalanobis Regret Minimization

n  Objective:

n  Solution:

For proof, see Appendix E, Lemma 15 of Duchi et al. 2011. Uses “trace trick” and Lagrangian. n  A defines the norm of the metric space we should be operating in

©Emily Fox 2014 34

minA

TX

t=1

⌦gt, A

�1gt↵

A = c

TX

t=1

gtgTt

! 12

subject to A ⌫ 0, tr(A) C

18

AdaGrad Algorithm

n  At time t, estimate optimal (sub)gradient modification A by

n  For d large, At is computationally intensive to compute. Instead,

n  Then, algorithm is a simple modification of normal updates:

©Emily Fox 2014 35

At =

tX

⌧=1

g⌧gT⌧

! 12

w(t+1) = arg minw2W

||w � (w(t) � ⌘diag(At)�1gt)||2diag(At)

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

AdaGrad in Euclidean Space

n  For , n  For each feature dimension,

where

n  That is,

n  Each feature dimension has it’s own learning rate! ¨  Adapts with t ¨  Takes geometry of the past observations into account ¨  Primary role of η is determining rate the first time a feature is encountered

©Emily Fox 2014 36

W = Rd

w(t+1)i w(t)

i � ⌘t,igt,i

⌘t,i =

w(t+1)i w(t)

i �⌘qPt⌧=1 g

2⌧,i

gt,i

19

AdaGrad Theoretical Guarantees

n  AdaGrad regret bound:

n  So, what does this mean in practice?

n  Many cool examples. This really is used in practice! n  Let’s just examine one…

©Emily Fox 2014 37

TX

t=1

ft(w(t))� ft(w

⇤) 2R1

dX

i=1

||g1:T,j ||2

R1 := max

t||w(t) �w⇤||1

AdaGrad Theoretical Example

n  Expect to out-perform when gradient vectors are sparse

n  SVM hinge loss example: where

n  If xjt ≠ 0 with probability

n  Previously best known method:

©Emily Fox 2014 38

ft(w) = [1� yt⌦x

t,w↵]+ x

t 2 {�1, 0, 1}d

E"f

1

T

TX

t=1

w(t)

!#� f(w⇤

) = O✓ ||w⇤||1p

T·max{log d, d1�↵/2}

◆

E"f

1

T

TX

t=1

w(t)

!#� f(w⇤) = O

✓ ||w⇤||1pT

·pd

◆

/ j�↵, ↵ > 1

20

Neural Network Learning

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rage F

ram

e A

ccura

cy (

%)

Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L−BFGS

(Dean et al. 2012)

Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)



Wildly non-convex problem:

f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk

, ⇠

k

i)], ⇠0i))

where

p(↵) =

1

1 + exp(↵)

�1 �2 �3 �4�5

x1 x2 x3 x4 x5

p(hx1, �1i)

Idea: Use stochastic gradient methods to solve it anyway



n  Very non-convex problem, but use SGD methods anyway

©Emily Fox 2014 39




, ⇠

k

i)], ⇠0i))

where

p(↵) =

1

1 + exp(↵)

�1 �2 �3 �4�5

x1 x2 x3 x4 x5

p(hx1, �1i)






, ⇠

k

i)], ⇠0i))

where

p(↵) =

1

1 + exp(↵)

�1 �2 �3 �4�5

x1 x2 x3 x4 x5

p(hx1, �1i)



Images from Duchi et al. ISMP 2012 slides

What you should know about Logistic Regression (LR) and Click Prediction

n  Click prediction problem: ¨  Estimate probability of clicking ¨  Can be modeled as logistic regression

n  Logistic regression model: Linear model n  Gradient ascent to optimize conditional likelihood n  Overfitting + regularization n  Regularized optimization

¨ Convergence rates and stopping criterion n  Stochastic gradient ascent for large/streaming data

¨ Convergence rates of SGD n  AdaGrad motivation, derivation, and algorithm

40 ©Emily Fox 2014

intro logistic regression gradient descent + sgd...

Documents