large scale machine learning - sanjiv k · 2010-10-20 · large-scale optimization sanjiv kumar,...

51
Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010 Sanjiv Kumar 10/19/2010 EECS6898 – Large Scale Machine Learning 1

Upload: others

Post on 17-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

Large-Scale Optimization

Sanjiv Kumar, Google Research, NY

EECS-6898, Columbia University - Fall, 2010

Sanjiv

Kumar            10/19/2010 EECS6898 – Large Scale Machine Learning 1

Page 2: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 2

Learning and OptimizationIn most cases, learning from data reduces to optimizing a function with

respect to model parameters

Given training data we want to learn a model

Sanjiv

Kumar            10/19/2010

);( wxfy =prediction }1,1{or,.,. −∈ℜ∈ℜ∈ yyxge d

niii yxD ,...,1},{ ==

Page 3: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 3

Learning and OptimizationIn most cases, learning from data reduces to optimizing a function with

respect to model parameters

Given training data we want to learn a model

Sanjiv

Kumar            10/19/2010

);( wxfy =prediction

niii yxD ,...,1},{ ==

)(minargˆ wJww

=

)](),([minarg wRλwDLw

+=

regularizerloss

}1,1{or,.,. −∈ℜ∈ℜ∈ yyxge d

Page 4: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 4

Learning and OptimizationIn most cases, learning from data reduces to optimizing a function with

respect to model parameters

Given training data we want to learn a model

Sanjiv

Kumar            10/19/2010

);( wxfy =prediction

niii yxD ,...,1},{ ==

)(minargˆ wJww

=

)](),([minarg wRλwDLw

+=

regularizerloss

• Convex vs non-convex

• Constrained vs unconstrained

• Smooth vs non-differentiable

w

)(wJ )(wJ

w

)(wJ

w

)(wJ

w

convex non-convex

non-differentiableconstrained

local minimumglobal minimum

twice differentiable

}1,1{or,.,. −∈ℜ∈ℜ∈ yyxge d

Page 5: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 5

Examples

Sanjiv

Kumar            10/19/2010

0wxwy T +=Linear regression

convex, smooth, unconstrainedwwλyxwwJ T

iii

T +−= ∑ 2)()(

Absorb by adding a dummy variable in x:

ww in 01 0 =x

x

y

ℜ∈ℜ∈ yx d ,

Page 6: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 6

Examples

Sanjiv

Kumar            10/19/2010

0wxwy T +=Linear regression

convex, smooth, unconstrainedwwλyxwwJ T

iii

T +−= ∑ 2)()(

Linear SVM )sgn( 0wxwy T +=

convex, non-differentiable, unconstrained

wwλxwywJ T

ii

Ti +−= ∑ )1,0max()(

Absorb by adding a dummy variable in x:

ww in 01 0 =x

x

y

2x

1x

ℜ∈ℜ∈ yx d ,

}1,1{, −∈ℜ∈ yx d

Page 7: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 7

Examples

Sanjiv

Kumar            10/19/2010

0wxwy T +=Linear regression

convex, smooth, unconstrainedwwλyxwwJ T

iii

T +−= ∑ 2)()(

Linear SVM )sgn( 0wxwy T +=

convex, non-differentiable, unconstrained

wwλxwywJ T

ii

Ti +−= ∑ )1,0max()(

convex, smooth, unconstrained

Logistic Regression

wwλxwyσwJ T

ii

Ti +−= ∑ )(log()(

)()|1( 0wxwσxyp T +==

Absorb by adding a dummy variable in x:

ww in 01 0 =x

x

y

probabilistic)1/(1)( xexσ −+=

1x

2x

2x

1x

ℜ∈ℜ∈ yx d ,

}1,1{, −∈ℜ∈ yx d

}1,1{, −∈ℜ∈ yx d

Page 8: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 8

Examples

Sanjiv

Kumar            10/19/2010

0wxwy T +=Linear regression ℜ∈ℜ∈ yx d ,

convex, smooth, unconstrainedwwλyxwwJ T

iii

T +−= ∑ 2)()(

Linear SVM )sgn( 0wxwy T += }1,1{, −∈ℜ∈ yx d

convex, non-differentiable, unconstrained

wwλxwywJ T

ii

Ti +−= ∑ )1,0max()(

Nonlinear (kernelized) versions: replace with

• 2-norm regularizer changes to where K

is the kernel matrix

• Same properties of the functions as their linear versions

xwT ∑i ii xxkw ),(

KwwT

Logistic Regression

}1,1{, −∈ℜ∈ yx d

wwλxwyσwJ T

ii

Ti +−= ∑ )(log()(

)()|1( 0wxwσxyp T +== probabilistic

Absorb by adding a dummy variable in x:

ww in 01 0 =x

x

y

)1/(1)( xexσ −+=

convex, smooth, unconstrained

1x

2x

2x

1x

Page 9: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 9

Popular Optimization MethodsFirst Order Methods

– Use gradient of the function to iteratively find the local minimum– Easy to compute and run but slow convergence– Gradient (steepest) descent– Coordinate descent– Conjugate Gradient– Stochastic Gradient Descent

Second Order Methods– Use gradients and Hessian (curvature) iteratively– Computationally more demanding but fast convergence– Newton’s method– Quasi-Newton methods (BFGS and variants)– Stochastic Quasi-Newton methods

Other than line-search based methods: Trust-region

Sanjiv

Kumar            10/19/2010

Page 10: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 10

Conditions for Local Minimum• If is a local-minimizer then,

– First-order necessary condition

Sanjiv

Kumar            10/19/2010

)(wJ

w

local minimum

local maximum saddle point

w

stationary pointGradient 0)ˆ()(

=∇=∂

∂ wJwwJ

Page 11: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 11

Conditions for Local Minimum• If is a local-minimizer then,

– First-order necessary condition

– Second-order necessary condition

Sanjiv

Kumar            10/19/2010

)(wJ

w

local minimum

local maximum saddle point

w

stationary pointGradient

Hessianpositive semi-definite

PSD is )ˆ()( 22

wJwwwJT ∇=

∂∂

0)ˆ()(=∇=

∂∂ wJwwJ

Page 12: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 12

Conditions for Local Minimum• If is a local-minimizer then,

– First-order necessary condition

– Second-order necessary condition

– Sufficient conditions

Sanjiv

Kumar            10/19/2010

)(wJ

w

local minimum

local maximum saddle point

w

stationary pointGradient

Hessian PSD is )ˆ()( 22

wJwwwJT ∇=

∂∂

positive semi-definite

PD is )ˆ(2 wJ∇ positive definite (strictly PD)0)ˆ( =∇ wJ and

For smooth convex function any stationary point is a global minimizer

0)ˆ()(=∇=

∂∂ wJwwJ

Page 13: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 13

Gradient DescentIteratively estimates new parameter values

Sanjiv

Kumar            10/19/2010

)(wJ

wtw 1+tw

)(1 tttt wJηww ∇−=+

Page 14: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 14

Gradient DescentIteratively estimates new parameter values

Sanjiv

Kumar            10/19/2010

)(wJ

wtw 1+tw

)(1 tttt wJηww ∇−=+

)()( 1 tt wJwJ ≤+We want

But this is not sufficient to reach local minima !One has to make sure

Step-length must satisfy the Wolfe conditions of sufficient decrease

ttttttt pwJηcwJpηwJ )()()( 1 ∇+≤+ 10 1 << c

Gradient descent converges to local minimum if step size is sufficiently small !

search direction

0)( →∇ twJ

Page 15: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 15

Gradient DescentIteratively estimates new parameter values

Sanjiv

Kumar            10/19/2010

)(wJ

wtw 1+tw

)(1 tttt wJηww ∇−=+

)()( 1 tt wJwJ ≤+We want

But this is not sufficient to reach local minima !0)( →∇ twJOne has to make sure

Step-length must satisfy the Wolfe conditions of sufficient decrease

ttttttt pwJηcwJpηwJ )()()( 1 ∇+≤+ 10 1 << c

Gradient descent converges to local minimum if step size is sufficiently small !

Rate of Convergence: Linearr

wwww

t

t ≤−−+

ˆˆ1

Painfully slow in practice, many heuristics are used (e.g., momentum)

search direction

Page 16: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 16

Coordinate DescentUpdates one coordinate at a time keeping others fixed

Sanjiv

Kumar            10/19/2010

)(1itt

it

it wJηww ∇−=+

• Cycles through all the coordinates sequentially

0w

0w

1w

1w2w

Page 17: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 17

Coordinate DescentUpdates one coordinate at a time keeping others fixed

Sanjiv

Kumar            10/19/2010

)(1itt

it

it wJηww ∇−=+

Properties• Very simple and easy to implement – reasonable when variables are loosely coupled

• Can be inefficient in practice (may take long time to converge)

• Convergence not guaranteed, in general

• Cycles through all the coordinates sequentially

0w

0w

1w

1w2w

Block Coordinate Descent• Update a small set of coordinates simultaneously

• Has been used successfully for training large SVMs

Page 18: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 18

Conjugate GradientParameters are searched along conjugate directions

Sanjiv

Kumar            10/19/2010

021 =HppTConjugacy: Two vectors p1 and p2 are conjugate wrt H

if

Starting with )( 00 wJp −∇= tttt pαww +=+1

green: GD red: CG

Page 19: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 19

Conjugate GradientParameters are searched along conjugate directions

Sanjiv

Kumar            10/19/2010

021 =HppTConjugacy: Two vectors p1 and p2 are conjugate wrt H

if

Starting with tttt pαww +=+1

ttTt

tT

tt

pHppwJα )(∇

=))()((

))()(()(

11

1

−−

∇−∇

∇−∇∇=

ttTt

ttT

tt

wJwJpwJwJwJβ

Successive conjugate directions as linear combination of gradient direction and previous conjugate direction

1)( −+−∇= tttt pβwJp

green: GD red: CG

,....2,1=t

Hestenes- Stiefel form

Exact

But usually picked something that satisfies Wolfe’s conditions !

)( 00 wJp −∇=

Page 20: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 20

Conjugate GradientParameters are searched along conjugate directions

Sanjiv

Kumar            10/19/2010

021 =HppT

Properties• For quadratic functions, guaranteed to return the optimal in at most d iterations

• Use preconditioning to increase the convergence rate: make condition number of Hessian as small as possible

Conjugacy: Two vectors p1 and p2 are conjugate wrt H

if

Starting with tttt pαww +=+1

ttTt

tT

tt

pHppwJα )(∇

=))()((

))()(()(

11

1

−−

∇−∇

∇−∇∇=

ttTt

ttT

tt

wJwJpwJwJwJβ

Successive conjugate directions as linear combination of gradient direction and previous conjugate direction

1)( −+−∇= tttt pβwJp

green: GD red: CG

,....2,1=t

Hestenes- Stiefel form

Exact

But usually picked something that satisfies Wolfe’s conditions !

)( 00 wJp −∇=

Page 21: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 21

Newton’s MethodApproximates a function using second order Taylor expansion

Sanjiv

Kumar            10/19/2010

)(wJ

wtw 1+tw

pwJpwJpwJpwJ tT

tT

tt )(21)()()( 2∇+∇+≈+

quadratic

Page 22: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 22

Newton’s MethodApproximates a function using second order Taylor expansion

Sanjiv

Kumar            10/19/2010

)(wJ

wtw 1+tw)(11 tttt wJHww ∇−= −+

minimizing w.r.t. p

pwJpwJpwJpwJ tT

tT

tt )(21)()()( 2∇+∇+≈+

( ) )()()( 112tttt wJHwJwJp ∇=∇∇= −−

Hessian

– No explicit step-length parameter– If Hessian is not positive definite, Newton step may be undefined or may even diverge– For quadratic functions, converges in a single step !

quadratic

Page 23: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 23

Newton’s MethodApproximates a function using second order Taylor expansion

Sanjiv

Kumar            10/19/2010

)(wJ

wtw 1+tw)(11 tttt wJHww ∇−= −+

minimizing w.r.t. p

Rate of Convergence: Quadratick

ww

ww

t

t ≤−

−+2

1

ˆ

ˆ

Fast convergence but each iteration slow: O(d2) space and O(d3) time

pwJpwJpwJpwJ tT

tT

tt )(21)()()( 2∇+∇+≈+

( ) )()()( 112tttt wJHwJwJp ∇=∇∇= −−

Hessian

– No explicit step-length parameter– If Hessian is not positive definite, Newton step may be undefined or may even diverge– For quadratic functions, converges in a single step !

quadratic

Page 24: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 24

Quasi-Newton MethodsDo not require computation of Hessian but still super-linear convergence

Key Idea: Approximate Hessian locally using change in gradients

Secant equation

Sanjiv

Kumar            10/19/2010

ttt ysB =+1

,1 ttt wws −= + )()( 1 ttt wJwJy ∇−∇= +

Approx Hessian

Wolfe’s conditions0>t

Tt ys

0ftB

Page 25: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 25

Quasi-Newton MethodsDo not require computation of Hessian but still super-linear convergence

Key Idea: Approximate Hessian locally using change in gradients

Secant equation

Sanjiv

Kumar            10/19/2010

ttt ysB =+1

,1 ttt wws −= + )()( 1 ttt wJwJy ∇−∇= +

Approx Hessian

– Additional conditions imposed on B, e.g., symmetry or diagonal– Difference between successive B’s is low-rank – BFGS (Broyden-Fletcher-Goldfarb-Shanno): uses rank-2 (inverse) Hessian updates

Wolfe’s conditions0>t

Tt ys

0ftB

FtBt BBB ~~minarg~

~1 −=+ subject to tttTtt syBBB == +++ 111

~,~~if 1~ −= tt BB

Page 26: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 26

Quasi-Newton MethodsDo not require computation of Hessian but still super-linear convergence

Key Idea: Approximate Hessian locally using change in gradients

Secant equation

Sanjiv

Kumar            10/19/2010

ttt ysB =+1

,1 ttt wws −= + )()( 1 ttt wJwJy ∇−∇= +

Approx Hessian

– Additional conditions imposed on B, e.g., symmetry or diagonal– Difference between successive B’s is low-rank – BFGS (Broyden-Fletcher-Goldfarb-Shanno): uses rank-2 (inverse) Hessian updates

• Inverse of approx Hessian updated directly O(d2) space and time

Tttt

Ttttt

Ttttt ssρsyρIBysρIB +−−=+ )(~)(~

1 ,/1 tTtt ysρ = IB =0

~

Wolfe’s conditions0>t

Tt ys

0ftB

subject to tttTtt syBBB == +++ 111

~,~~

O(d2) cost per iteration, superlinear rate of convergence, self-correcting properties

FtBt BBB ~~minarg~

~1 −=+if 1~ −= tt BB

Page 27: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 27

Limited-Memory BFGS (L-BFGS)Quasi-Newton methods produce dense Hessian approximations even when

true Hessian is sparse high storage cost for large-scale problems

Key Idea: Store approx Hessian using a few d-dim vectors from most recent iterations Store at most m most recent pairs

Sanjiv

Kumar            10/19/2010

),( tt ys

Page 28: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 28

Limited-Memory BFGS (L-BFGS)Quasi-Newton methods produce dense Hessian approximations even when

true Hessian is sparse high storage cost for large-scale problems

Key Idea: Store approx Hessian using a few d-dim vectors from most recent iterations Store at most m most recent pairs

Sanjiv

Kumar            10/19/2010

),( tt ys

Similar to Conjugate-Gradient, rate of convergence linear instead of superlinear as in BFGS

)(~1 ttttt wJBαww ∇−=+

– Iteratively estimate the product using most recent

– Can be achieved efficiently in two loops in O(md)

– Different initialization for each inner loop possible, e.g.,

)(~tt wJB ∇ ),( tt ys

,~0 IγB tt =11

11

−−

−−=t

Tt

tTt

tyyysγ

How about sparse, sampling-based or diagonal approximation of Hessian ?

Page 29: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 29

Comparison: Logistic Regression

Sanjiv

Kumar            10/19/2010

n=1500, d=100

independent features correlated features Hessian-based methods do better

Minka [5] Trade-off between number of iterations and cost per iteration !

Page 30: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 30

Batch vs

OnlineMost optimization functions use i.i.d. training data as,

Sanjiv

Kumar            10/19/2010

])[()( 2 wwλyxwwJ T

iii

T +−= ∑

])1,0[max()( wwλxwywJ T

ii

Ti +−= ∑

])([log()( wwλxwyσwJ T

ii

Ti +−= ∑

SVM

Linear Regression

Logistic Regression

Batch methods– compute gradients using all the training data– each iteration needs to use full batch of data slow for large datasets

Online (Stochastic) Methods– use gradients from a small subset of data, usually just a single point– do not decrease the function value monotonically– typically have worse convergence properties than batch methods– quite appropriate for large-scale applications where data is usually redundant– can converge even before seeing all the data once very fast– popular conjecture: may also avoid bad local minima

Page 31: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 31

Stochastic Gradient Descent (SGD)At each iteration, use a small training subset to estimate the gradient

Sanjiv

Kumar            10/19/2010

),(1 ttttt xwJηww ∇−=+single point or a small subset of training data

Page 32: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 32

Stochastic Gradient Descent (SGD)At each iteration, use a small training subset to estimate the gradient

Converges to optimal value if,

Step-length

Sanjiv

Kumar            10/19/2010

),(1 ttttt xwJηww ∇−=+single point or a small subset of training data

w∞=∑t tη∞<∑t tη

2

0ηtττηt +

= 0, 0 >ητ are tuning parameters

Properties– O(d) space and time per iteration instead of O(nd) of gradient descent– Outperforms batch gradient methods on large datasets– Slow convergence on ill-conditioned problems– Stochastic Meta-Descent: adjust a different gain for each parameter

step size decreases fast enough

parameters may move arbitrary distances

Page 33: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 33

Experiment -

SGD

Sanjiv

Kumar            10/19/2010

Bottou & Bousquet [8]

SVM classification taskd = 47K (~80 nonzero), n = ~800K

Page 34: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 34

SGD for Sparse DataSuppose training set vectors are sparse

– Only a small fraction (s) of d

elements are non-zero– Common for many applications: text, images, biological data,…

Sanjiv

Kumar            10/19/2010

nid

ix ,...,1}{ =ℜ∈1<<s

Page 35: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 35

SGD for Sparse DataSuppose training set vectors are sparse

– Only a small fraction (s) of d

elements are non-zero– Common for many applications: text, images, biological data,…

Objective function

Sanjiv

Kumar            10/19/2010

sparse

How to make sparse updates, i.e., O(sd) per iteration?

nid

ix ,...,1}{ =ℜ∈

])()([)( 1∑ = += ni i

Ti wRλxwyLwJ

)()(),( ttttTtttt wRλxyxwyLxwJ ∇+∇=∇

dense

dense vector update

)(dO per iteration

1<<s

Page 36: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 36

SGD for Sparse DataSuppose training set vectors are sparse

– Only a small fraction (s) of d

elements are non-zero– Common for many applications: text, images, biological data,…

Objective function

Sanjiv

Kumar            10/19/2010

sparse

How to make sparse updates, i.e., O(sd) per iteration?Break updates into two parts

nid

ix ,...,1}{ =ℜ∈

])()([)( 1∑ = += ni i

Ti wRλxwyLwJ

)()(),( ttttTtttt wRλxyxwyLxwJ ∇+∇=∇

dense

dense vector update

)(dO per iteration

1<<s

1. Update using gradients of L(.) alone ignoring the regularizer),(1 ttttt xwLηww ∇−=+

2. Every kth iteration, adjust w using gradients of regularizer)( 111 +++ ∇−← tttt wRηkww

Page 37: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 37

PerceptronOriginally proposed to learn a linear binary classifier

Sanjiv

Kumar            10/19/2010

)sgn()( xwxf T=

⎩⎨⎧ +

=+t

ttt w

xyww 1Update Rule

)(xfy ≠if x

is misclassified, i.e.,

otherwise

}1,1{−=y

Page 38: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 38

PerceptronOriginally proposed to learn a linear binary classifier

Sanjiv

Kumar            10/19/2010

)sgn()( xwxf T=

Update Rule )(xfy ≠

2x

1x

if x is misclassified, i.e.,

otherwise

Properties– Stochastic gradient descent on a non-differentiable

loss function– Guaranteed to converge if data is linearly separable– What happens for inseparable data ?

– Use of heuristics such as voting with various parameter vectors

will not converge but oscillate !

⎩⎨⎧ +

=+t

ttt w

xyww 1

}1,1{−=y

Page 39: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 39

Natural Gradient DescentIncorporate Riemannian Metric into stochastic descent

Sanjiv

Kumar            10/19/2010

]),(),([ Tttttxt xwJxwJEG ∇∇=

0ηtττηt +

=),(~ 11 tttttt xwJGηww ∇−= −+

Page 40: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 40

Natural Gradient DescentIncorporate Riemannian Metric into stochastic descent

Metric update

Sanjiv

Kumar            10/19/2010

]),(),([ Tttttxt xwJxwJEG ∇∇=

Properties– Update matrix inverse directly O(d2) space and time per iteration– More stable – has good theoretical properties– Still expensive for large-scale problems

),(~ 11 tttttt xwJGηww ∇−= −+

Ttttttt xwJxwJtGttG ),(),()/1(~/)1(~

1 ∇∇+−=+assuming Hessian to be constant

0ηtττηt +

=

Page 41: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 41

Online-BFGSChange of gradients estimated using a subset of training data

Sanjiv

Kumar            10/19/2010

,1 ttt wws −= +

Wolfe’s conditions0>t

Tt ys

0ftB

ttt syB =+1~

),(~1 tttttt xwJBηww ∇−=+

stochastic approximation of inverse Hessian)()( 1 ttt wJwJy ∇−∇= +

Page 42: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 42

Online-BFGSChange of gradients estimated using a subset of training data

Sanjiv

Kumar            10/19/2010

,1 ttt wws −= +

Tttt

Ttttt

Ttttt ssρcysρIBysρIB +−−=+ )(~)(~

1tTtt ysρ /1=

Wolfe’s conditions0>t

Tt ys

0ftB

1. To maintain positive curvature

ttt syB =+1~

),(~1 tttttt xwJBηww ∇−=+

stochastic approximation of inverse Hessian

),(),( 1 ttttt xwJxwJy ∇−∇= +Modify tsλ+ 0≥λ

)()( 1 ttt wJwJy ∇−∇= +

2. Initialize using a small coefficient to restrict initial parameter updateB~

IεB =0~ 1010−≈ε

3. Modify the update of 1~+tB

10 ≤< c cηη tt /=

Compute on the same sample !

Page 43: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 43

Online-BFGSChange of gradients estimated using a subset of training data

Sanjiv

Kumar            10/19/2010

,1 ttt wws −= +

Tttt

Ttttt

Ttttt ssρcysρIBysρIB +−−=+ )(~)(~

1tTtt ysρ /1=

Wolfe’s conditions0>t

Tt ys

0ftB

1. To maintain positive curvature

ttt syB =+1~

),(~1 tttttt xwJBηww ∇−=+

stochastic approximation of inverse Hessian

),(),( 1 ttttt xwJxwJy ∇−∇= +Modify tsλ+ 0≥λ

)()( 1 ttt wJwJy ∇−∇= +

2. Initialize using a small coefficient to restrict initial parameter updateB~

IεB =0~ 1010−≈ε

3. Modify the update of 1~+tB

10 ≤< c cηη tt /=

Compute on the same sample !

– Test of convergence: If last k stochastic gradients have been below a threshold– Online version of L-BFGS possible O(md) cost instead of O(d2)

Page 44: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 44

Experiments

Sanjiv

Kumar            10/19/2010

Synthetic quadratic function

# parameters d = 5, n = ~1M

Schraudolph et al. [7]

Page 45: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 45

Experiments

Sanjiv

Kumar            10/19/2010

On Conditional Random Field (CRF) applied to text chunking

# parameters d = 100K, n = ~9K

Schraudolph et al. [7]

Page 46: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 46

Stochastic Gradient Descent (Quasi-Newton)Approximate inverse Hessian by a diagonal matrix

Sanjiv

Kumar            10/19/2010

Inspired by online-BFGS

,1 ttt wws −= +

Wolfe’s conditions

0>tTt ys

0ftBttt syB =+1

~)),(),((~

111 ttttttt xwJxwJBww ∇−∇≈− +++

diagonal matrix)()( 1 ttt wJwJy ∇−∇= +

Page 47: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 47

Stochastic Gradient Descent (Quasi-Newton)Approximate inverse Hessian by a diagonal matrix

Sanjiv

Kumar            10/19/2010

Inspired by online-BFGS

,1 ttt wws −= +

Wolfe’s conditions

0>tTt ys

0ftBttt syB =+1

~)),(),((~

111 ttttttt xwJxwJBww ∇−∇≈− +++

diagonal matrix)()( 1 ttt wJwJy ∇−∇= +

ittttiiitt xwJxwJBww )],(),([~][ 11 ∇−∇≈− ++

Page 48: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 48

Stochastic Gradient Descent (Quasi-Newton)Approximate inverse Hessian by a diagonal matrix

Sanjiv

Kumar            10/19/2010

Inspired by online-BFGS

,1 ttt wws −= +

Wolfe’s conditions

0>tTt ys

0ftBttt syB =+1

~)),(),((~

111 ttttttt xwJxwJBww ∇−∇≈− +++

diagonal matrix)()( 1 ttt wJwJy ∇−∇= +

ittttiiitt xwJxwJBww )],(),([~][ 11 ∇−∇≈− ++

In Practice,– Update matrix entries only every k iterations– Do a leaky average of B to get stable updates

– Has a flavor of partially ‘fixed-Hessian’

1]~1[~~ −+← iiiiiii rBkBBittitkttti wwxwJxwJr ]/[)],(),([ 11 −∇−∇= +−+ }},100min{,max{ ii rλλr ←

Page 49: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

EECS6898 – Large Scale Machine Learning 49Sanjiv

Kumar            10/19/2010

Stochastic Gradient Descent (Quasi-Newton)

Bordes et al. [9]

Alpha dataset of Pascal Challenge (2008) # parameters d = 500 (dense), n = ~100K

sparse SGDRCV1: d = 47K, s = 0.0016

Page 50: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

1. H. E. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, 1951.

2. J. Nocedal and S. J. Wright, Numerical Optimization. Springer Series in Operations Research, 1999.3. N. Schraudolph, “Local gain adaptation in stochastic gradient descent,” Int. conf in Artifical Neural

Networks, 1999.4. S. Amari, H. Park, K. Fukumizu, “Adaptive method of realizing natural gradient learning for multi-layer

perceptrons,” Neural Computation, 2000.5. Tom Minka, “A comparison of numerical optimizers for logistic regression”, 2003.

http://research.microsoft.com/en-us/um/people/minka/papers/logreg/6. L. Bottou and Y. Lecun, “Online learning for very large datasets,” Applied Stochastic Models in

Business and Industry, 2005.7. N. Schraudolph, J. Yu and S. Gunter, “A stochastic quasi-Newton method for online convex

optimization,” AISTATS, 2007.8. L. Bottou and O. Bousquet: Learning Using Large Datasets, Mining Massive DataSets for Security,

NATO ASI Workshop Series, IOS Press, Amsterdam, 2008.9. A. Bordes, L. Bottou and P. Gallinari, “SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent,

JMLR, 2009.10. A. Bordes, L. Bottou, P. Gallinari, J. Chang, S. A. Smith, “Erratum: SGD-QN is less careful than

expected,” 2010. http://jmlr.csail.mit.edu/papers/v11/bordes10a.html11. J. Yu, S. Vishwanathan, S. Günter, and N. N. Schraudolph. A Quasi-Newton Approach to Nonsmooth

Convex Optimization Problems in Machine Learning. Journal of Machine Learning Research, 11:1145–1200, 2010.

References

EECS6898 – Large Scale Machine Learning 50Sanjiv

Kumar            10/19/2010

Page 51: Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 2010. Sanjiv Kumar 10/19/2010

Given: A labeled training set,

Goal: Learn a predictive function

EECS6898 – Large Scale Machine Learning 51

Logistic Regressionniii yx ...1},{ = }1,1{, −∈ℜ∈ i

di yx

)()|1( xwσxyp T==

wwλxwywL T

ii

Ti +−+= ∑ ))exp(1log()(

XAzIλXAXw Tt

11 )( −+ +=

log-likelihood log-priorxwT

)|1( xyp =

1x

2x

)()|( iT

iii xwyσxyp =

(negative) log-posterior

Convex problem, Newton method:

)100(~),100(~ KOdMOn)( 2ndO

)( 3dO

multiplicationinversion

)(ndO First-order methods

Sanjiv

Kumar            10/19/2010

Absorb ww in 0