iterative reweighted least squaressrihari/cse574/chap4/4.3.3-irls.pdfhigher order derivatives...

Iterative Reweighted Least Squares

Sargur N. SrihariUniversity at Buffalo, State University of New York

USA

Topics in Linear Classification using Probabilistic Discriminative Models

• Generative vs Discriminative1. Fixed basis functions in linear classification2. Logistic Regression (two-class)3. Iterative Reweighted Least Squares (IRLS)4. Multiclass Logistic Regression5. Probit Regression6. Canonical Link Functions

2

SrihariMachine Learning

Topics in IRLS

• Linear and Logistic Regression• Newton-Raphson Method• Hessian• IRLS for Linear Regression• IRLS for Logistic Regression• Numerical Example• Hessian in Deep Learning

Machine Learning Srihari

3

Recall Linear Regression• Assuming Gaussian noise p(t|x,w,β)=N (t | y(x,w),β -1 )

• And a linear model• Likelihood has the form

• Log-likelihood becomes

• Which is quadratic in input• Derivative has the form• Setting equal to zero and solving we get


4

y(x,w) = w

jφ

j(x)

j=0

M −1

∑ = wTφ(x)

p(t | X,w,β) = N t

n| wTφ(x

n),β−1( )

n=1

N

∏

ln p(t | w,β) = lnN t

n| wTφ(x

n),β−1( )

n=1

N

∑ =N2

lnβ−N2

ln2π−βED(w)

E

D(w) = 1

2 t

n−wTϕ(x

n) { }

n=1

N

∑2

TT FFF=F -+ 1)(

∇ ln p(t | w,β) = β t

n−wTφ(x

n) { }

n=1

N

∑ φ(xn)T

tw +F=ML

÷÷÷÷÷

ø

ö

ççççç

è

æ

=F

-

-

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

ff

ffff

Linear and Logistic Regression• In linear regression there is a closed-form max

likelihood solution for determining w• on the assumption of Gaussian noise model • Due to quadratic dependence of log-likelihood on w

• For logistic regression: No closed-form maximum likelihood solution• Due to nonlinearity of logistic sigmoid

• But departure from quadratic is not substantial• Error function is concave, i.e., unique minimum


5

E(w) = − ln p(t |w) = − t

nlny

n+ (1 − t

n)ln(1 − y

n){ }

n=1

N

∑ yn= σ(wTɸn)

E

D(w) = 1

2 t

n−wTφ(x

n) { }

n=1

N

∑2

tw +F=ML

What is IRLS?• An iterative method to find solution w*

– for linear regression and logistic regression• assuming least squares objective

• While simple gradient descent has the form

• IRLS uses second derivative and has the form

• It is derived from Newton-Raphson method • where H is the Hessian matrix whose elements are the

second derivatives of E(w) wrt w


6

w(new) = w(old) − η∇E(w)

w(new) = w(old) −H −1∇E(w)

∇E[w!"] ≡ ∂E

∂w0, ∂E∂w1

,# ∂E∂wn

⎡

⎣⎢

⎤

⎦⎥

H = ∇∇E(w) = d 2

dw2E(w) =

∂2E∂w

12

∂2E∂w

1∂w

2

∂2E∂w

2∂w

1

∂2E∂w

22

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

Why second derivative?Machine Learning Srihari

7

Since we are solving forderivative of E(w)Need second derivative

Criticalpoint

Currentsolution

• Derivative of a function E(w) at wis the slope of tangent at w

• Newton-Raphson is an iterative method to find the zeros of a function – It repeatedly computes the derivative

• We need to find value of w at which ∇E(w)=0– So we need to repeatedly compute

the derivative of ∇E(w)

Solving for f (x)=0

Higher order derivatives example(1-D)Machine Learning Srihari

8

• Illustration of 1st, 2nd and 3rd order derivatives– Derivatives of Gaussian p(x) ~ N(0, σ)

First Order Derivative

Second Order Derivative

Third Order Derivative

Second derivative measures curvature

• Derivative of a derivative• Quadratic functions with different curvatures

9

Decrease is faster than predictedby Gradient Descent

Gradient Predictsdecrease correctly

Decreaseis slower than expectedActually increases

Dashed line isvalue of costfunction predictedby gradient alone

Beyond Gradient: Jacobian and Hessian matrices

• Sometimes we need to find all derivatives of a function whose input and output are both vectors

• If we have function f: RmàRn– Then the matrix of partial derivatives is known as

the Jacobian matrix J defined as

10

J

i,j=∂∂x

j

f x( )i

Second derivative• Derivative of a derivative• For a function f: RnàR the derivative wrt xi of

the derivative of f wrt xj is denoted as• In a single dimension we can denote by

f ”(x)• Tells us how the first derivative will change as

we vary the input• This important as it tells us whether a gradient

step will cause as much of an improvement as based on gradient alone

11

∂2

∂xi∂x

j

f

∂2

∂x 2f

Hessian• Second derivative with many dimensions• H ( f ) (x) is defined as• Hessian is the Jacobian of the gradient • Hessian matrix is symmetric, i.e., Hi,j=Hj,i

• anywhere that the second partial derivatives are continuous

– So the Hessian matrix can be decomposed into a set of real eigenvalues and an orthogonal basis of eigenvectors

• Eigenvalues of H are useful to determine learning rate as seen in next two slides

H(f )(x)

i,j=

∂2

∂xi∂x

j

f (x)

Role of eigenvalues of Hessian

• Second derivative in direction d is dTHd– If d is an eigenvector, second derivative in that

direction is given by its eigenvalue– For other directions, weighted average of

eigenvalues (weights of 0 to 1, with eigenvectors with smallest angle with d receiving more value)

• Maximum eigenvalue determines maximum second derivative and minimum eigenvalue determines minimum second derivative

13

Learning rate from Hessian

• Taylor’s series of f(x) around current point x(0)

• where g is the gradient and H is the Hessian at x(0)

– If we use learning rate ε the new point x is given by x(0)-εg. Thus we get

• There are three terms: – original value of f, – expected improvement due to slope, and – correction to be applied due to curvature

• Solving for step size when correction is least gives

14

f (x)≈ f (x (0))+ (x - x (0))Tg +

12(x - x (0))T H(x - x (0))

f (x (0)− εg)≈ f (x (0))− εgTg +

12ε2gTHg

ε*≈

gTggTHg

Another 2nd Derivative Method• Using Taylor’s series of f (x) around current x (0)

– solve for the critical point of this function to give

– When f is a quadratic (positive definite) function use solution to jump to the minimum function directly

– When not quadratic apply solution iteratively• Can reach critical point much faster than

gradient descent– But useful only when nearby point is a minimum

15

f (x)≈ f (x (0))+ (x - x (0))T∇x f (x (0))+

12(x - x (0))T H(f )(x - x (0))(x - x (0))

x* = x (0)−H(f )(x (0))−1∇x f (x(0))

• IRLS is applicable to both Linear Regression and Logistic Regression

• We discuss both, for each we need1. Model2. Objective function E (w)

• Linear Regression: Sum of Squared Errors• Logistic Regression: Bernoulli Likelihood Function

3. Gradient4. Hessian5. Newton-Raphson update

Two applications of IRLSSrihari

16


Machine Learning

∇E(w)

H = ∇∇E(w)

y(x,w) = w

jφ

j(x) = wTφ(x)

j=0

M −1

∑

p(t |w) = y

n

tn

n=1

N

∏ 1 − yn{ }1−tn

E(w) = 1

2tn−wTφ(x

n){ }2

n=1

N

∑

p(C1|ϕ) =y(ϕ) = σ (wTϕ)

IRLS for Linear Regression1. Model:

2. Error Function: Sum of Squares:

3. Gradient of Error Function is:

4. Hessian is:

5. Newton-Raphson Update:

∇E(w) = (wTφn− t

n)φ

nn=1

N

∑ = ΦTΦw − ΦTt

H = ∇∇E(w) = φ

nφ

nT

n=1

N

∑ = ΦTΦ

where Φ is the N x Mdesign matrix whosenth row is given by φnT


y(x,w) = w

jφ

j(x) = wTφ(x)

j=0

M −1

∑

E(w) = 1

2tn−wTφ(x

n){ }2

n=1

N

∑

÷÷÷÷÷

ø

ö

ççççç

è

æ

=F

-

-

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

ff

ffff

for data set X={xn,tn} n=1,..N


Substituting: andw (new) = w(old)-(ΦTΦ)-1{ΦTΦw (old)-ΦTt}

= (ΦTΦ)-1ΦTtwhich is the standard least squares solution

Since it is independent of w, Newton-Raphson gives exact solution in one step

�

H = ΦTΦ ∇E(w) = ΦTΦw − ΦTt

IRLS for Logistic Regression1. Model: p(C1|ϕ) =y(ϕ) = σ (wTϕ) Likelihood:

2. Objective Function: Negative log-likelihood:

3. Gradient of Error Function is:

4. Hessian is:

5. Newton-Raphson Update:

where Φ is the N x M design matrix whosenth row is given by ɸnT


÷÷÷÷÷

ø

ö

ççççç

è

æ

=F

-

-

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

ff

ffff


p(t |w) = y

n

tn

n=1

N

∏ 1 − yn{ }1−tn

for data set {ϕn,tn}, tn ε {0,1}, yn=ϕ (xn)

E(w) = − t

nlny

n+ (1 − t

n)ln(1 − y

n){ }

n=1

N

∑Cross-entropy error function

∇E(w) = (y

n− t

n)φ

n= ΦT(y − t)

n=1

N

∑

H = ∇∇E(w) = y

n(1 − y

n)φ

nφ

nT = ΦTRΦ

n=1

N

∑R is N x N diagonal matrix with elements

Rnn= yn(1-yn)=wTϕn(1-wTϕn)

Hessian is not constant and depends on w through R. Since H is positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave function ofw and so has a unique minimum

Substituting andw(new) = w(old) - (ΦTRΦ)-1ΦT (y-t)

= (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)} = (ΦTRΦ)-1ΦTRz

�

H = ΦTRΦ ∇E(w) = ΦT(y − t)

where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)

Update formula is a set of normal equations. Since Hessian depends on w apply them iteratively each time using the new weight vector

Initial Weight Vector, Gradient and Hessian (2-class)

• Weight vector

• Gradient

• Hessian

19

Final Weight Vector, Gradient and Hessian (2-class)

• Weight Vector

• Gradient

• Hessian

20

Number of iterations : 10

Error (Initial and Final): 15.0642, 1.0000e-009

Hessian in Deep Learning • The second derivative can be used to

determine whether a critical point is a local minimum, maximum or a saddle point– On a critical point f’(x)=0– When f’’(x)>0 the first derivative f’(x) increases as

we move to the right and decreases as we move left• We conclude that x is a local minimum

– For local maximum, f’(x)=0 and f’’(x)<0– When f’’(x)=0 test is inconclusive: xmay be a

saddle point or part of a flat region

21

Multidimensional Second derivative test

• In multiple dimensions, we need to examine second derivatives of all dimensions

• Eigendecomposition generalizes the test• Test eigenvalues of Hessian to determine

whether critical point is a local maximum, local minimum or saddle point

• When H is positive definite (all eigenvalues are positive) the point is a local minimum

• Similarly negative definite implies a maximum22

Saddle point• Contains both positive and negative curvature• Function is f(x)=x12-x22

– Along axis x1, function curves upwards: this axis is an eigenvector of H and has a positive value

– Along x2, function corves downwards; its direction is an eigenvector of H with negative eigenvalue

– At a saddle point Eigen values are both positive and negative

23

iterative reweighted least squaressrihari/cse574/chap4/4.3.3-irls.pdfhigher order derivatives...

Documents