descent, newton logistic regression,...

Matthieu R Bloch Thursday, January 30, 2020

LOGISTIC REGRESSION, GRADIENTLOGISTIC REGRESSION, GRADIENTDESCENT, NEWTONDESCENT, NEWTON

1

LOGISTICSLOGISTICSTAs and Office hours

Monday: Mehrdad (TSRB 523a) - 2pm-3:15pmTuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pmWednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pmThursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pmFriday: Brighton (TSRB 523a) - 12pm-1:15pm

Homework 1Hard deadline Friday January 31, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL)

Homework 2 postedDue Wednesday February7,2020 11:59pm EST(Wednesday February14, 2020 for DL)

2

RECAP: GENERATIVE MODELSRECAP: GENERATIVE MODELSQuadratic Discriminant Analysis: classes distributed according to

Covariance matrices are class dependent but decision boundary not linear anymore

Generative model rarely accurate

Number of parameters to estimate: class priors, means, elements of covariance matrix

Works well if Works poorly if without other tricks (dimensionality reduction, structured covariance)

Biggest concern: “one should solve the [classification] problem directly and never solve a more generalproblem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998

Revisit binary classifier with LDA

We no not need to estimate the full joint distribution!

N ( , )μk Σk

K − 1 Kd d(d + 1)12N ≫ dN ≪ d

(x) = =η1ϕ(x; , Σ)π1 μ1

ϕ(x; , Σ) + ϕ(x; , Σ)π1 μ1 π0 μ0

1

1 + exp(−( x + b))w⊺

3

LOGISTIC REGRESSIONLOGISTIC REGRESSIONAssume that is of the form

Estimate and from the data directly

Plugin the result to obtain

The function is called the logistic function

The binary logistic classifier is (linear)

How do we estimate and ?

From LDA analysis: , Direct estimation of from maximum likelihood

η(x) 11+exp(−( x+b))w⊺

ŵ b̂

(x) =η̂ 11+exp(−( x+ ))ŵ⊺ b̂

x ↦ 11+e−x

(x) = 1{ (x) ≥ } = 1{ x + ≥ 0}hLC η̂ 12 ŵ⊺ b̂

ŵ b̂

= ( − )ŵ Σ̂−1

μ̂1 μ̂0 b = − + log12 μ̂

⊺

0Σ̂−1

μ̂012 μ̂

⊺

1Σ̂−1

μ̂1π̂1π̂0

( , b)ŵ

5

MLE FOR LOGISTIC REGRESSIONMLE FOR LOGISTIC REGRESSIONWe have a parametric density model for

Standard trick: and This allows us to lump the offset and write

Given our dataset the likelihood is

For with we obtain

(y|x) = (x)pθ η̂= [1,x~ x⊺]⊺ θ = [b w⊺]⊺

η(x) =1

1 + exp(− )θ⊺x~

{( , )x~i yi }Ni=1 L(θ) ≜ ( | )∏Ni=1 Pθ yi x

~i

K = 2 Y = {0, 1}

L(θ) ≜ η( (1 − η( )∏i=1

N

x~

i)yi x~i )

1−yi

ℓ(θ) = ( log η( ) + (1 − ) log(1 − η( )))∑i=1

N

yi x~i yi x~i

ℓ(θ) = ( − log(1 + ))∑i=1

N

yiθ⊺x~

i eθ⊺x~i

6

FINDING THE MLEFINDING THE MLEA necessary condition for optimality is

Here this means

System of non linear equations!

Use numerical algorithm to find the solution of Provable convergence when is convex

We will discuss two techniques:Gradient descentNewton’s methodThere are many more, especially useful in high dimension

ℓ(θ) = 0∇θ

( − ) = 0∑Ni=1 x~i yi

11+exp(− )θ⊺x~i

d + 1

− ℓ(θ)argminθ−ℓ

9

WRAPPING UP PLUGIN METHODSWRAPPING UP PLUGIN METHODSNaive Bayes, LDA, and logistic regression are all plugin methods that result in linear classifiers

Naive Bayesplugin method based on density estimationscales well to high-dimensions and naturally handles mixture of discrete and continuous features

Linear discriminant analysisbetter if Gaussianity assumptions are valid

Logistic regressionmodels only the distribution of , not valid for a larger class of distributionsfewer parameters to estimate

Plugin methods can be useful in practice, but ultimately they are very limitedThere are always distributions where our assumptions are violatedIf our assumptions are wrong, the output is totally unpredictableCan be hard to verify whether our assumptions are rightRequire solving a more difficult problem as an intermediate step

Py|x Py,x

10

GRADIENT DESCENTGRADIENT DESCENTConsider the canonical problem

Find minimum by find iteratively by “rolling downhill”Start from point

; is the step size

Choice of step size really matters: too small and convergence takes forever, too big and mightnever converge

Many variants of gradient descentMomentum: and Accelerated: and

In practice, gradient has to be evaluated from data

f(x) with f : → Rminx∈Rd

Rd

x(0)

= − η∇f(x)x(1) x(0) |x=x(0) η

= − η∇f(x)x(2) x(1) |x=x(1)

⋯

η

= γ + η∇f(x)vt vt−1 |x=x(t) = −x(t+1)

x(t) vt

= γ + η∇f(x)vt vt−1 |x= −γx(t) vt−1 = −x(t+1)

x(t) vt

11

NEWTON’S METHODNEWTON’S METHODNewton-Raphson method uses the second derivative to automatically adapt step size

Hessian matrix

Newton’s method is much faster when the dimension is small but impractical when is large

= − [ f(x) ∇f(x)x(j+1) x(j) ∇2 ]−1 |x=xj

f(x) =∇2

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢

f∂2

∂x21

f∂2

∂ ∂x1 x2

⋮f∂2

∂ ∂xd x1

f∂2

∂ ∂x1 x2

f∂2

∂x22

⋮f∂2

∂ ∂xd x2

⋯

⋯

⋱

⋯

f∂2

∂ ∂x1 xd

f∂2

∂ ∂x2 xd

⋮f∂2

∂x2d

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥

d d

12

STOCHASTIC GRADIENT DESCENTSTOCHASTIC GRADIENT DESCENTO�en have a loss function of the form where

The gradient is and gradient descent update is

Problematic if dataset if huge of if not all data is availabelUse iterative technique instead

Tons of variations of principleBatch, minibatch, Adagrad, RMSprop, Adam, etc.

ℓ(θ) = (θ)∑Ni=1 ℓi (θ) = f( , , θ)ℓi xi yi

ℓ(θ) = ∇ (θ)∇θ ∑Ni=1 ℓi

= − η ∇ (θ)θ(j+1) θ(j) ∑i=1

N

ℓi

= − η∇ (θ)θ(j+1) θ(j) ℓi

13

descent, newton logistic regression,...

Documents