descent, newton logistic regression,...

14
Matthieu R Bloch Thursday, January 30, 2020 LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON 1

Upload: others

Post on 28-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Matthieu R Bloch Thursday, January 30, 2020

    LOGISTIC REGRESSION, GRADIENTLOGISTIC REGRESSION, GRADIENTDESCENT, NEWTONDESCENT, NEWTON

    1

  • LOGISTICSLOGISTICSTAs and Office hours

    Monday: Mehrdad (TSRB 523a) - 2pm-3:15pmTuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pmWednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pmThursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pmFriday: Brighton (TSRB 523a) - 12pm-1:15pm

    Homework 1Hard deadline Friday January 31, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL)

    Homework 2 postedDue Wednesday February7,2020 11:59pm EST(Wednesday February14, 2020 for DL)

    2

  • RECAP: GENERATIVE MODELSRECAP: GENERATIVE MODELSQuadratic Discriminant Analysis: classes distributed according to

    Covariance matrices are class dependent but decision boundary not linear anymore

    Generative model rarely accurate

    Number of parameters to estimate: class priors, means, elements of covariance matrix

    Works well if Works poorly if without other tricks (dimensionality reduction, structured covariance)

    Biggest concern: “one should solve the [classification] problem directly and never solve a more generalproblem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998

    Revisit binary classifier with LDA

    We no not need to estimate the full joint distribution!

    N ( , )μk Σk

    K − 1 Kd d(d + 1)12N ≫ dN ≪ d

    (x) = =η1ϕ(x; , Σ)π1 μ1

    ϕ(x; , Σ) + ϕ(x; , Σ)π1 μ1 π0 μ0

    1

    1 + exp(−( x + b))w⊺

    3

  • 4

  • LOGISTIC REGRESSIONLOGISTIC REGRESSIONAssume that is of the form

    Estimate and from the data directly

    Plugin the result to obtain

    The function is called the logistic function

    The binary logistic classifier is (linear)

    How do we estimate and ?

    From LDA analysis: , Direct estimation of from maximum likelihood

    η(x) 11+exp(−( x+b))w⊺

    ŵ b̂

    (x) =η̂ 11+exp(−( x+ ))ŵ⊺ b̂

    x ↦ 11+e−x

    (x) = 1{ (x) ≥ } = 1{ x + ≥ 0}hLC η̂ 12 ŵ⊺ b̂

    ŵ b̂

    = ( − )ŵ Σ̂−1

    μ̂1 μ̂0 b = − + log12 μ̂

    0Σ̂−1

    μ̂012 μ̂

    1Σ̂−1

    μ̂1π̂1π̂0

    ( , b)ŵ

    5

  • MLE FOR LOGISTIC REGRESSIONMLE FOR LOGISTIC REGRESSIONWe have a parametric density model for

    Standard trick: and This allows us to lump the offset and write

    Given our dataset the likelihood is

    For with we obtain

    (y|x) = (x)pθ η̂= [1,x~ x⊺]⊺ θ = [b w⊺]⊺

    η(x) =1

    1 + exp(− )θ⊺x~

    {( , )x~i yi }Ni=1 L(θ) ≜ ( | )∏Ni=1 Pθ yi x

    ~i

    K = 2 Y = {0, 1}

    L(θ) ≜ η( (1 − η( )∏i=1

    N

    x~

    i)yi x~i )

    1−yi

    ℓ(θ) = ( log η( ) + (1 − ) log(1 − η( )))∑i=1

    N

    yi x~i yi x~i

    ℓ(θ) = ( − log(1 + ))∑i=1

    N

    yiθ⊺x~

    i eθ⊺x~i

    6

  • 7

  • 8

  • FINDING THE MLEFINDING THE MLEA necessary condition for optimality is

    Here this means

    System of non linear equations!

    Use numerical algorithm to find the solution of Provable convergence when is convex

    We will discuss two techniques:Gradient descentNewton’s methodThere are many more, especially useful in high dimension

    ℓ(θ) = 0∇θ

    ( − ) = 0∑Ni=1 x~i yi

    11+exp(− )θ⊺x~i

    d + 1

    − ℓ(θ)argminθ−ℓ

    9

  • WRAPPING UP PLUGIN METHODSWRAPPING UP PLUGIN METHODSNaive Bayes, LDA, and logistic regression are all plugin methods that result in linear classifiers

    Naive Bayesplugin method based on density estimationscales well to high-dimensions and naturally handles mixture of discrete and continuous features

    Linear discriminant analysisbetter if Gaussianity assumptions are valid

    Logistic regressionmodels only the distribution of , not valid for a larger class of distributionsfewer parameters to estimate

    Plugin methods can be useful in practice, but ultimately they are very limitedThere are always distributions where our assumptions are violatedIf our assumptions are wrong, the output is totally unpredictableCan be hard to verify whether our assumptions are rightRequire solving a more difficult problem as an intermediate step

    Py|x Py,x

    10

  • GRADIENT DESCENTGRADIENT DESCENTConsider the canonical problem

    Find minimum by find iteratively by “rolling downhill”Start from point

    ; is the step size

    Choice of step size really matters: too small and convergence takes forever, too big and mightnever converge

    Many variants of gradient descentMomentum: and Accelerated: and

    In practice, gradient has to be evaluated from data

    f(x) with f : → Rminx∈Rd

    Rd

    x(0)

    = − η∇f(x)x(1) x(0) |x=x(0) η

    = − η∇f(x)x(2) x(1) |x=x(1)

    η

    = γ + η∇f(x)vt vt−1 |x=x(t) = −x(t+1)

    x(t) vt

    = γ + η∇f(x)vt vt−1 |x= −γx(t) vt−1 = −x(t+1)

    x(t) vt

    11

  • NEWTON’S METHODNEWTON’S METHODNewton-Raphson method uses the second derivative to automatically adapt step size

    Hessian matrix

    Newton’s method is much faster when the dimension is small but impractical when is large

    = − [ f(x) ∇f(x)x(j+1) x(j) ∇2 ]−1 |x=xj

    f(x) =∇2

    ⎢⎢⎢⎢⎢⎢⎢⎢⎢

    f∂2

    ∂x21

    f∂2

    ∂ ∂x1 x2

    ⋮f∂2

    ∂ ∂xd x1

    f∂2

    ∂ ∂x1 x2

    f∂2

    ∂x22

    ⋮f∂2

    ∂ ∂xd x2

    f∂2

    ∂ ∂x1 xd

    f∂2

    ∂ ∂x2 xd

    ⋮f∂2

    ∂x2d

    ⎥⎥⎥⎥⎥⎥⎥⎥⎥

    d d

    12

  • STOCHASTIC GRADIENT DESCENTSTOCHASTIC GRADIENT DESCENTO�en have a loss function of the form where

    The gradient is and gradient descent update is

    Problematic if dataset if huge of if not all data is availabelUse iterative technique instead

    Tons of variations of principleBatch, minibatch, Adagrad, RMSprop, Adam, etc.

    ℓ(θ) = (θ)∑Ni=1 ℓi (θ) = f( , , θ)ℓi xi yi

    ℓ(θ) = ∇ (θ)∇θ ∑Ni=1 ℓi

    = − η ∇ (θ)θ(j+1) θ(j) ∑i=1

    N

    ℓi

    = − η∇ (θ)θ(j+1) θ(j) ℓi

    13

  • 14