lecture 5: lda and logistic regressionhzhang/math574m/2020... · both lda and logistic regression...

39
Linear Classification Methods Linear Odds Models Comparison Lecture 5: LDA and Logistic Regression Hao Helen Zhang September 22, 2020 Hao Helen Zhang Lecture 5: LDA and Logistic Regression 1 / 39

Upload: others

Post on 31-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Lecture 5: LDA and Logistic Regression

Hao Helen Zhang

September 22, 2020

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 1 / 39

Page 2: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Outline

Two Popular Linear Models for Classification

Linear Discriminant Analysis (LDA)Logistic Regression Models

Take-home message:

Both LDA and Logistic regression models rely on thelinear-odd assumption, indirectly or directly. However, theyestimate the coefficients in a different manner.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 2 / 39

Page 3: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Linear Classifier

Linear methods: The decision boundary is linear.Common linear classification methods:

Linear regression methods (covered in Lecture 3)

Linear log-odds (logit) models

Linear logistic modelsLinear discriminant analysis (LDA)

separating hyperplanes (introduced later)

perceptron model (Rosenblatt 1958)Optimal separating hyperplane (Vapnik 1996) – SVMs

From now on, we assume equal costs (by default).

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 3 / 39

Page 4: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Odds, Logit, and Linear Odds Models Linear

Some terminologies

Call the term Pr(Y=1|X=x)

Pr(Y=0|X=x)is called odds

Call log Pr(Y=1|X=x)

Pr(Y=0|X=x)log of the odds, or logit function

Linear odds models assume: the logit is linear in x, i.e.,

logPr(Y = 1|X = x)

Pr(Y = 0|X = x)= β0 + βT1 x.

Examples: LDA, Logistic regression

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 4 / 39

Page 5: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Classifier Based on Linear Odd Models

From the linear odds, we can obtain posterior class probabilities

Pr(Y = 1|x) =exp (β0 + βT1 x)

1 + exp (β0 + βT1 x)

Pr(Y = 0|x) =1

1 + exp (β0 + βT1 x)

Assuming equal costs, the decision boundary is given by

{x|P(Y = 1|x) = 0.5} = {x|β0 + βT1 x = 0},

which can be interpreted as “zero log-odds”

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 5 / 39

Page 6: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

02

4logit function log[s/(1−s)]

s

g(s)

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

logistic curve exp(t)/(1+exp(t))

t

p(t)

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 6 / 39

Page 7: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Linear Discriminant Analysis (LDA)

LDA assumes

Assume each class density is multivariate Gaussian, i.e.,

X |Yj ∼ N(µj ,Σj), j = 0, 1.

Equal covariance assumption

Σj = Σ, j = 0, 1.

In other words,

both classes are from Gaussian and they have the samecovariance matrix.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 7 / 39

Page 8: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Linear Discriminant Function

Under the mixture Gaussian assumption, the log-odd is

logPr(Y = 1|X = x)

Pr(Y = 0|X = x)

= logπ1π0− 1

2(µ1 + µ0)TΣ−1(µ1 − µ0) + xTΣ−1(µ1 − µ0)

Under equal costs, the LDA classifies to “1” if and only if[log

π1π0− 1

2(µ1 + µ0)TΣ−1(µ1 − µ0)

]+ xTΣ−1(µ1 − µ0) > 0.

It has a linear boundary {x : β0 + xTβ1 = 0}, with

β0 = logπ1π0− 1

2(µ1 + µ0)TΣ−1(µ1 − µ0),

β1 = Σ−1(µ1 − µ0).

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 8 / 39

Page 9: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Parameter Estimation in LDA

In practice, π1, π0, µ0, µ1,Σ are unknown

We estimate the parameters from the training data, usingMLE or the moment estimator

π̂j = nj/n, where nk is the size size of class j .µ̂j =

∑Yi=j xi/nj for j = 0, 1.

The sample covariance matrix is

Sj =1

nj − 1

∑Yi=j

(xi − µ̂j)(xi − µ̂j)T

(Unbiased) pooled sample covariance is a weighted average

Σ̂ =n0 − 1

(n0 − 1) + (n1 − 1)S0 +

n1 − 1

(n0 − 1) + (n1 − 1)S1

=1∑

j=0

∑Yi=j

(xi − µ̂j)(xi − µ̂j)T/(n − 2)

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 9 / 39

Page 10: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

R code for LDA Fitting (I)

There are two ways to call the function “lda”. The first way is touse a formula and an optional data frame.

library(MASS)

lda(formula, data,subset)

Arguments:

formula: the form “groups ∼ x1 + x2 + . . .”, where theresponse is the grouping factor and the right hand sidespecifies the (non-factor) discriminators.

data: data frame from which variables specified

subset: An index vector specifying the cases to be used in thetraining sample.

Output:

an object of class “lda” with multiple components

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 10 / 39

Page 11: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

R code for LDA Fitting (II)

The second way is to use a matrix and group factor as the first twoarguments.

library(MASS)

lda(x, grouping, prior = proportions, CV = FALSE)

Arguments:

x: a matrix or data frame or Matrix containing predictors.

grouping: a factor specifying the class for each observation.

prior: the prior probabilities of class membership. Ifunspecified, the class proportions for the training set are used.

Output:

If CV = TRUE, the return value is a list with components“class” (the MAP classification, a factor) and “posterior”(posterior probabilities for the classes).

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 11 / 39

Page 12: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

R code for LDA Prediction

We use the “predict” or “predict.lda” function to classifymultivariate observations with lda

predict(object, newdata, ...)

Arguments:

object: object of class “lda”

newdata: data frame of cases to be classified or, if “object”has a formula, a data frame with columns of the same namesas the variables used.

Output:

a list with the components “class” (the MAP classification, afactor) and “posterior” (posterior probabilities for the classes)

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 12 / 39

Page 13: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Fisher’s Iris Data (Three-Classification Problems)

Fisher (1936) “The use of multiple measurements in taxonomicproblems”.

Three species: Iris setosa, Iris versicolor, Iris verginica

Four features: the length and the width of the sepals andpetals, in centimeters.

50 samples from each species

In the following analysis, we randomly select 50% of the datapoints as the training set, and the rest as the test set.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 13 / 39

Page 14: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 14 / 39

Page 15: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Illustration 1

Iris <- data.frame(rbind(iris3[,,1], iris3[,,2],

iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3)))

train <- sample(1:150, 75)

table(Iris$Sp[train])

z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 15 / 39

Page 16: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Training Error and Test Error

#training error rate

ytrain <- predict(z, Iris[train, ])$class

table(ytrain, Iris$Sp[train])

train_err <- mean(ytrain!=Iris$Sp[train])

#test error rate

ytest <- predict(z, Iris[-train, ])$class

table(ytest, Iris$Sp[-train])

test_err <- mean(ytest!=Iris$Sp[-train])

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 16 / 39

Page 17: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Illustration 2

tr <- sample(1:50, 25)

train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])

test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])

cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

z <- lda(train, cl)

ytest <- predict(z, test)$class

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 17 / 39

Page 18: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Logistic Regression

Model assumption: the log-odd is linear in x.

logPr(Y = 1|X = x)

Pr(Y = 0|X = x)= β0 + βT

1 x.

Define

p(x;β) ≡=exp (β0 + βT

1 x)

1 + exp (β0 + βTl x)

Write β = (β0,β1). The posterior class probabilities can becalculated as

p1(x) = Pr(Y = 1|x) = p(x;β),

p0(x) = Pr(Y = 0|x) = 1− p(x;β).

The classification boundary is given by: {x : β0 + βT1 x = 0}.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 18 / 39

Page 19: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Model Interpretation

Denote µ = E (Y |X) = P(Y = 1|X)

By assuming

g(µ) = log[µ/(1− µ)] = β0 + βT1 X,

the logit g connects µ with the linear predictor β0 + βT1 X.

We call g the link function

Var(Y |X) = µ(1− µ).

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 19 / 39

Page 20: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Interpretation of βj

In logistic regression,

oddsXj=

odds(. . . ,Xj = x + 1, . . .)

odds(. . . ,Xj = x , . . .)= eβj .

If Xj = 0 or 1, then odds for group with Xj = 1 are eβj higherthan for group with Xj = 0, with other parameters fixed.

For rare diseases,

when incidence is rare, Pr(Y = 0) ≈ 1, then odds≈ Pr(Y = 1)

eβj = oddsXj≈

Pr(. . . ,Xj = x + 1, . . .)

Pr(. . . ,Xj = x , . . .).

in a cancer study, an log-OR of 5 means that smokers aree5 ≈ 150 times more likely to develop the cancer

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 20 / 39

Page 21: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Maximum Likelihood Estimate (MLE) for Logistic Models

The joint conditional likelihood of yi given xi is

l(β) =n∑

i=1

log pyi (x;β),

wherepy (β; x) = p(x;β)y [1− p(x;β)]1−y .

In details,

l(β) =n∑

i=1

{yi log p(xi ;β) + (1− yi ) log[1− p(xi ,β)]}

=n∑

i=1

{yi (β10 + βT1 xi )− log[1 + exp (β10 + βT

1 xi )].

.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 21 / 39

Page 22: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Score Equations

For simplicity, now assume xi has 1 in its first component.

∂l(β)

∂β=

n∑i=1

xi [yi − p(xi ;β)] = 0,

Totally (d+1) nonlinear equations. The first equation

n∑i=1

yi =n∑

i=1

p(xi ;β),

expected number of 1’s = observed number in sample.

y = [y1, · · · , yn]T

p = [p(x1;βold), · · · , p(xn;βold)]T .

W = diag{p(xi ;β

old)(1− p(xi ;βold))

}.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 22 / 39

Page 23: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Newton-Raphson Algorithm

The second-derivative (Hessian) matrix

∂2l(β)

∂β∂βT= −

n∑i=1

xixTi p(xi ;β)[1− p(xi ;β)].

1 Choose an initial value β0

2 Update β by

βnew = βold −[∂2l(β)

∂β∂βT

]−1βold

∂l(β)

∂β βold

Using matrix notations, we have

∂l(β)

∂β= XT (y − p),

∂2l(β)

∂β∂βT= −XTWX .

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 23 / 39

Page 24: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Iteratively Re-weighted Least Squares (IRLS)

Newton-Raphson step

βnew = βold + (XTWX )−1XT (y − p)

= (XTWX )−1XTW(Xβold + W−1(y − p)

)= (XTWX )−1XTW z,

where we defined the adjusted response

z = Xβold + W−1(y − p)

Repeatedly solve weighted least squares till convergence.

βnew = arg minβ

(z− Xβ)TW (z− Xβ),

Weight W , response z, and p change in each iteration.

The algorithm can be generalized to K ≥ 3 case.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 24 / 39

Page 25: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Quadratic Approximations

β̂ satisfies a self-consistency relationship: it solves a weighted leastsquare fit with response

zi = xTi β̂ +(yi − p̂i )

p̂i (1− p̂i )

and the weight wi = p̂i (1− p̂i ).

β = 0 seems to be a good starting value

Typically the algorithm converges, but it is never guaranteed.

The weighted residual sum-of-squared is Pearson chi-squarestatistic

n∑i=1

(yi − p̂i )2

p̂i (1− p̂i ),

a quadratic approximation to the deviance.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 25 / 39

Page 26: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

Statistical Inferences

Using the weighted least squares formulation, we have

Asymptotic likelihood theory says: if the model is correct, β̂ isconsistent

Using central limit theorem, the distribution of β̂ converges toN(β, (XTWX )−1)

Model building is costly due to iterations, popular shortcuts:

For inclusion of a term, use Rao score test.For exclusion of a term, use Wald test.

Neither of these two algorithms require iterative fitting, andare based on the maximum likelihood fit of the current model.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 26 / 39

Page 27: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

R Code for Logistic Regression

logist <- glm(formula, family, data, subset, ...)

predict(logist)

Arguments:

formula: an object of class “formula”

family: a description of the error distribution and link functionto be used in the model.

data: an optional data frame, list or environment containingthe variables in the model.

subset: an optional vector specifying a subset of observationsto be used in the fitting process.

Output:

returns an object of class inheriting from “glm” and “lm”

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 27 / 39

Page 28: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

LDALogistics Regression

R Code for Logistic Prediction

logist <- glm(y~x, data, family=binomial(link="logit"))

predict(logist, newdata)

summary(logist)

anova(logist)

The function “predict” gives the predicted values, newdata isa data frame

The function “summary” is used to obtain or print a summaryof the results

The function “anova” produces an analysis of variance table.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 28 / 39

Page 29: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Relationship between LDA and Least Squares (LS)

For two-class problems, both LDA and least squares fit a linearboundary β0 + βT x. Their solutions have the followingrelationship:

The least square regression coefficient β̂ is proportional to theLDA direction, i.e.,

β̂ ∝ Σ̂−1(µ̂1 − µ̂0),

where Σ̂ is the pooled sample covariance matrix, and µ̂k isthe sample mean of points from class k , k = 0, 1. In otherwords, their slope coefficients are identical, up to a scalarmultiple. (Exercise 4.2 in textbook)

The LS intercept β̂0 is generally different from that of LDA,unless n1 = n0.

In general, they have different decision rules (unless n1 = n2).Hao Helen Zhang Lecture 5: LDA and Logistic Regression 29 / 39

Page 30: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Common Feature of LDA and Logistic Regression Models

For both LDA and logistic regression, the logit has a linear form

logPr(Y = 1|x)

Pr(Y = 0|x)= β0 + βT

1 x.

Or equivalently, for both estimators, their posterior classprobability can be expressed in the form of

Pr(Y = 1|x) =exp (β0 + βT

1 x)

1 + exp (β0 + βT1 x)

.

They have exactly same forms. Are they same estimators?

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 30 / 39

Page 31: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Major Differences of LDA and Logistsics Regresion

Main differences of two estimators include

Where is the linear-logit from?

What assumptions are made on the data distribution?(Difference on data distribution)

How to estimate the linear coefficients?(Difference in parameter estimation)

Any assumption on the marginal density of X?(flexibility of model)

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 31 / 39

Page 32: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Where is Linear-logit from?

For LDA, the linear logit is due to the equal-covariance Gaussianassumption on data

logPr(Y = 1|x)

Pr(Y = 0|x)= log

π1π0− 1

2(µ1 + µ0)TΣ−1(µ1 + µ0)

+xTΣ−1(µ1 − µ0)

= β0 + βT1 x

For Logistic model, the linear logit is due to construction

logPr(Y = 1|x)

Pr(Y = 0|x)= β0 + βT

1 x

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 32 / 39

Page 33: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Difference in Marginal Density Assumption

The assumptions on Pr(X):

The logistic model leaves the marginal density of X arbitraryand unspecified.

The LDA model assumes a Gaussian density

Pr(X) =1∑

j=0

πjφ(X;µj ,Σ)

Conclusion: The logistic model makes less assumptions about thedata, and hence is more general.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 33 / 39

Page 34: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Difference in Parameter Estimation

Logistic regression

Maximizing the conditional likelihood, the multinomiallikelihood with probabilities Pr(Y = k |X)

The marginal density Pr(X) is totally ignored (fullynonparametric using the empirical distribution function whichplaces 1/n at each observation)

LDA

Maximizing the full log-likelihood based on the joint density

Pr(X,Y = j) = φ(X;µj ,Σ)πj ,

Standard MLE theory leads to estimators µ̂j , Σ̂, π̂j

Marginal density does play a role

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 34 / 39

Page 35: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

More Comments

LDA is easier to compute than logistic regression.

If the true fk(x)’s are Gaussian, LDA is better.

Logistic regression may lose efficiency around 30%asymptotically in error rate (by Efron 1975)

Robustness?

LDA uses all the points to estimate the covariance matrix;more information but not robust against outliersLogistic regression down-weights points far from decisionboundary (recall the weight is pi (1− pi ); more robust and safer

In practice, these two methods often give similar results (forapproximately normal distributed data)

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 35 / 39

Page 36: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Two-dimensional Linear Example

Consider the following scenarios for a two-class problem:

π1 = π0 = 0.5.

Class 1: X ∼ N2((1, 1)T , 4I)

Class 2: X ∼ N2((−1,−1)T , 4I)

The Bayes boundary is

{x : x1 − x2 = 0}.

We generate

the “training set” of n = 200, 400 to fit the classifier

the “testing set” of n′ = 2000 “ to evaluate its predictionperformance.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 36 / 39

Page 37: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

●●

●●

●●

●●

●●

●●

● ●

●●

−4 −2 0 2 4

−4−2

02

4

Training Scatter Plot

x1

x2

● ●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

−4 −2 0 2 4

−4−2

02

4

Bayes and Linear Classifiers

x1

x2

● ●

●●

● ●

●● ●

●●

baylin

●●

●●

●●

●●

●●

●●

● ●

●●

−4 −2 0 2 4

−4−2

02

4

Logistic and LDA Classifiers

x1

x2

● ●

●●

● ●

●● ●

●●

loglda

●●

●●

●●

●●

●●

●●

● ●

●●

−4 −2 0 2 4

−4−2

02

4

All Linear Classifiers

x1

x2

● ●

●●

● ●

●● ●

●●

baylinloglda

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 37 / 39

Page 38: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

Performance of Various Classifiers (n=200)

Bayes Linear Logistic LDA

Train Error 0.250 0.255 0.255 0.255Test Error 0.240 0.247 0.242 0.247

Fitted classification boundaries:Linear & LDA: x2 = −0.48− 1.73x1Logistic: x2 = −0.32− 1.80x1.

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 38 / 39

Page 39: Lecture 5: LDA and Logistic Regressionhzhang/math574m/2020... · Both LDA and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate

Linear Classification MethodsLinear Odds Models

Comparison

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

−4 −2 0 2 4

−4−2

02

4

Training Scatter Plot

x1

x2

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

−4 −2 0 2 4

−4−2

02

4

Bayes and Linear Classifiers

x1

x2

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

baylin

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

−4 −2 0 2 4

−4−2

02

4

Logistic and LDA Classifiers

x1

x2

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

loglda

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

−4 −2 0 2 4

−4−2

02

4

All Linear Classifiers

x1

x2

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

baylinloglda

Hao Helen Zhang Lecture 5: LDA and Logistic Regression 39 / 39