lecture 12: logistic regression (v1) - web.stanford.edu

30
MS&E 226: “Small” Data Lecture 12: Logistic regression (v1) Ramesh Johari [email protected] Fall 2015 1 / 30

Upload: others

Post on 30-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 12: Logistic regression (v1) - web.stanford.edu

MS&E 226: “Small” DataLecture 12: Logistic regression (v1)

Ramesh [email protected]

Fall 2015

1 / 30

Page 2: Lecture 12: Logistic regression (v1) - web.stanford.edu

Regression methods for binary outcomes

2 / 30

Page 3: Lecture 12: Logistic regression (v1) - web.stanford.edu

Binary outcomes

For the duration of this lecture suppose the outcome variableYi ∈ {0, 1} for each i.

(Much of what we cover generalizes to discrete outcomes withmore than two levels, but we focus on the binary case.)

3 / 30

Page 4: Lecture 12: Logistic regression (v1) - web.stanford.edu

A population model for binary outcomes

What does it mean to have a population model for binaryoutcomes?

Given a covariate vector ~X, the population model specifies theprobability that Yi is either 0 or 1.

Given a sample X and Y, our goal is to construct a goodapproximation of this population model (for prediction, inference,or both).

4 / 30

Page 5: Lecture 12: Logistic regression (v1) - web.stanford.edu

Linear regression?Why doesn’t linear regression work well for prediction? A picture:

0

1

−20 0 20X

Y

5 / 30

Page 6: Lecture 12: Logistic regression (v1) - web.stanford.edu

Logistic regression

At its core, logistic regression is a method that directly addressesthis issue with linear regression: it produces fitted values thatalways lie in [0, 1].

Input: Sample data X and Y.

Output: A fitted model f̂(·), where we interpret f̂( ~X) as anestimate of the probability that the corresponding outcome Y isequal to 1.

6 / 30

Page 7: Lecture 12: Logistic regression (v1) - web.stanford.edu

Logistic regression: Basics

7 / 30

Page 8: Lecture 12: Logistic regression (v1) - web.stanford.edu

The population modelLogistic regression assumes the following population model:

P(Y = 1| ~X) =exp( ~Xβ)

1 + exp( ~Xβ)= 1− P(Y = 0| ~X).

A plot in the case with only one covariate, and β0 = 0, β1 = 1:

0.00

0.25

0.50

0.75

1.00

−10 −5 0 5 10X

P(Y

= 1

| X

)

8 / 30

Page 9: Lecture 12: Logistic regression (v1) - web.stanford.edu

Logistic curve

The function g given by:

g(q) = log

(q

1− q

)is called the logit function. It has the following inverse, called thelogistic curve:

g−1(z) =exp(z)

1 + exp(z).

In terms of g, we can write the population model as:1

P(Y = 1| ~X) = g−1( ~Xβ).

1This is one example of a generalized linear model (GLM); for a GLM, g iscalled the link function.

9 / 30

Page 10: Lecture 12: Logistic regression (v1) - web.stanford.edu

Logistic curve

Note that from this curve we see some important characteristics oflogistic regression:

I The logistic curve is increasing. Therefore, in logisticregression, larger values of covariates that have positivecoefficients will tend to increase the probability that Y = 1.

I When z > 0, then g−1(z) > 1/2; when z < 0, theng−1(z) < 1/2.Therefore, when ~Xβ > 0, Y is more likely to be one thanzero; and conversely, when ~Xβ < 0, Y is more likely to bezero than one.

10 / 30

Page 11: Lecture 12: Logistic regression (v1) - web.stanford.edu

Interpretations: Log odds ratio

In many ways, the choice of a logistic regression model is a matterof practical convenience, rather than any fundamentalunderstanding of the population: it allows us to neatly employregression techniques for binary data.

One way to interpret the model:

I Note that given a probability 0 < q < 1, q/(1− q) is calledthe odds ratio. The odds ratio lies in (0,∞).

I So logistic regression uses a model that suggests the log oddsratio of Y given ~X is linear in the covariates. Note the logodds ratio lies in (−∞,∞).

11 / 30

Page 12: Lecture 12: Logistic regression (v1) - web.stanford.edu

Interpretations: Latent variables

Another way to interpret the model is the latent variable approach:

I Suppose given a vector of covariates ~X, a logistic randomvariable Z is sampled independently:

P(Z < z) = g−1(z).

I Define Y = 1 if ~Xβ + Z > 0, and Y = 0 otherwise.

I By this definition, Y = 1 if and only if Z < ~Xβ, an eventthat has probability g−1( ~Xβ) — exactly the logisticregression population model.

12 / 30

Page 13: Lecture 12: Logistic regression (v1) - web.stanford.edu

Interpretations: Latent variables

The latent variable interpretation is particularly popular ineconometrics, where it is a first example of a discrete choicemodeling.For example, in modeling customer choice over whether or not tobuy a product, suppose:

I Each customer has a feature vector ~X.

I This customer’s utility for purchasing the item is the(random) quantity ~Xβ + Z, and the utility for not purchasingthe item is zero.

I The probability the customer purchases the item is exactly thelogistic regression model.

This is a very basic example of a random utility model forcustomer choice.

13 / 30

Page 14: Lecture 12: Logistic regression (v1) - web.stanford.edu

Fitting through maximum likelihood

The logistic regression parameters are found through maximumlikelihood.

The (conditional) likelihood for Y given X and parameters β is:

P(Y|β,X) =

n∏i=1

g−1(Xiβ)Yi(1− g−1(Xiβ))

1−Yi .

Let β̂MLE be the resulting solution; these are the logistic regressioncoefficients.2

2For the remainder of this lecture we drop the subscript MLE on thesolution.

14 / 30

Page 15: Lecture 12: Logistic regression (v1) - web.stanford.edu

Fitting through maximum likelihood [∗]

Unfortunately, in contrast to our previous examples, maximumlikelihood estimation does not have a closed form solution in thecase of logistic regression.

However, it turns out that there are reasonably efficient iterativemethods for algorithmically computing the MLE solution.

One example is an algorithm inspired by weighted least squares,called iteratively reweighted least squares. This algorithmiteratively updates the weights in WLS to converge to the logisticregression MLE solution. See [AoS], Section 13.7 for details.

15 / 30

Page 16: Lecture 12: Logistic regression (v1) - web.stanford.edu

Fitting and interpreting logistic regressionmodels

16 / 30

Page 17: Lecture 12: Logistic regression (v1) - web.stanford.edu

Example: The CORIS dataset

Recall: 462 South African males evaluated for heart disease.

Outcome variable: Coronary heart disease (chd).

Covariates:

I Systolic blood pressure (sbp)

I Cumulative tobacco use (tobacco)

I LDL cholesterol (ldl)

I Adiposity (adiposity)

I Family history of heart disease (famhist)

I Type A behavior (typea)

I Obesity (obesity)

I Current alcohol consumption (alcohol)

I Age (age)

17 / 30

Page 18: Lecture 12: Logistic regression (v1) - web.stanford.edu

Logistic regression

To run logistic regression in R, we use the glm function, as follows:

> fm = glm(formula = chd ~ .,

family = "binomial",

data = coris)

> display(fm)

coef.est coef.se

(Intercept) -6.15 1.31

...

famhist 0.93 0.23

...

obesity -0.06 0.04

...

(As before, standard errors are the standard deviation of thesampling distribution, and can be used to construct confidenceintervals.)

18 / 30

Page 19: Lecture 12: Logistic regression (v1) - web.stanford.edu

Interpreting the output

Recall that if a coefficient is positive, it increases the probabilitythat the outcome is 1 in the fitted model (since the logisticfunction is increasing).

So, for example, obesity has a negative coefficient. What doesthis mean? Do you believe the implication?

19 / 30

Page 20: Lecture 12: Logistic regression (v1) - web.stanford.edu

Interpreting the output

In linear regression, coefficients were directly interpretable; this isnot as straightforward for logistic regression.

Some approaches to interpretation of β̂j (holding all othercovariates constant):

I β̂j is the change in the log odds ratio for Y per unit change inXj .

I exp(β̂j) is the change in the odds ratio for Y per unit changein Xj .

20 / 30

Page 21: Lecture 12: Logistic regression (v1) - web.stanford.edu

The “divide by 4” rule

We can also directly differentiate the fitted model with respect toXj .

By differentating g−1( ~Xβ̂), we find that the change in the (fitted)probability Y = 1 per unit change in Xj is:(

exp( ~Xβ)

[1 + exp( ~Xβ)]2

)β̂j .

Note that the term in parentheses cannot be any larger than 1/4.

Therefore, |β̂j |/4 is an upper bound on the magnitude of thechange in the fitted probability that Y = 1, per unit change in Xj .

21 / 30

Page 22: Lecture 12: Logistic regression (v1) - web.stanford.edu

ResidualsHere is what happens when we plot residuals vs. fitted values:

−0.5

0.0

0.5

1.0

0.00 0.25 0.50 0.75fitted.coris

resi

dual

s.co

ris

Why is this plot not informative?22 / 30

Page 23: Lecture 12: Logistic regression (v1) - web.stanford.edu

ResidualsAn alternative approach: bin the residuals based on fitted value,and then plot the average residual in each bin.

E.g., when we divide the data into ≈ 50 bins, here is what weobtain:

0.0 0.2 0.4 0.6 0.8

−0.

2−

0.1

0.0

0.1

0.2

Binned residual plot

Expected Values

Ave

rage

res

idua

l

23 / 30

Page 24: Lecture 12: Logistic regression (v1) - web.stanford.edu

Logistic regression as a classifier

24 / 30

Page 25: Lecture 12: Logistic regression (v1) - web.stanford.edu

Classification

Logistic regression serves as a classifier in the following naturalway:

I Given the estimated coefficients β̂, and a new covariate vector~X, compute ~Xβ̂.

I If the resulting value is positive, return Y = 1 as the predictedvalue.

I If the resulting value is negative, return Y = 0 as thepredicted value.

25 / 30

Page 26: Lecture 12: Logistic regression (v1) - web.stanford.edu

Linear classification

The idea is to directly use the model to choose the most likelyvalue for Y , given the model:

When ~Xβ̂ > 0, the fitted model gives P(Y = 1| ~X, β̂) > 1/2, sowe make the prediction that Y = 1.

Note that logistic regression is therefore an example of a linearclassifier: the boundary between covariate vectors where we predictY = 1 (resp., Y = 0) is linear.

26 / 30

Page 27: Lecture 12: Logistic regression (v1) - web.stanford.edu

Tuning logistic regression

As with other classifiers, we can tune logistic regression to tradeoff false positives and false negatives.

In particular, suppose we choose a threshold t, and predict Y = 1whenever ~Xβ̂ > t.

I When t = 0, we recover the classification rule on thepreceding slide. This is the rule that minimizes average 0-1loss on the training data.

I What happens to our classifier when t→∞?

I What happens to our classifier when t→ −∞

As usual, you can plot an ROC curve for the resulting family ofclassifiers as t varies.

27 / 30

Page 28: Lecture 12: Logistic regression (v1) - web.stanford.edu

Additional thoughts on logistic regression

28 / 30

Page 29: Lecture 12: Logistic regression (v1) - web.stanford.edu

Regularized logistic regression [∗]

As with linear regression, regularized logistic regression is oftenused in the presence of many features.

In practice, the most common regularization technique is to addthe penalty −λ

∑j |β̂j | to the maximum likelihood problem; this is

the equivalent of lasso for logistic regression.

As for linear regression, this penalty has the effect of selecting asubset of the parameters (for sufficient large values of theregularization penalty λ).

29 / 30

Page 30: Lecture 12: Logistic regression (v1) - web.stanford.edu

Do you believe the model?

Logistic regression highlights one of the inherent tensions ininference:

I Inference means that we are trying to understand and explainthe population model.

I At the same time, it appears that logistic regression is anestimation procedure that largely exists for its technicalconvenience. (Do we really believe that the population modelis logistic?)

At times like this it is useful to remember:

“All models are wrong, but some are more useful thanothers.”

– George E.P. Box

30 / 30