22s:152 applied linear regression ch. 14 (sec. 1) and ch...
Post on 04-Feb-2021
2 Views
Preview:
TRANSCRIPT
-
22s:152 Applied Linear Regression
Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4):Logistic Regression————————————————————
Logistic Regression
•When the response variable is a binary vari-able, such as
– 0 or 1
– live or die
– fail or succeed
then we approach our modeling a little dif-ferently
•What is wrong with our previous modeling?– Let’s look at an example...
1
-
• Example: Study on lead levels in children
Binary response called highbld:
Either high lead blood level(1) orlow lead blood level(0)
One continuous predictor variable called soil:
Measure of the level of lead in the soil inthe subject’s backyard
> sl=read.csv("soillead.csv")
> attach(sl)
> head(sl)
highbld soil
1 1 1290
2 0 90
3 1 894
4 0 193
5 1 1410
6 1 410
2
-
The dependent variable plotted against theindependent variable:
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
soil
highbld
Original
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
soil
highbld
Jittered
3
-
Consider the usual regression model:Yi = β0 + β1xi + �i
The fitted model and residuals:Ŷ = 0.3178 + 0.0003x
0 1000 3000 5000
0.00.20.40.60.81.0
Fitted model (jittered points)
soil
highbld
0.4 0.8 1.2 1.6
-0.8
-0.4
0.0
0.4
Residuals vs. Fitted values
lm.out$fitted.values
lm.out$residuals
-2 -1 0 1 2
-0.8
-0.4
0.0
0.4
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
All sorts of violations here...
4
-
Violations:
- Our usual �i ∼ N(0, σ2) isn’t reasonable
The errors are not normal, and they don’thave the same variability across all x-values.
- predictions for observations can be outside [0,1].
For x=3000, Ŷ = 1.14, and this is notpossibly an average of the 0s and 1s at x=3000(like a conditional mean given x).
Ŷx=3000 = 1.14 which is not in [0,1].
5
-
•What is a better way to model a0-1 response using regression?
• At each x value, there’s a certain chance of a0 and a certain chance of a 1.
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
soil
highbld
• For example, in this data set, it looks morelikely to get a 1 at higher values of x.
• Or, P (Y = 1) changes with the x-value.
6
-
• Conditioning on a given x, we haveP (Y = 1), and P (Y = 1) ∈ (0, 1).
•We will consider this probability of getting a1 given the x-value(s) in our modeling...
πi = P (Yi = 1|Xi)
The previous plot shows that P (Y = 1) de-pends on the value of x.
• It is actually a transformation of this prob-ability (πi) that we will use as our responsein the regression model.
7
-
Odds of an Event
• First, let’s discuss the odds of an event.
When two fair coins are flipped,P (two heads)=1/4P (not two heads)=3/4
The odds in favor of getting two heads is:
odds =P (2 heads)
P (not 2 heads)=
1/4
3/4= 1/3
or sometimes referred to as 1 to 3 odds.
You’re 3 times as likely to not get 2 headsas you are to get 2 heads.
8
-
• For a binary variable Y (2 possible outcomes),
odds in favor of Y=1 isP (Y =1)P (Y =0)
=P (Y =1)
1−P (Y =1)
• For example, if P (heart attack)=0.0018,then the odds of a heart attack is
0.00180.9982 =
0.00181−0.0018 = 0.001803
• The ratio of the odds for two different groupsis also a quantity of interest.
For example, consider heart attacks for“male nonsmoker vs. male smoker”
Suppose P (heart attack)=0.0036 for a malesmoker, and P (heart attack)=0.0018 for amale nonsmoker.
9
-
Then, the odds ratio (O.R.) for a heart at-tack in nonsmoker vs. smoker is
O.R. =odds of a heart attack for non-smoker
odds of a heart attack for smoker
=
(Pr(heart attack|non-smoker)
1−Pr(heart attack|non-smoker)
)(
Pr(heart attack|smoker)1−Pr(heart attack|smoker)
)
=
(0.00180.9982
)(0.00360.9964
) = 0.4991• Interpretation of the odds ratio for binary 0-1
– O.R. ≥ 0– If O.R. = 1.0, then P (Y = 1) is the same
in both samples
– IfO.R. < 1.0, then P (Y = 1) is less in thenumerator group than in the denominatorgroup
– O.R. = 0 if and only if P (Y = 1) = 0 innumerator sample
10
-
Back to Logistic Regression...
• The response variable we will model is a trans-formation of P (Yi = 1) for a given Xi.
• The transformation is the logit transforma-tion
logit(a) = ln
(a
1− a
)• The response variable we will use:
logit[P (Yi = 1)] = ln
(P (Yi = 1)
1− P (Yi = 1)
)This is the loge of the odds that Yi = 1.
Notice: P (Yi = 1) ∈ (0, 1)(P (Yi=1)
1−P (Yi=1)
)∈ (0,∞)
and −∞ < ln(
P (Yi=1)1−P (Yi=1)
)
-
• The logistic regression model:
ln
(P (Yi = 1)
1− P (Yi = 1)
)= β0+β1X1i+· · ·+βkXki
– This response on the left isn’t ‘bounded’by [0,1] eventhough the Y-values them-selves are (having the response boundedby [0,1] was a problem before).
– The response on the left can feasibly beany positive or negative quantity.
– This a nice characteristic because the rightside of the equation can ‘potentially’ giveany possible predicted value −∞ to ∞.
• The logistic regression model is aGENERALIZED LINEAR MODEL.Linear model on the right, something otherthan the usual continuous Y on the left.
12
-
• Let’s look at the fitted logistic regression modelfor the lead level data with one X covariate.
ln
(P (Yi = 1)
1− P (Yi = 1)
)= β̂0+β̂1Xi = −1.5160+0.0027 Xi
●
●
●
●
●●
●
●●
●● ●
●●●
●
●
●
●● ●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●
●●● ●
●●●
● ●
●●
●
●
●
●●●
●●●●
●
● ●
●
●
●
● ●
●●●
●●
●
●
●●●
●
●●
●●●
●
●
●●
●●
●
● ●
●
●
●
●
●●
●●
●
●
●
● ●
●●●
●
●●
●
●●
●●●
● ●
●
●
●
●●
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
soil
jitte
r(hi
ghbl
d, fa
ctor
= 0
.12)
The curve represents the E(Yi|xi).
And...
E(Yi|xi) = 0·P (Yi = 0|xi)+1·P (Yi = 1|xi)
= P (Yi = 1|xi)
13
-
•We can manipulate the regression model toput it in terms of P (Yi = 1) = E(Yi)
• If we take this regression model
ln
(P (Yi = 1)
1− P (Yi = 1)
)= −1.5160 + 0.0027 Xi
and solve for P (Yi = 1) we get...
P (Yi = 1) =exp(−1.5160 + 0.0027 Xi)
1 + exp(−1.5160 + 0.0027 Xi)
– The value on the right is bounded to [0,1].
– Because our β1 is positive,
∗ As X gets larger, P (Yi = 1) goes to 1.∗ As X gets smaller, P (Yi = 1) goes to 0.
– The fitted curve, i.e. the function of Xi onthe right, is S-shaped (sigmoidal)
14
-
• In the general model with one covariate:
ln
(P (Yi = 1)
1− P (Yi = 1)
)= β0 + β1 Xi
which means
P (Yi = 1) =exp(β0 + β1 Xi)
1 + exp(β0 + β1 Xi)
=1
1 + exp[−(β0 + β1 Xi)]
• If β1 is positive, we get the S-shape that goesto 1 asXi goes to∞ and goes to 0 asXi goesto −∞ (as in the lead example).
• If β1 is negative, we get the opposite S-shapethat goes to 0 as Xi goes to ∞ and goes to1 as Xi goes to −∞.
15
-
• Another way to think of logistic regression isthat we’re modeling a Bernoulli random vari-able occurring for each Xi, and the Bernoulliparameter πi depends on the covariate value.
Then, Yi|Xi ∼ Bernoulli(πi)
where πi represents P (Yi = 1|Xi)
E(Yi|Xi) = πi and
V (Yi|Xi) = πi(1− πi)
We’re thinking in terms of the conditional distribu-
tion of Y |X (we don’t have constant variance acrossthe X values, mean and variance are tied together).
Writing πi as a function of Xi,
πi =exp(β0 + β1 Xi)
1 + exp(β0 + β1 Xi)
16
-
Interpretation of parametersFor the lead levels example.
• Intercept β0:
When Xi = 0, there is no lead in the soil inthe backyard. Then,
ln
(P (Yi = 1)
1− P (Yi = 1)
)= β0
So, β0 is the log-odds that a randomly se-lected child with no lead in their backyardhas a high lead blood level.
Or, πi =eβ0
1+eβ0is the chance that a kid with
no lead in their backyard has high lead bloodlevel.
For this data,β̂0 = −1.5160 or π̂i = 0.1800
17
-
Interpretation of parametersFor the lead levels example.
• Coefficient β1:
Consider the loge of the odds ratio (O.R.) ofhaving high lead blood levels for the following2 groups...
1) those with exposure level of x2) those with exposure level of x + 1
loge
( π21−π2π1
1−π1
)= loge
(eβ0eβ1(x+1)
eβ0eβ1x
)= loge
(eβ1)
= β1
β1 is the loge of O.R. for a 1 unit increase inx.
It compares the groups with x exposure and(x + 1) exposure.
18
-
Or, un-doing the log,
eβ1 is theO.R. comparing the two groups.
For this data,β̂1 = 0.0027 or e
0.0027 = 1.0027
A 1 unit increase in X increases the odds ofhaving a high lead blood level by a factor of1.0027.
Though this value is small, the range of theX values is large(40 to 5255), so it can havea substantial impact when you consider thefull spectrum of x values.
19
-
Prediction
•What is the predicted probability of high leadblood level for a child with X = 500.
P (Yi = 1) =exp(−1.5160 + 0.0027 Xi)
1 + exp(−1.5160 + 0.0027 Xi)
=1
1 + exp[−(−1.5160 + 0.0027 Xi)]
=1
1 + exp[1.5160− 0.0027× 500]= 0.4586
•What is the predicted probability of high leadblood level for a child with X = 4000.
P (Yi = 1) =1
1 + exp[1.5160− 0.0027× 4000]= 0.9999
●
●
●
●
●●
●
●●
●● ●
●●●
●
●
●
●● ●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●
●●● ●
●●●
● ●
●●
●
●
●
●●●
●●●●
●
● ●
●
●
●
● ●
●●●
●●
●
●
●●●
●
●●
●●●
●
●
●●
●●
●
● ●
●
●
●
●
●●
●●
●
●
●
● ●
●●●
●
●●
●
●●
●●●
● ●
●
●
●
●●
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
soil
jitte
r(hi
ghbl
d, fa
ctor
= 0
.12)
20
-
•What is the predicted probability of high leadblood level for a child with X = 0.
P (Yi = 1) =1
1 + exp[1.5160]= 0.1800
So, there’s still a chance of having high leadblood level even when the backyard doesn’thave any lead. This is why we don’t see thelow-end of our fitted S-curve go to zero.
Testing Significance of Covariate
> glm.out=glm(highbld~soil,family=binomial(logit))
> summary(glm.out)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5160568 0.3380483 -4.485 7.30e-06 ***
soil 0.0027202 0.0005385 5.051 4.39e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Soil is a significant predictor in the logistic re-gression model.
21
-
A couple things...
•Method of Maximum Likelihood is usedto fit the model.
Estimates β̂j are asymptotically normal, soR uses Z-tests (or Wald tests) for covariatesignificance in the output. [NOTE: squaringa z-statistic gets you a chi-squared statisticwith 1 df.]
• General concepts easily extended to logisticregression with many covariates.
22
top related