statistical analysis sc504/hs927 spring term 2008
DESCRIPTION
Statistical Analysis SC504/HS927 Spring Term 2008. Introduction to Logistic Regression Dr. Daniel Nehring. Outline. Preliminaries: The SPSS syntax Linear regression and logistic regression OLS with a binary dependent variable Principles of logistic regression - PowerPoint PPT PresentationTRANSCRIPT
1
Statistical AnalysisSC504/HS927Spring Term 2008
Introduction to Logistic Regression
Dr. Daniel Nehring
2
Outline
Preliminaries: The SPSS syntax Linear regression and logistic regression OLS with a binary dependent variable Principles of logistic regression Interpreting logistic regression coefficients Advanced principles of logistic regression (for self-study)
Source:http://privatewww.essex.ac.uk/~dfnehr
3
PRELIMINARIES
4
The SPSS syntax
Simple programming language allowing access to all SPSS operations
Access to operations not covered in the main interface
Accessible through syntax windows Accessible through ‘Paste’ buttons in every
window of the main interface Documentation available in ‘Help’ menu
5
Using SPSS syntax files
Saved in a separate file format through the syntax window
Run commands by highlighting them and pressing the arrow button.
Comments can be entered into the syntax. Copy-paste operations allow easy learning of the
syntax. The syntax is preferable at all times to the main
interface to keep a log of work and identify and correct mistakes.
6
PART I
7
Relation between 2 continuous variables
Regression coefficient 1 Measures association between y and x Amount by which y changes on average when x changes by
one unit Least squares method
Simple linear regression
y
x
xβαy 11Slope
8
Multiple linear regression
Relation between a continuous variable and a set of i continuous variables
Partial regression coefficients i
Amount by which y changes on average when xi changes by one unit and all the other xis remain constant
Measures association between xi and y adjusted for all other xi
xβ ... xβ xβαy ii2211
9
Multiple linear regression
Predicted Predictor variables
Response variable Explanatory variables
Dependent Independent variables
xβ ... xβ xβα y ii2211
10
OLS with a binary dependent variable
Binary variables can take only 2 possible values: yes/no (e.g. educated to degree level, smoker/non-smoker) success/failure (e.g. of a medical treatment)
Coded 1 or 0 (by convention 1=yes/ success) Using OLS for a binary dependent variable predicted
values can be interpreted as probabilities; expected to lie between 0 and 1
But nothing to constrain the regression model to predict values between 0 and 1; less than 0 & greater than 1 are possible and have no logical interpretation
Approaches which ensure that predicted values lie between 0 & 1 are required such as logistic regression
Fitting equation to the data
Linear regression: Least squares Logistic regression: Maximum likelihood Likelihood function
Estimates parameters with property that likelihood (probability) of observed data is higher than for any other values
Practically easier to work with log-likelihood
n
iiiii xyxylL
1
)(1ln)1()(ln)(ln)(
12
Maximum Likelihood Estimation (MLE) OLS cannot be used for logistic regression since
the relationship between the dependent and independent variable is non-linear
MLE is used instead to estimate coefficients on independent variables (parameters)
Of all possible values of these parameters, MLE chooses those under which the model would have been most likely to generate the observed sample
13
Logistic regression Models relationship between set of
variables xidichotomous (yes/no)categorical (social class, ... )continuous (age, ...)
and
dichotomous (binary) variable Y
14
PART II
15
Logistic regression (1) ‘Logistic regression’ or ‘logit’ p is the probability of an event occurring 1-p is the probability of the event not occurring p can take any value from 0 to 1 the odds of the event occurring =
the dependent variable in a logistic regression is the natural log of the odds:
pp1
pp
1ln
16
Logistic regression (2)
ln (.) can take any value, p will always range from 0 to 1
the equation to be estimated is:
kk xbxbxbap
p
.. . .
1ln 2211
Logistic regression (3)
logit of P(y|x)
{Logistic transformation
18
Predicting p let
then to predict p for individual i,
kk xbxbxbaz .. . . 2211
i
i
i
k ikii
z
z
i
z
xbxbxba
i
i
ee
p
e
ep
p
1
1 .. ..
2211
Logistic function (1)
0.0
0.2
0.4
0.6
0.8
1.0
Probability of event y
x
20
PART III
21
Interpreting logistic regression coefficients
intercept is value of ‘log of the odds’ when all independent variables are zero
each slope coefficient is the change in log odds from a 1-unit increase in the independent variable, controlling for the effects of other variables
two problems: log odds not easy to interpret change in log odds from 1-unit increase in one independent depends
on values of other independent variables
but the exponent of b (eb) is not dependent on values of other independent variables and is the odds ratio
22
Odds ratio
odds ratio for coefficient on a dummy variable, e.g. female=1 for women, 0 for men
odds ratio = ratio of the odds of event occurring for women to the odds of its occurring for men
odds for women are eb times odds for men
23
General rules for interpreting logistic regression coefficients
if b1 > 0, X1 increases p
if b1 < 0, X1 decreases p
if odds ratio >1, X1 increases p
if odds ratio < 1, X1 decreases p
if CI for b1 includes 0, X1 does not have a statistically significant effect on p
if CI for odds ratio includes 1, X1 does not have a statistically significant effect on p
24
An example: modelling the relationship between disability, age and income in the 65+ population dependent variable = presence of disability
(1=yes,0=no) independent variables:
X1 age in years (in excess of 65 i.e. 650, 70 5)
X2 whether has low income (in lowest 3rd of the income distribution)
data: Health Survey for England, 2000
25
Example: logistic regression estimate for probability of being disabled, people aged 65+ Coeff
(b)
Odds
ratio
p-
value
95% CI on
coeff
95% CI on
odds ratio
constant -0.912 0.000 -0.696 -1.129
age 0.078 1.081 0.000 0.060 0.095 1.062 1.100
has low income
-0.270 0.764 0.003 -0.024 -0.515 0.597 0.976
source: estimated from the Health Survey for England, 2000
26
PART IV
27
Odds, log odds, odds ratios and probabilities
pp
1 odds
pp
1ln odds log
kk xbxbxba .. . . 2211
kbe k variableratio, odds
)...(
)...(
2211
2211
1 probabilty
kk
kk
xbxbxba
xbxbxba
ee
28
Odds, odds ratios and probability of disability among non low income people aged 65+
2.82
0.74
1.000
7.029
0.40
0.29
1.081
0
1
2
3
4
5
6
7
8
65 70 75 80 85 90
age
prob
abili
ties
/odd
s/od
ds r
atio
s
odds
probability
odds ratio compared with age 65+
29
Odds, odd ratios and probabilities pj = 0.2 i.e. a 20% probability oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25 pk = 0.4 oddsk = 0.4/0.6 = 0.67 relative probability/risk pj/pk = 0.2/0.4 = 0.5 odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37 odds ratio is not equal to relative probability/risk except approximately if pj and pk are small………
30
Points to note from logit example.xls if you see an odds ratio of e.g. 1.5 for a
dummy variable indicating female, beware of saying ‘women have a probability 50% higher than men’. Only if both p’s are small can you say this.
better to calculate probabilities for example cases and compare these
31
let
then to predict p for individual i,
Predicting p kk xbxbxbaz .. . . 2211
i
i
i
k ikii
z
z
i
z
xbxbxba
i
i
ee
p
e
ep
p
1
1 .. ..
2211
32
E.g.: Predicting a probability from our model Predict disability for someone on low income aged
75: Add up the linear equationa(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27=-0.402 Take the exponent of it to get to the odds of being
disabled=.669 Put the odds over 1+the odds to give the probability=c.0.4 – or a 40 per cent chance of being disabled
33
Goodness of fit in logistic regressions based on improvements in the likelihood of observing the
sample use a chi-square test with the test statistic =
where R and U indicate restricted and unrestricted models unrestricted – all independent variables in model restricted – all or a subset of variables excluded from the
model (their coefficients restricted to be 0)
U
R
LL
ln22
34
Statistical significance of coefficient estimates in logistic regressions Calculated using standard errors as in OLS
for large n, t > 1.96 means that there is a 5% or lower probability that the true value of the coefficient is 0.
or p 0.05
b
b seb
tˆ
ˆ
ˆ
35
95% confidence intervals for logistic regression coefficient estimates
For CIs of odds ratios calculate CIs for coefficients and take their exponents
bseb ˆ96.1ˆ