-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
1/7
Faculty Development Program
Clinical Epidemiology and Clinical Research
TOPIC: Biostat istics 5: M ultivariable a nd Logistic RegressionDATE: May (1:30PM 5:00PM)
LEADERS: Art Evans
OBJECTIVES:
1. Interpret a logistic regression equation.
2. Calculate OR and RR from logistic regression for continuous and categorical
predictor variables.
3. Describe the main assumptions of the logistic model.
4. Describe the main errors in studies that analyze data with logistic regression.
5. Check for interactions in linear and logistic models.
REQUIRED READINGS:
Norman and Streiner. PDQ Statistics. 2nd Ed. Pages 65-69 (ANCOVA); 116-117 (logistic
regression).
Norman and Streiner. Biostatistics: The Bare Essentials. Pages: 119-127.
Concato and Feinstein. The risk of determining risk with multivariable models. Ann
Intern Med. 1993;201-210.
PROBLEMS:
A randomized trial was performed testing a new treatment against placebo with mortality
at one-year as the outcome of interest.
A logistic regression model was used to assess the treatment effect while adjusting for
potential confounders.
Questions:
1. According to the logistic regression model, is there evidence of interaction?
2. Based on the first model (no confounding or interaction considered), what is the
OR that describes the treatment effect? Verify by calculating the OR in the raw
data.
3. What is the treatment OR after adjusting for the potential confounder? Is there
evidence of confounding? How do you make that decision?
-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
2/7
Model without the confounder
beta coefficient P value
intercept -0.2
treatment (1=Tx, 0=C) -1.0
-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
3/7
Multivariable Linear Regression
1. Confusion: multivariable vs. multivariate
Multivariable means that you are simultaneously considering more than one predictor
variable (independent; X), eg, Y = X1+X2+X3+X4
Multivariate usually means that you are simultaneously considering more than one
outcome variable (dependent; Y), eg, Y1+Y2+Y3 = X1+X2+X3+X4
2. Sample Size: Rough rule of thumb
Linear regression: 10 subjects for every potential predictor variable;
Logistic regression: 10 subjects in the smallestgroup of the dichotomous outcome
variable for every potential predictor variable;
Multivariate linear regression: 10 subjects for every potential variable, including
the multiple dependent (Y) variables.
Note: this is 10 subjects for everypotentialpredictor, not 10 for every significant
predictor in the final model! (This assumes you are interested in describing a
prediction model, rather than simply being interested in controlling for lots of
potential confounders while examining a specific exposuredisease relationship. If
the later is true, then the combination of potential confounders counts as 1 variable.
3. Interpretation of beta coefficients:
The importance of a beta coefficient must be interpreted in light of the units of the
particular X variable. For example, the beta coefficient for heightmeasured in yardswould be much smaller than ifheightwere measured in inches (which would be 36 times
bigger). Despite a difference of 36-fold between these two beta coefficients, their
importance is identical. Therefore, it is impossible to judge a beta coefficient without
knowing the units of measurement.
4. Interactions:
Always look for interactions among predictor variables. Two X variables may appear to
have no relationship with the outcome variable until an interaction term is also
considered in the model.
Interaction terms are the best method to test for important differences among
subgroups.
Very bad method: testing within each subgroup separately and then declaring
interaction present if one subgroup demonstrates a significant difference whereas in the
other subgroup there is no significant difference.
Always test for interaction before trying to simplify the model.
3
-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
4/7
Always check for interaction before checking for confounding. (Remember: Adjusting
for confounding is similar to taking the average among the subgroups. Taking the
average is bad if there is important interaction, ie, the effect is markedly different among
subgroups.)
5. Confounding:If the primary goal is to estimate the effect of one X variable on Y, while adjusting for
possible confounding from other X variables, then see if the beta coefficient for the main
X changes when all the other Xs are added to the model. If it does, then there is some
confounding. If it changes a lot, then there is a lot of confounding.
6. Test linearity assumption:
For multivariable linear regression (single Y; multiple Xs), the assumptions are:
linear relationship between Y and Xs;
for all possible combinations of X, the distribution of Y is normal with a constantvariance.
Eyeball test: SPSS: Graphs: Scatter: Matrix: enter all Xs and Y: look at the row in the
matrix that compares Y to each of the Xs: ask yourself: Is there really a linear
relationship?
Do NOT test the linearity assumption by looking at a table of correlation coefficients
between Y and each of the Xs. Instead, look at the scatterplots.
Check all partial regression plots to see if they are linear:
SPSS: Analyze: Regression: Linear: Plots: select Produce all partial plots
7. Test for collinearity (multicollinearity):
Its okay for the X variables to be correlated, but its not okay if they are nearly identical,
correlations near 1.0 (completely redundant).
Check that the tolerance> 0.1 for each X variable (collinearity diagnostics). (A tolerance
of < 0.1 is bad and means that something has to be done.)
8. Test for outliers: unusual values of Y for combinations of Xs
Do NOT plot residuals against the Observedvalues of Y (it will always have a positive
slope = 1-R2).
Instead, plot residuals against the Expectedvalues of Y.
Cooks distance tells you how much the beta coefficients will change if a particular case
(outlier) is removed. If Cooks distance is > 1, then its a case with a particularly big
influence and should be double checked to make sure there is no measurement error.
4
-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
5/7
9. Choosing the best model (for prediction, rather than explanation):
Among the different methods (forward; backward; stepwise; best subset), backwards is
often the best (start with all Xs in the model and take out the most nonsignificant
predictor, and repeat until only significant predictors are left). There are better ways.
But, a good rule of thumb: do it several ways, and if you get a different answer, then becautious and get more help.
10. Multiple Linear Regression in SPSS:
SPSS: Analyze: Regression: Linear onlyallows continuous variables as predictors.
SPSS: Analyze: General Linear Model: Univariate allows different kinds of predictors.
5
-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
6/7
Logistic Regression
1. Logistic regression models: 2 common goalstest associations vs. make predictions
If the outcome (Y) variable is dichotomous, then logistic regression allows you to assess
the association between Y and any type of X variable (nominal, ordinal, or interval),
while controlling for other variables (other Xs).
Logistic regression models also allow you to make predictions: for any combination of
predictor (X) variables, what is the probability that Y=1?
2. Logistic equation:
natural log of (odds that Y=1) = X1 + X2 +X3
3. Beta coefficients in logistic model:
For each of the X variables, there will be a beta coefficient. There will also be a Y
intercept term (except for case control studies).
ln (odds Y=1) = bo + b1X1 + b2X2 + b3X3
Odds (Y=1) = e(bo + b1X1 + b2X2 + b3X3)
If there are no interaction terms, then the odds ratio (OR) for the relationship between Y
and any X is simply: eb, where b is the beta coefficient for that particular X. This odds
ratio is adjusted for all the other Xs in the model.
If X is an ordinal or interval variable, then the odds ratio (eb) measures the relative
change in odds for every one unit change in the X variable.
4. Interactions:
As with other regression models, you must force the computer to look for interactions.
If there are two X variables in the model, then the relationship between Y and X1
(measured as an OR) is adjusted for the average value of X2. However, if the relationship
between Y and X1 (OR) is different for different values of X2, then there is interaction.
5. Interaction is good to find. It means there are important differences among subgroups
of patients (subgroups defined by X2).
6. Sample Size:
You should consider only one X variable for every 10 events in the smallerof the twosubgroups of Y. Again, this rule applies to the total number ofpotentialpredictor
variables being considered, not the final number ofsignificantpredictors.
However, if the goal is just measuring the association between one main X variable and
Y, while adjusting for several possible confounders, then all the potential confounders
(all the other Xs) can be considered together as the equivalent of one other variable. In
this situation, you would need at least 20-30 events in the smallest subgroup of Y.
6
-
7/29/2019 Multivariate vs Multivariable - Cook County Hospital
7/7
7. Ordinal or Interval Predictor Variables: Do they meet the assumption of the model?
There is a linearity assumption for logistic regression models just as there is for linear
regression models. The assumption is that for any change of 1 unit in the X variable, the
OR will be the same (ie, the OR for X=1 compared to X=2 will be the same as the OR for
X=4 compared to X=5). This is the same as saying: there is a straight line relationshipwhen you plot the X variable on the horizontal axis and the ln (odds Y=1) on the vertical
axis. If this assumption is violated, then the conclusions of the model will be misleading.
Unfortunately, there is no easy test for this assumption. Ideally, you need to visually
inspect the plot, which you must create yourself.
For dichotomous X variables, there is no problem, since this assumption is
automatically satisfied.
8. Goodness of Fit Test:
Always inspect the goodness of fit test.
It is a test for logistic regression models that compares the expected to the observed
percentages of Y=1 for different combinations of Xs. If it is significant (small P value),
thats bad. It means the model doesnt fit the data well. In that case, then look for
important interactions or look for problems with ordinal or interval X variables that
might not be satisfying the linearity assumption. Another reason might be too few
outcome events in one of the subgroups of Y.
7