bms 617

Marshall University Genomics Core Facility

Marshall University School of MedicineDepartment of Biochemistry and Microbiology

BMS 617

Lecture 12: Multiple and Logistic Regression

Marshall University School of Medicine

Multiple Regression

• In linear regression, we had one independent variable, and one dependent (outcome) variable– In lab experiments, this is fairly common– The investigator manipulates the value of one variable and

keeps everything else the same• In some lab experiments, and in most observational

studies, there is more than one independent variable– Multiple Regression is used for these scenarios– "Multiple Regression" really refers to a collection of

different techniques


Aims of Multiple Regression• Quantifying the effect of one variable of interest while adjusting for the

effects of other variables– Very common in observational studies– The other variables change outside of the control of the investigator– These other variables are often called covariates

• Creating an equation which is useful for predicting the value of the outcome variable given the values of the various independent variables– For example, predict the probability of cancer recurrence after surgery alone

given characteristics of the tumor (grade, stage, etc) and of the patient (age, height, weight, etc)• Might be used to decide whether or not to use chemotherapy in addition to surgery

• Developing a scientific understanding of the impact of several variables on the outcome


Types of Multiple Regression• We will look at the following types of multiple regression (there

are many others):– Multiple Linear Regression

• The dependent variable is a linear function of the independent variables– Logistic Regression

• The outcome variable is binary (dichotomous, or categorical with two possible outcomes)

• The log odds ratio of the outcome is modeled as a function of the independent variables

– Proportional Hazards Regression• Proportional Hazards Regression is used when the outcome is the elapsed

time to a non-recurring event• It is effectively used to compute the effect of independent variables on a

survival curve


Multiple Linear Regression• Multiple Linear Regression finds the linear equation which best

predicts an outcome variable, Y, from multiple independent variables X1, X2,…, Xk

• Example (from Motulsky): Lead Exposure and Kidney Function– Staessen et al. (1992) investigated the relationship between lead

concentration in the blood and kidney function• Kidney function measured by creatinine clearance

– Observational study of 965 men– Naive approach would be to measure lead concentration and creatinine

clearance and analyze just the two variables– However, kidney function is known to decrease with age, and lead

accumulates in the blood over time• Age is a confounding variable• Must account for this


Multiple Regression Model

• The model Staessen et al. used was• Yi = β0 + β1Xi,1 + β2Xi,2 + β3Xi,3 + β4Xi,4 + β5Xi,5 + εi

• where the variables areVariable DescriptionYi Creatine clearance of subject iXi,1 log(serum lead) of subject iXi,2 Age of subject iXi,3 Body mass of subject iXi,4 log(GGT) of subject i (liver function)Xi,5 1 if subject i had previously taken diuretics, 0 otherwiseεi Random scatter


Multiple Regression Parameters

• The β in the equation for the model are the parameters of the model– Do not vary from data point to data point– Are values associated with the population– Will be estimated from the data

• Note that one of the variables (Xi,5) is categorical, and we use a “dummy variable” in its place


What multiple regression does

• Multiple linear regression finds values for the parameters that make the model predict the actual data as well as possible

• Estimates for β0, … β5 are usually denoted b0 … b5

• Software performing the regression will report the best estimates for each parameter, a confidence interval and p-value for each estimate, and an R2 value for the model

• Null hypotheses for the p-values are that the variable provides no information to the model, i.e. that the parameter is zero


Interpreting the Co-efficients

• The coefficients can be interpreted in a similar way to the slope estimate in simple linear regression

• Represent the change in the dependent variable for one unit increase in the corresponding independent variable, keeping all the other independent variables fixed

• In the example, b1 (estimate for log(lead concentration)) was -9.5 ml/min, with a 95% CI of [-18.1, -0.9].

• This means for every one unit increase in log(lead concentration), creatinine clearance decreased by -9.5 ml/min on average, if all other variables were kept fixed.


Interpreting the Coefficients• The coefficients can be interpreted in a similar way to the

slope estimate in simple linear regression– Represent the change in the dependent variable for one unit

increase in the corresponding independent variable, keeping all the other independent variables fixed

• In the example, b1 (estimate for log(lead concentration)) was -9.5 ml/min, with a 95% CI of [-18.1, -0.9].

• This means for every one unit increase in log(lead concentration), creatinine clearance decreased by -9.5ml/min on average, if all other variables were kept fixed.


Statistical Significance of the Coefficients

• One unit increase in log(lead concentration) means a 10 fold increase in lead concentration

• So the average decrease in creatinine clearance corresponding to a 10 fold increase in lead concentration was 9.5 ml/min, and the 95% confidence interval for the decrease was 0.9ml/min to 18.1ml/min.– Since the 95% CI does not contain 0, the p-value for this coefficient must

be less than 0.05• This is the p-value for the null hypothesis that the coefficient is zero• Alternatively think of this as a comparison of models:

– Compare the full model (including this variable) to the model not including this variable


Interpreting coefficients for “dummy variables”

• One of the variables in the model was really a binary variable– Has the subject previously taken diuretics?– Coded as 0 for no and 1 for yes

• Estimate for the coefficient for this variable was -8.8ml/min– An increase in one unit for this variable results in a decrease in

creatinine clearance of 8.8 ml/min, on average– Since the only values are 0 and 1, this means that participants

who has previously taken diuretics had an average creatinine clearance 8.8 ml/min lower than those who had not, if all other variables are held equal


Interpreting the R2 value for the model

• Multiple linear regression reports an R2 value– For our example, R2 is 0.27

• This means that 27% of the variation in creatinine clearance is accounted for by the model

• The remaining 73% is due to random scatter, or is associated with variables not included in the model

• Unlike simple linear regression, we cannot plot a graph of the model

• One approach to visualizing the model is to plot the predicted outcome variable from the model against the actual measured value


Multiple Linear Regression Plot


Variable Selection

• The authors of the article collected much more data

• Stated that other variables did not improve the fit of the model

• Adding additional parameters will almost always increase the R2 value– Should use the sum-of-squares F test explained earlier

to test if there really is an improvement in the model– Beware of overfitting (explained later)


Logistic Regression

• Logistic Regression is used when the outcome variable is binary– i.e. categorical with two possible outcomes

• The general idea is to build a multiple linear model with the outcome variable being the log of the odds ratio– i.e. we build a model predicting the log of the odds of one

of the two outcomes from the independent variables– the parameters describe the difference in odds when the

variables change by one unit


Logistic Regression Example

• We performed chart reviews on 99 post-menopausal women

• Ran a logistic regression for an outcome of diabetes with age at menopause, smoking status, and BMI as independent variables


Logistic Regression Results


Interpreting Logistic Regression Results

• The "Model Summary" box describes how well the model fits the data.– -2 Log likelihood is computed from the likelihood of our observed data given the

model. Since likelihood must be between 0 and 1, this is always positive and a small value means a better fit. (Our data do not fit the model well.)

• R2 cannot be calculated in the same way for logisitic regression. The remaining two values give two alternate approaches, and the interpretation for these is similar to a regular R2. Again, our data do not fit the model well.

• The "Classification Table" describes the accuracy of using the model as a predictor.

• Use the independent variables to compute the predicted odds, and predict the class based on the most likely

• Note that adding more variables will always improve the accuracy; this should really be tested on an independent data set


Interpreting the Logistic Regression Parameters

• The "Variables in the Equation" box gives the parameter estimates, 95% CIs, and p-values

• The parameter for Smoking is 1.204. This means that a one-unit increase in the smoking variable results in an increase in the log odds ratio of 1.204.

• Logs here are natural logs; so the increase in odds ratio is e1.204=3.335 fold• This is a dummy variable, so a smoker has about 3.3 times the odds of

becoming diabetic than a non-smoker• The parameter for BMI is 0.072; e0.072=1.075, so an increase of one unit in

BMI results in a 1.075-fold increase in the odds ratio of being diabetic.• The p-values and 95% CIs show that the parameter for smoking is

significant at a significance level of 0.05.• BMI has a p-value of 0.055.


Mathematical Model for Logistic Regression

• The mathematical setup for logistic regression is:

• log(ORi) = β0 + Xi,1 β1 + … + Xi,k βk

• where the variables are

• OR: Odds ratio for subject i

• Xi,j: Value of variable j for subject i

• For our model, the estimates give

• log(OR) = -3.307 + 1.208 S + 0.071 B

• OR = e-3.307 + 1.208 S + 0.071 B

• OR = e-3.307e1.208 Se0.071 B = 0.037 x 3.347S x 1.073B


Summary• Multiple Linear Regression fits a dependent variable as a linear

model of multiple independent variables– Provides parameter estimates for each independent variable, along with

confidence intervals and p-values– The null hypothesis for the p-value is that the variable doesn't

contribute to the model– Used for finding the effect of a variable while correcting for confounding

variables• Logistic regression is used when the dependent variable is binary

– Models the log odds ratio as a linear function of the dependent variables

– Parameters are the increase in log odds ratio per unit increase in the independent variable

bms 617

Documents

linear function

linear equation

scenariosmultiple regression

age of subject ixi

logserum lead of subject

multiple regressionwe

multiple regressionquantifying

lead concentration