![Page 1: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/1.jpg)
Linear correlation and linear regression + summary of tests
Week 12 Regression Theory
![Page 2: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/2.jpg)
Recall: Covariance
1
))((),(cov 1
n
YyXxyx
n
iii
![Page 3: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/3.jpg)
Interpreting Covariance
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
![Page 4: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/4.jpg)
Correlation coefficient
Pearson’s Correlation Coefficient is standardized covariance (unitless):
yx
yxariancer
varvar
),(cov
![Page 5: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/5.jpg)
Recall dice problem…
Var(x) = = 2.916666
Var(y) = 5.83333
Cov(xy) = 2.91666
707.2
1
8333.591666.2
91666.2r
5.707. 22 R Interpretation of R2: 50% of the total variation in the sum of the two dice is explained by the roll on the first die. Makes perfect intuitive sense!
R2=“Coefficient of Determination” = SSexplained/TSS
![Page 6: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/6.jpg)
Correlation
• Measures the relative strength of the linear relationship between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship
![Page 7: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/7.jpg)
Scatter Plots of Data with Various Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
Xr = 0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 8: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/8.jpg)
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 9: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/9.jpg)
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 10: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/10.jpg)
Linear Correlation
Y
X
Y
X
No relationship
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 11: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/11.jpg)
Linear regression
http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html
In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
![Page 12: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/12.jpg)
What is “Linear”?
• Remember this:• Y=mX+B?
B
m
![Page 13: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/13.jpg)
What’s Slope?
A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.
![Page 14: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/14.jpg)
Simple linear regression
The linear regression model:
Love of Math = 5 + .01*math SAT score
intercept
slope
P=.22; not significant
![Page 15: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/15.jpg)
Prediction
If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?)
![Page 16: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/16.jpg)
Linear Regression Model
Y’s are modeled…
Yi= 100*X + random errori
Follows a normal distribution
Fixed – exactly on the line
![Page 17: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/17.jpg)
Assumptions (or the fine print)
• Linear regression assumes that… – 1. The relationship between X and Y is linear– 2. Y is distributed normally at each value of X– 3. The variance of Y at every value of X is the same
(homogeneity of variances)
• Why? The math requires it—the mathematical process is called “least squares” because it fits the regression line by minimizing the squared errors from the line (mathematically easy, but not general—relies on above assumptions).
![Page 18: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/18.jpg)
Expected value of y:iy
ii xy ˆˆˆ
Expected value of y at level of x: xi=
![Page 19: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/19.jpg)
Residual
)ˆˆ(ˆ iiiii xyyye
We fit the regression coefficients such that sum of the squared residuals were minimized (least squares regression).
![Page 20: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/20.jpg)
Residual
Residual = observed value – predicted value
Predicted value
33.5 weeks
![Page 21: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/21.jpg)
Residual Analysis: check assumptions
• The residual for observation i, ei, is the difference between its observed and predicted value
• Check the assumptions of regression by examining the residuals– Examine for linearity assumption– Examine for constant variance for all levels of X (homoscedasticity) – Evaluate normal distribution assumption– Evaluate independence assumption
• Graphical Analysis of Residuals– Can plot residuals vs. X
iii YYe ˆ
![Page 22: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/22.jpg)
Residual Analysis for Linearity
Not Linear Linear
x
resi
dua
ls
x
Y
x
Y
x
resi
dua
ls
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 23: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/23.jpg)
Residual Analysis for Homoscedasticity
Non-constant variance Constant variance
x x
Y
x x
Y
resi
dua
ls
resi
dua
ls
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 24: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/24.jpg)
Residual Analysis for Independence
Not IndependentIndependent
X
Xresi
dua
ls
resi
dua
ls
X
resi
dua
ls
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
![Page 25: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/25.jpg)
As a linear regression…
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 657.5000000 23.66105065 27.79 <.0001
OddDay 81.7307692 32.81197359 2.49 0.0204
Intercept
represents the
mean value in the
even-day group.
It is significantly
different than 0—
so the average
Eng SAT score is
not 0.
Slope represents the
difference in means
between odd and even
groups. Diff is
significant.
![Page 26: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/26.jpg)
Multiple Linear Regression
• More than one predictor…
= + 1*X + 2 *W + 3 *Z
Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.
![Page 27: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/27.jpg)
ANOVA is linear regression!
A categorical variable with more than two groups:E.g.: groups 1, 2, and 3 (mutually exclusive)
= (=value for group 1) + 1*(1 if in group 2) + 2 *(1 if in group 3)
This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable
![Page 28: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/28.jpg)
Other types of multivariate regression
Multiple linear regression is for normally distributed outcomes
Logistic regression is for binary outcomes
Cox proportional hazards regression is used when time-to-event is the outcome
![Page 29: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/29.jpg)
Overview of statistical tests
The following table gives the appropriate choice of a statistical test or measure of association for various types of data (outcome variables and predictor variables) by study design.
Continuous outcome
Binary predictorContinuous predictors
e.g., blood pressure= pounds + age + treatment (1/0)
![Page 30: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/30.jpg)
Alternative summary: statistics for various types of outcome data
Outcome Variable
Are the observations independent or correlated?
Assumptions
independent correlated
Continuous(e.g. pain scale, cognitive function)
TtestANOVALinear correlationLinear regression
Paired ttestRepeated-measures ANOVAMixed models/GEE modeling
Outcome is normally distributed (important for small samples).Outcome and predictor have a linear relationship.
Binary or categorical(e.g. fracture yes/no)
Difference in proportionsRelative risksChi-square test Logistic regression
McNemar’s testConditional logistic regressionGEE modeling
Chi-square test assumes sufficient numbers in each cell (>=5)
Time-to-event(e.g. time to fracture)
Kaplan-Meier statisticsCox regression
n/a Cox regression assumes proportional hazards between groups
![Page 31: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/31.jpg)
Continuous outcome (means)
Outcome Variable
Are the observations independent or correlated?Alternatives if the normality assumption is violated (and small sample size):
independent correlated
Continuous(e.g. pain scale, cognitive function)
Ttest: compares means between two independent groups
ANOVA: compares means between more than two independent groups
Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables
Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes
Paired ttest: compares means between two related groups (e.g., the same subjects before and after)
Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements)
Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time
Non-parametric statistics
Wilcoxon sign-rank test: non-parametric alternative to the paired ttest
Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest
Kruskal-Wallis test: non-parametric alternative to ANOVA
Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient
![Page 32: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/32.jpg)
Binary or categorical outcomes (proportions)
Outcome Variable
Are the observations correlated? Alternative to the chi-square test if sparse cells:
independent correlated
Binary or categorical(e.g. fracture, yes/no)
Chi-square test: compares proportions between more than two groups
Relative risks: odds ratios or risk ratios
Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios
McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after)
Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data)
GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures)
Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5).
McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).
![Page 33: Linear correlation and linear regression + summary of tests Week 12 Regression Theory](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f005503460f94c16069/html5/thumbnails/33.jpg)
Time-to-event outcome (survival data)
Outcome Variable
Are the observation groups independent or correlated? Modifications to Cox regression if proportional-hazards is violated:
independent correlated
Time-to-event (e.g., time to fracture)
Kaplan-Meier statistics: estimates survival functions for each group (usually displayed graphically); compares survival functions with log-rank test
Cox regression: Multivariate technique for time-to-event data; gives multivariate-adjusted hazard ratios
n/a (already over time)
Time-dependent predictors or time-dependent hazard ratios (tricky!)