linear models i - vital-it · 2013. 6. 28. · pitfalls in regression ecological regression –...
TRANSCRIPT
![Page 1: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/1.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Linear Models I
http://www.isrec.isb-sib.ch/~darlene/EMBnet/
5 10 15 20
1.0
1.2
1.4
1.6
1.8
Blood Glucose
Sho
rten
ing
Vel
ocity
Correlation and Regression
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Univariate Data (Review)Measurements on a single variable XConsider a continuous (numerical) variableSummarizing X– Numerically
• Center• Spread
– Graphically• Boxplot• Histogram
![Page 2: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/2.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Bivariate DataBivariate data are just what they sound like –data with measurements on two variables; let’s call them X and YHere, we are looking at two continuousvariablesWant to explore the relationship between the two variablesCan also look for association between two discrete variables; we won’t cover that here
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Scatterplot
We can graphically summarize a bivariate data set with a scatterplot (also sometimes called a scatter diagram)Plots values of one variable on the horizontal axis and values of the other on the vertical axisCan be used to see how values of 2 variables tend to move with each other (i.e. how the variables are associated)
![Page 3: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/3.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Scatterplot: positive association
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Scatterplot: negative association
![Page 4: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/4.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Scatterplot: real data example
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Numerical SummaryTypically, a bivariate data set is summarized numerically with 5 summary statisticsThese provide a fair summary for scatterplots with the same general shape as we just saw, like an oval or an ellipse
We can summarize each variable separately : X mean, X SD; Y mean, Y SD
But these numbers don’t tell us how the values of X and Y vary together
![Page 5: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/5.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Correlation CoefficientThe (sample) correlation coefficient r is defined as the average value of the product
(X in SUs)*(Y in SUs)
SU = standard units = (X – mean(X))/SD(X)
r is a unitless quantity-1 ≤ r ≤ 1
r is a measure of LINEAR ASSOCIATION
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
R: correlation
In R: > cor(x,y)
Note, however, that if there are missing values (NA), then you will get an error messageElementary statistical functions in R require
– no missing values, or– explicit statement of what to do with NA
![Page 6: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/6.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
R: NA in statistical functions
For single vector functions (e.g. mean, var, sd), give the argument na.rm=TRUE
For cor, though, there are more possibilities for dealing with NASee the argument use and the methods given there: ?cor
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
What r is...r is a measure of LINEAR ASSOCIATIONThe closer r is to –1 or 1, the more tightly the points on the scatterplot are clustered around a line
The sign of r (+ or -) is the same as the sign of the slope of the line
When r = 0, the points are not LINEARLY ASSOCIATED – this does NOT mean there is NO ASSOCIATION
![Page 7: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/7.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
...and what r is not
r is a measure of LINEAR ASSOCIATIONr does NOT tell us if Y is a function of Xr does NOT tell us if X causes Yr does NOT tell us if Y causes Xr does NOT tell us what the scatterplotlooks like
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
r ≈ 0: random scatter
![Page 8: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/8.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
r ≈ 0: curved relation
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
r ≈ 0: outliers
outliers
![Page 9: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/9.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
r ≈ 0: parallel lines
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
r ≈ 0: different linear trends
![Page 10: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/10.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Correlation is NOT causationYou cannot infer that since X and Y are highly correlated (r close to –1 or 1) that X is causinga change in YY could be causing XX and Y could both be varying along with a third, possibly unknown factor (either causal or not; often ‘time’ ):Polio and soft drinks: US polio cases tended to go up in summer, so do sales of soft drinks => does not mean that soft drinks cause polio
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Predicting shortening velocitySay we are interested in getting a value for shortening velocity (thuesen data)We could measure it, but that may be difficult/expensive/impractical/etc.If we have a measurement on a variable that is related to shortening velocity – such as blood glucose, say –then perhaps there would be some way to use that measurement to estimate or predict shortening velocity
What relation is suggested by the scatterplot?
![Page 11: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/11.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
SV vs. BG
5 10 15 20
1.0
1.2
1.4
1.6
1.8
Blood Glucose
Sho
rten
ing
Vel
ocity
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
(Simple) Linear RegressionRefers to drawing a (particular, special) line through a scatterplotUsed for 2 broad purposes:– Explanation– Prediction
Equation for a line to predict y knowing x (in slope-intercept form) looks like:
y = a + b*xa is called the intercept ; b is the slope
![Page 12: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/12.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Which line?
There are many possible linesthat could be drawn through the cloud of points in the scatterplot ...
How to choose?
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Regression PredictionThe regression prediction says:
when X goes up by 1 SD, predicted Y goes up
**NOT by 1 SD**, but by only r SDs (down if r negative)
This prediction can be expressed as a formula for a line in slope-intercept form:
predicted y = intercept + slope * x,
with slope = r * SD(Y)/SD(X)
intercept = mean(Y) – slope * mean(X)
![Page 13: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/13.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Least Squares
Q: Where does this equation come from?
A: It is the line that is ‘best’ in the sense that it minimizes the sum of the squared errors in the vertical (Y) direction
*
*
*
*
*
errors
X
Y
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Interpretation of parameters
The regression line has two parameters: the slope and the interceptThe regression slope is the average change in Y when X increases by 1 unitThe intercept is the predicted value for Y when X = 0If the slope = 0, then X does not help in predicting Y (linearly)
![Page 14: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/14.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Another view of the regression line
We can divide the scatterplot into regions (X-strips) based on values of X
Within each X-strip, plot the average value of Y (using only Y values that have X values in the X-strip)
This is the graph of averagesThe regression line can be thought of as a smoothed version of the graph of averages
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Scatterplot (again)
![Page 15: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/15.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Creating X-strips
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Graph of averages
![Page 16: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/16.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Residuals
There is an error in making a regression prediction:
error = observed Y – predicted Y
These errors are called residuals
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Pitfalls in regressionecological regression– when the units are aggregated, for example death
rates from lung cancer vs. percentage of smokers in cities => relationship can look stronger than it actually is (we don’t know whether it is the smokers that are dying of lung cancer)
extrapolation– don’t know what the relationship between X and Y looks
like outside the range of the dataregression effect/fallacy– test-retest and regression toward the mean
![Page 17: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/17.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Modeling OverviewWant to capture important features of the relationship between a (set of) variable(s)and one or more response(s)Many models are of the form
g(Y) = f(x) + error
Differences in the form of g, f and distributional assumptions about the error term
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Linear ModelingA simple linear model:
E(Y) = β0 + β1xGaussian measurement model:
Y = β0 + β1x + ε,where ε ~ N(0, σ2)More generally:
Y = Xβ + ε,where Y is n x 1, X is n x p, β is p x 1,ε is n x 1, often assumed N(0, σ2Inxn)
![Page 18: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/18.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Linear ModelingMore generally:
Y = Xβ + ε,Example
Y = β0 + β1x1 + β2x2 + ε,y: response variable, dependent variablex: explanatory variables, independent variables, covariatesβ : coefficients, parameters
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
R: linear modeling with lm
To compute regression coefficients (intercept and slope(s)) in R: lm(y ~ x)
Can read ~ as ‘described (or modeled) by ’Example : to predict ventricular shortening velocity from blood glucose:> lm(short.velocity ~ blood.glucose)Call:
lm(formula = short.velocity ~ blood.glucose)Coefficients:
(Intercept) blood.glucose 1.09781 0.02196
![Page 19: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/19.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
R: using lmYou can do much more complicated modeling with lm
The result of lm is a model object which contains additional information beyond what gets printed
To see some of these other quantities:Lm.result <- lm(Y ~ X)
summary(Lm.result)
Or summary(lm(Y ~ X))
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
R: summarizing lm> summary(lm(short.velocity~blood.glucose))Call:lm(formula = short.velocity ~ blood.glucose)
Residuals:Min 1Q Median 3Q Max
-0.40141 -0.14760 -0.02202 0.03001 0.43490
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.09781 0.11748 9.345 6.26e-09 ***blood.glucose 0.02196 0.01045 2.101 0.0479 *---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1
` ' 1
Residual standard error: 0.2167 on 21 degrees of freedomMultiple R-Squared: 0.1737, Adjusted R-squared: 0.1343 F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479
![Page 20: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/20.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Example in R attach(whiteside)Data: weekly gas consumption before
and after insulation of the house and external average temperature
print(Whiteside) Insul Temp Gas
1 Before -0.8 7.2…26 Before 10.2 2.627 After -0.7 4.8…56 After 9.7 1.5plot(Gas ~ Temp)points(Gas ~ Temp, whiteside, subset = Insul == "After", col="red")points(Gas ~ Temp, whiteside, subset = Insul == "Before", col="blue")lmgasB = lm(Gas ~ Temp, whiteside, subset = Insul == "Before")lmgasA = lm(Gas ~ Temp, whiteside, subset = Insul == "After")abline(lmgasA, col="red")abline(lmgasB, col="blue")
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
CIs and hypothesis tests
With some assumptions about the error distribution, you can make confidence intervals or carry out hypothesis tests :– for the regression line
– prediction interval for future observation
– hypothesis tests for coefficients
We will not worry about the details of these
![Page 21: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/21.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Basic model checkingExamination of residuals– Normality– Time effects– Nonconstant variance– Curvature
Detection of influential observations– Hat matrix
We will do a little of this in the practical
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
QQ-PlotQuantile-quantile plotAssess whether a
sample follows a particular (e.g. normal) distribution (or to compare two samples)
A method for looking for outliers when data are mostly normal
Sam
ple
Theoretical
Sample quantile is 0.125
Value from Normal distributionwhich yields a quantile of 0.125 (= -1.15)
![Page 22: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/22.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Typical deviations from straight line patterns
Outliers
Curvature at both ends (long or short tails)
Convex/concave curvature (asymmetry)
Horizontal segments, plateaus, gaps
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Outliers
![Page 23: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/23.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Long Tails
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Short Tails
![Page 24: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/24.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Asymmetry
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Plateaus/Gaps
![Page 25: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/25.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
Hat values
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
What to do?
![Page 26: Linear Models I - Vital-IT · 2013. 6. 28. · Pitfalls in regression ecological regression – when the units are aggregated, ... response variable, dependent variable x: explanatory](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f7e8c4278c0e446372db83/html5/thumbnails/26.jpg)
EMBnet Course – Introduction to Statistics for Biologists 6 Mar 2007
LinearizationExample
if Y ≈ b * xc,then log(Y) ≈ log( b) + c log( x)
Y’≈ b’ + c x’
if Y ≈ b + exp(cx),then log(Y- b) ≈ log( exp(cx)) = cx
Y’≈ c x