correlation and linear regression microbiology 3053 microbiological procedures
TRANSCRIPT
Correlation and Linear Correlation and Linear RegressionRegression
Microbiology 3053Microbiology 3053
Microbiological ProceduresMicrobiological Procedures
CorrelationCorrelation
Correlation analysis is used when you have Correlation analysis is used when you have measured two continuous variables and want to measured two continuous variables and want to quantify how consistently they vary togetherquantify how consistently they vary together
The stronger the correlation, the more likely to The stronger the correlation, the more likely to accurately estimate the value of one variable accurately estimate the value of one variable from the otherfrom the other
Direction and magnitude of correlation is Direction and magnitude of correlation is quantified by Pearson’s correlation coefficient, r quantified by Pearson’s correlation coefficient, r Perfectly negative (-1.00) to perfectly positive (1.00)Perfectly negative (-1.00) to perfectly positive (1.00) No relationship (0.00)No relationship (0.00)
CorrelationCorrelation
The closer r = |1|, the stronger the relationshipThe closer r = |1|, the stronger the relationship R=0 means that knowing the value of one variable R=0 means that knowing the value of one variable
tells us nothing about the value of the othertells us nothing about the value of the other Correlation analysis uses data that has already Correlation analysis uses data that has already
been collectedbeen collected ArchivalArchival Data not produced by experimentationData not produced by experimentation
Correlation does Correlation does notnot show cause and effect but show cause and effect but may suggest such a relationshipmay suggest such a relationship
Correlation Correlation ≠ Causation≠ Causation
There is a strong, positive correlation There is a strong, positive correlation betweenbetween the number of churches and bars in a townthe number of churches and bars in a town smoking and alcoholism (consider the smoking and alcoholism (consider the
relationship between smoking and lung relationship between smoking and lung cancer)cancer)
students who eat breakfast and school students who eat breakfast and school performanceperformance
marijuana usage and heroin addiction (vs marijuana usage and heroin addiction (vs heroin addiction and marijuana usage)heroin addiction and marijuana usage)
Visualizing CorrelationVisualizing Correlation
Scatterplots are used to illustrate Scatterplots are used to illustrate correlation analysiscorrelation analysis Assignment of axes does not matter (no Assignment of axes does not matter (no
independent and dependent variables)independent and dependent variables) Order in which data pairs are plotted Order in which data pairs are plotted
does not matterdoes not matter In strict usage, lines are In strict usage, lines are notnot drawn drawn
through correlation scatterplotsthrough correlation scatterplots
CorrelationsCorrelationsStrong Negative Correlation
-100-80-60-40-20
020406080
100120
0 10 20 30 40 50
r = - 0.9960
Weak Positive Correlation
-400
-300
-200
-100
0
100
200
300
400
500
600
0 10 20 30 40 50
r = 0.266
No Correlation
-2000
-1000
0
1000
2000
3000
4000
5000
0 50 100 150 200 250
r = 0.00
Linear RegressionLinear Regression
Used to measure the relationship between Used to measure the relationship between two variablestwo variables Prediction and a cause and effect relationshipPrediction and a cause and effect relationship Does one variable change in a consistent manner Does one variable change in a consistent manner
with another variable?with another variable? x = independent variable (cause)x = independent variable (cause) y = dependent variable (effect)y = dependent variable (effect)
If it is not clear which variable is the cause If it is not clear which variable is the cause and which is the effect, linear regression is and which is the effect, linear regression is probably an inappropriate testprobably an inappropriate test
Linear RegressionLinear Regression
Calculated from experimental dataCalculated from experimental data Independent variable is under the control of Independent variable is under the control of
the investigator (exact value)the investigator (exact value) Dependent variable is normally distributedDependent variable is normally distributed Differs from correlation, where both Differs from correlation, where both
variables are normally distributed and variables are normally distributed and selected at random by investigatorselected at random by investigator
Regression analysis with more than one Regression analysis with more than one independent variable is termed multiple independent variable is termed multiple (linear) regression(linear) regression
Linear RegressionLinear RegressionBest fit line based on the sum of the squares of the distance of the data points from the predicted values (on the line)
y = 1.0092x + 8.6509
R2 = 0.8863
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Independent Variable
Dep
end
ent
Var
iab
le
Linear RegressionLinear Regression
y = a + bx wherey = a + bx where a = y intercept (point where x = 0 and the line passes a = y intercept (point where x = 0 and the line passes
through the y-axis)through the y-axis) b = slope of the line (yb = slope of the line (y22-y-y11/x/x22-x-x11))
The slope indicates the nature of the correlationThe slope indicates the nature of the correlation Positive = y increases as x increasesPositive = y increases as x increases Negative = y decreases as x increasesNegative = y decreases as x increases 0 = no correlation 0 = no correlation
Same as Pearson’s correlationSame as Pearson’s correlation No relationship between the variablesNo relationship between the variables
Correlation Coefficient (r)Correlation Coefficient (r) Shows the strength of the linear Shows the strength of the linear
relationship between two variables, relationship between two variables, symbolized by rsymbolized by r
The closer the data points are to the line, The closer the data points are to the line, the closer the regression value is to 1 or -1the closer the regression value is to 1 or -1 r varies between -1 (perfect negative r varies between -1 (perfect negative
correlation) to 1 (perfect positive correlation)correlation) to 1 (perfect positive correlation) 0 - 0.2 no or very weak association0 - 0.2 no or very weak association 0.2 -0.4 weak association0.2 -0.4 weak association 0.4 -0.6 moderate association0.4 -0.6 moderate association 0.6 - 0.8 strong association0.6 - 0.8 strong association 0.8 - 1.0 very strong to perfect association0.8 - 1.0 very strong to perfect association null hypothesis is no association (r = 0)null hypothesis is no association (r = 0) Salkind, N. J. (2000) Salkind, N. J. (2000) Statistics for people who think Statistics for people who think
they hate statistics.they hate statistics. Thousand Oaks, CA: Sage Thousand Oaks, CA: Sage
Coefficient of Determination Coefficient of Determination (r(r22))
Used to estimate the extent to which the Used to estimate the extent to which the dependent variable (y) is under the dependent variable (y) is under the influence of the independent variable (x)influence of the independent variable (x)
rr22 (the square of the correlation coefficient) (the square of the correlation coefficient) Varies from 0 to 1Varies from 0 to 1 rr22 = 1 means that the value of y is completely = 1 means that the value of y is completely
dependent on x (no error or other contributing dependent on x (no error or other contributing factors)factors)
rr22 < 1 indicates that the value of y is < 1 indicates that the value of y is influenced by more than the value of xinfluenced by more than the value of x
Coefficient of DeterminationCoefficient of Determination A measurement of the proportion of variance A measurement of the proportion of variance
of y explained by its dependence on xof y explained by its dependence on x Remainder (1 - rRemainder (1 - r22) is the variance of y that is not ) is the variance of y that is not
explained by x (explained by x (i.e.,i.e., error or other factors) error or other factors) e.g., if re.g., if r22 = 0.84, it shows a strong, positive = 0.84, it shows a strong, positive
relationship between the variables and shows relationship between the variables and shows that the value of x is used to predict 84% of the that the value of x is used to predict 84% of the variability of y (and 16% is due to other factors)variability of y (and 16% is due to other factors)
rr22 can be calculated for correlation analysis can be calculated for correlation analysis by squaring r butby squaring r but NotNot a measure of variation of y explained by a measure of variation of y explained by
variation in xvariation in x Variation in y is associated with the variance of x Variation in y is associated with the variance of x
(and vice versa)(and vice versa)
Assumptions of Linear Assumptions of Linear RegressionRegression
Independent variable (x) is selected by investigator Independent variable (x) is selected by investigator (not random) and has no associated variance(not random) and has no associated variance
For every value of x, values of y have a normal For every value of x, values of y have a normal distributiondistribution
Observed values of y differ from the mean value of Observed values of y differ from the mean value of y by an amount called a y by an amount called a residual. residual. (Residuals are (Residuals are normally distributed.)normally distributed.)
The variances of y for all values of x are equal The variances of y for all values of x are equal (homoscedasticity)(homoscedasticity)
Observations are independent (Each individual in Observations are independent (Each individual in the sample is only measured once.)the sample is only measured once.)
Linear Regression DataLinear Regression Data
Anscombe, F. J. 1973. Graphs in Statistical Analysis. The American Statistician 27(1):17-21.
The numbers alone do not guarantee that the data have been fitted well!
Linear Regression DataLinear Regression Data
Linear Regression DataLinear Regression DataFigure 1: Acceptable regression model with observations distributed evenly around the regression line
Figure 2: Strong curvature suggests that linear regression may not be appropriate (an additional variable may be required)
Linear Regression DataLinear Regression DataFigure 3: A single outlier alters the slope of the line. The point may be erroneous but if not, a different test may be necessary
Figure 4: Actually a regression line connecting only two points. If the rightmost point was different, the regression line would shift.
What if we’re not sure if What if we’re not sure if linear regression is linear regression is
appropriate?appropriate?
ResidualsResidualsHomoscedastic Heteroscedastic
• Variance appears random• Good regression model
• “Funnel” shaped and may be bowed• Suggests that a transformation and inclusion of additional variables may be warranted
Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS (http://water.usgs.gov/pubs/twri/twri4a3/)
Data Set 1
-2.5-2
-1.5-1
-0.50
0.51
1.52
2.5
0 5 10 15
X Variable 1
Re
sid
ua
ls
Data Set 2
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 5 10 15
X Variable 1
Re
sid
ua
ls
Data Set 3
-2
-1
0
1
2
3
4
0 5 10 15
X Variable 1
Re
sid
ua
ls
Data Set 4
-2-1.5
-1-0.5
00.5
11.5
22.5
0 5 10 15 20
X Variable 1
Re
sid
ua
ls
OutliersOutliers
Values that appear very different from others in the Values that appear very different from others in the data setdata set Rule of thumb: an outlier is more than three standard Rule of thumb: an outlier is more than three standard
deviations from meandeviations from mean Three causesThree causes
Measurement or recording errorMeasurement or recording error Observation from a different populationObservation from a different population A rare event from within the populationA rare event from within the population
Outliers need to be considered and not simply Outliers need to be considered and not simply dismisseddismissed May indicate important phenomenonMay indicate important phenomenon e.g., e.g., ozone hole data (outliers removed automatically by ozone hole data (outliers removed automatically by
analysis program, delaying observation about 10 years)analysis program, delaying observation about 10 years)
OutliersOutliers
Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS (http://water.usgs.gov/pubs/twri/twri4a3/)
When is Linear Regression When is Linear Regression Appropriate?Appropriate?
Data should be interval or ratioData should be interval or ratio The dependent and independent variables should The dependent and independent variables should
be identifiablebe identifiable The relationship between variables should be linear The relationship between variables should be linear
(if not, a transformation might be appropriate) (if not, a transformation might be appropriate) Have you chosen the values of the independent Have you chosen the values of the independent
variable?variable? Does the residual plot show a random spread Does the residual plot show a random spread
(homoscedastic) and does the normal probability (homoscedastic) and does the normal probability plot display a straight line (or does a histogram of plot display a straight line (or does a histogram of residuals show a normal distribution)?residuals show a normal distribution)?
(Normal Probability Plot of (Normal Probability Plot of Residuals)Residuals)
The normal probability plot indicates whether the residuals follow a normal distribution, in which case the points will follow a straight line. Expect some moderate scatter even with normal data. Look only for definite patterns like an "S-shaped" curve, which indicates that a transformation of the response may provide a better analysis. (from Design Expert 7.0 from Stat-Ease)
(Histogram of Residuals (Histogram of Residuals Distribution)Distribution)
Lineweaver-Burk PlotLineweaver-Burk Plot
][
][ max
SK
VSv
mo
is linearized by taking its reciprocal:
The Michaelis-Menton equation to describe enzyme activity:
][
111
maxmax SV
K
Vvm
o
where: y = 1/vo
x = 1/[S]
a = 1/Vmax
b = Km/Vmax
Mock Enzyme ExperimentMock Enzyme ExperimentMichaelis-Menton Plot
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100 120 140
S (pennies/m^2)
v (p
enn
ies/
min
)
Mock Enzyme ExperimentMock Enzyme Experiment
Lineweaver-Burk Plot
y = 0.7053x + 0.0076
R2 = 0.9785
0.000
0.010
0.020
0.030
0.040
0.050
0.060
0.070
0.080
0.090
0.000 0.020 0.040 0.060 0.080 0.100 0.120
1/S (pennies/m^2)^-1
1/v
(pen
nie
s/m
in)^
-1
Mock Enzyme ExperimentMock Enzyme ExperimentEadie-Hofstee
y = -85.671x + 124.48
R2 = 0.8543
0
20
40
60
80
100
120
140
0 0.2 0.4 0.6 0.8 1 1.2 1.4
v/S (m^2/min)
v (p
enn
ies/
min
)
Mock Enzyme ExperimentMock Enzyme Experiment
Mock Enzyme ExperimentMock Enzyme Experiment
Mock Enzyme ExperimentMock Enzyme Experiment
Mock Enzyme ExperimentMock Enzyme ExperimentResidual Plot
-0.01
-0.005
0
0.005
0.01
0.00 0.05 0.10 0.15
X Variable
Re
sid
ua
ls
Mock Enzyme ExperimentMock Enzyme ExperimentNormal Probability Plot
0
0.01
0.020.03
0.04
0.05
0.060.07
0.08
0.09
0 20 40 60 80 100 120
Sample Percentile
Y