lecture 12 correlation and linear regression y = ax + b the least squares method of carl friedrich...
Post on 20-Dec-2015
219 views
TRANSCRIPT
Lecture 12Correlation and linear regression
y = ax + b
2 2
1 1
( ) [ ( )]n n
i ii i
D y y ax b
1
1
2 ( ) 0
2 ( ) 0
n
i i ii
n
i ii
Dx y ax b
a
Dy ax b
b
1
22
1
n
i ii
n
ii
x y nx ya
x nx
b y ax
( ) ( )y ax y ax y y a x x
The least squares method of Carl Friedrich Gauß.
0
5
10
15
20
0 5 10 15 20
Y
X
Dy2
OLRy
Dy
2
1
2
1
1
22
1
1
22
1
)(1
))((1
1
1
x
xyn
ii
n
iii
n
ii
n
iii
n
ii
n
iii
s
s
xxn
yyxxn
xxn
yxyxn
xnx
yxnyxa
Covariance
Variance
Correlation coefficient
xy
x y
xy
x y
sr
s s
22
2 2
xy
x y
r
Coefficient of determination
2 Explained variance
Total varianceR
y
x
yxxyx
s
sar
srssas
2
Slope a and coefficient of correlation r are zero if the covariance is zero.
11 r
10 2 r
y = 0.192x + 0.4671R² = 0.1723
01234567
0 10 20 30
Brac
hypt
erou
s spe
cies
Macropterous species
y = 0.3875x + 3.7188R² = 0.4455
02468
101214
0 10 20 30
Dim
orph
ic sp
ecie
s
Macropterous species
Relationships between macropterous, dimorphic and brachypterous ground beetles
on 17 Mazurian lake islandsPositive correlation; r =r2= 0.41The regression is weak. Macropterous species richness explains only 17% of the variance in brachypterous species richness.We have some islands without brachypterous species.We really don’t know what is the independent variable.There is no clear cut logical connection.
Positive correlation; r =r2= 0.67The regression is moderate. Macropterous species richness explains only 45% of the variance in dimorphic species richness.The relationship appears to be non-linear. Log-transformation is indicated (no zero counts).We really don’t know what is the independent variable.There is no clear cut logical connection.
y = -36.203x + 5.5585R² = 0.2311
01234567
0 0.05 0.1 0.15
Brac
hypt
erou
s spe
cies
Isolation
y = 0.4894x + 22.094R² = 0.0037
05
1015202530354045
-3 -2 -1 0 1 2
Brac
hypt
erou
s spe
cies
ln Area
Negative correlation; r =r2= -0.48The regression is weak. Island isolation explains only 23% of the variance in brachypterous species richness.We have two apparent outliers. Without them the whole relationship would vanish, it est R2 0.Outliers have to be eliminated fom regression analysis.We have a clear hypothesis about the logical relationships. Isolation should be the predictor of species richness.
No correlation; r =r2= 0.06The regression slope is nearly zero. Area explains less than 1% of the variance in brachypterous species richness.We have a clear hypothesis about the logical relationships. Area should be the predictor of species richness.
The matrix perspective
y = 0.192x + 0.4671R² = 0.1723
01234567
0 10 20 30
Brac
hypt
erou
s spe
cies
Macropterous species
)(
61
......
131
121
71
2
...
3
6
4
6
...
13
12
7
1
...
1
1
1
2
...
3
6
4
6
...
13
12
7
...
2
...
3
6
4
10
10
1
1
1
1
0
0
0
0
aa
aa
a
a
a
a
a
a
a
a
XaY
X is not quadratic. It doesn’t possess an inverse
aIaXaXXXYXXX
XaXYXXaY
TTTT
TT
11 )()(
Brachy Macro Constant4 7 16 12 13 13 1
4 18 11 10 14 14 12 7 15 22 11 9 10 7 1
0 15 10 13 11 8 14 10 1
2 8 1
6 14 1
2 6 1
Transpose Macro 7 12 13Constant 1 1 1
Dispersion matrix
XTX 2499 193193 17
Inverse0.003248 -0.03687-0.03687 0.477455
XTY 57045
Coefficients
a1 0.192014
a0 0.467138
YXXXa
XaYTT 1)(
y = 0.192x + 0.4671R² = 0.1723
01234567
0 10 20 30Br
achy
pter
ous s
peci
es
Macropterous species
Dispersion matrix
XTX 2499 193193 17
n
i
n
ii
n
ii
n
ii
T
constconstx
constxx
1
2
1
11
2
XX
n
ii
n
ii xx
nxx
n 1
2
1
222 11
Brachy Macro Constant4 7 16 12 13 13 1
4 18 11 10 14 14 12 7 15 22 11 9 10 7 1
0 15 10 13 11 8 14 10 1
2 8 1
6 14 1
2 6 1
Brachy Macro Constant4 7 16 12 13 13 1
4 18 11 10 14 14 12 7 15 22 11 9 10 7 1
0 15 10 13 11 8 14 10 1
2 8 1
6 14 1
2 6 1
n
iii
n
iiixy yyxx
nyxyx
n 11
11
Variance
Covariance
Brachy Macro Constant Brachy Macro Constant Brachy 4 6 3 4 1 4 2 5 14 7 1 2.64706 11.3529 1 Macro 7 12 13 18 10 14 7 22 96 12 1 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 13 13 1 2.64706 11.3529 14 18 1 2.64706 11.3529 1 Brachy 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6470591 10 1 2.64706 11.3529 1 Macro 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.352944 14 1 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 12 7 1 2.64706 11.3529 15 22 1 2.64706 11.3529 11 9 1 2.64706 11.3529 1 XTX 185 570 45 MTM 119.12 510.88 450 7 1 2.64706 11.3529 1 570 2499 193 510.88 2191.1 1930 15 1 2.64706 11.3529 1 45 193 17 45 193 170 13 1 2.64706 11.3529 1 Variance Covariance1 8 1 2.64706 11.3529 1 Brachy Constant4 10 1 2.64706 11.3529 1 XTX - MTM 65.882 59.118 0 XTX - MTM 3.8754 3.4775 02 8 1 2.64706 11.3529 1 59.118 307.88 0 /17 Macro 3.4775 18.111 06 14 1 2.64706 11.3529 1 0 0 0 Constant 0 0 02 6 1 2.64706 11.3529 1
Raw data Arithmetic mean
Dispersion matrix Squared means
)()[(1
)(1
MXMXMMXXΣ TTT
nn
221
22212
11221
...
............
...
...
nnn
n
n
Σ
The covariance matrixis square and
symmetric
VariancesCovariances
y
x
y
x
yx
xyr
/10
0/1
/10
0/1Σ
Covariance VSS 3.8754 3.4775 1.9686 1.7665
3.4775 18.111 0.8171 4.2557V VSV
1/x 0.508 0 1 0.41511/y 0 0.235 r 0.4151 1
r2 0.1723
Non-linear relationships
y = 0.0056x + 24.305R² = 0.2963
0
10
20
30
40
50
60
0 2000 4000
Spec
ies
IndividualsThe species – individuals relationship are obviously non-linear.
Ground beetles on Mazurian lake islands
y = 6.0987ln(x) - 8.3513R² = 0.6003
1
10
100
1 100 10000
Spec
ies
Individuals
y = 6.7337x0.2306
R² = 0.67
0
10
20
30
40
50
60
0 2000 4000
Spec
ies
Individuals
Linear function Logarithmic function Power function
y = 6.7337x0.2306
R² = 0.67
1
10
100
1 100 10000
Spec
ies
Individuals
IIS
IS
ln2308.0907.1ln2308.0)733.6ln(ln
733.6 2308.0
Intercept Slope
The power function has the highest R2 and explains therefore most of the variance in species richness.The coefficient of determination is a measure of goodness of fit.
Having more than one predictor
Individuals
Isolation
Area
Species
Describe species richness in dependence of numbers of individuals, area, and isolation of islands.
We need a clear hypothesis about dependent and independent predictors.Use a block diagram.
Island Species Individuals Area Isolation1pog 13 55 0.01 0.0887192pog 24 149 0.9 0.0885923pog 31 206 2.1 0.081131cor 29 3450 6.84 0.089384dab 31 505 10 0.080644ful 37 996 9.9 0.094508gil 54 1895 10 0.093676guc 27 476 0.92 0.097195hel 25 325 2.3 0.088938lip 30 459 4.19 0.088367mil 34 1410 0.2 0.089204sos 33 829 20.09 0.087405swi 34 1704 2.08 0.096915
ter 16 91 0.03 0.085875
wil 21 102 1 0.096584
wron 28 342 0.15 0.01wros 21 258 0.15 0.01
Individuals
Isolation
Area
Species
Predictors are not independent.Numbers of individuals depends on area and degree of isolation.
We need linear relationships
We use ln transformed variables of species, area, and individuals. Check for multicollinearityusing a correlation matrix.We check for non-linearities using plots.
Of the predictors area and individuals are highly correlated.
The correlation between area and individuals is highly significant.The probability of H0 = 0.004.
In linear regression analysis correlations of predictors below 0.7 are acceptable.
Collinearity
The final data for our analysis
The model
Isolation a Area a Ind a a S3 2 1 0ln ln ln
YXXXa
XaYTT 1)(
Multiple linear regression
The vector Y contains the
response variable
The matrix X contains the effect (predictor) variables
Island ln_S Constant ln_Ind. ln_Area Isolation1pog 2.564949 1 4.007333 -4.60517 0.0887192pog 3.164068 1 5.003946 -0.10536 0.0885923pog 3.427515 1 5.327876 0.741937 0.081131cor 3.366817 1 8.14613 1.922788 0.089384dab 3.443352 1 6.224558 2.302585 0.080644ful 3.609114 1 6.903747 2.292535 0.094508gil 3.985008 1 7.546974 2.302585 0.093676guc 3.294602 1 6.165418 -0.08338 0.097195hel 3.236061 1 5.783825 0.832909 0.088938lip 3.401197 1 6.12905 1.432701 0.088367mil 3.521447 1 7.251345 -1.60944 0.089204sos 3.483143 1 6.72022 3.000222 0.087405swi 3.531251 1 7.440734 0.732368 0.096915
ter 2.772589 1 4.51086 -3.50656 0.085875
wil 3.060271 1 4.624973 0 0.096584
wron 3.332205 1 5.834811 -1.89712 0.01wros 3.020425 1 5.55296 -1.89712 0.01 60
The predictor variables have to contain different information.
If X is singular no inverse exists
IsolationAreaIndS 91.0ln07.0ln15.048.2ln
The model explains 78.6 % of variance in species richness.21.4% of avriance remains unexplained.
The probability that R2 is zero is only 0.01%.With 99.9% R2 > 0and hence statistically significant.
The probabilities that the coefficients deviate from zero.Isolation is not a significant predictor.
error StandardtCoefficien
t
2
22
1)2(
rr
ntF
0
5
10
15
20
0 5 10 15 20
Y
X2
yOLRx
xy
sa
s2
xyOLRy
x
sa
s
Model I regression
2
222 *
x
yyxyx s
saOLRyaOLRx
aOLRx
sysaOLRys
What distance to minimize?
Dy2
Dx2 OLRx
OLRy
2
2 xy y y
RMA x yx xy x
s s sa a a
s s s
y OLRyRMA
x
s aa
s r
Reduced major axis regression is the geometric average of aOLRy and aOLRx
Model II regression
OLRyRMA aa
0
5
10
15
20
0 5 10 15 20
Y
X
y2
x2 OLRx
OLRyDx Dy
RMA
Past standard output of linear regression Reduced major axis
Parameters and standard errors
Parametric probability for r = 0
2
2( 2)
1
r nt df n
r
2
22
1)2(
rr
ntF
We don’t have a clear hypothesis about the causal relationships.In this case RMA is indicated.
Permutation test for statistical significance
Both tests indicate that Brach and Macro are not significantly correlated.The RMA regression slope is insignificant.
Macro Brach Los() Macro Los() Macro Los() Macro Los() Macro Los() Macro7 4 0.335757 14 0.531818 10 0.258728 14 0.296023 10 0.809377 1412 6 0.787809 10 0.580728 18 0.860314 9 0.524753 8 0.801854 1013 3 0.310238 12 0.101989 6 0.709402 15 0.826895 15 0.942821 2218 4 0.626757 22 0.115425 8 0.793515 12 0.064408 13 0.722662 1210 1 0.220597 13 0.413435 14 0.965281 7 0.25255 7 0.218747 1814 4 0.012454 6 0.684826 10 0.305505 13 0.976486 8 0.404831 137 2 0.909548 9 0.474608 22 0.701483 10 0.170293 22 0.745551 822 5 0.299534 10 0.830635 7 0.061196 22 0.517693 14 0.968818 69 1 0.177327 8 0.581156 13 0.204792 8 0.355126 10 0.822951 77 0 0.953261 7 0.916832 7 0.72657 8 0.38976 6 0.78764 1415 0 0.242402 7 0.974389 7 0.013131 18 0.639621 7 0.878803 1513 0 0.595826 13 0.625952 15 0.066869 10 0.511781 7 0.032343 78 1 0.596459 8 0.260397 13 0.414809 6 0.489293 14 0.92727 1010 4 0.880829 14 0.61705 14 0.093979 7 0.504421 12 0.267633 8
8 2 0.548183 15 0.588517 9 0.462482 7 0.630868 13 0.106493 7
14 6 0.790054 7 0.015239 8 0.234162 13 0.778739 18 0.89634 13
6 2 0.999702 18 0.253364 12 0.011327 14 0.815214 9 0.4389 9
0.099125 -0.05535 0.302746 0.358917 -0.0413N 1000
Observed r 0.41508801 Mean r 0.061Lower CL -0.538Upper CL 0.768
Permutation test for statistical significance
Randomize 1000 times x or y.Calculate each time r. Plot the statistical distribution and calculate the lower and upper confidence limits.
0102030405060708090
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Nr
Lower CL Upper CL
g > 0
Calculating confidence limits
Rank all 1000 coefficients of correlation and take the values at rank positions 25 and 975.
S N2.5 = 25 S N2.5 = 25
m > 0
Observed r
The RMA regression has a much steeper slope.This slope is often intuitively better.
The coefficient of correlation is independent of the regression method
The 95% confidence limit of the regression slopemark the 95% probability that the regression slope is within these
limits.The lower CL is negative, hence the zero slope is with the 95% CL.
Upper CL
Lower CL
In OLRy regression insignificance of slope means also insignificance of r and R2.
0
5
10
15
20
0 5 10 15 20
Y
X
Dy2
OLRy
Dy
Outliers have an overproportional
influence on correlation and
regression.
Outliers should be eliminated from regression analysis.
Instead of the Pearson coefficient of correlations use Spearman’s rank order correlation.
01234567
0 1 2 3 4 5 6 7
Y
X
Normal correlation on ranked data
rPearson = 0.79
rSpearman = 0.77
Home work and literature
Refresh:
• Coefficient of correlation• Pearson correlation• Spearman correlation• Linear regression• Non-linear regression• Model I and model II regression• RMA regression
Prepare to the next lecture:
• F-test• F-distribution• Variance
Literature:
Łomnicki: Statystyka dla biologówhttp://statsoft.com/textbook/