lecture 12 correlation and linear regression y = ax + b the least squares method of carl friedrich...

Lecture 12Correlation and linear regression

y = ax + b

2 2

1 1

( ) [ ( )]n n

i ii i

D y y ax b

1

1

2 ( ) 0

2 ( ) 0

n

i i ii

n

i ii

Dx y ax b

a

Dy ax b

b

1

22

1

n

i ii

n

ii

x y nx ya

x nx

b y ax

( ) ( )y ax y ax y y a x x

The least squares method of Carl Friedrich Gauß.

0

5

10

15

20

0 5 10 15 20

Y

X

Dy2

OLRy

Dy

2

1

2

1

1

22

1

1

22

1

)(1

))((1

1

1

x

xyn

ii

n

iii

n

ii

n

iii

n

ii

n

iii

s

s

xxn

yyxxn

xxn

yxyxn

xnx

yxnyxa

Covariance

Variance

Correlation coefficient

xy

x y

xy

x y

sr

s s

22

2 2

xy

x y

r

Coefficient of determination

2 Explained variance

Total varianceR

y

x

yxxyx

s

sar

srssas

2

Slope a and coefficient of correlation r are zero if the covariance is zero.

11 r

10 2 r

y = 0.192x + 0.4671R² = 0.1723

01234567

0 10 20 30

Brac

hypt

erou

s spe

cies

Macropterous species

y = 0.3875x + 3.7188R² = 0.4455

02468

101214

0 10 20 30

Dim

orph

ic sp

ecie

s


Relationships between macropterous, dimorphic and brachypterous ground beetles

on 17 Mazurian lake islandsPositive correlation; r =r2= 0.41The regression is weak. Macropterous species richness explains only 17% of the variance in brachypterous species richness.We have some islands without brachypterous species.We really don’t know what is the independent variable.There is no clear cut logical connection.

Positive correlation; r =r2= 0.67The regression is moderate. Macropterous species richness explains only 45% of the variance in dimorphic species richness.The relationship appears to be non-linear. Log-transformation is indicated (no zero counts).We really don’t know what is the independent variable.There is no clear cut logical connection.

y = -36.203x + 5.5585R² = 0.2311

01234567

0 0.05 0.1 0.15

Brac

hypt

erou

s spe

cies

Isolation

y = 0.4894x + 22.094R² = 0.0037

05

1015202530354045

-3 -2 -1 0 1 2

Brac

hypt

erou

s spe

cies

ln Area

Negative correlation; r =r2= -0.48The regression is weak. Island isolation explains only 23% of the variance in brachypterous species richness.We have two apparent outliers. Without them the whole relationship would vanish, it est R2 0.Outliers have to be eliminated fom regression analysis.We have a clear hypothesis about the logical relationships. Isolation should be the predictor of species richness.

No correlation; r =r2= 0.06The regression slope is nearly zero. Area explains less than 1% of the variance in brachypterous species richness.We have a clear hypothesis about the logical relationships. Area should be the predictor of species richness.

The matrix perspective

y = 0.192x + 0.4671R² = 0.1723

01234567

0 10 20 30

Brac

hypt

erou

s spe

cies


)(

61

......

131

121

71

2

...

3

6

4

6

...

13

12

7

1

...

1

1

1

2

...

3

6

4

6

...

13

12

7

...

2

...

3

6

4

10

10

1

1

1

1

0

0

0

0

aa

aa

a

a

a

a

a

a

a

a

XaY

X is not quadratic. It doesn’t possess an inverse

aIaXaXXXYXXX

XaXYXXaY

TTTT

TT

11 )()(

Brachy Macro Constant4 7 16 12 13 13 1

4 18 11 10 14 14 12 7 15 22 11 9 10 7 1

0 15 10 13 11 8 14 10 1

2 8 1

6 14 1

2 6 1

Transpose Macro 7 12 13Constant 1 1 1

Dispersion matrix

XTX 2499 193193 17

Inverse0.003248 -0.03687-0.03687 0.477455

XTY 57045

Coefficients

a1 0.192014

a0 0.467138

YXXXa

XaYTT 1)(

y = 0.192x + 0.4671R² = 0.1723

01234567

0 10 20 30Br

achy

pter

ous s

peci

es


Dispersion matrix

XTX 2499 193193 17

n

i

n

ii

n

ii

n

ii

T

constconstx

constxx

1

2

1

11

2

XX

n

ii

n

ii xx

nxx

n 1

2

1

222 11


4 18 11 10 14 14 12 7 15 22 11 9 10 7 1

0 15 10 13 11 8 14 10 1

2 8 1

6 14 1

2 6 1


4 18 11 10 14 14 12 7 15 22 11 9 10 7 1

0 15 10 13 11 8 14 10 1

2 8 1

6 14 1

2 6 1

n

iii

n

iiixy yyxx

nyxyx

n 11

11

Variance

Covariance

Brachy Macro Constant Brachy Macro Constant Brachy 4 6 3 4 1 4 2 5 14 7 1 2.64706 11.3529 1 Macro 7 12 13 18 10 14 7 22 96 12 1 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 13 13 1 2.64706 11.3529 14 18 1 2.64706 11.3529 1 Brachy 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6470591 10 1 2.64706 11.3529 1 Macro 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.352944 14 1 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 12 7 1 2.64706 11.3529 15 22 1 2.64706 11.3529 11 9 1 2.64706 11.3529 1 XTX 185 570 45 MTM 119.12 510.88 450 7 1 2.64706 11.3529 1 570 2499 193 510.88 2191.1 1930 15 1 2.64706 11.3529 1 45 193 17 45 193 170 13 1 2.64706 11.3529 1 Variance Covariance1 8 1 2.64706 11.3529 1 Brachy Constant4 10 1 2.64706 11.3529 1 XTX - MTM 65.882 59.118 0 XTX - MTM 3.8754 3.4775 02 8 1 2.64706 11.3529 1 59.118 307.88 0 /17 Macro 3.4775 18.111 06 14 1 2.64706 11.3529 1 0 0 0 Constant 0 0 02 6 1 2.64706 11.3529 1

Raw data Arithmetic mean

Dispersion matrix Squared means

)()[(1

)(1

MXMXMMXXΣ TTT

nn

221

22212

11221

...

............

...

...

nnn

n

n

Σ

The covariance matrixis square and

symmetric

VariancesCovariances

y

x

y

x

yx

xyr

/10

0/1

/10

0/1Σ

Covariance VSS 3.8754 3.4775 1.9686 1.7665

3.4775 18.111 0.8171 4.2557V VSV

1/x 0.508 0 1 0.41511/y 0 0.235 r 0.4151 1

r2 0.1723

Non-linear relationships

y = 0.0056x + 24.305R² = 0.2963

0

10

20

30

40

50

60

0 2000 4000

Spec

ies

IndividualsThe species – individuals relationship are obviously non-linear.

Ground beetles on Mazurian lake islands

y = 6.0987ln(x) - 8.3513R² = 0.6003

1

10

100

1 100 10000

Spec

ies

Individuals

y = 6.7337x0.2306

R² = 0.67

0

10

20

30

40

50

60

0 2000 4000

Spec

ies

Individuals

Linear function Logarithmic function Power function

y = 6.7337x0.2306

R² = 0.67

1

10

100

1 100 10000

Spec

ies

Individuals

IIS

IS

ln2308.0907.1ln2308.0)733.6ln(ln

733.6 2308.0

Intercept Slope

The power function has the highest R2 and explains therefore most of the variance in species richness.The coefficient of determination is a measure of goodness of fit.

Having more than one predictor

Individuals

Isolation

Area

Species

Describe species richness in dependence of numbers of individuals, area, and isolation of islands.

We need a clear hypothesis about dependent and independent predictors.Use a block diagram.

Island Species Individuals Area Isolation1pog 13 55 0.01 0.0887192pog 24 149 0.9 0.0885923pog 31 206 2.1 0.081131cor 29 3450 6.84 0.089384dab 31 505 10 0.080644ful 37 996 9.9 0.094508gil 54 1895 10 0.093676guc 27 476 0.92 0.097195hel 25 325 2.3 0.088938lip 30 459 4.19 0.088367mil 34 1410 0.2 0.089204sos 33 829 20.09 0.087405swi 34 1704 2.08 0.096915

ter 16 91 0.03 0.085875

wil 21 102 1 0.096584

wron 28 342 0.15 0.01wros 21 258 0.15 0.01

Individuals

Isolation

Area

Species

Predictors are not independent.Numbers of individuals depends on area and degree of isolation.

We need linear relationships

We use ln transformed variables of species, area, and individuals. Check for multicollinearityusing a correlation matrix.We check for non-linearities using plots.

Of the predictors area and individuals are highly correlated.

The correlation between area and individuals is highly significant.The probability of H0 = 0.004.

In linear regression analysis correlations of predictors below 0.7 are acceptable.

Collinearity

The final data for our analysis

The model

Isolation a Area a Ind a a S3 2 1 0ln ln ln

YXXXa

XaYTT 1)(

Multiple linear regression

The vector Y contains the

response variable

The matrix X contains the effect (predictor) variables

Island ln_S Constant ln_Ind. ln_Area Isolation1pog 2.564949 1 4.007333 -4.60517 0.0887192pog 3.164068 1 5.003946 -0.10536 0.0885923pog 3.427515 1 5.327876 0.741937 0.081131cor 3.366817 1 8.14613 1.922788 0.089384dab 3.443352 1 6.224558 2.302585 0.080644ful 3.609114 1 6.903747 2.292535 0.094508gil 3.985008 1 7.546974 2.302585 0.093676guc 3.294602 1 6.165418 -0.08338 0.097195hel 3.236061 1 5.783825 0.832909 0.088938lip 3.401197 1 6.12905 1.432701 0.088367mil 3.521447 1 7.251345 -1.60944 0.089204sos 3.483143 1 6.72022 3.000222 0.087405swi 3.531251 1 7.440734 0.732368 0.096915

ter 2.772589 1 4.51086 -3.50656 0.085875

wil 3.060271 1 4.624973 0 0.096584

wron 3.332205 1 5.834811 -1.89712 0.01wros 3.020425 1 5.55296 -1.89712 0.01 60

The predictor variables have to contain different information.

If X is singular no inverse exists

IsolationAreaIndS 91.0ln07.0ln15.048.2ln

The model explains 78.6 % of variance in species richness.21.4% of avriance remains unexplained.

The probability that R2 is zero is only 0.01%.With 99.9% R2 > 0and hence statistically significant.

The probabilities that the coefficients deviate from zero.Isolation is not a significant predictor.

error StandardtCoefficien

t

2

22

1)2(

rr

ntF

0

5

10

15

20

0 5 10 15 20

Y

X2

yOLRx

xy

sa

s2

xyOLRy

x

sa

s

Model I regression

2

222 *

x

yyxyx s

saOLRyaOLRx

aOLRx

sysaOLRys

What distance to minimize?

Dy2

Dx2 OLRx

OLRy

2

2 xy y y

RMA x yx xy x

s s sa a a

s s s

y OLRyRMA

x

s aa

s r

Reduced major axis regression is the geometric average of aOLRy and aOLRx

Model II regression

OLRyRMA aa

0

5

10

15

20

0 5 10 15 20

Y

X

y2

x2 OLRx

OLRyDx Dy

RMA

Past standard output of linear regression Reduced major axis

Parameters and standard errors

Parametric probability for r = 0

2

2( 2)

1

r nt df n

r

2

22

1)2(

rr

ntF

We don’t have a clear hypothesis about the causal relationships.In this case RMA is indicated.

Permutation test for statistical significance

Both tests indicate that Brach and Macro are not significantly correlated.The RMA regression slope is insignificant.

Macro Brach Los() Macro Los() Macro Los() Macro Los() Macro Los() Macro7 4 0.335757 14 0.531818 10 0.258728 14 0.296023 10 0.809377 1412 6 0.787809 10 0.580728 18 0.860314 9 0.524753 8 0.801854 1013 3 0.310238 12 0.101989 6 0.709402 15 0.826895 15 0.942821 2218 4 0.626757 22 0.115425 8 0.793515 12 0.064408 13 0.722662 1210 1 0.220597 13 0.413435 14 0.965281 7 0.25255 7 0.218747 1814 4 0.012454 6 0.684826 10 0.305505 13 0.976486 8 0.404831 137 2 0.909548 9 0.474608 22 0.701483 10 0.170293 22 0.745551 822 5 0.299534 10 0.830635 7 0.061196 22 0.517693 14 0.968818 69 1 0.177327 8 0.581156 13 0.204792 8 0.355126 10 0.822951 77 0 0.953261 7 0.916832 7 0.72657 8 0.38976 6 0.78764 1415 0 0.242402 7 0.974389 7 0.013131 18 0.639621 7 0.878803 1513 0 0.595826 13 0.625952 15 0.066869 10 0.511781 7 0.032343 78 1 0.596459 8 0.260397 13 0.414809 6 0.489293 14 0.92727 1010 4 0.880829 14 0.61705 14 0.093979 7 0.504421 12 0.267633 8

8 2 0.548183 15 0.588517 9 0.462482 7 0.630868 13 0.106493 7

14 6 0.790054 7 0.015239 8 0.234162 13 0.778739 18 0.89634 13

6 2 0.999702 18 0.253364 12 0.011327 14 0.815214 9 0.4389 9

0.099125 -0.05535 0.302746 0.358917 -0.0413N 1000

Observed r 0.41508801 Mean r 0.061Lower CL -0.538Upper CL 0.768

Permutation test for statistical significance

Randomize 1000 times x or y.Calculate each time r. Plot the statistical distribution and calculate the lower and upper confidence limits.

0102030405060708090

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Nr

Lower CL Upper CL

g > 0

Calculating confidence limits

Rank all 1000 coefficients of correlation and take the values at rank positions 25 and 975.

S N2.5 = 25 S N2.5 = 25

m > 0

Observed r

The RMA regression has a much steeper slope.This slope is often intuitively better.

The coefficient of correlation is independent of the regression method

The 95% confidence limit of the regression slopemark the 95% probability that the regression slope is within these

limits.The lower CL is negative, hence the zero slope is with the 95% CL.

Upper CL

Lower CL

In OLRy regression insignificance of slope means also insignificance of r and R2.

0

5

10

15

20

0 5 10 15 20

Y

X

Dy2

OLRy

Dy

Outliers have an overproportional

influence on correlation and

regression.

Outliers should be eliminated from regression analysis.

Instead of the Pearson coefficient of correlations use Spearman’s rank order correlation.

01234567

0 1 2 3 4 5 6 7

Y

X

Normal correlation on ranked data

rPearson = 0.79

rSpearman = 0.77

Home work and literature

Refresh:

• Coefficient of correlation• Pearson correlation• Spearman correlation• Linear regression• Non-linear regression• Model I and model II regression• RMA regression

Prepare to the next lecture:

• F-test• F-distribution• Variance

Literature:

Łomnicki: Statystyka dla biologówhttp://statsoft.com/textbook/

http://statsoft.com/textbook/

lecture 12 correlation and linear regression y = ax + b the least squares method of carl friedrich...

Documents

brachypterous species

predictor of species

macropterous species

dimorphic species richness

coefficient of correlation

variables of species

variance covariance

negative correlation