- parameter point & interval estimates...1 a refresher in applied statistics model fitting -...

1

A refresher in applied statistics

Model fitting

- parameter point & interval estimates

Simple and multiple linear regression

ANOVA and ANCOVA

Beate Sick

We use R for performing statistical data analysisRecommended environment: RStudio

Main reasons:• open source• powerful• wide spread• reproducible• transparent

2

3

probability world ↔ data world

describe data

Visualize & Summarize Statistical inference- model choice

- Parameter estimation

- Confidence intervals

- Tests

- Regression, ANOVA

Statistics connects data with models

data, sampleModel

Predictions

inductive

statistics

PredictionProbability calculus

4

We use a sample to learn about the population

Results from statistical

inference are only correct, if

the sample was representative.

A sample is representative

if it does not systematically

differ from the population

(e.g. the percentages of male

and female are similar in the

sample than in the population).

The world of data and the world of models

data/reality model

sample population

discrete data/features(numeric or categorical)

discrete random variable(numeric)

continuous data/features(numeric)

continuous random variable(numeric)

observation Random variable

relative frequency probability (P)

histogram (scaled)Density

continuous distribution

bar plot of frequencies (scaled)(of rel. frequency at discrete features)

Probability distributiondiscrete distribution

average expected value m

sample variance s2 variance s2

x

The expected value = population mean

•The expected value of a random variable is the average,

which we would get with an infinite big sample

• It measures the location of the random variable

• It corresponds to the centre of mass of the density (see red line)

•It often determines the parameter of the model

•The expected value can also be calculated due to the density

0.0 0.5 1.0 1.5 2.0

01

23

4

Exponentialverteilung

x

f(x)

70 2 4 6 8 10

0.0

00

.05

0.1

00

.15

0.2

00

.25

my.x

P

Po(l=2.5)

1

( ) ( )

( ) ( )

n

i i

i

EW X x P X x

EW X x f x dx

8

The most famous discrete distributions/models

name of the

distribution

possible values x

P(X=k)

expected value m

variance s2

application

Bernoulli

X~Bern(p)

{0,1} m=E(X)=p

s2=Var(X)=p*(1-p)

X: indicates if an

event occurres or

not

Binomial

X~B(n,p)

{0,1,…,n} m=E(X)=n*p

s2=Var(X)=n*p*(1-p)

X: number of

successes in n

independent

Bernoulli trials

Poisson

X~Po(l)

{0,1,...} m=E(X)=l

s2=Var(X)=l

X: number of events

in a certain interval

or time-bin

( ) (1 )k n kn

P X kk

p p - -

( )!

k

P X k ek

ll -

( 1)

( 0) 1

P X

P X

p

p

-

9

The most important continuous distributions/models

Name of the

distribution

(parameter)

domain

density f

distribution F

expected value

variance

application

Uniform V.

X~U(a,b)

R if all events have the same

probability or if the

probability is not known at

all

Exponential V.

X~Exp(l)

R0+ waiting times,

time to fail

Normal V.

X~N(m,s2)

R typical measurements(affected symmetrically by various

factors),

Asymptotic approximation

for other distributions

( ) xf x e ll -

'( ) 1 xF x e l- -

1( ) ,f x

b a

-

for a £ x £ b,

otherwise f (x) = 0

( )x a

F x für a x bb a

-

-

2

( )2

( )( )

12

a bE X

b aVar X

-

2

1( )

1( )

E X

Var X

l

l

2

( )

( )

E X

Var X

m

s

2

2

( )

21

( )2

x

f x e

m

s

s p

--

2

2

( ' )

21

( ) '2

xx

F x e dx

m

s

s p

--

-

10

Normal

X~N(m,s2)

( )E X m2( )Var X s

E X n p( )

Var X n p p( ) ( ) -1

Binomial

X~B(n,p)

Poisson

X~Po(l)

( )E x l

( )Var X l

Relation

Parameter-E(X)-Var(X)Parameter-estimator

as function of the data1

( )E Xl

X~Exp(l)2

1( )Var X

l

1

1ˆˆ ( )n

i

i

E X x xn

l

ˆ . . .. .

p average no successesper n trials

1 1ˆ

( ) xE Xl

2 2

1

1ˆˆ var( ) ( )1

n

i

i

X x xn

s

--

1

1ˆˆ ( )n

i

i

E X x xn

m

Paramter estimationfor the most important distributions

Exponential

Distributionfamily V

X~V(Parameter-Set)

11

The probability density function (pdf)

0.0 0.5 1.0 1.5 2.0

01

23

4

Dichtefunktion

Wartezeit

De

nsity

0.6

0.3

( ) ( 0.6) () )( 0.3

b

a

P fa X b F b F ax dx

-

a b

( ) xf x e ll -

The probability of getting a result between a and b is equal to the area

under the density function above the interval [a,b]. The calculation of the

probability is made by integrating the density function in interval [a,b]

pexp(0.6,rate=4) – pexp(0.3,rate=4)

x (waiting times)

12

Data vary! That’s why statisticians can find jobs ;-)

Wenn ich alle

messen würde.

Eine

Zufallstichprobe

mit n=30

n=30

CtsVar_1samp_Dots30.pdf

Beispiel: Grösse 12 jähriger neuseeländischer Schülerinnen

Where is the center of the population?

13



14

Vizualize boxplot with memory

Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html


14

15Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html



15




16




17




18



19


Where is the center of the population?We get more certain with increasing sample size

20

21

Confidence intervals

22

How sure can I be about the true paramter value?

Goal:

We would like to determine from our sample/observations an interval,

which covers the true parameter value with a probability of 95%.

+/-1.58 IQR/sqrt(n)

The notch covers the

median «quite certain»

boxplot(x,notch=TRUE)

22

23

Sample Variation & Central Limit Theorem

population

Sample with

sample-size n=10

Distribution of many

sample means

Animation: http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Because of the sample variation also

derived statistics like the mean value

varies from sample to sample

The sample mean is an unbiased

estimator for the population mean E(X).

CLT: The sample mean is normaly

distributed around the population mean

and the variation decreases with

increasing sample size.

2( )

~ ( , ) ( ( ), )a

xx

Var XX N N E X

n n

sm

1 1

12 2

n nt tq q

- -

- -

1-

/2/2

Construction of an exact CI for the expected value m

1 1

1 1

12 2

1 12 2

( ) 1

( ) 1

n n

n n

t tx

x

t tx xx

XP q q

n

P X q X qn n

m

s

s sm

- -

- -

-

- -

- -

- -

exact 95% CI for mx0.975

z xX qn

s

~ (0,1)x

x

XT N

n

m

s

-

( )2

{1,2,... }

2

. . . ~ ,

~ ,

i n x x

xx

X i d d N

X Nn

m s

sm

t

Distribution of T under H0

24

The construction of the CI is based on

the distribution of T under H0 : m = mx

1Test-Statistic or Pivot: ~x

xdf n

XT t

s

n

m -

-

known s

1

0.975

ˆnt xX q

n

s-

estimateds

12 2

z z -

-

1-

/2/2

12 2

1 12 2

( ) 1ˆ

ˆ ˆ( ) 1

x

x

x xx

XP z z

n

P X z X zn n

m

s

s sm

-

- -

- -

- -

approx. 95% CI for mx 0.975

ˆ ˆ1.96x xX z X

n n

s s

Test-Statistic or Pivot:

~ (0,1)x

x

aXT N

s

n

m-

2

2Central Limit Theorem

. . . 1,..., , 25 , ( ) , ( )

~ ,

i x x

xx

a

X i d d i n n E X Var X

X Nn

m s

sm

t

Distribution of T under H0

25

The construction of the CI is based on

the distribution of T under H0 : m = mx

standard error:

se(x)University of Zurich, Department of Biostatistics

Construction of an approximative CI for m

26

The CI is as random as the sample

( )2

sd xx

n

5m

95 out of 100 95%-CI for m do cover the true population parameter m=5 when

simulating 100 random samples from a population following N(m=5,s2).

With a 95%-CI we have a risk of 5% that the true population parameter is not

contained by the CI.

27Source: The Cartoon Guide to Statistics, Larry Gonick and Woollcott Smith

For what purpose do we develop a statistical model?

• Description:

Describe data by a statistical model.

• Explanation:Search for the “true” model to understand and

causally explain the relationships between

variables and to plan for interventions.

• Prediction:Use model to make reliable predictions.

This is done well with a conventional statistical model

Difficult with observational data – in medicine we do RCT to learn about causal effects

To evaluate and tune a good prediction model it is best to work with train, validation and test data.

28

29

Simple linear regression1 explanatory variable

30

Example:

In our first example we investigate how the size of trees

depends on the ph-level of the soil.

In India, it was observed that alkaline soil hampers plant growth. This gave rise

to a search for tree species which show high tolerance against high ph-values.

An outdoor trial was performed, where 120 trees of a particular species were

planted on a big field with under different pH-value variation.

After 3 years of growth, every trees height was measured. Additionally, the pH-

value of the soil in the vicinity of each tree was determined and recorded.

Simple Linear Regression

31

7.5 8.0 8.5

23

45

67

phvalue

he

igh

tTree Height vs. pH-Value

Scatterplot: Tree Height vs. pH-value

ph=7.9

???

Which height would we expect at ph=7.9?

32

7.5 8.0 8.5

23

45

67

phvalue

he

igh

t

Tree Height vs. pH-Value

7.5 8.0 8.5

23

45

67

phvalue

he

igh

t


7.5 8.0 8.5

23

45

67

phvalue

he

igh

t


Systematic Relation: What is a good model?

What is a good model for the relation between pH-value and tree height?

The first model fits the training data perfect but does probably overfit the data.

To evaluate the performance of a model we can use cross-validation: leave out

successively each data point – determine the model with remaining data and

use the model to predict left out value. The model is best which produces the

best predictions on new or left out data points.

33

> summary(fit)

Call: lm(formula = height ~ phvalue, data =

treeheight)

Coefficients: Estimate Std. Error t-value Pr(>|t|)

(Intercept) 28.7227 2.2395 12.82 <2e-16 ***

phvalue -3.0034 0.2844 -10.56 <2e-16 ***

Reading the R-summary of a linear model

0ˆ

ˆ( )t

se

-

ˆint :ercept ˆ ˆ( )se s

ˆslope: ˆ ˆse( ) s test value

ˆ 28.7 3i i iy x x -

Residual stand. err.: 1.008 on 121 degrees of freedom

Multiple R-squared: 0.4797,

Adjusted R-squared: 0.4754

F-statistic: 111.5 on 1 and 121 DF,

p-value: < 2.2e-16

Global test for the model(is full model better than only using the intercept?)

Adj. R2 (use in multiple regression)

ˆˆs

R2 (in simple regression equals corr(x,y)2)

= #obs. - #(estimated parameter)

ˆ

0 0

p-value: p

H : 0

ˆ

0 0

p-value: p

H : 0

34

The regression line will not run through all the data points.

Thus, there are random errors:

Meaning of variables/parameters:

is the response variable (height) of observation .

is the predictor variable (pH-value) of observation .

are the regression coefficients. They are unknown

previously, and need to be estimated from the data.

is the residual or error, i.e. the random difference bet-

ween observation and regression line.

ixiy

i

,

i

i

systematic

part of the model

2, . . ~ (0. )i i iiiY i i d Na b X s

random part of the model,

random errors

Linear regression: a traditional view as seen in many textbooks

Linear regression: a more general view

2

Model for the condition probability distribution

CPD: =(Y|X ) ~ N( , ) ii i xXY m s

The predicted value of the linear regression

gives only one of the parameter of the CPD: mx

that depend on the predictor values. The

second parameter of the CPD (s2) is assumed

to be independent of the predictor values and

defines the variance of the error term .

2

Y ~ V

(Y|X ) ~ N( , ) i

contiuous

arbirar

i x

y

sm

Y , xx m

35

CP

D

CP

D

CP

D

MP

D

( )1

2

0

1

1

2

0

1

y = +

E Y

Var(Y )=Var(Y|X

=( |X

)=Var( )

=x )=

i.i.d. ~ (0, )

+i i

i

x i

i

i i i

X

X i

i

i

x

x

N

m m

s

s

identical independent distributedY

is c

on

tin

uo

us

and

can

hav

e an

arb

itra

ry m

argi

nal

p

rob

abili

ty d

istr

ibu

tio

n

36

We need to fit a straight line that

fits the data well.

Many possible solutions exist,

some are good, some are worse.

Our paradigm is to fit the line such

that the squared errors are

minimized.

Least Squares Fitting

We minimize the sum of

squared residuals2 2 2

1 1 1

ˆ( ) ( ( )) min!n n n

i i i i i

i i i

r y y y x

- -

http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html

Remark: According to the Gauss-Markov-Theorem the OLS (ordinary least square) fitting procedure

leads to the best linear unbiased estimators (BLUE) of the regression parameters.

37

> summary(fit)

Call: lm(formula = height ~ phvalue, data =

treeheight)

Coefficients: Estimate Std. Error t-value Pr(>|t|)

(Intercept) 28.7227 2.2395 12.82 <2e-16 ***

phvalue -3.0034 0.2844 -10.56 <2e-16 ***

Linear regression in R

0ˆ

ˆ( )t

se

-

ˆint :ercept ˆ( )se p-value

ˆ:slope ˆ( )se p-valuetest value

ˆ 28.7 3i i iy x x -

Residual stand. err.: 1.008 on 121 degrees of freedom

Multiple R-squared: 0.4797,

Adjusted R-squared: 0.4754

F-statistic: 111.5 on 1 and 121 DF,

p-value: < 2.2e-16

Global test for the model

(will see later)

R2

38

7.5 8.0 8.5

23

45

67

pH-Value

Tre

e H

eig

ht


Least Squares Regression Model

( ) 28.7 3

(8) 28.7 24 4.7

height ph ph

heigth

-

-

ph=8

Confidence- and “prediction” intervals

The expected value of y at the position x is with 95% percentage certainty

covered by the confidence interval.

95% percentage of all individual observations y (of the training data set) is

contained in the “prediction” interval.39

y

x

regression model

confidence interval

prediction interval upper prediction limit

lower prediction limit

( )( )E y x

Interesting intervals when doing regression

2

2

ˆ ˆ( )ntb q se b-

2

2

ˆ ˆ( )nta q se a-

i i iY a bX

2

2

ˆ ( )nt

k ky q se y-

2

2

ˆ ˆ( )nt

k ky q se y-

Prediction

interval for yl:

CI

for E(yk):

CI for

the slope:

CI for

y-intercept:

confint(fit, parm=1,level=0.95)

2.5 % 97.5 %

(Intercept) -1513786 -881578.2

confint(fit,parm=2,level=0.95)

2.5 % 97.5 %

x 124.4983 153.0250

predict(fit, new, se.fit=T,

interval=c("confidence"))

fit lwr upr

1855075 1829770 1880379

predict(fit, new, se.fit=T,

interval=c(„prediction"))

fit lwr upr

1855075 1728713 1981436

40

41

Residual Analysis= Checking the model assumptions

Before we continue to look into the results, we need to check if the

modelling assumptions are met!

Why? Because otherwise we draw invalid conclusions from the results.

The assumption we took here is that the errors i i.i.d. ~N(0,s2)

We use the observed residuals as estimate for the unobserved errors.

This implies four things for the residuals:

a) The expected value of ri is 0: E(ri )= 0.

b) All ri have the same variance: Var(ri) = ො𝜎2 .

c) The ri are normally distributed.

d) The ri are independent of each other.

Assumptions of a linear regression model

42

x

( , )i ix y0 1ˆ ˆ x

ir

0 1ˆ ˆˆ( , )i i ix y x

y

independent variable

de

pe

ndent vari

able

ˆi i ir y y -

Observed residuals serve as estimate for the error

0 1 1

0 1 1

y = +

ˆ ˆy = +

i i i

i i

x

x

( )

( )0 1 1

0 1 1

y - +

ˆ ˆˆ ˆy - + y -y

i i i

i i i i i i

x

r x

43

Standardized and Studentized residuals

The standardized residual raw 𝑟𝑖 can be derived from the raw residual 𝑟𝑖 by dividing it by an estimate of its standard deviation.

Where ො𝜎𝐸 is the residual standard error and 𝐻 is the hat matrix

With the same formula we get Studentized Residuals if the estimate of the residual standard error ො𝜎𝐸 is obtained by ignoring the 𝑖𝑡ℎ data point.

ˆ 1

ii

E ii

rr

Hs

-

44

There are 4 "standard plots" in R:

- Residuals vs. Fitted, aka Tukey-Anscombe-Plot

- Normal Plot (uses standardized residuals)

- Scale-Location-Plot (uses standardized residuals)

- Leverage-Plot (uses standardized residuals)

In R: > plot(fit)

Model checking: residual analysis in R

45

Model checking: residual analysis in R

46

The Tukey-Anscombe diagram plots the residuals against the fitted values

is the most important model checking tool. This plot is ideal to check if

assumptions a) and b) (and partially d)) are met.

A perfect Tukey-Anscombe show a horizontal smoother at height 0 around

which the residuals are at each point distributed with same variance.

The residual vs fitted plot aka as Tukey-Anscombe plot

47

Quelle: http://www.uni-forst.gwdg.de/~dgaffre/elan/institut_fbi/skripten/statistica/v10/v10a.html

Examples of Tukey-Anscombe-Plots

Resid

uen

Re

sid

uen

Re

sid

uen

Re

sid

uen

4848

The Normal Plot

With the Normal Plot we check if the residuals show strong deviations from a Normal distribution. In a perfect Normal plot all points are close to a straight line.

49

Draw data from Normal-Distribution and generate the Normal-Quantil-Quantil-Plots

# normal Q-Q Plot w/o CI

qqnorm(residuals(fit))

# normal Q-Q Plot with CI

library(car)

qqPlot(residuals(fit), dist="norm")

50

The Scale-Location Plot

Here we plot | ǁ𝑟|𝑖 vs ො𝑦𝑖 and check for constant variance meaning the spread

of the absolute residual values do not change over the range of fitted values

meaning if the variance of the residual is constant.

A perfect Scale-Location Plot shows a smooth horizontal line.

51

The Leverage Plot

With the Leverage Plot we check for influential points with large Cook’s distances.

A Leverage plot without points beyond the dashed level curves is fine.

Points with Cooks distance larger than 0.5 or 1 must be further checked.

52

A high leverage of an observation 𝑦𝑖 means that the data point has extreme predictor value and has the potential to force the regression relation to strongly adapt to that data point. The leverage is simply given by the 𝑖𝑡ℎ diagonal element of the hat matrix 𝐻, since 𝐻𝑖𝑖Δ𝑦𝑖 is the change in ො𝑦𝑖 if 𝑦𝑖 changes by Δ𝑦𝑖. The average leverage in a regression model with 𝑝 estimated coefficients and 1 intercept is given

by Τ(𝑝+1)𝑛 . We say a data point has high leverage if 𝐻𝑖𝑖≥ 2 ∙ Τ(𝑝+1)

𝑛

What is meant by leverage points?

53

With Cook’s Distance we estimate the potential change in all the fitted values if 𝑖𝑡ℎ data point is omitted from the analysis.

What is Cook’s Distance measuring?

[ ] 2 2

2

ˆ ˆ( )

( 1) 1 ( 1)

i

k k ii ii

E ii

y y H rD

p H ps

- -

-

If 𝐷𝑖 ≥ 0.5 , the 𝑖𝑡ℎ data point is called influential.

If 𝐷𝑖 ≥ 1 , the 𝑖𝑡ℎ data point might be really

dangerous.

54

Plot residuals vs predictors

Residuals should not show structure when be

plotted vs fitted values or any of the predictors.

2~ (0, ), . .i N I i i d s

-> residual plots look o.k.

Tuckey Anscombe Plot

55

56

What to do if the model

assumptions are violated?

Should we apply an transformation to outcome and predictor?

See the in-class exercises on transformations for the effect of applying

a non-linear transformation on the response or predictor variable.

57

58

Improve structural form of the model

- add missing predictors

- add interactions

- apply transformation on predictors (change form of relationship between y and predictor)

- apply transformation on outcome (to stabilize residual variance & change form)

Handle extreme values, leverage points outliers

- outliers and leverage points should be identified with diagnostic plots

- check if the usage of robust methods is necessary

- use transformations to make variable distributions less skewed

Make sure that observations within a group are independent

- unrecorded predictors or inhomogeneous population

- observations coming from a matched study are not independent

(analyze such data e.g. with mixed models)

- subjects may influence other subjects under study

consider using another model

- e.g. Poisson regression in case of count data

How can a linear regression model be improved?

58

59

First-Aid Transformations:

do always apply these (if no practical reasons against it)

to both response and predictors

Absolute values and concentrations:

log-transformation:

Count data:

square-root transformation:

Proportions:

arcsine transformation:

log( )y y

y y

( )siny arc y

First Aid Transformations

Variance-stabilizing transformations

Motivation for the partial residual plot

The partial residuals 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 give insight of the relationship between predictor 𝑥𝑗and the “adjusted outcome”, which is corrected for effect of all other predictors.

A perfect partial residual plot shows a linear relation between 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 and 𝑥𝑗.

Partial residuals are residuals we get when omitting 𝑥𝑗 from the model formula.

j,partialˆ ˆ ˆˆ

j j j j k k

k j k j

r y x y r x x r

- -

Partial residual plots in R:- library(car); crPlots(...)

- library(faraway); prplot(...)

- residuals(fit,

type="partial")0 1 2 3 4 5

80

09

00

10

00

11

00

Mortality vs. log(NOx)

log(NOx)

Mo

rta

lity

0 1 2 3 4 5

-10

0-5

00

50

log(NOx)P

art

ial R

esid

ua

ls f

or

log

(NO

x)

Partial Residual Plot for log(NOx)

60

Marginal relationship ≠ relation in partial residual plot

The marginal plot of outcome vs predictor does not take into account the

influence of all other predictors on the outcome and is therefore not

appropriate if we are interested in the additional influence of a predictor on

the outcome given all other predictors are already in the model.

0 1 2 3 4 5

80

09

00

10

00

11

00

Mortality vs. log(NOx)

log(NOx)

Mo

rta

lity

0 1 2 3 4 5

-10

0-5

00

50

log(NOx)

Pa

rtia

l R

esid

ua

ls f

or

log

(NO

x)

Partial Residual Plot for log(NOx)

61

We want to check if the adjusted relation between an untransformed

predictor and the outcome is linear. Hence, we use the partial

residual plot which only shows the "isolated“ influence of that

predictor on the response.

The observed shape in the partial residual plot indicates if/which

transformation we should use for the selected predictor.

Partial residual plots help to find transformations for predictors

62

𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙

𝑥𝑗 𝑥𝑗 𝑥𝑗A quadratic transformation

of xj might be appropriate.No transformation required. .Predictor xj has no additional

explanatory power when

added to the model.

63

ANOVA = ANalysis Of VAriance

Total sample variability

TSS

Variability explained by

the modelSSmodel

Unexplained (or error) variability

RSS

Example with one factorial predictorDo medical doctors spend less time with obese patients?

In an observational study it was measured

how much time doctors spend with a patient.

64

Do medical doctors spend less time with obese patients?How can we test this with linear regression and ANOVA?

An ANOVA with 1 factor with 2 levels is equivalent to a two-sample t-test.

Normality check

passed

t.test(TIME~WEIGHT, data=dat)

# t = 2.9, df = 67, p-value = 0.0057

# alternative hypothesis: true difference in

# means is not equal to 0

# 95 percent confidence interval:

# 2 11

# sample estimates:

# mean of x mean of y

# 31 25

# do it by regression with one factorial predictor:

fit=lm(TIME~WEIGHT, data=dat)

anova(fit)

# get anova-table from lm-object

# Response: TIME

# Df Sum Sq Mean F value Pr(>F)

# WEIGHT 1 776 776 8.16 0.0057 **

# Residuals 69 6561 95

65

How to test for an effect between >2 groups?Applying 1-way ANOWA with >2 levels

fit=lm(folate~group, data=dat)

anova(fit) # p=0.044

Here, we want to investigate, if three different treatments

result in different levels of the output: folate in red blood cells

We can apply a regression with the group factor as predictor

to investigate this question, given the folate values y in each

group are i.i.d. normal distributed (check not shown).

Since p<5%, we can conclude that there are differences,

i.e. the folate level is not the same in all groups.

Remark: If there is only 1 factor as predictor, like treatment group, we talk about

1-way ANOVA regardless of the number of groups.

66

The ANOVA gets significantBetween which groups are the differences?

We can perform three pair-wise t-tests.

Only the t-test comparing group 1

versus 2 gets significant.

We need to correct for multiple testing,

e.g. by Bonferroni-correction. Here, this

correction leads to non-significance for

all 3 tests.

The significant ANOVA result, only tells us, that there are any differences.

We need to perform post-hoc tests to investigate, between which groups

we can really find differences.

List of post-hoc tests (from wiki)

• Fisher's least significant difference: LSD

• Bonferroni correction

• Duncan's new multiple range test

• Friedman test

• Newman–Keuls method

• Scheffé's method

• Tukey's range test

• Dunnett's test

Result of (uncorrected) pair-wise t-tests:

67

68

Multiple linear regression≥2 explanatory variables

Multiple linear regression: interpretation of coefficientas used in descriptive modelling

( )2

0 1 1 iy = + ... ε , ~ 0,i i p ip ix x N s

11= y - yk k k kk x x x xy

-

The coefficient k gives the change of the outcome y, given the explanatory

variable xk is increased by one unit and all other variables are held constant.

69

= y X β ε

11 1 01 1

21 2 12 2

1

1

1with , , ,

1

p

p

n np pn n

x xy

x xy

x xy

y X β ε

Modeling HDL as example of multiple regression:

Estimate Std. Error t-value Pr(>|t|)

Intercept 1.16448 0.28804 4.04 <.0001

AGE -0.00092 0.00125 -0.74 0.4602

BMI -0.01205 0.00295 -4.08 <.0001

BLC 0.05055 0.02215 2.28 0.0239

PRSSY -0.00041 0.00044 -0.95 0.3436

DIAST 0.00255 0.00103 2.47 0.0147

GLUM -0.00046 0.00018 -2.50 0.0135

SKINF 0.00147 0.00183 0.81 0.4221

LCHOL 0.31109 0.10936 2.84 0.0051

The predictors of log(HDL) are age, body mass index, blood vitamin C,

systolic and diastolic blood pressures, skinfold thickness, and the log of

total cholesterol. The equation is:

Log(HDL) = 1.16 - 0.00092(Age) -0.012(BMI)+…+ 0.311(LCHOL)

70

Interpretation of coefficients on previous slide:

Need to use entire equation for making predictions.

Each coefficient j measures the difference in expected LHDL between 2

subjects if the factor xj differs by 1 unit between the two subjects, and if

all other factors are the same.

E.g., expected log(LHDL) is 0.012 lower in a subject whose BMI is 1 unit

greater, but is the same as the other subject on other factors.

The meanings of the coefficients in the HDL example

log(HDL) = 1.16 - 0.00092(Age) -0.012(BMI)+…+ 0.311(LCHOL)

0 1 1 2 2 ...i i i p ip iY x x x

71

The p-values measure the significance of the association of a factor with Log(HDL)

in the presence of all other predictors of the model.

This is sometimes expressed as “after accounting for other factors” or “adjusting for

other factors”, and is called independent association.

SKINF alone probably is associated. However, its p=0.42 says that it provides no

additional information that helps to predict LogHDL, after accounting for other

factors such as BMI.

The p-value and also the coefficient-value of a predictor depend i.g. not only on the

association with the outcome variable but also on the other predictors in the model.

Only if all predictors are independent multiple regression leads the same p-values

and coefficients than p simple regression each with only one predictor.

The meanings of the p-values and the coefficientsin the multiple linear regression output

72

73

The larger a sample, the smaller the p-values for the very

same predictor effect. Thus do not confuse a small p-values

with an important predictor effect!!!

More important than p-values:

• Look at absolute values of (significant) coefficients.

• Look at confidence intervals!

Significance vs. Relevance

74

ANCOVA = ANalysi of COVAriance

75

Output: hours: lifetime of a cutting tool

Predictor 1: rpm: speed of the machine in rpm

Predictor 2: tool: tool type A or B

Linear Regression with continuous and factorial predictors

fit1 <- lm(hours ~ rpm + tool, data=my.dat)

500 600 700 800 900 1000

15

20

25

30

35

40

rpm

ho

urs

A

A

A

AA

A

A

A

A

A

B

BB B

B

B

B

BB

B

Durability of Lathe Cutting Tools

We have an additive model: the difference between the tools is a shift.

76

What does interaction mean?Different slopes of continuous variables at different levels of a factor

fit2=lm(hours ~ rpm * tool,

data=my.dat)

500 600 700 800 900 1000

15

20

25

30

35

40

rpm

ho

urs

A

A

A

AA

A

A

A

A

A

B

BB B

B

B

B

BB

B

Durability of Lathe Cutting Tools: with Interaction

500 600 700 800 900 1000

15

20

25

30

35

40

rpm

ho

urs

A

A

A

AA

A

A

A

A

A

B

BB B

B

B

B

BB

B


Do not allow for interaction Interaction as allowed

fit1=lm(hours ~ rpm + tool,

data=my.dat)

In case of interaction, the slope of the predictor “rpm” changes for different levels of

the second predictor “tool”.

77

Do we get the same slope in rpm for tool A and tool B?Is there an interaction between rpm and tool?

fit2 <- lm(hours ~ rpm * tool, data=my.dat)

> summary(fit2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 32.774760 4.633472 7.073 2.63e-06 ***

rpm -0.020970 0.006074 -3.452 0.00328 **

toolB 23.970593 6.768973 3.541 0.00272 **

rpm:toolB -0.011944 0.008842 -1.351 0.19553

---

Residual standard error: 2.968 on 16 degrees of freedom

Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937

F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08

The main effects are hard to interpret in case of interactions.

Here the interactions seems not to be significant. With ANOVA we can test for

nested models if the more complex model leads to a significant improvement:

hour 32.8 0.02 rpm 24 toolB -0.01 (rpm toolB) -

78

How to read a model with interaction?

500 600 700 800 900 1000

15

20

25

30

35

40

rpm

ho

urs

A

A

A

AA

A

A

A

A

A

B

BB B

B

B

B

BB

B

Durability of Lathe Cutting Tools: with InteractionInteraction is allowed

In case of interaction, the slope of the predictor “rpm” changes for different levels of

the second predictor “tool” – also the intercept is changing for the two tools.

toolB (toolB= ) :

0.02 rpm -0.01 (rpm )

56.9 0

hour 32.8 0.02 rpm 24 toolB -0.01 (rpm toolB)

hour 32.8 24

hour

toolA (toolB

hour

= ):

0.02 rpm -0.01 (rpm )

32.8 0.02 rp

.03 rp

32.8 24

m

m

hour

1

11

0

0

0

-

-

-

-

-

Remark: In case of interaction between two continuous predictors, slope (and intercept) of one predictor changes

continuously with a continuous changing value of the other predictor and vice versa.

79

Do we need the complex model with the interaction?

fit2=lm(hours ~ rpm * tool,

data=my.dat)

500 600 700 800 900 1000

15

20

25

30

35

40

rpm

ho

urs

A

A

A

AA

A

A

A

A

A

B

BB B

B

B

B

BB

B

Durability of Lathe Cutting Tools: with Interaction

500 600 700 800 900 1000

15

20

25

30

35

40

rpm

ho

urs

A

A

A

AA

A

A

A

A

A

B

BB B

B

B

B

BB

B


Do not allow for interaction Interaction is allowed

fit1=lm(hours ~ rpm + tool,

data=my.dat)

anova(fit2, fit1, test="F")

# p>5%, therefore interaction is not needed

80

0) Preprocessing

- learning the meaning of all variables, check for correlations

- give short and informative names

- check for impossible values, errors

- if they exist (missing, error): set them to NA

- consider imputation methods, but be careful

1) First-aid transformations

- bring all variables to a suitable scale (use also field knowledge)

- routinely apply the first-aid transformations

2) Find a good model

- start with a model including important confounders

- perform a residual analysis

- improve model by transformations or adding better predictors

- reduce step by step complexity (be aware of introduced biases)

- use your specific knowledge to choose between variables

Steps in linear modelling

81

Limits of linear Regression

If your residuals do not follow a Normal distribution (even after

transformations) use generalized linear modeling

(glm – e.g. logisitic regression)

If your predictors show a strong correlation use shrinkage methods

(e.g. lasso)

If your data are not independent use mixed models or methods for

time-series.

If you do not have a linear relation, use non-linear regression

(e.g. nlm) or generalizes additive models (e.g. gam) or tree models

- parameter point & interval estimates...1 a refresher in applied statistics model fitting -...

Documents