- parameter point & interval estimates...1 a refresher in applied statistics model fitting -...
TRANSCRIPT
1
A refresher in applied statistics
Model fitting
- parameter point & interval estimates
Simple and multiple linear regression
ANOVA and ANCOVA
Beate Sick
We use R for performing statistical data analysisRecommended environment: RStudio
Main reasons:• open source• powerful• wide spread• reproducible• transparent
2
3
probability world ↔ data world
describe data
Visualize & Summarize Statistical inference- model choice
- Parameter estimation
- Confidence intervals
- Tests
- Regression, ANOVA
Statistics connects data with models
data, sampleModel
Predictions
inductive
statistics
PredictionProbability calculus
4
We use a sample to learn about the population
Results from statistical
inference are only correct, if
the sample was representative.
A sample is representative
if it does not systematically
differ from the population
(e.g. the percentages of male
and female are similar in the
sample than in the population).
The world of data and the world of models
data/reality model
sample population
discrete data/features(numeric or categorical)
discrete random variable(numeric)
continuous data/features(numeric)
continuous random variable(numeric)
observation Random variable
relative frequency probability (P)
histogram (scaled)Density
continuous distribution
bar plot of frequencies (scaled)(of rel. frequency at discrete features)
Probability distributiondiscrete distribution
average expected value m
sample variance s2 variance s2
x
The expected value = population mean
•The expected value of a random variable is the average,
which we would get with an infinite big sample
• It measures the location of the random variable
• It corresponds to the centre of mass of the density (see red line)
•It often determines the parameter of the model
•The expected value can also be calculated due to the density
0.0 0.5 1.0 1.5 2.0
01
23
4
Exponentialverteilung
x
f(x)
70 2 4 6 8 10
0.0
00
.05
0.1
00
.15
0.2
00
.25
my.x
P
Po(l=2.5)
1
( ) ( )
( ) ( )
n
i i
i
EW X x P X x
EW X x f x dx
8
The most famous discrete distributions/models
name of the
distribution
possible values x
P(X=k)
expected value m
variance s2
application
Bernoulli
X~Bern(p)
{0,1} m=E(X)=p
s2=Var(X)=p*(1-p)
X: indicates if an
event occurres or
not
Binomial
X~B(n,p)
{0,1,…,n} m=E(X)=n*p
s2=Var(X)=n*p*(1-p)
X: number of
successes in n
independent
Bernoulli trials
Poisson
X~Po(l)
{0,1,...} m=E(X)=l
s2=Var(X)=l
X: number of events
in a certain interval
or time-bin
( ) (1 )k n kn
P X kk
p p - -
( )!
k
P X k ek
ll -
( 1)
( 0) 1
P X
P X
p
p
-
9
The most important continuous distributions/models
Name of the
distribution
(parameter)
domain
density f
distribution F
expected value
variance
application
Uniform V.
X~U(a,b)
R if all events have the same
probability or if the
probability is not known at
all
Exponential V.
X~Exp(l)
R0+ waiting times,
time to fail
Normal V.
X~N(m,s2)
R typical measurements(affected symmetrically by various
factors),
Asymptotic approximation
for other distributions
( ) xf x e ll -
'( ) 1 xF x e l- -
1( ) ,f x
b a
-
for a £ x £ b,
otherwise f (x) = 0
( )x a
F x für a x bb a
-
-
2
( )2
( )( )
12
a bE X
b aVar X
-
2
1( )
1( )
E X
Var X
l
l
2
( )
( )
E X
Var X
m
s
2
2
( )
21
( )2
x
f x e
m
s
s p
--
2
2
( ' )
21
( ) '2
xx
F x e dx
m
s
s p
--
-
10
Normal
X~N(m,s2)
( )E X m2( )Var X s
E X n p( )
Var X n p p( ) ( ) -1
Binomial
X~B(n,p)
Poisson
X~Po(l)
( )E x l
( )Var X l
Relation
Parameter-E(X)-Var(X)Parameter-estimator
as function of the data1
( )E Xl
X~Exp(l)2
1( )Var X
l
1
1ˆˆ ( )n
i
i
E X x xn
l
ˆ . . .. .
p average no successesper n trials
1 1ˆ
( ) xE Xl
2 2
1
1ˆˆ var( ) ( )1
n
i
i
X x xn
s
--
1
1ˆˆ ( )n
i
i
E X x xn
m
Paramter estimationfor the most important distributions
Exponential
Distributionfamily V
X~V(Parameter-Set)
11
The probability density function (pdf)
0.0 0.5 1.0 1.5 2.0
01
23
4
Dichtefunktion
Wartezeit
De
nsity
0.6
0.3
( ) ( 0.6) () )( 0.3
b
a
P fa X b F b F ax dx
-
a b
( ) xf x e ll -
The probability of getting a result between a and b is equal to the area
under the density function above the interval [a,b]. The calculation of the
probability is made by integrating the density function in interval [a,b]
pexp(0.6,rate=4) – pexp(0.3,rate=4)
x (waiting times)
12
Data vary! That’s why statisticians can find jobs ;-)
Wenn ich alle
messen würde.
Eine
Zufallstichprobe
mit n=30
n=30
CtsVar_1samp_Dots30.pdf
Beispiel: Grösse 12 jähriger neuseeländischer Schülerinnen
Where is the center of the population?
13
14
Vizualize boxplot with memory
Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Where is the center of the population?
14
15Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
15
16Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
16
17Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
17
18Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
18
19Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Where is the center of the population?
19
20Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Where is the center of the population?We get more certain with increasing sample size
20
21
Confidence intervals
22
How sure can I be about the true paramter value?
Goal:
We would like to determine from our sample/observations an interval,
which covers the true parameter value with a probability of 95%.
+/-1.58 IQR/sqrt(n)
The notch covers the
median «quite certain»
boxplot(x,notch=TRUE)
22
23
Sample Variation & Central Limit Theorem
population
Sample with
sample-size n=10
Distribution of many
sample means
Animation: http://onlinestatbook.com/stat_sim/sampling_dist/index.html
Because of the sample variation also
derived statistics like the mean value
varies from sample to sample
The sample mean is an unbiased
estimator for the population mean E(X).
CLT: The sample mean is normaly
distributed around the population mean
and the variation decreases with
increasing sample size.
2( )
~ ( , ) ( ( ), )a
xx
Var XX N N E X
n n
sm
1 1
12 2
n nt tq q
- -
- -
1-
/2/2
Construction of an exact CI for the expected value m
1 1
1 1
12 2
1 12 2
( ) 1
( ) 1
n n
n n
t tx
x
t tx xx
XP q q
n
P X q X qn n
m
s
s sm
- -
- -
-
- -
- -
- -
exact 95% CI for mx0.975
z xX qn
s
~ (0,1)x
x
XT N
n
m
s
-
( )2
{1,2,... }
2
. . . ~ ,
~ ,
i n x x
xx
X i d d N
X Nn
m s
sm
t
Distribution of T under H0
24
The construction of the CI is based on
the distribution of T under H0 : m = mx
1Test-Statistic or Pivot: ~x
xdf n
XT t
s
n
m -
-
known s
1
0.975
ˆnt xX q
n
s-
estimateds
12 2
z z -
-
1-
/2/2
12 2
1 12 2
( ) 1ˆ
ˆ ˆ( ) 1
x
x
x xx
XP z z
n
P X z X zn n
m
s
s sm
-
- -
- -
- -
approx. 95% CI for mx 0.975
ˆ ˆ1.96x xX z X
n n
s s
Test-Statistic or Pivot:
~ (0,1)x
x
aXT N
s
n
m-
2
2Central Limit Theorem
. . . 1,..., , 25 , ( ) , ( )
~ ,
i x x
xx
a
X i d d i n n E X Var X
X Nn
m s
sm
t
Distribution of T under H0
25
The construction of the CI is based on
the distribution of T under H0 : m = mx
standard error:
se(x)University of Zurich, Department of Biostatistics
Construction of an approximative CI for m
26
The CI is as random as the sample
( )2
sd xx
n
5m
95 out of 100 95%-CI for m do cover the true population parameter m=5 when
simulating 100 random samples from a population following N(m=5,s2).
With a 95%-CI we have a risk of 5% that the true population parameter is not
contained by the CI.
27Source: The Cartoon Guide to Statistics, Larry Gonick and Woollcott Smith
For what purpose do we develop a statistical model?
• Description:
Describe data by a statistical model.
• Explanation:Search for the “true” model to understand and
causally explain the relationships between
variables and to plan for interventions.
• Prediction:Use model to make reliable predictions.
This is done well with a conventional statistical model
Difficult with observational data – in medicine we do RCT to learn about causal effects
To evaluate and tune a good prediction model it is best to work with train, validation and test data.
28
29
Simple linear regression1 explanatory variable
30
Example:
In our first example we investigate how the size of trees
depends on the ph-level of the soil.
In India, it was observed that alkaline soil hampers plant growth. This gave rise
to a search for tree species which show high tolerance against high ph-values.
An outdoor trial was performed, where 120 trees of a particular species were
planted on a big field with under different pH-value variation.
After 3 years of growth, every trees height was measured. Additionally, the pH-
value of the soil in the vicinity of each tree was determined and recorded.
Simple Linear Regression
31
7.5 8.0 8.5
23
45
67
phvalue
he
igh
tTree Height vs. pH-Value
Scatterplot: Tree Height vs. pH-value
ph=7.9
???
Which height would we expect at ph=7.9?
32
7.5 8.0 8.5
23
45
67
phvalue
he
igh
t
Tree Height vs. pH-Value
7.5 8.0 8.5
23
45
67
phvalue
he
igh
t
Tree Height vs. pH-Value
7.5 8.0 8.5
23
45
67
phvalue
he
igh
t
Tree Height vs. pH-Value
Systematic Relation: What is a good model?
What is a good model for the relation between pH-value and tree height?
The first model fits the training data perfect but does probably overfit the data.
To evaluate the performance of a model we can use cross-validation: leave out
successively each data point – determine the model with remaining data and
use the model to predict left out value. The model is best which produces the
best predictions on new or left out data points.
33
> summary(fit)
Call: lm(formula = height ~ phvalue, data =
treeheight)
Coefficients: Estimate Std. Error t-value Pr(>|t|)
(Intercept) 28.7227 2.2395 12.82 <2e-16 ***
phvalue -3.0034 0.2844 -10.56 <2e-16 ***
Reading the R-summary of a linear model
0ˆ
ˆ( )t
se
-
ˆint :ercept ˆ ˆ( )se s
ˆslope: ˆ ˆse( ) s test value
ˆ 28.7 3i i iy x x -
Residual stand. err.: 1.008 on 121 degrees of freedom
Multiple R-squared: 0.4797,
Adjusted R-squared: 0.4754
F-statistic: 111.5 on 1 and 121 DF,
p-value: < 2.2e-16
Global test for the model(is full model better than only using the intercept?)
Adj. R2 (use in multiple regression)
ˆˆs
R2 (in simple regression equals corr(x,y)2)
= #obs. - #(estimated parameter)
ˆ
0 0
p-value: p
H : 0
ˆ
0 0
p-value: p
H : 0
34
The regression line will not run through all the data points.
Thus, there are random errors:
Meaning of variables/parameters:
is the response variable (height) of observation .
is the predictor variable (pH-value) of observation .
are the regression coefficients. They are unknown
previously, and need to be estimated from the data.
is the residual or error, i.e. the random difference bet-
ween observation and regression line.
ixiy
i
,
i
i
systematic
part of the model
2, . . ~ (0. )i i iiiY i i d Na b X s
random part of the model,
random errors
Linear regression: a traditional view as seen in many textbooks
Linear regression: a more general view
2
Model for the condition probability distribution
CPD: =(Y|X ) ~ N( , ) ii i xXY m s
The predicted value of the linear regression
gives only one of the parameter of the CPD: mx
that depend on the predictor values. The
second parameter of the CPD (s2) is assumed
to be independent of the predictor values and
defines the variance of the error term .
2
Y ~ V
(Y|X ) ~ N( , ) i
contiuous
arbirar
i x
y
sm
Y , xx m
35
CP
D
CP
D
CP
D
MP
D
( )1
2
0
1
1
2
0
1
y = +
E Y
Var(Y )=Var(Y|X
=( |X
)=Var( )
=x )=
i.i.d. ~ (0, )
+i i
i
x i
i
i i i
X
X i
i
i
x
x
N
m m
s
s
identical independent distributedY
is c
on
tin
uo
us
and
can
hav
e an
arb
itra
ry m
argi
nal
p
rob
abili
ty d
istr
ibu
tio
n
36
We need to fit a straight line that
fits the data well.
Many possible solutions exist,
some are good, some are worse.
Our paradigm is to fit the line such
that the squared errors are
minimized.
Least Squares Fitting
We minimize the sum of
squared residuals2 2 2
1 1 1
ˆ( ) ( ( )) min!n n n
i i i i i
i i i
r y y y x
- -
http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html
Remark: According to the Gauss-Markov-Theorem the OLS (ordinary least square) fitting procedure
leads to the best linear unbiased estimators (BLUE) of the regression parameters.
37
> summary(fit)
Call: lm(formula = height ~ phvalue, data =
treeheight)
Coefficients: Estimate Std. Error t-value Pr(>|t|)
(Intercept) 28.7227 2.2395 12.82 <2e-16 ***
phvalue -3.0034 0.2844 -10.56 <2e-16 ***
Linear regression in R
0ˆ
ˆ( )t
se
-
ˆint :ercept ˆ( )se p-value
ˆ:slope ˆ( )se p-valuetest value
ˆ 28.7 3i i iy x x -
Residual stand. err.: 1.008 on 121 degrees of freedom
Multiple R-squared: 0.4797,
Adjusted R-squared: 0.4754
F-statistic: 111.5 on 1 and 121 DF,
p-value: < 2.2e-16
Global test for the model
(will see later)
R2
38
7.5 8.0 8.5
23
45
67
pH-Value
Tre
e H
eig
ht
Tree Height vs. pH-Value
Least Squares Regression Model
( ) 28.7 3
(8) 28.7 24 4.7
height ph ph
heigth
-
-
ph=8
Confidence- and “prediction” intervals
The expected value of y at the position x is with 95% percentage certainty
covered by the confidence interval.
95% percentage of all individual observations y (of the training data set) is
contained in the “prediction” interval.39
y
x
regression model
confidence interval
prediction interval upper prediction limit
lower prediction limit
( )( )E y x
Interesting intervals when doing regression
2
2
ˆ ˆ( )ntb q se b-
2
2
ˆ ˆ( )nta q se a-
i i iY a bX
2
2
ˆ ( )nt
k ky q se y-
2
2
ˆ ˆ( )nt
k ky q se y-
Prediction
interval for yl:
CI
for E(yk):
CI for
the slope:
CI for
y-intercept:
confint(fit, parm=1,level=0.95)
2.5 % 97.5 %
(Intercept) -1513786 -881578.2
confint(fit,parm=2,level=0.95)
2.5 % 97.5 %
x 124.4983 153.0250
predict(fit, new, se.fit=T,
interval=c("confidence"))
fit lwr upr
1855075 1829770 1880379
predict(fit, new, se.fit=T,
interval=c(„prediction"))
fit lwr upr
1855075 1728713 1981436
40
41
Residual Analysis= Checking the model assumptions
Before we continue to look into the results, we need to check if the
modelling assumptions are met!
Why? Because otherwise we draw invalid conclusions from the results.
The assumption we took here is that the errors i i.i.d. ~N(0,s2)
We use the observed residuals as estimate for the unobserved errors.
This implies four things for the residuals:
a) The expected value of ri is 0: E(ri )= 0.
b) All ri have the same variance: Var(ri) = ො𝜎2 .
c) The ri are normally distributed.
d) The ri are independent of each other.
Assumptions of a linear regression model
42
x
( , )i ix y0 1ˆ ˆ x
ir
0 1ˆ ˆˆ( , )i i ix y x
y
independent variable
de
pe
ndent vari
able
ˆi i ir y y -
Observed residuals serve as estimate for the error
0 1 1
0 1 1
y = +
ˆ ˆy = +
i i i
i i
x
x
( )
( )0 1 1
0 1 1
y - +
ˆ ˆˆ ˆy - + y -y
i i i
i i i i i i
x
r x
43
Standardized and Studentized residuals
The standardized residual raw 𝑟𝑖 can be derived from the raw residual 𝑟𝑖 by dividing it by an estimate of its standard deviation.
Where ො𝜎𝐸 is the residual standard error and 𝐻 is the hat matrix
With the same formula we get Studentized Residuals if the estimate of the residual standard error ො𝜎𝐸 is obtained by ignoring the 𝑖𝑡ℎ data point.
ˆ 1
ii
E ii
rr
Hs
-
44
There are 4 "standard plots" in R:
- Residuals vs. Fitted, aka Tukey-Anscombe-Plot
- Normal Plot (uses standardized residuals)
- Scale-Location-Plot (uses standardized residuals)
- Leverage-Plot (uses standardized residuals)
In R: > plot(fit)
Model checking: residual analysis in R
45
Model checking: residual analysis in R
46
The Tukey-Anscombe diagram plots the residuals against the fitted values
is the most important model checking tool. This plot is ideal to check if
assumptions a) and b) (and partially d)) are met.
A perfect Tukey-Anscombe show a horizontal smoother at height 0 around
which the residuals are at each point distributed with same variance.
The residual vs fitted plot aka as Tukey-Anscombe plot
47
Quelle: http://www.uni-forst.gwdg.de/~dgaffre/elan/institut_fbi/skripten/statistica/v10/v10a.html
Examples of Tukey-Anscombe-Plots
Resid
uen
Re
sid
uen
Re
sid
uen
Re
sid
uen
4848
The Normal Plot
With the Normal Plot we check if the residuals show strong deviations from a Normal distribution. In a perfect Normal plot all points are close to a straight line.
49
Draw data from Normal-Distribution and generate the Normal-Quantil-Quantil-Plots
# normal Q-Q Plot w/o CI
qqnorm(residuals(fit))
# normal Q-Q Plot with CI
library(car)
qqPlot(residuals(fit), dist="norm")
50
The Scale-Location Plot
Here we plot | ǁ𝑟|𝑖 vs ො𝑦𝑖 and check for constant variance meaning the spread
of the absolute residual values do not change over the range of fitted values
meaning if the variance of the residual is constant.
A perfect Scale-Location Plot shows a smooth horizontal line.
51
The Leverage Plot
With the Leverage Plot we check for influential points with large Cook’s distances.
A Leverage plot without points beyond the dashed level curves is fine.
Points with Cooks distance larger than 0.5 or 1 must be further checked.
52
A high leverage of an observation 𝑦𝑖 means that the data point has extreme predictor value and has the potential to force the regression relation to strongly adapt to that data point. The leverage is simply given by the 𝑖𝑡ℎ diagonal element of the hat matrix 𝐻, since 𝐻𝑖𝑖Δ𝑦𝑖 is the change in ො𝑦𝑖 if 𝑦𝑖 changes by Δ𝑦𝑖. The average leverage in a regression model with 𝑝 estimated coefficients and 1 intercept is given
by Τ(𝑝+1)𝑛 . We say a data point has high leverage if 𝐻𝑖𝑖≥ 2 ∙ Τ(𝑝+1)
𝑛
What is meant by leverage points?
53
With Cook’s Distance we estimate the potential change in all the fitted values if 𝑖𝑡ℎ data point is omitted from the analysis.
What is Cook’s Distance measuring?
[ ] 2 2
2
ˆ ˆ( )
( 1) 1 ( 1)
i
k k ii ii
E ii
y y H rD
p H ps
- -
-
If 𝐷𝑖 ≥ 0.5 , the 𝑖𝑡ℎ data point is called influential.
If 𝐷𝑖 ≥ 1 , the 𝑖𝑡ℎ data point might be really
dangerous.
54
Plot residuals vs predictors
Residuals should not show structure when be
plotted vs fitted values or any of the predictors.
2~ (0, ), . .i N I i i d s
-> residual plots look o.k.
Tuckey Anscombe Plot
55
56
What to do if the model
assumptions are violated?
Should we apply an transformation to outcome and predictor?
See the in-class exercises on transformations for the effect of applying
a non-linear transformation on the response or predictor variable.
57
58
Improve structural form of the model
- add missing predictors
- add interactions
- apply transformation on predictors (change form of relationship between y and predictor)
- apply transformation on outcome (to stabilize residual variance & change form)
Handle extreme values, leverage points outliers
- outliers and leverage points should be identified with diagnostic plots
- check if the usage of robust methods is necessary
- use transformations to make variable distributions less skewed
Make sure that observations within a group are independent
- unrecorded predictors or inhomogeneous population
- observations coming from a matched study are not independent
(analyze such data e.g. with mixed models)
- subjects may influence other subjects under study
consider using another model
- e.g. Poisson regression in case of count data
How can a linear regression model be improved?
58
59
First-Aid Transformations:
do always apply these (if no practical reasons against it)
to both response and predictors
Absolute values and concentrations:
log-transformation:
Count data:
square-root transformation:
Proportions:
arcsine transformation:
log( )y y
y y
( )siny arc y
First Aid Transformations
Variance-stabilizing transformations
Motivation for the partial residual plot
The partial residuals 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 give insight of the relationship between predictor 𝑥𝑗and the “adjusted outcome”, which is corrected for effect of all other predictors.
A perfect partial residual plot shows a linear relation between 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 and 𝑥𝑗.
Partial residuals are residuals we get when omitting 𝑥𝑗 from the model formula.
j,partialˆ ˆ ˆˆ
j j j j k k
k j k j
r y x y r x x r
- -
Partial residual plots in R:- library(car); crPlots(...)
- library(faraway); prplot(...)
- residuals(fit,
type="partial")0 1 2 3 4 5
80
09
00
10
00
11
00
Mortality vs. log(NOx)
log(NOx)
Mo
rta
lity
0 1 2 3 4 5
-10
0-5
00
50
log(NOx)P
art
ial R
esid
ua
ls f
or
log
(NO
x)
Partial Residual Plot for log(NOx)
60
Marginal relationship ≠ relation in partial residual plot
The marginal plot of outcome vs predictor does not take into account the
influence of all other predictors on the outcome and is therefore not
appropriate if we are interested in the additional influence of a predictor on
the outcome given all other predictors are already in the model.
0 1 2 3 4 5
80
09
00
10
00
11
00
Mortality vs. log(NOx)
log(NOx)
Mo
rta
lity
0 1 2 3 4 5
-10
0-5
00
50
log(NOx)
Pa
rtia
l R
esid
ua
ls f
or
log
(NO
x)
Partial Residual Plot for log(NOx)
61
We want to check if the adjusted relation between an untransformed
predictor and the outcome is linear. Hence, we use the partial
residual plot which only shows the "isolated“ influence of that
predictor on the response.
The observed shape in the partial residual plot indicates if/which
transformation we should use for the selected predictor.
Partial residual plots help to find transformations for predictors
62
𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙
𝑥𝑗 𝑥𝑗 𝑥𝑗A quadratic transformation
of xj might be appropriate.No transformation required. .Predictor xj has no additional
explanatory power when
added to the model.
63
ANOVA = ANalysis Of VAriance
Total sample variability
TSS
Variability explained by
the modelSSmodel
Unexplained (or error) variability
RSS
Example with one factorial predictorDo medical doctors spend less time with obese patients?
In an observational study it was measured
how much time doctors spend with a patient.
64
Do medical doctors spend less time with obese patients?How can we test this with linear regression and ANOVA?
An ANOVA with 1 factor with 2 levels is equivalent to a two-sample t-test.
Normality check
passed
t.test(TIME~WEIGHT, data=dat)
# t = 2.9, df = 67, p-value = 0.0057
# alternative hypothesis: true difference in
# means is not equal to 0
# 95 percent confidence interval:
# 2 11
# sample estimates:
# mean of x mean of y
# 31 25
# do it by regression with one factorial predictor:
fit=lm(TIME~WEIGHT, data=dat)
anova(fit)
# get anova-table from lm-object
# Response: TIME
# Df Sum Sq Mean F value Pr(>F)
# WEIGHT 1 776 776 8.16 0.0057 **
# Residuals 69 6561 95
65
How to test for an effect between >2 groups?Applying 1-way ANOWA with >2 levels
fit=lm(folate~group, data=dat)
anova(fit) # p=0.044
Here, we want to investigate, if three different treatments
result in different levels of the output: folate in red blood cells
We can apply a regression with the group factor as predictor
to investigate this question, given the folate values y in each
group are i.i.d. normal distributed (check not shown).
Since p<5%, we can conclude that there are differences,
i.e. the folate level is not the same in all groups.
Remark: If there is only 1 factor as predictor, like treatment group, we talk about
1-way ANOVA regardless of the number of groups.
66
The ANOVA gets significantBetween which groups are the differences?
We can perform three pair-wise t-tests.
Only the t-test comparing group 1
versus 2 gets significant.
We need to correct for multiple testing,
e.g. by Bonferroni-correction. Here, this
correction leads to non-significance for
all 3 tests.
The significant ANOVA result, only tells us, that there are any differences.
We need to perform post-hoc tests to investigate, between which groups
we can really find differences.
List of post-hoc tests (from wiki)
• Fisher's least significant difference: LSD
• Bonferroni correction
• Duncan's new multiple range test
• Friedman test
• Newman–Keuls method
• Scheffé's method
• Tukey's range test
• Dunnett's test
Result of (uncorrected) pair-wise t-tests:
67
68
Multiple linear regression≥2 explanatory variables
Multiple linear regression: interpretation of coefficientas used in descriptive modelling
( )2
0 1 1 iy = + ... ε , ~ 0,i i p ip ix x N s
11= y - yk k k kk x x x xy
-
The coefficient k gives the change of the outcome y, given the explanatory
variable xk is increased by one unit and all other variables are held constant.
69
= y X β ε
11 1 01 1
21 2 12 2
1
1
1with , , ,
1
p
p
n np pn n
x xy
x xy
x xy
y X β ε
Modeling HDL as example of multiple regression:
Estimate Std. Error t-value Pr(>|t|)
Intercept 1.16448 0.28804 4.04 <.0001
AGE -0.00092 0.00125 -0.74 0.4602
BMI -0.01205 0.00295 -4.08 <.0001
BLC 0.05055 0.02215 2.28 0.0239
PRSSY -0.00041 0.00044 -0.95 0.3436
DIAST 0.00255 0.00103 2.47 0.0147
GLUM -0.00046 0.00018 -2.50 0.0135
SKINF 0.00147 0.00183 0.81 0.4221
LCHOL 0.31109 0.10936 2.84 0.0051
The predictors of log(HDL) are age, body mass index, blood vitamin C,
systolic and diastolic blood pressures, skinfold thickness, and the log of
total cholesterol. The equation is:
Log(HDL) = 1.16 - 0.00092(Age) -0.012(BMI)+…+ 0.311(LCHOL)
70
Interpretation of coefficients on previous slide:
Need to use entire equation for making predictions.
Each coefficient j measures the difference in expected LHDL between 2
subjects if the factor xj differs by 1 unit between the two subjects, and if
all other factors are the same.
E.g., expected log(LHDL) is 0.012 lower in a subject whose BMI is 1 unit
greater, but is the same as the other subject on other factors.
The meanings of the coefficients in the HDL example
log(HDL) = 1.16 - 0.00092(Age) -0.012(BMI)+…+ 0.311(LCHOL)
0 1 1 2 2 ...i i i p ip iY x x x
71
The p-values measure the significance of the association of a factor with Log(HDL)
in the presence of all other predictors of the model.
This is sometimes expressed as “after accounting for other factors” or “adjusting for
other factors”, and is called independent association.
SKINF alone probably is associated. However, its p=0.42 says that it provides no
additional information that helps to predict LogHDL, after accounting for other
factors such as BMI.
The p-value and also the coefficient-value of a predictor depend i.g. not only on the
association with the outcome variable but also on the other predictors in the model.
Only if all predictors are independent multiple regression leads the same p-values
and coefficients than p simple regression each with only one predictor.
The meanings of the p-values and the coefficientsin the multiple linear regression output
72
73
The larger a sample, the smaller the p-values for the very
same predictor effect. Thus do not confuse a small p-values
with an important predictor effect!!!
More important than p-values:
• Look at absolute values of (significant) coefficients.
• Look at confidence intervals!
Significance vs. Relevance
74
ANCOVA = ANalysi of COVAriance
75
Output: hours: lifetime of a cutting tool
Predictor 1: rpm: speed of the machine in rpm
Predictor 2: tool: tool type A or B
Linear Regression with continuous and factorial predictors
fit1 <- lm(hours ~ rpm + tool, data=my.dat)
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools
We have an additive model: the difference between the tools is a shift.
76
What does interaction mean?Different slopes of continuous variables at different levels of a factor
fit2=lm(hours ~ rpm * tool,
data=my.dat)
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools: with Interaction
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools
Do not allow for interaction Interaction as allowed
fit1=lm(hours ~ rpm + tool,
data=my.dat)
In case of interaction, the slope of the predictor “rpm” changes for different levels of
the second predictor “tool”.
77
Do we get the same slope in rpm for tool A and tool B?Is there an interaction between rpm and tool?
fit2 <- lm(hours ~ rpm * tool, data=my.dat)
> summary(fit2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.774760 4.633472 7.073 2.63e-06 ***
rpm -0.020970 0.006074 -3.452 0.00328 **
toolB 23.970593 6.768973 3.541 0.00272 **
rpm:toolB -0.011944 0.008842 -1.351 0.19553
---
Residual standard error: 2.968 on 16 degrees of freedom
Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937
F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08
The main effects are hard to interpret in case of interactions.
Here the interactions seems not to be significant. With ANOVA we can test for
nested models if the more complex model leads to a significant improvement:
hour 32.8 0.02 rpm 24 toolB -0.01 (rpm toolB) -
78
How to read a model with interaction?
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools: with InteractionInteraction is allowed
In case of interaction, the slope of the predictor “rpm” changes for different levels of
the second predictor “tool” – also the intercept is changing for the two tools.
toolB (toolB= ) :
0.02 rpm -0.01 (rpm )
56.9 0
hour 32.8 0.02 rpm 24 toolB -0.01 (rpm toolB)
hour 32.8 24
hour
toolA (toolB
hour
= ):
0.02 rpm -0.01 (rpm )
32.8 0.02 rp
.03 rp
32.8 24
m
m
hour
1
11
0
0
0
-
-
-
-
-
Remark: In case of interaction between two continuous predictors, slope (and intercept) of one predictor changes
continuously with a continuous changing value of the other predictor and vice versa.
79
Do we need the complex model with the interaction?
fit2=lm(hours ~ rpm * tool,
data=my.dat)
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools: with Interaction
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools
Do not allow for interaction Interaction is allowed
fit1=lm(hours ~ rpm + tool,
data=my.dat)
anova(fit2, fit1, test="F")
# p>5%, therefore interaction is not needed
80
0) Preprocessing
- learning the meaning of all variables, check for correlations
- give short and informative names
- check for impossible values, errors
- if they exist (missing, error): set them to NA
- consider imputation methods, but be careful
1) First-aid transformations
- bring all variables to a suitable scale (use also field knowledge)
- routinely apply the first-aid transformations
2) Find a good model
- start with a model including important confounders
- perform a residual analysis
- improve model by transformations or adding better predictors
- reduce step by step complexity (be aware of introduced biases)
- use your specific knowledge to choose between variables
Steps in linear modelling
81
Limits of linear Regression
If your residuals do not follow a Normal distribution (even after
transformations) use generalized linear modeling
(glm – e.g. logisitic regression)
If your predictors show a strong correlation use shrinkage methods
(e.g. lasso)
If your data are not independent use mixed models or methods for
time-series.
If you do not have a linear relation, use non-linear regression
(e.g. nlm) or generalizes additive models (e.g. gam) or tree models