linear modelsinterpretation • we get values for parameters a and b as a = -18.85 b = 0.013 • a...
TRANSCRIPT
Linear ModelsStat 430
Outline
• Parameter Estimation
• Goodness of Fit
• Matrix Notation: Multivariate Models
• Distributional Assumptions
• Dummy Variables
Olympic Gold Medallists
!"
!#$"
!#%"
!#&"
!#'"
'"
'#$"
'#%"
'#&"
'#'"
("
)''*" )(**" )($*" )(%*" )(&*" )('*" $***" $*$*"
!"#$%&'()%*+,'-.,%
!"#"$%$&'(")"&*%*+"
,"
,%-"
,%."
,%/"
,%*"
*"
*%-"
*%."
*%/"
*%*"
0"
&**$" &0$$" &0-$" &0.$" &0/$" &0*$" -$$$" -$-$"
!"#$%&'()%*+,'-.,%
Interpretation
• We get values for parameters a and b asa = -18.85b = 0.013
• a is the intercept - i.e. the value for Y if X=0in this data the interpretation is a bit obscure: for the year 0 we would expect the winner to jump 18.85m backwards (quite a feat!)
• b is the average increase that we expect for Y when we increase X by 1 unit: for each year we expect the winner to jump 1.3 cm further, from one Olympics to the next we’d expect an increase of 5.2 cm
Simple Linear Regression
• How do we get a and b?
• How good is the model?
• What are confidence intervals for the parameters a and b, for predicted values?
Ordinary Least Squares (OLS)Estimation
• Model: y = ax + b
• Data x1, x2, x3, ... and y1, y2, y3
• OLS: find a and b such that they minimize ∑ ( a + b xi - yi)2
• aOLS = -18.85bOLS = 0.013
• b = a =
Derivation of Estimates
Goodness of Fit
• Coefficient of Determination R2
• compare amount of variability overall to amount explained in model:
• R2 = (TSS - SSE)/TSS
• TSS = ∑ (yi - mean(y))2
• SSE = ∑ ei2
• R2 is value in [0,1], with R2 = 1 indicating perfect fit
Extending the Model
• Explanatory variable can be discrete
Example: Running in OZ
• http://www.statsci.org/data/oz/ms212.html
• Students in an introductory statistics class participated in a simple experiment: The students took their own pulse rate. They were then asked to flip a coin. If the coin came up heads, they were to run in place for one minute. Otherwise they sat for one minute. Then everyone took their pulse again. The pulse rates and other physiological and lifestyle data are given in the data.
Pulse 1 vs Pulse 2
Pulse1
Pulse2
60
80
100
120
140
160
60 80 100 120 140
should look at difference in Pulse
Pulse difference
factor(Ran)
dPulse
0
20
40
60
80
1 2
Linear Model with Categorical Variables• y = a + bran
• Use value of b only for those students who did run
• identical to:y = a + b xran
where xran = 0 for students who did not run, and 1 otherwise (dummy variable)
lm(formula = dPulse ~ Ran, data = fitness)
Residuals: Min 1Q Median 3Q Max -41.391 -3.000 1.000 4.609 42.609
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 103.783 4.489 23.12 <2e-16 ***Ran -52.391 2.715 -19.30 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14 on 107 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.7768,Adjusted R-squared: 0.7747 F-statistic: 372.4 on 1 and 107 DF, p-value: < 2.2e-16
Interpretation of Parameters
• Ran has values 1 (= Ran) and 2 (= Sit)
• in modellm(formula = dPulse ~ Ran, data = fitness) Ran is interpreted as numeric variable
• Somebody who ran therefore has an average difference in pulse of 103.783 + 1*(-52.391) = 51.392,somebody who sat, has a difference in pulse of 103.783 + 2*(-52.391) = -1.
Better, use factor
lm(formula = dPulse ~ factor(Ran), data = fitness)
Residuals: Min 1Q Median 3Q Max -41.391 -3.000 1.000 4.609 42.609
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 51.391 2.064 24.9 <2e-16 ***factor(Ran)2 -52.391 2.715 -19.3 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14 on 107 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.7768,Adjusted R-squared: 0.7747 F-statistic: 372.4 on 1 and 107 DF, p-value: < 2.2e-16
Parameter Estimates
• By default, R is using baseline coding for all factor variables, i.e. the first effects are always set to zero:
is interpreted as difference in pulse of 51.391 + 0 when Ran is equal to 1, and 51.391 - 52.391 when Ran is equal to 2
(Intercept) 51.391 2.064 24.9 <2e-16 ***factor(Ran)2 -52.391 2.715 -19.3 <2e-16 ***
Further Extension
• Multiple explanatory variables
factor(Gender)
dPulse
0
20
40
60
80
1 2interaction(factor(Gender), factor(Ran))
dPulse
0
20
40
60
80
1.1 2.1 1.2 2.2
Age
dPulse
0
20
40
60
80
20 25 30 35 40 45Weight
dPulse
0
20
40
60
80
40 60 80 100