simple linear regression ams 572 11/29/2010. outline 1.brief history and motivation – zhen gong...
TRANSCRIPT
Simple Linear RegressionAMS 57211/29/2010
Outline1. Brief History and Motivation – Zhen Gong2. Simple Linear Regression Model – Wenxiang
Liu3. Ordinary Least Squares Method – Ziyan Lou4. Goodness of Fit of LS Line – Yixing Feng5. OLS Example – Lingbin Jin6. Statistical Inference on Parameters – Letan Lin7. Statistical Inference Example – Emily Vo8. Regression Diagnostics– Yang Liu9. Correlation Analysis – Andrew Candela10.Implementation in SAS – Joseph Chisari
2/69
Legendre published the earliest form of regression, which was the method of least squares in 1805.
In 1809, Gauss published the same method.
The method was extended by Francis Galton in the 19th century to describe a biological phenomenon.
Karl Pearson and Udny Yule extended it to a more general statistical context around 20th century.
Brief History and Introduction
3/69
Motivation for Regression Analysis
• Regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable.
• When there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression
New observedpredictor value
Prediction for response variable
?
Predict Y, based on X
4/69
2010 Camry:Horsepower at 6000 rpm: 169Highway gasoline consumption: 0.03125 gallon per mile
2010 Milan:Horsepower at 6000 rpm: 175Highway gasoline consumption: 0.0326 gallon per mile
2010 Fusion:Horsepower at 6000 rpm: 263Highway gasoline consumption: ?
Response variable (Y): Highway gasoline consumptionPredictor variable (X): Horsepower at 6000 rpm
Motivation for Regression Analysis
5/69
•A summary of the relationship between a dependent variable (or response variable) Y and an independent variable (or covariate variable) X.
•Y is assumed to be a random variable while, even if X is a random variable, we condition on it (assume it is fixed). Essentially, we are interested in knowing the behavior of Y given we know X = x.
Simple Linear Regression Model
6/69
•Regression models attempt to minimize the distance measured vertically between the observation point and the model line (or curve).
•The length of the line segment is called residual, modeling error, or simply error.
•The negative and positive errors should cancel out⇒ Zero overall errorMany lines will satisfy this criterion.
Good Model
7/69
Good Model
8/69
• In simple linear regression, the population regression line was given by
E(Y) = β0+β1x •The actual values of Y are assumed to be
the sum of the mean value, E(Y), and a random error term, ∊:
Y = E(Y) + ∊ = β0+β1x + ∊•At any given value of x, the dependent
variable Y ~ N (β0+β1x , σ2)
Probabilistic Model
9/69
Least Squares (LS) Fit
Pressure Boiling Pt Pressure Boiling Pt20.79 194.5 24.01 201.320.79 194.3 25.14 203.622.40 197.9 26.57 204.622.67 198.4 28.49 209.523.15 199.4 27.76 208.623.35 199.9 29.04 210.723.89 200.9 29.88 211.923.99 201.1 30.06 212.224.02 201.4
Boiling Point of Water in the Alps
10/69
Least Squares (LS) Fit
Find a line that represent the
”best” linear relationship:
11/69
Least Squares (LS) Fit• Problem: the data does not go through a line
•Find the line that minimizes the sum:
• We are looking for the line that minimizes 210 )()(
iii xyxe
2
110
n
iii xyQ
ii xy 10 ni ,......2,1
12/69
Least Squares (LS) Fit
• To get the parameters that make the sum of square difference become minimum, take partial derivative for each parameter and equate it with zero.
2
0 1
1 1
0 1
20 1
20 1
0
2 0
0
i i
i i i
i i i i
i i i i
y xQ
y x x
x y x x
x x x y
2
0 1
0 0
0 1
0 1
0 1
0
2 1 0
0
i i
i i
i i
i i
y xQ
y x
y n x
n x y
13/69
Least Squares (LS) Fit
2
1 1 1 10
2 2
1 1
1 1 11
2 2
1 1
( )( ) ( )( )
( )
( )( )
( )
n n n n
i i i i ii i i i
n n
i ii i
n n n
i i i ii i i
n n
i ii i
x y x x y
n x x
n x y x y
n x x
•Solve the equations and we get
14/69
Least Squares (LS) Fit
1 1 1 1
2 2 2
1 1 1
2 2 2
1 1 1
1( )( ) ( )( )
1( ) ( )
1( ) ( )
n n n n
xy i i i i i ii i i i
n n n
xx i i ii i i
n n n
yy i i ii i i
S x x y y x y x yn
S x x x xn
S y y y yn
• To simplify, we introduce
10 1
xy
xx
Sy x
S
0 1y x
• The resulting equation is known as the least squares line, which is an estimate of the true regression line.
15/69
Goodness of Fit of the LS Line
The fitted values is
The residuals
are used to evaluate the goodness of fit of
the LS Line.
0 1ˆ ˆ( )i i ie y x
0 1ˆi iy x
16/69
Goodness of Fit of the LS Line
The error sum of squares SSE=
The total sum of squares
SST=
The regression sum of squares
SST=SSR+SSE
2 2 2
1 1 1 1
0
ˆ ˆ ˆ ˆ( ) ( ) ( ) 2 ( )( )n n n n
i i i i i i ii i i i
SSR SSE
SST y y y y y y y y y y
17/69
Goodness of Fit of the LS Line•The coefficient of determination
is always between 0 and 1 •The sample correlation coefficient
between X and Y is
For the simple linear regression,
18/70
Estimation of the variance
The variance measures the scatter of the
around their means
An unbiased estimate of is given by
This estimate of has n-2 degrees of freedom.
2
2 1
2 2
n
ii
eSSE
sn n
19/69
Implementing OLS method to Problem 10.4
OLS method:
20 1
1
[ ( )]n
i ii
Q y x
The time between eruptions of Old Faithful geyser in Yellowstone National Park is random but is related to the duration of the last eruption. The table below shows these times for 21 consecutive eruptions.
Obs No.
Last
Next
Obs No.
Last
Next
Obs No.
Last Next
1 2.0 50 8 2.8 57 15 4.0 77
2 1.8 57 9 3.3 72 16 4.0 70
3 3.7 55 10 3.5 62 17 1.7 43
4 2.2 47 11 3.7 63 18 1.8 48
5 2.1 53 12 3.8 70 19 4.9 70
6 2.4 50 13 4.5 85 20 4.2 79
7 2.6 62 14 4.7 75 21 4.3 72
20/69
A scatter plot of Next vs. LAST
Implementing OLS method to Problem 10.4
21/69
3.238x y=62.714
212
1
( ) 22.230xx ii
S x x
21
2
1
( ) 2844.286yy ii
S y y
21
1
( )( ) 217.629xy i ii
S x x y y
212
1
ˆ( ) 713.687i ii
SSE y y
21
2
1
ˆ( ) 2130.599ii
SSR y y
2844.286yySST S
1 / 9.790xy xxS S 0 1ˆ ˆ 31.013y x
Implementing OLS method to Problem 10.4
22/69
0 1ˆ ˆy= x
When x=3, y=60
/ 0.865r SSR SST
We could say that Last is a good predictor of Next
Implementing OLS method to Problem 10.4
23/69
Final Result
and are normally distributed.
.
Statistical Inference
0 1
00 )ˆ( E 11)ˆ( E
xx
i
nS
xSD
2
0 )ˆ( xxS
SD )ˆ( 1
)1,0(~)ˆ(
ˆ
0
00 NSD
)1,0(~
)ˆ(
ˆ
1
11 NSD
Statistical Inference on and0 1
24/69
Set ’s as fixed and use
Derivation
. ix 0)( xnxxx ii
n
i xx
i
xx
iii
xx
ii
S
Yxx
S
xxYYxx
S
YYxx
1
1
)(
)()())((
xY 10ˆˆ
Statistical Inference on and 0 1
25/69
Derivation
.
xxxx
xx
n
ii
xx
n
i xx
i
n
ii
xx
i
SS
S
xxS
S
xx
YVarS
xxVar
2
2
2
1
22
2
1
2
2
1
2
1
)(
)()ˆ(
11
21
11
1
11
10
1
10
11
)(
)()(
)()(
)()(
)()()ˆ(
n
ii
xx
n
ii
n
iii
xx
n
i xx
iin
i xx
i
n
i xx
ii
n
i xx
ii
xxS
xxxxxxS
S
xxx
S
xx
S
xExx
S
YExxE
Statistical Inference on and 0 1
26/69
Derivation
.
0
110
110
1
10
)(
)ˆ()(
)ˆ()ˆ(
xn
xn
xn
xE
xEn
YE
xYEE
i
i
i
Statistical Inference on and 0 1
27/69
xx
i
xx
ii
xx
nS
x
nS
xnxxx
S
x
n
VarxYVar
xYVarVar
22
22
222
12
10
)(
)ˆ()(
)ˆ()ˆ(
Since
2
1
11~)ˆ(
ˆ
nt
SE
Pivotal Quantities (P.Q.):
.
2222
2
~)2(
n
SSESn
Confidence Intervals (CI’s):
. )ˆ(ˆ 02/,20 SEtn
2
0
00~)ˆ(
ˆ
nt
SE
)ˆ(ˆ12/,21 SEtn
xx
i
nS
xsSE
2
0)ˆ(xxS
sSE )ˆ( 1
Statistical Inference on and 0 1
28/70
A useful application is to show whether there is a linear relationship between x and y
29/69
Hypothesis tests:
. 0110
0110 :.: HvsH
Reject at level if
0H 2/,2
1
011
0)ˆ(
ˆ
ntSE
t
0:.0: 1010 HvsH
Reject at level if
0H 2/,2
1
1
0)ˆ(
ˆ
nt
SEt
One-side alternative hypotheses can be tested using one-side t-test.
Statistical Inference on and0 1
Mean Square: A sum of squares divided by its degrees of freedom.
30/69
Analysis of Variance (ANOVA)
2and
1 n
SSEMSE
SSRMSR
020
2
1
1
2
12
21
2 )ˆ(
ˆ
/
ˆˆFt
SESss
S
s
SSR
MSE
MSR
xx
xx
22/,2,2,1 nn tf
Source of Variation (Source)
Sum of Squares (SS)
Degrees of
Freedom (d.f.)
Mean Square (MS)
F
Regression
Error
SSR
SSE
1
n - 2
Total SST n - 1
Analysis of Variance (ANOVA)ANOVA Table:
1
SSRMSR
2n
SSEMSE
MSE
MSRF
31/69
Statistical Inference Example – Testing for Linear Relationship
•Problem 10.4At α = 0.05, is there a linear trend between
the time to the NEXT eruption and the duration of the LAST eruption?
vs.
Reject H0 if where
0 1: 0H 1 1: 0H
2, /2nt t
1
1
tSE
32/69
Statistical Inference – Hypothesis TestingSolution:
1
217.6299.790
22.230xy
xx
SB
S
2
1
713.687n
i ii
SSE y y
713.689
6.1292 19
SSEs
n
1
xx
sSE
S
6.1291.2999
22.230
1
1
9.7907.531
1.2999tSE
7.531 2.0932, /2 19,0.025 2.093nt t
We reject H0 and therefore concludeThat there is a linear relationship between NEXT and LAST.
33/70
Statistical Inference Example - Confidence and Prediction Intervals
•Problem 10.11 from Tamane & Dunlop Statistics and Data Analysis
10.11 (a) Calculate a 95% PI for the time to the next eruption if the last eruption lasted 3 minutes.
34/69
Problem 10.11 – Prediction Interval
Solution:The formula for a 100(1-α)% PI for a future
observation is given by
*Y
* 2*
2, /2
1 ( )1n
xx
x xY t s
n S
35/69
Problem 10.11 - Prediction Interval
1 9.790xy
xx
SB
S
6.1292
SSEs
n
[47.238,73.529]
0 1 31.013B y B x
**
0 1Y B B x 31.013 9.790(3)
60.385
2, /2 19,0.025 2.093nt t
* 2*
2, /2
1 ( )1n
xx
x xY t s
n S
2
60.385
1 (3 3.238)(2.093)(6.129) 1
21 22.230
36/69
Problem 10.11 - Confidence Interval
10.11(b) Calculate a 95% CI for the mean time to the next eruption for a last eruption lasting 3 minutes. Compare this confidence interval with the PI obtained in (a)
37/69
Problem 10.11 - Confidence IntervalSolution:The formula for a 100(1-α)% CI for is
given by
where The 95% CI is [57.510, 63.257]The CI is shorter than the PI
*
* 2*
2, /2
1 ( )n
xx
x xt s
n S
**
0 1B B x
38/70
Regression Diagnostics
Checking the Model Assumptions1. is a linear function of 2. is the same for all 3. The errors are normally
distributed 4. The errors are independent(for
time series data) Checking for Outliers and Influential Observations
ix( )iE Y2( )iVar Y ix
ii
39/69
Checking the Model Assumptions•Residuals:
• can be viewed as the “estimates”
of random errors
ˆi i ie y y
ie'i s
22 2( )1
~ (0, 1 )ii
xx
x xe N
n S
40/69
Checking for Linearity
•If regression of on is linear, then the plot of vs. should exhibit random scatter around zero
y xie ix
41/69
1 0 394.33 360.64 33.69
2 4 329.50 331.51 -2.01
3 8 302.39 302.39 -11.39
4 12 273.27 273.27 -18.10
5 16 244.15 244.15 -14.82
6 20 215.02 215.02 -10.19
7 24 185.90 185.90 -6.90
8 28 156.78 156.78 7.05
9 32 127.66 127.66 22.67
0 5 10 15 20 25 30 35150
200
250
300
350
400
Checking for LinearityTire Wear Data
i ix iy ˆiy ie
x
y
42/69
Checking for Linearity
1 0 394.33 360.64 33.69
2 4 329.50 331.51 -2.01
3 8 302.39 302.39 -11.39
4 12 273.27 273.27 -18.10
5 16 244.15 244.15 -14.82
6 20 215.02 215.02 -10.19
7 24 185.90 185.90 -6.90
8 28 156.78 156.78 7.05
9 32 127.66 127.66 22.67
i ix iy ˆiy ie
Tire Wear Data
x
Resi
du
al
0 5 10 15 20 25 30 35-20
-10
0
10
20
30
40
43/69
Checking for Linearity•Data Transformation
x
y2x
y
3x
y
x log yx 1/ y
x ylog x y1/ x
2y
x 3yx
2y
x
ylog x y1/ x
y
xlog yx1/ y
x y2x
y3xy
x2yx3y
44/69
Checking for Constant Variance
•If the constant variance assumption is correct, the dispersion of the is approximately constant with respect to the
'ie s
ˆ 'iy s
45/69
Resi
du
al
e
y0 0.5 1 1.5 2 2.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Checking for Constant Variance
Example from textbook 10.21
46/69
Checking for Normality
•We can use residuals to make a normal plot
-0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20.01
0.02
0.05
0.10
0.25
0.50
0.75
0.90
0.95
0.98
0.99
Data
Pro
babi
lity
Normal Probability Plot
Example from textbook 10.21Normal plot of residuals
47/69
Checking for OutliersDefinition: An outlier is an observation that does not follow the general pattern of the relationship between and
•A large residual indicates an outlier!!
y x
*
2( ) ( )11
i i ii
i i
xx
e e ee
SE e sx xs
n S
* 2ie
48/69
Checking for Influential ObservationsAn observation can be influential because it has an extreme x-value, an y-value, or both
•A large indicates an influential observation!!
k: # of predictors
1
ˆn
i ij jj
y h y
2( )1 i
iixx
x xh
n S
2( 1) /iih k n
iih
iih
49/69
Checking for Influential Observations
0 2 4 6 8 10 12 14 16 18 20
10
20
30
40
50
60
70
80
90
50/69
Why use Correlation analysis?
•If the nature of the relationship between X and Y is not known, we can investigate the correlation between them without making any assumptions of causality.
•In order to do this, assume (X,Y) follows the bivariate normal distribution.
51/69
The Bivariate Normal Distribution
•(X,Y) has the following distribution:
52/69
Why can we do this?
•This assumption reduces to the probabilistic model for linear regression since the conditional distribution of Y given X=x is normal with the following parameters:
•So when X=x the mean of Y is a linear function of x and the variance is constant w.r.t. x.
53/69
So what?
•Under these assumptions we can use the data available to make inferences about ρ.
•First we have to estimate ρ from the data. Define the sample correlation coefficient R:
54/69
How can we use this?
•The exact distribution of R is very complicated, but we do have some options.
•Under the null Hypothesis H0:ρ0=0 the distribution of R is simplified. An exact test exists in this case.
•For arbitrary values of ρ0 we can approximate a function of R with a normal distribution thanks to R.A. Fisher.
55/69
Testing H0 : ρ0=0
•Under H0 the distribution of
is t(n-2). This is kind of surprising, but think
about it. The test statistic we used to test β10=0 is distributed as t(n-2) and ρ=0 if and only if β1=0. That the two test statistics are equivalent is shown on page 382-383 of the text.
56/69
Approximation of R
•Fisher showed that for n even as small as 10
•Now we can test H0 : ρ= ρ0 vs. H1 : ρ ≠ ρ0 for arbitrary ρ0. We just compute:
57/70
Almost Finished!
•We now have the tools necessary for inference on ρ. For a confidence interval for ρ compute:
and solve for:
58/69
Correlation - Conclusion
•When we are not sure of the relationship between X and Y assume (Xi,Yi) is an observation from a bivariate normal distribution. To test H0 : ρ= ρ0 vs H1 : ρ ≠ ρ0 at significance level α just compare :
to
But if ρ0 =0 compare to t(n-2,α)
59/69
SAS - Reg Procedure
Proc Reg Data=Regression_Example;Title "Regresion Example";
Model Next = Last;Plot Next*Last;Plot Residual.*Predicted.;
Output Out=Data_From_Regression Residual=R Predicted=PV;
Run;
60/69
Proc Reg Output
61/70
Plot Next*Last
62/70
SAS - Plotting Regression Line
Symbol1 Value=Dot C=blue I=R;Symbol2 Value=None C=red I=RLCLM95;Proc Gplot Data=Regression_Example;
Title "Regression Line and CIs";Plot Next*Last=1 Next*Last=2/Overlay;
Run;
63/70
Plotting Regression Line
64/70
SAS - Checking Homoscedasticity
Proc Reg Data=Regression_Example;Title "Regresion Example";
Model Next = Last;Plot Next*Last;Plot Residual.*Predicted.;
Output Out=Data_From_Regression Residual=R Predicted=PV;
Run;
65/69
Predicted.*Residual.
66/69
SAS - Checking Normality of Residuals
Proc Reg Data=Regression_Example;Output Out=Data_From_Regression Residual=R Predicted=PV;
Proc Univariate Data=Data_From_Regression Normal;Var R;qqplot R / Normal(Mu=est Sigma=est);
Run;
67/69
Checking for Normality
68/69
Questions?69/69