simple linear regression
DESCRIPTION
Simple Linear Regression. Lecture for Statistics 509 November-December 2000. Correlation and Regression. Study of association and/or relationship between variables. - PowerPoint PPT PresentationTRANSCRIPT
Simple Linear Regression
Lecture for Statistics 509
November-December 2000
Week of 11/27/2000 Stat 509 - Regression Lecture 2
Correlation and Regression
• Study of association and/or relationship between variables.
• Useful for determining the effect of changes in one variable (called the independent or control variable) on another variable (called the dependent or response variable).
• Regression models could be utilized to determine optimal operating conditions [these conditions specified by the control variables] in order to achieve a certain specified value or yield on the response variable.
• Regression models could also be utilized to predict the value of the response given a value of the independent variable, or could be used for “calibrating” the value of the independent variable to achieve a certain response.
Week of 11/27/2000 Stat 509 - Regression Lecture 3
Some Examples• Control variable is X = Average Speed of a Car and response variable
is Y=Fuel Efficiency of the Car. Goal is to determine speed to optimize the efficiency of the car.
• Control variable is X = Temperature, while the response variable is Y = Yield in a chemical reaction.
• Control variable is X = amount of fertilizer applied on a plant, while the response variable is Y = yield of this plant.
• Control variable is X = thickness of a stack of bond paper, while the response variable is Y = number of sheets in this stack.
• Control variable is X = average time of studying, while the response variable is Y = GPA.
Week of 11/27/2000 Stat 509 - Regression Lecture 4
Population Model• Each member of the population will have a value for the independent
variable X and the response variable Y, usually represented by the vector (X,Y).
• For a given value X = x, the variable Y has a certain distribution whose conditional mean is (x) and whose conditional variance is 2(x).
• This could be visualized as follows: When you consider the subpopulation consisting of units whose values of X equal x, then their Y-values has a certain distribution whose mean is (x) and whose variance is 2(x). When you pick a unit from this subpopulation, then the Y-value that you will observe is governed by this particular distribution. In particular, this observation could be expressed via
• Y = (x) + , where e is some “error term.”
Week of 11/27/2000 Stat 509 - Regression Lecture 5
Assumptions for Simple Linear Regression
• Assumptions for Simple Linear Regression (x) = E(Y|X=x) = + x. This means that the mean of Y, given X =
x, is a linear function of x. is called the regression coefficient or the slope of the regression line;
is the y-intercept. 2(x) = does not depend on x. This is the assumption of “equal
variances” or homoscedasticity.
• Furthermore, for the sample data (x1, Y1), (x2, Y2), …, (xn, Yn):
• Y1, Y2, …, Yn are independent observations, and their conditional distributions are all normal.
• In shorthand notation:
• Yi = (xi) + i = + xi + i, i=1,2,…,n, where 1, 2, …, n are independent and identically distributed (IID) N(0,2).
Week of 11/27/2000 Stat 509 - Regression Lecture 6
Regression Problem• Given the sample (bivariate) data (x1, Y1), (x2, Y2), …, (xn, Yn),
satisfying the linear regression model
• Yi = + xi + i with 1, 2, …, n IID N(0, 2)
• we would like to address the following questions:
• How should the data be summarized graphically?• What are the estimators of the parameters , , and 2?• What will be an estimate of the prediction line?• What are the properties of the estimators of the model parameters?• How do we test whether the fitted regression model is a significant
model? • How do we construct CIs or test hypotheses concerning parameters?• How do we perform prediction using the prediction model?
Week of 11/27/2000 Stat 509 - Regression Lecture 7
Illustrative Example: On Plasma Etching
• Plasma etching is essential to the fine-line pattern transfer in current semiconductor processes. The paper “Ion Beam-Assisted Etching of Aluminum with Chlorine” in J. Electrochem. Soc. (1985) gives the data below on chlorine flow (x, in SCCM) through a nozzle used in the etching mechanism, and etch rate (y, in 100A/min)
x 1.5 1.5 2.0 2.5 2.5 3.0 3.5 3.5 4.0y 23.0 24.5 25.0 30.0 33.5 40.0 40.5 47.0 49.0
Week of 11/27/2000 Stat 509 - Regression Lecture 8
The Scatterplot
2 3 4
20
30
40
50
ChlorineFlow
Etc
hRat
e
Scatterplot of Chlorine Flow and Etch Rate
Week of 11/27/2000 Stat 509 - Regression Lecture 9
Least-Squares Prediction Line
T h e l e a s t - s q u a r e s ( L S ) p r i n c i p l e t o f i t t i n g t h e r e g r e s s i o n l i n e t o t h es c a t t e r p l o t s t a t e s t h a t t h e b e s t f i t t i n g l i n e
bxaY ˆ
i s s u c h t h a t t h e c o e f f i c i e n t s a a n d b w i l l p r o v i d e t h e s m a l l e s t p o s s i b l ev a l u e t o t h e s u m o f s q u a r e d d e v i a t i o n s b e t w e e n t h e o b s e r v e d Y - v a l u e s a n dt h e i r a s s o c i a t e d p r e d i c t e d v a l u e s . T h e p r e d i c t e d v a l u e s a r e
,,...,2,1,ˆ nibxaY ii
s o t h e q u a n t i t y t h a t n e e d s t o b e m i n i m i z e d i s g i v e n b y :
.)(ˆ),(2
1
2
1
n
iii
n
iii bxaYYYbaQ
U s i n g m i n i m i z a t i o n t e c h n i q u e s f r o m C a l c u l u s , t h e c o e f f i c i e n t s t h a t w i l lp r o v i d e t h e m i n i m u m v a l u e f o r Q ( a , b ) a r e g i v e n i n t h e n e x t s l i d e .
Week of 11/27/2000 Stat 509 - Regression Lecture 10
nibXaYYYR
bXaY
XbYa
SXX
SXYb
SYYSXX
SXYr
YXnYXYYXXSXY
YnYYYSYY
XnXXXSXX
iiiii
i
n
iii
n
ii
n
ii
n
ii
n
ii
n
ii
,...,2,1 ),(ˆ
LinePredictionˆ
ofEstimator
ofEstimator
tCoefficien nCorrelatio Sample))((
))((
)(
)(
gression Linear ReSimplefor Formulas
11
2
1
22
1
2
1
22
1
Week of 11/27/2000 Stat 509 - Regression Lecture 11
SYY
SSRR
MSE
nSSE
SSR
MSE
MSRF
n
SSEMSES
SSESSRSSY
YYSSR
YYRSSE
c
n
ii
n
iii
n
ii
2
2
2
2
1
2
11
2
tion Determinaoft Coefficien
ofestimator unbiased an is
)2/(
1/2
)ˆ(
)ˆ(
Week of 11/27/2000 Stat 509 - Regression Lecture 12
Analysis of Variance TableSource ofVariation
Degrees-of-Freedom
Sum ofSquares
MeanSquares
F-Value
Regression 1 SSR MSRError n-2 SSE MSE
MSR/MSE
Total n-1 SYY
To test the null hypothesis H0: =0, compare the F-value (MSR/MSE) to the tabular value obtainedfrom the F-distribution with degrees-of-freedom(1,n-2). If the F-value is larger, then the nullhypothesis is rejected, and it is concluded that theregression model is significant (at the prespecifiedlevel of significance).
Week of 11/27/2000 Stat 509 - Regression Lecture 13
SXX
Xx
nMSEtbxa
xX
SXX
Xx
nMSEtbxa
x
SXX
Xx
nMSExY
SXX
X
nMSEa
SXX
MSEb
n
n
20
2/;20
0
20
2/;20
0
20
02
2
2
2
)(11)()(
:at Interval Prediction
)(1)()(
:)(for Interval Confidence
)(1)()](ˆ[ˆ
1)()(ˆ
)(ˆ
Intervals Confidence and ErrorsStandard
Week of 11/27/2000 Stat 509 - Regression Lecture 14
Excel Worksheet for Regression Computations
X=ChlorineFlow Y=EtchRate X̂ 2 Y 2̂ XY1.5 23 2.25 529 34.51.5 24.5 2.25 600.25 36.75
2 25 4 625 502.5 30 6.25 900 752.5 33.5 6.25 1122.25 83.75
3 40 9 1600 1203.5 40.5 12.25 1640.25 141.753.5 47 12.25 2209 164.5
4 49 16 2401 196
24 312.5 70.5 11626.75 902.25SumX SumY SumX2 SumY2 SumXY
SXX 6.5 b 10.60256SYY 776.055556 a 6.448718SXY 68.9166667 MSE 6.480311
Week of 11/27/2000 Stat 509 - Regression Lecture 15
Regression Analysis from Minitab• The regression equation is: y = 6.45 + 10.6 x
• Predictor Coef StDev T P• Constant 6.449 2.795 2.31 0.054• x 10.6026 0.9985 10.62 0.000
• S = 2.546 R-Sq = 94.2% R-Sq(adj) = 93.3%
• Analysis of Variance
• Source DF SS MS F P• Regression 1 730.69 730.69 112.76 0.000• Residual Error 7 45.36 6.48• Total 8 776.06
Week of 11/27/2000 Stat 509 - Regression Lecture 16
Fitted Line in Scatterplot with Bands
2 3 4
15
25
35
45
55
ChlorineFlow
Etc
hRat
eY = 6.44872 + 10.6026X
R-Sq = 94.2 %
Regression
95% CI
95% PI
Regression Analysis of the Plasma Etching Data