purpose of regression analysis regression analysis is used primarily to model causality and provide...
TRANSCRIPT
![Page 1: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/1.jpg)
Purpose of Regression Analysis
• Regression analysis is used primarily to model causality and provide prediction– Predicts the value of a dependent (response)
variable based on the value of at least one independent (explanatory) variable
– Explains the effect of the independent variables on the dependent variable
![Page 2: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/2.jpg)
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
![Page 3: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/3.jpg)
Simple Linear Regression Model
• Relationship between variables is described by a linear function
• The change of one variable causes the change in the other variable
• A dependency of one variable on the other
![Page 4: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/4.jpg)
PopulationRegressionLine (conditional mean)
Population Linear Regression
Population regression line is a straight line that describes the dependence of the average value (conditional mean)average value (conditional mean) of one variable on the other
Population Y intercept
Population SlopeCoefficient
Random Error
Dependent (Response) Variable
Independent (Explanatory) Variable
ii iY X
YX
![Page 5: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/5.jpg)
Population Linear Regression
(continued)
ii iY X
= Random Error
Y
X
(Observed Value of Y) =
Observed Value of Y
YX iX
i
(Conditional Mean)
![Page 6: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/6.jpg)
Sample regression line provides an estimateestimate of the population regression line as well as a predicted value of Y
Sample Linear Regression
Sample Y Intercept
SampleSlopeCoefficient
Residual0 1i iib bY X e
0 1Y b b X Sample Regression Line (Fitted Regression Line, Predicted Value)
![Page 7: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/7.jpg)
Sample Linear Regression
• and are obtained by finding the values of and that minimizes the sum of the squared residuals
• provides an estimateestimate of • provides and estimateestimate of
0b 1b 0b1b
0b
1b
(continued)
22
1 1
ˆn n
i i ii i
Y Y e
![Page 8: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/8.jpg)
Sample Linear Regression
(continued)
Y
XObserved Value
YX iX
i
ii iY X
0 1i iY b b X
ie
0 1i iib bY X e 1b
0b
![Page 9: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/9.jpg)
Interpretation of the Slope and the Intercept
• is the average value of Y when
the value of X is zero.
• measures the change in the
average value of Y as a result of a one-unit
change in X.
| 0E Y X
1
|E Y X
X
![Page 10: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/10.jpg)
• is the estimatedestimated average
value of Y when the value of X is zero.
• is the estimatedestimated change in
the average value of Y as a result of a one-unit
change in X.
(continued)
ˆ | 0b E Y X
1
ˆ |E Y Xb
X
Interpretation of the Slope and the Intercept
![Page 11: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/11.jpg)
Simple Linear Regression: Example
You want to examine the linear dependency of the annual sales of produce stores on their size in square footage. Sample data for seven stores were obtained. Find the equation of the straight line that fits the data best.
Annual Store Square Sales
Feet ($1000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
![Page 12: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/12.jpg)
Scatter Diagram: Example
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
Square Feet
An
nu
al
Sa
les
($00
0)
Excel Output
![Page 13: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/13.jpg)
Equation for the Sample Regression Line: Example
0 1ˆ
1636.415 1.487i i
i
Y b b X
X
From Excel Printout:
CoefficientsIntercept 1636.414726X Variable 1 1.486633657
![Page 14: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/14.jpg)
Excel Output
Regression Statistics
Multiple R 0.970557
R Square 0.941981
Adjusted R Square 0.930378
Standard Error 611.7515
Observations 7
ANOVA
df SS MS FSignificance
F
Regression 1 30380456 30380456 81.17909 0.000281
Residual 5 1871200 374239.9
Total 6 32251656
Coefficient
sStandard
Error t Stat P-value Lower 95% Upper 95%Intercept 1636.415 451.4953 3.624433 0.015149 475.8109 2797.019X Variable 1 1.486634 0.164999 9.009944 0.000281 1.06249 1.910777
![Page 15: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/15.jpg)
Graph of the Sample Regression Line: Example
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
Square Feet
An
nu
al
Sa
les
($00
0)
Y i = 1636.415 +1.487X i
![Page 16: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/16.jpg)
Interpretation of Results: Example
The slope of 1.487 means that for each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units.
The model estimates that for each increase of one square foot in the size of the store, the expected annual sales are predicted to increase by $1487.
ˆ 1636.415 1.487i iY X
![Page 17: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/17.jpg)
How Good is the regression?
• R2
• Confidence Intervals• Residual Plots• Analysis of Variance• Hypothesis (t) tests
![Page 18: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/18.jpg)
Measure of Variation: The Sum of Squares
SST = SSR + SSE
Total Sample
Variability
= Explained Variability
+ Unexplained Variability
![Page 19: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/19.jpg)
Measure of Variation: The Sum of Squares
• SST = total sum of squares – Measures the variation of the Yi values around their
mean Y
• SSR = regression sum of squares – Explained variation attributable to the relationship
between X and Y
• SSE = error sum of squares – Variation attributable to factors other than the
relationship between X and Y
(continued)
![Page 20: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/20.jpg)
Measure of Variation: The Sum of Squares
(continued)
Xi
Y
X
Y
SST = (Yi - Y)2
SSE =(Yi - Yi )2
SSR = (Yi - Y)2
_
_
_
![Page 21: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/21.jpg)
The Coefficient of Determination
•
• Measures the proportion of variation in Y that is explained by the independent variable X in the regression model
2 Regression Sum of Squares
Total Sum of Squares
SSRr
SST
![Page 22: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/22.jpg)
Coefficients of Determination (r 2) and Correlation (r)
r2 = 1, r2 = 1,
r2 = .8, r2 = 0,Y
Yi = b0 + b1Xi
X
^
YYi = b0 + b1Xi
X
^Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
r = +1 r = -1
r = +0.9 r = 0
![Page 23: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/23.jpg)
Linear Regression Assumptions
1. Linearity
2. Normality– Y values are normally distributed for each X– Probability distribution of error is normal
2. Homoscedasticity (Constant Variance)
3. Independence of Errors
![Page 24: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/24.jpg)
Residual Analysis
• Purposes– Examine linearity – Evaluate violations of assumptions
• Graphical Analysis of Residuals– Plot residuals vs. Xi , Yi and time
![Page 25: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/25.jpg)
Residual Analysis for Linearity
Not Linear Linear
X
e eX
Y
X
Y
X
![Page 26: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/26.jpg)
Residual Analysis for Homoscedasticity
Heteroscedasticity Homoscedasticity
SR
X
SR
X
Y
X X
Y
![Page 27: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/27.jpg)
• Y values are normally distributed around the regression line.
• For each X value, the “spread” or variance around the regression line is the same.
Variation of Errors around the Regression Line
X1
X2
X
Y
f(e)
Sample Regression Line
![Page 28: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/28.jpg)
Residual Plot
0 1000 2000 3000 4000 5000 6000
Square Feet
Residual Analysis:Excel Output for Produce Stores Example
Excel Output
Observation Predicted Y Residuals1 4202.344417 -521.34441732 3928.803824 -533.80382453 5822.775103 830.22489714 9894.664688 -351.66468825 3557.14541 -239.14541036 4918.90184 644.09816037 3588.364717 171.6352829
![Page 29: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/29.jpg)
Residual Analysis for Independence
Not Independent Independente e
TimeTime
Residual is plotted against time to detect any autocorrelation
No Particular PatternCyclical Pattern
Graphical Approach
![Page 30: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/30.jpg)
Inference about the Slope: t Test
• t test for a population slope– Is there a linear dependency of Y on X ?
• Null and alternative hypotheses– H0: 1 = 0 (no linear dependency)– H1: 1 0 (linear dependency)
• Test statistic–
–
1
1
1 1
2
1
where
( )
YXb n
bi
i
b St S
SX X
. . 2d f n
![Page 31: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/31.jpg)
Example: Produce Store
Data for Seven Stores: Estimated Regression Equation:
The slope of this model is 1.487.
Is square footage of the store affecting its annual sales?
Annual Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Yi = 1636.415 +1.487Xi
![Page 32: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/32.jpg)
Inferences about the Slope: t Test Example
H0: 1 = 0
H1: 1 0
.05
df 7 - 2 = 5
Critical Value(s):
Test Statistic:
Decision:
Conclusion:There is evidence that square footage affects annual sales.
t0 2.5706-2.5706
.025
Reject Reject
.025
From Excel Printout
Reject H0
Coefficients Standard Error t Stat P-valueIntercept 1636.4147 451.4953 3.6244 0.01515Footage 1.4866 0.1650 9.0099 0.00028
1b 1bS t
![Page 33: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/33.jpg)
0 1 1 2 2i i i k ki iY b b X b X b X e
Population Y-intercept
Population slopes Random Error
The Multiple Regression Model
Relationship between 1 dependent & 2 or more independent variables is a linear
function
Dependent (Response) variable for sample
Independent (Explanatory) variables for sample model
1 2i i i k ki iY X X X
Residual
![Page 34: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/34.jpg)
Population Multiple Regression Model
Bivariate model
X2
Y
X1Y|X = 0 + 1X 1i + 2X 2i
0
Y i = 0 + 1X 1i + 2X 2i + i
ResponsePlane
(X 1i,X 2i)
(O bserved Y )
i
X2
Y
X1Y|X = 0 + 1X 1i + 2X 2i
0
Y i = 0 + 1X 1i + 2X 2i + i
ResponsePlane
(X 1i,X 2i)
(O bserved Y )
i
![Page 35: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/35.jpg)
Sample Multiple Regression Model
X2
Y
X1
b0
Y i = b0 + b1X 1 i + b2X 2 i + e i
ResponsePlane
(X 1i, X 2i)
(O bserved Y)
^
e i
Y i = b0 + b1X 1 i + b2X 2 i
X2
Y
X1
b0
Y i = b0 + b1X 1 i + b2X 2 i + e i
ResponsePlane
(X 1i, X 2i)
(O bserved Y)
^
e i
Y i = b0 + b1X 1 i + b2X 2 i
Bivariate model
Sample Regression PlaneSample Regression Plane
![Page 36: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/36.jpg)
Simple and Multiple Regression Compared
• Coefficients in a simple regression pick up the impact of that variable plus the impacts of other variables that are correlated with it and the dependent variable.
• Coefficients in a multiple regression net out the impacts of other variables in the equation.
![Page 37: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/37.jpg)
Simple and Multiple Regression Compared:Example
• Two simple regressions:– –
• Multiple regression:–
0 1
0 1
Oil Temp
Oil Insulation
0 1 2Oil Temp Insulation
![Page 38: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/38.jpg)
Multiple Linear Regression Equation
Too complicated
by hand! Ouch!
![Page 39: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/39.jpg)
Interpretation of Estimated Coefficients
• Slope (bi)
– Estimated that the average value of Y changes by bi for each 1 unit increase in Xi holding all other variables constant (ceteris paribus)
– Example: if b1 = -2, then fuel oil usage (Y) is expected to decrease by an estimated 2 gallons for each 1 degree increase in temperature (X1) given the inches of insulation (X2)
• Y-intercept (b0)
– The estimated average value of Y when all Xi = 0
![Page 40: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/40.jpg)
Multiple Regression Model: Example
Oil (Gal) Temp Insulation275.30 40 3363.80 27 3164.30 40 10
40.80 73 694.30 64 6
230.90 34 6366.70 9 6300.60 8 10237.80 23 10121.40 63 3
31.40 65 10203.50 41 6441.10 21 3323.00 38 3
52.50 58 10
(0F)
Develop a model for estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches.
![Page 41: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/41.jpg)
1 2ˆ 562.151 5.437 20.012i i iY X X
Sample Multiple Regression Equation: Example
CoefficientsIntercept 562.1510092X Variable 1 -5.436580588X Variable 2 -20.01232067
Excel Output
For each degree increase in temperature, the estimated average amount of heating oil used is decreased by 5.437 gallons, holding insulation constant.
For each increase in one inch of insulation, the estimated average use of heating oil is decreased by 20.012 gallons, holding temperature constant.
0 1 1 2 2i i i k kiY b b X b X b X
![Page 42: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/42.jpg)
Confidence Interval Estimate for the Slope
Provide the 95% confidence interval for the population slope 1 (the effect of temperature on oil consumption).
11 1n p bb t S
Coefficients Lower 95% Upper 95%Intercept 562.151009 516.1930837 608.108935X Variable 1 -5.4365806 -6.169132673 -4.7040285X Variable 2 -20.012321 -25.11620102 -14.90844
-6.169 1 -4.704
The estimated average consumption of oil is reduced by between 4.7 gallons to 6.17 gallons per each increase of 10 F.
![Page 43: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/43.jpg)
Coefficient of Multiple Determination
• Proportion of total variation in Y explained by all X variables taken together
–
• Never decreases when a new X variable is added to model– Disadvantage when comparing models
212
Explained Variation
Total VariationY k
SSRr
SST
![Page 44: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/44.jpg)
Adjusted Coefficient of Multiple Determination
• Proportion of variation in Y explained by all X variables adjusted for the number of X variables used–
– Penalize excessive use of independent variables– Smaller than– Useful in comparing among models
2 212
11 1
1adj Y k
nr r
n k
212Y kr
![Page 45: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/45.jpg)
Coefficient of Multiple Determination
Regression StatisticsMultiple R 0.982654757R Square 0.965610371Adjusted R Square 0.959878766Standard Error 26.01378323Observations 15
Excel Output
SST
SSRr ,Y 2
12
Adjusted r2
reflects the number of explanatory variables and sample size
is smaller than r2
![Page 46: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/46.jpg)
Interpretation of Coefficient of Multiple Determination
•
– 96.56% of the total variation in heating oil can be explained by different temperature and amount of insulation
•
– 95.99% of the total fluctuation in heating oil can be explained by different temperature and amount of insulation after adjusting for the number of explanatory variables and sample size
2,12 .9656Y
SSRr
SST
2adj .9599r
![Page 47: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/47.jpg)
Using The Model to Make Predictions
Predict the amount of heating oil used for a home if the average temperature is 300 and the insulation is six inches.
The predicted heating oil used is 278.97 gallons
1 2
ˆ 562.151 5.437 20.012
562.151 5.437 30 20.012 6
278.969
i i iY X X
![Page 48: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/48.jpg)
Testing for Overall Significance
• Shows if there is a linear relationship between all of the X variables together and Y
• Use F test statistic• Hypotheses:
– H0: …k = 0 (no linear relationship)
– H1: at least one i ( at least one independentvariable affects Y )
• The null hypothesis is a very strong statement• Almost always reject the null hypothesis
![Page 49: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/49.jpg)
Test for Significance:Individual Variables
• Shows if there is a linear relationship between the variable Xi and Y
• Use t test statistic• Hypotheses:
– H0: i 0 (no linear relationship)
– H1: i 0 (linear relationship between Xi and Y)
![Page 50: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/50.jpg)
Residual Plots
• Residuals vs.– May need to transform variable
• Residuals vs.– May need to transform variable
• Residuals vs. time– May have autocorrelation
Y
1X
2X1X
2X
![Page 51: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/51.jpg)
Residual Plots: Example
Insulation Residual Plot
0 2 4 6 8 10 12
No Discernable Pattern
Temperature Residual Plot
-60
-40
-20
0
20
40
60
0 20 40 60 80
Re
sid
ua
ls
Maybe some non-linear relationship
![Page 52: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/52.jpg)
The Quadratic Regression Model
• Relationship between one response variable and two or more explanatory variables is a quadratic polynomial function
• Useful when scatter diagram indicates non-linear relationship
• Quadratic model :–
• The second explanatory variable is the square of the first variable
20 1 1 2 1i i i iY X X
![Page 53: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/53.jpg)
Quadratic Regression Model
(continued)
Quadratic models may be considered when scatter diagram takes on the following shapes:
X1
Y
X1X1
YYY
2 > 0 2 > 0 2 < 0 2 < 0
2 = the coefficient of the quadratic term
X1
![Page 54: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/54.jpg)
Dummy Variable Models
• Categorical explanatory variable (dummy variable) with two or more levels:
• Yes or no, on or off, male or female, • Coded as 0 or 1• Only intercepts are different• Assumes equal slopes across categories• The number of dummy variables needed is
(number of levels - 1)• Regression model has same form:
0 1 1 2 2i i i k ki iY X X X
![Page 55: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/55.jpg)
Dummy-Variable Models (with 2 Levels)
Given:
Y = Assessed Value of House
X1 = Square footage of House
X2 = Desirability of Neighborhood =
Desirable (X2 = 1)
Undesirable (X2 = 0)
0 1 1 2 0 2 1 1ˆ (1) ( )i i iY b b X b b b b X
0 if undesirable 1 if desirable
0 1 1 2 0 1 1ˆ (0)i i iY b b X b b b X
Same slopes
0 1 1 2 2i i iY b b X b X
![Page 56: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/56.jpg)
Undesirable
Desirable Location
Dummy-Variable Models (with 2 Levels)
(continued)
X1 (Square footage)
Y (Assessed Value)
b0 + b2
b0
Same slopes
Intercepts different
1b
![Page 57: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/57.jpg)
Interpretation of the Dummy Variable Coefficient (with 2 Levels)
Example:
0 1 1 2 2 1 2ˆ 20 5 6i i i i iY b b X b X X X
1X : GPA2X
0 Female
1 Male
Y: Annual salary of college graduate in thousand $
On average, male college graduates are making an estimated six thousand dollars more than female college graduates with the same GPA.
:
![Page 58: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/58.jpg)
Dummy-Variable Models (with 3 Levels)
1
2 3
Given:
Assessed Value of the House (1000 $)
Square Footage of the House
Style of the House = Split-level, Ranch, Condo
(3 Levels; Need 2 Dummy Variables)
1 if Split-level 1
0 if not
Y
X
X X
0 1 1 2 2 3 3
if Ranch
0 if not
iY b b X b X b X
![Page 59: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/59.jpg)
Interpretation of the Dummy Variable Coefficients (with 3 Levels)
With the same footage, a Split-level will have an estimated average assessed value of 18.84 thousand dollars more than a Condo.With the same footage, a Ranch will have an estimated average assessed value of 23.53 thousand dollars more than a Condo.
1 2 3
2
1
3
1
1
Given the Estimated Model:
ˆ 20.43 0.045 18.84 23.53
For Split-level 1 :
ˆ 20.43 0.045 18.84
For Ranch 1 :
ˆ 20.43 0.045 23.53
For Condo:
ˆ 20.43 0.045
i i i i
i i
i i
i i
Y X X X
X
Y X
X
Y X
Y X
![Page 60: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/60.jpg)
Dummy Variables
• Predict Weekly Sales in a Grocery Store• Possible independent variables:
– Price– Grocery Chain
• Data Set:– Grocery.xls
• Interaction Effect?
![Page 61: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/61.jpg)
Interaction Regression Model
• Hypothesizes interaction between pairs of X variables– Response to one X variable varies at different levels
of another X variable
• Contains two-way cross product terms –
• Can be combined with other models – E.G., Dummy variable model
0 1 1 2 2 3 1 2i i i i i iY X X X X
![Page 62: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/62.jpg)
Effect of Interaction
• Given:–
• Without interaction term, effect of X1 on Y is measured by 1
• With interaction term, effect of X1 on Y is measured by 1 + 3 X2
• Effect changes as X2 increases
0 1 1 2 2 3 1 2i i i i i iY X X X X
![Page 63: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/63.jpg)
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
Interaction Example
Effect (slope) of X1 on Y does depend on X2 value
X1
44
88
1212
0000 110.50.5 1.51.5
YY = 1 + 2X1 + 3X2 + 4X1X2
![Page 64: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/64.jpg)
Interaction Regression Model Worksheet
Case, i Yi X1i X2i X1i X2i
1 1 1 3 3
2 4 8 5 40
3 1 3 2 6
4 3 5 6 30
: : : : :
Case, i Yi X1i X2i X1i X2i
1 1 1 3 3
2 4 8 5 40
3 1 3 2 6
4 3 5 6 30
: : : : :
Multiply X1 by X2 to get X1X2. Run regression with Y, X1, X2 , X1X2
![Page 65: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/65.jpg)
• Hypothesize interaction between pairs of independent variables
• Contains 2-way product terms
Evaluating Presence of Interaction
0 1 1 2 2 3 1 2i i i i i iY X X X X
![Page 66: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/66.jpg)
Using Transformations
• Requires data transformation• Either or both independent and dependent
variables may be transformed• Can be based on theory, logic or scatter
diagrams
![Page 67: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/67.jpg)
Inherently Linear Models
• Non-linear models that can be expressed in linear form– Can be estimated by least squares in linear form
• Require data transformation
![Page 68: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/68.jpg)
Transformed Multiplicative Model (Log-Log)
1 20 1 2Original: i i i iY X X
0 1 1 2 2Transformed: ln ln ln ln lni i i iY X X
Y
X1
Y
X1
1 1
10 1 11 0
1 1
1 1
Similarly for X2
![Page 69: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/69.jpg)
Square Root Transformation
Y
X1
0 1 1 2 2i i i iY X X
1 > 0
1 < 0
Similarly for X2
Transforms one of above model to one that appears linear. Often used to overcome heteroscedasticity.
![Page 70: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/70.jpg)
Linear-Logarithmic Transformation
Y
X1
0 1 1 2 2ln( ) ln( )i i i iY X X
1 > 0
1 < 0
Similarly for X2
Transformed from an original multiplicative model
![Page 71: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/71.jpg)
Exponential Transformation(Log-Linear)
Y
X1
0 1 1 2 2i iX Xi iY e Original Model
1 > 0
1 < 0
Transformed Into: 0 1 1 2 2 1ln lni i iY X X
![Page 72: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/72.jpg)
Model Building / Model Selection
• Find “the best” set of explanatory variables among all the ones given.
• “Best subset” regression (only linear models)– Requires a lot of computation (2N regressions)
• “Stepwise regression”• “Common Sense” methodology
– Run regression with all variables– Throw out variables not statistically significant– “Adjust” model by including some excluded variables, one at a
time
• Tradeoff: Parsimony vs. Fit
![Page 73: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/73.jpg)
Association ≠ Causation !
![Page 74: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/74.jpg)
Regression Limitations
• R2 measures the association between independent and dependent variables
Association ≠ Causation !• Be careful about doing predictions that involve
extrapolation• Inclusion / Exclusion of independent
variables is subject to a type I / type II error
![Page 75: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/75.jpg)
Multi-collinearity
• What?– When one independent variable is highly correlated (“collinear”)
with one or more other independent variables– Examples:
• square feet and square meters as independent variables to predict house price (1 sq ft is roughly 0.09 sq meters)
• “total rooms” and bedrooms plus bathrooms for a house
• How to detect?– Run a regression with the “not-so-independent” independent
variable (in the examples above: square feet and total rooms) as a function of all other remaining independent variables, e.g.:
• X1 = β0 + β2 X2 + …+ βk Xk
– If R2 of the above regression is > 0.8, then one suspects multi-collinearity to be present
![Page 76: Purpose of Regression Analysis Regression analysis is used primarily to model causality and provide prediction –Predicts the value of a dependent (response)](https://reader033.vdocuments.us/reader033/viewer/2022051416/56649e125503460f94afe1e4/html5/thumbnails/76.jpg)
Multi-collinearity
• What effect?– Coefficient estimates are unreliable– Can still be used for predicting values for Y– If possible, delete the “not-so-independent”
independent variable
• When to check?– When one suspects that two variables measure the
same thing, or when the two variables are highly correlated
– When one suspects that one independent variable is a (linear) function of the other independent variables
(continued)