statistics and data analysis
Post on 22-Feb-2016
53 Views
Preview:
DESCRIPTION
TRANSCRIPT
Part 18: Regression Modeling18-1/44
Statistics and Data Analysis
Professor William GreeneStern School of Business
IOMS DepartmentDepartment of Economics
Part 18: Regression Modeling18-2/44
Statistics and Data Analysis
Part 18 – Regression Modeling
Part 18: Regression Modeling18-3/44
Linear Regression Models
Least squares results Regression model Sample statistics Estimates of population parameters
How good is the model? In the abstract Statistical measures of model fit
Assessing the validity of the relationship
Part 18: Regression Modeling18-4/44
Regression Model Regression relationship
yi = α + β xi + εi
Random εi implies random yi
Observed random yi has two unobserved components:
Explained: α + β xi Unexplained: εi
Random component εi zero mean, standard deviation σ, normal distribution.
Part 18: Regression Modeling18-5/44
Linear Regression: Model Assumption
Part 18: Regression Modeling18-6/44
Least Squares Results
Part 18: Regression Modeling18-7/44
Using the Regression Model
Prediction: Use xi as information to predict yi.
The natural predictor is the mean,
xi provides more information.
With xi, the predictor is
y
iˆ
iy = a+bx
Part 18: Regression Modeling18-8/44
Regression Fits
YEARS
SALA
RY
302520151050
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
Scatterplot of SALARY vs YEARS
Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes
ROOMSFU
ELBI
LL111098765432
1400
1200
1000
800
600
400
200
Scatterplot of FUELBILL vs ROOMS
Part 18: Regression Modeling18-9/44
Regression Arithmetic
N N
ˆ
ˆ
ˆ
i i
i i i i
22 N 2i=1 i i=1 i i=1 i
y = y +e = prediction + error
y - y = y - y +e
A few algebra steps later...
Σ y - y = Σ y - y + Σ e
TOTAL LARGE?? small??TOTAL = Regression + ResidualThis is the analysis of (the) variance (of y); ANOVA
Part 18: Regression Modeling18-10/44
Analysis of Variance
N Nˆ 22 N 2i=1 i i=1 i i=1 iΣ y - y = Σ y - y + Σ e
TOTAL = Regression + Residual
Part 18: Regression Modeling18-11/44
Fit of the Model to the Data
ˆ 22N N N 2i=1 i i=1 i i=1 i
The original question about the model fit to the data :
Σ y - y = Σ y - y + Σ e
TOTAL LARGE?? small??
TOTAL = Regression + Residual
TOTAL SS = RTOTAL
egr SS
ession SSTOTAL SS
Proportion Expl
Residual S
ained
+
1 = +
STOTAL SS
Proportion Unexplained
Part 18: Regression Modeling18-12/44
Explained Variation
The proportion of variation “explained” by the regression is called R-squared (R2)
It is also called the Coefficient of Determination
Part 18: Regression Modeling18-13/44
Movie Madness
Fit
R2
Part 18: Regression Modeling18-14/44
Regression Fits
Domestic
Over
seas
6005004003002001000
1400
1200
1000
800
600
400
200
0
S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%
Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic
R2 = 0.522
Income
G
2750025000225002000017500150001250010000
7
6
5
4
3
S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%
Fitted Line PlotG = 1.928 + 0.000179 Income
R2 = 0.880
R2 = 0.424
Output
Cost
80000700006000050000400003000020000100000
500
400
300
200
100
0
Scatterplot of Cost vs Output
R2 = 0.924
Part 18: Regression Modeling18-15/44
R2 = 0.338
R2 is still positive even if the correlation is negative.
Part 18: Regression Modeling18-16/44
R Squared Benchmarks Aggregate time
series: expect .9+ Cross sections, .5
is good. Sometimes we do much better.
Large survey data sets, .2 is not bad. Output
Cost
80000700006000050000400003000020000100000
500
400
300
200
100
0
Scatterplot of Cost vs Output
R2 = 0.924 in this cross section.
Part 18: Regression Modeling18-17/44
Correlation Coefficient
xy
N1i iN-1 i=1
N N2 21 1i iN-1 N-1i=1 i=1
xy
r = Correlation(x,y)
Sample Cov[x,y]= [Sample Standard deviation (x)] [Sample standard deviation (y)]
(x -x)(y -y)=
(x -x) (y -y)
1 r 1
Part 18: Regression Modeling18-18/44
Correlations
DomesticOv
erse
as
6005004003002001000
1400
1200
1000
800
600
400
200
0
Scatterplot of Overseas vs Domestic
rxy = 0.723
GINI
GDPC
0.60.50.40.30.2
30000
25000
20000
15000
10000
5000
0
S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%
Fitted Line PlotGDPC = 19826 - 34508 GINI
rxy = -.402
C6
C5
1614121086420
9
8
7
6
5
4
3
2
1
0
Scatterplot of C5 vs C6
rxy = +1.000
Part 18: Regression Modeling18-19/44
R-Squared is rxy2
R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.
The correlation between yi and (a+bxi) is the same as the correlation between yi and xi.
Therefore,…. A regression with a high R2 predicts yi well.
Part 18: Regression Modeling18-20/44
Adjusted R-Squared We will discover when we study regression with more
than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.
To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.
2 2N -1R = 1- (1- R )N - K -1
K is the number of "x" variables in the equation.
Part 18: Regression Modeling18-21/44
Movie Madness Fit
2R
Part 18: Regression Modeling18-22/44
Notes About Adjusted R2
2 2 2 2
2
2
(1) Adjusted R is denoted R . R is less than R .
(2) R is not the square of R. It is not the square of anything. Adjusted R squared is just a name, not a formula.
(3) Adjusting R makes no s2
2
ense when there is only one variable
in the model. You should pay no attention to R when K = 1.
(4) R can be less than zero! See point (2).
Part 18: Regression Modeling18-23/44
Is R2 Large?
Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits)
by examining R2. F is used for this purpose.
Part 18: Regression Modeling18-24/44
The F Ratio
ˆ 22N N N 2i=1 i i=1 i i=1 i
The original question about the model fit to the data :
Σ y - y = Σ y - y + Σ e
TOTAL LARGE?? small??TOTAL = Regression + ResidualWe would like the Regression SS
ˆ
2N 2i=1 iN 2 2i=1 i
to be large and theResidual SS to be small
(N- 2)Σ y - y(N- 2)Regression SS (N- 2)RF = Residual SS Σ e 1-R
Part 18: Regression Modeling18-25/44
Is R2 Large? Since
F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large.
For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.
Part 18: Regression Modeling18-26/44
Movie Madness Fit
R2
F
Part 18: Regression Modeling18-27/44
Why Use F and not R2?
When is R2 “large?” we have no benchmarks to decide.
How large is “large?” We have a table for F statistics to determine when F is statistically large: yes or no.
Part 18: Regression Modeling18-28/44
F TableThe “critical value” depends on the number of observations. If F is larger than the appropriate value in the table, conclude that there is a “statistically significant” relationship.
There is a huge F table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.
n2 is N-2
Part 18: Regression Modeling18-29/44
Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz
The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000
S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%
Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1
n2 is N-2
Part 18: Regression Modeling18-30/44
$135 Million
http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss
Klimt, to Ronald Lauder
Part 18: Regression Modeling18-31/44
$100 Million … sort ofStephen Wynn with a Prized Possession, 2007
Part 18: Regression Modeling18-32/44
An Enduring Art Mystery
Why do larger paintings command higher prices?
The Persistence of Memory. Salvador Dali, 1931
The Persistence of Statistics. Hildebrand, Ott and Gray, 2005
Graphics show relative sizes of the two works.
Part 18: Regression Modeling18-33/44
Part 18: Regression Modeling18-34/44
Previously sold for exp(16.6374) = $16.8M
Part 18: Regression Modeling18-35/44
Monet in Large and Small
ln (SurfaceArea)
ln (U
S$)
7.67.47.27.06.86.66.46.26.0
18
17
16
15
14
13
12
11
S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%
Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)
Log of $price = a + b log surface area + e
Sale prices of 328 signed Monet paintings
The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.
Part 18: Regression Modeling18-36/44
The Data
ln (SurfaceArea)
Freq
uenc
y
8.88.07.26.45.64.84.03.2
90
80
70
60
50
40
30
20
10
0
Histogram of ln (SurfaceArea)
ln (US$)
Freq
uenc
y
16.515.013.512.010.5
80
70
60
50
40
30
20
10
0
Histogram of ln (US$)
Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)
Part 18: Regression Modeling18-37/44
Monet Regression: There seems to be a regression. Is there a theory?
Part 18: Regression Modeling18-38/44
Conclusions about F
R2 answers the question of how well the model fits the data
F answers the question of whether there is a statistically valid fit (as opposed to no fit).
What remains is the question of whether there is a valid relationship – i.e., is β different from zero.
Part 18: Regression Modeling18-39/44
The Regression Slope The model is yi = α+βxi+εi
The “relationship” depends on β. If β equals zero, there is no relationship
The least squares slope, b, is the estimate of β based on the sample. It is a statistic based on a random sample. We cannot be sure it equals the true β.
To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.
Part 18: Regression Modeling18-40/44
Uncertainty About the Regression SlopeHypothetical Regression Fuel Bill vs. Number of Rooms The regression equation isFuel Bill = -252 + 136 Number of Rooms
Predictor Coef SE Coef T PConstant -251.9 44.88 -5.20 0.000Rooms 136.2 7.09 19.9 0.000
S = 144.456R-Sq = 72.2% R-Sq(adj) = 72.0%
This is b, the estimate of β
This “Standard Error,” (SE) is the measure of uncertainty about the true value.
The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)
Part 18: Regression Modeling18-41/44
Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz
The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000
S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%
Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1
Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17]
Part 18: Regression Modeling18-42/44
Elasticity in the Monet Regression:
b = 1.7246.This is the elasticity of price with respect to area.The confidence interval would be1.7246 1.96(.1908) =[1.3506 to 2.0986]The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.
Part 18: Regression Modeling18-43/44
Conclusion about b So, should we conclude the slope is not zero?
Does the range of uncertainty include zero? No, then you should conclude the slope is not zero. Yes, then you can’t be very sure that β is not zero.
Tying it together. If the range of uncertainty does not include 0.0 then, The ratio b/SE is larger than2. The square of the ratio is larger than 4. The square of the ratio is F. F larger than 4 gave the same conclusion. They are looking at the same thing.
Part 18: Regression Modeling18-44/44
Summary The regression model – theory Least squares results, a, b, s, R2
The fit of the regression model to the data ANOVA and R2
The F statistic and R2
Uncertainty about the regression slope
top related