statistics and data analysis

Part 18: Regression Modeling18-1/44

Statistics and Data Analysis

Professor William GreeneStern School of Business

IOMS DepartmentDepartment of Economics


Statistics and Data Analysis

Part 18 – Regression Modeling


Linear Regression Models

Least squares results Regression model Sample statistics Estimates of population parameters

How good is the model? In the abstract Statistical measures of model fit

Assessing the validity of the relationship


Regression Model Regression relationship

yi = α + β xi + εi

Random εi implies random yi

Observed random yi has two unobserved components:

Explained: α + β xi Unexplained: εi

Random component εi zero mean, standard deviation σ, normal distribution.


Linear Regression: Model Assumption


Least Squares Results


Using the Regression Model

Prediction: Use xi as information to predict yi.

The natural predictor is the mean,

xi provides more information.

With xi, the predictor is

y

iˆ

iy = a+bx


Regression Fits

YEARS

SALA

RY

302520151050

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

Scatterplot of SALARY vs YEARS

Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes

ROOMSFU

ELBI

LL111098765432

1400

1200

1000

800

600

400

200

Scatterplot of FUELBILL vs ROOMS


Regression Arithmetic

N N

ˆ

ˆ

ˆ

i i

i i i i

22 N 2i=1 i i=1 i i=1 i

y = y +e = prediction + error

y - y = y - y +e

A few algebra steps later...

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??TOTAL = Regression + ResidualThis is the analysis of (the) variance (of y); ANOVA


Analysis of Variance

N Nˆ 22 N 2i=1 i i=1 i i=1 iΣ y - y = Σ y - y + Σ e

TOTAL = Regression + Residual


Fit of the Model to the Data

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??

TOTAL = Regression + Residual

TOTAL SS = RTOTAL

egr SS

ession SSTOTAL SS

Proportion Expl

Residual S

ained

+

1 = +

STOTAL SS

Proportion Unexplained


Explained Variation

The proportion of variation “explained” by the regression is called R-squared (R2)

It is also called the Coefficient of Determination


Movie Madness

Fit

R2


Regression Fits

Domestic

Over

seas

6005004003002001000

1400

1200

1000

800

600

400

200

0

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

R2 = 0.522

Income

G

2750025000225002000017500150001250010000

7

6

5

4

3

S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%

Fitted Line PlotG = 1.928 + 0.000179 Income

R2 = 0.880

R2 = 0.424

Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

R2 = 0.924


R2 = 0.338

R2 is still positive even if the correlation is negative.


R Squared Benchmarks Aggregate time

series: expect .9+ Cross sections, .5

is good. Sometimes we do much better.

Large survey data sets, .2 is not bad. Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

R2 = 0.924 in this cross section.


Correlation Coefficient

xy

N1i iN-1 i=1

N N2 21 1i iN-1 N-1i=1 i=1

xy

r = Correlation(x,y)

Sample Cov[x,y]= [Sample Standard deviation (x)] [Sample standard deviation (y)]

(x -x)(y -y)=

(x -x) (y -y)

1 r 1


Correlations

DomesticOv

erse

as

6005004003002001000

1400

1200

1000

800

600

400

200

0

Scatterplot of Overseas vs Domestic

rxy = 0.723

GINI

GDPC

0.60.50.40.30.2

30000

25000

20000

15000

10000

5000

0

S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%

Fitted Line PlotGDPC = 19826 - 34508 GINI

rxy = -.402

C6

C5

1614121086420

9

8

7

6

5

4

3

2

1

0

Scatterplot of C5 vs C6

rxy = +1.000


R-Squared is rxy2

R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.

The correlation between yi and (a+bxi) is the same as the correlation between yi and xi.

Therefore,…. A regression with a high R2 predicts yi well.


Adjusted R-Squared We will discover when we study regression with more

than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.

To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.

2 2N -1R = 1- (1- R )N - K -1

K is the number of "x" variables in the equation.


Movie Madness Fit

2R


Notes About Adjusted R2

2 2 2 2

2

2

(1) Adjusted R is denoted R . R is less than R .

(2) R is not the square of R. It is not the square of anything. Adjusted R squared is just a name, not a formula.

(3) Adjusting R makes no s2

2

ense when there is only one variable

in the model. You should pay no attention to R when K = 1.

(4) R can be less than zero! See point (2).


Is R2 Large?

Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits)

by examining R2. F is used for this purpose.


The F Ratio

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??TOTAL = Regression + ResidualWe would like the Regression SS

ˆ

2N 2i=1 iN 2 2i=1 i

to be large and theResidual SS to be small

(N- 2)Σ y - y(N- 2)Regression SS (N- 2)RF = Residual SS Σ e 1-R


Is R2 Large? Since

F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large.

For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.


Movie Madness Fit

R2

F


Why Use F and not R2?

When is R2 “large?” we have no benchmarks to decide.

How large is “large?” We have a table for F statistics to determine when F is statistically large: yes or no.


F TableThe “critical value” depends on the number of observations. If F is larger than the appropriate value in the table, conclude that there is a “statistically significant” relationship.

There is a huge F table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.

n2 is N-2


Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

n2 is N-2


$135 Million

http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss

Klimt, to Ronald Lauder


$100 Million … sort ofStephen Wynn with a Prized Possession, 2007


An Enduring Art Mystery

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Graphics show relative sizes of the two works.


Previously sold for exp(16.6374) = $16.8M


Monet in Large and Small

ln (SurfaceArea)

ln (U

S$)

7.67.47.27.06.86.66.46.26.0

18

17

16

15

14

13

12

11

S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%

Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)

Log of $price = a + b log surface area + e

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.


The Data

ln (SurfaceArea)

Freq

uenc

y

8.88.07.26.45.64.84.03.2

90

80

70

60

50

40

30

20

10

0

Histogram of ln (SurfaceArea)

ln (US$)

Freq

uenc

y

16.515.013.512.010.5

80

70

60

50

40

30

20

10

0

Histogram of ln (US$)

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)


Monet Regression: There seems to be a regression. Is there a theory?


Conclusions about F

R2 answers the question of how well the model fits the data

F answers the question of whether there is a statistically valid fit (as opposed to no fit).

What remains is the question of whether there is a valid relationship – i.e., is β different from zero.


The Regression Slope The model is yi = α+βxi+εi

The “relationship” depends on β. If β equals zero, there is no relationship

The least squares slope, b, is the estimate of β based on the sample. It is a statistic based on a random sample. We cannot be sure it equals the true β.

To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.


Uncertainty About the Regression SlopeHypothetical Regression Fuel Bill vs. Number of Rooms The regression equation isFuel Bill = -252 + 136 Number of Rooms

Predictor Coef SE Coef T PConstant -251.9 44.88 -5.20 0.000Rooms 136.2 7.09 19.9 0.000

S = 144.456R-Sq = 72.2% R-Sq(adj) = 72.0%

This is b, the estimate of β

This “Standard Error,” (SE) is the measure of uncertainty about the true value.

The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)


Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17]


Elasticity in the Monet Regression:

b = 1.7246.This is the elasticity of price with respect to area.The confidence interval would be1.7246 1.96(.1908) =[1.3506 to 2.0986]The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.


Conclusion about b So, should we conclude the slope is not zero?

Does the range of uncertainty include zero? No, then you should conclude the slope is not zero. Yes, then you can’t be very sure that β is not zero.

Tying it together. If the range of uncertainty does not include 0.0 then, The ratio b/SE is larger than2. The square of the ratio is larger than 4. The square of the ratio is F. F larger than 4 gave the same conclusion. They are looking at the same thing.


Summary The regression model – theory Least squares results, a, b, s, R2

The fit of the regression model to the data ANOVA and R2

The F statistic and R2

Uncertainty about the regression slope

statistics and data analysis

Documents

regression modeling18

regression modelingpart

regression arithmetic

regression modelprediction

regression fits r2

regression of fuel bill

adjusted r2

r2 large