statistics and data analysis

44
Part 18: Regression Modeling 8-1/44 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Upload: nicole

Post on 22-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 18 – Regression Modeling. Linear Regression Models. Least squares results Regression model Sample statistics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistics and Data Analysis

Part 18: Regression Modeling18-1/44

Statistics and Data Analysis

Professor William GreeneStern School of Business

IOMS DepartmentDepartment of Economics

Page 2: Statistics and Data Analysis

Part 18: Regression Modeling18-2/44

Statistics and Data Analysis

Part 18 – Regression Modeling

Page 3: Statistics and Data Analysis

Part 18: Regression Modeling18-3/44

Linear Regression Models

Least squares results Regression model Sample statistics Estimates of population parameters

How good is the model? In the abstract Statistical measures of model fit

Assessing the validity of the relationship

Page 4: Statistics and Data Analysis

Part 18: Regression Modeling18-4/44

Regression Model Regression relationship

yi = α + β xi + εi

Random εi implies random yi

Observed random yi has two unobserved components:

Explained: α + β xi Unexplained: εi

Random component εi zero mean, standard deviation σ, normal distribution.

Page 5: Statistics and Data Analysis

Part 18: Regression Modeling18-5/44

Linear Regression: Model Assumption

Page 6: Statistics and Data Analysis

Part 18: Regression Modeling18-6/44

Least Squares Results

Page 7: Statistics and Data Analysis

Part 18: Regression Modeling18-7/44

Using the Regression Model

Prediction: Use xi as information to predict yi.

The natural predictor is the mean,

xi provides more information.

With xi, the predictor is

y

iy = a+bx

Page 8: Statistics and Data Analysis

Part 18: Regression Modeling18-8/44

Regression Fits

YEARS

SALA

RY

302520151050

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

Scatterplot of SALARY vs YEARS

Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes

ROOMSFU

ELBI

LL111098765432

1400

1200

1000

800

600

400

200

Scatterplot of FUELBILL vs ROOMS

Page 9: Statistics and Data Analysis

Part 18: Regression Modeling18-9/44

Regression Arithmetic

N N

ˆ

ˆ

ˆ

i i

i i i i

22 N 2i=1 i i=1 i i=1 i

y = y +e = prediction + error

y - y = y - y +e

A few algebra steps later...

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??TOTAL = Regression + ResidualThis is the analysis of (the) variance (of y); ANOVA

Page 10: Statistics and Data Analysis

Part 18: Regression Modeling18-10/44

Analysis of Variance

N Nˆ 22 N 2i=1 i i=1 i i=1 iΣ y - y = Σ y - y + Σ e

TOTAL = Regression + Residual

Page 11: Statistics and Data Analysis

Part 18: Regression Modeling18-11/44

Fit of the Model to the Data

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??

TOTAL = Regression + Residual

TOTAL SS = RTOTAL

egr SS

ession SSTOTAL SS

Proportion Expl

Residual S

ained

+

1 = +

STOTAL SS

Proportion Unexplained

Page 12: Statistics and Data Analysis

Part 18: Regression Modeling18-12/44

Explained Variation

The proportion of variation “explained” by the regression is called R-squared (R2)

It is also called the Coefficient of Determination

Page 13: Statistics and Data Analysis

Part 18: Regression Modeling18-13/44

Movie Madness

Fit

R2

Page 14: Statistics and Data Analysis

Part 18: Regression Modeling18-14/44

Regression Fits

Domestic

Over

seas

6005004003002001000

1400

1200

1000

800

600

400

200

0

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

R2 = 0.522

Income

G

2750025000225002000017500150001250010000

7

6

5

4

3

S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%

Fitted Line PlotG = 1.928 + 0.000179 Income

R2 = 0.880

R2 = 0.424

Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

R2 = 0.924

Page 15: Statistics and Data Analysis

Part 18: Regression Modeling18-15/44

R2 = 0.338

R2 is still positive even if the correlation is negative.

Page 16: Statistics and Data Analysis

Part 18: Regression Modeling18-16/44

R Squared Benchmarks Aggregate time

series: expect .9+ Cross sections, .5

is good. Sometimes we do much better.

Large survey data sets, .2 is not bad. Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

R2 = 0.924 in this cross section.

Page 17: Statistics and Data Analysis

Part 18: Regression Modeling18-17/44

Correlation Coefficient

xy

N1i iN-1 i=1

N N2 21 1i iN-1 N-1i=1 i=1

xy

r = Correlation(x,y)

Sample Cov[x,y]= [Sample Standard deviation (x)] [Sample standard deviation (y)]

(x -x)(y -y)=

(x -x) (y -y)

1 r 1

Page 18: Statistics and Data Analysis

Part 18: Regression Modeling18-18/44

Correlations

DomesticOv

erse

as

6005004003002001000

1400

1200

1000

800

600

400

200

0

Scatterplot of Overseas vs Domestic

rxy = 0.723

GINI

GDPC

0.60.50.40.30.2

30000

25000

20000

15000

10000

5000

0

S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%

Fitted Line PlotGDPC = 19826 - 34508 GINI

rxy = -.402

C6

C5

1614121086420

9

8

7

6

5

4

3

2

1

0

Scatterplot of C5 vs C6

rxy = +1.000

Page 19: Statistics and Data Analysis

Part 18: Regression Modeling18-19/44

R-Squared is rxy2

R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.

The correlation between yi and (a+bxi) is the same as the correlation between yi and xi.

Therefore,…. A regression with a high R2 predicts yi well.

Page 20: Statistics and Data Analysis

Part 18: Regression Modeling18-20/44

Adjusted R-Squared We will discover when we study regression with more

than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.

To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.

2 2N -1R = 1- (1- R )N - K -1

K is the number of "x" variables in the equation.

Page 21: Statistics and Data Analysis

Part 18: Regression Modeling18-21/44

Movie Madness Fit

2R

Page 22: Statistics and Data Analysis

Part 18: Regression Modeling18-22/44

Notes About Adjusted R2

2 2 2 2

2

2

(1) Adjusted R is denoted R . R is less than R .

(2) R is not the square of R. It is not the square of anything. Adjusted R squared is just a name, not a formula.

(3) Adjusting R makes no s2

2

ense when there is only one variable

in the model. You should pay no attention to R when K = 1.

(4) R can be less than zero! See point (2).

Page 23: Statistics and Data Analysis

Part 18: Regression Modeling18-23/44

Is R2 Large?

Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits)

by examining R2. F is used for this purpose.

Page 24: Statistics and Data Analysis

Part 18: Regression Modeling18-24/44

The F Ratio

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??TOTAL = Regression + ResidualWe would like the Regression SS

ˆ

2N 2i=1 iN 2 2i=1 i

to be large and theResidual SS to be small

(N- 2)Σ y - y(N- 2)Regression SS (N- 2)RF = Residual SS Σ e 1-R

Page 25: Statistics and Data Analysis

Part 18: Regression Modeling18-25/44

Is R2 Large? Since

F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large.

For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.

Page 26: Statistics and Data Analysis

Part 18: Regression Modeling18-26/44

Movie Madness Fit

R2

F

Page 27: Statistics and Data Analysis

Part 18: Regression Modeling18-27/44

Why Use F and not R2?

When is R2 “large?” we have no benchmarks to decide.

How large is “large?” We have a table for F statistics to determine when F is statistically large: yes or no.

Page 28: Statistics and Data Analysis

Part 18: Regression Modeling18-28/44

F TableThe “critical value” depends on the number of observations. If F is larger than the appropriate value in the table, conclude that there is a “statistically significant” relationship.

There is a huge F table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.

n2 is N-2

Page 29: Statistics and Data Analysis

Part 18: Regression Modeling18-29/44

Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

n2 is N-2

Page 30: Statistics and Data Analysis

Part 18: Regression Modeling18-30/44

$135 Million

http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss

Klimt, to Ronald Lauder

Page 31: Statistics and Data Analysis

Part 18: Regression Modeling18-31/44

$100 Million … sort ofStephen Wynn with a Prized Possession, 2007

Page 32: Statistics and Data Analysis

Part 18: Regression Modeling18-32/44

An Enduring Art Mystery

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Graphics show relative sizes of the two works.

Page 33: Statistics and Data Analysis

Part 18: Regression Modeling18-33/44

Page 34: Statistics and Data Analysis

Part 18: Regression Modeling18-34/44

Previously sold for exp(16.6374) = $16.8M

Page 35: Statistics and Data Analysis

Part 18: Regression Modeling18-35/44

Monet in Large and Small

ln (SurfaceArea)

ln (U

S$)

7.67.47.27.06.86.66.46.26.0

18

17

16

15

14

13

12

11

S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%

Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)

Log of $price = a + b log surface area + e

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Page 36: Statistics and Data Analysis

Part 18: Regression Modeling18-36/44

The Data

ln (SurfaceArea)

Freq

uenc

y

8.88.07.26.45.64.84.03.2

90

80

70

60

50

40

30

20

10

0

Histogram of ln (SurfaceArea)

ln (US$)

Freq

uenc

y

16.515.013.512.010.5

80

70

60

50

40

30

20

10

0

Histogram of ln (US$)

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

Page 37: Statistics and Data Analysis

Part 18: Regression Modeling18-37/44

Monet Regression: There seems to be a regression. Is there a theory?

Page 38: Statistics and Data Analysis

Part 18: Regression Modeling18-38/44

Conclusions about F

R2 answers the question of how well the model fits the data

F answers the question of whether there is a statistically valid fit (as opposed to no fit).

What remains is the question of whether there is a valid relationship – i.e., is β different from zero.

Page 39: Statistics and Data Analysis

Part 18: Regression Modeling18-39/44

The Regression Slope The model is yi = α+βxi+εi

The “relationship” depends on β. If β equals zero, there is no relationship

The least squares slope, b, is the estimate of β based on the sample. It is a statistic based on a random sample. We cannot be sure it equals the true β.

To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.

Page 40: Statistics and Data Analysis

Part 18: Regression Modeling18-40/44

Uncertainty About the Regression SlopeHypothetical Regression Fuel Bill vs. Number of Rooms The regression equation isFuel Bill = -252 + 136 Number of Rooms

Predictor Coef SE Coef T PConstant -251.9 44.88 -5.20 0.000Rooms 136.2 7.09 19.9 0.000

S = 144.456R-Sq = 72.2% R-Sq(adj) = 72.0%

This is b, the estimate of β

This “Standard Error,” (SE) is the measure of uncertainty about the true value.

The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)

Page 41: Statistics and Data Analysis

Part 18: Regression Modeling18-41/44

Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17]

Page 42: Statistics and Data Analysis

Part 18: Regression Modeling18-42/44

Elasticity in the Monet Regression:

b = 1.7246.This is the elasticity of price with respect to area.The confidence interval would be1.7246 1.96(.1908) =[1.3506 to 2.0986]The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.

Page 43: Statistics and Data Analysis

Part 18: Regression Modeling18-43/44

Conclusion about b So, should we conclude the slope is not zero?

Does the range of uncertainty include zero? No, then you should conclude the slope is not zero. Yes, then you can’t be very sure that β is not zero.

Tying it together. If the range of uncertainty does not include 0.0 then, The ratio b/SE is larger than2. The square of the ratio is larger than 4. The square of the ratio is F. F larger than 4 gave the same conclusion. They are looking at the same thing.

Page 44: Statistics and Data Analysis

Part 18: Regression Modeling18-44/44

Summary The regression model – theory Least squares results, a, b, s, R2

The fit of the regression model to the data ANOVA and R2

The F statistic and R2

Uncertainty about the regression slope