statistics and data analysis

Part 18: Regression Modeling18-1/44

Statistics and Data Analysis

Professor William GreeneStern School of Business

IOMS DepartmentDepartment of Economics

Statistics and Data Analysis

Part 18 – Regression Modeling

Linear Regression Models

Least squares results Regression model Sample statistics Estimates of population parameters

How good is the model? In the abstract Statistical measures of model fit

Assessing the validity of the relationship

Regression Model Regression relationship

yi = α + β xi + εi

Random εi implies random yi

Observed random yi has two unobserved components:

Explained: α + β xi Unexplained: εi

Random component εi zero mean, standard deviation σ, normal distribution.

Linear Regression: Model Assumption

Least Squares Results

Using the Regression Model

Prediction: Use xi as information to predict yi.

The natural predictor is the mean,

xi provides more information.

With xi, the predictor is

iy = a+bx

Regression Fits

302520151050

100000

Scatterplot of SALARY vs YEARS

Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes

ROOMSFU

LL111098765432

Scatterplot of FUELBILL vs ROOMS

Regression Arithmetic

i i i i

22 N 2i=1 i i=1 i i=1 i

y = y +e = prediction + error

y - y = y - y +e

A few algebra steps later...

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??TOTAL = Regression + ResidualThis is the analysis of (the) variance (of y); ANOVA

Analysis of Variance

N Nˆ 22 N 2i=1 i i=1 i i=1 iΣ y - y = Σ y - y + Σ e

TOTAL = Regression + Residual

Fit of the Model to the Data

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??

TOTAL = Regression + Residual

TOTAL SS = RTOTAL

egr SS

ession SSTOTAL SS

Proportion Expl

Residual S

STOTAL SS

Proportion Unexplained

Explained Variation

The proportion of variation “explained” by the regression is called R-squared (R2)

It is also called the Coefficient of Determination

Movie Madness

Regression Fits

Domestic

6005004003002001000

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

R2 = 0.522

Income

2750025000225002000017500150001250010000

S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%

Fitted Line PlotG = 1.928 + 0.000179 Income

R2 = 0.880

R2 = 0.424

Output

80000700006000050000400003000020000100000

Scatterplot of Cost vs Output

R2 = 0.924

R2 = 0.338

R2 is still positive even if the correlation is negative.

R Squared Benchmarks Aggregate time

series: expect .9+ Cross sections, .5

is good. Sometimes we do much better.

Large survey data sets, .2 is not bad. Output

80000700006000050000400003000020000100000

Scatterplot of Cost vs Output

R2 = 0.924 in this cross section.

Correlation Coefficient

N1i iN-1 i=1

N N2 21 1i iN-1 N-1i=1 i=1

r = Correlation(x,y)

Sample Cov[x,y]= [Sample Standard deviation (x)] [Sample standard deviation (y)]

(x -x)(y -y)=

(x -x) (y -y)

Correlations

DomesticOv

6005004003002001000

Scatterplot of Overseas vs Domestic

rxy = 0.723

0.60.50.40.30.2

S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%

Fitted Line PlotGDPC = 19826 - 34508 GINI

rxy = -.402

1614121086420

Scatterplot of C5 vs C6

rxy = +1.000

R-Squared is rxy2

R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.

The correlation between yi and (a+bxi) is the same as the correlation between yi and xi.

Therefore,…. A regression with a high R2 predicts yi well.

Adjusted R-Squared We will discover when we study regression with more

than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.

To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.

2 2N -1R = 1- (1- R )N - K -1

K is the number of "x" variables in the equation.

Movie Madness Fit

Notes About Adjusted R2

2 2 2 2

(1) Adjusted R is denoted R . R is less than R .

(2) R is not the square of R. It is not the square of anything. Adjusted R squared is just a name, not a formula.

(3) Adjusting R makes no s2

ense when there is only one variable

in the model. You should pay no attention to R when K = 1.

(4) R can be less than zero! See point (2).

Is R2 Large?

Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits)

by examining R2. F is used for this purpose.

The F Ratio

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??TOTAL = Regression + ResidualWe would like the Regression SS

2N 2i=1 iN 2 2i=1 i

to be large and theResidual SS to be small

(N- 2)Σ y - y(N- 2)Regression SS (N- 2)RF = Residual SS Σ e 1-R

Is R2 Large? Since

F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large.

For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.

Movie Madness Fit

Why Use F and not R2?

When is R2 “large?” we have no benchmarks to decide.

How large is “large?” We have a table for F statistics to determine when F is statistically large: yes or no.

F TableThe “critical value” depends on the number of observations. If F is larger than the appropriate value in the table, conclude that there is a “statistically significant” relationship.

There is a huge F table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.

n2 is N-2

Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

n2 is N-2

$135 Million

http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss

Klimt, to Ronald Lauder

$100 Million … sort ofStephen Wynn with a Prized Possession, 2007

An Enduring Art Mystery

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Graphics show relative sizes of the two works.

Previously sold for exp(16.6374) = $16.8M

Monet in Large and Small

ln (SurfaceArea)

7.67.47.27.06.86.66.46.26.0

S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%

Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)

Log of $price = a + b log surface area + e

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

The Data

ln (SurfaceArea)

8.88.07.26.45.64.84.03.2

Histogram of ln (SurfaceArea)

ln (US$)

16.515.013.512.010.5

Histogram of ln (US$)

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

Monet Regression: There seems to be a regression. Is there a theory?

Conclusions about F

R2 answers the question of how well the model fits the data

F answers the question of whether there is a statistically valid fit (as opposed to no fit).

What remains is the question of whether there is a valid relationship – i.e., is β different from zero.

The Regression Slope The model is yi = α+βxi+εi

The “relationship” depends on β. If β equals zero, there is no relationship

The least squares slope, b, is the estimate of β based on the sample. It is a statistic based on a random sample. We cannot be sure it equals the true β.

To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.

Uncertainty About the Regression SlopeHypothetical Regression Fuel Bill vs. Number of Rooms The regression equation isFuel Bill = -252 + 136 Number of Rooms

Predictor Coef SE Coef T PConstant -251.9 44.88 -5.20 0.000Rooms 136.2 7.09 19.9 0.000

S = 144.456R-Sq = 72.2% R-Sq(adj) = 72.0%

This is b, the estimate of β

This “Standard Error,” (SE) is the measure of uncertainty about the true value.

The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)

Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17]

Elasticity in the Monet Regression:

b = 1.7246.This is the elasticity of price with respect to area.The confidence interval would be1.7246 1.96(.1908) =[1.3506 to 2.0986]The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.

Conclusion about b So, should we conclude the slope is not zero?

Does the range of uncertainty include zero? No, then you should conclude the slope is not zero. Yes, then you can’t be very sure that β is not zero.

Tying it together. If the range of uncertainty does not include 0.0 then, The ratio b/SE is larger than2. The square of the ratio is larger than 4. The square of the ratio is F. F larger than 4 gave the same conclusion. They are looking at the same thing.

Summary The regression model – theory Least squares results, a, b, s, R2

The fit of the regression model to the data ANOVA and R2

The F statistic and R2

Uncertainty about the regression slope

statistics and data analysis

regression modeling18

regression modelingpart

regression arithmetic

regression modelprediction

regression fits r2

regression of fuel bill

adjusted r2

r2 large

Documents

statistics data analysis and optimization_new

data analysis, presentation, and statistics

statistics (data analysis)

excel data analysis tools descriptive statistics – data...

statistics, data analysis, and decision modeling, 5e...

statistics & data analysis

data analysis, statistics, machine...

introduction to statistics & data analysis

data analysis: descriptive statistics

data analysis and statistics

data analysis, statistics, machine learningwilkinson... ·...

statistics and data analysis - new york...

courses days engineering statistics and data … days...

dm statistics and data analysis

statistics and data analysis in geology - · pdf...

data analysis, statistics & measurement.pptx

data analysis statistics. overview getting ready for data...

statistics and data analysis in geology - soest ·...

math academy: data analysis & statistics

data analysis statistics. inferential statistics