regression analysis the motorpool example looking just at two-dimensional shadows, we don’t see...

57
Regression Analysis The Motorpool Example • Looking just at two-dimensional shadows, we don’t see the true effects of the variables. • We need a way to look at all the dimensions of a relationship at the same time.

Upload: karen-college

Post on 01-Apr-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression Analysis

• The Motorpool Example

• Looking just at two-dimensional shadows, we don’t see the true effects of the variables.

• We need a way to look at all the dimensions of a relationship at the same time.

Page 2: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Regression Model

εMakeβAgeβMileageβαCosts 321

dependent variable

explanatory variablesor

independent variables

residual term

coefficients linearmathematical structure

What we’re talking about is sometimes explicitly called “linear regression analysis,” since it assumes that the underlying relationship is linear (i.e., a straight line in two dimensions, a plane in three, and so on)!

Page 3: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Why Spend All This Time on such a Limited Tool?

• Some interesting relationships are linear.• All relationship are locally linear!• Several of the most commonly encountered

nonlinear relationships in management can be translated into linear relationships, studied using regression analysis, and the results then untranslated back to the original problem! (This is part of what we’ll learn in Sessions 3 and 4.)

Page 4: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

A Few Final Assumptions Concerning

• The validity of regression analysis depends on several assumptions concerning the residual term.

– E[ε] = 0 . This is purely a cosmetic assumption. The estimate of α will include any on-average residual effects which are different from zero.

– ε varies normally across the population. While a substantive assumption, this is typically true, due to the Central Limit Theorem, since the residual term is the total of a myriad of other, unidentified explanatory variables. If this assumption is not correct, all statements regarding confidence intervals for individual predictions might be invalid.

• The following additional assumptions will be discussed later in the course.– StdDev[ε] does not vary with the values of the explanatory variables. (This is called the

homoskedasticity assumption.) Again, if this assumption is not correct, all statements regarding confidence intervals for individual predictions might be invalid.

– ε is uncorrelated with the explanatory variables of the model. The regression analysis will “attribute” as much of the variation in the dependent variable as it can to the explanatory variables. If some unidentified factor covaries with one of the explanatory variables, the estimate of that explanatory variable’s coefficient (i.e., the estimate of its effect in the relationship) will suffer from “specification bias,” since the explanatory variable will have both its own effect, and some of the effect of the unidentified variable, attributed to it. This is why, when doing a regression for the purpose of estimating the effect of some explanatory variable on the dependent variable, we try to work with the most “complete” model possible.

Page 5: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

1. Predictions

• Given an individual, and some information about that individual, predict what the dependent variable will be.– What annual maintenance and repair cost (Costs)

would you predict for a new (Age = 0) Honda (Make = 1) driven 15,000 miles (Mileage = 15)?

• Regress the dependent variable onto the given variables to get the “prediction equation”. Then make the prediction.

Page 6: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Predictions

The prediction equation:

Costspred = 107.34 + 29.65 · Mileage + 73.96 · Age + 47.43 · Make ( + 0 )

Page 7: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

1.1 A Prediction for an Individual

Costspred = 107.34 + 29.65 · 15 + 73.96 · 0 + 47.43 · 1 = $599.49

The margin of error in the prediction (at the 95%-confidence level) is

2.2010 · $55.75 = $122.70 ,

And so a 95%-confidence interval for the prediction is $599.49 ± $122.70 .

Page 8: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

1.2 Prediction: The Estimated Mean for a Subgroup of Similar Individuals

$599.49 ± 2.2010 · $26.67$599.49 ± $58.69

• Estimate the mean annual costs for new Hondas driven 15,000 miles.– The estimate for the group is what we’d predict for any one member

of the group. The margin of error in the estimate is computed using the standard error of the estimated mean.

Page 9: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

1.3 Sources of Error

0XbXbaY

εXβXβαY

kk11pred

kk11

standard error of the regression, StdDev()

standard error of the estimated mean

standard error of the prediction

22 thisthisthis

Page 10: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Standard Error of the Regression

• Using the prediction equation, we predict for each sample observation.

• The difference between the prediction and the actual value of the dependent variable (i.e., the error) is an estimate of that individual’s residual.

• StdDev() is estimated from these.

Indeed, the regression “process” simply finds the coefficient estimates which minimize the standard error of the regression (or equivalently, which minimize the sum of the squared residuals)!

Page 11: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

A Brief Digression• What annual maintenance and repair cost (Costs) would you

predict for a Honda (Make=1) driven 15,000 miles (Mileage = 15)?– Regress Costs onto just Mileage and Make.

Costspred = $678.00 .

Page 12: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

A Brief Digression

• The prediction made using the reduced model is precisely what we would get if we predicted Age from Mileage and Make, and then Costs from all 3!

Costspred = $678.00 .

Page 13: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

A Brief Digression

• The prediction made using the reduced model is precisely what we would get if we predicted Age from Mileage and Make, and then Costs from all 3!

The reason we don’t take this latter approach is that the standard error of the prediction here is based on the assumption that the age of the car isprecisely 1.061546 years, instead of actually being unknown.

Still, it’s reassuring to see that the numbers all fit together.

Page 14: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

2. Estimating an Effect

• An additional thousand miles of driving in the course of a year adds, on average, how much to the year’s maintenance and repair costs?– It is ESSENTIAL to note that the additional driving

changes neither the car’s Age, nor its Make. In order to hold them constant while varying Mileage, we need to work with a model including ALL of the explanatory variables.

Page 15: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Estimating an EffectThe coefficient of Mileage in the most-complete model is our estimate of the impact of a one-unit (1,000 mile) change in Mileage. That coefficient is $29.65/ thousand miles. (That is, 29.65 units of the dependent variable per unit of the explanatory variable.)

The predictions below examine the impact of an additional thousand miles of driving for a two-year-old Ford and two two-year-old Hondas. Each difference in predictions is $29.65 greater for the car driven an additional 1,000 miles.

Page 16: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Estimating an EffectEach coefficient is an estimate of the “true” coefficient, and is subject to sampling error.One standard-deviation’s-worth of uncertainty in the estimate is given by the standard error of the coefficient.

For example, a 95%-confidence interval for the coefficient of Mileage in the full model is

29.65 ± 2.2010 · 3.9229.65 ± 8.62

Page 17: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

3. The Explanatory Power of the Model

• Why do maintenance and repair costs vary from car to car across the current fleet?– A partial answer is, “Because Mileage, Age, and

Make vary from car to car across the fleet.”• Indeed, variations in those three variables can

potentially explain 80.78% of the overall variability in Costs across the fleet!

• This is the adjusted coefficient of determination for our model.

Page 18: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Explanatory Power of the Model

• Names can vary: The {adjusted, corrected, unbiased} {coefficient of determination, r-squared} all refer to the same thing.– Without an adjective, the {coefficient of determination, r-squared}

refers to a number slightly larger than the “correct” number, and is a throwback to pre-computer days.

• When a new variable is added to a model, which actually contributes nothing to the model (i.e., its true coefficient is 0), the adjusted coefficient of determination will, on average, remain unchanged.– Depending on chance, it might go up or down a bit.– If negative, interpret it as 0%.– The thing without the adjective will always go up. That’s obviously not

quite “right.”

Page 19: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Explanatory Power of the Model

• Subtracting the adjusted coefficient of determination from 100% yields the fraction of the population-wide variation in the dependent variable which must be explained by terms still lumped together in the residual.– If your goal is to explain everything, you want the adjusted coefficient of

determination to be large.– If your goal is to explain something, a very small value might be perfectly

acceptable.

Page 20: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

4. The Relative Explanatory Importance of the Explanatory Variables: The Beta-Weights

• What explains why maintenance and repair costs vary from car to car across the current fleet?

– (This is the same question as before, but now we seek a more detailed answer.)

– Compare the absolute values of the beta-weights.

Variations in Mileage across the population are roughly twice as important as are variations in Age (1.1531 vs. 0.5597), in helping to explain why Costs vary across the population.

In turn, the fact that the cars vary in Age is more than twice as important as is the fact that some are Fords, and others Hondas (0.5597 vs. 0.2193), in helping to explain why Costs vary.

Page 21: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Beta-Weights

• You can’t compare regression coefficients directly, since they may carry different dimensions.

• The beta-weights are dimensionless, and combine how much each explanatory variable varies, with how much that variability leads to variability in the dependent variable. – Specifically, they are the product of each explanatory

variable’s standard deviation (how much it varies) and its coefficient (how much its variation affects the dependent variable), divided by the standard deviation of the dependent variable (just to remove all dimensionality).

Page 22: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

5. The Significance Levels of the t-Ratios (the p-values)

• How strong is the evidence that Mileage does play a role in the relationship involving all three explanatory variables?– “Strength of evidence” evokes memories of

hypothesis testing!– If we wish to conclude that the evidence supports

the inclusion of Mileage in our model, we must take the opposite as our null hypothesis:

• Mileage would not belong if it had no effect on Costs, i.e., if its true coefficient were 0.

Page 23: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Significance Levels of the t-Ratios (the p-values)

• Null hypothesis: “The true coefficient of Mileage is 0.”

– Our estimate is 29.65.– One standard-deviation’s-worth of

uncertainty in the estimate is 3.915.– Our estimate is 7.5726 standard deviations

away from the hypothesized true value.– If the truth really were 0, we’d see

something this far away (or further) only 0.0011% of the time.

– The data is an overwhelmingly strong contradiction to the null hypothesis, and therefore …

The evidence is overwhelmingly strong in support of the statement that the true coefficient of Mileage differs from 0, and Mileage does belong in our model.

Page 24: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Does Make Belong in our Model?• Null hypothesis: “The true coefficient of

Make is 0.”– Our estimate is 47.43.– One standard-deviation’s-worth of

uncertainty in the estimate is 28.98.– Our estimate is 1.6366 standard

deviations away from the hypothesized true value.

– If the truth really were 0, we’d see something this far away (or further) only 12.9983% of the time.

– The data is a bit of a contradiction to the null hypothesis, and therefore …

There’s only a bit of evidence in support of the statement that the true coefficient of Make differs from 0, and Mileage does belong in our model.

So, what should we do? Leave Make in, or take it out?

Page 25: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Does Make Belong in our Model?• There’s only a bit of evidence in support of the statement that the true

coefficient of Make differs from 0, and Mileage does belong in our model.

• So, what should we do? Leave Make in, or take it out?• It depends: Remember, the belief decision must stand on three legs.

– If the Fords and Hondas came from a joint production facility …• I’d lean towards leaving it out

– If the Fords came from Detroit, and the Hondas from Kyoto …• I’d lean towards leaving it in

– More data might clarify the situation …• The standard error of the coefficient would drop.

– If the coefficient stayed around 43, the significance level would get closer to zero, building stronger evidence for including the variable

– If the coefficient shrank towards 0, there would continue to be no real evidence supporting Make’s inclusion, and even if it did belong, its estimated effect would be small.

Page 26: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Significance Levels of the t-Ratios

• Imagine that you have a model.– You introduce a new variable into that model.– The adjusted coefficient of determination increases.

• Does this mean that the new variable belongs in your model?– Not necessarily! Adding garbage to your model will

increase the adjusted coefficient of determination a little bit around half of the time.

– The significance level (of the new variable) tells you if the adjusted coefficient of determination went up by enough to support keeping the new variable.

Page 27: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Summary1. Predictions

What annual maintenance and repair cost (Costs) would you predict for a new (Age = 0) Honda (Make = 1) driven 15,000 miles (Mileage = 15)?

Regress dependent variable onto all known (for this individual) explanatory variables.Look at (prediction) ± (~2)·(standard error of prediction).

Estimate the mean annual costs for new Hondas (note the plural!) driven 15,000 miles.Regress dependent variable onto all known explanatory variables.Look at (prediction) ± (~2)·(standard error of estimated mean).

2. Estimating an EffectAn additional thousand miles of driving in the course of a year adds, on average, how much to the year’s maintenance and repair costs?

Regress dependent variable onto all explanatory variables (use most complete model).Look at (estimated coefficient) ) ± (~2)·(standard error of coefficient).

3. The Explanatory Power of the ModelWhy do maintenance and repair costs vary from car to car across the current fleet?

Look at the adjusted coefficient of determination to see how much of the variation in the dependent variable can bejointly explained by variations in the included explanatory variables.

4. The Relative Explanatory Importance of the Explanatory VariablesVariation in which explanatory variable is most important in explaining why maintenance and repair costs vary from car to car across the current fleet?

Compare the absolute values of the beta-weights.5. The Significance Levels of the t-Ratios

How strong is the evidence that Mileage does play a role in the relationship involving all three explanatory variables?The smaller the significance level, the stronger the evidence that this variable has a non-zero coefficient in this model.

Page 28: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression Analysis:How to DO It

Example: The “car discount” dataset

Page 29: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Discounts on Car Purchases

• Of course, no one pays list price for a new car. Realizing this, the owner of a new-car dealership has decided to conduct a study, to attempt to understand better the relationship between customer characteristics, and customer success in negotiating a discount from his salespeople.

• He collects data on a sample of 100 purchasers of mid-size cars (he has already sold several thousand of these cars):– Specifically, he notes the age, annual income, and sex (men

were represented by 0, and women by 1, in the coding of sex) of each purchaser (obtained from credit records), together with the discount from list price which the purchaser finally received.

Page 30: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Discounts on Car Purchases

• He collects data on a sample of 100 purchasers of mid-size cars (he has already sold several thousand of these cars):– He notes the age, annual income, and sex of each purchaser,

together with the discount from list price which the purchaser finally received.

Discount Age Income Sex1003 28 47658 11394 41 32126 1 Discount ($) negotiated on the purchase2542 21 28374 1 of a car: age of purchaser (years),1658 47 29321 0 annual income ($), and sex (M/F = 0/1).1374 29 38016 11536 43 25343 01402 54 30310 0692 35 45709 0947 41 46242 0

1415 19 27933 1… … … …

Page 31: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Discounts on Car Purchases• He collects data on a sample of 100 purchasers of mid-size cars (he

has already sold several thousand of these cars):– He notes the age, annual income, and sex of each purchaser, together

with the discount from list price which the purchaser finally received.• Why mid-size cars only?

– To avoid needing to include model/price of car• Other possible explanatory variables?

– About purchaser• Negotiation training• Preparatory research• Significant other

– About salesperson• Identity• Biases

Page 32: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Look at the Univariate Statistics

• This will give you a sense of how each variable varies individually– Estimate of population mean (or proportion)– Standard deviation and extremes– 95%-confidence interval for population mean (or proportion)

• Estimate ± (~2)·(standard error of the mean)• Estimate ± “margin of error” (at 95%-confidence level)

Page 33: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Univariate statisticsDiscount Age Income Sex

mean 1268.24 37.1 35705.17 0.46standard deviation 538.665375 9.91122209 10273.7291 0.50090827standard error of the mean 53.8665375 0.99112221 1027.37291 0.05009083

minimum 130 19 19119 0median 1310.5 37 34401.5 0maximum 2542 58 64648 1range 2412 39 45529 1

skewness -0.018 0.154 0.452 0.163kurtosis -0.710 -0.633 -0.270 -2.014

number of observations 100

t-statistic for computing95%-confidence intervals 1.9842

For example, $1,268.24 ± 1.9842·$53.87, or 46% ± 1.9842·5.01% .

Page 34: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Full Regression

• The “most-complete” model provides …– The best predictive model (pretty much)– The most accurate estimate of the “pure effect” of

each explanatory variable on the dependent variable• Specifically, the difference in the dependent variable

typically associated with one unit of difference in one explanatory variable when the others are held constant.

Page 35: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Discountconstant Age Income Sex

coefficient 1971.72565 9.48991379 -0.035313 446.294355std error of coef 146.147064 3.6320188 0.00366827 64.5567912t-ratio 13.4914 2.6128 -9.6266 6.9132significance 0.0000% 1.0423% 0.0000% 0.0000%beta-weight 0.1746 -0.6735 0.4150

standard error of regression 301.19175coefficient of determination 69.68%adjusted coef of determination 68.74%

number of observations 100residual degrees of freedom 96

t-statistic for computing95%-confidence intervals 1.9850

Page 36: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Adjusted Coefficient of Determination in the Full Model

• How much of the “story” (how much of the overall variation in the dependent variable) is potentially explained by the fact that the explanatory variables themselves vary across the population?

• r2 = 1 – Var() / Var(Y) (roughly) = 68.74%

– How can it be increased?• By including new relevant variables• Including a new “garbage” variable will leave it, on

average, unchanged

Page 37: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Coefficients

• The coefficient of an explanatory variable in the most-complete model …– Is an estimate of the average difference in the

dependent variable for two distinct individuals who differ (by one unit) only in that explanatory variable.

– Is an estimate of the average difference we’d expect to see in a specific individual if one aspect alone were slightly different (and all other aspects were the same.)

• coefficient ± (~2)·(standard error of coefficient)

Page 38: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Discountconstant Age Income Sex

coefficient 1971.72565 9.48991379 -0.035313 446.294355std error of coef 146.147064 3.6320188 0.00366827 64.5567912t-ratio 13.4914 2.6128 -9.6266 6.9132significance 0.0000% 1.0423% 0.0000% 0.0000%beta-weight 0.1746 -0.6735 0.4150

standard error of regression 301.19175coefficient of determination 69.68%adjusted coef of determination 68.74%

number of observations 100residual degrees of freedom 96

t-statistic for computing95%-confidence intervals 1.9850

$9.49 ± 1.9850·$3.63 per year of Age (with same Income and Sex), or-$0.0353 ± 1.9850·$0.0037 per dollar of Income (with same Age and Sex), or$446.29 ± 1.9850·$64.56 more for a woman (1) than for a man (0) (with same Age and Income)

Page 39: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Predictions, using most-recent regression

coefficients values for predictionconstant 1971.7256

Age 9.4899138 30 31 30 30Income -0.035313 35000 35000 36000 35000

Sex 446.29435 1 1 1 0

predicted value of Discount 1466.762 1476.252 1431.449 1020.468standard error of prediction 305.5644 305.2995 305.8382 305.2382standard error of regression 301.1917 301.1917 301.1917 301.1917standard error of estimated mean 51.50843 49.91316 53.10864 49.53689

confidence level 95.00% t-statistic 1.9850residual degr. freedom 96

confidence limits lower 860.2218 870.2374 824.3652 414.5748for prediction upper 2073.303 2082.267 2038.533 1626.361

confidence limits lower 1364.519 1377.175 1326.029 922.1379for estimated mean upper 1569.006 1575.329 1536.869 1118.798

Predict

Make single prediction

Page 40: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Tests involving Coefficients• In the full model, how strongly does the evidence

support saying, “Sex≥$200”?• H0: Sex≤$200, significance 0.01204% (overwhelmingly strong

evidence against H0, hence supporting original statement)

446.294 estimate/prediction of unknown quantity64.557 measure of uncertainty

100 sample size3 number of explanatory variables in regression, or

0 if dealing with a population mean

Null hypothesis:≥ 100.00000%

true value = 200 0.02408%≤ 0.01204%

(from t-distribution with 96 degrees of freedom)

significance level of data with respect to null hypothesis

From Session-1’s “Hypothesis_Testing_Tool.xls”

Page 41: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Tests involving Coefficients• Other statements?

Statement Significance level of data (with respect to opposite statement)

Strength of evidence supporting statement

Sex≥$200 0.01204% overwhelming

Sex≥$300 1.28444% very strong

Sex≥$350 6.95385% somewhat strong

Sex≥$400 23.75235% quite weak

From Session-1’s “Hypothesis_Testing_Tool.xls”

Page 42: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Predictions

• Based on ANY model, what would we predict the dependent variable to be, if all we knew about an individual were the given values for the listed explanatory variables?

• Prediction ± (~2)·(standard error of the prediction)

• What would we expect to see, on average, across a large pool of similar individuals?

• Prediction ± (~2)·(std. error of the estimated mean)

Page 43: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Prediction, using most-recent regression

constant Age Income Sexcoefficients 1971.726 9.489914 -0.03531 446.2944values for prediction 30 35000 1

predicted value of Discount 1466.762standard error of prediction 305.5644standard error of regression 301.1917standard error of estimated mean 51.50843

confidence level 95.00% t-statistic 1.9850residual degr. freedom 96

confidence limits lower 860.2218for prediction upper 2073.303

confidence limits lower 1364.519for estimated mean upper 1569.006

Predict

Make multiple predictions

$1,466.76 ± 1.9850·$305.56, an individual predictionfor a 30-year-old woman earning $35,000/year

$1,466.76 ± 1.9850·$51.51, an estimate of the large-group meanfor 30-year-old women earning $35,000/year

Page 44: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Significance

• The significance level of the t-ratio (for each variable separately)

• Sometimes called the “p-value” for that variable

– How strong is the evidence that, in a model already containing all of the other explanatory variables, this variable “belongs” (i.e., has a non-zero coefficient of its own)?

– Equivalently, is this a variable whose value we’d like to know when predicting for a specific individual?

• Close to zero = strong evidence it DOES belong (our null hypothesis is that it doesn’t)

Page 45: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Discountconstant Age Income Sex

coefficient 1971.72565 9.48991379 -0.035313 446.294355std error of coef 146.147064 3.6320188 0.00366827 64.5567912t-ratio 13.4914 2.6128 -9.6266 6.9132significance 0.0000% 1.0423% 0.0000% 0.0000%beta-weight 0.1746 -0.6735 0.4150

standard error of regression 301.19175coefficient of determination 69.68%adjusted coef of determination 68.74%

number of observations 100residual degrees of freedom 96

t-statistic for computing95%-confidence intervals 1.9850

Page 46: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Significance (continued)

• Null hypothesis: “In the current model, the true coefficient of this variable is 0.”– The coefficient of this variable is our estimate– (coefficient) / (standard error of the coefficient) tells us how many standard deviations away from the hypothesized truth (0) the estimate is

• significance = Pr(we’d be this far away just by chance)

– Close to 0% = (recall coin-flipping story)• highly contradictory to null hypothesis• strongly supportive of alternative (it DOES belong)

Page 47: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Significance (continued)

• The significance level deals with the marginal contribution of a variable to the current model.

• Adding an irrelevant explanatory variable to a regression model will increase the adjusted coefficient of determination about half the time. The significance level tells us if the coefficient of determination went up by enough to argue that the new variable is relevant.

Page 48: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Beta-Weights

• Why is Discount varying from one sale to the next?– What’s the relative explanatory “power” of

(variation in) each of the explanatory variables (in explaining the currently-observed variability in the dependent variable across the population)?

– The comparative magnitudes of the beta-weights (for all of the explanatory variables together in the model) answer this question.

Page 49: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Discountconstant Age Income Sex

coefficient 1971.72565 9.48991379 -0.035313 446.294355std error of coef 146.147064 3.6320188 0.00366827 64.5567912t-ratio 13.4914 2.6128 -9.6266 6.9132significance 0.0000% 1.0423% 0.0000% 0.0000%beta-weight 0.1746 -0.6735 0.4150

standard error of regression 301.19175coefficient of determination 69.68%adjusted coef of determination 68.74%

number of observations 100residual degrees of freedom 96

t-statistic for computing95%-confidence intervals 1.9850

• Why does discount vary across the population?

• Primarily, because Income varies.• Secondarily, because some purchasers are men and

others are women (i.e., Sex varies).

Page 50: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The Beta-Weights (continued)

• Each answers the question:– If two individuals have the same values for all the

explanatory variables in the model except one, and for this one their values differ by one standard-deviation’s-worth of variability (in this variable), then their predicted values for the dependent variable would differ by how many standard deviations (of variability in the dependent variable)?

• “Typical” variation in each of the explanatory variables alone can explain (relatively) how much of the observed variability in the dependent variable?

Page 51: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

We Can Explore Other Models

• We can drop variables– Are older or younger purchasers currently getting

larger discounts?• We can change the dependent variable

– Are the female purchasers, on average, older or younger than the male purchasers?

– What’s the impact of aging on purchaser income?

Page 52: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Ageconstant Sex

coefficient 38.9074074 -3.9291465std error of coef 1.32861376 1.95893412t-ratio 29.2842 -2.0058significance 0.0000% 4.7639%beta-weight -0.1986

standard error of regression 9.76327735coefficient of determination 3.94%adjusted coef of determination 2.96%

number of observations 100residual degrees of freedom 98

t-statistic for computing95%-confidence intervals 1.9845

Male purchasers are, on average, 38.91 years old.

Female purchasers are, on average, 3.93 years younger than the men.

Are the female purchasers, on average, older or younger than the male purchasers?

Page 53: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Discountconstant Age

coefficient 1817.16511 -14.795825std error of coef 202.794627 5.28272248t-ratio 8.9606 -2.8008significance 0.0000% 0.6142%beta-weight -0.2722

standard error of regression 520.957868coefficient of determination 7.41%adjusted coef of determination 6.47%

number of observations 100residual degrees of freedom 98

t-statistic for computing95%-confidence intervals 1.9845

If the “pure” effect of an additional year of age is to increase a purchaser’s discount, then what explains the negative coefficient of Age below?

• An older patron is likely to have a higher income (which typically is associated with a smaller discount)

• An older patron is more likely to be male (which typically is associated with a smaller discount)

Page 54: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Regression: Discountconstant Age Sex

coefficient 1292.48764 -8.4694585 630.367989std error of coef 178.496906 4.34612096 85.9945283t-ratio 7.2410 -1.9487 7.3303significance 0.0000% 5.4216% 0.0000%beta-weight -0.1558 0.5862

A Reconciliation across Models

• On these next three slides, we’ll focus on the “older people have higher incomes” effect:

• As a patron ages by a year (and his/her sex stays unchanged!), his/her discount typically drops by $8.47.

Page 55: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

As the patron ages by a year (and his/her sex stays unchanged!), his/her income typically rises by $508.58.

Regression: Incomeconstant Age Sex

coefficient 19234.7835 508.57672 -5212.6301std error of coef 3542.55077 86.25558 1706.69615t-ratio 5.4296 5.8962 -3.0542significance 0.0000% 0.0000% 0.2913%beta-weight 0.4906 -0.2541

Page 56: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

The combined age and income effects are precisely what we originally estimated for an additional year of age, when income was not held constant.

Regression: Discountconstant Age Income Sex

coefficient 1971.72565 9.48991379 -0.035313 446.294355std error of coef 146.147064 3.6320188 0.00366827 64.5567912t-ratio 13.4914 2.6128 -9.6266 6.9132significance 0.0000% 1.0423% 0.0000% 0.0000%beta-weight 0.1746 -0.6735 0.4150

9.48991379 impact of Age508.57672 additional Income

-17.959372 impact of additional Income

net consequence of-8.4694585 aging a year and earning

more as a result

Page 57: Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look

Conclusion

Regression: Discountconstant Age Sex

coefficient 1292.48764 -8.4694585 630.367989std error of coef 178.496906 4.34612096 85.9945283t-ratio 7.2410 -1.9487 7.3303significance 0.0000% 5.4216% 0.0000%beta-weight -0.1558 0.5862

To the extent that Income covaries with Age, if Income is omitted from our model, Age gets “blamed” for part of Income’s effect on Discount.

This yields the most accurate possible predictions based on Age and Sex alone, but grossly misestimates the pure effect of Age.

And that is why we try to use the “most-complete” model to estimate the pure effect of any variable on the dependent variable… and why our next session will focus on building the model itself.