statistical inference and regression analysis: gb.3302.30 professor william greene stern school of...

99
Pepperoni Plain M ushroom Sausage Pepper and Onion Mushroom and Onion Garlic M eatball Category Meatball 5.0% Garlic 2.3% Mushroom and Onion 9.2% Pepper and Onion 7.3% Sausage 5.8% M ushroom 16.2% Plain 32.5% Pepperoni 21.8% Pie ChartofPercentvs Type Listing 900000 800000 700000 600000 500000 400000 300000 200000 100000 BoxplotofListing Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Percent 1000000 800000 600000 400000 200000 0 99 95 90 80 70 60 50 40 30 20 10 5 1 M ean 369687 StD ev 156865 N 51 AD 0.994 P-Value 0.012 Probability PlotofListing Normal- 95% CI Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Frequency 900000 800000 700000 600000 500000 400000 300000 200000 14 12 10 8 6 4 2 0 H istogram ofListing Listing Percent 9000 0 0 80 00 00 7000 0 0 6 0 00 00 50 0 0 0 0 400000 30 00 00 2000 0 0 1000 00 0 100 80 60 40 20 0 M ean 369687 StD ev 156865 N 51 Em piricalCD F ofListing Norm al Incom ePC Listing 30000 25000 20000 15000 1000000 800000 600000 400000 200000 M arginalPlotofListing vs IncomePC Pepperoni Plain M ushroom Sausage Pepper and Onion Mushroom and Onion Garlic M eatball Category Meatball 5.0% Garlic 2.3% Mushroom and Onion 9.2% Pepper and Onion 7.3% Sausage 5.8% M ushroom 16.2% Plain 32.5% Pepperoni 21.8% Pie ChartofPercentvs Type Listing 900000 800000 700000 600000 500000 400000 300000 200000 100000 BoxplotofListing Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Percent 1000000 800000 600000 400000 200000 0 99 95 90 80 70 60 50 40 30 20 10 5 1 M ean 369687 StD ev 156865 N 51 AD 0.994 P-Value 0.012 Probability PlotofListing Normal- 95% CI Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Frequency 900000 800000 700000 600000 500000 400000 300000 200000 14 12 10 8 6 4 2 0 H istogram ofListing Listing Percent 9000 0 0 80 00 00 7000 0 0 6 0 00 00 50 0 0 0 0 400000 30 00 00 2000 0 0 1000 00 0 100 80 60 40 20 0 M ean 369687 StD ev 156865 N 51 Em piricalCD F ofListing Norm al Incom ePC Listing 30000 25000 20000 15000 1000000 800000 600000 400000 200000 M arginalPlotofListing vs IncomePC Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

Upload: barnaby-scott

Post on 19-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics
  • Slide 2
  • 2/97 2
  • Slide 3
  • 3/97 3
  • Slide 4
  • Statistics and Data Analysis Part 7 Regression Model-1 Regression Diagnostics
  • Slide 5
  • 5/97 Using the Residuals How do you know the model is good? Various diagnostics to be developed over the semester. But, the first place to look is at the residuals.
  • Slide 6
  • 6/97 Residuals Can Signal a Flawed Model Standard application: Cost function for output of a production process. Compare linear equation to a quadratic model (in logs) (123 American Electric Utilities)
  • Slide 7
  • 7/97 Electricity Cost Function
  • Slide 8
  • 8/97 Candidate Model for Cost Log c = a + b log q + e Most of the points in this area are above the regression line. Most of the points in this area are below the regression line. Most of the points in this area are above the regression line.
  • Slide 9
  • 9/97 A Better Model? Log Cost = + 1 logOutput + 2 [logOutput] 2 +
  • Slide 10
  • 10/97 Candidate Models for Cost The quadratic equation is the appropriate model. Logc = a + b1 logq + b2 log 2 q + e
  • Slide 11
  • 11/97 Missing Variable Included Residuals from the quadratic cost model Residuals from the linear cost model
  • Slide 12
  • 12/97 Unusual Data Points Outliers have (what appear to be) very large disturbances, Wolf weight vs. tail length The 500 most successful movies
  • Slide 13
  • 13/97 Outliers 99.5% of observations will lie within mean 3 standard deviations. We show (a+bx) 3s e below.) Titanic is 8.1 standard deviations from the regression! Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.) These observations might deserve a close look.
  • Slide 14
  • 14/97 logPrice = a + b logArea + e Prices paid at auction for Monet paintings vs. surface area (in logs) Not an outlier: Monet chose to paint a small painting. Possibly an outlier: Why was the price so low?
  • Slide 15
  • 15/97 What to Do About Outliers (1) Examine the data (2) Are they due to mismeasurement error or obvious coding errors? Delete the observations. (3) Are they just unusual observations? Do nothing. (4) Generally, resist the temptation to remove outliers. Especially if the sample is large. (500 movies is large. 10 wolves is not.) (5) Question why you think it is an outlier. Is it really?
  • Slide 16
  • 16/97 Regression Options
  • Slide 17
  • 17/97 Diagnostics
  • Slide 18
  • 18/97 On Removing Outliers Be careful about singling out particular observations this way. The resulting model might be a product of your opinions, not the real relationship in the data. Removing outliers might create new outliers that were not outliers before. Statistical inferences from the model will be incorrect.
  • Slide 19
  • Statistics and Data Analysis Part 7 Regression Model-2 Statistical Inference
  • Slide 20
  • 20/97 b As a Statistical Estimator What is the interest in b? = dE[y|x]/dx Effect of a policy variable on the expectation of a variable of interest. Effect of medication dosage on disease response many others
  • Slide 21
  • 21/97 Application: Health Care Data German Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994. DOCTOR = 1(Number of doctor visits > 0) HOSPITAL= 1(Number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = 0 ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / 10000. HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling AGE = age in years MARRIED = marital status EDUC = years of education
  • Slide 22
  • 22/97 Regression? Population relationship Income = + Health + (For this population, Income =.31237 +.00585 Health + E[Income | Health] =.31237 +.00585 Health
  • Slide 23
  • 23/97 Distribution of Health
  • Slide 24
  • 24/97 Distribution of Income
  • Slide 25
  • 25/97 Average Income | Health Health N j = 447 255 642 1173 1390 4233 2570 4191 6172 3061 3192
  • Slide 26
  • 26/97 b is a statistic Random because it is a sum of the s. It has a distribution, like any sample statistic
  • Slide 27
  • 27/97 Sampling Experiment 500 samples of N=52 drawn from the 27,326 (using a random number generator to simulate N observation numbers from 1 to 27,326) Compute b with each sample Histogram of 500 values
  • Slide 28
  • 28/97
  • Slide 29
  • 29/97 Conclusions Sampling variability Seems to center on Appears to be normally distributed
  • Slide 30
  • 30/97 Distribution of slope estimator, b Assumptions: (Model) Regression: y i = + x i + i (Crucial) Exogenous data: data x and noise are independent; E[ |x]=0 or Cov( ,x)=0 (Temporary) Var[ |x] = 2, not a function of x (Homoscedastic) Results: What are the properties of b?
  • Slide 31
  • 31/97
  • Slide 32
  • 32/97 (1) b is unbiased and linear in
  • Slide 33
  • 33/97 (2) b is efficient Gauss Markov Theorem: Like Rao Blackwell. (Proof in Greene) Variance of b is smallest among linear unbiased estimators.
  • Slide 34
  • 34/97 (3) b is consistent
  • Slide 35
  • 35/97 Consistency: N=52 vs. N=520
  • Slide 36
  • 36/97 a is unbiased and consistent
  • Slide 37
  • 37/97 Covariance of a and b
  • Slide 38
  • 38/97 Inference about Have derived expected value and variance of b. b is a point estimator Looking for a way to form a confidence interval. Need a distribution and a pivotal statistic to use.
  • Slide 39
  • 39/97 Normality
  • Slide 40
  • 40/97 Confidence Interval
  • Slide 41
  • 41/97 Estimating sigma squared
  • Slide 42
  • 42/97 Usable Confidence Interval Use s instead of s. Use t distribution instead of normal. Critical t depends on degrees of freedom b - ts < < b + ts
  • Slide 43
  • 43/97 Slope Estimator
  • Slide 44
  • 44/97 Regression Results ----------------------------------------------------------------------------- Ordinary least squares regression............ LHS=BOX Mean = 20.72065 Standard deviation = 17.49244 ---------- No. of observations = 62 DegFreedom Mean square Regression Sum of Squares = 7913.58 1 7913.57745 Residual Sum of Squares = 10751.5 60 179.19235 Total Sum of Squares = 18665.1 61 305.98555 ---------- Standard error of e = 13.38627 Root MSE 13.16860 Fit R-squared =.42398 R-bar squared.41438 Model test F[ 1, 60] = 44.16247 Prob F > F*.00000 --------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval --------+-------------------------------------------------------------------- Constant| -14.3600** 5.54587 -2.59.0121 -25.2297 -3.4903 CNTWAIT3| 72.7181*** 10.94249 6.65.0000 51.2712 94.1650 --------+-------------------------------------------------------------------- Note: ***, **, * ==> Significance at 1%, 5%, 10% level. -----------------------------------------------------------------------------
  • Slide 45
  • 45/97 Hypothesis Test about Outside the confidence interval is the rejection for hypothesis tests about For the internet buzz regression, the confidence interval is 51.2712 to 94.1650 The hypothesis that equals zero is rejected.
  • Slide 46
  • Statistics and Data Analysis Part 7-3 Prediction
  • Slide 47
  • 47/97 Predicting y Using the Regression Actual y 0 is + x 0 + 0 Prediction is y 0 ^ = a + bx 0 + 0 Error is y 0 y 0 ^ = (a- ) + (b- )x 0 + 0 Variance of the error is Var[a] + x 0 2 Var[b] + 2x 0 Cov[a,b] + Var[ 0 ]
  • Slide 48
  • 48/97 Prediction Variance
  • Slide 49
  • 49/97 Quantum of Solace Actual Box = $67.528882M a=-14.36, b=72.7181, N=62, s b =10.94249, s 2 = 13.3863 2 buzz = 0.76, prediction = 40.906 Mean buzz = 0.4824194 (buzz mean) 2 = 1.49654 S forecast = 13.831425 2 Confidence interval = 40.906 +/- 2.003(13.831425) = 13.239 to 68.527 (Note: The confidence interval contains the value)
  • Slide 50
  • 50/97 Forecasting Out of Sample Per Capita Gasoline Consumption vs. Per Capita Income, 1953-2004. How to predict G for 2012? You would need first to predict Income for 2012. How should we do that? Regression Analysis: G versus Income The regression equation is G = 1.93 + 0.000179 Income Predictor Coef SE Coef T P Constant 1.9280 0.1651 11.68 0.000 Income 0.00017897 0.00000934 19.17 0.000 S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8%
  • Slide 51
  • 51/97 The Extrapolation Penalty The interval is narrowest at x* =, the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty. (1) Uncertainty about the prediction of x (2) Uncertainty that the linear relationship will continue to exist as we move farther from the center.
  • Slide 52
  • 52/97 Normality Necessary for t statistics and confidence intervals Residuals reveal whether disturbances are normal? Standard tests and devices
  • Slide 53
  • 53/97 Normally Distributed Residuals? -------------------------------------- Kolmogorov-Smirnov test of F(E ) vs. Normal[.00000, 13.27610^2] ******* K-S test statistic =.1810063 ******* 95% critical value =.1727202 ******* 99% critical value =.2070102 Normality hyp. should be rejected. --------------------------------------
  • Slide 54
  • 54/97 Nonnormal Disturbances Appeal to the central limit theorem Use standard normal instead of t t is essentially normal if N > 50.
  • Slide 55
  • Statistics and Data Analysis Part 7-4 Multiple Regression
  • Slide 56
  • 56/97 Box Office and Movie Buzz 56
  • Slide 57
  • 57/97 Box Office and Budget 57
  • Slide 58
  • 58/97 Budget and Buzz Effects 58
  • Slide 59
  • 59/97 An Enduring Art Mystery Why do larger paintings command higher prices? The Persistence of Memory. Salvador Dali, 1931 The Persistence of Statistics. Rice, 2007 Graphics show relative sizes of the two works.
  • Slide 60
  • 60/97 The Data Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)
  • Slide 61
  • 61/97 Monet in Large and Small Log of $price = a + b log surface area + e Sale prices of 328 signed Monet paintings The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.
  • Slide 62
  • 62/97 Monet Regression: There seems to be a regression. Is there a theory?
  • Slide 63
  • 63/97 How much for the signature? The sample also contains 102 unsigned paintings Average Sale Price Signed $3,364,248 Not signed $1,832,712 Average price of signed Monets is almost twice that of unsigned
  • Slide 64
  • 64/97 Can we separate the two effects? Average Prices Small Large Unsigned 346,845 5,795,000 Signed 689,422 5,556,490 What do the data suggest? (1) The size effect is huge (2) The signature effect is confined to the small paintings.
  • Slide 65
  • 65/97 A Multiple Regression Ln Price = a + b1 ln Area + b2 (0 if unsigned, 1 if signed) + e b2
  • Slide 66
  • 66/97 Monet Multiple Regression Regression Analysis: ln (US$) versus ln (SurfaceArea), Signed The regression equation is ln (US$) = 4.12 + 1.35 ln (SurfaceArea) + 1.26 Signed Predictor Coef SE Coef T P Constant 4.1222 0.5585 7.38 0.000 ln (SurfaceArea) 1.3458 0.08151 16.51 0.000 Signed 1.2618 0.1249 10.11 0.000 S = 0.992509 R-Sq = 46.2% R-Sq(adj) = 46.0% Interpretation (to be explored as we develop the topic): (1) Elasticity of price with respect to surface area is 1.3458 very large (2) The signature multiplies the price by exp(1.2618) (about 3.5), for any given size.
  • Slide 67
  • 67/97 Ceteris Paribus in Theory Demand for gasoline: G = f(price,income) Demand (price) elasticity: e P = %change in G given %change in P holding income constant. How do you do that in the real world? The percentage changes How to change price and hold income constant?
  • Slide 68
  • 68/97 The Real World Data
  • Slide 69
  • 69/97 U.S. Gasoline Market, 1953-2004
  • Slide 70
  • 70/97 Shouldnt Demand Curves Slope Downward?
  • Slide 71
  • 71/97 A Thought Experiment The main driver of gasoline consumption is income not price Income is growing over time. We are not holding income constant when we change price! How do we do that?
  • Slide 72
  • 72/97 How to Hold Income Constant? Multiple Regression Using Price and Income Regression Analysis: G versus GasPrice, Income The regression equation is G = 0.134 - 0.00163 GasPrice + 0.000026 Income Predictor Coef SE Coef T P Constant 0.13449 0.02081 6.46 0.000 GasPrice -0.0016281 0.0004152 -3.92 0.000 Income 0.00002634 0.00000231 11.43 0.000 It looks like the theory works.
  • Slide 73
  • Statistics and Data Analysis Linear Multiple Regression Model
  • Slide 74
  • 74/97 Classical Linear Regression Model The model is y = f(x 1,x 2,,x K, 1, 2, K ) + = a multiple regression model. Important examples: Marginal cost in a multiple output setting Separate age and education effects in an earnings equation. Denote (x 1,x 2,,x K ) as x. Boldface symbol = vector. Form of the model E[y|x] = a linear function of x. Dependent and independent variables. Independent of what? Think in terms of autonomous variation. Can y just change? What causes the change?
  • Slide 75
  • 75/97 Model Assumptions: Generalities Linearity means linear in the parameters. Well return to this issue shortly. Identifiability. It is not possible in the context of the model for two different sets of parameters to produce the same value of E[y|x] for all x vectors. (It is possible for some x.) Conditional expected value of the deviation of an observation from the conditional mean function is zero Form of the variance of the random variable around the conditional mean is specified Nature of the process by which x is observed. Assumptions about the specific probability distribution.
  • Slide 76
  • 76/97 Linearity of the Model f(x 1,x 2,,x K, 1, 2, K ) = x 1 1 + x 2 2 + + x K K Notation: x 1 1 + x 2 2 + + x K K = x . Boldface letter indicates a column vector. x denotes a variable, a function of a variable, or a function of a set of variables. There are K variables on the right hand side of the conditional mean function. The first variable is usually a constant term. (Wisdom: Models should have a constant term unless the theory says they should not.) E[y|x] = 1 *1 + 2 *x 2 + + K *x K. ( 1 *1 = the intercept term).
  • Slide 77
  • 77/97 Linearity Linearity means linear in the parameters, not in the variables E[y|x] = 1 f 1 () + 2 f 2 () + + K f K (). f k () may be any function of data. Examples: Logs and levels in economics Time trends, and time trends in loglinear models rates of growth Dummy variables Quadratics, power functions, log-quadratic, trig functions, interactions and so on.
  • Slide 78
  • 78/97 Linearity Simple linear model, E[y|x] =x Quadratic model: E[y|x] = + 1 x + 2 x 2 Loglinear model, E[lny|x] = + k lnx k k Semilog, E[y|x] = + k lnx k k All are linear. An infinite number of variations.
  • Slide 79
  • 79/97 Matrix Notation 79
  • Slide 80
  • 80/97 Notation Define column vectors of N observations on y and the K x variables. The assumption means that the rank of the matrix X is K. No linear dependencies => FULL COLUMN RANK of the matrix X.
  • Slide 81
  • 81/97 Uniqueness of the Conditional Mean The conditional mean relationship must hold for any set of N observations, i = 1,,N. Assume, that N K (justified later) E[y 1 |x] = x 1 E[y 2 |x] = x 2 E[y n |x] = x n All N observations at once: E[y|X] = X = E .
  • Slide 82
  • 82/97 Uniqueness of E[y|X] Now, suppose there is a that produces the same expected value, E[y|X] = X = E . Let = - . Then, X = X - X = E - E = 0. Is this possible? X is an N K matrix (N rows, K columns). What does X = 0 mean? We assume this is not possible. This is the full rank assumption. Ultimately, it will imply that we can estimate . This requires N K.
  • Slide 83
  • 83/97 An Unidentified (But Valid) Theory of Art Appreciation Enhanced Monet Area Effect Model: Height and Width Effects Log(Price) = 1 + 2 log Area + 3 log Aspect Ratio + 4 log Height + 5 Signature + (Aspect Ratio = Height/Width)
  • Slide 84
  • 84/97 Conditional Homoscedasticity and Nonautocorrelation Disturbances provide no information about each other. Var[ i | X ] = 2 Cov[ i, j |X] = 0
  • Slide 85
  • 85/97 Heteroscedasticity Regression of log of per capita gasoline use on log of per capita income, gasoline price and number of cars per capita for 18 OECD countries for 19 years. The standard deviation varies by country. Countries are ordered by the standard deviation of their 19 residuals.
  • Slide 86
  • 86/97 Autocorrelation logG= 1 + 2 logPg + 3 logY + 4 logPnc + 5 logPuc +
  • Slide 87
  • 87/97 Autocorrelation Results from an Incomplete Model
  • Slide 88
  • 88/97 Normal Distribution of Used to facilitate finite sample derivations of certain test statistics. Observations are independent Assumption will be unnecessary we will use the central limit theorem for the statistical results we need.
  • Slide 89
  • 89/97 The Linear Model y = X +, N observations, K columns in X, (usually including a column of ones for the intercept). Standard assumptions about X Standard assumptions about |X E[|X]=0, E[]=0 and Cov[,x]=0 Regression: E[y|X] = X
  • Slide 90
  • Statistics and Data Analysis Least Squares
  • Slide 91
  • 91/97 Vocabulary Some terms to be used in the discussion. Population characteristics and entities vs. sample quantities and analogs Residuals and disturbances Population regression line and sample regression Objective: Learn about the conditional mean function. Estimate and 2 First step: Mechanics of fitting a line to a set of data
  • Slide 92
  • 92/97 Least Squares
  • Slide 93
  • 93/97 Matrix Results
  • Slide 94
  • 94/97
  • Slide 95
  • 95/97 Moment Matrices
  • Slide 96
  • 96/97 Least Squares Normal Equations
  • Slide 97
  • 97/97 Second Order Conditions
  • Slide 98
  • 98/97 Does b Minimize ee?
  • Slide 99
  • 99/97 Positive Definite Matrix