linear regression models powerful modeling technique tease out relationships between...

Linear Regression Models Powerful modeling technique Tease out relationships between

“independent” variables and 1 “dependent” variable

Models not perfect…need an error term Measurement errors, wrong model, omitted

variables, inherent randomness Linear models often misused.

Example: Lake Water Quality Chlorophyll-a (C) widely used indicator –

measure of eutrophication Nitrogen (N) associated with

eutrophication Q: Golf Course Development. Nitrogen

expected to . By how much will C increase/decrease?

How should we proceed?

Plot C vs. N

5 1 0 1 5 2 0 2 5

N i t ro g e n

0

5 0

1 0 0

1 5 0

Ch

loro

ph

yll

A “Better” Model Explain (single) regression line (model?).

Neg. relationship suggests a problem. Omitted variable: Phosphorus (P)

Want to tease out effect of N, P separately. Write a Multiple Linear Regression Model:

Model designed to “tease out” effect of N and effect of P, separately, on C.

(**) Define and interpret variables, parameters.

ii2i10i NPC

Estimation Use data to estimate parameter values

that give “best fit”: b0=-9.4, b1=0.3, b2=1.2

Answer: A one unit increase in N, results in about a 1.2 unit increase in C.

Importance: Omitting phosphorus from model introduced significant bias!!!

Question: US Gas Consumption Gasoline consumption produces

many negative byproducts. Policy may be directed at increasing

the price of gas to reduce consumption.

But what is effect of price change? Question: What is the price elasticity

of demand for gasoline in the U.S.?

Some Gasoline Data

1962 1972 1982 1992

YEAR

0.7

0.8

0.9

1.0

1.1

1.2

G.P

OP

0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1

PG

0.7

0.8

0.9

1.0

1.1

1.2

G.P

OP

Gas Data Cont’d Gas consumption increases through time.

But no info here about price. Next plot shows (+) relationship between

gas price and gas consumption. Note opposite of demand curve. Something is wrong here…

Just as in Eutrophication problem, may have omitted important variables.

May have other problems, too.

The OLS “Estimator” Estimator: A rule or strategy for using

data to estimate an unknown parameter. Defined before the data are drawn.

Ordinary Least Squares (OLS) estimator finds value of parameter that minimizes sum of squared deviations (see C vs. N plot)

Several assumptions for OLS estimator to apply to a model

Linear Model The model must be linear

Linear in parameters, not in variables.• Difference between parameter, variable.

Examples:

t)S1(

t1t

t3t

t

2t

t

ttt

teSR

Z)Xlog(

XY

XY

Transforming Models Previous “Ricker” model is non-

linear (in the parameter). Sometimes, can transform model so

linear. When plot, graph is nonlinear.

Take log of both sides, giving:

)log()S1()Slog()Rlog( ttt1t

CLRM: Assumption 1 Dependent variable (Y) is function of

specific set of independent variables (X’s). Linear in parameters Additive error Coefficients are constant but unknown

Violations called “specification errors”, e.g.

Wrong regressors (a.k.a. indep. vars; X’s) Nonlinearity Changing parameters (e.g. through time)

CLRM: Assumption 2 Disturbances (i’s) are independently and

identically distributed ~ (0,2) Typically we assume i~ N(0,2) Mean = 0 Constant variance, 2 (but unknown) Errors uncorrelated with one another

Example of violations: Measurement Bias (seep gas flux) Heteroskedasticity (variance differs). Autocorrelated Errors (disturbances correlated)

CLRM: Assumption 3 It is possible to repeat the sample with

same independent variables. If had same levels of explanatory vars, would

it be possible to generate same value of Y? Common Violations:

Errors in variables – measurement error in X. Autoregression – when lagged dependent

variable should be independent variable Simultaneous Equations – several

relationships act jointly.

Properties of Estimators Estimators have many properties.

“6” is an estimator, but not a very good one. Two main properties we care about:

Unbiased: The expected distance of estimator from thing it is estimating is 0.

Efficient: Small variance (spread) “6” is biased, but has a very small variance

(zero). OLS estimator is unbiased and has minimum

variance of all unbiased estimators.

Correlation vs. Causation Now we know just enough to be

dangerous! Can estimate how any set of variables affects

some other variable….Very Powerful. Problem is: Correlation doesn’t imply

Causation! …. Why Data Mining is bad. Chicken production, Global CO2. May be “spurious” (no underlying relationship)

Difficult to tease out statistically. “Granger Causality”

Violations & Consequences

Problem Consequences

Autocorrelation Unbiased, wrong inf.

Heterskedasticity Unbiased, wrong inf.

Contemporaneous Correlation (X, corr.)

Biased

Multicollinearity Usually OK

Omitted Variables Biased

Included Regressors Unbiased, extra noise

True model nonlinear Biased, Wrong inf.

Guide to Model Specification

1. Start with theory to generate model2. Check assumptions of CLRM3. Collect and plot data4. Estimate model, test restrictions

Possibly perform Box-Cox transform5. Check R2, and “Adjusted R2”6. Plot residuals – look for patterns7. Seek explanations for patterns

What’s a Residual? General form of linear model:

Graphically on board.

)"residual("YYˆ

)predicted(XˆˆY

)true(XY

iii

iii

Residual Plots Residuals vs. Fit Normal Quantile

Plot

Fitted : Phosphorus + Nitrogen

Res

idua

ls

50 100 150 200

-40

-20

020

4060

7

10

14

Quantiles of Standard Normal

Res

idua

ls

-2 -1 0 1 2

-40

-20

020

4060

7

10

14

Back to Gasoline Consumption Recall, interested in how gas consumption

is affected by price increase (say $0.10/gal.)

Variables: Gas consumption per capita (G) Gas price (Pg) Income (Y) New car price (Pnc) Used car price (Puc)

2 Alternative Specifications Linear specification:

Log-log specification (often used with economic data)

One way to test specification is Box-Cox Transform (see 3 lectures back)

tt4t3t2t10t PucPncYPgG

tt4t3t2t10t )Puclog()Pnclog()Ylog()Pglog()Glog(

Results of Linear Model

Parameter estimate, (p-value of t-test). Low p-value: “statistically significant”

R2 measures goodness of fit of model. Low p-value of F statistic means model

has explanatory power.

b0 b b2 b3 b4 R2 p (F)

-.09(.08)

-.04(.002)

.0002(.000)

-.10(.11)

-.04(.08)

.97 .000

Answer to Question A 1 unit increase in price leads to

a .04 unit decrease in gas consumption.

Units are: G(1000 gallons), Pg($). So, a $0.10 increase in gas price

leads to, on average, a 4 gallon decrease in gas consumption…not much!

linear regression models powerful modeling technique tease out relationships between...

Documents

model slide

n slide

time slide

gasoline data slide

wrong model

better model

price of gas

gas price