linear regression models powerful modeling technique tease out relationships between...
Post on 22-Dec-2015
218 views
TRANSCRIPT
Linear Regression Models Powerful modeling technique Tease out relationships between
“independent” variables and 1 “dependent” variable
Models not perfect…need an error term Measurement errors, wrong model, omitted
variables, inherent randomness Linear models often misused.
Example: Lake Water Quality Chlorophyll-a (C) widely used indicator –
measure of eutrophication Nitrogen (N) associated with
eutrophication Q: Golf Course Development. Nitrogen
expected to . By how much will C increase/decrease?
How should we proceed?
Plot C vs. N
5 1 0 1 5 2 0 2 5
N i t ro g e n
0
5 0
1 0 0
1 5 0
Ch
loro
ph
yll
A “Better” Model Explain (single) regression line (model?).
Neg. relationship suggests a problem. Omitted variable: Phosphorus (P)
Want to tease out effect of N, P separately. Write a Multiple Linear Regression Model:
Model designed to “tease out” effect of N and effect of P, separately, on C.
(**) Define and interpret variables, parameters.
ii2i10i NPC
Estimation Use data to estimate parameter values
that give “best fit”: b0=-9.4, b1=0.3, b2=1.2
Answer: A one unit increase in N, results in about a 1.2 unit increase in C.
Importance: Omitting phosphorus from model introduced significant bias!!!
Question: US Gas Consumption Gasoline consumption produces
many negative byproducts. Policy may be directed at increasing
the price of gas to reduce consumption.
But what is effect of price change? Question: What is the price elasticity
of demand for gasoline in the U.S.?
Some Gasoline Data
1962 1972 1982 1992
YEAR
0.7
0.8
0.9
1.0
1.1
1.2
G.P
OP
0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1
PG
0.7
0.8
0.9
1.0
1.1
1.2
G.P
OP
Gas Data Cont’d Gas consumption increases through time.
But no info here about price. Next plot shows (+) relationship between
gas price and gas consumption. Note opposite of demand curve. Something is wrong here…
Just as in Eutrophication problem, may have omitted important variables.
May have other problems, too.
The OLS “Estimator” Estimator: A rule or strategy for using
data to estimate an unknown parameter. Defined before the data are drawn.
Ordinary Least Squares (OLS) estimator finds value of parameter that minimizes sum of squared deviations (see C vs. N plot)
Several assumptions for OLS estimator to apply to a model
Linear Model The model must be linear
Linear in parameters, not in variables.• Difference between parameter, variable.
Examples:
t)S1(
t1t
t3t
t
2t
t
ttt
teSR
Z)Xlog(
XY
XY
Transforming Models Previous “Ricker” model is non-
linear (in the parameter). Sometimes, can transform model so
linear. When plot, graph is nonlinear.
Take log of both sides, giving:
)log()S1()Slog()Rlog( ttt1t
CLRM: Assumption 1 Dependent variable (Y) is function of
specific set of independent variables (X’s). Linear in parameters Additive error Coefficients are constant but unknown
Violations called “specification errors”, e.g.
Wrong regressors (a.k.a. indep. vars; X’s) Nonlinearity Changing parameters (e.g. through time)
CLRM: Assumption 2 Disturbances (i’s) are independently and
identically distributed ~ (0,2) Typically we assume i~ N(0,2) Mean = 0 Constant variance, 2 (but unknown) Errors uncorrelated with one another
Example of violations: Measurement Bias (seep gas flux) Heteroskedasticity (variance differs). Autocorrelated Errors (disturbances correlated)
CLRM: Assumption 3 It is possible to repeat the sample with
same independent variables. If had same levels of explanatory vars, would
it be possible to generate same value of Y? Common Violations:
Errors in variables – measurement error in X. Autoregression – when lagged dependent
variable should be independent variable Simultaneous Equations – several
relationships act jointly.
Properties of Estimators Estimators have many properties.
“6” is an estimator, but not a very good one. Two main properties we care about:
Unbiased: The expected distance of estimator from thing it is estimating is 0.
Efficient: Small variance (spread) “6” is biased, but has a very small variance
(zero). OLS estimator is unbiased and has minimum
variance of all unbiased estimators.
Correlation vs. Causation Now we know just enough to be
dangerous! Can estimate how any set of variables affects
some other variable….Very Powerful. Problem is: Correlation doesn’t imply
Causation! …. Why Data Mining is bad. Chicken production, Global CO2. May be “spurious” (no underlying relationship)
Difficult to tease out statistically. “Granger Causality”
Violations & Consequences
Problem Consequences
Autocorrelation Unbiased, wrong inf.
Heterskedasticity Unbiased, wrong inf.
Contemporaneous Correlation (X, corr.)
Biased
Multicollinearity Usually OK
Omitted Variables Biased
Included Regressors Unbiased, extra noise
True model nonlinear Biased, Wrong inf.
Guide to Model Specification
1. Start with theory to generate model2. Check assumptions of CLRM3. Collect and plot data4. Estimate model, test restrictions
Possibly perform Box-Cox transform5. Check R2, and “Adjusted R2”6. Plot residuals – look for patterns7. Seek explanations for patterns
What’s a Residual? General form of linear model:
Graphically on board.
)"residual("YYˆ
)predicted(XˆˆY
)true(XY
iii
iii
Residual Plots Residuals vs. Fit Normal Quantile
Plot
Fitted : Phosphorus + Nitrogen
Res
idua
ls
50 100 150 200
-40
-20
020
4060
7
10
14
Quantiles of Standard Normal
Res
idua
ls
-2 -1 0 1 2
-40
-20
020
4060
7
10
14
Back to Gasoline Consumption Recall, interested in how gas consumption
is affected by price increase (say $0.10/gal.)
Variables: Gas consumption per capita (G) Gas price (Pg) Income (Y) New car price (Pnc) Used car price (Puc)
2 Alternative Specifications Linear specification:
Log-log specification (often used with economic data)
One way to test specification is Box-Cox Transform (see 3 lectures back)
tt4t3t2t10t PucPncYPgG
tt4t3t2t10t )Puclog()Pnclog()Ylog()Pglog()Glog(
Results of Linear Model
Parameter estimate, (p-value of t-test). Low p-value: “statistically significant”
R2 measures goodness of fit of model. Low p-value of F statistic means model
has explanatory power.
b0 b b2 b3 b4 R2 p (F)
-.09(.08)
-.04(.002)
.0002(.000)
-.10(.11)
-.04(.08)
.97 .000
Answer to Question A 1 unit increase in price leads to
a .04 unit decrease in gas consumption.
Units are: G(1000 gallons), Pg($). So, a $0.10 increase in gas price
leads to, on average, a 4 gallon decrease in gas consumption…not much!