ch i introduction to regression and least squares

Chapter I. Slide 1

Introduction to Rossi’s MGMT 264B Slides

“I have made this longer than usual, because I lack the time to make it short.” - Pascal

I had the time! A lot of work has gone into simplifying and stripping down to the

essence. This means that the slides will require careful reading. They are designed to be self-contained but efficient! I hope you enjoy reading them.

Read them three times:

1.  Before class (write questions in margin) 2.  During class 3.  After class

Introduction to Rossi’s MGMT 264B Slides

At the end of each chapter, there are symbol and R command glossaries, and a list of important equations.

There are some symbols:

Note page available

Cool Move but difficult

Requires further thought

Chapter I. Slide 2

I. Introduction to Regression and Least Squares

a.  Conditional Prediction b.  Hedonic Pricing and Flat Panel TVs c.  Linear Prediction d.  Least Squares e.  Intuition behind Least Squares f.  Relationship between b and r g.  Decomposing the variance and R2

Chapter I. Slide 3

Chapter I. Slide 4

a. Conditional Prediction

The basic problem:

Available data

Formulate a model to predict a variable of

interest

Use prediction to make a business decision

Chapter I. Slide 5

a. Examples of Conditional Prediction

1.  Pricing Product Characteristics (Hedonic Pricing): -  Predict market value for various product characteristics -  Decision: optimal product configuration

2.  Optimal portfolio choice: -  Predict future joint distribution of asset returns -  Decision: Construct optimal portfolio (choose weights)

3.  Determination of promotional strategy: -  Predict sales volume response to a price reduction -  Decision: what is optimal promotional strategy?

Chapter I. Slide 6

b. Hedonic Regression

A Hedonic regression is a method which relates market prices to product characteristics.

The Hedonic regression provides a measure of what the

market will bear. Contrast with conjoint that measures the demand side only – what consumers are willing to pay. Closely related to Value Equivalence Line.

Prediction problem: predict what market will bear for a

product that doesn’t necessarily exist, e.g. value a property with no market transaction.

Chapter I. Slide 7

b. Example: Predicting Flat-Panel TV Prices

Problem: –  Predict market price based on observed characteristics.

For example, we are considering introducing a new model with 58” diagonal. We’d like to know what the market will bear for TV with “average quality.”

Solution:

–  Look at existing products for which we know the price and some observed characteristics

–  Build a prediction rule that predicts price as a function of the observed characteristics.

Chapter I. Slide 8

b. Flat-Panel TV Data

What characteristics do we use? We must select factors useful for prediction and we have to develop a specific quantitative measure of these factors (a variable)

–  Many factors or variables affect the price of a flat-panel TV •  Screen size •  Technology (plasma/LCD/LED) •  Brand name (?)

–  Easy to quantify price and size but what about other variables such as aesthetics?

For simplicity let’s focus only on screen size

b. R and course package, PERregress

Chapter I. Slide 9

Where to get R? http://cran.stat.ucla.edu/ R is a free package with versions for Windows, MAC OS, and Linux. Visit CRAN site nearest you (see above or google “CRAN”) and download and install R. Next you need to install the “package” made specifically for this course. This is easy to do. The package contains all of the datasets needed for the course as well as some customized functions. Start up R and select the “Packages and Data” menu item. Select a mirror site nearest you (such as UCLA) and then install the package “PERregress.”

b. Install course package, PERregress

Chapter I. Slide 10

.

R must be version 2.12.1 or higher!

b. Alternatively, install RStudio

Chapter I. Slide 11

See Videos section of CCLE web site.

Chapter I. Slide 12


R stores the data as a table or spreadsheet called a dataframe.

Chapter I. Slide 13

Another way is to think of the data is as a scatter plot. In other words, view the data as a collection of points in the plane (Xi, Yi)


35 40 45 50 55 60 65

500

1000

1500

2000

2500

3000

3500

4000

Size

Price

Chapter I. Slide 14

There appears to be a linear relationship between price and size. Let’s plot a line fit by the “eyeball” method. R command c() means put two or more numbers in a list called a vector. Here the list is the intercept and slope!

The line shown was fit by the “eyeball” method.

c. Linear Prediction

35 40 45 50 55 60 65

500

1000

1500

2000

2500

3000

3500

4000

Size

Price

Chapter I. Slide 15


Let Y denote the “dependent” variable and X the “independent” or predictor variable.

Recall that the equation of a line is:

Y = b0 + b1X Where:

b0 = is the intercept b1 = is the slope

–  The intercept units are in the units of Y ($) –  The slope is in unit of Y/unit of X ($/inch)

Chapter I. Slide 16


Y

X b0 2 1

b1 Y = b0 + b1X

Our “eyeball” line has b0 = -1400 b1 = 55

Chapter I. Slide 17


We can now predict the price of a flat-panel TV when we know only the size. We simply “read the value off the line” that we drew: For example, given a flat-panel TV with a size = 58 Predicted price = -1400 + 55(58 inches) = $1790

NOTE: the unit conversion from inches to dollars is done for us by the slope coefficient (b1)

Chapter I. Slide 18

c. Linear Fitting & Prediction in R

We can fit a line to the data using the R function, lm()! Dependent variable

Independent variable

Chapter I. Slide 19


R chooses a different line from ours: R fit: Price = -1408.93 +

57.13 Size

.

35 40 45 50 55 60 65

500

1000

1500

2000

2500

3000

3500

4000

Size

Price

R line

Eyeball line

Chapter I. Slide 20


Let’s look at the R prediction results when size = 58

= -1408.93 + 57.13 (58)

35 40 45 50 55 60 65

500

1000

1500

2000

2500

3000

3500

4000

Size

Price

Chapter I. Slide 21

Natural questions to ask at this point:

–  How does R select a best fitting line? –  Can we say anything about the accuracy of the predictions? –  What does all that other R output mean?


Chapter I. Slide 22

d. Least Squares

A strategy for estimating the slope and intercept parameters

Data: We observe the data recorded as the pairs (Xi, Yi) i = 1,…,N

Problem: Choose a fitted line. (b0, b1) A reasonable way to fit a line is to minimize the amount by which the fitted value differs from the actual value. This is called the residual.

Chapter I. Slide 23

d. Least Squares: Fitted Values & Residuals

What does “fitted value” mean?

Yi

Xi

Ŷi

The dots are the observed values and the line represents our fitted values given by:

Ŷi = b0 + b1Xi

Chapter I. Slide 24


What about the residual for the ith observation?

Yi

Xi

Ŷi ei = Yi – Ŷi = Residual

Chapter I. Slide 25


Note: The length of the red and green lines are the residuals

Yi

Xi

positive residuals negative residuals

The fitted value and the residual decompose the Yi observation into two parts: Yi = Ŷi + (Yi – Ŷi) = Ŷi + ei

Chapter I. Slide 26

d. The Least Squares Criterion

Ideally we want to minimize the size of all residuals. We must therefore trade off between moving closer to some

points and at the same time moving away from other points

The line fitting process:

i.  Compute residuals ii.  Square residuals and add up iii.  Pick the best fitting line (intercept and slope)

Least Squares:

choose b0 and b1 to minimize

∑=

N

1i2ie

Chapter I. Slide 27


Click on chart in slideshow to activate applet

Chapter I. Slide 28

What are the formulas which do the job? Least Squares Solution:


b1=(Xi !X)(Yi !Y)

i=1

N"(Xi !X)2

i=1

N"=

sXY

sX2

Where: sXY = sample covariance (X, Y) sX

2 = sample variance of (X)

XbYb 10 −=

Chapter I. Slide 29


Lets put the data into Excel:

Now simply calculate 10.75/5 to determine the slope = 2.15

Chapter I. Slide 30

e. Intuition Behind the Least Squares Formula

The equations for b0 and b1 show us how to ‘process’ the data (X1, Y1), (X2, Y2),…, (XN, YN) to generate guesses for the intercept and slope. The formulas use sample quantities that we are familiar with and have used in the past to calculate sample covariances (numerator) and sample variances (denominator). Can we develop an intuition the formulas that is based on prediction?

Chapter I. Slide 31


The intercept: The formula for b0 insures that the fitted regression line passes through the point of means . If you put in the average value of X, the least squares prediction is the average value of Y.

If we substitute in for the intercept using the LS formula, we obtain: If X is above the mean, then b1 tells us how much to scale this deviation above the mean to produce a forecast of Y relative to the mean of Y

)Y,X(

)XX(bYY 1 −=−

Chapter I. Slide 32


We can think of least squares as a two-part process:

i.  Plot the point of means

ii.  Find a line rotating through that point which has the smallest sum of squared residuals

There are many possible lines that pass thru the point of means. The least squares approach suggests that one line is the best. Can we understand this formula by application of a fundamental intuition derived from prediction?

Chapter I. Slide 33

Quick Review: Correlation

A intuition for the slope formula can be developed but it is more complicated. Let’s review the basic concept of a correlation first. The sample coefficient between two variables is defined as:

Remember: the correlation coefficient is a unitless measure of linear association between two variables.

rxy =(Xi!X)(Yi!Y)"

(Xi!X)2 (Yi!Y)2""=

sXY

sXsY

Chapter I. Slide 34

Quick Review: Correlation

Examples of various samples with varying degrees of correlation

r=.5

210-1-2

1

0

-1

-2

X

Y

210-1-2

2

1

0

-1

-2

-3

X

Y

r=0

r=.75

210-1-2

4

3

2

1

0

X

Y

r=1

1.50.5-0.5-1.5

3.5

2.5

1.5

0.5

-0.5

-1.5

-2.5

-3.5

-4.5

X

Y

Chapter I. Slide 35


What is the intuition for the LS formula for the slope, b1? Does there seem to be any relationship between e and Size in the flat-panel TV dataset?

35 40 45 50 55 60 65

-500

0500

1000

1500

Size

e

Does it make sense that e and size should be not be correlated? YES! Suppose we decide to ignore the least squares fit and make a bad choice for b1 (green line)

35 40 45 50 55 60 65

500

1000

1500

2000

2500

3000

3500

4000

Size

Price

Chapter I. Slide 36


Our bad choice -2543 + 80 * size

Least squares fit -1408.93 + 57.13 * size

Chapter I. Slide 37


Let’s look at the relationship between e and Size for the “bad fit”: Clearly, we have left something on the table in terms of prediction!

35 40 45 50 55 60 65

-1000

-500

0500

1000

Size

e1

We tend to underestimate (+e) the value of the TVs with size below the average size and overestimate (-e) the value of flat-panel TVs larger than average

Chapter I. Slide 38

Y = Ŷ + e


As long as the correlation between e and x is non-zero, we could always adjust our prediction rule to do better i.e. need to exploit all of the predictive power available in the X values and put this into Ŷ, leaving no “Xness” in the residuals. In summary: the following decomposition of each observation is made using the fitted and residual values:

“made from X” corr(Ŷ,X) = 1

unrelated to X Corr(e,X) = 0

Chapter I. Slide 39


The Fundamental Properties of Optimal Prediction: –  The optimal prediction of Y for an “average” or

representative X should be the “average” or representative value of Y

–  Prediction errors should be unrelated to the information used to formulate the predictions

For linear models:

– 

– 

= =ˆIf X X, Y Y

( ) ( )− = =ˆcorr Y Y,X corr e,X 0

Chapter I. Slide 40

f. The Relationship Between b and r

b1, can be written as the ratio of the sample covariance to the sample variance of X: Close to, but not the same as the sample correlation. Recall that b must “convert” the units of x into the units of y and is expressed as units of y per unit of x. However the sample correlation coefficient is unit-less. The relationship is given by:

b1=

sXY

sX2

b1=

sxy

sx2 = rxy !

sy

sx

Chapter I. Slide 41

f. The Relationship between b and r

Returning to the flat-panel TV price data, we can use commands in R to compute the least squares estimate of slope using only simple statistics. " "

sx2 sxy

Chapter I. Slide 42

f. The Relationship between b and r

Now let’s use the relationship between the regression coefficient and the sample correlation to compute the slope estimate.

sy

sx

Chapter I. Slide 43

g. Decomposing the Variance – The ANOVA Table We now know that:

We can now use these factors to decompose the total variance of Y:

Var(Y) = Var(Ŷ) + Var(e) + 2 cov(Ŷ,e) or,

Var(Y) = Var(Ŷ) + Var(e) What is true for the average square (variance) is also true for the the sum of total squares,

• 

•  corr(e,X) =0

•  corr(Ŷ,e) = 0

( )= =∑ ie 0 implies e 0, as well!

∑∑∑===

+−=−N

1i

2i

N

1i

2i

N

1i

2i e)YY()YY(

Total Sum of Squares

SST

Regression SS SSR

Error SS SSE

Chapter I. Slide 44

Let’s find this on the R printout

g. Decomposing the Variance – The ANOVA Table

SSE SSR

Chapter I. Slide 45

g. A Goodness of Fit Measure: R2

We have a good fit if:

To summarize how close SSR is to SST, we define the coefficient of determination: Two interpretations:

i.  Percentage of Variation in Y explained by X ii.  Square of correlation (hence “r” squared)

SSTSSE1

SSTSSRR2 −==

•  SSR is large •  SSE is small •  Fit would be perfect if SST = SSR

Chapter I. Slide 46

g. A Goodness of Fit Measure: R2

R2 on the R printout

Chapter I. Slide 47

g. Misuse of R2

R squared is often mis-used. Some establish arbitrary benchmarks for “high” values. For

example, most regard values of over .8 as “high.” “high” values of R2 are often associated with claims that the

model is: 1. adequate or correctly specified – “valid” 2. highly accurate for prediction

Unfortunately, neither is correct. More later!

Chapter I. Slide 48

Appendix: Derivation of LS Slope

We have satisfied ourselves that corr(e, X) =0. Now let us use this intuition to derive the least squares formula for b1

Verification: realize that corr(e,X) = 0 is equivalent to cov(e, X) = 0 (why?) substitute in for e:

( )( )= = − −∑ i icov(e,X) cov(X,e) 1/(N 1) X X e

XbYb 10 −=

( )( ) ( )( )⎡ ⎤− − − = − − − − =⎣ ⎦∑ ∑i i 0 1 i i i 1 1 iX X Y b b X X X Y Y b X b X 0

Chapter I. Slide 49

Appendix: Derivation of LS Slope

Verification (continued): or or Solving for b1

( )( )⎡ ⎤− − − − =⎣ ⎦∑ i i 1 1 iX X Y Y b X b X 0

∑∑

−−−

= 2i

ii1 )XX(

)YY()XX(b

( ) ( )( )− − − − =∑ i i 1 iX X Y Y b X X 0

( )( ) ( )⎡ ⎤− − − − =⎢ ⎥⎣ ⎦∑ 2

i i 1 iX X Y Y b X X 0

Chapter I. Slide 50

Glossary of Symbols

"X= independent or explanatory variable Y= dependent variable "" b0 - least squares estimate of intercept b1 - least squares estimate of regression slope ei - least squares residual Ŷ - fitted value s - sample standard dev (subscript tells you of what var)

s2 - sample variance (subscript tells you of what var) r - sample correlation coefficient SST - total sum of squares SSR - regression sum of squares SSE - error sum of squares R2 - coefficient of determination, goodness of fit measure

Chapter I. Slide 51

Important Equations

Ŷi = b0 + b1Xi

ei = Yi – Ŷi = Residual

b1=(Xi !X)(Yi !Y)

i=1

N"(Xi !X)2

i=1

N"=

sXY

sX2

XbYb 10 −=

fitted value

least squares formulae for

slope and intercept

residual

Chapter I. Slide 52

Important Equations

least squares line passes thru point of

means

relationship between

slope coef and

correlation

)XX(bYY 1 −=−

b1=

sxy

sx2 = rxy !

sy

sx

SSTSSE1

SSTSSRR2 −==

Alternative definitions for R-squared

Glossary of R commands

-  abline(c(intercept,slope)): Adds one line through the current plot with the defined intercept and slope

-  attach(A) : The data frame A is attached to the R search path. So objects in A can be accessed by simply giving their names.

-  anova(): Computes analysis of variance (or deviance) tables for one or more fitted model objects.

-  c(): Combine values into a vector or list -  cov(x,y),cor(x,y): Compute the covariance or correlation of

variables x and y. -  data(A) : allows access to data frame A that is in the data library. -  head(A) : Returns the first parts of a data frame A. -  library(“PERregress”) : loads the R package containing

datasets and customized functions for our class. -  lm(Y~X): Fits a linear model. Y is dep var and X is indep var.

Chapter I. Slide 53

Glossary of R commands

-  plot(X,Y): Plots X and Y -  predict(): Predictions from the results of various model fitting

functions.!-  A=read.csv(“file_name”): Reads file in .csv format and creates

data frame A from B. -  str(object): tells you the “structure” of an object, e.g. str

(dataframe). -  summary(): Generic function used to produce result summaries of

the results of various model fitting functions. -  var(X),mean(X),sd(X): Compute the variance, mean, standard

deviation of a variable named X.!

Chapter I. Slide 54

ch i introduction to regression and least squares

Documents