ms&e 226: fundamentals of data scienceweb.stanford.edu/...regression_annotated_2019.pdf · r2...

39
MS&E 226: Fundamentals of Data Science Lecture 2: Linear Regression Ramesh Johari [email protected] 1 / 37

Upload: others

Post on 20-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

MS&E 226: Fundamentals of Data ScienceLecture 2: Linear Regression

Ramesh [email protected]

1 / 37

Page 2: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Summarizing a sample

2/37

Page 3: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

A sample

Suppose Y = (Y1, . . . , Yn) is a sample of real-valued observations.

Simple statistics:

I Sample mean:

Y =1

n

nX

i=1

Yi.

I Sample median:I Order Yi from lowest to highest.I Median is average of n/2’th and (n/2 + 1)’st elements of this

list (if n is even)or (n+ 1)/2’th element of this list (if n is odd)

I More robust to “outliers”

3 / 37

Page 4: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

A sample

Suppose Y = (Y1, . . . , Yn) is a sample of real-valued observations.

Simple statistics:

I Sample mean:

Y =1

n

nX

i=1

Yi.

I Sample median:I Order Yi from lowest to highest.I Median is average of n/2’th and (n/2 + 1)’st elements of this

list (if n is even)or (n+ 1)/2’th element of this list (if n is odd)

I More robust to “outliers”

3 / 37

Page 5: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

A sample

Suppose Y = (Y1, . . . , Yn) is a sample of real-valued observations.

Simple statistics:

I Sample standard deviation:

�Y =

vuut 1

n� 1

nX

i=1

(Yi � Y )2.

Measures dispersion of the data.(Why n� 1? See homework.)

3 / 37

Page 6: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

Prices of houses in Saratoga County, New York in 2006.

1,728 observations on 16 variables (e.g., price, lotSize,livingArea, etc.)Available via mosaicData package (background atmosaic-web.org).

> install.packages("mosaicData")

> library(mosaicData)

> mean(SaratogaHouses$livingArea)

[1] 1754.976

> median(SaratogaHouses$livingArea)

[1] 1634.5

> sd(SaratogaHouses$livingArea)

[1] 619.9356

4 / 37

Page 7: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Relationships

5/37

Page 8: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Modeling relationships

We focus on a particular type of summarization:

Modeling the relationship between observations.

Formally:

I Let Yi, i = 1, . . . , n, be the i’th observed (real-valued)outcome.Let Y = (Y1, . . . , Yn)

I Let Xij , i = 1, . . . , n, j = 1, . . . , p be the i’th observation ofthe j’th (real-valued) covariate.Let Xi = (Xi1, . . . , Xip).Let X be the matrix whose rows are Xi.

6 / 37

¥µg!¥Esi→(xii . . ..¥

Ya

Page 9: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Pictures and names

How to visualize Y and X?

Names for the Yi’s:outcomes, response variables, target variables, dependent variables

Names for the Xij ’s:covariates, features, regressors, predictors, explanatory variables,independent variables

X is also called the design matrix.

7 / 37

Page 10: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

The SaratogaHouses dataset loaded earlier contains 16 columns,including:

price price (in dollars)livingArea living area (in square feet)age age of house (in years)bedrooms number of bedroomsbathrooms number of bathroomsheating type of heating systemnewConstruction whether the house is new construction

[ Note: Always question how variables are defined! ]

Reasonable question:

How is price related to the other variables?

8 / 37

Page 11: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

> library(tidyverse)

> sh = SaratogaHouses %>%

select(price,

livingArea,

age,

bedrooms,

bathrooms,

heating,

new = newConstruction)

9 / 37

Page 12: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

> shprice livingArea age bedrooms bathrooms heating new

1 132500 906 42 2 1.0 electric No2 181115 1953 0 3 2.5 hot water/steam No3 109000 1944 133 4 1.0 hot water/steam No4 155000 1944 13 3 1.5 hot air No5 86060 840 0 2 1.0 hot air Yes6 120000 1152 31 4 1.0 hot air No...

We will treat price as our outcome variable.

10 / 37

1

Page 13: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Continuous variables

Variables such as price and livingArea are continuous variables:they are naturally real-valued.

For now we only consider outcome variables that are continuous(like price).Note: even continuous variables can be constrained:

I Both price and livingArea must be positive.

I bedrooms must be a positive integer.

11 / 37

Page 14: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Categorical variables

Other variables take on only finitely many values, e.g.:

I new is Yes or No if the house is or is not new construction.I heating is one of the following:

I electric

I hot water/steam

I hot air

These are categorical variables (or factors).

12 / 37

In R : Use factor ( variable .name ) to let R

to treat a variable as a factor.

Page 15: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Modeling relationships

Goal:

Find a functional relationship f such that:

Yi ⇡ f(Xi)

This is our first example of a “model.”

We use models for lots of things:

I Associations and correlations

I Predictions

I Causal relationships

13 / 37

Page 16: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Linear regression models

14 / 37

Page 17: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Linear relationships

We first focus on modeling the relationship between outcomes andcovariates as linear.

In other words: find coe�cients �0, . . . , �p such that: 1

Yi ⇡ �0 + �1Xi1 + · · ·+ �pXip.

This is a linear regression model.

1We use “hats” on variables to denote quantities computed from data. Inthis case, whatever the coe�cients are, they will have to be computed from thedata we were given.

15 / 37

Page 18: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Matrix notation

We can compactly represent a linear model using matrix notation:

I Let � = [�0, �1, · · · �p]> be the (p+ 1)⇥ 1 column vector ofcoe�cients

I Expand X to have p+ 1 columns, where the first column(indexed j = 0) is Xi0 = 1 for all i.

I Then the linear regression model is that for each i:

Yi ⇡ Xi�,

or even more compactly

Y ⇡ X�.

16 / 37

Page 19: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Matrix notation

A picture of Y, X, and �:

17 / 37

Y , a pro + §, Xii + .

. -+ Bp X :p

Hittite::3:B :&

F⇐ IF

Page 20: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

ggplot(data = sh, aes(x = livingArea, y = price)) +

geom_point()

0e+00

2e+05

4e+05

6e+05

8e+05

1000 2000 3000 4000 5000livingArea

price

Looks like price is positively correlated with living_area.Use ggplot via tidyverse.

18 / 37

Page 21: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

Let’s build a simple regression model of price againstlivingArea.

> fm = lm(data = sh, price ~ 1 + livingArea)

> summary(fm)

...

Coefficients:

Estimate ...

(Intercept) 13439.394 ...

livingArea 113.123 ...

...

In other words: price ⇡ 13,439.394 + 113.123 ⇥ livingArea.

Note: summary(fm) produces lots of other output too! We are going to

gradually work in this course to understand what each of those pieces of output

means.

19 / 37

Yi a to + f. x.,

← do← §,

Page 22: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in RHere is the model plotted against the data:

0e+00

2e+05

4e+05

6e+05

8e+05

1000 2000 3000 4000 5000livingArea

price

> ggplot(data = kidiq, aes(x = mom_iq, y = kid_score)) +geom_point() +geom_smooth(method="lm", se=FALSE)

20 / 37

Page 23: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R: Multiple regression

We can include multiple covariates in our linear model.

> fm = lm(data = sh, price ~ 1 + livingArea + bedrooms)

> summary(fm)

...

Coefficients:

Estimate ...

(Intercept) 36667.895 ...

livingArea 125.405 ...

bedrooms -14196.769 ...

(Note that the coe�cient on livingArea is di↵erent now...we willdiscuss why later.)

21 / 37

Yi a §.+

^

f. Xii + B. Xiz

,,

Page 24: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

How to choose �?

There are many ways to choose �.

We focus primarily on ordinary least squares (OLS):

Choose � so that

SSE = sum of squared errors =nX

i=1

(Yi � Yi)2

is minimized, where

Yi = Xi� = �0 +pX

j=1

�jXij

is the fitted value of the i’th observation.

This is what R (typically) does when you call lm.(Later in the course we develop one justification for this choice.)

22 / 37

yf. tied value (at

th observation )

Page 25: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Questions to ask

Here are some important questions to be asking:

I Is the resulting model a good fit?

I Does it make sense to use a linear model?

I Is minimizing SSE the right objective?

We start down this road by working throughthe algebra of linear regression.

23 / 37

Page 26: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Ordinary least squares: Solution

24/ 37

Page 27: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

OLS solution

From here on out we assume that p < n andX has full rank = p+ 1.

(What does p < n mean, and why do we need it?)

TheoremThe vector � that minimizes SSE is given by:

� =⇣X

>X

⌘�1X

>Y.

(Check that dimensions make sense here: � is (p+ 1)⇥ 1.)

25 / 37

mmHi 1

Page 28: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

OLS solution: Intuition

The SSE is the squared Euclidean norm of Y � Y:

SSE =nX

i=1

(Yi � Yi)2 = kY � Yk2 = kY �X�k2.

Note that as we vary � we range overlinear combinations of the columns of X.

The collection of all such linear combinations isthe subspace spanned by the columns of X.

So the linear regression question is

What is the “closest” such linear combination to Y?

26 / 37

Page 29: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

OLS solution: Geometry

27/ 37

¥ =p :p;lteKIt+ . . ... % EM

i

it: Ky"I"It ;"

iii.#§l>( 441 )

y

( Xu, Xz , ,X } ,

)

Page 30: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

OLS solution: Algebraic proof [⇤]Based on [SM], Exercise 3B14:

I Observe that X>X is symmetric and invertible. (Why?)

I Note that: X>r = 0, where r = Y �X� is the vector of

residuals.In other words: the residual vector is orthogonal to every

column of X.

I Now consider any vector � that is (p+ 1)⇥ 1. Note that:Y �X� = r+X(� � �).

I Since r is orthogonal to X, we get:

kY �X�k2 = krk2 + kX(� � �)k2.

I The preceding value is minimized when X(� � �) = 0.

I Since X has rank p+ 1, the preceding equation has theunique solution � = �.

28 / 37

Page 31: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Hat matrix (useful for later) [⇤]

Since: Y = X� = X(X>X)�1

X>Y, we have:

Y = HY,

where:H = X(X>

X)�1X

>.

H is called the hat matrix.It projects Y into the subspace spanned by the columns of X.It is symmetric and idempotent (H2 = H).

29 / 37

Page 32: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Residuals and R2

30 / 37

Page 33: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Residuals

Let r = Y � Y = Y �X� be the vector of residuals.

Our analysis shows us that: r is orthogonal to every column of X.

In particular, r is orthogonal to the all 1’s vector(first column of X), so:

Y =1

n

nX

i=1

Yi =1

n

nX

i=1

Yi = Y .

In other words, the residuals sum to zero, andthe original and fitted values have the same sample mean.

31 / 37

f. I =o ⇒"

E.E=o ⇒E. Yi . I=o

Page 34: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Residuals

Since r is orthogonal to every column of X,we use the Pythagorean theorem to get:

kYk2 = krk2 + kYk2.

Using equality of sample means we get:

kYk2 � nY2= krk2 + kYk2 � nY

2.

32 / 37

Page 35: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Residuals

How do we interpret:

kYk2 � nY2= krk2 + kYk2 � nY

2?

Note 1n�1(kYk2 � nY

2) is the sample variance of Y.

2

Note 1n�1(kYk2 � nY

2) is the sample variance of Y.

So this relation suggests how much of the variation in Y is“explained” by Y.

2Note that the (adjusted) sample variance is usually defined as1

n�1

Pni=1(Yi � Y )2. You should check this is equal to the expression on the

slide!33 / 37

I =

" it-

Tie+

" 5112 - nee

117112 - nf2:

Page 36: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

R2

Formally:

R2 =

Pni=1(Yi � Y )2

Pni=1(Yi � Y )2

is a measure of the fit of the model, with 0 R2 1.3

When R2 is large, much of the outcome sample variance is“explained” by the fitted values.

Note that R2 is an in-sample measurement of fit:

We used the data itself to construct a fit to the data.

3Note that this result depends on Y = Y , which in turn depends on the factthat the all 1’s vector is part of X, i.e., that our linear model has an interceptterm.

34 / 37

÷.

÷ I

Page 37: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in R

The full output of our model earlier includes R2:

> fm = lm(data = sh, price ~ 1 + livingArea)

> summary(fm)

...

Multiple R-squared: 0.5075,Adjusted R-squared: 0.5072

...

Here Multiple R-squared is the R2 value. (We will discuss adjusted

R2 later in the course.)

35 / 37

Page 38: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Example in RWe can plot the residuals for our earlier model:

−2e+05

0e+00

2e+05

4e+05

2e+05 4e+05 6e+05fitted(fm)

residuals(fm)

> fm = lm(data = sh, price ~ 1 + livingArea)

> qplot(fitted(fm), residuals(fm), alpha = I(0.1))

Note: We generally plot residuals against fitted values, not the original

outcomes. Try plotting residuals against the original outcomes to see what

happens!36 / 37

Page 39: MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the fit of the model,

Questions

I What do you hope to see when you plot the residuals?

I Why might R2 be high, yet the model fit poorly?

I Why might R2 be low, and yet the model be useful?

I What happens to R2 if we add additional covariates to themodel?

37 / 37