regression: simple and linear...advertisement sales advertisement ales estimating coefficients true...

40
INTRODUCTION TO MACHINE LEARNING Regression: Simple and Linear

Upload: others

Post on 13-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

INTRODUCTION TO MACHINE LEARNING

Regression: Simple and Linear

Page 2: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Regression Principle

PREDICTORS REGRESSION RESPONSE

Page 3: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

ExampleShop Data: sales, competition, district size, ...

Data Analyst Relationship?

● Predictors: competition, advertisement, …

● Response: sales

Shopkeeper Predictions!

Page 4: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Simple Linear Regression● Simple: one predictor to model the response

● Linear: approximately linear relationship

Linearity Plausible? Sca!erplot!

Page 5: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Example● Relationship: advertisement sales

● Expectation: positively correlated

Page 6: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Example● Observation: upwards linear trend

● First Step: simple linear regression

5 10 15

0100

200

300

400

500

advertisement

sales

Page 7: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

ModelFi!ing a line

● Predictor: ● Intercept:

● Statistical Error:

● Response: ● Slope:

Page 8: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

5 10 15

0100

200

300

400

500

advertisement

sales

Advertisement

Sale

s

Estimating Coefficients

True Response

Residuals

Fi!ed Response

Minimize!

5 10 15

010

020

030

040

050

0

5 10 15

0100

200

300

400

500

advertisement

sales

#Observations

Page 9: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Estimating Coefficients

> my_lm <- lm(sales ~ ads, data = shop_data)

PredictorResponse

5 10 15

010

020

030

040

050

0

> my_lm$coefficients Returns coefficients

Page 10: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Prediction with RegressionPredicting new outcomes

> y_new <- predict(my_lm, x_new, interval = "confidence")

Provides confidence interval

,

Estimated Response

New Predictor Instance

Estimated Coefficients

Must be data frame

Example: Ads: 11.000$ Sales: 380.000$

Page 11: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Accuracy: RMSEMeasure of accuracy:

Estimated Response

True Response# Observations

difficult to interpret!

Example: RMSE = 76.000$ Meaning?

RMSE has unit + scale

Page 12: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Interpretation: % explained variance,

Accuracy: R-squaredSample mean response

close to 1 good fit!

> summary(my_lm)$r.squared

R-squared

Total SS

Example: 0.84

Page 13: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

Page 14: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

INTRODUCTION TO MACHINE LEARNING

Multivariable Linear Regression

Page 15: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

5 10 15

010

020

030

040

050

0

0 5 10 15

010

020

030

040

050

0

nearby competition

sale

s

ExampleSimple Linear Regression:

> lm(sales ~ ads, data = shop_data)

0 5 10 15

010

020

030

040

050

0

nearby competition

sale

s

> lm(sales ~ comp, data = shop_data)

Loss of information!

Page 16: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Multi-Linear Model

Solution: combine in multi linear model!

Individual Effect

● Higher predictive power

● Higher accuracy

Page 17: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Multi-Linear Regression Model

● Predictors:

● Response:

● Statistical Error:

● Coefficients:

Page 18: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Estimating Coefficients

True Response Fi!ed Response

#Observations

Minimize!

Residuals

Page 19: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Extending!

> my_lm <- lm(sales ~ ads + comp + ..., data = shop_data)

More predictors: total inventory, district size, …

PredictorsResponse

Extend methodology to p predictors:

Page 20: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

RMSE & Adjusted R-SquaredMore predictors

● Penalizes more predictors

● Used to compare

> summary(my_lm)$adj.r.squared

Solution: adjusted R-squared

Lower RMSE and higher R-squared

Higher complexity and cost{

In Example: 0.819 0.906

Page 21: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Influence of predictors● p-value: indicator influence of parameter

● p-value low — more likely parameter has significant influence

> summary(my_lm)

Call: lm(formula = sales ~ ads + comp, data = shop_data)

Residuals: Min 1Q Median 3Q Max -131.920 -23.009 -4.448 33.978 146.486

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 228.740 80.592 2.838 0.009084 ** ads 25.521 5.900 4.325 0.000231 *** comp -19.234 4.549 -4.228 0.000296 ***

P-Values

Page 22: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Example● Want 95% confidence — p-value <= 0.05

● Want 99% confidence — p-value <= 0.01

Note: Do not mix up R-squared with p-values!

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 228.740 80.592 2.838 0.009084 ** ads 25.521 5.900 4.325 0.000231 *** comp -19.234 4.549 -4.228 0.000296 ***

P-Values

Page 23: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Assumptions● Just make a model,

make a summary and look at p-values?

● Not that simple!

● We made some implicit assumptions

Page 24: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

0 100 200 300 400 500 600

−100

−50

050

100

150

Residual Plot

Estimated Sales

Res

idua

ls

−2 −1 0 1 2

−100

−50

050

100

150

Normal Q−Q Plot

Theoretical Quantiles

Res

idua

l Qua

ntile

s

> qqnorm(lm_shop$residuals)

> plot(lm_shop$fitted.values, lm_shop$residuals)

Verifying AssumptionsResiduals:

● Identical Normal:

Draws normal Q-Q plot

No pa!ern?● Independent:

Approximately a line?

Page 25: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Verfiying Assumptions

−2 −1 0 1 2

−100

−50

050

100

150

Normal Q−Q Plot

Theoretical QuantilesR

esid

ual Q

uant

iles

0 100 200 300 400 500 600

−100

−50

050

100

150

Residual Plot

Estimated Sales

Res

idua

ls

● Important to avoid mistakes!

● Alternative tests exist

Page 26: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

Page 27: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

INTRODUCTION TO MACHINE LEARNING

k-Nearest Neighbors and Generalization

Page 28: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Non-Parametric Regression

2 3 4 5 6

4446

4850

5254

56

x

y

Problem: Visible pa!ern, but not linear

Page 29: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Non-Parametric RegressionProblem: Visible pa!ern, but not linear

Solutions:

● Transformation

● Multi-linear Regression

● non-Parametric Regression

Tedious

Advanced

Doable

Page 30: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Non-Parametric RegressionProblem: Visible pa!ern, but not linear

Techniques:

● k-Nearest Neighbors

● Kernel Regression

● Regression Trees

● …

No parameter estimations required!

Page 31: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

k-NN: Algorithm

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

New observation

Given a training set and a new observation:

Page 32: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

Given a training set and a new observation:

k-NN: Algorithm

1. Calculate the distance in the predictors

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

Page 33: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

k-NN: Algorithm

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

yk = 4

2. Select the k nearest

Given a training set and a new observation:

Page 34: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

k-NN: Algorithm

Mean of 4 responses

3. Aggregate the response of the k nearest

Given a training set and a new observation:

Page 35: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

k-NN: Algorithm

Prediction

4. The outcome is your prediction

Given a training set and a new observation:

Page 36: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Choosing k● k = 1: Perfect fit on training set but poor predictions

● k = #obs in training set: Mean, also poor predictions

Reasonable: k = 20% of #obs in training set

Bias - Variance trade off!

Page 37: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Generalization in Regression● Built your own regression model

● Worked on training set

● Does it generalize well?!

● Two techniques

● Hold Out: simply split the dataset

● K-fold cross-validation

Page 38: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

Hold Out Method for RegressionTest setTraining set

Build regression model on training set

Calculate RMSE within training set

Predict the outcome of the test set

Calculate the RMSE within test set

Compare Test RMSE and Training RMSE

Page 39: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

Introduction to Machine Learning

2 3 4 5 6 7 8

−20

24

68

10

x

y

2 3 4 5 6 7 8

−20

24

68

10

x

y

2 3 4 5 6 7 8

−20

24

68

10

x

y

● Fit:

● Generalize:

● Prediction: ✔

Under and Overfi!ing Underfit Overfit

● Fit:

● Generalize:

● Prediction:

● Fit:

● Generalize:

● Prediction:

Page 40: Regression: Simple and Linear...advertisement sales Advertisement ales Estimating Coefficients True Response Residuals Fi!ed Response Minimize! 5 10 15 0 100 200 300 400 500 5 10 15

INTRODUCTION TO MACHINE LEARNING

Let’s practice!