regression: simple and linear...advertisement sales advertisement ales estimating coefficients true...

Post on 13-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

INTRODUCTION TO MACHINE LEARNING

Regression: Simple and Linear

Introduction to Machine Learning

Regression Principle

PREDICTORS REGRESSION RESPONSE

Introduction to Machine Learning

ExampleShop Data: sales, competition, district size, ...

Data Analyst Relationship?

● Predictors: competition, advertisement, …

● Response: sales

Shopkeeper Predictions!

Introduction to Machine Learning

Simple Linear Regression● Simple: one predictor to model the response

● Linear: approximately linear relationship

Linearity Plausible? Sca!erplot!

Introduction to Machine Learning

Example● Relationship: advertisement sales

● Expectation: positively correlated

Introduction to Machine Learning

Example● Observation: upwards linear trend

● First Step: simple linear regression

5 10 15

0100

200

300

400

500

advertisement

sales

Introduction to Machine Learning

ModelFi!ing a line

● Predictor: ● Intercept:

● Statistical Error:

● Response: ● Slope:

Introduction to Machine Learning

5 10 15

0100

200

300

400

500

advertisement

sales

Advertisement

Sale

s

Estimating Coefficients

True Response

Residuals

Fi!ed Response

Minimize!

5 10 15

010

020

030

040

050

0

5 10 15

0100

200

300

400

500

advertisement

sales

#Observations

Introduction to Machine Learning

Estimating Coefficients

> my_lm <- lm(sales ~ ads, data = shop_data)

PredictorResponse

5 10 15

010

020

030

040

050

0

> my_lm$coefficients Returns coefficients

Introduction to Machine Learning

Prediction with RegressionPredicting new outcomes

> y_new <- predict(my_lm, x_new, interval = "confidence")

Provides confidence interval

,

Estimated Response

New Predictor Instance

Estimated Coefficients

Must be data frame

Example: Ads: 11.000$ Sales: 380.000$

Introduction to Machine Learning

Accuracy: RMSEMeasure of accuracy:

Estimated Response

True Response# Observations

difficult to interpret!

Example: RMSE = 76.000$ Meaning?

RMSE has unit + scale

Introduction to Machine Learning

Interpretation: % explained variance,

Accuracy: R-squaredSample mean response

close to 1 good fit!

> summary(my_lm)$r.squared

R-squared

Total SS

Example: 0.84

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

INTRODUCTION TO MACHINE LEARNING

Multivariable Linear Regression

Introduction to Machine Learning

5 10 15

010

020

030

040

050

0

0 5 10 15

010

020

030

040

050

0

nearby competition

sale

s

ExampleSimple Linear Regression:

> lm(sales ~ ads, data = shop_data)

0 5 10 15

010

020

030

040

050

0

nearby competition

sale

s

> lm(sales ~ comp, data = shop_data)

Loss of information!

Introduction to Machine Learning

Multi-Linear Model

Solution: combine in multi linear model!

Individual Effect

● Higher predictive power

● Higher accuracy

Introduction to Machine Learning

Multi-Linear Regression Model

● Predictors:

● Response:

● Statistical Error:

● Coefficients:

Introduction to Machine Learning

Estimating Coefficients

True Response Fi!ed Response

#Observations

Minimize!

Residuals

Introduction to Machine Learning

Extending!

> my_lm <- lm(sales ~ ads + comp + ..., data = shop_data)

More predictors: total inventory, district size, …

PredictorsResponse

Extend methodology to p predictors:

Introduction to Machine Learning

RMSE & Adjusted R-SquaredMore predictors

● Penalizes more predictors

● Used to compare

> summary(my_lm)$adj.r.squared

Solution: adjusted R-squared

Lower RMSE and higher R-squared

Higher complexity and cost{

In Example: 0.819 0.906

Introduction to Machine Learning

Influence of predictors● p-value: indicator influence of parameter

● p-value low — more likely parameter has significant influence

> summary(my_lm)

Call: lm(formula = sales ~ ads + comp, data = shop_data)

Residuals: Min 1Q Median 3Q Max -131.920 -23.009 -4.448 33.978 146.486

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 228.740 80.592 2.838 0.009084 ** ads 25.521 5.900 4.325 0.000231 *** comp -19.234 4.549 -4.228 0.000296 ***

P-Values

Introduction to Machine Learning

Example● Want 95% confidence — p-value <= 0.05

● Want 99% confidence — p-value <= 0.01

Note: Do not mix up R-squared with p-values!

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 228.740 80.592 2.838 0.009084 ** ads 25.521 5.900 4.325 0.000231 *** comp -19.234 4.549 -4.228 0.000296 ***

P-Values

Introduction to Machine Learning

Assumptions● Just make a model,

make a summary and look at p-values?

● Not that simple!

● We made some implicit assumptions

Introduction to Machine Learning

0 100 200 300 400 500 600

−100

−50

050

100

150

Residual Plot

Estimated Sales

Res

idua

ls

−2 −1 0 1 2

−100

−50

050

100

150

Normal Q−Q Plot

Theoretical Quantiles

Res

idua

l Qua

ntile

s

> qqnorm(lm_shop$residuals)

> plot(lm_shop$fitted.values, lm_shop$residuals)

Verifying AssumptionsResiduals:

● Identical Normal:

Draws normal Q-Q plot

No pa!ern?● Independent:

Approximately a line?

Introduction to Machine Learning

Verfiying Assumptions

−2 −1 0 1 2

−100

−50

050

100

150

Normal Q−Q Plot

Theoretical QuantilesR

esid

ual Q

uant

iles

0 100 200 300 400 500 600

−100

−50

050

100

150

Residual Plot

Estimated Sales

Res

idua

ls

● Important to avoid mistakes!

● Alternative tests exist

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

INTRODUCTION TO MACHINE LEARNING

k-Nearest Neighbors and Generalization

Introduction to Machine Learning

Non-Parametric Regression

2 3 4 5 6

4446

4850

5254

56

x

y

Problem: Visible pa!ern, but not linear

Introduction to Machine Learning

Non-Parametric RegressionProblem: Visible pa!ern, but not linear

Solutions:

● Transformation

● Multi-linear Regression

● non-Parametric Regression

Tedious

Advanced

Doable

Introduction to Machine Learning

Non-Parametric RegressionProblem: Visible pa!ern, but not linear

Techniques:

● k-Nearest Neighbors

● Kernel Regression

● Regression Trees

● …

No parameter estimations required!

Introduction to Machine Learning

k-NN: Algorithm

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

New observation

Given a training set and a new observation:

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

Given a training set and a new observation:

k-NN: Algorithm

1. Calculate the distance in the predictors

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

k-NN: Algorithm

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

yk = 4

2. Select the k nearest

Given a training set and a new observation:

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

k-NN: Algorithm

Mean of 4 responses

3. Aggregate the response of the k nearest

Given a training set and a new observation:

Introduction to Machine Learning

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

4.5 4.6 4.7 4.8 4.9 5.0

45.5

46.0

46.5

47.0

47.5

48.0

48.5

x

y

k-NN: Algorithm

Prediction

4. The outcome is your prediction

Given a training set and a new observation:

Introduction to Machine Learning

Choosing k● k = 1: Perfect fit on training set but poor predictions

● k = #obs in training set: Mean, also poor predictions

Reasonable: k = 20% of #obs in training set

Bias - Variance trade off!

Introduction to Machine Learning

Generalization in Regression● Built your own regression model

● Worked on training set

● Does it generalize well?!

● Two techniques

● Hold Out: simply split the dataset

● K-fold cross-validation

Introduction to Machine Learning

Hold Out Method for RegressionTest setTraining set

Build regression model on training set

Calculate RMSE within training set

Predict the outcome of the test set

Calculate the RMSE within test set

Compare Test RMSE and Training RMSE

Introduction to Machine Learning

2 3 4 5 6 7 8

−20

24

68

10

x

y

2 3 4 5 6 7 8

−20

24

68

10

x

y

2 3 4 5 6 7 8

−20

24

68

10

x

y

● Fit:

● Generalize:

● Prediction: ✔

Under and Overfi!ing Underfit Overfit

● Fit:

● Generalize:

● Prediction:

● Fit:

● Generalize:

● Prediction:

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

top related