lecture 17 interaction plots simple linear regression (chapter 18.1- 18.2) homework 4 due friday....

Lecture 17

• Interaction Plots

• Simple Linear Regression (Chapter 18.1-18.2)

• Homework 4 due Friday. JMP instructions for question 15.41 are actually for question 15.35.

18.1 Introduction

• In Chapters 18 to 20 we examine the relationship between interval variables via a mathematical equation.

• The motivation for using the technique:– Forecast the value of a dependent variable (y) from the

value of independent variables (x1, x2,…xk.).

– Analyze the specific relationships between the independent variables and the dependent variable.

Uses of Regression Analysis

• A building manager company plans to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. The costs incurred by the company are proportional to the number of cleaning crews needed for this task. How many crews will be enough?

• The product manager in charge of a brand of children’s cereal would like to predict demand during the next year. She has available the following “predictor” variables: price of the product, number of children in target market, price of competitors’ products, effectiveness of advertising, annual sales this year and previous year

Uses of Regression Analysis

• A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community might be able to cover the cost of increased police protection by gains in tax revenues from higher property values.

• A real estate agent wants to more accurately predict the selling price of houses. She believes the following variables affect the price of a house: Size of house (sq. feet), number of bedrooms, frontage of lot, condition and location.

House size

HouseCost

Most lots sell for $25,000

Building a house costs about

$75 per square foot.

House cost = 25000 + 75(Size)

18.2 The Model

The model has a deterministic and a probabilistic components

House cost = 25000 + 75(Size)

House size

HouseCost

Most lots sell for $25,000

However, house cost vary even among same size houses!

18.2 The Model

Since cost behave unpredictably, we add a random component.

18.2 The Model

• The first order linear model

y = dependent variablex = independent variable

0 = y-intercept

1 = slope of the line = error variable

xy 10 xy 10

x

y

0Run

Rise = Rise/Run

0 and 1 are unknown populationparameters, therefore are estimated from the data.

Interpreting the Coefficients

• • Roomsclean=1.78+3.70*Number of Crews• called the y-intercept and called the slope.• Interpretation of slope: “For every additional

cleaning crew, we are able to clean an additional 3.70 rooms on average.”

• Interpretation of intercept: Technically, how many rooms on average can be cleaned with zero cleaning crews but doesn’t make sense here because it involves extrapolation.

XXYE 10)|(

0 1

Simple Regression Model

• The data are assumed to be a realization of

• is the “signal” and is “noise” (error)• are the unknown parameters of the model.

Objective of regression is to estimate them.• What is the interpretation of ?

),(,),,( 11 nn yxyx

),0(~ ,,

,1,2

1

10

Niid

nixy

n

iii

ix10 i2

10 ,,

,2,1 10

2

18.3 Estimating the Coefficients

• The estimates are determined by – drawing a sample from the population of interest,

– calculating sample statistics.

– producing a straight line that cuts into the data.

Question: What should be considered a good line?

x

y

The Least Squares (Regression) Line

A good line is one that minimizes the sum of squared differences between the points and the line.

The Least Squares (Regression) Line

3

3

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 +

(4,3.2)

(3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

2.5

Let us compare two linesThe second line is horizontal

The smaller the sum of squared differencesthe better the fit of the line to the data.

The Estimated Coefficients

To calculate the estimates of the line coefficients, that minimize the differences between the data points and the line, use the formulas:

xbyb

s

)Y,Xcov(b

10

2x

1

xbyb

s

)Y,Xcov(b

10

2x

1

The regression equation that estimatesthe equation of the first order linear modelis:

xbby 10 xbby 10

Typical Regression Analysis

• Observe pairs of data • Plot the data! See if a simple linear regression

model seems reasonable. If necessary, transform the data.

• Suspect (or hope) SRM assumptions are justified.• Estimate the true regression line

by the LS regression line

Check the model and make inferences.

),(,),,( 11 nn yxyx

xxyE 10)|(

xbby 10ˆ

• Example 18.2 (Xm18-02)– A car dealer wants to find

the relationship between the odometer reading and the selling price of used cars.

– A random sample of 100 cars is selected, and the data recorded.

– Find the regression line.

Car Odometer Price1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718

. . .

. . .

. . .

Independent variable x

Dependent variable y

The Simple Linear Regression Line

The Simple Linear Regression Line

• Solution– Solving by hand: Calculate a number of statistics

;823.822,14y

;45.009,36x

511,712,21n

)yy)(xx()Y,Xcov(

690,528,431n

)xx(s

ii

2i2

x

where n = 100.

067,17)45.009,36)(06232.(82.822,14xbyb

06232.690,528,43511,712,1

s)Y,Xcov(

b

10

2x

1

x0623.067,17xbby 10

This is the slope of the line.For each additional mile on the odometer,the price decreases by an average of $0.0623

Odometer Line Fit Plot

13000

14000

15000

16000

Odometer

Pri

ce

xy 0623.067,17ˆ

Interpreting the Linear Regression -Equation

The intercept is b0 = $17067.

0 No data

Do not interpret the intercept as the “Price of cars that have not been driven”

17067

Fitted Values and Residuals

• The least squares line decomposes the data into two parts where

• are called the fitted or predicted values.

• are called the residuals.

• The residuals are estimates of the errors

iii eyy ˆ iiiii yyexbby ˆ,ˆ 10

nyy ˆ,,ˆ1

nee ,,1

),,( 1 n

18.4 Error Variable: Required Conditions• The error is a critical part of the regression model.• Four requirements involving the distribution of must

be satisfied.– The probability distribution of is normal.

– The mean of is zero: E() = 0.

– The standard deviation of is for all values of x.

– The set of errors associated with different values of y are all independent.

The Normality of

From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation

From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation

0 + 1x1

0 + 1x2

0 + 1x3

E(y|x2)

E(y|x3)

x1 x2 x3

E(y|x1)

The standard deviation remains constant,

but the mean value changes with x

Estimating

• The standard error of estimate (root mean squared error) is an estimate of

• The standard error of estimate is basically the standard deviation of the residuals.

• If the simple regression model holds, then approximately– 68% of the data will lie within one of the LS line.– 95% of the data will lie within two of the LS line.

n

i ii yyn

s1

2)ˆ(2

1

s

s

s

s

Cleaning Crew Example

• Roomsclean=1.78+3.70*Number of Crews

• The building maintenance company is planning to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. Currently, the company has only 11 cleaning crews. Will 11 crews be enough?

Practice Problems

• 18.4,18.10,18.12

lecture 17 interaction plots simple linear regression (chapter 18.1- 18.2) homework 4 due friday....

Documents

size of house

size house size house

house costs

simple regression model

previous year slide

dependent variable y

dependent variable x

error variable x y