lecture 17 interaction plots simple linear regression (chapter 18.1- 18.2) homework 4 due friday....
Post on 22-Dec-2015
217 views
TRANSCRIPT
Lecture 17
• Interaction Plots
• Simple Linear Regression (Chapter 18.1-18.2)
• Homework 4 due Friday. JMP instructions for question 15.41 are actually for question 15.35.
18.1 Introduction
• In Chapters 18 to 20 we examine the relationship between interval variables via a mathematical equation.
• The motivation for using the technique:– Forecast the value of a dependent variable (y) from the
value of independent variables (x1, x2,…xk.).
– Analyze the specific relationships between the independent variables and the dependent variable.
Uses of Regression Analysis
• A building manager company plans to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. The costs incurred by the company are proportional to the number of cleaning crews needed for this task. How many crews will be enough?
• The product manager in charge of a brand of children’s cereal would like to predict demand during the next year. She has available the following “predictor” variables: price of the product, number of children in target market, price of competitors’ products, effectiveness of advertising, annual sales this year and previous year
Uses of Regression Analysis
• A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community might be able to cover the cost of increased police protection by gains in tax revenues from higher property values.
• A real estate agent wants to more accurately predict the selling price of houses. She believes the following variables affect the price of a house: Size of house (sq. feet), number of bedrooms, frontage of lot, condition and location.
House size
HouseCost
Most lots sell for $25,000
Building a house costs about
$75 per square foot.
House cost = 25000 + 75(Size)
18.2 The Model
The model has a deterministic and a probabilistic components
House cost = 25000 + 75(Size)
House size
HouseCost
Most lots sell for $25,000
However, house cost vary even among same size houses!
18.2 The Model
Since cost behave unpredictably, we add a random component.
18.2 The Model
• The first order linear model
y = dependent variablex = independent variable
0 = y-intercept
1 = slope of the line = error variable
xy 10 xy 10
x
y
0Run
Rise = Rise/Run
0 and 1 are unknown populationparameters, therefore are estimated from the data.
Interpreting the Coefficients
• • Roomsclean=1.78+3.70*Number of Crews• called the y-intercept and called the slope.• Interpretation of slope: “For every additional
cleaning crew, we are able to clean an additional 3.70 rooms on average.”
• Interpretation of intercept: Technically, how many rooms on average can be cleaned with zero cleaning crews but doesn’t make sense here because it involves extrapolation.
XXYE 10)|(
0 1
Simple Regression Model
• The data are assumed to be a realization of
• is the “signal” and is “noise” (error)• are the unknown parameters of the model.
Objective of regression is to estimate them.• What is the interpretation of ?
),(,),,( 11 nn yxyx
),0(~ ,,
,1,2
1
10
Niid
nixy
n
iii
ix10 i2
10 ,,
,2,1 10
2
18.3 Estimating the Coefficients
• The estimates are determined by – drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.
Question: What should be considered a good line?
x
y
The Least Squares (Regression) Line
A good line is one that minimizes the sum of squared differences between the points and the line.
The Least Squares (Regression) Line
3
3
41
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 +
(4,3.2)
(3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
2.5
Let us compare two linesThe second line is horizontal
The smaller the sum of squared differencesthe better the fit of the line to the data.
The Estimated Coefficients
To calculate the estimates of the line coefficients, that minimize the differences between the data points and the line, use the formulas:
xbyb
s
)Y,Xcov(b
10
2x
1
xbyb
s
)Y,Xcov(b
10
2x
1
The regression equation that estimatesthe equation of the first order linear modelis:
xbby 10 xbby 10
Typical Regression Analysis
• Observe pairs of data • Plot the data! See if a simple linear regression
model seems reasonable. If necessary, transform the data.
• Suspect (or hope) SRM assumptions are justified.• Estimate the true regression line
by the LS regression line
Check the model and make inferences.
),(,),,( 11 nn yxyx
xxyE 10)|(
xbby 10ˆ
• Example 18.2 (Xm18-02)– A car dealer wants to find
the relationship between the odometer reading and the selling price of used cars.
– A random sample of 100 cars is selected, and the data recorded.
– Find the regression line.
Car Odometer Price1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718
. . .
. . .
. . .
Independent variable x
Dependent variable y
The Simple Linear Regression Line
The Simple Linear Regression Line
• Solution– Solving by hand: Calculate a number of statistics
;823.822,14y
;45.009,36x
511,712,21n
)yy)(xx()Y,Xcov(
690,528,431n
)xx(s
ii
2i2
x
where n = 100.
067,17)45.009,36)(06232.(82.822,14xbyb
06232.690,528,43511,712,1
s)Y,Xcov(
b
10
2x
1
x0623.067,17xbby 10
This is the slope of the line.For each additional mile on the odometer,the price decreases by an average of $0.0623
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Pri
ce
xy 0623.067,17ˆ
Interpreting the Linear Regression -Equation
The intercept is b0 = $17067.
0 No data
Do not interpret the intercept as the “Price of cars that have not been driven”
17067
Fitted Values and Residuals
• The least squares line decomposes the data into two parts where
• are called the fitted or predicted values.
• are called the residuals.
• The residuals are estimates of the errors
iii eyy ˆ iiiii yyexbby ˆ,ˆ 10
nyy ˆ,,ˆ1
nee ,,1
),,( 1 n
18.4 Error Variable: Required Conditions• The error is a critical part of the regression model.• Four requirements involving the distribution of must
be satisfied.– The probability distribution of is normal.
– The mean of is zero: E() = 0.
– The standard deviation of is for all values of x.
– The set of errors associated with different values of y are all independent.
The Normality of
From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation
From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation
0 + 1x1
0 + 1x2
0 + 1x3
E(y|x2)
E(y|x3)
x1 x2 x3
E(y|x1)
The standard deviation remains constant,
but the mean value changes with x
Estimating
• The standard error of estimate (root mean squared error) is an estimate of
• The standard error of estimate is basically the standard deviation of the residuals.
• If the simple regression model holds, then approximately– 68% of the data will lie within one of the LS line.– 95% of the data will lie within two of the LS line.
n
i ii yyn
s1
2)ˆ(2
1
s
s
s
s
Cleaning Crew Example
• Roomsclean=1.78+3.70*Number of Crews
• The building maintenance company is planning to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. Currently, the company has only 11 cleaning crews. Will 11 crews be enough?
Practice Problems
• 18.4,18.10,18.12