Download - Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables
5-1bivar.
Unit 5Correlation and Regression:
Examining and Modeling Relationships Between Variables
Chapters 8 - 12
Outline: Two variables
Scatter Diagrams to display bivariate data
CorrelationConcept, Interpretation, Computation, Cautions
Regression Model:Using a LINE to describe the relation between two variables & for prediction•Finding "the" line•Interpreting its coefficients
Residuals, Prediction Errors
Extensions of Simple Linear Regression
A.05
5-2bivar.
Four Scatter Diagrams
2 4 6 8
6
7
8
9
10
size of help wanted ad
# applicants
20
25
30
35
40
45
costpermin.($)
6.0 6.4 7.06.6CUME rating
2
4
8
10
12
14
3 6 9 12
% delinquent
age of credit account (years)
90
110
120
130
140
150
10 30 50entertain. expenses (x $100)
lastyear'ssales($1000)
5-3bivar.
If there is STRONG ASSOCIATION between 2 variables, then knowing one helps a lot in predicting the other.
If there is WEAK ASSOCIATION between 2 variables, then information about one variable does not help much in predicting the other.
dependent variable
independent variable
Usually, the INDEPENDENT variable is thought to influence the DEPENDENT variable.
Association
5-4bivar.
Summarizing the RelationshipBetween Two Variables
1. Plot the points in a scatter diagram.
2. Find average for X and average for Y. Plot the point of averages.
3. Find SD(X), which measures horizontal spread of points, and SD(Y), which measures vertical spread of points.
4. Find the correlation coefficient (r), which measures the degree of clustering / spread of points about a line (the SD line).
Y Y
X X
5-5bivar.
Wood Products Shipments and Employment,
by state, 1989, excl. California
Employment x 100
Shipments ($ million)
0 50 100 150 200 2500
10
20
30
40
50
5-6bivar.
Wood Products Data
469.8 7,900246.4 4,400205.4 2,800186.5 3,600175.8 3,800142.9 2,100139.7 2,400120.6 1,900118.0 1,500104.3 1,500
89.9 1,600 73.5 1,500 72.6 1,400 71.4 1,200 53.9 800 52.4 1,400 50.1 1,200 48.1 1,400 47.0 1,100 36.7 800 27.4 500 27.3 400 22.9 300
Shipments ($ million)
Shipments Employment
5-7bivar.
Wood Products Shipments and Employment,
by state, 1989, excl. California
Employment x 100
Shipments ($ million)
0 50 100 150 200 2500
10
20
30
40
50
5-8bivar.
5-9bivar.
Linear Association
The correlation coefficient measures the LINEAR relationship between TWO variables.
It is a measure of LINEAR association or clustering around a line.
r near +1 r near -1
r positive, r negative near 0 near 0
r =1 r = -1
5-10bivar.
Interpretation of r
The closer the correlation coefficient is to 1 (or -1), the more tightly clustered the points are around a line (the SD line).
The SD line passes through all points which are an equal # of SD's away from the average for both variables.
positive association negative association
5-11bivar.
Twelve Plots, with r
Look in your textbook, pages 127 and 129.
5-12bivar.
5-13bivar.
Computing the Correlation Coefficient, r
= 1n (Xi-X)(Yi-Y)∑
SD(X) SD(Y)
= (XiYi) - X Y1
n ∑SD(X) SD(Y)
= Covariance(X,Y)SD(X) SD(Y)
Convert each variable to standard units. The average of the products gives the correlation coefficient.
r = average of (z-score for X) (z-score for Y)
5-14bivar.
Example: Computation of r
X Y X-X (X-X)2 Y-Y (Y-Y)2 z-score for X z-score for Y product
5-15bivar.
Some Cases When the Correlation Coefficient, r,
Does Not Give A Good Indication of Clustering
0 2 4 6 8 100
2
4
6
8
10
X
0 10 20 30 400
100
200
300
400
500
600
700
800
INDEP
r = .155 r = .536
Y
5-16bivar.
0 1000200030004000500060007000
0
1000
2000
3000
4000
5000
6000
BODY WEIGHT IN KG
r = .933(36 data values)
BRAINWEIGHT IN KG
5-17bivar.
“No Elephants”
0 100 200 300 400 500 6000
500
1000
1500
r = .596
body weight in kg
brain weight ingrams
(r = .887, excluding dinosaurs, elephants, humans)
5-18bivar.
all brain data,log
transformed
-10 0 10 20-5
0
5
10
r=.856 (all data)
log (body weight)
log (brainweight)
5-19bivar.
COUPON
PRICE
0 5 10 1580
90
100
110
120
r = .883 (all data)r = .984 (without flower bonds)
(Siegel)
5-20bivar.
Interpretation of Empirical Association
1. DescriptiveExample: Height versus Weight
2. CausalExample: Total Cost vs. Volume of Production
3. NonsenseExample: Polio Incidence vs. Soft Drink Sales
5-21bivar.
Prediction Using Correlation
1. What is the best prediction of the dependent variable?What if the value of the independent variable is available?
2. What is the likely size of the prediction error?
Fundamental Principle of Prediction
1. Use the mean of the relevant group.
2. SD of the group gives the "likely size of error."
5-22bivar.
5-23bivar.
Diamond State Telephone Company
Demand for LINES versus Proposed MONTHLY charge per line ($)
10 15 20 25 30 35100
150
200
250
MONTHLY
LIN
ES
5-24bivar.
Look At The Vertical StripCorresponding to the Given X
Value
Y
X
5-25bivar.
10 15 20 25 30 35100
150
200
250
MONTHLY
LIN
ES
x
x
x
Graph of Averages
x
x
estimated LINES = 237.495 - 3.867 MONTHLY
5-26bivar.
5-27bivar.
Linearly Related Variables
The REGRESSION LINE is to a scatter diagram as the AVERAGE is to a list of numbers.
The regression line estimates the average values for the dependent variable, Y, corresponding to each value, x, of the independent variable.
5-28bivar.
Linearly Related Variables
If we have 2 variables, linearly related to one another, then knowing the value of one variable (for a particular individual) can help to estimate / predict the value of the other variable.
• If we know nothing re. the value of the independent variable (X), then we estimate the value of the dependent variable to be the OVERALL AVERAGE of the dependent variable (Y).
• If we know that the independent variable (X) has a particular value for a given individual, then we can take a "more educated guess" at the value of the dependent variable (Y).
5-29bivar.
Regression and SD Lines
The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope
That is, associated with each increase of one SD in X, there is an increase of r SD’s in Y, on the average.
The SD LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope
r SDY
SDX
SDY
SDX
5-30bivar.
Estimating the Intercept andSlope of the Regression Line
The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) is also known as
The REGRESSION LINE for predicting Y from X, and has the form
Y = a + b x
= intercept + slope x.
Here,b = slope
= r SD(Y) / SD(X)
a = intercept
= avg(Y) - b avg(X)
= avg(Y) - r [SD(Y) / SD(X)] avg(X)
5-31bivar.
Prediction from aRegression Model
Predicted value of Y corresponding to a given value of X is
Y = a + b X
= ( Y - r SDY
SDX X ) + ( r SDY
SDX ) X
= Y - ( X - X ) ( r SDY
SDX )
= Y - ( X - X ) ( slope )
5-32bivar.
5-33bivar.
TOTAL OBSERVATIONS: 21
LINES MONTHLY
N OF CASES 21 21MINIMUM 105.000 10.320MAXIMUM 201.000 34.000MEAN 154.048 21.581VARIANCE 1122.648 69.623STANDARD DEV 33.506 8.344
PEARSON CORRELATION MATRIX
LINES MONTHLYLINES 1.000MONTHLY -0.963 1.000
NUMBER OF OBSERVATIONS: 21
5-34bivar.
Diamond State Questions
In the Diamond State Telephone Company example, avg (LINES) = 154.048 SD (LINES) = 33.506 avg (MONTHLY) = 21.581 SD (MONTHLY) = 8.344
r = -0.963
What are the coordinates for the point of averages?
What is the slope of the regression line?
Suppose the MONTHLY charge was set at $25.00.What would you estimate to be the demand for # LINES from the 62 new businesses?
Suppose the MONTHLY charge was set at $15.00.What would you estimate to be the demand for # LINES from the 62 new businesses?
5-35bivar.
Another Diamond State Question
Suppose the MONTHLY charge was set at $50.00.What would you estimate to be the demand for # LINES from the 62 new businesses?
5-36bivar.
Regression Computer Output
RegressionDEP VAR: LINES N: 21 MULTIPLE R: 0.963 SQUARED MULTIPLE R: 0.927ADJ SQRD MULTIPLE R: 0.923 STANDARD ERROR OF ESTIMATE: 9.273
VARIABLE COEFF STD ERROR STD COEF TOLERANCE T P(2 TAIL)
CONSTANT 237.495 5.732 0.000 . 41.432 0.000 MONTHLY -3.867 0.249 -0.963 1.000 -15.560 0.000
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
REGRESSION 20819.092 1 20819.092 242.103 0.000 RESIDUAL 1633.860 19 85.993
------------------------------------------------------------------------------------------------------------------
5-37bivar.
Interpreting theRegression Coefficients
5-38bivar.
Other Examples
1. X = Educational expenditure Y = Test scores
2. X = Height of a person Y = Weight of the person
3. X = # Service years of an automobile Y = Operating cost per year
4. X = Total weight of mail bags Y = # Mail orders
5. X = Price of product Y = Unit sales
6. X = Volume Y = Total cost of production
7. X = Calories in a candy bar Y = Grams of fat in the candy bar
8. X = Baseball slugging percentage Y = Player salary
9. X = Weight of a diamond Y = Price of the diamond
10.
11.
12.
5-39bivar.
Wood Products
TOTAL OBSERVATIONS: 23
SHIPMENT EMPLOY N OF CASES 23 23 MINIMUM 22.900 3.000 MAXIMUM 469.80079.000 MEAN 112.28719.783 VARIANCE 9931.683 281.087 STANDARD DEV 99.658 16.766
Pearson Correlation Matrix SHIPMENT EMPLOYSHIPMENT 1.00EMPLOY 0.979 1.00
Number of Observations: 23
5-40bivar.
5-41bivar.
y=ship,x=employ,line
0 10 20 30 40 50 60 70 800
100
200
300
400
500
EMPLOY
5-42bivar.
y=employ,x=ship,line
0 100 200 300 400 5000
10
20
30
40
50
60
70
80
SHIPMENT
5-43bivar.
Computer Output - 1
DEP VAR: SHIPMENT N: 23 MULT R: 0.979 SQRD MULT R: 0.958
ADJ SQRD MULTIPLE R: 0.956 STD ERROR OF ESTIMATE: 21.018
VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL)
CONSTANT -2.781 6.868 0.000 . -0.4050.690
EMPLOY 5.817 0.267 0.979 1.000 21.7630.000
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
REGRESSION .209220.316 1 209220.31 473.619 0.000
RESIDUAL 9276.710 21 441.748
-------------------------------------------------------------------------------------------
5-44bivar.
Computer Output - 2
DEP VAR: EMPLOY N: 23 MULT R: 0.979 SQRD MULT R: 0.958
ADJ SQRD MULT R: 0.956 STD ERROR OF ESTIMATE: 3.536
VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL)
CONSTANT 1.298 1.125 0.000 . 1.154 0.262
SHIPMENT 0.165 0.008 0.979 1.000 21.7630.000
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
REGRESSION 5921.363 1 5921.363 473.619 0.000
RESIDUAL 262.550 21 12.502
--------------------------------------------------------------------------------------------
5-45bivar.
Insurance Availability in Chicago
5-46bivar.
Chicago Plots
5-47bivar.
Chicago Insurance, cont.
For cases with income less than or equal to $15,000,avg (Voluntary) = 6.376 SD (Voluntary) = 3.959avg (Income) = $10,332.756 SD (Income) = $2,109.819 r = 0.896
Derive the equation for the regression line.
According to this linear model, what is the estimated value for "Voluntary" in a ZIP code area with Income $12,000?... with Income $9,500?
5-48bivar.
blank
5-49bivar.
Regression Effect
In virtually all test-retest situations, the bottom group on the first test will, on average, show some improvement on the 2nd test, and the top group will, on average, fall back.
This is called the REGRESSION EFFECT.
The REGRESSION FALLACY is thinking that the regression effect must be due to something important, not just the spread of points around the line.
5-50bivar.
blank
5-51bivar.
Residuals
Regression methods allow us to estimate the average value of the dependent variable for each value of the independent variable.
Individuals will differ somewhat from the regression estimates.
How much?
5-52bivar.
blank
Country Economic Birth RateAlgeria 2 48
Argentina 19 21Denmark 34 14Germany 40 11
Guatemala 8 41India 12 37
Ireland 20 22Jamaica 20 31Japan 37 19
Philippines 19 42United States 30 15
Russia 46 18
Algeria
5-53bivar.
Residuals
Prediction error = actual - predicted
= vertical distance from the point to the regression
line
5-54bivar.
Residuals for Economically Active Women and Crude Birth Rates
Country Economic Birth Rate Regr.Estim. ResidualAlgeria 2 48 44.1 3.9
Argentina 19 21 30.5 -9.5Denmark 34 14 18.5 -4.5Germany 40 11 13.7 -2.7
Guatemala 8 41 39.3 1.7India 12 37 36.1 0.9
Ireland 20 22 29.7 -7.7Jamaica 20 31 29.7 1.3Japan 37 19 16.1 2.9
Philippines 19 42 30.5 11.5United States 30 15 21.7 -6.7
Russia 46 18 8.9 9.1
5-55bivar.
Residual Plots
A residual plot should NOT look systematic(no trend or pattern) --just a cloud of points around the horizontal axis.
Problem plots also can tell us something about the data.
5-56bivar.
Residual Plot for Economically Active Women and Crude Birth
Rates
-15
-10
-5
0
5
10
15
0 10 20 30 40 50
Percent Economically Active Women
Residuals
5-57bivar.
Chicago Insurance CaseResidual Plot
(versus Income)
5-58bivar.
The Least Squares Property
of the Regression LineOf all lines, the regression line is the one
which has smallest sum of squared residuals (and also the smallest rms error).
Thus, it is The Least Squares Line.
5-59bivar.
Look at the Scatter DiagramBefore Fitting a Regression
Model !
For each of the following data sets, the regression equation is
Y = 3.0 + 0.5 X and r = 0.82
Sorry, I didn’t scan in these plots yet.
5-60bivar.
blank
5-61bivar.
How Big Are The Residuals ?
R.M.S. Error of the Regression Line:
The rms error of the regression line says how far typical points are above or below the regression line.
Standard Deviation of Y:
The SD of Y says how far typical point are above or below a horizontal line through the average of y.
In other words, the SD of y is the rms error for predicting y by its average, just ignoring the x-values.
5-62bivar.
How Big Are The Residuals ?
The overall size of the residuals is measured by computing their standard deviation.
The average of the residuals is zero.
Computing the rms error of the regression line:
The rms error of the regression line estimating Y from X can be figured as
Note that here Y is the dependent variable!
The rms error is to the regression line
as
the SD is to the average.
5-63bivar.
How Big Are the Residuals?
Recall the First -Order Linear Model:
= prediction error
= residual
The mean of the residuals is zero.The SD of the residuals is also known as the "root mean
squared error of the regression line" (rms error).
Y = β0 + β1X + ε
ε = (actual Y-value) - (predicted Y-value)
ε
ε
5-64bivar.
The overall size of the residuals is measured by computing their standard deviation.
The rms error is to the regression line
as
the SD is to the average
Computing the rms error:
The rms error of the regression line estimating Y from X can be figured as
Notes:
1.
2. Here Y is the dependent variable !
3. Here we are dividing by n, rather than n-2.
rms error
1 - r2 SD(Y) ≤ ( )SD Y
1 - r2 SD (Y)
5-65bivar.
Looking At Vertical Strips
5-66bivar.
Looking At Vertical Strips
For an oval cloud of points,the points in a vertical strip are off the
regression line (up and down) by amounts similar in size to the rms error of the regression line.
If the diagram is heteroscedastic, the rms error should not be used for individual strips.
5-67bivar.
Using the Normal Curve Inside A Vertical Strip
For an oval cloud of points,the SD within a vertical strip is about equal to
the rms error of the regression line.
5-68bivar.
blank
5-69bivar.
Uses for r:(1) Describes the clustering of the scatter diagram around the
SD line, relative to the SD's
(2) Says how the average value of y depends on x
r SD(Y)
1 SD(X)
(3) Gives the accuracy of the regression estimates (the SD of the prediction errors) via the rms error for the regression line
1-r2 SD(Y)
5-70bivar.
coeff of determin-4How much of the variation of Y has been explained by X?
(How much better are we at predicting Y when we do know the value of X?
Compare
Var ( Y - Y ) versus Var ( Y - Y )
Var ( Y ) versus ( 1 - r2 ) Var ( Y )
Thus, the proportion of the variation of Y which is NOT explained by X is
Var ( Y - Y )Var ( Y )
= ( 1 - r2 ) Var ( Y )Var ( Y )
= 1 - r2
And the proportion of the variation of Y which IS explained by X is
r2