Download - Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables

5-1bivar.

Unit 5Correlation and Regression:

Examining and Modeling Relationships Between Variables

Chapters 8 - 12

Outline: Two variables

Scatter Diagrams to display bivariate data

CorrelationConcept, Interpretation, Computation, Cautions

Regression Model:Using a LINE to describe the relation between two variables & for prediction•Finding "the" line•Interpreting its coefficients

Residuals, Prediction Errors

Extensions of Simple Linear Regression

A.05

5-2bivar.

Four Scatter Diagrams

2 4 6 8

6

7

8

9

10

size of help wanted ad

# applicants

20

25

30

35

40

45

costpermin.($)

6.0 6.4 7.06.6CUME rating

2

4

8

10

12

14

3 6 9 12

% delinquent

age of credit account (years)

90

110

120

130

140

150

10 30 50entertain. expenses (x $100)

lastyear'ssales($1000)

5-3bivar.

If there is STRONG ASSOCIATION between 2 variables, then knowing one helps a lot in predicting the other.

If there is WEAK ASSOCIATION between 2 variables, then information about one variable does not help much in predicting the other.

dependent variable

independent variable

Usually, the INDEPENDENT variable is thought to influence the DEPENDENT variable.

Association

5-4bivar.

Summarizing the RelationshipBetween Two Variables

1. Plot the points in a scatter diagram.

2. Find average for X and average for Y. Plot the point of averages.

3. Find SD(X), which measures horizontal spread of points, and SD(Y), which measures vertical spread of points.

4. Find the correlation coefficient (r), which measures the degree of clustering / spread of points about a line (the SD line).

Y Y

X X

5-5bivar.

Wood Products Shipments and Employment,

by state, 1989, excl. California

Employment x 100

Shipments ($ million)

0 50 100 150 200 2500

10

20

30

40

50

5-6bivar.

Wood Products Data

469.8 7,900246.4 4,400205.4 2,800186.5 3,600175.8 3,800142.9 2,100139.7 2,400120.6 1,900118.0 1,500104.3 1,500

89.9 1,600 73.5 1,500 72.6 1,400 71.4 1,200 53.9 800 52.4 1,400 50.1 1,200 48.1 1,400 47.0 1,100 36.7 800 27.4 500 27.3 400 22.9 300


Shipments Employment

5-7bivar.

Wood Products Shipments and Employment,

by state, 1989, excl. California

Employment x 100


0 50 100 150 200 2500

10

20

30

40

50

5-8bivar.

5-9bivar.

Linear Association

The correlation coefficient measures the LINEAR relationship between TWO variables.

It is a measure of LINEAR association or clustering around a line.

r near +1 r near -1

r positive, r negative near 0 near 0

r =1 r = -1

5-10bivar.

Interpretation of r

The closer the correlation coefficient is to 1 (or -1), the more tightly clustered the points are around a line (the SD line).

The SD line passes through all points which are an equal # of SD's away from the average for both variables.

positive association negative association

5-11bivar.

Twelve Plots, with r

Look in your textbook, pages 127 and 129.

5-12bivar.

5-13bivar.

Computing the Correlation Coefficient, r

= 1n (Xi-X)(Yi-Y)∑

SD(X) SD(Y)

= (XiYi) - X Y1

n ∑SD(X) SD(Y)

= Covariance(X,Y)SD(X) SD(Y)

Convert each variable to standard units. The average of the products gives the correlation coefficient.

r = average of (z-score for X) (z-score for Y)

5-14bivar.

Example: Computation of r

X Y X-X (X-X)2 Y-Y (Y-Y)2 z-score for X z-score for Y product

5-15bivar.

Some Cases When the Correlation Coefficient, r,

Does Not Give A Good Indication of Clustering

0 2 4 6 8 100

2

4

6

8

10

X

0 10 20 30 400

100

200

300

400

500

600

700

800

INDEP

r = .155 r = .536

Y

5-16bivar.

0 1000200030004000500060007000

0

1000

2000

3000

4000

5000

6000

BODY WEIGHT IN KG

r = .933(36 data values)

BRAINWEIGHT IN KG

5-17bivar.

“No Elephants”

0 100 200 300 400 500 6000

500

1000

1500

r = .596

body weight in kg

brain weight ingrams

(r = .887, excluding dinosaurs, elephants, humans)

5-18bivar.

all brain data,log

transformed

-10 0 10 20-5

0

5

10

r=.856 (all data)

log (body weight)

log (brainweight)

5-19bivar.

COUPON

PRICE

0 5 10 1580

90

100

110

120

r = .883 (all data)r = .984 (without flower bonds)

(Siegel)

5-20bivar.

Interpretation of Empirical Association

1. DescriptiveExample: Height versus Weight

2. CausalExample: Total Cost vs. Volume of Production

3. NonsenseExample: Polio Incidence vs. Soft Drink Sales

5-21bivar.

Prediction Using Correlation

1. What is the best prediction of the dependent variable?What if the value of the independent variable is available?

2. What is the likely size of the prediction error?

Fundamental Principle of Prediction

1. Use the mean of the relevant group.

2. SD of the group gives the "likely size of error."

5-22bivar.

5-23bivar.

Diamond State Telephone Company

Demand for LINES versus Proposed MONTHLY charge per line ($)

10 15 20 25 30 35100

150

200

250

MONTHLY

LIN

ES

5-24bivar.

Look At The Vertical StripCorresponding to the Given X

Value

Y

X

5-25bivar.

10 15 20 25 30 35100

150

200

250

MONTHLY

LIN

ES

x

x

x

Graph of Averages

x

x

estimated LINES = 237.495 - 3.867 MONTHLY

5-26bivar.

5-27bivar.

Linearly Related Variables

The REGRESSION LINE is to a scatter diagram as the AVERAGE is to a list of numbers.

The regression line estimates the average values for the dependent variable, Y, corresponding to each value, x, of the independent variable.

5-28bivar.

Linearly Related Variables

If we have 2 variables, linearly related to one another, then knowing the value of one variable (for a particular individual) can help to estimate / predict the value of the other variable.

• If we know nothing re. the value of the independent variable (X), then we estimate the value of the dependent variable to be the OVERALL AVERAGE of the dependent variable (Y).

• If we know that the independent variable (X) has a particular value for a given individual, then we can take a "more educated guess" at the value of the dependent variable (Y).

5-29bivar.

Regression and SD Lines

The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope

That is, associated with each increase of one SD in X, there is an increase of r SD’s in Y, on the average.

The SD LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope

r SDY

SDX

SDY

SDX

5-30bivar.

Estimating the Intercept andSlope of the Regression Line

The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) is also known as

The REGRESSION LINE for predicting Y from X, and has the form

Y = a + b x

= intercept + slope x.

Here,b = slope

= r SD(Y) / SD(X)

a = intercept

= avg(Y) - b avg(X)

= avg(Y) - r [SD(Y) / SD(X)] avg(X)

5-31bivar.

Prediction from aRegression Model

Predicted value of Y corresponding to a given value of X is

Y = a + b X

= ( Y - r SDY

SDX X ) + ( r SDY

SDX ) X

= Y - ( X - X ) ( r SDY

SDX )

= Y - ( X - X ) ( slope )

5-32bivar.

5-33bivar.

TOTAL OBSERVATIONS: 21

LINES MONTHLY

N OF CASES 21 21MINIMUM 105.000 10.320MAXIMUM 201.000 34.000MEAN 154.048 21.581VARIANCE 1122.648 69.623STANDARD DEV 33.506 8.344

PEARSON CORRELATION MATRIX

LINES MONTHLYLINES 1.000MONTHLY -0.963 1.000

NUMBER OF OBSERVATIONS: 21

5-34bivar.

Diamond State Questions

In the Diamond State Telephone Company example, avg (LINES) = 154.048 SD (LINES) = 33.506 avg (MONTHLY) = 21.581 SD (MONTHLY) = 8.344

r = -0.963

What are the coordinates for the point of averages?

What is the slope of the regression line?

Suppose the MONTHLY charge was set at $25.00.What would you estimate to be the demand for # LINES from the 62 new businesses?


5-35bivar.

Another Diamond State Question


5-36bivar.

Regression Computer Output

RegressionDEP VAR: LINES N: 21 MULTIPLE R: 0.963 SQUARED MULTIPLE R: 0.927ADJ SQRD MULTIPLE R: 0.923 STANDARD ERROR OF ESTIMATE: 9.273

VARIABLE COEFF STD ERROR STD COEF TOLERANCE T P(2 TAIL)

CONSTANT 237.495 5.732 0.000 . 41.432 0.000 MONTHLY -3.867 0.249 -0.963 1.000 -15.560 0.000

ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P

REGRESSION 20819.092 1 20819.092 242.103 0.000 RESIDUAL 1633.860 19 85.993

------------------------------------------------------------------------------------------------------------------

5-37bivar.

Interpreting theRegression Coefficients

5-38bivar.

Other Examples

1. X = Educational expenditure Y = Test scores

2. X = Height of a person Y = Weight of the person

3. X = # Service years of an automobile Y = Operating cost per year

4. X = Total weight of mail bags Y = # Mail orders

5. X = Price of product Y = Unit sales

6. X = Volume Y = Total cost of production

7. X = Calories in a candy bar Y = Grams of fat in the candy bar

8. X = Baseball slugging percentage Y = Player salary

9. X = Weight of a diamond Y = Price of the diamond

10.

11.

12.

5-39bivar.

Wood Products

TOTAL OBSERVATIONS: 23

SHIPMENT EMPLOY N OF CASES 23 23 MINIMUM 22.900 3.000 MAXIMUM 469.80079.000 MEAN 112.28719.783 VARIANCE 9931.683 281.087 STANDARD DEV 99.658 16.766

Pearson Correlation Matrix SHIPMENT EMPLOYSHIPMENT 1.00EMPLOY 0.979 1.00

Number of Observations: 23

5-40bivar.

5-41bivar.

y=ship,x=employ,line

0 10 20 30 40 50 60 70 800

100

200

300

400

500

EMPLOY

5-42bivar.

y=employ,x=ship,line

0 100 200 300 400 5000

10

20

30

40

50

60

70

80

SHIPMENT

5-43bivar.

Computer Output - 1

DEP VAR: SHIPMENT N: 23 MULT R: 0.979 SQRD MULT R: 0.958

ADJ SQRD MULTIPLE R: 0.956 STD ERROR OF ESTIMATE: 21.018

VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL)

CONSTANT -2.781 6.868 0.000 . -0.4050.690

EMPLOY 5.817 0.267 0.979 1.000 21.7630.000

ANALYSIS OF VARIANCE

SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P

REGRESSION .209220.316 1 209220.31 473.619 0.000

RESIDUAL 9276.710 21 441.748

-------------------------------------------------------------------------------------------

5-44bivar.

Computer Output - 2

DEP VAR: EMPLOY N: 23 MULT R: 0.979 SQRD MULT R: 0.958

ADJ SQRD MULT R: 0.956 STD ERROR OF ESTIMATE: 3.536

VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL)

CONSTANT 1.298 1.125 0.000 . 1.154 0.262

SHIPMENT 0.165 0.008 0.979 1.000 21.7630.000

ANALYSIS OF VARIANCE

SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P

REGRESSION 5921.363 1 5921.363 473.619 0.000

RESIDUAL 262.550 21 12.502

--------------------------------------------------------------------------------------------

5-45bivar.

Insurance Availability in Chicago

5-46bivar.

Chicago Plots

5-47bivar.

Chicago Insurance, cont.

For cases with income less than or equal to $15,000,avg (Voluntary) = 6.376 SD (Voluntary) = 3.959avg (Income) = $10,332.756 SD (Income) = $2,109.819 r = 0.896

Derive the equation for the regression line.

According to this linear model, what is the estimated value for "Voluntary" in a ZIP code area with Income $12,000?... with Income $9,500?

5-48bivar.

blank

5-49bivar.

Regression Effect

In virtually all test-retest situations, the bottom group on the first test will, on average, show some improvement on the 2nd test, and the top group will, on average, fall back.

This is called the REGRESSION EFFECT.

The REGRESSION FALLACY is thinking that the regression effect must be due to something important, not just the spread of points around the line.

5-50bivar.

blank

5-51bivar.

Residuals

Regression methods allow us to estimate the average value of the dependent variable for each value of the independent variable.

Individuals will differ somewhat from the regression estimates.

How much?

5-52bivar.

blank

Country Economic Birth RateAlgeria 2 48

Argentina 19 21Denmark 34 14Germany 40 11

Guatemala 8 41India 12 37

Ireland 20 22Jamaica 20 31Japan 37 19

Philippines 19 42United States 30 15

Russia 46 18

Algeria

5-53bivar.

Residuals

Prediction error = actual - predicted

= vertical distance from the point to the regression

line

5-54bivar.

Residuals for Economically Active Women and Crude Birth Rates

Country Economic Birth Rate Regr.Estim. ResidualAlgeria 2 48 44.1 3.9

Argentina 19 21 30.5 -9.5Denmark 34 14 18.5 -4.5Germany 40 11 13.7 -2.7

Guatemala 8 41 39.3 1.7India 12 37 36.1 0.9

Ireland 20 22 29.7 -7.7Jamaica 20 31 29.7 1.3Japan 37 19 16.1 2.9

Philippines 19 42 30.5 11.5United States 30 15 21.7 -6.7

Russia 46 18 8.9 9.1

5-55bivar.

Residual Plots

A residual plot should NOT look systematic(no trend or pattern) --just a cloud of points around the horizontal axis.

Problem plots also can tell us something about the data.

5-56bivar.

Residual Plot for Economically Active Women and Crude Birth

Rates

-15

-10

-5

0

5

10

15

0 10 20 30 40 50

Percent Economically Active Women

Residuals

5-57bivar.

Chicago Insurance CaseResidual Plot

(versus Income)

5-58bivar.

The Least Squares Property

of the Regression LineOf all lines, the regression line is the one

which has smallest sum of squared residuals (and also the smallest rms error).

Thus, it is The Least Squares Line.

5-59bivar.

Look at the Scatter DiagramBefore Fitting a Regression

Model !

For each of the following data sets, the regression equation is

Y = 3.0 + 0.5 X and r = 0.82

Sorry, I didn’t scan in these plots yet.

5-60bivar.

blank

5-61bivar.

How Big Are The Residuals ?

R.M.S. Error of the Regression Line:

The rms error of the regression line says how far typical points are above or below the regression line.

Standard Deviation of Y:

The SD of Y says how far typical point are above or below a horizontal line through the average of y.

In other words, the SD of y is the rms error for predicting y by its average, just ignoring the x-values.

5-62bivar.

How Big Are The Residuals ?

The overall size of the residuals is measured by computing their standard deviation.

The average of the residuals is zero.

Computing the rms error of the regression line:

The rms error of the regression line estimating Y from X can be figured as

Note that here Y is the dependent variable!

The rms error is to the regression line

as

the SD is to the average.

5-63bivar.

How Big Are the Residuals?

Recall the First -Order Linear Model:

= prediction error

= residual

The mean of the residuals is zero.The SD of the residuals is also known as the "root mean

squared error of the regression line" (rms error).

Y = β0 + β1X + ε

ε = (actual Y-value) - (predicted Y-value)

ε

ε

5-64bivar.

The overall size of the residuals is measured by computing their standard deviation.

The rms error is to the regression line

as

the SD is to the average

Computing the rms error:

The rms error of the regression line estimating Y from X can be figured as

Notes:

1.

2. Here Y is the dependent variable !

3. Here we are dividing by n, rather than n-2.

rms error

1 - r2 SD(Y) ≤ ( )SD Y

1 - r2 SD (Y)

5-65bivar.

Looking At Vertical Strips

5-66bivar.

Looking At Vertical Strips

For an oval cloud of points,the points in a vertical strip are off the

regression line (up and down) by amounts similar in size to the rms error of the regression line.

If the diagram is heteroscedastic, the rms error should not be used for individual strips.

5-67bivar.

Using the Normal Curve Inside A Vertical Strip

For an oval cloud of points,the SD within a vertical strip is about equal to

the rms error of the regression line.

5-68bivar.

blank

5-69bivar.

Uses for r:(1) Describes the clustering of the scatter diagram around the

SD line, relative to the SD's

(2) Says how the average value of y depends on x

r SD(Y)

1 SD(X)

(3) Gives the accuracy of the regression estimates (the SD of the prediction errors) via the rms error for the regression line

1-r2 SD(Y)

5-70bivar.

coeff of determin-4How much of the variation of Y has been explained by X?

(How much better are we at predicting Y when we do know the value of X?

Compare

Var ( Y - Y ) versus Var ( Y - Y )

Var ( Y ) versus ( 1 - r2 ) Var ( Y )

Thus, the proportion of the variation of Y which is NOT explained by X is

Var ( Y - Y )Var ( Y )

= ( 1 - r2 ) Var ( Y )Var ( Y )

= 1 - r2

And the proportion of the variation of Y which IS explained by X is

r2

Download - Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables

Top Related