chapter 2 correlation and regression.doc

Chapter 2 Correlation and Regression

Section 2.1. Introduction

In this chapter, we shall introduce the concepts of correlation and regression.

They are very useful concepts in discovering whether certain simple relationships

hold among variables. Before introducing these concepts, we have to introduce some

basic concept in statistics, such as mean, variance and covariance. These concepts are

not only important for correlation and regression, but also important for many other

data analysis techniques later introduced in this book. That is why we introduce them

at this early stage.

Since regression analysis is a well-known topic in applied statistics, we have no

intention to give a thorough discussion of this topic. We introduce this because it is

important for other techniques. The reader who is interested in this technique should

consult [Dunn and Clark 1974] and [Draper and Smith 1966].

Section 2.2. Mean and Variance

Definition

Let Xi denote a variable and xi1, xi2,…,xiM be the M individual observations of

variable Xi. Then the mean of xi is defined as

Example 2.1

1

If a family monthly spending during the year of 1974 is as follows :

Month Spending (in $1000)

January 10.0

Feb. 19.0

Mar. 9.5

Apr. 11.0

May 12.0

June 11.0

July 10.0

Aug. 13.0

Sep. 10.0

Oct. 10.0

Nov. 11.0

Dec. 10.0

Then the mean of the family spending is .

Given a set of data, it is often practical to normalize the data so that the mean is

0. That is, instead of xij, we calculate

Example 2.2

For the problem in Example 2.1, we shall have the following set of data :

Month Spending

January 10.0 – 11.375 = -1.375

Feb. 19.0 – 11.375 = 7.625

Mar. 9.5 – 11.375 = -1.875

Apr. 11.0 – 11.375 = -0.375

May 12.0 – 11.375 = 0.625

June 11.0 – 11.375 = -0.375

July 10.0 – 11.375 = -1.375

Aug. 13.0 – 11.375 = 1.625

Sep. 10.0 – 11.375 = -1.375

Oct. 10.0 – 11.375 = -1.375

Nov. 11.0 – 11.375 = -0.375

Dec. 10.0 – 11.375 = -1.375

It can be said that the above data are more informative. Note that the sign plays

an important role. If it is positive (negative), it is above (below) average. Thus, one

2

can easily see that in January, the family spends less than average and in May, it

spends more.

After the mean is calculated, we can extract more information by calculating the

variance which is a measure of dispersion.

Definition

Let Xi denote a variable and xi1, xi2,…,xiM denote M individual observations of Xi.

The variance of Xi is defined as

Where is the mean of Xi.

is called the standard deviation of Xi.

Example 2.3

For the family spending problem described in Example 2.1, the variance is

calculated as follows :

Let us imagine that we have a set of personnel data and the two variables are

involved. One is the year of college education and the other is the body weight (in

terms of Kilograms). Some typical examples may be as follows :

Weight Year of College Education

X1 X2

60 4

70 8

75 2

65 4

72 6

73 8

80 4

55 6

Table 2.1

One can see that there is another problem. That is, the values of weight are much

larger than the values of years of college education. Thus variable X1 will dominate

variable X2. Since we want these variables to play equally important roles, we want to

“normalize” these two variables. There are several ways to normalize the data. One

3

practical and commonly used method is to normalize the data so that the variances are

all 1. The normalization procedure is as follows :

Input : xi1, xi2,…,xiM.

Step 1. Calculate the mean .

Step 2. Let .

Step 3. Let .

Step 4. Let .

It should be obvious to the reader that the variance of xi1, xi2,…,xiM is 1.

Example 2.4

For the data shown in Table 2.1, we shall have

After normalization with respect to means and variances, the data become

X1 (weight) X2 (year of college education)

-1.135 -0.63

0.162 1.38

0.810 -1.64

-0.486 -0.63

0.421 0.37

0.551 1.38

1.459 -0.63

-1.783 0.37

The reader can see that the influence of units of measurements can now be

eliminated.

Section 2.3. Covariance and Correlation

Having studied some measurements of one variable, we can now study

measurements of more than one variable.

Consider Fig. 2.1. In Fig. 2.1 (a), X2 increases as X1 does. In Fig. 2.1 (b), X2

decreases as X1 increases and we can detect no such relationships in Fig. 2.1 (c). In

this section, we shall introduce the concepts of covariance and correlation which are

4

essentially designed to detect the existence or nonexistence of such relationships.

X2 X2 X2

X1 X1 X1

(a) (b) (c)

Fig. 2.1

Definition

Let variable X1 assume values x11,…,x1M and variable X2 assume values x21,

…,x2M. The Covariance between X1 and X2 is defined as

If, instead of X1 and X2, we use the notation of variables X and Y, then we use vxy

to denote the covariance between X and Y. If we have two identical variables, then the

covariance between these variables degenerates into the variance of the individual

variable. That is, .

Example 2.5

Consider the following two variables.

X1 X2

170 130

165 127

150 121

5

180 140

173 130

184 144

153 125

The covariance between X1 and X2 is computed as follows :

,

And v12 = 84.42

Example 2.6

Consider the following set of data :

X1 X2

1.000 0.000

0.707 0.707

0.000 1.000

-0.707 0.707

-1.000 0.000

-0.707 -0.707

0.000 -1.000

0.707 -0.707

The covariance between X1 and X2 is computed as follows :

,

The reader may have already noted that the above eights points lie on a circle.

That the covariance between X1 and X2 is zero is therefore not surprising.

Again, as we discussed before, the covariance is heavily influenced by the units

of measurements. In Example 2.5, X1 may be body height in centimeter and X2 may be

body weight in pounds. If we measure body weight by tons, the covariance may

approach zero simply because the values of X2 are too small, although the two

variables actually do “covary”. If the influence of units of measurements has to be

eliminated, one may use the definition of correlations.

6

Definition

Let variables X1 and X2 assume values x11,x12,…,x1M and x21,x22,…,x2M

respectively. Let and be the means of X1 and X2 respectively. Let and be

the standard deviations of X1 and X2 respectively. Then, the correlation between X1 and

X2 is defined as

Note that if these two variables have been normalized with respect to variances,

the correlation and covariance between these two variables will be the same.

It should be easy for the reader to prove that the correlation satisfies the

following properties :

(1) r11 = 1

(2) r12 = r21

(3)

Given a set of variables X1,X2,…,XN, it is often desirable to describe the

correlations, or covariances among these variances in matrix forms. We shall use V to

denote the covariance matrix and R to denote the correlation matrix. Each matrix is of

dimension . For the covariance (correlation) matrix V(R), the (i,j)th location

V[i,j] (R[i,j]) is the covariance cij (correlation rij) between Xi and Xj.

Example 2.7

Consider the data in Table 2.2. The covariance matrix and the correlation matrix

are shown in Table 2.3 and Table 2.4 respectively. Let us consider the correlation

matrix. From this matrix, we know that the change in population (X1) from 1950 to

1960 is highly related to the change in employment (X3) during the same period. On

the other hand, the change in median income (X6) is almost unrelated to all other

variables.

SELECTED POPULATOIN AND ECONOMIC STATISTICS FOR

SELECTED CITIES

Change in

population

Employment

per house-

Change in

employment

Median

age of the

Median

income

Change in

median

7

1950-1960

(percent)

hold, 1950 1950-1960

(percent)

population

(years)

1950

(dollars)

income

1950-1960

(percent)

X1 X2 X3 X4 X5 X6

1 98.30 1.36 63.50 27.20 3473.00 120.80

2 26.80 1.20 23.30 23.20 2367.00 74.90

3 40.80 1.38 41.90 27.20 2126.00 140.80

4 22.50 1.28 15.70 28.20 4045.00 71.50

5 95.30 1.26 103.20 28.20 5128.00 60.60

6 80.70 1.32 48.90 27.30 3098.00 83.40

7 33.20 1.19 29.00 23.30 1846.00 63.70

8 -15.90 1.09 -9.10 30.50 1932.00 111.30

9 19.20 1.10 22.80 33.40 2592.00 95.70

10 54.90 1.30 40.40 26.30 2880.00 81.30

11 56.40 1.48 43.70 30.90 3042.00 96.40

12 31.00 1.12 19.30 23.90 1746.00 98.70

13 25.60 1.42 36.20 23.60 885.00 464.30

14 51.10 1.23 59.90 23.30 1816.00 66.70

15 27.80 1.38 17.20 30.10 2830.00 93.80

16 16.30 1.14 16.20 32.70 2232.00 112.90

17 264.20 1.41 245.40 26.10 3398.00 99.90

18 108.20 1.30 110.30 26.70 3307.00 74.30

19 77.40 1.29 41.10 24.30 2225.00 87.30

20 -8.70 1.26 -12.40 41.50 5144.00 166.50

21 49.70 1.28 27.10 23.90 2207.00 97.80

22 16.20 1.30 15.40 24.90 2642.00 88.20

23 63.50 1.33 51.40 28.90 2491.00 115.00

24 79.40 1.39 70.00 25.70 2646.00 111.00

25 63.10 1.26 67.50 24.70 2095.00 80.90

26 30.30 1.22 25.90 31.90 2209.00 96.20

27 8.60 1.16 12.70 21.80 1208.00 98.30

28 188.40 1.34 175.60 26.70 3842.00 84.60

29 2.80 1.24 5.50 27.20 1599.00 128.60

30 28.00 1.24 26.30 27.90 2371.00 94.20

31 20.90 1.20 12.50 25.50 2913.00 89.20

32 11.80 1.13 3.80 33.30 2325.00 96.50

33 -3.10 1.07 0.20 31.80 1725.00 108.90

8

34 161.30 1.27 157.70 25.20 3990.00 79.80

35 15.90 1.30 4.30 29.00 3386.00 67.10

36 12.90 1.21 11.10 27.80 2530.00 83.80

37 23.70 1.15 28.40 22.60 1598.00 84.30

38 24.00 1.21 15.50 31.50 2494.00 90.60

39 2.20 1.30 -4.20 28.00 2819.00 72.20

40 19.40 1.25 18.20 30.10 2400.00 84.80

41 22.10 1.22 13.10 30.90 2069.00 110.40

42 92.90 1.24 85.50 26.10 3502.00 74.20

43 -4.40 1.33 2.30 35.50 4578.00 125.10

44 -4.00 1.21 -2.40 29.90 2443.00 82.50

45 104.90 1.37 49.10 28.70 2638.00 100.10

46 15.50 1.28 12.90 29.20 2274.00 113.70

47 13.80 1.22 22.20 29.40 1972.00 127.10

48 -14.30 1.29 3.30 33.90 5760.00 48.10

49 6.30 1.15 10.30 22.70 3233.00 48.70

50 49.50 1.29 42.60 25.00 1764.00 209.00

Source : 1960 United States Census.

Table 2.2

1 2 3 4 5 6

1 2753.92 2.22 2483.46 -62.07 10876.57 -271.96

2 2.22 0.00 1.65 -0.01 23.23 1.40

3 2483.46 1.85 2339.34 -56.41 11250.71 -181.65

4 -62.07 -0.01 -56.41 14.60 1696.77 -2.79

5 10876.57 23.23 11250.71 1696.77 967468.75 -19475.46

6 -271.96 1.40 -181.65 -2.79 -19475.46 3385.94

The Covariance Matrix for the Data in Table 2.2

Table 2.3

1 2 3 4 5 6

1.00 0.47 0.97 -0.30 0.21 -0.08

0.47 1.00 0342 -0.05 0.26 0.27

0.97 0.42 1.00 -0.30 0.23 -0.06

-0.30 -0.05 -0.30 1.00 0.45 -0.01

0.21 0.26 0.23 0.45 1.00 -0.34

-0.08 0.27 -0.06 -0.01 -0.34 1.00

9

The Correlation Matrix for the Data in Table 2.2

Table 2.4

Section 2.4: Linear Regression Analysis

Let us consider the case where we have a black box whose input

Current Voltage

Fig. 2.2

is current and whose output is voltage. We can measure the current and voltage. Thus,

for every input observation, we have a corresponding output observation. A typical

case may be as follows:

X (current) Y (voltage)

10 21

11 23

12 23

9 19

8 17

13 25

7 14

14 18

Table 2.5

There are many ways in which the current and voltage can be related. The simplest

model is to assume a linear relationship. That is,

Of course, it is quite unlikely that the above model exactly holds. We are interested in

b0 and b1 which “best fit” our observed data. The meaning of best fitting can be

explained by considering Fig. 2.3.

10

Black box

Fig. 2.3 (a) Fig. 2.3 (b)

In both figures, the points do not all lie on the line. We may therefore say that

errors occur if we use these straight lines to approximate our data. As can be seen, the

line in Fig. 2.3(a) is much better than that in Fig. 2.3(b). In the following, we shall

show how the best fitting line can be found.

Let us assume that for , we observe . Ideally,

(2.1)

The observed value corresponding to xi is yi. Therefore, we have an error

(2.2)

The total sum of squares of errors is

(2.3)

We shall choose b0 and b1 on the bases that they will minimize E. This is

achieved by differentiating E with respect to b0 and b1.

(2.4)

(2.5)

The values b0 and b1 are found by solving

11

16

18

20

22

24

26

28

7 8 9 10 11 12 13 14

14

16

18

20

22

24

26

28

7 8 9 10 11 12 13 14

(2.6)

and

(2.7)

We have

(2.8)

(2.9)

Let us divide the above equations by M, we obtain

(2.10)

(2.11)

Solving (2.10) and (2.11), we obtain

(2.12)

Equivalently,

(2.14)

(2.15)

Example 2.8:

For the set of data in Table 2.5, we have

12

, , , and .

Therefore,

We have

Y = 2.60 + 1.80x .

We now calculate the value of Y according to the above formula and

compare them with the observed values.

X Y(observed) Y/(according to linear

regression analysis)

10 21 20.80

11 23 22.40

12 23 24.20

9 19 18.80

8 17 17.00

13 25 26.00

7 14 15.20

14 28 27.80

13

12.0

016

.00

20.0

024

.00

28.0

032

.00

Y-A

XIS

6.00 8.00 10.00 16.00

X-AXIS

Fig. 2.4

Since we assumed that variable Y measures voltage, variable X measures current and

our linear regression model indicates that Y and X can be approximated by the

equation:

Y = 2.60 + 1.80X,

we can realize the system in the black box by the following passive and linear circuit.

Fig. 2.5

Let us assume that we are given a certain value of X, say x = 7.5. Can we guess

what Y should be? It should not be unreasonable to use Y = 2.60 + 1.80X to have an

educated guess. That is,

Y = 2.60 + 1.80 × 7.5 = 16.10.

One can see that the linear regression is indeed an information extraction

method. We were given only a set of data to start with. We have now established a

relationship between X and Y and can predict, with some degree of confidence, the

output associated with some unknown input.

Section 2.5: Function Approximation by Linear Regression Analysis

14

12.00 14.00

Voltage(Y)

Current(X)

Linear regression analysis can be applied to “function approximation”. This is

illustrated in the following example:

Example 2.9:

Let us assume that we have . In the interval [0, 1], we may approximate

this function by a straight line. Let us use 11 values of X(0, 0.1, 0.2, … , 0.9, 1.0). For

every xi, we have a corresponding yi as in the following table:

xi yi

0.0 1.000

0.1 0.995

0.2 0.980

0.3 0.956

0.4 0.923

0.5 0.882

0.6 0.835

0.7 0.782

0.8 0.726

0.9 0.667

1.0 0.606

Using a linear regression analysis e obtain

, , , ,

,

We have

y = 1.053 － 0.406x.

The function and y = 1.053 － 0.406x are now plotted in Fig. 2.6.

15

0.60

0.70

0.80

0.90

1.00

1.10Y

0.00 0.20 0.40 0.60 0.80 16.00

y = 1.053 － 0.406 x

Fig. 2.6

Section 2.6: The Matrix Approach to Linear Regression

In this section, we shall use some concept of matrix to solve linear regression

problems. Note that the fundamental equation governing y and x is

y = b0 + b1x

So far as the data are concerned, we have

y1 = b0 + b1x1

yM = b0 + b1xM

The above equations can be expressed as

. (2.16)

Where

, , and . (2.17)

It can be easily proved that

(2.18)

where is the transpose of and (2.19)

16

X

Comparing (2.8) and (2.9) with (2.18) and (2.19), we obtain

(2.20)

Our problem thus becomes how to obtain from (2.20). If the inverse of matrix exists, then we have

or

(2.21)

Example 2.10:

Let us consider the data in Example 2.9,.

, , ,

, ,

Therefore

y = 1.053 －0.406x.

17

Section 2.8: Multiple Linear Regression

So far we have limited ourselves to the discussion of one independent variable.

In reality, we may have a situation where we have one dependent variable and several

input variables as shown in Fig. 2.5.

x1

x2

x3 Black YBox

xN

Fig. 2.5

A very simple model would be

(2.22)

Let us assume that for every independent variable set , we have

a corresponding idealized dependent variable . Thus we have

(2.23)

The total sum of squares of errors is

(2.24)

We have

18

(2.25)

As we showed in the one variable case, we may let

and

and express the equations in (2.25) in matrix forms:

.

To obtain B, we have

or

. (2.26)

Example 2.11: (Social-economic Data)

In this example, we shall use some economic data obtained from[Burns and

Harman 1966] as shown in Table 2.6. We used median value of houses as the

dependent variable and the other two variables (median school years and misc.

professional services) as independent variables. The linear regression equation

obtained is

(2.27)

In Table 2.7, we list the value of house obtained through the use of (2.27), the

actual value and the percentage of error. The reader can see the average error is found

to be 14%.

19

Individual

(Tract No.)

X1

Median school Years

(unit = 10 year)

X2

Misc. Professional

Services

Y

Median Value House

(unit = thousand)

1 1.28 27.0 25.0

2 1.09 1.0 10.0

3 0.87 1.0 9.0

4 1.35 14.0 25.0

5 1.27 14.0 25.0

6 0.83 6.0 12.0

7 1.14 1.0 16.0

8 1.15 6.0 14.0

9 1.25 18.0 18.0

10 1.37 39.0 25.0

11 0.96 8.0 12.0

12 1.14 10.0 13.0

Table 2.6

Individual Tract Y (Actual Value) Y (Estimated) Percentage Error

1 25.0 22.9 0.09

2 10.0 13.7 0.27

3 9.0 9.0 0.00

4 25.0 22.2 0.13

5 25.0 20.4 0.22

6 12.0 8.8 0.36

7 13.0 14.78 0.08

8 14.0 15.96 0.12

9 18.0 20.48 0.12

10 25.0 27.19 0.08

11 12.0 12.12 0.01

12 13.0 16.50 0.21

Average error =

Table 2.7

Example 2.12: A Set of Artificial Data

20

In this example, we used the following formula to generate data:

(2.28)

where εmeans some kind of noise introduced by a random number generator.

Th entire data set is shown in Table 2.8.

I x1 x2 x3 y

1 25.0 34.0 200.0 135.638

2 51.0 70.0 36.0 -208.327

3 85.0 100.0 35.0 -255.589

4 54.0 720.0 51.0 -4983.289

5 45.0 78.0 5.0 339.020

6 70.0 88.0 654.0 559.850

7 22.0 11.0 428.0 586.920

8 1.0 5.0 7.0 -25.658

9 -24.0 -51.0 -750.0 -725.615

10 51.0 -11.0 6.0 351.608

Table 2.8

We then used formula (2.26) to obtain an optimized linear equation describing

the relationship among y, x1, x2, x3 and x3. This equation was found to be

. (2.29)

which is very much similar to Eq. (2.28).

This example shows that if the data are governed by a linear equation, the linear

regression analysis will correctly reveals this fact.

21

chapter 2 correlation and regression.doc

Documents