modelling and forecast

19
CHAPTER 6 : CORRELATION - REGRESSION 6.1 Introduction So far we have considered only univariate distributions. By the averages, dispersion and skewness of distribution, we get a complete idea about the structure of the distribution. Many a time, we come across problems which involve two or more variables. If we carefully study the figures of rain fall and production of paddy, figures of accidents and motor cars in a city, of demand and supply of a commodity, of sales and profit, we may find that there is some relationship between the two variables. On the other hand, if we compare the figures of rainfall in America and the production of cars in Japan, we may find that there is no relationship between the two variables. If there is any relation between two variables i.e. when one variable changes the other also changes in the same or in the opposite direction, we say that the two variables are correlated. W. J. King : If it is proved that in a large number of instances two variables, tend always to fluctuate in the same or in the opposite direction then it is established that a relationship exists  between the variables. This is called a "Correlation." 6.2 Correlation It means the study of existence, magnitude and direction of the relation between two or more variables. in technology and in statistics. Correlation is very important. The famous astronomist Bravais, Prof. Sir Fanci’s Galton, Karl Pearson (who used this concept in Biology and in Genetics). Prof. Neiswanger and so many others have contributed to this great subject. 6.3 Types of Correlation 1. Positive and negative correlation 2. Linear and non-linear correlation A) If two variables change in the same direction (i.e. if one increases the other also increases, or if one decreases, the other also decreases), then this is called a positive correlation. For example : Advertising and sales. B) If two variables change in the opposite direction ( i.e. if one increases, the other decreases and vice versa), then the correlation is called a negative correlation. For example : T.V. registrations and cinema attendance. 1. The nature of the graph gives us the idea of the linear type of correlation between two variables. If the graph is in a straight line, the correlation is called a "linear correlation" and if the graph is not in a straight line, the correlation is non-linear  or curvi-linear.  

Upload: calvince-ouma

Post on 03-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 1/19

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 2/19

For example, if variable x changes by a constant quantity, say 20 then y also changes by aconstant quantity, say 4. The ratio between the two always remains the same (1/5 in this case). Incase of a curvi-linear correlation this ratio does not remain constant.

6.4 Degrees of Correlation

Through the coefficient of correlation, we can measure the degree or extent of the correlation between two variables. On the basis of the coefficient of correlation we can also determinewhether the correlation is positive or negative and also its degree or extent.

1. Perfect correlation: If two variables changes in the same direction and in the same proportion, the correlation between the two is perfect positive . According to KarlPearson the coefficient of correlation in this case is +1. On the other hand if the variableschange in the opposite direction and in the same proportion, the correlation is perfect

negative . its coefficient of correlation is -1. In practice we rarely come across these typesof correlations.2. Absence of correlation: If two series of two variables exhibit no relations between them

or change in variable does not lead to a change in the other variable, then we can firmlysay that there is no correlation or absurd correlation between the two variables. In sucha case the coefficient of correlation is 0.

3. Limited degrees of correlation: If two variables are not perfectly correlated or is there a perfect absence of correlation, then we term the correlation as Limited correlation. It may

be positive, negative or zero but lies with the limits 1.

High degree, moderate degree or low degree are the three categories of this kind of correlation.The following table reveals the effect ( or degree ) of coefficient or correlation.

Degrees Positive Negative

Absence of correlation Zero 0

Perfect correlation + 1 -1High degree + 0.75 to +

1- 0.75 to -1

Moderate degree + 0.25 to +0.75

- 0.25 to -0.75

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 3/19

Low degree 0 to 0.25 0 to - 0.25

6.5 Methods Of Determining Correlation

We shall consider the following most commonly used methods.(1) Scatter Plot (2) Kar Pearson’scoefficient of correlation (3) Spearman’s Rank -correlation coefficient.

1) Scatter Plot ( Scatter diagram or dot diagram ): In this method the values of the twovariables are plotted on a graph paper. One is taken along the horizontal ( (x-axis) and the otheralong the vertical (y-axis). By plotting the data, we get points (dots) on the graph which aregenerally scattered and hence the name ‘Scatter Plot’.

The manner in which these points are scattered, suggest the degree and the direction ofcorrelation. The degree of correlation is denoted by ‘ r ’ and its direction is given by the signs

positive and negative.

i) If all points lie on a rising straight line the correlation is perfectly positive and r = +1 (seefig.1)

ii) If all points lie on a falling straight line the correlation is perfectly negative and r = -1 (seefig.2)

iii) If the points lie in narrow strip, rising upwards, the correlation is high degree of positive (seefig.3)

iv) If the points lie in a narrow strip, falling downwards, the correlation is high degree ofnegative (see fig.4)

v) If the points are spread widely over a broad strip, rising upwards, the correlation is low degree positive (see fig.5)

vi) If the points are spread widely over a broad strip, falling downward, the correlation is lowdegree negative (see fig.6)

vii) If the points are spread (scattered) without any specific pattern, the correlation is absent. i.e. r

= 0. (see fig.7)

Though this method is simple and is a rough idea about the existence and the degree ofcorrelation, it is not reliable. As it is not a mathematical method, it cannot measure the degree ofcorrelation.

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 4/19

2) Karl Pearson’s coefficient of correlation: It gives the numerical expression for the measureof correlation. it is noted by ‘ r ’. The value of ‘ r ’ gives the magnitude of correlation and signdenotes its direction. It is defined as

r =

where

N = Number of pairs of observation

Note : r is also known as product-moment coefficient of correlation.

OR r =

OR r =

Now covariance of x and y is defined as

Example Calculate the coefficient of correlation between the heights of father and his son for thefollowing data.

Heightoffather(cm):

165 166 167 168 167 169 170 172

Heightof son(cm):

167 168 165 172 168 172 169 171

Solution: n = 8 ( pairs of observations )

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 5/19

Height of

father

xi

Height of

son

yi

x=xi-x

y =yi-y xy x2 y2

165 167 -3 -2 6 9 4

166 168 -2 -1 2 4 1

167 165 -1 -4 4 1 16

167 168 -1 -1 1 1 1

168 172 0 3 0 0 9

169 172 1 3 3 1 9

170 169 2 0 0 4 0

172 171 4 2 8 16 4

xi=1344 yi=1352 0 0 xy=24 x2=36 y2=44

Calculation:

Now,

Since r is positive and 0.6. This shows that the correlation is positive and moderate (i.e. directand reasonably good).

Example From the following data compute the coefficient of correlation between x and y.

Example If covariance between x and y is 12.3 and the variance of x and y are 16.4 and 13.8respectively. Find the coefficient of correlation between them.

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 6/19

Solution: Given - Covariance = cov ( x, y ) = 12.3

Variance of x ( x2 )= 16.4

Variance of y ( y2 ) = 13.8

Now,

Example Find the number of pair of observations from the following data.

r = 0.25, (x i - x ) ( y i - y ) = 60, y = 4, ( x i - x )2 = 90.

Solution: Given - r = 0.25

If the values of x and y are very big, the calculation becomes very tedious and if we change the

variable x to u = and y to where x 0 and y 0 are the assumed means forvariable x and y respectively, then r xy = ruv

The formula for r can be simplified as

Example Marks obtained by two brothers FRED and TED in 10 tests are as follows:

Find the coefficient of correlation between the two.

Solution: Here x 0 = 60, c = 4, y 0 = 60 and d = 3

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 7/19

6.6 Coefficient Of Correlation For Bivariate Grouped Data

When the number of observations is very large, we need to arrange the data into different classes,which are either discrete or continuous. Items having values falling in a particular class are

placed together and those having values falling in another class are placed together. Due to this

the whole data is divided into horizontal rows and vertical columns, with one variable placedhorizontally and the other placed vertically. The table so obtained is a two-way frequencydistribution table and is called the correlation table or Bi-variate frequency distribution table. Theformula for calculating and for bi-variate distribution is given by

STEPS:

1. First write down the mid-points of x along a horizontal raw and those of y along avertical column.

2. Find3. Multiply each frequency by the corresponding value of u then by corresponding value of

v to get fuv. Write these numbers in the same box at the top.

4.

Add the frequencies horizontally, and write down the total. Similarly add the frequenciesvertically and write down its total.5. Multiply this additions of x by u to get f u.6. Multiply this addition of y by v to get f v.7. Multiply these frequencies by the square of u to get f u 2.8. Multiply these frequencies by the square of v to get f v 2.9. Add horizontally ( or vertically ) the top numbers denoting f u v written in each box ( or

cell )10. Write down f u, f u 2, f v, f v 2 and f u v and then use the above formula.

Example Calculate the coefficient of correlation for the following data.

Age(years)ofHusband

Age (years) of wife Total

10 -20 20 -30 30 -40 40 -50 50 -60

10 - 25 5

3

3 11 7 3

6

8

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 8/19

25 - 35

35 - 45

45 - 55

55 - 65

15

11

14

7

12

3

29

32

22

9

Total 8 29 32 22 9 100

Inserting, fuv = 94, n = 100, fu = -5, fv = -5, fu2 = 119 and fv2 = 119 in

6.7 Probable Error

It is used to help in the determination of the Karl Pearson’s coefficient of correlation ‘ r ’. Due tothis ‘ r ’ is corrected to a great extent but note that ‘ r ’ depends on the random sampling and itsconditions. it is given by

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 9/19

P. E. = 0.6745

i. If the value of r is less than P. E., then there is no evidence of correlation i.e. r is notsignificant.

ii. If r is m ore than 6 times the P. E. ‘ r ’ is practically certain .i.e. significant. iii. By adding or subtracting P. E. to ‘ r ’ , we get the upper and Lower limits within which ‘ r

’ of the population can be expected to lie.

Symbolically e = r P. E.

P = Correlation ( coefficient ) of the population.

Example If r = 0.6 and n = 64 find out the probable error of the coefficient of correlation.

Solution: P. E. = 0.6745

= 0.6745

=

= 0.57

6.8 Spearman’s Rank Correlation Coefficient

This method is based on the ranks of the items rather than on their actual values. The advantage

of this method over the others in that it can be used even when the actual values of items areunknown. For example if you want to know the correlation between honesty and wisdom of the boys of your class, you can use this method by giving ranks to the boys. It can also be used tofind the degree of agreements between the judgements of two examiners or two judges. Theformula is :

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 10/19

R =

where R = Rank correlation coefficient

D = Difference between the ranks of two items

N = The number of observations.

Note: -1 R 1.

i) When R = +1 Perfect positive correlation or complete agreement inthe same direction

ii) When R = -1 Perfect negative correlation or complete agreement inthe opposite direction.

iii) When R = 0 No Correlation.

Computation:

i. Give ranks to the values of items. Generally the item with the highest value is ranked 1and then the others are given ranks 2, 3, 4, .... according to their values in the decreasingorder.

ii. Find the difference D = R 1 - R 2where R 1 = Rank of x and R 2 = Rank of y

Note that D = 0 (always)iii. Calculate D 2 and then find D2 iv. Apply the formula.

Note :

In some cases, there is a tie between two or more items. in such a case each items have ranks 4th

and 5th respectively then they are given = 4.5th rank. If three items are of equal rank say

4th then they are given = 5th rank each. If m be the number of items of equal ranks,

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 11/19

the factor is added to S D 2. If there are more than one of such cases then thisfactor added as many times as the number of such cases, then

Example Calculate ‘ R ’ from the following data.

Student No.:

1 2 3 4 5 6 7 8 9 10

RankinMaths :

1 3 7 5 4 6 2 10 9 8

Rank

inStats:

3 1 4 5 6 9 7 8 10 2

Solution :

Student No.

RankinMaths(R 1)

RankinStats(R 2)

R 1 - R 2D

(R 1 - R 2 )2

D2

1 1 3 -2 4

2 3 1 2 4

3 7 4 3 9

4 5 5 0 0

5 4 6 -2 4

6 6 9 -3 9

7 2 7 -5 25

8 10 8 2 4

9 9 10 -1 1

10 8 2 6 36

N = 10 S D = 0 S D = 96

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 12/19

Calculation of R :

Example Calculate ‘ R ’ of 6 students from the following data.

Marksin Stats : 40 42 45 35 36 39

MarksinEnglish:

46 43 44 39 40 43

Solution:

MarksinStats

R 1 MarksinEnglish

R 2 R 1 - R 2 (R 1 -R 2)2 =D 2

40 3 46 1 2 4

42 2 43 3.5 -1.5 2.25

45 1 44 2 -1 1

35 6 39 6 0 0

36 5 40 5 0 0

39 4 43 3.5 0.5 0.25

N = 6 S D = 0 S D = 7.50

Here m = 2 since in series of marks in English of items of values 43 repeated twice.

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 13/19

6.9 Linear Regression

Correlation gives us the idea of the measure of magnitude and direction between correlated

variables. Now it is natural to think of a method that helps us in estimating the value of onevariable when the other is known. Also correlation does not imply causation. The fact that thevariables x and y are correlated does not necessarily mean that x causes y or vice versa. Forexample, you would find that the number of schools in a town is correlated to the number ofaccidents in the town. The reason for these accidents is not the school attendance; but these twoincreases what is known as population . A statistical procedure called regression is concernedwith causation in a relationship among variables. It assesses the contribution of one or morevariable called causing variable or independent variable or one which is being caused (dependent variable ). When there is only one independent variable then the relationship isexpressed by a straight line. This procedure is called simple linear regression .

Regression can be defined as a method that estimates the value of one variable when that ofother variable is known, provided the variables are correlated. The dictionary meaning ofregression is "to go backward." It was used for the first time by Sir Francis Galton in his research

paper "Regression towards mediocrity in hereditary stature."

Lines of Regression: In scatter plot, we have seen that if the variables are highly correlated thenthe points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line,such that all points are close to it from both sides. such a line can be taken as an idealrepresentation of variation. This line is called the line of best fit if it minimizes the distances ofall data points from it.

This line is called the line of regression . Now prediction is easy because now all we need to dois to extend the line and read the value. Thus to obtain a line of regression, we need to have a lineof best fit. But statisticians don’t measure the distances by dropping perpendiculars from pointson to the line. They measure deviations ( or errors or residuals as they are called) (i) verticallyand (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).

(1) Line of regression of y on x

Its form is y = a + b x

It is used to estimate y when x is given

(2) Line of regression of x on y

Its form is x = a + b y

It is used to estimate x when y is given.

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 14/19

They are obtained by (1) graphically - by Scatter plot (ii) Mathematically - by the method ofleast squares.

ii. Let y = a + b y ..... (1) where a and b are given by the normal equations

y = n a + b x ..... (2)

xy = a x + b x2 .... (3)

where ‘ n ’ be the number of pairs of values of x and y.

Equation (6) is the equation of the line of regression of y on x .

is called the coefficient of regression of y on x which is obviously the slope of this line.Interchanging x and y in equation (6), the equation of the line of regression of x and y is given by

Naturally b xy is the slope of this line which is equal to

Example A panel of two judges A and B graded dramatic performance by independentlyawarding marks as follows:

Solution:

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 15/19

The equation of the line of regression of y on x

Inserting x = 38, we get

y - 33 = 0.74 ( 38 - 33 )

y - 33 = 0.74 5

y - 33 = 3.7

y = 3.7 + 33

y = 36.7 = 37 ( approximately )

Therefore, the Judge B would have given 37 marks to 8th performance.

Example The two regression equations of the variables x an y are

x = 19.13 - 0.87 y and y = 11.64 - 0.50 x

Find (1) Mean of x’s

(2) Mean of y’s

(3) Correlation coefficient between x and y

Solution:

1. Calculation of Mean

\ Mean of x’s = 15.94 and Mean of y’s = 3.67

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 16/19

2. Calculation of ‘ r ’

x = 19.93 - 0.87 y

Therefore,

and y = 11.64 - 0.50 x

Therefore,

From (3) and (4) r = ± 0.66

But regression coefficient are negative

r = - 0.66

Example In a partially destroyed laboratory record of an analysis of correlation data, thefollowing results are legible:

Variance of x = 9Regression equations : 8 x - 10 y + 66 = 6

40 x - 18 y = 214

What are (1) Means of x’s and y’s (2) the coefficient of c orrelation between x and y (3) thestandard deviation of y ?

Solution:

1. Means:

8 x - 10 y = -66 ----- (1)

40 x - 18 y = 214 ----- (2)

Solving (1) and (2) as

40 x - 50 y = -330 ----- (1)

40 x - 18 y = 214 ----- (2)

-32 y = -544

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 17/19

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 18/19

Example From 10 observations of price x and supply y of a commodity the results obtained x= 130, y = 220, x2 = 2288, xy = 3467

Compute the regression of y on x and interpret the result. Estimate the supply when the price of16 units.

Solution: The equation of the line of regression of y on x

y = a + b x

Also from normal equations

y = n a + b x and xy = a x + b x2 we get

220 = 10 a + 130 b (1)

3467 = 130 a + 2288 (2)

Solving (1) and (2) as

2860 = 130 a + 1690 b3467 = 130 a + 2288 b

On subtraction 607 = 598 b b = 1.002

Putting b = 1.002 in 220 = 10 a + 130 b, we get a = 8.974.

Hence the 3 equation of the line of regression of y on x isy = 8.974 + 1.002 x

When x = 16, we get

y = 8.974 + 1.002 ( 16 )

y = 25.006

Example If is the acute angle between the two regression lines in the case of two variables xand y show that

8/11/2019 Modelling and Forecast

http://slidepdf.com/reader/full/modelling-and-forecast 19/19

with usual meanings. Explain the significance when r = 0 and r = 1.

Solution: The slopes of the two regression lines are

If r = then tan = or = /2 i.e. there is no relationship between two variables i.e.independent or uncorrelated.

If r = 1 then = 0. The two regression lines are coincident or parallel and the correlation is perfect.