correlation. bivariate distribution observations are taken on two variables two characteristics are...
TRANSCRIPT
CORRELATION
Bivariate Distribution
Observations are taken on two variables
Two characteristics are measured on n individuals
e.g : The height (x) and weight (y) of 10 students
A single characteristic is measured on two groups of individuals
e.g : The height of 10 males (x) and 10 females (y)
),(),...,,(),,( 2211 nn yxyxyx
Height Self-esteem
68 4.1
71 4.6
62 3.8
75 4.4
58 3.2
60 3.1
67 3.8
68 4.1
71 4.3
69 3.7
68 3.5
67 3.2
63 3.7
62 3.3
60 3.4
63 4
65 4.1
67 3.8
63 3.4
61 3.6
Definition
Correlation is used to measure and describe a relationship/association between two variables
A single number which describes the relationship between X and Y is the correlation coefficient. Denoted by ‘r’ or ‘ρ’.
Scatter Diagram
scatter plot
0
1
2
3
4
5
50 55 60 65 70 75 80
Height
Sel
f Est
eem
scatter plot
Education Level and Lifetime Earnings
0
1
2
3
4
5
0 2 4 6 8 10
Education (Predictor Variable)
Life
time
Earn
ings
(C
riter
ion
Varia
ble)
X (Education) Y (Income)8 3.47 4.46 2.55 2.14 1.63 1.52 1.21 1
What is the relationship between level of education and lifetime earnings?
Direction of Relationship
A scatter plot shows at a glance the direction of the relationship. A positive correlation indicates a directly
proportional relationship.
Direction of Relationship
A negative correlation indicates an inversely proportional relationship
No Correlation
In cases where there is no correlation between two variables, the dots are scattered about the plot in an irregular pattern.
Correlation Coefficient
The correlation coefficient measures three characteristics of the relationship between X and Y: The direction of the relationship. The form of the relationship. The degree of the relationship
Karl Pearson Correlation
Calculation
Calculate the KP Correlation for data in slide 3.
Ans: 0.73 Interpretation: The data exhibits a strong
positive correlation indicating that self-esteem increases with height.
X Education Y Income XY X2 Y2
8 3.4 27.2 64 11.567 4.4 30.8 49 19.366 2.5 15 36 6.255 2.1 10.5 25 4.414 1.6 6.4 16 2.563 1.5 4.5 9 2.252 1.2 2.4 4 1.441 1 1 1 1
36 17.7 97.8 204 48.83
8
83.48
204
8.97
7.17
36
2
2
n
Y
X
XY
Y
X
The data shows a high positive correlation between income and education.
Drawbacks
Presence of outliers Nonlinear scatter plot of x and y values. In the next slide scatter plots are shown for 7
different datasets that have the same correlation r=0.70. Is the use of r justified in each case?
Rank Correlation
Age (mths)
Stopping distance
Age rank Stopping rank
d d2
9 28.4 1 1 0 0
15 29.3 2 2 0 0
24 37.6 3 7 4 16
30 36.2 4 4.5 0.5 0.25
38 36.5 5 6 1 1
46 35.3 6 3 -3 9
53 36.2 7 4.5 -2.5 6.25
60 44.1 8 8 0 0
64 44.8 9 9 0 0
76 47.2 10 10 0 0
32.5
Scatter Plot
Calculations
Number in sample (n) = 10r = 1 - (195 / 10 x 99)r = 1 - 0.197r = 0.803 )1(
61
21
2
nn
dr
n
ii
Probable Error
n
rrEP
216745.0)(.
If r>6P.E, then correlation is highly significant in the population, otherwise it is insignificant.
Caution
Correlation does not imply causation. Example: Average temperature (x) in a month
and number of ice cream vendors (y). r=0.92 (Highly positive)