linear correlation analysis -...
Post on 22-Feb-2019
216 Views
Preview:
TRANSCRIPT
Linear Correlation Analysis
Spring 2005
• Superstitions
– Walking under a
ladder
– Opening an umbrella
indoors
• Empirical Evidence
– Consumption of ice
cream and drownings
are generally
positively correlated.
Can we reduce the
number of drownings
if we prohibit ice
cream sales in the
summer?
3 kinds of relationships between variables
• Association or Correlation or Covary
– Both variables tend to be high or low (positive relationship) or one tends to be high when the other is low (negative relationship). Variables do not have independent & dependent roles.
• Prediction
– Variables are assigned independent and dependent roles. Both variables are observed. There is a weak causal implication that the independent predictor variable is the cause and the dependent variable is the effect.
• Causal
– Variables are assigned independent and dependent roles. The independent variable is manipulated and the dependent variable is observed. Strong causal statements are allowed.
General Overview of
Correlational Analysis
• The purpose is to measure the strength of a linear relationship between 2 variables.
• A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y)
• X is typically the Input, Measured, or Independent variable.
• Y is typically the Output, Predicted, or Dependent variable.
• If, as X increases, there is a predictable shift in the values of Y, a correlation exists.
General Properties of
Correlation Coefficients
• Values can range between +1 and -1
• The value of the correlation coefficient
represents the scatter of points on a
scatterplot
• You should be able to look at a scatterplot
and estimate what the correlation would be
• You should be able to look at a correlation
coefficient and visualize the scatterplot
Perfect Linear Correlation
• Occurs when all the points in a scatterplot
fall exactly along a straight line.
Positive Correlation
Direct Relationship
• As the value of X increases, the value of Y also increases
• Larger values of X tend to be paired with larger values of Y (and consequently, smaller values of X and Y tend to be paired)
Negative Correlation
Inverse Relationship
• As the value of X
increases, the value of Y
decreases
• Small values of X tend to
be paired with large value
of Y (and vice versa).
Non-Linear Correlation
• As the value of X increases, the value of Y
changes in a non-linear manner
No Correlation
• As the value of X
changes, Y does not
change in a predictable
manner.
• Large values of X seem
just as likely to be paired
with small values of Y as
with large values of Y
Interpretation
• Depends on what the purpose of the study
is… but here is a “general guideline”...
• Value = magnitude of the relationship
• Sign = direction of the relationship
Some of the many
Types of Correlation Coefficients (there are lot’s more…)
Name X variable Y variable
Pearson r Interval/Ratio Interval/Ratio
Spearman rho Ordinal Ordinal
Kendall's Tau Ordinal Ordinal
Phi Dichotomous Dichotomous
Intraclass R Interval/Ratio
Test
Interval/Ratio
Retest
Some of the many
Types of Correlation Coefficients (there are lot’s more…. these are the ones we will
focus on this semester)
Name X variable Y variable
Pearson r Interval/Ratio Interval/Ratio
Spearman rho Ordinal Ordinal
Kendall's Tau Ordinal Ordinal
Phi Dichotomous Dichotomous
Intraclass R Interval/Ratio
Test
Interval/Ratio
Retest
Included in SPSS
“Bivariate Correlation”
procedure
The Pearson Product-Moment
Correlation (r)
• Named after Karl Pearson (1857-1936)
• Both X and Y measured at the
Interval/Ratio level
• Most widely used coefficient
in the literature
The Pearson Product-
Moment Correlation (r)
• A measure of the extent to
which paired scores occupy the
same or opposite positions
within their own distributions
From: Pagano (1994)
Computing Pearson r
Hand Calculation
Computing Pearson r
in EXCEL Step #1
Step #2: Insert Function (Pearson)
Subject X Y
A 1 2
B 3 5
C 4 3
D 6 7
E 7 5
0.73Pearson r =
Step #4: Format output Step #3: Select X and Y data
Computing Pearson r
in SPSS Step #1
Step #4: Means + SD’s
Step #2: Analyze-Correlate-Bivariate
Step #3: Select X and Y data
Computing Pearson r
in SPSS Output #1
Descriptive Statistics
4.20 2.387 5
4.40 1.949 5
VARX
VARY
Mean Std. Deviation N
Output #2:
Correlations
1 .731
. .161
5 5
.731 1
.161 .
5 5
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
VARX
VARY
VARX VARY
Interpretation
• r = 0.73 : p = .161
The researchers found a moderate, but not-
significant, relationship between X and Y
SAMPLE SIZE: One of the many issues involved with the
interpretation of correlation
coefficients
Descriptive Statistics
4.20 2.179 25
4.40 1.780 25
VARX
VARY
Mean Std. Deviation N
Correlations
1 .731**
. .000
25 25
.731** 1
.000 .
25 25
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
VARX
VARY
VARX VARY
Correlation is significant at the 0.01 level
(2-tailed).
**.
Interpretation
• r = 0.73 : p = .000
The researchers found a significant
moderate relationship between X and Y
How can this be?
• The distribution of Pearson r is not symmetrically
shaped as r approaches ± 1 (see http://davidmlane.com/hyperstat/A98696.html for more information)
• Examining the 95% confidence interval for r
An additional way to
Interpret Pearson r
• Coefficient of Determination
– r2
– The proportion of the variability of Y
accounted for by X
Variability of Y
This area of overlap
represents the proportion of
variability of Y accounted
for by X (value is expressed
as a %)
X
Correlation Identification Practice
• Let’s see if you can identify the value for
the correlation coefficient from a
scatterplot…
• Click to begin
0102030405060708090
100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y
0102030405060708090100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y
Outliers • Observations that clearly appear to be out of range of the other observations.
r = 0.97
r = 0.72
What to do with Outliers You are stuck with them unless…..
• Check to see if there has been a data entry error.
If so, fix the data.
• Check to see if these values are plausible. Is this
score within the minimum and maximum score
possible? If values are impossible, delete the
data. Report how many scores were deleted.
• Examine other variables for these subjects to see
if you can find an explanation for these scores
being so different from the rest. You might be
able to delete them if your reasoning is sound.
Correlation & Attenuation
• Restricting the range of scores can have a large
impact on a correlation coefficient.
0102030405060708090
100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y LOW
MEDIUM
HIGH
r = 0.72
Low Group
r = 0.55 0102030405060708090
100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y
LOW
0
5
10
15
20
25
30
35
40
45
0 5 10 15 20 25 30 35
Variable X
Variable Y
0102030405060708090
100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y
Medium Group
r = 0.86
MEDIUM
20
30
40
50
60
70
80
20 30 40 50 60 70
Variable X
Variable Y
0102030
405060708090100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y
High Group
r = 0.67
HIGH
60
70
80
90
100
60 70 80 90 100
Variable X
Variable Y
0102030
405060708090
100
0 10 20 30 40 50 60 70 80 90 100
Variable X
Variable Y LOW
r=0.55
MEDIUM
r=0.86
HIGH
r=0.67
Using all of the data…
r = 0.72
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
X variable
Y variable
Men
Women
Here’s another problem with interpreting Correlation
Coefficients that you should watch out for…..
Men
r = -0.21
Women
r = +0.22
All data combined
r = +0.89
Reporting a set of Correlation
Coefficients in a table
Complete correlation matrix.
Notice redundancy.
Lower triangular correlation
matrix. Values are not repeated.
There is also an upper triangular
matrix!
Spearman Rho (rs)
• Named after Charles E.
Spearman (1863-1945)
• Assumptions:
– Data consist of a random
sample of n pairs of numeric
or non-numeric observations
that can be ranked.
– Each pair of observations
represents two measurement
taken on the same object or
individual.
Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm
Why choose Spearman rho
instead of a Pearson r?
�Both X and Y are measured at the ordinal level
�Sample size is small
�X and Y are measured at the interval/ratio level, but are not normally distributed (e.g. are severely skewed)
�X and Y do not follow a bivariate normal distribution
What is a Bivariate Normal Distribution?
What is a Bivariate Normal Distribution?
Sample Problem
• Pincherle and Robinson (1974) note a marked
inter-observer variation in blood pressure
readings. They found that doctors who read high
on systolic tended to read high on diastolic. Table
1 shows the mean systolic and diastolic blood
pressure reading by 14 doctors.
• Research question: What is the strength of the
relationship between the two variables?
Pincherle, G. & Robinson, D. (1974). Mean blood pressure and its relation to other factors determined at a routine
executive health examination. J. Chronic Dis., 27, 245-260.
Doctor ID Systolic Diastolic
1 141.8 89.7
2 140.2 74.4
3 131.8 83.5
4 132.5 77.8
5 135.7 85.8
6 141.2 86.5
7 143.9 89.4
8 140.2 89.3
9 140.8 88.0
10 131.7 82.2
11 130.8 84.6
12 135.6 84.4
13 143.6 86.3
14 133.2 85.9
Table 1.
Mean blood pressure readings,
millimeters mercury, by doctor.
Research question: What is the strength
of the relationship between the two
variables?
Option #1: Compute a Pearson r
If you do not feel this data meet with
assumptions of the Pearson r… then
Option #2: Convert data to Ranks and
then compute a Spearman rho
We will be going over how to check the
assumptions on Wednesday when we talk
about Regression
Computation of Spearman Rho
Step #1
• Rank each X relative to all other
observed values of X from smallest to
largest in order of magnitude. The rank
of the ith value of X is denoted by R(Xi)
and R(Xi)=1 if Xi is the smallest
observed value of X
• Follow the same procedure for the Y
variable
Doctor ID Systolic Diastolic
1 141.8 89.7
2 140.2 74.4
3 131.8 83.5
4 132.5 77.8
5 135.7 85.8
6 141.2 86.5
7 143.9 89.4
8 140.2 89.3
9 140.8 88.0
10 131.7 82.2
11 130.8 84.6
12 135.6 84.4
13 143.6 86.3
14 133.2 85.9
Table 1.
Mean blood pressure readings,
millimeters mercury, by doctor.
Doctor ID Systolic Diastolic R(systolic)
11 130.8 84.6 1
10 131.7 82.2 2
3 131.8 83.5 3
4 132.5 77.8 4
14 133.2 85.9 5
12 135.6 84.4 6
5 135.7 85.8 7
2 140.2 74.4 8.5
8 140.2 89.3 8.5
9 140.8 88.0 10
6 141.2 86.5 11
1 141.8 89.7 12
13 143.6 86.3 13
7 143.9 89.4 14
Table 1.
Mean blood pressure readings, millimeters
mercury, by doctor.
Doctor ID Systolic Diastolic R(systolic) R(diastolic)
2 140.2 74.4 8.5 1
4 132.5 77.8 4 2
10 131.7 82.2 2 3
3 131.8 83.5 3 4
12 135.6 84.4 6 5
11 130.8 84.6 1 6
5 135.7 85.8 7 7
14 133.2 85.9 5 8
13 143.6 86.3 13 9
6 141.2 86.5 11 10
9 140.8 88.0 10 11
8 140.2 89.3 8.5 12
7 143.9 89.4 14 13
1 141.8 89.7 12 14
Table 1.
Mean blood pressure readings, millimeters mercury, by
doctor.
Doctor ID Systolic Diastolic R(systolic) R(diastolic)
1 141.8 89.7 12 14
2 140.2 74.4 8.5 1
3 131.8 83.5 3 4
4 132.5 77.8 4 2
5 135.7 85.8 7 7
6 141.2 86.5 11 10
7 143.9 89.4 14 13
8 140.2 89.3 8.5 12
9 140.8 88.0 10 11
10 131.7 82.2 2 3
11 130.8 84.6 1 6
12 135.6 84.4 6 5
13 143.6 86.3 13 9
14 133.2 85.9 5 8
Table 1.
Mean blood pressure readings, millimeters mercury, by
doctor.
Doctor ID Systolic Diastolic R(systolic) R(diastolic)
1 141.8 89.7 12 14
2 140.2 74.4 8.5 1
3 131.8 83.5 3 4
4 132.5 77.8 4 2
5 135.7 85.8 7 7
6 141.2 86.5 11 10
7 143.9 89.4 14 13
8 140.2 89.3 8.5 12
9 140.8 88.0 10 11
10 131.7 82.2 2 3
11 130.8 84.6 1 6
12 135.6 84.4 6 5
13 143.6 86.3 13 9
14 133.2 85.9 5 8
Table 1.
Mean blood pressure readings, millimeters mercury, by
doctor.
Doctor ID Systolic Diastolic R(systolic) R(diastolic) di
1 141.8 89.7 12 14 -2
2 140.2 74.4 8.5 1 7.5
3 131.8 83.5 3 4 -1
4 132.5 77.8 4 2 2
5 135.7 85.8 7 7 0
6 141.2 86.5 11 10 1
7 143.9 89.4 14 13 1
8 140.2 89.3 8.5 12 -3.5
9 140.8 88.0 10 11 -1
10 131.7 82.2 2 3 -1
11 130.8 84.6 1 6 -5
12 135.6 84.4 6 5 1
13 143.6 86.3 13 9 4
14 133.2 85.9 5 8 -3
Table 1.
Mean blood pressure readings, millimeters mercury, by doctor.
Doctor ID Systolic Diastolic R(systolic) R(diastolic) di di2
1 141.8 89.7 12 14 -2 4
2 140.2 74.4 8.5 1 7.5 56.25
3 131.8 83.5 3 4 -1 1
4 132.5 77.8 4 2 2 4
5 135.7 85.8 7 7 0 0
6 141.2 86.5 11 10 1 1
7 143.9 89.4 14 13 1 1
8 140.2 89.3 8.5 12 -3.5 12.25
9 140.8 88.0 10 11 -1 1
10 131.7 82.2 2 3 -1 1
11 130.8 84.6 1 6 -5 25
12 135.6 84.4 6 5 1 1
13 143.6 86.3 13 9 4 16
14 133.2 85.9 5 8 -3 9
ΣΣΣΣdi = 132.50
Table 1.
Mean blood pressure readings, millimeters mercury, by doctor.
Computing Spearman Rho using SPSS
Correlations
1.000 .708**
. .005
14 14
.708** 1.000
.005 .
14 14
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
SYSTOLIC
DIASTOLI
Spearman's rho
SYSTOLIC DIASTOLI
Correlation is significant at the .01 level (2-tailed).**.
Analyze-Correlate-Bivariate
Kendall’s Tau (ττττ, T, or t)
• Named after Sir Maurice G. Kendall (1907-1983)
• Based on the ranks of observations
• Values range between –1 and +1
• Computation is more tedious than rs • Defined as the probability of concordance minus the probability of discordance.
• Typically will yield a different value than rs
Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm
To find out more about this statistic, see
http://www2.chass.ncsu.edu/garson/pa765/assocordinal.htm
Comparison of
values for the
Blood Pressure
Data
Correlations
1 .418
. .136
14 14
.418 1
.136 .
14 14
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
SYSTOLIC
DIASTOLI
SYSTOLIC DIASTOLI
Correlations
1.000 .486*
. .016
14 14
.486* 1.000
.016 .
14 14
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
SYSTOLIC
DIASTOLI
Kendall's tau_b
SYSTOLIC DIASTOLI
Correlation is significant at the .05 level (2-tailed).*.
Correlations
1.000 .708**
. .005
14 14
.708** 1.000
.005 .
14 14
Correlation Coeff icient
Sig. (2-tailed)
N
Correlation Coeff icient
Sig. (2-tailed)
N
SYSTOLIC
DIASTOLI
Spearman's rho
SYSTOLIC DIASTOLI
Correlation is signif icant at the .01 level (2-tailed).**.
The “Pearson Family”
Name Symbol X Y
Pearson Product-moment r Interval/Ratio Interval/Ratio
Spearman rho rs Ordinal Ordinal
Phi Φ True Dichotomous True Dichotomous
Point Biserial rpb True Dichotomous Interval/Ratio
Rank-Biserial rrb True Dichotomous Ordinal
Name Symbol X Y
Kendal's Tau Τ Ordinal Ordinal
Biserial rb Forced Dichotomous Interval/Ratio
Tetrachoric rt Forced Dichotomous Forced Dichotomous
Definitions
Pearson "Family"
Non-Pearson "family"
Types of Correlation Coefficients
Forced Dichtomous: The variable is assumed to have an underlying normal
distribution, but is forced to be a dichotomous variable (e.g. Rich/Poor, Happy/Sad,
Smart/Not Smart, etc.)
True Dichotomous: A variable that is nominal and has only two levels.
From: http://www.oandp.org/jpo/library/1996_03_105.asp
• Nonparametric tests should not be substituted for parametric tests when parametric tests are more appropriate. Nonparametric tests should be used when the assumptions of parametric tests cannot be met, when very small numbers of data are used, and when no basis exists for assuming certain types or shapes of distributions (9).
• Nonparametric tests are used if data can only be classified, counted or ordered-for example, rating staff on performance or comparing results from manual muscle tests. These tests should not be used in determining precision or accuracy of instruments because the tests are lacking in both areas.
From:
http://www.unesco.org/webworld/idams/advguide/Chapt4_2.htm
• Pearson correlation is unduly influenced by outliers, unequal variances, non-normality, and nonlinearity. An important competitor of the Pearson correlation coefficient is the Spearman’s rank correlation coefficient. This latter correlation is calculated by applying the Pearson correlation formula to the ranks of the data rather than to the actual data values themselves. In so doing, many of the distortions that plague the Pearson correlation are reduced considerably.
For more information about the effect of
ties on Spearman Rho, see…
• CONOVER, WJ. Approximations of the
Critical Region for Spearman's Rho With
and Without Ties Present. Communications
in Statistics, Volume B7, No. 3 (1978) (with
R. L. Iman), pp. 269-282..
top related