research methods & design in psychology lecture 4 correlation lecturer: james neill
TRANSCRIPT
Research Methods & Design in Psychology
Lecture 4Correlation
Lecturer: James Neill
Readings
• Howell (Fundamentals)– Ch9 (Correlation)
• Howell (Methods)– Ch6 (Categorical Data and Chi-Square)
– Ch 9 (Correlation and Regression)
Overview
• Correlational analyses
• Types of answers
• Types of correlation
• Interpretation
• Assumptions / Limitations
The World is Made of Covariation
Covariations are the Building Block of Complex Models
1.000000E+00
6.073440E-01
6.789050E-01
2.836710E-01
2.564180E-01
2.926990E-01
3.690980E-01
3.691210E-01
3.872940E-01
1.608430E-01
6.073440E-01
1.000000E+00
6.883440E-01
2.230500E-01
2.552220E-01
2.847220E-01
3.684480E-01
3.784700E-01
4.337140E-01
1.703010E-01
6.789050E-01
6.883440E-01
1.000000E+00
2.863960E-01
2.942750E-01
3.762330E-01
3.577350E-01
4.135850E-01
4.594940E-01
2.110050E-01
2.836710E-01
2.230500E-01
2.863960E-01
1.000000E+00
7.817460E-01
6.446790E-01
2.101820E-01
2.715590E-01
2.620210E-01
1.718700E-01
2.564180E-01
2.552220E-01
2.942750E-01
7.817460E-01
1.000000E+00
6.322330E-01
2.261940E-01
3.138780E-01
2.794770E-01
1.982780E-01
2.926990E-01
2.847220E-01
3.762330E-01
6.446790E-01
6.322330E-01
1.000000E+00
2.439680E-01
3.084300E-01
3.535430E-01
2.188810E-01
3.690980E-01
3.684480E-01
3.577350E-01
2.101820E-01
2.261940E-01
2.439680E-01
1.000000E+00
5.618400E-01
5.151740E-01
2.272670E-01
3.691210E-01
3.784700E-01
4.135850E-01
2.715590E-01
3.138780E-01
3.084300E-01
5.618400E-01
1.000000E+00
6.657210E-01
2.406820E-01
3.872940E-01
4.337140E-01
4.594940E-01
2.620210E-01
2.794770E-01
3.535430E-01
5.151740E-01
6.657210E-01
1.000000E+00
2.661950E-01
1.608430E-01
1.703010E-01
2.110050E-01
1.718700E-01
1.982780E-01
2.188810E-01
2.272670E-01
2.406820E-01
2.661950E-01
1.000000E+00
Correlational Research Questions
Are two variables related?Interesting questions tend to:
• test a novel relationship, e.g.– “is time spent studying for exams associated with
increased incidence of brain cancer?”
• avoid simply showing an expected relationship, e.g.– “is time spent studying studying for exams associated
with higher exam marks?”
Correlational analyses 1
Correlational analyses are used to examine the extent to which two variables have a simple linear relationship.
Correlations provide the building blocks for:– Factor analysis– Reliability– Regression– Etc.
Correlational analyses 2
Linear relationship between 2 variables:– direction and – strength– ranges from -1 to +1
• Sign indicates direction• Size indicates strength
Correlational analyses 3
Measures the extent to which:• differences in one variable can be
predicted from differences in the other variable
• one variable varies with another variable
A correlation is also an effect size.
Types of Answers
• No relationship (independence)
• Linear relationship:– As one variable increases, so does the other
(+ve)– As one variable increases, the other decreases
(-ve)
• Non-linear
• Restricted range
• Heterogeneous samples
Types of Correlation
• Phi / Cramer’s V
• Spearman’s rank / Kendall’s Tau b
• Point bi-serial rpb
• Product-moment or Pearson’s r
Nominal Ordinal Int/Ratio
Nominal
Clustered bar-graphChi-squaredPhi (φ) or Cramer's V
Clustered bar-graphChi-squaredPhi (φ) or Cramer's V
Scatterplot, bar chart or error-bar chartPoint bi-serial correlation (rpb)
Ordinal
Scatterplot or clustered bar chartSpearman's Rho or Kendall's Tau
RecodeScatterplotPoint bi-serial or Spearman/Kendall
Int/Ratio
ScatterplotProduct-moment correlation (r)
tufte$X10 5 10 15 20
05
1015
tufte$X20 5 10 15 20
05
1015
tufte$X30 5 10 15 20
05
1015
tufte$X40 5 10 15 20
05
1015
Tufte: Graphics reveal data.
Nominal by Nominal
Contingency Tables 1
• Bivariate frequency tables
• Marginal totals
• Can include %
Contingency Tables 2
Example
Example
Example
Clustered Bar Graph
• Bar graph of frequencies or percentages with the category axis clustered by coloured bars to indicate the two variable’s categories
Example
Chi-square 1
Example
Example
Phi () & Cramer’s VPhi ()
• Two dichotomous variables (2x2, 2x3, 3x2)
• E.g., Gender & Pass/Fail
Cramer’s V
• Two dichotomous variables (3x3 or greater)
• E.g., Favourite Season x Favourite Sense
Example
Ordinal by Ordinal
Spearman’s rho (rs)
• For ranked (or recoded to ordinal) data
• Uses product-moment correlation, but interpretations must be adjusted to consider the underlying ranked scales– e.g. Olympic Placing vs. World Ranking
Kendall’s Tau-b
• Kendall’s tau-b
– for ordinal/ranked data– takes joint ranks into account– Ranges -1 to +1, but only for square tables
Dichotomous by Interval/Ratio
Point-biserial correlation
• Point-biserial correlation (rpb)
– one dichotomous & one continuous
variable– calculate as for Pearson’s r, but
interpretations must be adjusted to consider the underlying ranked scales
– e.g., gender and self-esteem
Example
Example
Product-moment correlation (r)• For two interval and/or ratio variables
• r = covxy
sxsy
Interval/Ratio by Interval/Ratio
Scatterplots
• Plot each pair of observations (X, Y)
• x = predictor variable (independent)
• y = criterion variable (dependent)
• Check for:– outliers– linearity
• ‘Line of best fit’
y = a + bx
The correlation between 2 variables is a measure of the degree to which:
• Pairs of numbers (points) cluster together around a best-fitting straight line
Scatterplot showing relationship between age & cholesterol
Figure 9.1
Infant Mortaility and Number of Physicians
Physicians per 100,000 Population
201816141210
Infa
nt
Mo
rta
lity
10
8
6
4
2
0
-2
-4
-6
Strong positive (.81)
Figure 9.2
Life Expectancy and Health Care Costs
Health Care Expenditures
1600140012001000800600400200
Life
Exp
ect
an
cy (
Ma
les)
74
73
72
71
70
69
68
67
66
Weak positive (.14)
Figure 9.3
Cancer Rate and Solar Radiation
Solar Radiation
600500400300200
Bre
ast
Ca
nce
r R
ate
34
32
30
28
26
24
22
20
Moderately strong negative (-.76)
Stop global warming: Become a pirate
Correlation Estimation
Indicate level (high, med., or low) and sign of the correlation for:
a) number of guns in community and number firearm deaths
b) robberies and incidence of drug abuse
c) protected sex and incidence of AIDS
d) community education level and crime rate
e) solar flares and suicide
Covariance
• Variance shared by 2 variables
• Covariance reflects the direction of the relationship:
+ve cov indicates + relationship
-ve cov indicates - relationship.
Cross products
1))((
NYYXX
CovXY
Covariance – Cross-products-ve cross
products
X1
403020100
Y1
3
3
2
2
1
1
0
-ve dev. products
-ve dev. products
+ve dev. products
+ve dev. products
Covariance
• Covariance is dependent on the scale of measurement used
• Can’t compare magnitude of cov across different scales of measurement (e.g., age by weight in kilos versus age by weight in grams).
• Therefore, standardise covariance-> correlation
Correlation formula
YX
XY
ssCov
r
The correlation between 2 variables is:• an effect size – i.e., standardised measure
of amount of covariation.
Correlation - SPSSCorrelations
.713**
.000
21
PearsonCorrelationSig.(2-tailed)NPearsonCorrelationSig.(2-tailed)N
CigaretteConsumption perAdult per Day
CHD Mortalityper 10,000
CigaretteConsumptionper Adult per
Day
CHDMortality per10,000
Correlation is significant at the 0.01 level(2-tailed).
**.
Hypothesis testing
• Almost all correlations are not 0, therefore “What is the likelihood that a relationship between variables is a ‘true’ relationship, or could it simply be a result of random sampling variability or ‘chance’”?
Significance of Correlation• Null hypothesis (H0): assumes that there is no
‘true’ relationship• Alternative hypothesis (H1): assumes that the
relationship is real • We initially assume the null hypothesis, and
evaluate whether the data support the alternative hypothesis
• Null hypothesis: H0: = 0 • Alternative hypothesis H0: 0
rho= population product-moment correlation coefficient
How do we test the null hypothesis?
• Use statistical tests of probability which produce a p value
• Convention specifies a criterion for statistical significance of 0.05 (alpha level)
• We generate a p value and compare it to 0.05.
• If p is less than 0.05, this indicates statistical significance and less than a 5% chance that the relationship being tested is due to random sampling variability
Imprecision in hypothesis testing
• Type I error: rejecting the null hypothesis when it is true
• Type II error: Accepting the null hypothesis when it is false
• Statistical significance is a function of effect size, sample size and alpha level
Significance of Correlation
• A 1- or 2-tailed significance test can be done in an effort to infer to a population
• Result will depend on the power of study (i.e., higher N, p, and r, more likely to be sig.)
• Alternatively look up r tables with df = N - 2
Scatterplot showing a confidence interval for a line of best fit
US States 4th Academic Achievement by SES
Significance of Correlationdf critical
(N-2) p = .055 .67
10 .50
15 .41
20 .36
25 .32
30 .30
50 .23
200 .11
500 .07
1000 .05
When we say that the correlation between Age and test Performance is significant, we mean that:
a. there is an important relationship between Age and test Performance
b. the true correlation between Age and Performance in the population is equal to 0
c. the true correlation between Age and Performance in the population is not equal to 0
d. getting older causes you to do poorly on tests
Descriptive Assumptions
• LOM >= interval• Linear relationships• No outliers
Inferential Assumptions
• Homoscedasticity• Similar, normal underlying
distributions• Correction for attenuation
(unreliability)• Minimal and normally distributed
measurement error
Homoscedasticity
Factors that affect correlation
• Restricted range• Heterogenous samples• Scale has no effect
Coefficient of Determination (r2)
• CoD = The proportion of variance or change in one variable that can be accounted for by another variable.
• e.g., r = .60, r2 = .36
Interpreting Correlation(Cohen, 1988)
• A correlation is an effect size, so guidelines re strength can be suggested.
Strength r r2
weak: .1 to .3 (1 to 10%)
moderate: .3 to .5 (10 to 25%)
strong: >.5 (> 25%)
Size of Correlation (Cohen, 1988)
WEAK (.1 - .3)
MODERATE (.3-.5)
STRONG (>.5)
Interpreting Correlation 1
Strength r r2
very weak 0 - .19 (0 to 4%)
weak .20 - .39 (4 to 16%)
moderate .40 - .59 (16 to 36%)
strong .60 - .79 (36% to 64%)
very strong .80 - 1.00 (64% to 100%)
Interpreting Correlation 2
Interpreting Correlation 3
Interpreting Correlation 4
Correlation of this scatterplot = .9
X1
403020100
Y1
3
3
2
2
1
1
0
Correlation of this scatterplot = .9
X1
1009080706050403020100
Y1
2222222222111111111100000
What do you estimate the correlation of this scatterplot of height and weight to be?
a. -.5
b. -1
c. 0
d. .5
e. 1
WEIGHT
737271706968676665
HE
IGH
T
176
174
172
170
168
166
What do you estimate the correlation of this scatterplot to be?
a. -.5
b. -1
c. 0
d. .5
e. 1
X
5.65.45.25.04.84.64.4
Y
14
12
10
8
6
4
2
What do you estimate the correlation of this scatterplot to be?
a. -.5
b. -1
c. 0
d. .5
e. 1
X
1412108642
Y
6
5
5
5
5
5
4
Non-linear Relationships
Check scatterplot
Can a linear relationship ‘capture’ the lion’s share of the variance?
If so,use r.
Y2
6050403020100
X2
40
30
20
10
0
Non-linear relationships
• If non-linear, consider transforming variables to ‘create’ linear relationship = equivalent to finding a non-linear mathematical function to describe the relationship between the variables
Range restriction
• Range restriction is when sample contains restricted (or truncated) range of scores– e.g., fluid intelligence and age < 18 might
have linear relationship
• If range restriction, be cautious in generalising beyond the range for which data is available– E.g., fluid intelligence does not continue to
increase linearly with age after age 18
Range restriction
Heterogenous samples
• Sub-samples (e.g., males & females) may artificially increase or decrease overall r.
• Solution - calculate r separately for sub-samples & overall, look for differences
W1
80706050
H1
190
180
170
160
150
140
130
Scatterplot of Same-sex Relations & Opposite-sex Relations by Gender
boys r = .67
girls r = .52
Opp Sex Relations
76543210
Sam
e S
ex R
elat
ions
7
6
5
4
3
2
SEX
female
male
Scatterplot of Weight and Self-esteem by Gender
WEIGHT
120110100908070605040
SE
10
8
6
4
2
0
SEX
male
female
Males r = .50
Females r = -.48
Effect of Outliers• Outliers can disproportionately increase or
decrease r.• Options
– compute r with & without outliers– get more data for outlying values– recode outliers as having more conservative
scores– transformation– recode variable into lower level of measurement
Age & self-esteem ( r = .63)
AGE
8070605040302010
SE
10
8
6
4
2
0
Age & self-esteem (outliers removed) r = .23
AGE
40302010
SE
9
8
7
6
5
4
3
2
1
Checklist1. Graphs & Scatterplots
– Outliers?– Linear?– Does each variable have a reasonable range?– Are there subsamples to consider?
2. Choose appropriate measure of Association
3. Conduct inferential test (if needed)
4. Interpret/Discuss
Dealing with several correlations
• Scatterplot matrices and correlation matrices organised correlations amongst several variables at once.
• Example (Kliewer et al, 1998)
– 99 young children
– Measured level of• Witnessed violence, Intrusive thoughts,
Social support, and Internalizing symptoms
Correlation matrix
Scatterplot matrix
Reporting
• Relate back to research hypothesis• Describe & interpret correlation
– direction of relationship– size/strength– significance
• If many correlations, report in a table• Acknowledge limitations e.g.,
– Heterogeneity (sub-samples)– Range restriction– Causality?
Writing“Number of children and marital
satisfaction were inversely related to each other, r (48) = -.35, p<.05, indicating that contentment in marriage declines as couples elect to have more children. Overall, number of children explained approximately 10% of the variance in marital satisfaction, a small-moderate effect.”
Also see end of Howell (Fundamentals) Ch9 for an example write-up
Key Points• Covariations are the building blocks of
reliability analysis, factor analysis, multiple regression
• Correlation does not prove causation – may be in opposite direction, co-causal, or due to other variables
• Check scatterplots to see whether a correlation makes sense
• Choose appropriate measure of association based on levels of measurement
• Use r, r2 and statistical significance
References• Rank order correlation
http://faculty.vassar.edu/lowry/ch3b.html• Correlation coefficient
http://www.sportsci.org/resource/stats/correl.html
• Correlationhttp://www2.chass.ncsu.edu/garson/pa765/correl.htm
• Correlationhttp://www.uwsp.edu/psych/stat/7/correlat.htm