bivariate relationships
DESCRIPTION
SHARON LAWNER WEINBERG SARAH KNAPP ABRAMOWITZ. Statistics SPSS An Integrative Approach SECOND EDITION. Bivariate Relationships. Using. Chapter 5. Summarizing the Relationship Between Two Variables: An Overview. The Relationship Between Two Scale Variables What the Scatterplot Tells Us. - PowerPoint PPT PresentationTRANSCRIPT
Bivariate RelationshipsChapter 5Chapter 5
SHARON LAWNER WEINBERG SARAH KNAPP ABRAMOWITZ
StatisticStatisticss
SPSSSPSSAn Integrative Approach
SECOND EDITION
UsinUsingg
Summarizing the Relationship Between Two Variables: An Overview
Variable Types Summary Graphic Summary Statistic
Both Scale Scatterplot Pearson Correlation
Both Ordinal Scatterplot Spearman Correlation
An Ordinal & A Scale
Scatterplot Spearman Correlation
A Scale & A Dichotomy
Scatterplot or Boxplot Pearson (point biserial) Correlation
Both Dichotomies Clustered Bar Graph Pearson (phi-coefficient) Correlation or Contingency Table
The Relationship Between Two Scale Variables What the Scatterplot Tells Us
• Whether the relationship appears linear
• If it does appear linear, it also tells us: • The direction and nature of the linear
relationship• The relative strength of the linear relationship
Overview: Examples Using the Scatterplot to Describe the Relationship Between Two Scale Variables
• Hamburg data set: FAT and the CALORIES.• States data set: PERTAK (percentage of eligible students
taking the SAT) and SATV (average verbal SAT score for the state).
• Currency data set: BILLVALUE (bill denomination) and the number of bills in circulation.
• Marijuana data set: YEAR and the percentage of students reporting that they ever smoked marijuana from 1987-1999.
Creating the Scatterplot
Example: Using the Hamburg data set, describe the relationship between the FAT and the CALORIES of a burger.
Solution: To obtain the scatterplot between FAT and CALORIES for the Hamburg data set, using SPSS, go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define. Put CALORIES into the box labeled y-axis and FAT into the box labeled x-axis and click OK.
Scatterplot: FAT vs. CALORIES
McDonald's Hamburgers
Grams of Fat
403020100
CA
LOR
IES
600
500
400
300
200
Interpreting the Scatterplot of FAT vs. CALORIES
• A line appears to fit the data well; i.e., there is not a simple curve that would provide a better fit, so a linear model is appropriate.
• The direction of the linear relationship is positive because the slope of the line representing the data is positive. The nature of the linear relationship is that burgers that are relatively high in fat tend also to be relatively high in calories.
• The strength of the linear relationship appears to be strong because the points cluster tightly around the line.
Editing the Scatterplot to Label Points
Go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define. Put CALORIES into the box labeled y-axis, FAT into the box labeled x-axis, and NAME in the box labeled label cases by and click OK. Double click on the graph to put it in the Chart Editor. Click on Elements, Show Data Labels. Move Name to the Displayed box and eliminate count. Click Apply, Close.
Labeled Scatterplot: FAT vs CALORIES
Scatterplot: PERTAK vs. SATV
U.S. College-Bound Students
Percentage of Eligible Students Taking the SAT
100806040200
Ave
rage
SAT
ver
bal i
n 19
97600
580
560
540
520
500
480
460
Interpreting the Scatterplot of SATV vs. PERTAK
• Although the points have a curvilinear shape, a line would appear to represent these points reasonably well, and so we will use it in this case.
• The direction the linear relationship is negative because the slope of the line representing the data is negative. The nature is that states with a relatively low percentage of students taking the SAT tend to have higher SAT Verbal scores, on average.
• The strength of the linear relationship is more moderate than for the hamburger example because the points in this case do not cluster as tightly around the line.
Scatterplot: Denomination (BILLVALUE) vs. number of bills in circulation.
Note: Use Transform, Compute to combine variables to create a variable for the number of bills in circulation.
United States Currency
Denomination
120100806040200
NU
MB
ER
7000000000
6000000000
5000000000
4000000000
3000000000
2000000000
1000000000
0
Interpreting the Scatterplot of BILLVALUE vs. NUMBER
• Because the points have a “cloud like” formation, neither a simple curve nor a line is a good fit for these data.
• We conclude that there is little or no relationship between the bill value and the number in circulation.
Scatterplot: Year vs. percentage of high school seniors reporting that they smoked marijuana at least once: 1987-
1999.Note: Use Select Cases to restrict to the appropriate years.
YEAR
2000199519901985
Pe
rce
nt U
sed
Ma
riju
an
a48
46
44
42
40
38
36
34
32
Interpreting the Scatterplot of YEAR vs. MARIJUANA
• A simple curve (or two lines) provides a better fit for the data than a single line and is therefore more appropriate than a line for modeling the data.
• The relationship between marijuana use and year is non-linear.
Quantifying the Linear Relationship between Two Scale Variables: Pearson Product Moment Correlation Coefficient
• Often called, simply, correlation, and symbolized by the letter r.• Before calculating, use a scatterplot to verify that the relationship
between the variables appears to be linear.• Calculated as the average of the product of the z-scores. • This summary statistic measures the direction, nature, and strength of
the linear relationship.• Direction: Look at sign of r (positive or negative)• Nature: Look at sign of r (positive means that high scores on one
variable correspond to high scores on the other and low with low, negative means that low scores on one variable correspond to high on the other and vice versa)
• Strength: Look at magnitude (absolute value) of r. In the social sciences, a good rule of thumb comes from Cohen’s scale: r < .1 little or no, .1 <= r < .3, weak, .5 <= r < .5 moderate, r >= .5 strong
Obtaining the Pearson Correlation Using SPSS
To use SPSS to obtain the correlation coefficient between CALORIES and FAT, click Analyze on the Main Menu Bar, Correlate, and Bivariate. Move the two variables, CALORIES and FAT, into the Variables box and click OK.
Interpreting the Pearson Correlation Coefficients
• The correlation between FAT and CALORIES is .997 indicating a very strong positive linear relationship: burgers that are relatively high in fat tend also to be relatively high in calories and burgers that are relatively low in fat tend also to be relatively low in calories.
• The correlation between SAT Verbal and the percentage of students taking the SAT is -.86 indicating a strong negative linear relationship: states that have a relatively high verbal SAT average tend to have a relatively low percentage of students taking the SAT and states that have a relatively low verbal SAT average tend to have a relatively high percentage of students taking the SAT.
Other Properties of Correlation• The strength of the correlation is measured on an ordinal
scale• Correlation does not imply causation, i.e. when two
variables are correlated it is not necessarily true that changing one will result in a predictable change in the other
• A linear transformation applied to one variable does not change the magnitude of the correlation. The sign of the correlation will change, however, if the transformation involves multiplication by a negative number
• Restricting the range of one of the variables can increase or decrease the magnitude of the correlation
Relationships between Two Ordinal or One Ordinal and One Scale:
Scatterplot and Spearman Rank Correlation Coefficient
The Spearman correlation, called Spearman’s rho, is a special case of the Pearson correlation computed on ranked data.
Example: Describe the relationship, or indicate that there is not one, between the amount of time spent in school on homework (HWKIN12) and the amount of time spent out of school on homework (HWKOUT12) in twelfth grade for students in the NELS data set.
Scatterplot: HWKIN12 and HWKOUT12
Obtaining the Spearman Rank Correlation Coefficient Using SPSS
Click Analyze, Correlate, Bivariate. Move the variables HWKIN12 and HWKOUT12 into the Variables box. Click Spearman and click off Pearson in the Correlation Coefficients box. Click OK. Note that when using SPSS, we do not need to transform the data to rankings to obtain the Spearman correlation coefficient. SPSS does this transformation for us.
Interpreting the Spearman Rank Correlation Coefficient
• The Spearman correlation is interpreted in the same way as the Pearson correlation.
• In this case, Spearman’s rho = .40, indicating a moderate positive relationship.
• Twelfth grade students in the NELS data set who spend a relatively large amount of time doing homework in school also spend a relatively large amount of time doing homework outside of school and students who spend a relatively small amount of time doing homework in school tend also to spend a relatively small amount of time doing homework outside of school.
Relationships between One Scale and One Dichotomous Variable
Example using the Hamburg data set: Describe the relationship between calories and cheese.
Interpreting the Correlation When One Variable is Scale and One is Dichotomous
• The correlation between CALORIES and CHEESE is r = .51.• The correlation is positive indicating that high scores on
one variable are associated with high scores on the other.• CHEESE is coded with 0 (a relatively low score)
representing the absence of cheese and 1 (a relatively high score) representing the presence of cheese.
• Burgers with cheese tend to be higher in calories than those without cheese.
• This special case of Pearson correlation is sometimes called the point biserial correlation.
Description of the Impeach Data Set
• On February 12, 1999, for only the second time in the nation’s history, the U.S. Senate voted on whether to remove a President, based on impeachment articles passed by the U.S. House.
• Dozens of political talk shows featured analyses of why senators may have voted the way they did, but such discourse was rarely (if ever) informed by systematic statistical analysis of the votes.
• Professor Alan Reifman of Texas Tech University created this data set about the senators to be used as part of such an analysis. The relevant variable descriptions appear in the following table.
Variables in the Impeach Data Set Variable Name
Variable Label
Value Label
VOTE1 Vote on perjury. 0 = Not Guilty
1 = Guilty
VOTE2 Vote on obstruction of justice. 0 = Not Guilty
1 = Guilty
PARTY Political party affiliation. 0 = Democrat
1 = Republican
CONSERV Conservatism. Each senator’s degree of ideological conservatism is based on 1997 voting records as judged by the American Conservative Union, where the scores ranged from 0 to 100 and 100 is most conservative.
REGION U.S. Census region from which the senator comes. 1 = Northeast
2 = Midwest
3 = South
4 = West
SUPPORTC State voter support for Clinton. The percent of the vote Clinton received in the 1996 Presidential election in the senator’s state.
RELECT Year the senator’s seat is up for re-election (1990, 1992, 1994)
NEWBIE First term senator? 0 = No
1 = Yes
Scatterplot Example: Describe the relationship between conservatism
score and the vote on perjury
Interpreting the Correlation between Senators’ Conservatism and Their Vote on Perjury
• The correlation between VOTE1 and conservatism is r = .87, indicating a strong relationship between the two variables.
• The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other.
• VOTE1 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty.
• Senators who are more conservative tended to vote guilty on perjury.
Scatterplot Example: Describe the relationship between conservatism
score and the vote on obstruction of justice
Interpreting the Correlation between Senators’ Conservatism and Their Vote on Obstruction of Justice
• The correlation between VOTE2 and conservatism is r = .94, indicating a strong relationship between the two variables and a stronger relationship than that between VOTE1 and conservatism.
• The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other.
• VOTE2 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty.
• Senators who are more conservative tended to vote guilty on obstruction of justice.
Relationships between Two Dichotomous Variables
Example: Is there a relationship between whether or not the senator is first-term and his or her vote on perjury?
Solutions via:•Clustered bar graph•Pearson•Crosstabulation
Using SPSS to Obtain a Clustered Bar Graph
Click Graphs on the main menu bar, Legacy Dialogs, and Bar. Change from Simple to Clustered and click Define. Put VOTE1 in the Category Axis box and NEWBIE in the Define Clusters By box. Click OK.
Clustered Bar Graph
Using SPSS to Obtain the Contingency Table
To obtain the frequencies of each of the four cells (a contingency table or cross-tabulation), click Analyze on the main menu bar, Descriptive Statistics, Crosstabs. Put VOTE1 in the Row(s) box and NEWBIE in the Column(s) box. Click OK.
Contingency Table
Vote on Perjury * First-Term senator? Crosstabulation
Count
39 16 55
23 22 45
62 38 100
Not Guilty
Guilty
Vote onPerjury
Total
No Yes
First-Term senator?
Total
Contingency Table Analysis
First term senators tended to vote guilty and more established senators tended to vote not guilty.
Any of the following alternatives may be used to provide statistical support:• Approximately 62.9 percent (39/62*100) of the non-first term senators
voted not guilty whereas 42.1 percent (16/38*100) of the first term senators voted not guilty.
• Approximately 37.1 percent (23/62*100) of the non-first term senators voted guilty whereas 57.9 percent (22/38*100) of the first term senators voted guilty.
• Approximately 70.9 percent (39/55*100) of the not guilty votes came from non-first term senators whereas 51.1 percent (23/45*100) of the guilty votes came from non-first term senators.
• Approximately 29.1 percent (16/55*100) of the not guilty votes came from first term senators whereas 48.9 percent (22/45*100) of the guilty votes came from first term senators.
Correlation Analysis
• The correlation between VOTE1 and NEWBIE is r = .20.• The sign of the correlation is positive, so high scores on one
variable are associated with high scores on the other.• VOTE2 is coded with 0 (a relatively low score) representing
not guilty and 1 (a relatively high score) representing guilty. • NEWBIE is coded with 0 representing non-first term and 1
representing first term. • First term senators tended to vote guilty on perjury and more
established senators tended to vote not guilty.• This special case of Pearson correlation is sometimes called
the phi coefficient.
Relationships between Other Variable Types
• Nominal non-dichotomous or ordinal with fewer than about five categories by dichotomous. • Example: Are there regional differences in how the
senators tended to vote on obstruction of justice?• Nominal non-dichotomous or ordinal with fewer than
about five categories by scale. • Example: Are there regional differences in the typical
conservatism score of the senators?
Clustered Bar Graph: Graphically Representing Vote on Obstruction vs Region
REGION
WestSouthMidwestNortheast
Cou
nt20
10
0
Vote on Obstruction
Not Guilty
Guilty
Contingency Table: Tabulating Vote on Obstruction of Justice by Region
Vote on Obstruction of Justice * REGION Crosstabulation
Count
15 12 13 10 50
3 12 19 16 50
18 24 32 26 100
Not Guilty
Guilty
Vote on Obstructionof Justice
Total
Northeast Midwest South West
REGION
Total
Contingency Table Analysis
• Senators from the northeast tended to vote not guilty, while those from the south and west tended to vote guilty and those from the midwest were equally likely to vote guilty or not guilty.
• In particular, approximately 83.3 percent (15/18*100) of the senators from the northeast voted not guilty whereas 50.0 percent (12/24*200) from the midwest, 40.6 percent (13/32*200) from the south, and 38.5 percent (10/26*200) from the west voted not guilty.
• Alternatively, in terms of voting guilty, approximately 16.7 percent (3/18*100) of the senators from the northeast voted guilty whereas 50.0 percent (12/24*200) from the midwest, 59.4 percent (19/32*200) from the south, and 61.5 percent (16/26*200) from the west voted guilty.
Boxplots: Graphically Representing Conservatism Score by Region
Compare Means or Medians: Comparing Conservatism Scores by Region
Analysis Based on Medians
• Because the data are noticeably skewed for the northeast region, a more appropriate comparison of conservatism across regions is via the median, although results based on the means in this example yield the same result.
• According to the values of the median, the most conservative senators come from the south (72), followed by the west (64), the midwest (50), and the northeast (19.5).
Selection• The table on the following slide provides
guidelines for choosing the appropriate statistic(s) and graphs for describing the relationship between two variables.
• Other combinations may be correct.
Levels of Measurement
Nominal with 2 categories (dichotomous)
Nominal with more than two categories or ordinal with more than two categories, but not more than five categories
Ordinal with five or more categories
Scale
Nominal with 2 categories (dichotomous)
Pearson correlation or percentages from crosstabulation and clustered bar graph
Percentages from crosstabulation and clustered bar graph
Spearman correlation and interactive scatterplot or boxplot
Pearson correlation and interactive scatterplot or boxplot
Nominal with more than two categories or ordinal with more than two categories, but not more than five categories
Percentages from crosstabulation and clustered bar graph
Percentages from crosstabulation and clustered bar graph
Medians or Spearman correlation and interactive scatterplot or boxplot
Means or medians and interactive scatterplot or boxplot
Ordinal with five or more categories
Spearman correlation and interactive scatterplot or boxplot
Medians or Spearman correlation and interactive scatterplot or boxplot
Spearman correlation and scatterplot
Spearman correlation and scatterplot
Scale Pearson correlation and interactive scatterplot or boxplot
Means or medians and interactive scatterplot or boxplot
Spearman correlation and scatterplot
Pearson correlation and scatterplot. Correlation should not be used unless scatterplot is well represented by a line.