bivariate relationships

Bivariate RelationshipsChapter 5Chapter 5

SHARON LAWNER WEINBERG SARAH KNAPP ABRAMOWITZ

StatisticStatisticss

SPSSSPSSAn Integrative Approach

SECOND EDITION

UsinUsingg

Summarizing the Relationship Between Two Variables: An Overview

Variable Types Summary Graphic Summary Statistic

Both Scale Scatterplot Pearson Correlation

Both Ordinal Scatterplot Spearman Correlation

An Ordinal & A Scale

Scatterplot Spearman Correlation

A Scale & A Dichotomy

Scatterplot or Boxplot Pearson (point biserial) Correlation

Both Dichotomies Clustered Bar Graph Pearson (phi-coefficient) Correlation or Contingency Table

The Relationship Between Two Scale Variables What the Scatterplot Tells Us

• Whether the relationship appears linear

• If it does appear linear, it also tells us: • The direction and nature of the linear

relationship• The relative strength of the linear relationship

Overview: Examples Using the Scatterplot to Describe the Relationship Between Two Scale Variables

• Hamburg data set: FAT and the CALORIES.• States data set: PERTAK (percentage of eligible students

taking the SAT) and SATV (average verbal SAT score for the state).

• Currency data set: BILLVALUE (bill denomination) and the number of bills in circulation.

• Marijuana data set: YEAR and the percentage of students reporting that they ever smoked marijuana from 1987-1999.

Creating the Scatterplot

Example: Using the Hamburg data set, describe the relationship between the FAT and the CALORIES of a burger.

Solution: To obtain the scatterplot between FAT and CALORIES for the Hamburg data set, using SPSS, go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define. Put CALORIES into the box labeled y-axis and FAT into the box labeled x-axis and click OK.

Scatterplot: FAT vs. CALORIES

McDonald's Hamburgers

Grams of Fat

403020100

CA

LOR

IES

600

500

400

300

200

Interpreting the Scatterplot of FAT vs. CALORIES

• A line appears to fit the data well; i.e., there is not a simple curve that would provide a better fit, so a linear model is appropriate.

• The direction of the linear relationship is positive because the slope of the line representing the data is positive. The nature of the linear relationship is that burgers that are relatively high in fat tend also to be relatively high in calories.

• The strength of the linear relationship appears to be strong because the points cluster tightly around the line.

Editing the Scatterplot to Label Points

Go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define. Put CALORIES into the box labeled y-axis, FAT into the box labeled x-axis, and NAME in the box labeled label cases by and click OK. Double click on the graph to put it in the Chart Editor. Click on Elements, Show Data Labels. Move Name to the Displayed box and eliminate count. Click Apply, Close.

Labeled Scatterplot: FAT vs CALORIES

Scatterplot: PERTAK vs. SATV

U.S. College-Bound Students

Percentage of Eligible Students Taking the SAT

100806040200

Ave

rage

SAT

ver

bal i

n 19

97600

580

560

540

520

500

480

460

Interpreting the Scatterplot of SATV vs. PERTAK

• Although the points have a curvilinear shape, a line would appear to represent these points reasonably well, and so we will use it in this case.

• The direction the linear relationship is negative because the slope of the line representing the data is negative. The nature is that states with a relatively low percentage of students taking the SAT tend to have higher SAT Verbal scores, on average.

• The strength of the linear relationship is more moderate than for the hamburger example because the points in this case do not cluster as tightly around the line.

Scatterplot: Denomination (BILLVALUE) vs. number of bills in circulation.

Note: Use Transform, Compute to combine variables to create a variable for the number of bills in circulation.

United States Currency

Denomination

120100806040200

NU

MB

ER

7000000000

6000000000

5000000000

4000000000

3000000000

2000000000

1000000000

0

Interpreting the Scatterplot of BILLVALUE vs. NUMBER

• Because the points have a “cloud like” formation, neither a simple curve nor a line is a good fit for these data.

• We conclude that there is little or no relationship between the bill value and the number in circulation.

Scatterplot: Year vs. percentage of high school seniors reporting that they smoked marijuana at least once: 1987-

1999.Note: Use Select Cases to restrict to the appropriate years.

YEAR

2000199519901985

Pe

rce

nt U

sed

Ma

riju

an

a48

46

44

42

40

38

36

34

32

Interpreting the Scatterplot of YEAR vs. MARIJUANA

• A simple curve (or two lines) provides a better fit for the data than a single line and is therefore more appropriate than a line for modeling the data.

• The relationship between marijuana use and year is non-linear.

Quantifying the Linear Relationship between Two Scale Variables: Pearson Product Moment Correlation Coefficient

• Often called, simply, correlation, and symbolized by the letter r.• Before calculating, use a scatterplot to verify that the relationship

between the variables appears to be linear.• Calculated as the average of the product of the z-scores. • This summary statistic measures the direction, nature, and strength of

the linear relationship.• Direction: Look at sign of r (positive or negative)• Nature: Look at sign of r (positive means that high scores on one

variable correspond to high scores on the other and low with low, negative means that low scores on one variable correspond to high on the other and vice versa)

• Strength: Look at magnitude (absolute value) of r. In the social sciences, a good rule of thumb comes from Cohen’s scale: r < .1 little or no, .1 <= r < .3, weak, .5 <= r < .5 moderate, r >= .5 strong

Obtaining the Pearson Correlation Using SPSS

To use SPSS to obtain the correlation coefficient between CALORIES and FAT, click Analyze on the Main Menu Bar, Correlate, and Bivariate. Move the two variables, CALORIES and FAT, into the Variables box and click OK.

Interpreting the Pearson Correlation Coefficients

• The correlation between FAT and CALORIES is .997 indicating a very strong positive linear relationship: burgers that are relatively high in fat tend also to be relatively high in calories and burgers that are relatively low in fat tend also to be relatively low in calories.

• The correlation between SAT Verbal and the percentage of students taking the SAT is -.86 indicating a strong negative linear relationship: states that have a relatively high verbal SAT average tend to have a relatively low percentage of students taking the SAT and states that have a relatively low verbal SAT average tend to have a relatively high percentage of students taking the SAT.

Other Properties of Correlation• The strength of the correlation is measured on an ordinal

scale• Correlation does not imply causation, i.e. when two

variables are correlated it is not necessarily true that changing one will result in a predictable change in the other

• A linear transformation applied to one variable does not change the magnitude of the correlation. The sign of the correlation will change, however, if the transformation involves multiplication by a negative number

• Restricting the range of one of the variables can increase or decrease the magnitude of the correlation

Relationships between Two Ordinal or One Ordinal and One Scale:

Scatterplot and Spearman Rank Correlation Coefficient

The Spearman correlation, called Spearman’s rho, is a special case of the Pearson correlation computed on ranked data.

Example: Describe the relationship, or indicate that there is not one, between the amount of time spent in school on homework (HWKIN12) and the amount of time spent out of school on homework (HWKOUT12) in twelfth grade for students in the NELS data set.

Scatterplot: HWKIN12 and HWKOUT12

Obtaining the Spearman Rank Correlation Coefficient Using SPSS

Click Analyze, Correlate, Bivariate. Move the variables HWKIN12 and HWKOUT12 into the Variables box. Click Spearman and click off Pearson in the Correlation Coefficients box. Click OK. Note that when using SPSS, we do not need to transform the data to rankings to obtain the Spearman correlation coefficient. SPSS does this transformation for us.

Interpreting the Spearman Rank Correlation Coefficient

• The Spearman correlation is interpreted in the same way as the Pearson correlation.

• In this case, Spearman’s rho = .40, indicating a moderate positive relationship.

• Twelfth grade students in the NELS data set who spend a relatively large amount of time doing homework in school also spend a relatively large amount of time doing homework outside of school and students who spend a relatively small amount of time doing homework in school tend also to spend a relatively small amount of time doing homework outside of school.

Relationships between One Scale and One Dichotomous Variable

Example using the Hamburg data set: Describe the relationship between calories and cheese.

Interpreting the Correlation When One Variable is Scale and One is Dichotomous

• The correlation between CALORIES and CHEESE is r = .51.• The correlation is positive indicating that high scores on

one variable are associated with high scores on the other.• CHEESE is coded with 0 (a relatively low score)

representing the absence of cheese and 1 (a relatively high score) representing the presence of cheese.

• Burgers with cheese tend to be higher in calories than those without cheese.

• This special case of Pearson correlation is sometimes called the point biserial correlation.

Description of the Impeach Data Set

• On February 12, 1999, for only the second time in the nation’s history, the U.S. Senate voted on whether to remove a President, based on impeachment articles passed by the U.S. House.

• Dozens of political talk shows featured analyses of why senators may have voted the way they did, but such discourse was rarely (if ever) informed by systematic statistical analysis of the votes.

• Professor Alan Reifman of Texas Tech University created this data set about the senators to be used as part of such an analysis. The relevant variable descriptions appear in the following table.

Variables in the Impeach Data Set Variable Name

Variable Label

Value Label

VOTE1 Vote on perjury. 0 = Not Guilty

1 = Guilty

VOTE2 Vote on obstruction of justice. 0 = Not Guilty

1 = Guilty

PARTY Political party affiliation. 0 = Democrat

1 = Republican

CONSERV Conservatism. Each senator’s degree of ideological conservatism is based on 1997 voting records as judged by the American Conservative Union, where the scores ranged from 0 to 100 and 100 is most conservative.

REGION U.S. Census region from which the senator comes. 1 = Northeast

2 = Midwest

3 = South

4 = West

SUPPORTC State voter support for Clinton. The percent of the vote Clinton received in the 1996 Presidential election in the senator’s state.

RELECT Year the senator’s seat is up for re-election (1990, 1992, 1994)

NEWBIE First term senator? 0 = No

1 = Yes

Scatterplot Example: Describe the relationship between conservatism

score and the vote on perjury

Interpreting the Correlation between Senators’ Conservatism and Their Vote on Perjury

• The correlation between VOTE1 and conservatism is r = .87, indicating a strong relationship between the two variables.

• The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other.

• VOTE1 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty.

• Senators who are more conservative tended to vote guilty on perjury.

Scatterplot Example: Describe the relationship between conservatism

score and the vote on obstruction of justice

Interpreting the Correlation between Senators’ Conservatism and Their Vote on Obstruction of Justice

• The correlation between VOTE2 and conservatism is r = .94, indicating a strong relationship between the two variables and a stronger relationship than that between VOTE1 and conservatism.

• The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other.

• VOTE2 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty.

• Senators who are more conservative tended to vote guilty on obstruction of justice.

Relationships between Two Dichotomous Variables

Example: Is there a relationship between whether or not the senator is first-term and his or her vote on perjury?

Solutions via:•Clustered bar graph•Pearson•Crosstabulation

Using SPSS to Obtain a Clustered Bar Graph

Click Graphs on the main menu bar, Legacy Dialogs, and Bar. Change from Simple to Clustered and click Define. Put VOTE1 in the Category Axis box and NEWBIE in the Define Clusters By box. Click OK.

Clustered Bar Graph

Using SPSS to Obtain the Contingency Table

To obtain the frequencies of each of the four cells (a contingency table or cross-tabulation), click Analyze on the main menu bar, Descriptive Statistics, Crosstabs. Put VOTE1 in the Row(s) box and NEWBIE in the Column(s) box. Click OK.

Contingency Table

Vote on Perjury * First-Term senator? Crosstabulation

Count

39 16 55

23 22 45

62 38 100

Not Guilty

Guilty

Vote onPerjury

Total

No Yes

First-Term senator?

Total

Contingency Table Analysis

First term senators tended to vote guilty and more established senators tended to vote not guilty.

Any of the following alternatives may be used to provide statistical support:• Approximately 62.9 percent (39/62*100) of the non-first term senators

voted not guilty whereas 42.1 percent (16/38*100) of the first term senators voted not guilty.

• Approximately 37.1 percent (23/62*100) of the non-first term senators voted guilty whereas 57.9 percent (22/38*100) of the first term senators voted guilty.

• Approximately 70.9 percent (39/55*100) of the not guilty votes came from non-first term senators whereas 51.1 percent (23/45*100) of the guilty votes came from non-first term senators.

• Approximately 29.1 percent (16/55*100) of the not guilty votes came from first term senators whereas 48.9 percent (22/45*100) of the guilty votes came from first term senators.

Correlation Analysis

• The correlation between VOTE1 and NEWBIE is r = .20.• The sign of the correlation is positive, so high scores on one

variable are associated with high scores on the other.• VOTE2 is coded with 0 (a relatively low score) representing

not guilty and 1 (a relatively high score) representing guilty. • NEWBIE is coded with 0 representing non-first term and 1

representing first term. • First term senators tended to vote guilty on perjury and more

established senators tended to vote not guilty.• This special case of Pearson correlation is sometimes called

the phi coefficient.

Relationships between Other Variable Types

• Nominal non-dichotomous or ordinal with fewer than about five categories by dichotomous. • Example: Are there regional differences in how the

senators tended to vote on obstruction of justice?• Nominal non-dichotomous or ordinal with fewer than

about five categories by scale. • Example: Are there regional differences in the typical

conservatism score of the senators?

Clustered Bar Graph: Graphically Representing Vote on Obstruction vs Region

REGION

WestSouthMidwestNortheast

Cou

nt20

10

0

Vote on Obstruction

Not Guilty

Guilty

Contingency Table: Tabulating Vote on Obstruction of Justice by Region

Vote on Obstruction of Justice * REGION Crosstabulation

Count

15 12 13 10 50

3 12 19 16 50

18 24 32 26 100

Not Guilty

Guilty

Vote on Obstructionof Justice

Total

Northeast Midwest South West

REGION

Total

Contingency Table Analysis

• Senators from the northeast tended to vote not guilty, while those from the south and west tended to vote guilty and those from the midwest were equally likely to vote guilty or not guilty.

• In particular, approximately 83.3 percent (15/18*100) of the senators from the northeast voted not guilty whereas 50.0 percent (12/24*200) from the midwest, 40.6 percent (13/32*200) from the south, and 38.5 percent (10/26*200) from the west voted not guilty.

• Alternatively, in terms of voting guilty, approximately 16.7 percent (3/18*100) of the senators from the northeast voted guilty whereas 50.0 percent (12/24*200) from the midwest, 59.4 percent (19/32*200) from the south, and 61.5 percent (16/26*200) from the west voted guilty.

Boxplots: Graphically Representing Conservatism Score by Region

Compare Means or Medians: Comparing Conservatism Scores by Region

Analysis Based on Medians

• Because the data are noticeably skewed for the northeast region, a more appropriate comparison of conservatism across regions is via the median, although results based on the means in this example yield the same result.

• According to the values of the median, the most conservative senators come from the south (72), followed by the west (64), the midwest (50), and the northeast (19.5).

Selection• The table on the following slide provides

guidelines for choosing the appropriate statistic(s) and graphs for describing the relationship between two variables.

• Other combinations may be correct.

Levels of Measurement

Nominal with 2 categories (dichotomous)

Nominal with more than two categories or ordinal with more than two categories, but not more than five categories

Ordinal with five or more categories

Scale

Nominal with 2 categories (dichotomous)

Pearson correlation or percentages from crosstabulation and clustered bar graph

Percentages from crosstabulation and clustered bar graph

Spearman correlation and interactive scatterplot or boxplot

Pearson correlation and interactive scatterplot or boxplot

Nominal with more than two categories or ordinal with more than two categories, but not more than five categories



Medians or Spearman correlation and interactive scatterplot or boxplot

Means or medians and interactive scatterplot or boxplot

Ordinal with five or more categories

Spearman correlation and interactive scatterplot or boxplot

Medians or Spearman correlation and interactive scatterplot or boxplot

Spearman correlation and scatterplot


Scale Pearson correlation and interactive scatterplot or boxplot

Means or medians and interactive scatterplot or boxplot


Pearson correlation and scatterplot. Correlation should not be used unless scatterplot is well represented by a line.

bivariate relationships

Documents

scatterplot of fat

fat vs calories scatterplot

scatterplot of satv

states data set

marijuana data set

currency data set

data labels

overviewthe relationship