measures of relationship

21
Measures of Relationship Chapter 5 of the textbook introduced you to the two most widely used measures of relationship: the Pearson product-moment correlation and the Spearman rank-order correlation. We will be covering these statistics in this section, as well as other measures of relationship among variables. What is a Relationship? Correlation coefficients are measures of the degree of relationship between two or more variables. When we talk about a relationship, we are talking about the manner in which the variables tend to vary together. For example, if one variable tends to increase at the same time that another variable increases, we would say there is a positive relationship between the two variables. If one variable tends to decrease as another variable increases, we would say that there is a negative relationship between the two variables. It is also possible that the variables might be unrelated to one another, so that there is no predictable change in one variable based on knowing about changes in the other variable. As a child grows from an infant into a toddler into a young child, both the child's height and weight tend to change. Those changes are not always tightly locked to one another, but they do tend to occur together. So if we took a sample of children from a few weeks old to 3 years old and measured the height and weight of each child, we would likely see a positive relationship between the two. A relationship between two variables does not necessarily mean that one variable causes the other. When we see a

Upload: joffreyu

Post on 09-Dec-2015

8 views

Category:

Documents


2 download

DESCRIPTION

Research Tools

TRANSCRIPT

Page 1: Measures of Relationship

Measures of RelationshipChapter 5 of the textbook introduced you to the two most widely used measures of relationship: the Pearson product-moment correlation and the Spearman rank-order correlation. We will be covering these statistics in this section, as well as other measures of relationship among variables.

What is a Relationship?

Correlation coefficients are measures of the degree of relationship between two or more variables. When we talk about a relationship, we are talking about the manner in which the variables tend to vary together. For example, if one variable tends to increase at the same time that another variable increases, we would say there is a positive relationship between the two variables. If one variable tends to decrease as another variable increases, we would say that there is a negative relationship between the two variables. It is also possible that the variables might be unrelated to one another, so that there is no predictable change in one variable based on knowing about changes in the other variable.

As a child grows from an infant into a toddler into a young child, both the child's height and weight tend to change. Those changes are not always tightly locked to one another, but they do tend to occur together. So if we took a sample of children from a few weeks old to 3 years old and measured the height and weight of each child, we would likely see a positive relationship between the two.

A relationship between two variables does not necessarily mean that one variable causes the other. When we see a relationship, there are three possible causal interpretations. If we label the variables A and B, A could cause B, B could cause A, or some third variable (we will call it C) could cause both A and B. With the relationship between height and weight in children, it is likely that the general growth of children, which increases both height and weight, accounts for the observed correlation. It is very foolish to assume that the presence of a correlation implies a causal relationship between the two variables. There is an extended discussion of this issue in Chapter 7 of the text.

Scatter Plots and Linear Relationships

A helpful way to visualize a relationship between two variables is to construct a scatter plot, which you were briefly introduced to in our discussion of graphical techniques. A scatter plot represents each set of paired scores on a two dimensional

Page 2: Measures of Relationship

graph, in which the dimensions are defined by the variables. For example, if we wanted to create a scatter plot of our sample of 100 children for the variables of height and weight, we would start by drawing the X and Y axes, labeling one height and the other weight, and marking off the scales so that the range on these axes is sufficient to handle the range of scores in our sample. Let's suppose that our first child is 27 inches tall and 21 pounds. We would find the point on the weight axis that represents 21 pounds and the point on the height axis that represents 27 inches. Where these two points cross, we would put a dot that represents the combination of height and weight for that child, as shown in the figure below.

We then continue the process for all of the other children in our sample, which might produce the scatter plot illustrated below.

Page 3: Measures of Relationship

It is always a good idea to produce scatter plots for the correlations that you compute as part of your research. Most will look like the scatter plot above, suggesting a linear relationship. Others will show a distribution that is less organized and more scattered, suggesting a weak relationship between the variables. But on rare occasions, a scatter plot will indicate a relationship that is not a simple linear relationship, but rather shows a complex relationship that changes at different points in the scatter plot. The scatter plot below illustrates a nonlinear relationship, in which Y increases as X increases, but only up to a point; after that point, the relationship reverses direction. Using a simple correlation coefficient for such a situation would be a mistake, because the correlation cannot capture accurately the nature of a nonlinear relationship.

Page 4: Measures of Relationship

Pearson Product-Moment Correlation

The Pearson product-moment correlation was devised by Karl Pearson in 1895, and it is still the most widely used correlation coefficient. This history behind the mathematical development of this index is fascinating. Those interested in that history can click on the link. But you need not know that history to understand how the Pearson correlation works.

The Pearson product-moment correlation is an index of the degree of linear relationship between two variables that are both measured on at least an ordinal scale of measurement. The index is structured so the a correlation of 0.00 means that there is no linear relationship, a correlation of +1.00 means that there is a perfect positive relationship, and a correlation of -1.00 means that there is a perfect negative relationship. As you move from zero to either end of this scale, the strength of the relationship increases. You can think of the strength of a linear relationship as how tightly the data points in a scatter plot cluster around a straight line. In a perfect relationship, either negative or positive, the points all fall on a single straight line. We will see examples of that later. The symbol for the Pearson correlation is a lowercase r, which is often subscripted with the two variables. For example, rxy would stand for the correlation between the variables X and Y.

Page 5: Measures of Relationship

The Pearson product-moment correlation was originally defined in terms of Z-scores. In fact, you can compute the product-moment correlation as the average cross-product Z, as show in the first equation below. But that is an equation that is difficult to use to do computations. The more commonly used equation now is the second equation below. Although this equation looks much more complicated and looks like it would be much more difficult to compute, in fact, this second equation is by far the easier of the two to use if you are doing the computations with nothing but a calculator.

You can learn how to compute the Pearson product-moment correlation either by hand or using SPSS for Windows by clicking on one of the buttons below. Use the browser's return arrow key to return to this page.

Spearman Rank-Order Correlation

The Spearman rank-order correlation provides an index of the degree of linear relationship between two variables that are both measured on at least an ordinal scale of measurement. If one of the variables is on an ordinal scale and the other is on an interval or ratio scale, it is always possible to convert the interval or ratio scale to an ordinal scale. That process is discussed in the section showing you how to compute this correlation by hand.

The Spearman correlation has the same range as the Pearson correlation, and the numbers mean the same thing. A zero correlation means that there is no relationship,

Page 6: Measures of Relationship

whereas correlations of +1.00 and -1.00 mean that there are perfect positive and negative relationships, respectively. The formula for computing this correlation is shown below. Traditionally, the lowercase r with a subscript s is used to designate the Spearman correlation (i.e., rs). The one term in the formula that is not familiar to you is d, which is equal to the difference in the ranks for the two variables. This is explained in more detail in the section that covers the manual computation of the Spearman rank-order correlation.

The Phi Coefficient

The Phi coefficient is an index of the degree of relationship between two variables that are measured on a nominal scale. Because variables measured on a nominal scale are simply classified by type, rather than measured in the more general sense, there is no such thing as a linear relationship. Nevertheless,  it is possible to see if there is a relationship. For example, suppose you want to study the relationship between religious background and occupations. You have a classification systems for religion that includes Catholic, Protestant, Muslim, Other, and Agnostic/Atheist. You have also developed a classification for occupations that include Unskilled Laborer, Skilled Laborer, Clerical, Middle Manager, Small Business Owner, and Professional/Upper Management. You want to see if the distribution of religious preferences differ by occupation, which is just another way of saying that there is a relationship between these two variables. 

The Phi Coefficient is not used nearly as often as the Pearson and Spearman correlations. Therefore, we will not be devoting space here to the computational procedures. However, interested students can consult advances statistics textbooks for the details. you can compute Phi easily as one of the options in the crosstabs procedure in SPSS for Windows. Click on the button below to see how.

Advanced Correlational Techniques

Correlational techniques are immensely flexible and can be extended dramatically to solve various kinds of statistical problems. Covering the details of these advanced correlational techniques is beyond the score of this text and website. However, we have included brief discussions of several advanced correlational techniques on the Student Resource Website, including multidimensional scaling, path analysis,taxonomic search techniques, and statistical analysis of neuroimages.

Nonlinear Correlational Procedures

Page 7: Measures of Relationship

The vast majority of correlational techniques used in psychology are linear correlations. However, there are times when one can expect to find nonlinear relationships and would like to apply statistical procedures to capture such complex relationships. This topic is far too complex to cover here. The interested student will want to consult advanced statistical textbooks that specialize in regression analyses. 

There are two words of caution that we want to state about using such nonlinear correlational procedures. Although it is relatively easy to do the computations using modern statistical software, you should not use these procedures unless you actually understand them and their pitfalls. It is easy to misuse the techniques and to be fooled into believing things that are not true from a naive analysis of the output of computer programs. 

The second word of caution is that there should be a strong theoretical reason to expect a nonlinear relationship if you are going to use nonlinear correlational procedures. Many psychophysiological processes are by their nature nonlinear, so using nonlinear correlations in studying those processes makes complete sense. But for most psychological processes, there is no good theoretical reasons to expect a nonlinear relationship.

Looking for Relationships in the Data            When there are two series of data, there are a number of statistical measures that can be used to capture how the series move together over time.

Correlations and Covariances

The two most widely used measures of how two variables move together (or do not) are the correlation and the covariance. For two data series, X (X1, X2,) and Y(Y, Y. . .), the covariance provides a measure of the degree to which they move together and is estimated by taking the product of the deviations from the mean for each variable in each period.

The sign on the covariance indicates the type of relationship the two variables have. A positive sign indicates that they move together and a negative sign that they move in opposite directions. Although the covariance increases with the strength of the relationship, it is still relatively difficult to draw judgments on the strength of the

Page 8: Measures of Relationship

relationship between two variables by looking at the covariance, because it is not standardized.

            The correlation is the standardized measure of the relationship between two variables. It can be computed from the covariance :

The correlation can never be greater than one or less than negative one. A correlation close to zero indicates that the two variables are unrelated. A positive correlation indicates that the two variables move together, and the relationship is stronger as the correlation gets closer to one. A negative correlation indicates the two variables move in opposite directions, and that relationship gets stronger the as the correlation gets closer to negative one. Two variables that are perfectly positively correlated (XY = 1) essentially move in perfect proportion in the same direction, whereas two variables that are perfectly negatively correlated move in perfect proportion in opposite directions.

Regressions

            A simple regression is an extension of the correlation/covariance concept. It attempts to explain one variable, the dependent variable, using the other variable, the independent variable.

Scatter Plots and Regression Lines

Keeping with statistical tradition, let Y be the dependent variable and X be the independent variable. If the two variables are plotted against each other with each pair of observations representing a point on the graph, you have a scatterplot, with Y on the vertical axis and X on the horizontal axis.  Figure A1.3 illustrates a scatter plot.

Page 9: Measures of Relationship

Figure A1.3: Scatter Plot of Y versus X

In a regression, we attempt to fit a straight line through the points that best fits the data. In its simplest form, this is accomplished by finding a line that minimizes the sum of the squared deviations of the points from the line. Consequently, it is called an ordinary least squares (OLS) regression. When such a line is fit, two parameters emerge—one is the point at which the line cuts through the Y-axis, called the intercept of the regression, and the other is the slope of the regression line:

Y = a + bX

The slope (b) of the regression measures both the direction and the magnitude of the relationship between the dependent variable (Y) and the independent variable (X). When the two variables are positively correlated, the slope will also be positive, whereas when the two variables are negatively correlated, the slope will be negative. The magnitude of the slope of the regression can be read as follows: For every unit increase in the dependent variable (X), the independent variable will change by b (slope).

Estimating Regression Parameters

Although there are statistical packages that allow us to input data and get the regression parameters as output, it is worth looking at how they are estimated in the first place. The slope of the regression line is a logical extension of the covariance concept introduced in the last section. In fact, the slope is estimated using the covariance:

Page 10: Measures of Relationship

 

The intercept (a) of the regression can be read in a number of ways. One interpretation is that it is the value that Y will have when X is zero. Another is more straightforward and is based on how it is calculated. It is the difference between the average value of Y, and the slope-adjusted value of X.

Regression parameters are always estimated with some error or statistical noise, partly because the relationship between the variables is not perfect and partly because we estimate them from samples of data. This noise is captured in a couple of statistics. One is the R2 of the regression, which measures the proportion of the variability in the dependent variable (Y) that is explained by the independent variable (X). It is also a direct function of the correlation between the variables:

 

An R2 value close to one indicates a strong relationship between the two variables, though the relationship may be either positive or negative. Another measure of noise in a regression is the standard error, which measures the �spread� around each of the two parameters estimated—the intercept and the slope. Each parameter has an associated standard error, which is calculated from the data:

Standard Error of Intercept = SEa =

 

 

Page 11: Measures of Relationship

If we make the additional assumption that the intercept and slope estimates are normally distributed, the parameter estimate and the standard error can be combined to get a t-statistic that measures whether the relationship is statistically significant.

t-Statistic for Intercept = a/SEa

t-Statistic from Slope = b/SEb

For samples with more than 120 observations, a t-statistic greater than 1.95 indicates that the variable is significantly different from zero with 95% certainty, whereas a statistic greater than 2.33 indicates the same with 99% certainty. For smaller samples, the t-statistic has to be larger to have statistical significance.[1]

Using Regressions

Although regressions mirror correlation coefficients and covariances in showing the strength of the relationship between two variables, they also serve another useful purpose. The regression equation described in the last section can be used to estimate predicted values for the dependent variable, based on assumed or actual values for the independent variable. In other words, for any given Y, we can estimate what X should be:

X = a + b(Y)

How good are these predictions? That will depend entirely on the strength of the relationship measured in the regression. When the independent variable explains a high proportion of the variation in the dependent variable (R2 is high), the predictions will be precise. When the R2 is low, the predictions will have a much wider range.

From Simple to Multiple Regressions

            The regression that measures the relationship between two variables becomes a multiple regression when it is extended to include more than one independent variables (X1, X2, X3, X4 . . .) in trying to explain the dependent variable Y. Although the graphical presentation becomes more difficult, the multiple regression yields output that is an extension of the simple regression.

Y = a + bX1 + cX2 + dX3 + eX4

The R2 still measures the strength of the relationship, but an additional R2 statistic called the adjusted R2 is computed to counter the bias that will induce the R2 to keep

Page 12: Measures of Relationship

increasing as more independent variables are added to the regression. If there are k independent variables in the regression, the adjusted R2 is computed as follows:

 

Multiple regressions are powerful tools that allow us to examine the determinants of any variable.

Regression Assumptions and Constraints

Both the simple and multiple regressions described in this section also assume linear relationships between the dependent and independent variables. If the relationship is not linear, we have two choices. One is to transform the variables by taking the square, square root, or natural log (for example) of the values and hope that the relationship between the transformed variables is more linear. The other is to run nonlinear regressions that attempt to fit a curve (rather than a straight line)  through the data.

There are implicit statistical assumptions behind every multiple regression that we ignore at our own peril. For the coefficients on the individual independent variables to make sense, the independent variable needs to be uncorrelated with each other, a condition that is often difficult to meet. When independent variables are correlated with each other, the statistical hazard that is created is called multicollinearity. In its presence, the coefficients on independent variables can take on unexpected signs (positive instead of negative, for instance) and unpredictable values. There are simple diagnostic statistics that allow us to measure how far the data may be deviating from our ideal.

 Correlation Types

Correlation is a measure of association between two variables. The variables are not designated as

dependent or independent. The two most popular correlation coefficients are: Spearman's correlation

coefficient rho and Pearson's product-moment correlation coefficient.

Page 13: Measures of Relationship

When calculating a correlation coefficient for ordinal data, select Spearman's technique. For interval or

ratio-type data, use Pearson's technique.

The value of a correlation coefficient can vary from minus one to plus one. A minus one indicates a

perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero

means there is no relationship between the two variables. When there is a negative correlation between

two variables, as the value of one variable increases, the value of the other variable decreases, and vise

versa. In other words, for a negative correlation, the variables work opposite each other. When there is a

positive correlation between two variables, as the value of one variable increases, the value of the other

variable also increases. The variables move together.

The standard error of a correlation coefficient is used to determine the confidence intervals around a true

correlation of zero. If your correlation coefficient falls outside of this range, then it is significantly different

than zero. The standard error can be calculated for interval or ratio-type data (i.e., only for Pearson's

product-moment correlation).

The significance (probability) of the correlation coefficient is determined from the t-statistic. The probability

of the t-statistic indicates whether the observed correlation coefficient occurred by chance if the true

correlation is zero. In other words, it asks if the correlation is significantly different than zero. When the t-

statistic is calculated for Spearman's rank-difference correlation coefficient, there must be at least 30

cases before the t-distribution can be used to determine the probability. If there are fewer than 30 cases,

you must refer to a special table to find the probability of the correlation coefficient.

Example

A company wanted to know if there is a significant relationship between the total number of salespeople

and the total number of sales. They collect data for five months.

Variable 1 Variable 2

207 6907

180 5991

220 6810

205 6553

190 6190

--------------------------------

Correlation coefficient = .921

Standard error of the coefficient = ..068

t-test for the significance of the coefficient = 4.100

Page 14: Measures of Relationship

Degrees of freedom = 3

Two-tailed probability = .0263

Another Example

Respondents to a survey were asked to judge the quality of a product on a four-point Likert scale

(excellent, good, fair, poor). They were also asked to judge the reputation of the company that made the

product on a three-point scale (good, fair, poor). Is there a significant relationship between respondents

perceptions of the company and their perceptions of quality of the product?

Since both variables are ordinal, Spearman's method is chosen. The first variable is the rating for the

quality the product. Responses are coded as 4=excellent, 3=good, 2=fair, and 1=poor. The second

variable is the perceived reputation of the company and is coded 3=good, 2=fair, and 1=poor.

Variable 1 Variable 2

4 3

2 2

1 2

3 3

4 3

1 1

2 1

-------------------------------------------

Correlation coefficient rho = .830

t-test for the significance of the coefficient = 3.332

Number of data pairs = 7

Probability must be determined from a table because of the small sample size.

Regression

Simple regression is used to examine the relationship between one dependent and one independent

variable. After performing an analysis, the regression statistics can be used to predict the dependent

variable when the independent variable is known. Regression goes beyond correlation by adding

prediction capabilities.

People use regression on an intuitive level every day. In business, a well-dressed man is thought to be

financially successful. A mother knows that more sugar in her children's diet results in higher energy

Page 15: Measures of Relationship

levels. The ease of waking up in the morning often depends on how late you went to bed the night before.

Quantitative regression adds precision by developing a mathematical formula that can be used for

predictive purposes.

For example, a medical researcher might want to use body weight (independent variable) to predict the

most appropriate dose for a new drug (dependent variable). The purpose of running the regression is to

find a formula that fits the relationship between the two variables. Then you can use that formula to

predict values for the dependent variable when only the independent variable is known. A doctor could

prescribe the proper dose based on a person's body weight.

The regression line (known as the least squares line) is a plot of the expected value of the dependent

variable for all values of the independent variable. Technically, it is the line that "minimizes the squared

residuals". The regression line is the one that best fits the data on a scatterplot.

Using the regression equation, the dependent variable may be predicted from the independent variable.

The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is the

point on the y axis where the regression line would intercept the y axis. The slope and y intercept are

incorporated into the regression equation. The intercept is usually called the constant, and the slope is

referred to as the coefficient. Since the regression model is usually not a perfect predictor, there is also an

error term in the equation.

In the regression equation, y is always the dependent variable and x is always the independent variable.

Here are three equivalent ways to mathematically describe a linear regression model.

y = intercept + (slope  x) + error

y = constant + (coefficient x) + error

y = a + bx + e

The significance of the slope of the regression line is determined from the t-statistic. It is the probability

that the observed correlation coefficient occurred by chance if the true correlation is zero. Some

researchers prefer to report the F-ratio instead of the t-statistic. The F-ratio is equal to the t-statistic

squared.

The t-statistic for the significance of the slope is essentially a test to determine if the regression model

(equation) is usable. If the slope is significantly different than zero, then we can use the regression model

to predict the dependent variable for any value of the independent variable.

On the other hand, take an example where the slope is zero. It has no prediction ability because for every

value of the independent variable, the prediction for the dependent variable would be the same. Knowing

the value of the independent variable would not improve our ability to predict the dependent variable.

Thus, if the slope is not significantly different than zero, don't use the model to make predictions.

Page 16: Measures of Relationship

The coefficient of determination (r-squared) is the square of the correlation coefficient. Its value may vary

from zero to one. It has the advantage over the correlation coefficient in that it may be interpreted directly

as the proportion of variance in the dependent variable that can be accounted for by the regression

equation. For example, an r-squared value of .49 means that 49% of the variance in the dependent

variable can be explained by the regression equation. The other 51% is unexplained.

The standard error of the estimate for regression measures the amount of variability in the points around

the regression line. It is the standard deviation of the data points as they are distributed around the

regression line. The standard error of the estimate can be used to develop confidence intervals around a

prediction.

Example

A company wants to know if there is a significant relationship between its advertising expenditures and its

sales volume. The independent variable is advertising budget and the dependent variable is sales

volume. A lag time of one month will be used because sales are expected to lag behind actual advertising

expenditures. Data was collected for a six month period. All figures are in thousands of dollars. Is there a

significant relationship between advertising budget and sales volume?

Indep. Var. Depen. Var

4.2 27.1

6.1 30.4

3.9 25.0

5.7 29.7

7.3 40.1

5.9 28.8

--------------------------------------------------

Model: y = 9.873 + (3.682 x) + error

Standard error of the estimate = 2.637

t-test for the significance of the slope = 3.961

Degrees of freedom = 4

Two-tailed probability = .0149

r-squared = .807

You might make a statement in a report like this: A simple linear regression was performed on six months

of data to determine if there was a significant relationship between advertising expenditures and sales

volume. The t-statistic for the slope was significant at the .05 critical alpha level, t(4)=3.96, p=.015. Thus,

we reject the null hypothesis and conclude that there was a positive significant relationship between

Page 17: Measures of Relationship

advertising expenditures and sales volume. Furthermore, 80.7% of the variability in sales volume could be

explained by advertising expenditures.