1 research methods of applied linguistics and statistics (11) correlation and multiple regression by...

1

Research Methods of Applied Linguistics and Statistics (11)

Correlation and multiple regression

By Qin Xiaoqing

2

Pearson Correlation

The Pearson correlation allows us to establish the strength of relationships between continuous variables.

To show the relationship, the first step is to draw a scatterplot or scattergram, which can help us to obtain a preliminary understanding of this relationship.

The scatterplot can be described in terms of direction, strength and linearity.

3

Correlation and SPSS

Pearson product-moment coefficient is designed for interval level (continuous) variables. It can also be used if you have one continuous variable (e.g., scores on a measure of self-esteem) and one dichotomous variable (e.g., sex: M/F).

Spearman rank order correlation is designed for use with ordinal level or ranked data.

SPSS will calculate two types of correlation. First, it will give a simple bivariate correlation (which just means between two variables), also known as zero-order correlation. SPSS will also explore the relationship between two variables, while controlling for another variable. This is known as partial correlation.

4

Direction

Positive relationships represent relationships in which an increase in one variable is associated with an increase in a second.

Negative relationships represent relationships in which an increase in one variable is associated with decrease in a second.

5

Strength

Strong relationships appear as those in which the dots are very close to a straight line

Weak relationships appear as those in which the dots are more scattered about a straight line, or farther away from that line.

6

Linearity

Linear relationships are indicated when the pattern of dots on the scatter diagram appears to be straight, or if the points could be represented by drawing a straight line through them.

7

Steps for computation

1. List the score for each S in parallel columns on a data sheet.

2. Square each score and enter these values in the columns labeled X2 and Y2.

3. Multiply the scores and enter this value in the XY column.

4. Add the values in each column.

5. Insert the values in the formula of correlation coefficient.

1N

)zz(r yx

xy

8

Example

S X Y X2 Y2 XY

1 12 8 144 64 96

2 10 12 100 144 120

3 11 5 121 25 556

4 9 8 181 64 72

5 8 4 64 16 32

6 7 13 49 169 91

7 7 7 49 49 49

8 5 3 25 9 15

9 4 8 16 64 32

10 3 5 9 25 15

Total 76 73 658 629 577

9

Scatterplot

L2 proficiency

1412108642

Shor

t-te

rm m

emor

y

14

12

10

8

6

4

2

10

Interpretation of scatterplot

Checking for outliers Inspecting the distribution of data points:Are the data points spread all over the place? This suggests a

very low correlation.Are all the points neatly arranged in a narrow cigar shape? This

suggests quite a strong correlation.Could you draw a straight line through the main cluster of points,

or would a curved line better represent the points? If a curved line is evident (suggesting a curvilinear relationship), then Pearson correlation should not be used.

What is the shape of the cluster? Is it even from one end to the other? Or does it start off narrow and then get fatter. If this is the case, the data may be violating the assumption of variance homogeneity.

Determining the direction of the relationship between the variables

11

Formula of r for raw score

])Y(YN][)X(XN[

)YX()XY(Nr

2222xy

X Y X2 Y2 XY

Total

76 73 658 629 577

25.963804

222

])73()629)(10][()76()658)(10[(

)73)(76()577(10r

22xy

12

Assumptions underlying Pearson correlation

1. The data are measured as scores or ordinal scales that are truly continuous.

2. The scores on the two variables, X and Y, are independent.

3. The data should be normally distributed through their range.

4. The relationship between X and Y must be linear.

13

Interpreting the correlation coefficient

1.When r=.60, the variance overlap between the 2 measures is .36.

2.The overlap tells that the 2 measures provide similar information. Or the magnitude of r2 indicates the amount of variances in X which is accounted for by Y or vice versa.

14

Correlation coefficient

If you hope 2 tests measure basically the same thing, .71 isn’t very strong; .80 or .90 may be desirable.

A correlation of .30 or lower may appear weak, but in educational research such a correlation might be very important.

Significant level: p<.05, .01, df=N-2

15

r=.10 to .29 or r=–.10 to –.29 smallr=.30 to .49 or r=–.30 to –.49 mediumr=.50 to 1.0 or r=–.50 to –1.0 large

16

Presenting the results from correlation

17

Comparing the correlation coefficients for two groups

Sometimes when doing correlational research you may want to compare the strength of the correlation coefficients for two separate groups.

18

Factors affecting correlation

If you have a restricted range of scores on either of the variables, this will reduce the value of r, eg. Age (18-20) and success on an exam.

The existence of scores with extreme outliers in the data.

The presence of extremely high and extremely low scores on a variable with little in the middle.

Reliability of the data. Non-linear relationship. Always check the

scatterplot, particularly if you obtain low values of r.

19

Correlation versus causality

Correlation provides an indication that there is a relationship between two variables It does not however indicate that one variable causes the other. The correlation between two variables (A and B) could be due to the fact that A causes B, that B causes A, or (just to complicate matters) that an additional variable (C) causes both A and B. The possibility of a third variable that influences both of your observed variables should always be considered.

20

Statistical vs practical significance

Don’t get too excited if your correlation coefficients are ‘significant’. With large samples, even quite small correlation coefficients can reach statistical significance. Although statistically significant, the practical significance of a correlation of .2 is very limited. You should focus on the actual size of Pearson’s r and the amount of shared variance between the two variables. To interpret the strength of your correlation coefficient you should also take into account other research that has been conducted in your particular topic area. If other researchers in your area have only been able to predict 9 per cent of the variance (a correlation of .3) in a particular outcome (e.g., anxiety), then your study that explains 25 per cent would be impressive in comparison. In other topic areas, 25 per cent of the variance explained may seem small and irrelevant.

21

Linear regressionMultiple regression

22

Understanding regression

Regression is a way of predicting performance on the dependent variable via one or more independent variables.

In simple regression, we predict scores on one variable on the basis of scores on a second.

In multiple regression, we expand the possible sources of prediction and test to see which of many variables and which combination of variables allow us to make the best prediction.

23

Linear regressionRegression and correlation are related

procedures. The correlation coefficient is central to simple linear regression. While we can’t make causal claims on the basis of correlation, we can use correlation to predict one variable from another.

We can’t just throw variables into a multiple regression and hope that, magically, answers will appear.

We should have a sound thoretical or conceptual reason for the analysis and, in particular, the order of variables entering the equation.

24

Uses of multiple regression

how well a set of variables is able to predict a particular outcome;

which variable in a set of variables is the best predictor of an outcome; and

whether a particular predictor variable is still able to predict an outcome when the effects of another variable are controlled for.

25

Assumptions of multiple regression

Sample sizeStevens (1996) recommends that ‘for social science

research, about 15 subjects per predictor are needed for a reliable equation’.

Tabachnick and Fidell (1996, p. 132) give a formula for calculating sample size requirements, taking into account the number of independent variables that you wish to use: N > 50 + 8m (where m = number of independent variables). If you have five independent variables you would need 90 cases.

More cases are needed if the dependent variable is skewed.

For stepwise regression there should be a ratio of forty cases for every independent variable.

26

Multicollinearity. It exists when the independent variables are highly correlated (r=.9 and above). Multiple regression doesn’t like multicollinearity, and it certainly doesn’t contribute to a good regression model, so always check for this problem before you start.

Outliers. Multiple regression is very sensitive to outliers (very high or very low scores).

Normality, linearity

27

MLAT and language learning

The closer r is to ±1 the smaller the error will be in predicting performance on one variable to that of the second. The smaller, the greater the error.

28

Predicting scores using regression

4 pieces of information are needed: They are the mean for scores on one variable; The mean for scores on the second variable; The S’s score on X, and The slope of the best-fitting straight line of the joint

distribution.

With this information, we can predict the S’s score on Y from X on a mathematical basis. By ‘regressing’ Y on X, predicting Y from X will be possible.

29

Regression line

Lines drawn to the straight line in the scatterplot show the amount of error. Suppose we square each of these errors and then find the mean of the sum of these squared errors. This best-fitting straight line is called regression line and is technically defined as the line that results in the smallest mean of the sum of the squared errors.

We can think of the regression line as being that which is closest to all the dots but, more precisely, it is the one that results in a mean of the squared errors that is less than any other line we might produce.

30

Determining the slope

Turn MLAT and language learning to z score for comparability. Then plot the intersection of each S’s z score on the MLAT and

on the test. As the z scores on the MLAT increase they form a ‘run’. The horizontal line of a triangle. At the same time, the z scores on the test increase to form a ‘rise’, the vertical line.

The slope (b) of the regression line is shown as we connect these 2 lines to form the third side of the triangle.

31

Regression coefficient with known r and SD

In the diagram, an increase of say 6 units on the run (MLAT) would equal 2 units of increase on the rise.

The slope is the rise divided by the run. The result is a fraction. That fraction is the correlation coefficient.

The correlation coefficient is the same as the slope of the best-fitting line in a z-score scatterplot. In the triangle, the slope of the regression line was 2÷6, and so r for the two is .33. suppose SDs are 8 and 10 respectively for Y and X.

To obtain the slope, we multiply the correlation coefficient by the standard deviation of Y over the standard deviation of X. 26.

10

833.

s

srb

X

Yxy

32

Regression coefficient with raw data

With r and SD, it is very easy to find the slope. With raw data, the formula for slope follows:

222 )()(

))(()(

)(

)()(

XXN

YXXYN

XX

YYXXb

33

Example: using TSE to predict TOEFL

Mean on TOEFL=540, SD=40. Mean on TSE=30, SD=4. r=.80, b=8.0

A student achieved 36 on the TSE, 6 higher than the mean. Multiplying that by the slope, we get 8×6=48. So our prediction of TOEFL is mean Y (540) +48=588. The formula follows:

Another regression equation is:

588)3036)(8(540)XX(bY)Y predicted(Y

bxaY

34

Standard error of estimate

There is some overlap in the variance of the two variables. When we square the value of r, we find the degree of shared variance.

Of the original 100% of the variance, with an r=.50, we have accurately accounted for 25% of the variance using the straight line as the bass for prediction. The error variance now is reduced to 75%.

In regression, standard error of estimate (SEE) shows the dispersion of scores away from the straight line. If all the data are tightly clustered on the line, little error is made in prediction.

SEE tells us how much error is likely to occur in prediction.

35

Error variance

To compute SEE, we need to know the error variance, which is the sum of squares of actual scores minus predicted scores divided by N-2.

The square root of this variance is referred to as the SEE (1.35):

35.189.196.2r1S or;2N

)YY( 22y

2

2N

)YY(2

Mean for X=8, SD=4.47; mean for Y=10.8, SD=2.96; r=.89

36

Confidence interval68% confidence interval: ± 1 SEE (eg.± 1.35):

68% of actual Y scores would fall within .± 1.35 of the predicted Y score.

95% confidence interval: ± 1.96×SEE 99% confidence interval: ±2.58×SEE Suppose estimated score is 11.98, then95% confidence interval : between 9.33

(11.98-1.35×1.96) and 14.63 (11.98+1.35×1.96)

99% confidence interval?8.5(11.98-3.48) - 15.46 (11.98+3.48)

37

Estimated L2 scores predicted from class hours

38

Goodness of fit for regression model: R2

R2, also called multiple correlation or the coefficient of multiple determination, is the percent of the variance in the dependent explained uniquely or jointly by the independents.

Adjusted R2 is an adjustment for the fact that when one has a large number of independents, it is possible that R2 will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent.

The greater the number of independents, the more the researcher is expected to report the adjusted coefficient.

39

T-test

t-tests are used to assess the significance of individual b coefficients. specifically testing the null hypothesis that the regression coefficient is zero.

40

F test

F test is used to test the significance of R, which is the same as testing the significance of R2, which is the same as testing the significance of the regression model as a whole.

If prob(F) < .05, then the model is considered significantly better than would be expected by chance and we reject the null hypothesis of no linear relationship of y to the independents.

41

Multicollinearity

Multicollinearity is the intercorrelation of independent variables. R2's near 1 violate the assumption of no perfect collinearity, while high R2's increase the standard error of the beta coefficients and make assessment of the unique role of each independent difficult or impossible.

42

tolerance or VIF

To assess multivariate multicollinearity, one uses tolerance or VIF, which build in the regressing of each independent on all the others.

As a rule of thumb, if tolerance is less than .20, a problem with multicollinearity is indicated.

When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable.

The more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients.

43

Selecting method for predicting variables: Forward selection

This method starts with a model containing none of the explanatory variables. In the first step, the procedure considers variables one by one for inclusion and selects the variable that results in the largest increase in R2. In the second step, the procedures considers variables for inclusion in a model that only contains the variable selected in the first step. In each step, the variable with the largest increase in R2 is selected until, according to an F-test, further additions are judged to not improve the model.

44

Backward selection

This method starts with a model containing all the variables and eliminates variables one by one, at each step choosing the variable for exclusion as that leading to the smallest decrease in R2. Again, the procedure is repeated until, according to an F-test, further exclusions would represent a deterioration of the model.

45

Stepwise selection

This method is, essentially, a combination of the previous two approaches. Starting with no variables in the model, variables are added as with the forward selection method. In addition, after each inclusion step, a backward elimination process is carried out to remove variables that are no longer judged to improve the model.

46

Interpretation of the results from multiple regression

Checking the assumptionsEvaluating the modelEvaluating each of the independent

variables

47

Presenting the results of multiple regression

It would be a good idea to look for examples of the presentation of different statistical analysis in the journals relevant to your topic area. Different journals have different requirements and expectations. Given the severe space limitations in journals these days, often only a brief summary of the results is presented and readers are encouraged to contact the author for a copy of the full results.

1 research methods of applied linguistics and statistics (11) correlation and multiple regression by...

Documents

strong correlation

partial correlation

types of correlation

low correlation

strength of relationships

simple bivariate correlation

negative relationships

distribution of data