correlation and covariance

Correlation andCovariance

Overview

Continuous

Continuous

Categorical

Histogram

Scatter

Boxplot

Predictor Variable(X-Axis)

Height

Outcome, Dependent Variable (Y-Axis)

Variables

Y

X’s

Height

Independent Variables

DependentVariables

Y

X4X3X2X1

Correlation Matrix for Continuous Variables

chart.Correlation(num2)PerformanceAnalytics package

• A deviation is the difference between the mean and an actual data point.

• Deviations can be calculated by taking each score and subtracting the mean from it:

deviation ix x

Calculating ‘Error’

Calculating ‘Error’

• Take the error between the mean and the data and add them????

0)( XX

Score Mean Deviation

1 2.6 -1.6

2 2.6 -0.6

3 2.6 0.4

3 2.6 0.4

4 2.6 1.4

Total = 0

Use the Total Error?Deviation

• We could add the deviations to find out the total error.

• Deviations cancel out because some are positive and others negative.

• Therefore, we square each deviation.

• If we add these squared deviations we get the sum of squared errors (SS).

Sum of Squared ErrorsDeviation

2SS ( ) 5.20X X

Score Mean Deviation Squared Deviation

1 2.6 -1.6 2.56

2 2.6 -0.6 0.36

3 2.6 0.4 0.16

3 2.6 0.4 0.16

4 2.6 1.4 1.96

Total 5.20

Sum of Squared Errors

• The variance is measured in units squared.

• This isn’t a very meaningful metric so we take the square root value.

• This is the standard deviation (s).

2

1 5.205 1.02

niix x

ns

Standard Deviation

• The sum of squares is a good measure of overall variability, but is dependent on the number of scores.

• We calculate the average variability by dividing by the number of scores (n).

• This value is called the variance (s2).

Variance

Same Mean, Different Standard Deviation

Temperature Variation Across Cities

Austin

Las Vegas

San Diego

San Francisco

Tampa Bay

Count of Hours

1cov( , ) i ix x y y

Nx y

Covariance

Y

X

Persons 2,3, and 5 look to have similar magnitudes from their means

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

-4-3-2-1012345

254417

441021418221

4)4)(62()2)(60()1)(41()2)(41()3)(40(

1))((

)cov(

.

.....

.....N

yyxxy,x ii

Covariance

• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).

• Calculate the error [deviation] between the mean and their score for the second variable (y).

• Multiply these error values.• Add these values and you get the cross product deviations.• The covariance is the average cross-product deviations:


Nx y

Covariance

Age Income Education7 4 34 1 86 3 58 6 18 5 77 2 95 3 39 5 87 4 58 2 29 5 28 4 29 2 38 4 73 1 43 1 38 2 61 2 53 1 76 3 3

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

Age vs. Income

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

-6.0-5.0-4.0-3.0-2.0-1.00.01.02.03.04.0

Delta A Delta I

Do they VARY the same way relative to their own means?

2.47

• It depends upon the units of measurement.• E.g. the covariance of two variables measured in miles

might be 4.25, but if the same scores are converted to kilometres, the covariance is 11.

• One solution: standardize it! normalize the data• Divide by the standard deviations of both variables.

• The standardized version of covariance is known as the correlation coefficient.• It is relatively unaffected by units of measurement.

Limitations of Covariance

cov

1

xy

x y

i i

x y

s s

x x y y

N s s

r

The Correlation Coefficient

cov

4.25

1.67 2.92.87

xy

x ys sr

• It varies between -1 and +1• 0 = no relationship

• It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect

• Coefficient of determination, r2

• By squaring the value of r you get the proportion of variance in one variable shared by the other.

Things to Know about the Correlation

Correlation

Covariance is High: r ~1

Covariance is Low: r ~0

Correlation

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

f(x) = 0.432980193459235 x + 0.250575771533855R² = 0.442392806360523

Correlation

Need inter-item/variable correlations > .30

Character Vector: b <- c("one","two","three")

numeric vector

character vector

Numeric Vector: a <- c(1,2,5.3,6,-2,4)

Matrix: y<-matrix(1:20, nrow=5,ncol=4)

Dataframe:d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed")

List:w <- list(name="Fred", age=5.3)

Data Structures

Framework Source: Hadley Wickham

http://statmethods.net/input/datatypes.html

Correlation Matrix

Correlation and Covariance


Nx y

Revisiting the Height Dataset

Galton: Height Dataset

cor(heights)Error in cor(heights) : 'x' must be numeric

Initial workaround: Create data.frame without the Factors

h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)

cor() function does not handle Factors

Later we will RECODE the variable into a 0, 1

Excel correl() does not either

Histogram of Correlation Coefficients

-1 +1

Correlations Matrix: Both Types

library(car)scatterplotMatrix(heights)

Zoom in on Gender

Correlation Matrix for Continuous Variables

chart.Correlation(num2)PerformanceAnalytics package

Categorical: Revisit Box Plot

Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors

Note there is an equation here:Y = mx b

Correlation will depend on spread of distributions

Manual Calculation: Note Stdev is Lower

Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the

Continuous Variable has a lot of variation, spread.

Categorical: Recode!Gender recoded as

a 0= Female1 = Male

@correl does not work with Factor

Variables

Formula now works!

Correlation: Continuous & Discrete

More examples of cor.test()

• Too many variables is difficult to handle

• Computing power to handle all that data.

• Principal components analysis seeks to identify and quantify those components by analyzing the original, observable variables

• In many cases, we can wind up working with just a few—on the order of, say, three to ten—principal components or factors instead of tens or hundreds of conventionally measured variables.

Overview

Z1X1

X2

X3

Z2

observable variables

Which component explains the most variance?

Z3

vectors

Principal Components Analysis

Principal Components Analysis

Principal Components

Correlation Regression

correlation and covariance

Documents

correlation correlation

error deviation

squared deviations

standard deviations

total error

error values

limitations of covariance

covariance yxpersons