correlation and covariance
DESCRIPTION
Correlation and Covariance. Overview. Outcome, Dependent Variable (Y-Axis). Height. Continuous. Histogram. Predictor Variable (X-Axis). Scatter. Continuous. Boxplot. Categorical. Variables. Dependent Variables. Y. Y. Height. X4. X3. X1. X2. Independent Variables. X’s. - PowerPoint PPT PresentationTRANSCRIPT
Correlation andCovariance
Overview
Continuous
Continuous
Categorical
Histogram
Scatter
Boxplot
Predictor Variable(X-Axis)
Height
Outcome, Dependent Variable (Y-Axis)
Variables
Y
X’s
Height
Independent Variables
DependentVariables
Y
X4X3X2X1
Correlation Matrix for Continuous Variables
chart.Correlation(num2)PerformanceAnalytics package
Slide 5
• A deviation is the difference between the mean and an actual data point.
• Deviations can be calculated by taking each score and subtracting the mean from it:
deviation ix x
Calculating ‘Error’
Calculating ‘Error’
Slide 7
• Take the error between the mean and the data and add them????
0)( XX
Score Mean Deviation
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total = 0
Use the Total Error?Deviation
Slide 8
• We could add the deviations to find out the total error.
• Deviations cancel out because some are positive and others negative.
• Therefore, we square each deviation.
• If we add these squared deviations we get the sum of squared errors (SS).
Sum of Squared ErrorsDeviation
Slide 9
2SS ( ) 5.20X X
Score Mean Deviation Squared Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20
Sum of Squared Errors
Slide 10
• The variance is measured in units squared.
• This isn’t a very meaningful metric so we take the square root value.
• This is the standard deviation (s).
2
1 5.205 1.02
niix x
ns
Standard Deviation
Slide 11
• The sum of squares is a good measure of overall variability, but is dependent on the number of scores.
• We calculate the average variability by dividing by the number of scores (n).
• This value is called the variance (s2).
Variance
Slide 12
Same Mean, Different Standard Deviation
Temperature Variation Across Cities
Austin
Las Vegas
San Diego
San Francisco
Tampa Bay
Count of Hours
1cov( , ) i ix x y y
Nx y
Covariance
Y
X
Persons 2,3, and 5 look to have similar magnitudes from their means
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
-4-3-2-1012345
254417
441021418221
4)4)(62()2)(60()1)(41()2)(41()3)(40(
1))((
)cov(
.
.....
.....N
yyxxy,x ii
Covariance
• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).
• Calculate the error [deviation] between the mean and their score for the second variable (y).
• Multiply these error values.• Add these values and you get the cross product deviations.• The covariance is the average cross-product deviations:
1cov( , ) i ix x y y
Nx y
Covariance
Age Income Education7 4 34 1 86 3 58 6 18 5 77 2 95 3 39 5 87 4 58 2 29 5 28 4 29 2 38 4 73 1 43 1 38 2 61 2 53 1 76 3 3
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
Age vs. Income
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
-6.0-5.0-4.0-3.0-2.0-1.00.01.02.03.04.0
Delta A Delta I
Do they VARY the same way relative to their own means?
2.47
• It depends upon the units of measurement.• E.g. the covariance of two variables measured in miles
might be 4.25, but if the same scores are converted to kilometres, the covariance is 11.
• One solution: standardize it! normalize the data• Divide by the standard deviations of both variables.
• The standardized version of covariance is known as the correlation coefficient.• It is relatively unaffected by units of measurement.
Limitations of Covariance
cov
1
xy
x y
i i
x y
s s
x x y y
N s s
r
The Correlation Coefficient
cov
4.25
1.67 2.92.87
xy
x ys sr
• It varies between -1 and +1• 0 = no relationship
• It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect
• Coefficient of determination, r2
• By squaring the value of r you get the proportion of variance in one variable shared by the other.
Things to Know about the Correlation
Correlation
Covariance is High: r ~1
Covariance is Low: r ~0
Correlation
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
f(x) = 0.432980193459235 x + 0.250575771533855R² = 0.442392806360523
Correlation
Need inter-item/variable correlations > .30
Character Vector: b <- c("one","two","three")
numeric vector
character vector
Numeric Vector: a <- c(1,2,5.3,6,-2,4)
Matrix: y<-matrix(1:20, nrow=5,ncol=4)
Dataframe:d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed")
List:w <- list(name="Fred", age=5.3)
Data Structures
Framework Source: Hadley Wickham
Correlation Matrix
Correlation and Covariance
1cov( , ) i ix x y y
Nx y
Revisiting the Height Dataset
Galton: Height Dataset
cor(heights)Error in cor(heights) : 'x' must be numeric
Initial workaround: Create data.frame without the Factors
h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)
cor() function does not handle Factors
Later we will RECODE the variable into a 0, 1
Excel correl() does not either
Histogram of Correlation Coefficients
-1 +1
Correlations Matrix: Both Types
library(car)scatterplotMatrix(heights)
Zoom in on Gender
Correlation Matrix for Continuous Variables
chart.Correlation(num2)PerformanceAnalytics package
Categorical: Revisit Box Plot
Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors
Note there is an equation here:Y = mx b
Correlation will depend on spread of distributions
Manual Calculation: Note Stdev is Lower
Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the
Continuous Variable has a lot of variation, spread.
Categorical: Recode!Gender recoded as
a 0= Female1 = Male
@correl does not work with Factor
Variables
Formula now works!
Correlation: Continuous & Discrete
More examples of cor.test()
• Too many variables is difficult to handle
• Computing power to handle all that data.
• Principal components analysis seeks to identify and quantify those components by analyzing the original, observable variables
• In many cases, we can wind up working with just a few—on the order of, say, three to ten—principal components or factors instead of tens or hundreds of conventionally measured variables.
Overview
Z1X1
X2
X3
Z2
observable variables
Which component explains the most variance?
Z3
vectors
Principal Components Analysis
Principal Components Analysis
Principal Components
Correlation Regression