outliers chapter 5.3 data screening. outliers can bias a parameter estimate

20
Outliers Chapter 5.3 Data Screening

Upload: julian-todd

Post on 21-Jan-2016

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

Chapter 5.3 Data Screening

Page 2: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers can Bias a Parameter Estimate

Page 3: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

…and the Error associated with that Estimate

Page 4: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Outlier – case with extreme value on one variable or multiple variables

• Why?– Data input error– Not a population you meant to sample– From the population but has really long tails and

very extreme values

Page 5: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Outliers – Two Types• Univariate – for basic univariate statistics– Use these when you have ONE DV or Y variable.

• Multivariate – for some univariate statistics and all multivariate statistics– Use these when you have multiple continuous

variables or lots of DVs.

Page 6: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Univariate• In a normal z-distribution anyone who has a z-

score of +/- 3 is less than .2% of the population.

• Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.

Page 7: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Univariate outliers are fine and dandy, but you may have lots of data and don’t want to do each column one at a time. – Plus, the multivariate outlier analysis works just as

well if it’s one column or 500, so let’s just do that.

Page 8: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Multivariate– Now we need some way to measure distance from

the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!)

• Mahalanobis distance– Creates a distance from the centroid (mean of

means)

Page 9: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Mahalanobis• Centroid is created by plotting the 3D picture

of the means of all the means and measuring the distance– Similar to Euclidean distance

Page 10: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Mahalanobis• No set cut off rule – Use a chi-square table.– DF = # of variables (DVs, variables that you used to

calculate Mahalanobis)– Use p<.001

NOTE: DF here has NOTHING to do with the DF for hypothesis testing.

Page 11: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• So do I delete them?• Yes: they are far away from the middle!• No: they may not affect your analysis!• It depends: I need the sample size!• SO?!– Try it with and without them. See what happens.

FISH!

Page 12: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Important side notes:– For ANOVA, t-tests, correlation: you will use a fake

regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening.

Page 13: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Important side notes:– For regression based tests: you can run the real

regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different.• You will also use other regression based values for this

analysis.

Page 14: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Important side note:– Many functions in R have their own data screening

options. This guide is for global screening not specific to one analysis.

Page 15: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• First, figure out the factor columns, as all columns need to be int or num.– filledin_none[ , -c(1,2)] – Use that dataset code in the next function.

Page 16: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Mahalanobis function• mahalanobis(– Dataset name,– colMeans(dataset name, na.rm = TRUE),– cov(datasetname, use = “pairwise.complete.obs)– )

Page 17: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• mahal = mahalanobis(filledin_none[ , -c(1,2)], colMeans(filledin_none[ , -c(1,2)],

na.rm = TRUE),cov(filledin_none[ , -c(1,2)],

use="pairwise.complete.obs"))

Page 18: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Now, let’s get rid of people with bad scores– But what is a bad score?– Use a chi-square table.– DF = # of variables (DVs, variables that you used to

calculate Mahalanobis)– Use p<.001

• Oh, let’s make R do it.

Page 19: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• Use the qchisq function, which finds the cut off score for you.– qchisq(1-pvalue, Number of columns)

• cutoff = qchisq(.999,ncol(dataset)) • cutoff = qchisq(.999,ncol(filledin_none[ , -

c(1,2)]))

Page 20: Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate

Outliers

• So, let’s see how many are bad– summary(mahal < cutoff)

• Let’s get rid of those peeps– noout = filledin_none[ mahal < cutoff, ]