basics of data cleaning. why examine your data? basic understanding of the data set ensure...

27
Basics of Data Cleaning

Upload: eleanor-sparks

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Basics of Data Cleaning

Why Examine Your Data?• Basic understanding of the data set

• Ensure statistical and theoretical underpinnings of a given m.v. technique are met

• Concerns about the data– Departures from distribution assumptions

(i.e., normality)– Outliers– Missing Data

Testing Assumptions• MV Normality assumption

– Solution is better

• Violation of MV Normality– Skewness (symmetry)– Kurtosis (peakedness)– Heteroscedascity – Non-linearity

4.003.503.002.502.001.501.00.500.00

8

6

4

2

0

GPA

Negative Skew

Positive Skew

GPA

4.003.503.002.502.001.501.00.500.00

8

6

4

2

0

Kurtosis

Mesokurtic

Leptokurtic

Platykurtic

Skewness & KurtosisStatistics

age140

4

.354

.205

-.266

.407

Valid

Missing

N

Skewness

Std. Error of Skewness

Kurtosis

Std. Error of Kurtosis

SPSS SyntaxFREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS.

Z values = StatisticStd Error

Skewness = .354/.205 = 1.73

Kurtosis = -.266/.407 = -.654

Critical Values for z score.05 +/- 1.96.01 +/- 2.58

Homoscedascity

22

2

24

4

23

3

21

1

21 = 2

2 = 23 = 2

4 = 2e

When there are multiple groups, each group has similar levels of variance(similar standard deviation)

Linearity

Zscore(STRESS)

43210-1-2-3

Zsc

ore

(PE

RF

)

1

0

-1

-2

-3

-4

-5

-6

Testing the Assumptions of Absence of Correlated Errors • Correlated errors means there is an

unmeasured variable affecting the analysis

• Key is to identify the unmeasured variable and to include it in the analysis

• How often do we meet this assumption?

Data Cleaning• Examine

– Individual items/scales (i.e., reliability)– Bivariate relationships– Multivariate relationships

• Techniques to use– Graphs non-normality, heteroscedasticity– Frequencies missing data, out of bounds

values– Univariate outliers (+/- 3 SD from mean)– Mahalanobis Distance (.001)

Graphical Examination

• Single Variable: Shape of Distribution– Histogram – Stem and leaf

• Relationships between two+ variables– Scatterplot

Histogram

0.00 10.00 20.00 30.00 40.00 50.00

yrsexp

0

10

20

30

40

50

60

Fre

qu

en

cy

Mean = 11.8008Std. Dev. = 10.1035N = 364

Scatterplot

10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00

age

0.00

10.00

20.00

30.00

40.00

50.00

60.00y

rse

xp

Frequencies

race

290 78.2 78.6 78.6

31 8.4 8.4 87.0

19 5.1 5.1 92.1

19 5.1 5.1 97.3

9 2.4 2.4 99.7

1 .3 .3 100.0

369 99.5 100.0

2 .5

371 100.0

white, not hispanic

african american/black

asian/pacific islander

hispanic

other

9.00

Total

Valid

SystemMissing

Total

Frequency Percent Valid PercentCumulative

Percent

Outliers

• Where do outliers come from?– Inclusion of subjects not part of the

population (e.g., ESL response to vocabulary test)

– Legitimate data points*– Extreme values of random error (X = t + e)– Error in observation– Error in data preparation

Univariate Outliers

• Criteria: Mean +/- 3 SD

• Example: Age – Mean = 34.68– SD = 10.05

• Out of range values > 64.83 or < 4.53

Univariate Outliers

20.00 30.00 40.00 50.00 60.00 70.00

age

0

10

20

30

40

50

Fre

qu

en

cy

Mean = 34.6784Std. Dev. = 10.04525N = 370

Multivariate OutliersMahalanobis Distance SPSS SyntaxRegression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal).

Critical Values (case with D > c.v. is m.v. outlier)two variables - 13.82three variables - 16.27four variables - 18.46five variables - 20.52six variables - 22.46

Approaches to Outliers

• Leave them alone

• Delete entire case (listwise)

• Delete only relevant variables (pairwise)

• Trim – highest legitimate value

• Mean substitution

• Imputation

Effects of Outliers

Zscore(HMWRK)

10-1-2-3-4

Zsc

ore

(EX

AM

)

2

1

0

-1

-2

-3

Zscore(HMWRK)

10-1-2-3-4

Zsc

ore

(EX

AM

)

2

1

0

-1

-2

-3

r = .50 r = .32

0

5

10

15

20

25

0 20

Publications

Interviews

N = 500, no outlier

N = 501, outlier present

N = 401, outlier present

N = 301, outlier present

N = 201, outlier present

N = 101, oulier present

N = 51, oulier present

N = 26, outlier present

Effects of Outliers

Major Problems: Missing Data

• Generalizability issues

• Reduces power (sample size)

• Impacts accuracy of results– Accuracy = dispersion around true score

(can be under- or over-estimation)– Varies with MDT used

Dealing with Missing Data• Listwise deletion• Pairwise deletion • Mean substitution • Regression imputation • Hot-deck imputation • Multiple imputation

Dealing with Missing DataIn Order of Accuracy:• Pairwise deletion • Listwise deletion • Regression imputation• Mean substitution • Hot-deck imputation

Dealing with Missing DataMDT Pros Cons

Listwise deletion Easy to useHigh accuracy

Reduces sample size

Pairwise deletion Easy to useHighest accuracy

Problematic in MV analyses; non-positive definite correlation matrix

Mean substitution Easy to useSaves data; preserves sample sizeModerate accuracy

Attenuation of findings

Regression imputation (no error term adjustment)

Saves data; preserves sample sizeModerate accuracy

Difficult to useCan’t use when all predictors are missing

Hot-deck imputation Saves data; preserves sample size

Lots of bias & error

TransformationsBest Transformation to Try• Square Root

• Log

• Inverse

• “Reflect” (mirror image), then transform

Distribution• Moderate deviation from

normality• Substantial deviation

from normality• Severe deviation; esp. j-

shape• Negative skew

• Interpretation of transformed variables?