basics of data cleaning. why examine your data? basic understanding of the data set ensure...
TRANSCRIPT
Why Examine Your Data?• Basic understanding of the data set
• Ensure statistical and theoretical underpinnings of a given m.v. technique are met
• Concerns about the data– Departures from distribution assumptions
(i.e., normality)– Outliers– Missing Data
Testing Assumptions• MV Normality assumption
– Solution is better
• Violation of MV Normality– Skewness (symmetry)– Kurtosis (peakedness)– Heteroscedascity – Non-linearity
Skewness & KurtosisStatistics
age140
4
.354
.205
-.266
.407
Valid
Missing
N
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
SPSS SyntaxFREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS.
Z values = StatisticStd Error
Skewness = .354/.205 = 1.73
Kurtosis = -.266/.407 = -.654
Critical Values for z score.05 +/- 1.96.01 +/- 2.58
Homoscedascity
22
2
24
4
23
3
21
1
21 = 2
2 = 23 = 2
4 = 2e
When there are multiple groups, each group has similar levels of variance(similar standard deviation)
Testing the Assumptions of Absence of Correlated Errors • Correlated errors means there is an
unmeasured variable affecting the analysis
• Key is to identify the unmeasured variable and to include it in the analysis
• How often do we meet this assumption?
Data Cleaning• Examine
– Individual items/scales (i.e., reliability)– Bivariate relationships– Multivariate relationships
• Techniques to use– Graphs non-normality, heteroscedasticity– Frequencies missing data, out of bounds
values– Univariate outliers (+/- 3 SD from mean)– Mahalanobis Distance (.001)
Graphical Examination
• Single Variable: Shape of Distribution– Histogram – Stem and leaf
• Relationships between two+ variables– Scatterplot
Histogram
0.00 10.00 20.00 30.00 40.00 50.00
yrsexp
0
10
20
30
40
50
60
Fre
qu
en
cy
Mean = 11.8008Std. Dev. = 10.1035N = 364
Scatterplot
10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00
age
0.00
10.00
20.00
30.00
40.00
50.00
60.00y
rse
xp
Frequencies
race
290 78.2 78.6 78.6
31 8.4 8.4 87.0
19 5.1 5.1 92.1
19 5.1 5.1 97.3
9 2.4 2.4 99.7
1 .3 .3 100.0
369 99.5 100.0
2 .5
371 100.0
white, not hispanic
african american/black
asian/pacific islander
hispanic
other
9.00
Total
Valid
SystemMissing
Total
Frequency Percent Valid PercentCumulative
Percent
Outliers
• Where do outliers come from?– Inclusion of subjects not part of the
population (e.g., ESL response to vocabulary test)
– Legitimate data points*– Extreme values of random error (X = t + e)– Error in observation– Error in data preparation
Univariate Outliers
• Criteria: Mean +/- 3 SD
• Example: Age – Mean = 34.68– SD = 10.05
• Out of range values > 64.83 or < 4.53
Univariate Outliers
20.00 30.00 40.00 50.00 60.00 70.00
age
0
10
20
30
40
50
Fre
qu
en
cy
Mean = 34.6784Std. Dev. = 10.04525N = 370
Multivariate OutliersMahalanobis Distance SPSS SyntaxRegression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal).
Critical Values (case with D > c.v. is m.v. outlier)two variables - 13.82three variables - 16.27four variables - 18.46five variables - 20.52six variables - 22.46
Approaches to Outliers
• Leave them alone
• Delete entire case (listwise)
• Delete only relevant variables (pairwise)
• Trim – highest legitimate value
• Mean substitution
• Imputation
Effects of Outliers
Zscore(HMWRK)
10-1-2-3-4
Zsc
ore
(EX
AM
)
2
1
0
-1
-2
-3
Zscore(HMWRK)
10-1-2-3-4
Zsc
ore
(EX
AM
)
2
1
0
-1
-2
-3
r = .50 r = .32
0
5
10
15
20
25
0 20
Publications
Interviews
N = 500, no outlier
N = 501, outlier present
N = 401, outlier present
N = 301, outlier present
N = 201, outlier present
N = 101, oulier present
N = 51, oulier present
N = 26, outlier present
Effects of Outliers
Major Problems: Missing Data
• Generalizability issues
• Reduces power (sample size)
• Impacts accuracy of results– Accuracy = dispersion around true score
(can be under- or over-estimation)– Varies with MDT used
Dealing with Missing Data• Listwise deletion• Pairwise deletion • Mean substitution • Regression imputation • Hot-deck imputation • Multiple imputation
Dealing with Missing DataIn Order of Accuracy:• Pairwise deletion • Listwise deletion • Regression imputation• Mean substitution • Hot-deck imputation
Dealing with Missing DataMDT Pros Cons
Listwise deletion Easy to useHigh accuracy
Reduces sample size
Pairwise deletion Easy to useHighest accuracy
Problematic in MV analyses; non-positive definite correlation matrix
Mean substitution Easy to useSaves data; preserves sample sizeModerate accuracy
Attenuation of findings
Regression imputation (no error term adjustment)
Saves data; preserves sample sizeModerate accuracy
Difficult to useCan’t use when all predictors are missing
Hot-deck imputation Saves data; preserves sample size
Lots of bias & error