data screening using spss - teaching
TRANSCRIPT
![Page 1: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/1.jpg)
Data screening using SPSSData screening using SPSS
Natalie Loxton
7 July 2008
![Page 2: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/2.jpg)
Recommended TextsRecommended Texts
• Tabachnick & Fidell (2007) Using multivariate statistics (5th ed). Chpt 4
• Field (2003) Discovering statistics using SPSS for windows. Chpt 2windows. Chpt 2
• Cohen, Cohen, West & Aiken (2003) Applied multiple regression/correlation analysis for the behavioural sciences (3rd ed.). Chpt 4
![Page 3: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/3.jpg)
Key assumptions in GLMKey assumptions in GLM
1. Normality (see handout)
2. Homogeneity of variance (depends on analysis)
3. Interval level data (need to check this BEFORE 3. Interval level data (need to check this BEFORE design questionnaire or collect data, e.g., ask for actual age rather than provide age groups)
4. Independence of observations (part of data collection, research design)
![Page 4: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/4.jpg)
General tips before we begin….General tips before we begin….
• Coding books !!!• SPSS file types:
*.sav = data file*.sav = data file*.spo = output file
• *.sps = syntax file • Switch on “Display commands in log”• Set page length to “infinite” to reduce
overall output
![Page 5: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/5.jpg)
Data entry (Between subjects)Data entry (Between subjects)ID Status Friends Alc Income Neurot
1 Lecturer 5 10 20000 10
2 Lecturer 2 15 40000 17
3 Lecturer 0 20 35000 14
4 Lecturer 4 5 22000 134 Lecturer 4 5 22000 13
5 Lecturer 1 30 50000 21
6 Student 10 25 5000 7
7 Student 12 20 100 13
8 Student 15 16 3000 9
9 Student 12 17 10000 14
10 Student 17 18 10 13
![Page 6: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/6.jpg)
Data entry (Within subjects)Data entry (Within subjects)
ID Pre-treatment
Post-Treatment
6mth FU
1 30 20 22
2 25 24 242 25 24 24
3 15 10 5
4 20 18 20
5 32 30 30
![Page 7: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/7.jpg)
Data entry (Mixed design)Data entry (Mixed design)
Client Group Pre Post 6 mth FU
1 Treatment 30 20 22
2 Treatment 25 24 24
3 Treatment 15 10 5
4 Treatment 20 18 204 Treatment 20 18 20
5 Treatment 32 30 30
6 Control 25 25 25
7 Control 20 20 22
8 Control 16 15 10
9 Control 17 20 25
10 Control 18 18 20
![Page 8: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/8.jpg)
•Ensure collect enough data • Drop down menus versus syntax•Assigning ID numbers to cases (after the fact)
compute id= $casenum.
ScreeningScreening
compute id= $casenum.Execute.
Open dataset: “missing data Tabachnick.sav”
Add ID numbers
![Page 9: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/9.jpg)
•Check amount of missing datacompute miss= nmiss (var1 – var K).Execute.
• Types of missing data and estimation
Missing DataMissing Data
• Types of missing data and estimation
![Page 10: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/10.jpg)
Missing completely at random (MCAR)
Missing data is independent of any other measured variable (y2) and independent of the variable itself (y1).
• I.e., SES=y2; depression=y1.
• If participants dropped out across a range of SES levels, then the missing on depression would be independent of SES.
• Little’s MCAR test in MVA indicates whether MCAR or not (want ns)
![Page 11: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/11.jpg)
Missing at random (MAR)
Missing data may be dependent on another measured variable (y2), but is independent of the variable itself (y1).
• I.e., SES=y2; depression=y1.
• If participants only from high levels of SES dropped out , then the missing on depression would be dependent on SES.SES.
• MAR can be inferred if Little’s test is signif but missingness predictable from other vars (other than the variable itself) – tested by Separate Variance Test. MNAR indicated if this test reveals missingness related to the DV
• Consult supervisor or statistics advisor (not my area of expertise so check with someone who is)
![Page 12: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/12.jpg)
Replacing Missing ValuesMean value replacement: NO - restricts the
variance of the variables involved.
If <5% missing data
doesn’t really matter what method
Tabachnick & Fidell (2007), pp. 62-72Tabachnick & Fidell (2007), pp. 62-72
Schafer, J. L., & Graham, J. W. (2002). Missing Data: Our View of the State of the Art. Psychological Methods, 7, 147-177.
Little & Rubin (1987) Statistical analysis with missing data. New York: John Wiley
![Page 13: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/13.jpg)
Replacing Missing ValuesExpectation Maximisation (EM):
Use “Missing values analysis” in SPSS and create new data setMVAtimedrs attdrug atthouse income emplmnt mstatus race/MAXCAT = 25/CATEGORICAL = emplmnt mstatus race/NOUNIVARIATE/NOUNIVARIATE/TTEST NOPROB PERCENT=5/MPATTERN DESCRIBE = timedrs attdrug atthouse income emplmnt
mstatus race/EM ( TOLERANCE=0.001 CONVERGENCE=0.0001 ITERATIONS=25
• OUTFILE='F:\Workshop_cleaning data\missing data tabachnick with EM‘+ ' substitution.sav' ) .
See Table 4.1 in T&F (2007) for output and discussion of limitations of EM substitution
Repeat analysis with EM sub and complete cases and check for similarity
![Page 14: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/14.jpg)
SPSS Exam data (Field, 2000)SPSS Exam data (Field, 2000)
• 100 stats students
• Exam = first year statistics exam scores
• Computer = measure of computer literacy• Computer = measure of computer literacy
• Lecture = percentage of statistics lectures attended
• Numeracy = measure of student’s numeracy out of 15
![Page 15: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/15.jpg)
Give ID numbers to cases
Check min & max scores within range
Types of distributions
Refer to handout for details
freq [VARIABLE LIST]/stat= def skew seskew kurt sekurt /hist= norm.
![Page 16: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/16.jpg)
DistributionsDistributionsStatistics
100 100 100 100
0 0 0 0
58.1000 50.7100 59.7650 4.8500
21.31557 8.26004 21.68478 2.70568
-.107 -.174 -.422 .961
.241 .241 .241 .241
-1.105 .364 -.179 .946
.478 .478 .478 .478
15.00 27.00 8.00 1.00
99.00 73.00 100.00 14.00
Valid
Missing
N
Mean
Std. Deviation
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Minimum
Maximum
Percentageon SPSS
examComputer
literacy
Percentageof lecturesattended Numeracy
![Page 17: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/17.jpg)
Give ID numbers to cases
Check min & max scores within range
Types of distributions
Check skewness & kurtosis
Identify univariate outliers
![Page 18: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/18.jpg)
DistributionsDistributionsStatistics
100 100 100 100
0 0 0 0
58.1000 50.7100 59.7650 4.8500
21.31557 8.26004 21.68478 2.70568
-.107 -.174 -.422 .961
.241 .241 .241 .241
-1.105 .364 -.179 .946
.478 .478 .478 .478
15.00 27.00 8.00 1.00
99.00 73.00 100.00 14.00
Valid
Missing
N
Mean
Std. Deviation
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Minimum
Maximum
Percentageon SPSS
examComputer
literacy
Percentageof lecturesattended Numeracy
.241* 3 = .723
![Page 19: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/19.jpg)
OutliersOutliers
• Using “Descriptives” ask for Z scores• Identify data points >3.29 or < -3.29
Using syntax:Desc [var1 var2 var3] /SAVE.
*by default, var called z[oldvarname].
*check if there are any greater than Zabs = 3.29.freq [ZVAR]/stat= min max /format=notable.
*list the offenders.temp.sel if ( [ZVAR] > 3.29 or [ZVAR] < - 3.29 ).list id [ZVAR].
![Page 20: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/20.jpg)
Sensitivity to Reward Scale
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
Sensitivity to Reward Scale
Fre
quen
cy
120
100
80
60
40
20
0
Std. Dev = 4.33
Mean = 10.0
N = 443.00
OutliersOutliers
Reasons for outliers
1. Data entry error2. Failure to specify 99 or 999 as missing data3. Outlier not a true member of population of interest4. Outlier is a true member of population of interest with an
AUDIT total score
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
AUDIT total score
Fre
quen
cy
120
100
80
60
40
20
0
Std. Dev = 5.33
Mean = 6.6
N = 443.00
4. Outlier is a true member of population of interest with an extreme score
What to do?
1. Look at histogram1. Sometimes transforming data can “pull in” the outlier2. Censoring outliers 3. May need to delete case/s and run with and without
outlier
![Page 21: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/21.jpg)
Give ID numbers to cases
Check min & max scores within range
Types of distributions
Check skewness & kurtosis
Identify univariate outliers
Making a record of decisions
![Page 22: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/22.jpg)
Screening SheetsScreening SheetsVariable Outliers? Skewed?
+/-?
Kurtosis? Transformation? Result of Transformation?
Additional Information
![Page 23: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/23.jpg)
BreakBreak
![Page 24: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/24.jpg)
“Eating & Alcohol data”“Eating & Alcohol data”• 430 female university students
• Age = age in years
• Bul = Bulimia Scale
• AUDIT = Measure of hazardous drinking
• SR = Sensitivity to Reward• SR = Sensitivity to Reward
• FES_Coh = Family Cohesion
• Your task – run thru the previous checklist and assign an ID and check for skew, kurtosis, outliers and record on screening sheet (missing data already sorted)
![Page 25: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/25.jpg)
Give ID numbers to cases
Check min & max scores within range
Types of distributions
Check skewness & kurtosis
Identify univariate outliers
Making a record of decisions
![Page 26: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/26.jpg)
Transforming dataTransforming data• Positively skewed data
• Moderate skew – use squareroot (SQRT(var))
• More severe skew – use log (lg10(var))
• Horribly skewed – try inverse (-1/(var))
• Note. If start at zero need to add a constantCOMPUTE var_inv = -1/(var+1) .COMPUTE var_inv = -1/(var+1) .
EXECUTE
AUDIT total score
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
AUDIT total score
Fre
quen
cy
120
100
80
60
40
20
0
Std. Dev = 5.33
Mean = 6.6
N = 443.00
AUDIT_SQ
5.50
5.00
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
.50
0.00
AUDIT_SQ
Fre
quen
cy
80
60
40
20
0
Std. Dev = 1.10
Mean = 2.31
N = 443.00
BUL_LOG
1.63
1.56
1.50
1.44
1.38
1.31
1.25
1.19
1.13
1.06
1.00
.94
.88
BUL_LOG
Fre
quen
cy
100
80
60
40
20
0
Std. Dev = .17
Mean = 1.14
N = 442.00
Bulimia 6 point scale
42.5
40.0
37.5
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
Bulimia 6 point scale
Fre
quen
cy
140
120
100
80
60
40
20
0
Std. Dev = 6.42
Mean = 14.9
N = 442.00
Age
52.5
50.0
47.5
45.0
42.5
40.0
37.5
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
Age
Fre
quen
cy
200
100
0
Std. Dev = 8.03
Mean = 23.5
N = 443.00
AGE_INV
-.020
-.025
-.030
-.035
-.040
-.045
-.050
-.055
-.060
AGE_INV
Fre
quen
cy
200
100
0
Std. Dev = .01
Mean = -.046
N = 443.00
![Page 27: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/27.jpg)
Transforming dataTransforming data• Negatively skewed data
• Need to reflect and transform, then flip
• COMPUTE var_rsq = -1*SQRT(K-var).
• When K = greatest value plus 1
• E.g., COMPUTE coh_rsq = -1*SQRT(10-• E.g., COMPUTE coh_rsq = -1*SQRT(10-coh)
FES Cohesion
10.08.06.04.02.00.0
FES Cohesion
Fre
quen
cy
160
140
120
100
80
60
40
20
0
Std. Dev = 2.80
Mean = 6.2
N = 430.00
![Page 28: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/28.jpg)
Transforming dataTransforming data• Bimodal data
• Check whether 2 underlying pop.s
• Split into dichotomous var around the break and test using analyses that handle dichotomous variables (depends on whether dichotomous variables (depends on whether the dichotomous var is an IV or a DV)
![Page 29: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/29.jpg)
Multivariate screeningMultivariate screeningNeed to run a MRMultivariate outliers – use Mahalanobis
Distance (the distance of a point for a participant from the centre of the participant from the centre of the distribution for all participants)
- use Chi Sq table to see if so distant as to be a statistically significant multivariate outlier
(df = number of IVs; p always set to .001)
![Page 30: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/30.jpg)
Regression resultsRegression results
Residuals Statisticsa
4.8239 9.1463 6.6698 .88096 430
-2.095 2.811 .000 1.000 430
.290 1.227 .543 .179 430
4.7844 9.4943 6.6678 .88577 430
-8.15579 21.92994 .00000 5.27521 430
Predicted Value
Std. Predicted Value
Standard Error ofPredicted Value
Adjusted Predicted Value
Residual
Minimum Maximum Mean Std. Deviation N
-8.15579 21.92994 .00000 5.27521 430
-1.539 4.138 .000 .995 430
-1.552 4.178 .000 1.002 430
-8.29190 22.35648 .00193 5.34797 430
-1.554 4.261 .002 1.006 430
.285 21.991 3.991 3.576 430
.000 .068 .003 .007 430
.001 .051 .009 .008 430
Std. Residual
Stud. Residual
Deleted Residual
Stud. Deleted Residual
Mahal. Distance
Cook's Distance
Centered Leverage Value
Dependent Variable: AUDIT total scorea.
![Page 31: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/31.jpg)
Residuals Statisti
4.8239 9.1463
-2.095 2.811
.290 1.227
Predicted Value
Std. Predicted Value
Standard Error ofPredicted Value
Minimum Maximum
4.7844 9.4943
-8.15579 21.92994
-1.539 4.138
-1.552 4.178
-8.29190 22.35648
-1.554 4.261
.285 21.991
.000 .068
.001 .051
Predicted Value
Adjusted Predicted Value
Residual
Std. Residual
Stud. Residual
Deleted Residual
Stud. Deleted Residual
Mahal. Distance
Cook's Distance
Centered Leverage Value
Dependent Variable: AUDIT total scorea.
![Page 32: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/32.jpg)
Multicollinearity : Correlation between IVsIf <.7 okCheck bivariate correlations between IVs
Multivariate screeningMultivariate screening
Tolerance:Amount of variance NOT predicted by the other IVs>.01 ok Place “tol” in syntax or check collinearity
diagnostics when run MR
![Page 33: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/33.jpg)
Regression resultsRegression results
Coefficientsa
8.709 1.603 5.431 .000
.000 .035 -.001 -.013 .990 .868 1.152
.045 .042 .054 1.066 .287 .892 1.121
-.116 .067 -.090 -1.744 .082 .867 1.154
-.250 .097 -.131 -2.574 .010 .890 1.123
(Constant)
Age
Bulimia 6 point scale
Sensitivity to RewardScale
FES Cohesion
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: AUDIT total scorea.
![Page 34: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/34.jpg)
Normality of residualsNormality of residuals
Scatterplot
Dependent Variable: IRREL
Re
sidu
al
3
2
1
1270
30
212
75
44439
3123
20
971057164
38
29109
636091
13
106
Should be square-ish
Cf T&F
Regression Standardized Predicted Value
43210-1-2
Reg
ress
ion
Sta
nda
rdiz
ed
R 1
0
-1
-2
-3
-4
752054
38
1018672
63
59464234
1107
1039998
9391
8382
81
80 7673
69
68666255
51
48
4745
3736163110
106
100
90
7978
67 65
52
50 4943 40
28
2725
1817
11
10
108979695
92
89
8885 58
57
5335 26
22
19
6
1029484
77
7441
33
3224
8
![Page 35: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/35.jpg)
Need to run multiple analysis:
Non-transformed, outliers in
Non-transformed, outliers out
Deciding which output to reportDeciding which output to report
Transformed, outliers in
Transformed, outliers out
May not needif transformation“fixes” the outliers
![Page 36: Data screening using SPSS - Teaching](https://reader031.vdocuments.us/reader031/viewer/2022020700/61f392cc80758f75056eeca8/html5/thumbnails/36.jpg)
Summary tableSummary tableUse this table to make the decision of which analysis to report
IVs TransformedN Y
Outliers
IN
Outliers
OUT
Rules: 1) If significance doesn’t change (to or from non- significance), report NON-TRANSFORMED, if changes report
TRANSFORMED2) Univariate and multivariate outlier may need to be deleted if
influencing results and/or don’t seem to be part of the pop. of interest. Try to reduce the influence by transformation or censoring. If in doubt consult with supervisor.