learn data preparation and pre-processing in spss with an
TRANSCRIPT
Learn Data Preparation and Pre-
Processing in SPSS With an Online
Survey of Correlates of Heavy
Drinking (2012)
© 2019 SAGE Publications Ltd. All Rights Reserved.
This PDF has been generated from SAGE Research Methods Datasets.
Learn Data Preparation and Pre-
Processing in SPSS With an Online
Survey of Correlates of Heavy
Drinking (2012)
Student Guide
Introduction
This example dataset introduces data preparation and pre-processing. Before any
statistical analysis can take place, most data need to be prepared; however, this
topic is not often covered in general statistics textbooks. This brief practical guide
offers a theoretical overview, as well as tools for approaching data preparation,
originating in Excel and processed into SPSS.
The illustrative example walks through the preparation and pre-processing of data
from an online survey of the correlates of heavy drinking. Heavy drinking is a
complex phenomenon, and it is important for researchers to ask many questions
of participants, which contributes to moderately large and multivariate datasets.
Because dealing with a raw and relatively large dataset at the conclusion of a
study can be a daunting task, it is important to develop some skills to aid preparing
the data for analysis. This example is applicable to any research on complex
phenomena that would generate multivariate datasets.
This page provides links to this sample dataset and a guide to using statistical
software for data preparation and pre-processing.
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 2 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
What Is Data Preparation and Pre-Processing?
At the conclusion of a survey, one might think the next logical step would be
to simply upload the data where the responses were entered into a statistical
analysis package and, with the help of a statistics textbook, analyze. But in reality,
most data need to be prepared (or pre-processed) before they can be analyzed.
In most cases, a raw dataset is messy: Spreadsheets may include extra notes
and text information that statistical analysis software cannot read, some of the
respondents’ answers (rows) may be nonsense, and some of the variables in raw
form (columns) might not make sense.
Problematically, statistics textbooks almost never present raw data; in order to
more efficiently teach statistical methods such as regression or ANOVA, most
textbook examples and practice sets start with data that are ready to go. And there
are very few practical guides that show us what we need to do in order to prepare
data for analysis. The primary literature on data preparation is largely theoretical
and aimed more at data mining, typically of enormous datasets, and typically in
the corporate world (García, Ramírez-Gallego, Luengo, Benítez, & Herrera, 2016;
Pyle, 1999). Also, an Internet search for tutorials merely leads researchers to the
ads of many companies offering to prepare data for a price.
Given this dearth of practical help, the prospect of handling a raw dataset can be
daunting. But in general, after the extra notes, color codes, comments, and so on
are taken out of a spreadsheet so that it looks like a proper dataset (i.e., a row of
variable names with nothing but numbers below it), there are just two simple steps
to data preparation: (1) cleaning and (2) recoding, each of which will be discussed
in turn.
Data Cleaning (Rows)
Although the term “data cleaning” sounds a bit unethical, it is actually anything
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 3 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
but (and so should certainly not be confused with more questionable “data
massaging” analytic practices). In general, it is simply seeing that the individuals’
responses make sense. These are the rows of the spreadsheet or dataset—where
one row corresponds to a single person’s responses to all the questions (columns)
on a survey.
Participants’ responses might be problematic for many reasons. Autobots and
repeat survey-takers are not uncommon to online survey venues, for instance. If
the same person (or robot) takes a survey five times for money, then at least four
of those survey responses need to be removed or cleaned out of the dataset.
Often, this is relatively easy to do if repeat survey-takers appear to put decreasing
amounts of effort into each successive survey—only answering a few questions or
just repeatedly clicking the same button (e.g., all “strongly agree” to contradicting
questions, just because “strongly agree” is on the right side of the screen). Also,
if survey questions allow participants to openly write answers to questions, the
answers might not make sense (e.g., song lyrics or gibberish copied and pasted
by bots into the text field).
Although there are advanced methods for detecting robot activity on a survey, the
aforementioned techniques, including gaining a thorough knowledge of one’s own
dataset, are invaluable. It is up to the researcher to decide whether or not an entire
row should be removed or just a cell within it, though a general rule of thumb is to
try to keep as much data as possible.
Also, in cases in which a person does not fit a survey’s inclusion criteria, it is not
necessary to delete this person (i.e., their row of data), as most statistical software
packages allow analyses to be done with subsets of individuals. This might be the
case in which we are interested in seeing survey results in the entire sample and
then in a subset of individuals, such as just women, middle-aged adults, or those
who specifically reported not taking any medication. For these reasons and again,
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 4 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
it is best to keep as much data as possible, including extra documentation when
available, in the data dictionary.
Data Recoding (Columns)
The term “recoding” also does not sound quite right, but again, it is an important
part of data preparation. Recoding is simply assigning numeric values to
substantive responses in a manner that the resulting variables’ set of values make
substantive sense. Variables are usually given as the columns of the dataset.
Variables’ values might be problematic for various reasons. One of the biggest
reasons for this has to do with differences in online survey software. The truth is
computers are not thinking beings; they will assign numbers to any ordinal scale
associated with any question, although we cannot expect that the numbers will
follow a direction that makes sense to humans. For example, we may have a
question that we know falls on a perfectly good ordinal scale, like: “How satisfied
are you with living in California,” where options include (in this order): extremely,
moderately, a little, not at all, not applicable. But unless the survey software is
told up front what we want the corresponding number values to be, the computer
might assign these numbers as follows, just because of the order: extremely = 1,
moderately = 2, a little = 3, not at all = 4, not applicable = 5. That would mean that
the people scoring highest on this “satisfaction” variable are people do not even
report the question as applicable.
Instead, researchers would want to “reverse the coding” of this question, as well
as set the N/A option to missing or give it some coded value that clarifies it is
not part of the same scale as the other responses (−999 is sometimes used, for
instance). For example: extremely = 3, moderately = 2, a little = 1, not at all =
0, not applicable = blank. Notice also, that “Not at all” seems better represented
numerically by “0” as opposed to by “1” like the software, if left to its own devices,
might do. The good news is we can fix this problem with computer syntax within
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 5 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
any statistical software package. Doing so is called “recoding” variables—so that
the number codes make substantive sense—for each variable column.
Combining Questions
Additionally, researchers might want to combine questions. This happens often
when we are most interested in a “total score” out of participants’ answers to many
questions. For example, perhaps participants answered 50 questions relating to
depression, and we hope to sum those together to make a “total depression”
score column. As a smaller example, we might add the three questions “how many
times did you cry last week,” “how many times did you feel down last week,”
and “how often did you have trouble sleeping last week” to create a total score
related to depressed behaviors. A new, summed-up variable like this might be
called “TotalD.”
Of note, some of the questions themselves may need to be reverse-coded before
they can be added to a total score (that is, technically speaking, we may need
to “reverse the polarity” of a question item). For example, it would be reasonable
to ask participants about happy mood in our depression questionnaire, but keep
in mind that a high score for happy mood would correspond to a low score, or
subtract from, the total score for depression.
Normalizing Versus Recoding
Lastly, researchers may want to know whether a variable—usually the main
dependent/criterion/outcome variable of interest—is normally distributed. This is
a favorite, and lengthily discussed, topic of statistics textbooks. The gist of this
discussion is that if we cannot compute a mean or average from a variable, or we
can compute one but we do not trust it for various reasons, then it might not be
a good variable in the first place. That is, a good variable has a good amount of
variation—with people representing the high, middle, and low end of our question.
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 6 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
For example, if our dependent variable is income, it is important to have people
with many different incomes, ultimately to figure out what contributes to those
differences in income.
When a variable does not have such variability, there are many ways to address
the problem though whether or not to “fix” a variable after it is collected is a point
of debate. For example, due to a mathematical phenomenon, we know that if we
take the square root of all the scores of a variable, that variable can look a bit more
normally distributed (i.e., as if there are a lot of middles, some highs and lows,
and very few very-highs and very-lows, creating a bell-shaped curve). Applying
mathematical actions like this is called “transforming” variables. Depending on the
field of study and variables of interest, different kinds of transformations are useful
to know (see Osborne, 2010, and Peng, 2009, for user-friendly tutorials on this
advanced topic on data preparation). That said, actions like taking the square root
of numbers can render them meaningless; for example, there is no such thing in
the real world as a square-rooted dollar.
One solution might be to recode values (Coolidge, 2013). For example, perhaps
our income variable has five socioeconomic strata (Poverty, Lower middle, Middle,
Upper middle, and Affluent), but it turns out that very few people are separately in
each of the three middle classes. Rather than “normalizing” the variable, we can
use recoding to create a new three-tiered variable (Poverty, Middle, and Affluent).
Indeed, depending on sampling issues, we might even need to create just a binary
or dichotomous variable (Poor and Not Poor). Note that if our only option is to
have a dichotomous dependent variable, it is not the end of the world, but we
would need to do statistics that are not based in the classical school of interpreting
means (e.g., non-parametric statistics).
Illustrative Example: An Online Survey of Correlates of Heavy
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 7 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
Drinking
This example walks through the preparation and pre-processing of data from an
online survey of the correlates of heavy drinking. Because heavy drinking is a
complex and multivariate phenomenon, it is important to ask individuals many
questions about drinking behaviors, their beliefs about alcohol, and other aspects
that might relate to these (Martinez, Sher & Wood, 2014; Scalzo & Martinez,
2017). That said, experimental studies about heavy drinking may incorporate
observational surveys (i.e., supplemental questions or variables that are used as
covariates in experimental research). And it should be noted that each question
that is asked on a survey is considered a variable that could relate to drinking—so
the more questions asked, the bigger a dataset gets. This makes it important to
think ahead, in terms of how big a dataset will ultimately be based on all the
questions that are being asked.
The Data
The present dataset comes from a large online survey that was created with
Survey Monkey survey software and was collected using Amazon’s Mechanical
Turk (MTurk) online forum. The raw dataset has n = 200 respondents and includes
3 demographic questions, 8 questions about participants’ own drinking behaviors,
and 11 questions about participants’ beliefs about others’ sense of approval
surrounding drinking (known as injunctive norms; Larimer, Turner, Mallett &
Geisner, 2004). For confidentiality, the MTurk IDs have been replaced with random
four-digit numbers for each participant; these are the codes that allow us to
differentiate one individual from another, and so some observation/respondent
“ID” is typically one of the first columns in a dataset. Otherwise, the data have
been left completely raw. A full data dictionary (codebook), or explanation of
variables as downloaded from Survey Monkey in their raw form, is included in this
segment.
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 8 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
Preparing and Pre-Processing the Data
In looking at this dataset, the first thing that pops out is all the blank spaces.
Indeed, many survey respondents do not answer all the questions, or any
questions at all for that matter. Although most statistical software programs can
ignore missing information in a dataset, these blank holes are our first indicator
that the data need to be prepared for analysis. But there are other indicators,
too, which will be discussed in turn. Specifically, first, we must create a proper
dataset. Then, to carry out a full data preparation and pre-processing effort on this
dataset, we must (1) “clean” the participants’ observations (rows) of any problems
we observe, and (2) recode the variables (columns) to make substantive sense.
Create a Proper Dataset
The most important indicator that the data need to be prepared is that the dataset
does not look like a proper dataset. The first row of a proper dataset should
be variable names; each variable name should be one word only and contain
no spaces (the reason for this is that many statistical software programs do not
accept variable names with any spaces, or even treat spaces as the end of one
variable/column and the start of another, which would messily split the information
from one variable into two different places!). Below each of these columns should
be numbers only. But the example dataset has three rows of information. The
top row was added by the author after download and contains variable names
for each column. The next two rows are the survey questions themselves as was
provided by the survey software. Note that the software did not assign variable
names to the questions; this is up to the researcher to do. As long as there is just
one row of variable names at the top, and just numbers in all the cells below, a
dataset can be imported into just about any statistical software package.
An important note should be made at this point. Namely, being able to import
a spreadsheet into SPSS is important for cleaning as will be discussed below.
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 9 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
But the actual dataset cleaning should never be done in SPSS (i.e., as an SPSS
dataset file). Instead, with each successive data preparation step, save the data
file as a new Excel file and not an SPSS file. This is important to do, should we
ever need to go back and revisit what we previously did. Thus, once we delete
the two extraneous rows of this dataset, the next step is to save it as a new file
in Excel: “dataset-heavy-drinking-data-2012-subset1_v2” (note that the “version
control” tool in Excel is another option for a less manual way to track work and
changes, though the tool’s capability may change as versions of Excel change).
Clean the Observations (Rows)
It always pays to start by inspecting any dataset for problems, no matter how big
the dataset is. In terms of this dataset, there appear to be no strange numbers or
anomalies (though this will be discussed a bit later). Moreover, as the survey was
done on an online platform for money, there may be participant duplicates.
This dataset is too big to keep track of duplicates by hand, so the “frequency”
function of SPSS will be useful. To find duplicates, we run a frequency table of the
“ID” variable (please refer to the companion SPSS guide for help). The frequency
table should show just one of each ID number; if there are more, that individual
took the survey more than once. For example, person #1137 attempted the survey
three times.
While statistical software packages can generally ignore completely missing
values, a more insidious problem comes with people such as #7715, who appear
to have answered the survey questions twice—both times with slightly different
responses. It would be inaccurate and bad research to keep both of these rows.
However, as the researcher, you may use your own judgment about which of
these two rows to delete, if not both (e.g., choosing the one that comes first,
as the most effort might have been given it; randomly choosing one of the two).
Depending on how you choose to clean these duplicates, this dataset should
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 10 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
comprise a total of between 160 and 170 participants. Although this represents a
loss in statistical power, careful cleaning can represent a gain in the validity and
replicability of your work.
If your dataset is too big to find anomalies, or data “klinkers” (see Coolidge,
2013) by hand, you can use the “frequency” function for each and every variable
because the frequency function tables how often something occurs in your
variable. Even if you have a continuous variable like age, the frequency function
would be able to tell you whether you have ten 18-year-olds, eleven 19-year-olds,
two 20-year-olds, fifteen 22-year-olds, and one 2,000-year-old—clearly a mistake
in how this participant wrote in their age. In this case, because you cannot “guess”
that you have a 20 year old on your hands (e.g., you may instead have someone
born in the year 2000), it is suggested that you delete the offending age (but not
the whole row of data) as a cleaning step.
Save your cleaned dataset as “dataset-heavy-drinking-data-2012-subset1_v3.” At
this point, this is all the work we need to do in Excel, so you could also save
it as “dataset-heavy-drinking-data-2012-subset1_forSPSS” or “dataset-heavy-
drinking-data-2012-subset1_Final.” Next, we will be creating some syntax in
SPSS to recode our variables.
Recode the Variables (Columns)
It is best to create a syntax file—a program of commands—for recoding (please
refer to the companion SPSS guide for help). This way, you (and anyone else
using your dataset) can always import your cleaned Excel file into SPSS, run the
recode syntax program, and do any analyses that you choose. The SPSS syntax
can be saved in SPSS, or alternately, you can save the syntax in a separate Word
document or your email. Of note, it is also possible and often advantageous to
have syntax explaining how the data were or are to be cleaned, though because
it is a more advanced skill, it is recommended to start learning syntax by recoding
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 11 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
variables.
In their raw form, none of the variables in this dataset are coded in a way that
makes sense. Most apparent is the age variable—just looking at the dataset
without checking the data dictionary (codebook), it looks like there are 2- and
3-year-olds in this study on heavy drinking. This variable clearly needs to be
recoded such that the ordinal values are the actual ages (so that we can calculate
the average age of our participants). The alcohol variables also are all reverse-
coded; the most drinking should correspond with the highest numbers.
Furthermore, the injunctive norms variables also may not make sense if we want
to measure how approving people are of drinking; again, the highest numbers
should indicate the highest approval. Also, “Very disapproving” seems like zero
disapproval and might be recoded as such: 0 = Very disapproving, 1 =
Disapproving, 2 = Neutral, 3 = Approving, and 4 = Very approving. Also, it is
possible we may want to then combine some or all of the injunctive norms
variables to create a total “Approval” score.
The relationship status variable is a nominal, or categorical variable, meaning
that we cannot do many statistical analyses with it in its present form. But we
may have hypotheses about marital status and drinking, such that we want to
correlate this variable with a drinking variable. To do so, we could dichotomize this
variable in a number of ways, including: (1) 0 = Not married, 1 = Married; (2) 0
= Never married, 1 = Married at least once; (3) 0 = Not divorced or separated,
1 = Divorced or separated. Any of these (binary) variables could be correlated
with a (continuous) drinking variable. The specific name of this analysis is called
“point–biserial correlation.” By this same logic, if the sex variable is recoded as 0 =
Female and 1 = Male, then a positive correlation with drinking would be interpreted
to mean that males are more likely to drink more.
Review
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 12 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
The previous example is a primer on the most fundamental principles of data
preparation and pre-processing, though there are many advanced topics and
debates for those interested. Some topics include what to do with missing data,
how to handle extreme responses or “outliers,” and how (or even whether) to
normalize data, alternative ways to set up a dataset (e.g., in “stacked” form), what
is most ethical in terms of data mining, and when is it most appropriate to make
hypotheses surrounding data and conduct statistical tests.
That said, you should know:
• What a raw (just-recently-collected) dataset looks like versus what a proper
(ready-to-analyze) dataset looks like.
• Theoretically, how to approach preparing data for analysis.
• Practical and conceptual tools for cleaning data rows.
• Practical and conceptual tools for recoding variable columns.
Your Turn
You can download this sample dataset along with a guide showing how to prepare
data using statistical software. The dataset has a number of duplicates and
variables needing to be recoded.
See whether you can reproduce the work included in the practical and How-to
Guides. Aside from duplicates, think of other types of cleaning issues that might
arise in a dataset. Create other variable recodes within this dataset that might be
of interest to you. Formulate hypotheses about how these variables might relate
or associate to each other. Once your dataset is fully prepared for analysis, feel
free to utilize any statistical tests that would be appropriate for addressing your
own hypotheses.
References
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 13 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)
Coolidge, F. L. (2013). Statistics: A gentle introduction, 3rd Edition. Los Angeles,
CA: SAGE.
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F.
(2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1),
9. doi:http://dx.doi.org/10.1186/s41044-016-0014-0
Larimer, M. E., Turner, A. P., Mallett, K. A., & Geisner, I. M. (2004). Predicting
drinking behavior and alcohol-related problems among fraternity and sorority
members: Examining the role of descriptive and injunctive norms. Psychology
of Addictive Behaviors, 18(3), 203–212. doi:http://dx.doi.org/10.1037/
0893-164X.18.3.203
Martinez, J. A., Sher, K. J., & Wood, P. K. (2014). Drinking consequences and
subsequent drinking in college students over 4 years. Psychology of Addictive
Behaviors, 28, 1240–1245. doi:http://dx.doi.org/10.1037/a0038352
Osborne, J. W. (2010). Improving your data transformations: Applying the Box-
Cox transformation. Practical Assessment, Research & Evaluation, 15(12).
Retrieved from http://pareonline.net/getvn.asp?v=15&n=12
Peng, C. Y. (2009). Data transformation: Data analysis using SAS® (pp.
119–140). Thousand Oaks, CA: SAGE.
Pyle, D. (1999). Data preparation for data mining (Vol. 1). San Francisco, CA:
Morgan Kaufmann. Retrieved from https://pdfs.semanticscholar.org/470a/
828d5e3962f2917a0092cc6ba46ccfe41a2a.pdf
Scalzo, A. C., & Martinez, J. A. (2017). Not all anxiety is the same: How different
“types” of anxiety uniquely associate to college students’ drinking intentions.
Journal of College Student Development, 58, 943–947. doi:http://dx.doi.org/
10.1353/csd.2017.0073
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 14 of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online
Survey of Correlates of Heavy Drinking (2012)