learn data preparation and pre-processing in spss with an

Learn Data Preparation and Pre-

Processing in SPSS With an Online

Survey of Correlates of Heavy

Drinking (2012)

© 2019 SAGE Publications Ltd. All Rights Reserved.

This PDF has been generated from SAGE Research Methods Datasets.

Learn Data Preparation and Pre-

Processing in SPSS With an Online

Survey of Correlates of Heavy

Drinking (2012)

Student Guide

Introduction

This example dataset introduces data preparation and pre-processing. Before any

statistical analysis can take place, most data need to be prepared; however, this

topic is not often covered in general statistics textbooks. This brief practical guide

offers a theoretical overview, as well as tools for approaching data preparation,

originating in Excel and processed into SPSS.

The illustrative example walks through the preparation and pre-processing of data

from an online survey of the correlates of heavy drinking. Heavy drinking is a

complex phenomenon, and it is important for researchers to ask many questions

of participants, which contributes to moderately large and multivariate datasets.

Because dealing with a raw and relatively large dataset at the conclusion of a

study can be a daunting task, it is important to develop some skills to aid preparing

the data for analysis. This example is applicable to any research on complex

phenomena that would generate multivariate datasets.

This page provides links to this sample dataset and a guide to using statistical

software for data preparation and pre-processing.

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

of 14 Learn Data Preparation and Pre-Processing in SPSS With an Online

Survey of Correlates of Heavy Drinking (2012)

What Is Data Preparation and Pre-Processing?

At the conclusion of a survey, one might think the next logical step would be

to simply upload the data where the responses were entered into a statistical

analysis package and, with the help of a statistics textbook, analyze. But in reality,

most data need to be prepared (or pre-processed) before they can be analyzed.

In most cases, a raw dataset is messy: Spreadsheets may include extra notes

and text information that statistical analysis software cannot read, some of the

respondents’ answers (rows) may be nonsense, and some of the variables in raw

form (columns) might not make sense.

Problematically, statistics textbooks almost never present raw data; in order to

more efficiently teach statistical methods such as regression or ANOVA, most

textbook examples and practice sets start with data that are ready to go. And there

are very few practical guides that show us what we need to do in order to prepare

data for analysis. The primary literature on data preparation is largely theoretical

and aimed more at data mining, typically of enormous datasets, and typically in

the corporate world (García, Ramírez-Gallego, Luengo, Benítez, & Herrera, 2016;

Pyle, 1999). Also, an Internet search for tutorials merely leads researchers to the

ads of many companies offering to prepare data for a price.

Given this dearth of practical help, the prospect of handling a raw dataset can be

daunting. But in general, after the extra notes, color codes, comments, and so on

are taken out of a spreadsheet so that it looks like a proper dataset (i.e., a row of

variable names with nothing but numbers below it), there are just two simple steps

to data preparation: (1) cleaning and (2) recoding, each of which will be discussed

in turn.

Data Cleaning (Rows)

Although the term “data cleaning” sounds a bit unethical, it is actually anything

SAGE



2



but (and so should certainly not be confused with more questionable “data

massaging” analytic practices). In general, it is simply seeing that the individuals’

responses make sense. These are the rows of the spreadsheet or dataset—where

one row corresponds to a single person’s responses to all the questions (columns)

on a survey.

Participants’ responses might be problematic for many reasons. Autobots and

repeat survey-takers are not uncommon to online survey venues, for instance. If

the same person (or robot) takes a survey five times for money, then at least four

of those survey responses need to be removed or cleaned out of the dataset.

Often, this is relatively easy to do if repeat survey-takers appear to put decreasing

amounts of effort into each successive survey—only answering a few questions or

just repeatedly clicking the same button (e.g., all “strongly agree” to contradicting

questions, just because “strongly agree” is on the right side of the screen). Also,

if survey questions allow participants to openly write answers to questions, the

answers might not make sense (e.g., song lyrics or gibberish copied and pasted

by bots into the text field).

Although there are advanced methods for detecting robot activity on a survey, the

aforementioned techniques, including gaining a thorough knowledge of one’s own

dataset, are invaluable. It is up to the researcher to decide whether or not an entire

row should be removed or just a cell within it, though a general rule of thumb is to

try to keep as much data as possible.

Also, in cases in which a person does not fit a survey’s inclusion criteria, it is not

necessary to delete this person (i.e., their row of data), as most statistical software

packages allow analyses to be done with subsets of individuals. This might be the

case in which we are interested in seeing survey results in the entire sample and

then in a subset of individuals, such as just women, middle-aged adults, or those

who specifically reported not taking any medication. For these reasons and again,

SAGE



2



it is best to keep as much data as possible, including extra documentation when

available, in the data dictionary.

Data Recoding (Columns)

The term “recoding” also does not sound quite right, but again, it is an important

part of data preparation. Recoding is simply assigning numeric values to

substantive responses in a manner that the resulting variables’ set of values make

substantive sense. Variables are usually given as the columns of the dataset.

Variables’ values might be problematic for various reasons. One of the biggest

reasons for this has to do with differences in online survey software. The truth is

computers are not thinking beings; they will assign numbers to any ordinal scale

associated with any question, although we cannot expect that the numbers will

follow a direction that makes sense to humans. For example, we may have a

question that we know falls on a perfectly good ordinal scale, like: “How satisfied

are you with living in California,” where options include (in this order): extremely,

moderately, a little, not at all, not applicable. But unless the survey software is

told up front what we want the corresponding number values to be, the computer

might assign these numbers as follows, just because of the order: extremely = 1,

moderately = 2, a little = 3, not at all = 4, not applicable = 5. That would mean that

the people scoring highest on this “satisfaction” variable are people do not even

report the question as applicable.

Instead, researchers would want to “reverse the coding” of this question, as well

as set the N/A option to missing or give it some coded value that clarifies it is

not part of the same scale as the other responses (−999 is sometimes used, for

instance). For example: extremely = 3, moderately = 2, a little = 1, not at all =

0, not applicable = blank. Notice also, that “Not at all” seems better represented

numerically by “0” as opposed to by “1” like the software, if left to its own devices,

might do. The good news is we can fix this problem with computer syntax within

SAGE



2



any statistical software package. Doing so is called “recoding” variables—so that

the number codes make substantive sense—for each variable column.

Combining Questions

Additionally, researchers might want to combine questions. This happens often

when we are most interested in a “total score” out of participants’ answers to many

questions. For example, perhaps participants answered 50 questions relating to

depression, and we hope to sum those together to make a “total depression”

score column. As a smaller example, we might add the three questions “how many

times did you cry last week,” “how many times did you feel down last week,”

and “how often did you have trouble sleeping last week” to create a total score

related to depressed behaviors. A new, summed-up variable like this might be

called “TotalD.”

Of note, some of the questions themselves may need to be reverse-coded before

they can be added to a total score (that is, technically speaking, we may need

to “reverse the polarity” of a question item). For example, it would be reasonable

to ask participants about happy mood in our depression questionnaire, but keep

in mind that a high score for happy mood would correspond to a low score, or

subtract from, the total score for depression.

Normalizing Versus Recoding

Lastly, researchers may want to know whether a variable—usually the main

dependent/criterion/outcome variable of interest—is normally distributed. This is

a favorite, and lengthily discussed, topic of statistics textbooks. The gist of this

discussion is that if we cannot compute a mean or average from a variable, or we

can compute one but we do not trust it for various reasons, then it might not be

a good variable in the first place. That is, a good variable has a good amount of

variation—with people representing the high, middle, and low end of our question.

SAGE



2



For example, if our dependent variable is income, it is important to have people

with many different incomes, ultimately to figure out what contributes to those

differences in income.

When a variable does not have such variability, there are many ways to address

the problem though whether or not to “fix” a variable after it is collected is a point

of debate. For example, due to a mathematical phenomenon, we know that if we

take the square root of all the scores of a variable, that variable can look a bit more

normally distributed (i.e., as if there are a lot of middles, some highs and lows,

and very few very-highs and very-lows, creating a bell-shaped curve). Applying

mathematical actions like this is called “transforming” variables. Depending on the

field of study and variables of interest, different kinds of transformations are useful

to know (see Osborne, 2010, and Peng, 2009, for user-friendly tutorials on this

advanced topic on data preparation). That said, actions like taking the square root

of numbers can render them meaningless; for example, there is no such thing in

the real world as a square-rooted dollar.

One solution might be to recode values (Coolidge, 2013). For example, perhaps

our income variable has five socioeconomic strata (Poverty, Lower middle, Middle,

Upper middle, and Affluent), but it turns out that very few people are separately in

each of the three middle classes. Rather than “normalizing” the variable, we can

use recoding to create a new three-tiered variable (Poverty, Middle, and Affluent).

Indeed, depending on sampling issues, we might even need to create just a binary

or dichotomous variable (Poor and Not Poor). Note that if our only option is to

have a dichotomous dependent variable, it is not the end of the world, but we

would need to do statistics that are not based in the classical school of interpreting

means (e.g., non-parametric statistics).

Illustrative Example: An Online Survey of Correlates of Heavy

SAGE



2



Drinking

This example walks through the preparation and pre-processing of data from an

online survey of the correlates of heavy drinking. Because heavy drinking is a

complex and multivariate phenomenon, it is important to ask individuals many

questions about drinking behaviors, their beliefs about alcohol, and other aspects

that might relate to these (Martinez, Sher & Wood, 2014; Scalzo & Martinez,

2017). That said, experimental studies about heavy drinking may incorporate

observational surveys (i.e., supplemental questions or variables that are used as

covariates in experimental research). And it should be noted that each question

that is asked on a survey is considered a variable that could relate to drinking—so

the more questions asked, the bigger a dataset gets. This makes it important to

think ahead, in terms of how big a dataset will ultimately be based on all the

questions that are being asked.

The Data

The present dataset comes from a large online survey that was created with

Survey Monkey survey software and was collected using Amazon’s Mechanical

Turk (MTurk) online forum. The raw dataset has n = 200 respondents and includes

3 demographic questions, 8 questions about participants’ own drinking behaviors,

and 11 questions about participants’ beliefs about others’ sense of approval

surrounding drinking (known as injunctive norms; Larimer, Turner, Mallett &

Geisner, 2004). For confidentiality, the MTurk IDs have been replaced with random

four-digit numbers for each participant; these are the codes that allow us to

differentiate one individual from another, and so some observation/respondent

“ID” is typically one of the first columns in a dataset. Otherwise, the data have

been left completely raw. A full data dictionary (codebook), or explanation of

variables as downloaded from Survey Monkey in their raw form, is included in this

segment.

SAGE



2



Preparing and Pre-Processing the Data

In looking at this dataset, the first thing that pops out is all the blank spaces.

Indeed, many survey respondents do not answer all the questions, or any

questions at all for that matter. Although most statistical software programs can

ignore missing information in a dataset, these blank holes are our first indicator

that the data need to be prepared for analysis. But there are other indicators,

too, which will be discussed in turn. Specifically, first, we must create a proper

dataset. Then, to carry out a full data preparation and pre-processing effort on this

dataset, we must (1) “clean” the participants’ observations (rows) of any problems

we observe, and (2) recode the variables (columns) to make substantive sense.

Create a Proper Dataset

The most important indicator that the data need to be prepared is that the dataset

does not look like a proper dataset. The first row of a proper dataset should

be variable names; each variable name should be one word only and contain

no spaces (the reason for this is that many statistical software programs do not

accept variable names with any spaces, or even treat spaces as the end of one

variable/column and the start of another, which would messily split the information

from one variable into two different places!). Below each of these columns should

be numbers only. But the example dataset has three rows of information. The

top row was added by the author after download and contains variable names

for each column. The next two rows are the survey questions themselves as was

provided by the survey software. Note that the software did not assign variable

names to the questions; this is up to the researcher to do. As long as there is just

one row of variable names at the top, and just numbers in all the cells below, a

dataset can be imported into just about any statistical software package.

An important note should be made at this point. Namely, being able to import

a spreadsheet into SPSS is important for cleaning as will be discussed below.

SAGE



2



But the actual dataset cleaning should never be done in SPSS (i.e., as an SPSS

dataset file). Instead, with each successive data preparation step, save the data

file as a new Excel file and not an SPSS file. This is important to do, should we

ever need to go back and revisit what we previously did. Thus, once we delete

the two extraneous rows of this dataset, the next step is to save it as a new file

in Excel: “dataset-heavy-drinking-data-2012-subset1_v2” (note that the “version

control” tool in Excel is another option for a less manual way to track work and

changes, though the tool’s capability may change as versions of Excel change).

Clean the Observations (Rows)

It always pays to start by inspecting any dataset for problems, no matter how big

the dataset is. In terms of this dataset, there appear to be no strange numbers or

anomalies (though this will be discussed a bit later). Moreover, as the survey was

done on an online platform for money, there may be participant duplicates.

This dataset is too big to keep track of duplicates by hand, so the “frequency”

function of SPSS will be useful. To find duplicates, we run a frequency table of the

“ID” variable (please refer to the companion SPSS guide for help). The frequency

table should show just one of each ID number; if there are more, that individual

took the survey more than once. For example, person #1137 attempted the survey

three times.

While statistical software packages can generally ignore completely missing

values, a more insidious problem comes with people such as #7715, who appear

to have answered the survey questions twice—both times with slightly different

responses. It would be inaccurate and bad research to keep both of these rows.

However, as the researcher, you may use your own judgment about which of

these two rows to delete, if not both (e.g., choosing the one that comes first,

as the most effort might have been given it; randomly choosing one of the two).

Depending on how you choose to clean these duplicates, this dataset should

SAGE



2



comprise a total of between 160 and 170 participants. Although this represents a

loss in statistical power, careful cleaning can represent a gain in the validity and

replicability of your work.

If your dataset is too big to find anomalies, or data “klinkers” (see Coolidge,

2013) by hand, you can use the “frequency” function for each and every variable

because the frequency function tables how often something occurs in your

variable. Even if you have a continuous variable like age, the frequency function

would be able to tell you whether you have ten 18-year-olds, eleven 19-year-olds,

two 20-year-olds, fifteen 22-year-olds, and one 2,000-year-old—clearly a mistake

in how this participant wrote in their age. In this case, because you cannot “guess”

that you have a 20 year old on your hands (e.g., you may instead have someone

born in the year 2000), it is suggested that you delete the offending age (but not

the whole row of data) as a cleaning step.

Save your cleaned dataset as “dataset-heavy-drinking-data-2012-subset1_v3.” At

this point, this is all the work we need to do in Excel, so you could also save

it as “dataset-heavy-drinking-data-2012-subset1_forSPSS” or “dataset-heavy-

drinking-data-2012-subset1_Final.” Next, we will be creating some syntax in

SPSS to recode our variables.

Recode the Variables (Columns)

It is best to create a syntax file—a program of commands—for recoding (please

refer to the companion SPSS guide for help). This way, you (and anyone else

using your dataset) can always import your cleaned Excel file into SPSS, run the

recode syntax program, and do any analyses that you choose. The SPSS syntax

can be saved in SPSS, or alternately, you can save the syntax in a separate Word

document or your email. Of note, it is also possible and often advantageous to

have syntax explaining how the data were or are to be cleaned, though because

it is a more advanced skill, it is recommended to start learning syntax by recoding

SAGE



2



variables.

In their raw form, none of the variables in this dataset are coded in a way that

makes sense. Most apparent is the age variable—just looking at the dataset

without checking the data dictionary (codebook), it looks like there are 2- and

3-year-olds in this study on heavy drinking. This variable clearly needs to be

recoded such that the ordinal values are the actual ages (so that we can calculate

the average age of our participants). The alcohol variables also are all reverse-

coded; the most drinking should correspond with the highest numbers.

Furthermore, the injunctive norms variables also may not make sense if we want

to measure how approving people are of drinking; again, the highest numbers

should indicate the highest approval. Also, “Very disapproving” seems like zero

disapproval and might be recoded as such: 0 = Very disapproving, 1 =

Disapproving, 2 = Neutral, 3 = Approving, and 4 = Very approving. Also, it is

possible we may want to then combine some or all of the injunctive norms

variables to create a total “Approval” score.

The relationship status variable is a nominal, or categorical variable, meaning

that we cannot do many statistical analyses with it in its present form. But we

may have hypotheses about marital status and drinking, such that we want to

correlate this variable with a drinking variable. To do so, we could dichotomize this

variable in a number of ways, including: (1) 0 = Not married, 1 = Married; (2) 0

= Never married, 1 = Married at least once; (3) 0 = Not divorced or separated,

1 = Divorced or separated. Any of these (binary) variables could be correlated

with a (continuous) drinking variable. The specific name of this analysis is called

“point–biserial correlation.” By this same logic, if the sex variable is recoded as 0 =

Female and 1 = Male, then a positive correlation with drinking would be interpreted

to mean that males are more likely to drink more.

Review

SAGE



2



The previous example is a primer on the most fundamental principles of data

preparation and pre-processing, though there are many advanced topics and

debates for those interested. Some topics include what to do with missing data,

how to handle extreme responses or “outliers,” and how (or even whether) to

normalize data, alternative ways to set up a dataset (e.g., in “stacked” form), what

is most ethical in terms of data mining, and when is it most appropriate to make

hypotheses surrounding data and conduct statistical tests.

That said, you should know:

• What a raw (just-recently-collected) dataset looks like versus what a proper

(ready-to-analyze) dataset looks like.

• Theoretically, how to approach preparing data for analysis.

• Practical and conceptual tools for cleaning data rows.

• Practical and conceptual tools for recoding variable columns.

Your Turn

You can download this sample dataset along with a guide showing how to prepare

data using statistical software. The dataset has a number of duplicates and

variables needing to be recoded.

See whether you can reproduce the work included in the practical and How-to

Guides. Aside from duplicates, think of other types of cleaning issues that might

arise in a dataset. Create other variable recodes within this dataset that might be

of interest to you. Formulate hypotheses about how these variables might relate

or associate to each other. Once your dataset is fully prepared for analysis, feel

free to utilize any statistical tests that would be appropriate for addressing your

own hypotheses.

References

SAGE



2



Coolidge, F. L. (2013). Statistics: A gentle introduction, 3rd Edition. Los Angeles,

CA: SAGE.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F.

(2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1),

9. doi:http://dx.doi.org/10.1186/s41044-016-0014-0

Larimer, M. E., Turner, A. P., Mallett, K. A., & Geisner, I. M. (2004). Predicting

drinking behavior and alcohol-related problems among fraternity and sorority

members: Examining the role of descriptive and injunctive norms. Psychology

of Addictive Behaviors, 18(3), 203–212. doi:http://dx.doi.org/10.1037/

0893-164X.18.3.203

Martinez, J. A., Sher, K. J., & Wood, P. K. (2014). Drinking consequences and

subsequent drinking in college students over 4 years. Psychology of Addictive

Behaviors, 28, 1240–1245. doi:http://dx.doi.org/10.1037/a0038352

Osborne, J. W. (2010). Improving your data transformations: Applying the Box-

Cox transformation. Practical Assessment, Research & Evaluation, 15(12).

Retrieved from http://pareonline.net/getvn.asp?v=15&n=12

Peng, C. Y. (2009). Data transformation: Data analysis using SAS® (pp.

119–140). Thousand Oaks, CA: SAGE.

Pyle, D. (1999). Data preparation for data mining (Vol. 1). San Francisco, CA:

Morgan Kaufmann. Retrieved from https://pdfs.semanticscholar.org/470a/

828d5e3962f2917a0092cc6ba46ccfe41a2a.pdf

Scalzo, A. C., & Martinez, J. A. (2017). Not all anxiety is the same: How different

“types” of anxiety uniquely associate to college students’ drinking intentions.

Journal of College Student Development, 58, 943–947. doi:http://dx.doi.org/

10.1353/csd.2017.0073

SAGE



2



http://dx.doi.org/10.1186/s41044-016-0014-0

http://dx.doi.org/10.1037/0893-164X.18.3.203

http://dx.doi.org/10.1037/0893-164X.18.3.203

http://dx.doi.org/10.1037/a0038352

http://pareonline.net/getvn.asp?v=15&n=12

https://pdfs.semanticscholar.org/470a/828d5e3962f2917a0092cc6ba46ccfe41a2a.pdf

https://pdfs.semanticscholar.org/470a/828d5e3962f2917a0092cc6ba46ccfe41a2a.pdf

http://dx.doi.org/10.1353/csd.2017.0073

http://dx.doi.org/10.1353/csd.2017.0073

learn data preparation and pre-processing in spss with an

Documents