spss introductory session data entry and descriptive stats

SPSS & Quantitative Data AnalysisKulbir Singh Birak

SPSS is a computer program for analysing quantitative data.

This can range from basic descriptive statistics such as the mean, mode, median and range to powerful tests of significance (So whether we accept or reject a hypothesis).

What the data looks like, and what that means if anything.

What is SPSS?

You can access SPSS on the vast majority of PC’s at UCS, in these labs, the Waterfront PC’s and the library PC’s

Additionally, if you wish you can borrow a copy of SPSS to install on your own home PC or laptop. There are 16 copies in the library you just need to borrow the disc and input the license code that comes with it (license’s do come to an end and when they do you can just come and borrow a new version of SPSS or attain a new license code)

Over night loan only or you can bring your laptops in and do there and then

Windows version only, no Apple version

SPSS Access

Overview

Why do numbers matter in research design?

Numbers allow you to do two basic things:- Count how often

“something” happens

- Count how big an issue “something” is

Overview

Once you can count the extent (how often) and nature (for quantitative research a numerical descriptor of an attribute) you can already do some pretty important things. You can answer questions such as:

How common is an issue? For instance, are black

children over-represented in care? Are black adults over-represented in psychiatric hospital?

How serious is a particular issue? Or how is it distributed within a sample? For instance, how serious

are the concerns about children in families allocated a social worker?

OverviewOnce you can count stuff you can start to answer other important and interesting questions, for instance:

Students may often come to you with various questions about SPSS and difficulties that they are having

If you are lucky enough to catch them early on a lot of unnecessary frustration and stress about analysing data can be avoided.

The most important thing a student can do before they even consider methodology, methods or analysis is to have a clear research question/aim and hypotheses in place that conceptualise and operationalise the variables they wish to study.

SPSS and Quantitative Data

Some Basic Definitions

A variable is the “thing” that you’re interested in studying e.g. depression, gender differences, social deprivation,

specific crime rates, levels of emotionality (how emotional someone

is) or different types of food!

• Things like depression, gender differences, social deprivation, specific crime rates, levels of emotionality and food type, etc. are called “variables” because they vary.● Some people are more depressed than others● Some people are men, and others are women● Some Social policies may be more successful than others● We may see different crimes committed in different

contexts, areas ● Some people are less emotional than others● Food types can range from pizza to hamburgers to filet

mignon, or might be Thai, Ethiopian, Polish or American cuisine, etc., etc.

TO “CONCEPTUALISE” A VARIABLE MEANS TO MAKE CLEAR WHAT YOU MEAN BY THE VARIABLE….

• For example, for the variable “food type,” you need to be clear about whether you mean

• (1) vegetarian or meat, OR• (2) breakfast, lunch or dinner foods, OR• (3) Ethiopian, Thai or American foods, OR • (4) something else!

TO “OPERATIONALISE” A VARIABLE IS TO DECIDE HOW YOU WILL MEASURE IT

• For example, if the variable you’re interested in is depression:● Will you ask people to rate themselves, and if

so, on what sort of a scale?● Alternatively, will you measure depression by

facial expression? By some behaviour that you observe? In some other way?

TO “OPERATIONALISE” A VARIABLE IS TO DECIDE HOW YOU WILL MEASURE IT

• If the variable you’re studying is intelligence & you don’t think Exam scores are a good measure of intelligence, what measure WILL you use?

• Asking these sorts of questions is completing the process of “operationalising” your variables.

• Conceptualisation & Operationalisation are necessary for a Quantitative approach

Exploratory Descriptive Causal/RelationshipExploratory research is undertaken when few or no previous studies exist. The aim is to look for patterns, hypotheses or ideas that can be tested and will form the basis for further research.

Typical research techniques would include case studies, observation and reviews of previous related studies and data.

Data from exploratory studies tends to be qualitative.

Expands on the Exploratory

Descriptive research can be used to identify and classify the elements or characteristics of the subject, e.g. number of days youth offenders remained out of trouble.

Quantitative techniques are most often used to collect, analyse and summarise data.

Causal and Relationship research focuses on being able to predict/hypothesise cause and effect between observed behaviours, or relationships between aspects of behaviour/society/crime rates.

The idea is that Causal and Relationship research is moving a step beyond descriptive research and the quantitative data collected can be used and analysed in a manner that allows the researcher to infer a significant effect/difference or relationship

TYPES OF QUANTITATIVE RESEARCH

Aims and Objectives

• The Quantitative approach sets out at the start of a study with a research question and a hypothesis/prediction

• Hypotheses are formal statements of predictions derived from evidence from earlier research and/or theory.

• The null hypothesis (H0) is a statement of ‘no difference/effect/change’ between the variables

• The experimental hypothesis (H1) is a statement of difference/relationships between variables

QUANTITATIVE DESIGNS AND HYPOTHESES

• Experimental Hypothesis: Students who study for tests in study groups will score significantly better on their exams than students who did not study in study groups

• Null Hypothesis: There will be no significant difference in exam results between students who do and do not study in study groups

EXAMPLE OF HYPOTHESIS

This clarity in the question and hypothesis can make life markedly easier for yourselves and the student in the long run.

However, I appreciate that this is not always the easiest/or will not be the case for you more often than not.

So what I will be covering with you today is a brief introduction to the SPSS interface and as to how we would go about doing the initial basics of data entry and beginning to explore descriptive data.

If we can I’ll also take you through examples of some basic significance testing (otherwise I’ll put up so available)

A light but important session. Going over the basics of how to input data, label your

variables so it is clear and how to create codebooks. It’s all about building up your confidence with the

interface, and developing good practise. It’s about doing the basics so as to avoid confusion

later on, e.g. inputting the data correctly for different types of analysis.

Data Entry and Descriptives

Hopefully should be familiar with the idea of descriptive data.

As the name suggests they are what we use to describe the data we have.

There’s no point in knowing that the IQ scores between two groups are significantly different if we don’t have a way of describing the scores, and the difference.

Measures of central tendency: Mean, mode, median etc.

Measures of dispersion: Standard deviation etc.

Descriptive Stats

Levels of Measurement

In 1946 Stevens proposed a theory of scales of measurement. Nominal data (lowest level of

measurement) Ordinal data (unable to differentiate

points on scale) Interval data (points on scale equal

distance apart) Ratio data (equal distance between

points on scale)

Nominal

Provides the least exact information Participants are placed in categories

Data that is categorical e.g. gender, colours, shoe type, play behaviour

Variable must fit into one category Measure of frequency

Numbers may be used but only as category labels

Central tendency is described using the mode Data is represented using a frequency table

or bar chart

Examples: Nominal Data

Type of Bicycle Mountain bike, road bike, chopper, folding, BMX.

Ethnicity White British, Afro-Caribbean, Asian, Chinese,

other, etc. (note problems with these categories).

Smoking status smoker, non-smoker

Ordinal

Simplest true scale, orders measurements along a continuum Represent rank position in a group e.g. 1st, 2nd, 3rd …10th No information on difference between positions

Central tendency is described in terms of the median

Dispersion can be measured using the range or inter-quartile range (middle 50% of the distribution)

Ordinal Data

A type of categorical data in which order is important.

Class of degree-1st class, 2:1, 2:2, 3rd class, fail

Degree of illness- none, mild, moderate, acute, chronic.

Opinion of students about stats classes-Very unhappy, unhappy, neutral, happy, ecstatic!

Interval and ratio variables

According to Fielding & Gilbert (2000) these are often used interchangeably, and incorrectly by social scientists (pg15)

Interval, ordered categories, no inherent concept of zero (Clark 2004), we can calculate meaningful distance between categories, few real examples of interval variables in social sciences (Fielding & Gilbert 2000:15)

Ratio. A meaningful zero amount (e.g. income), possible to calculate ratios so also has the interval property (e.g. someone earning £20,000 earns twice as much as someone who earns £10,000) (Fielding & Gilbert 2000:15)

Difference between interval and ratio usually not important for statistical analysis (Fielding & Gilbert 2000:15)

Interval variables- Examples

Fahrenheit temperature scale- Zero is arbitrary- 40 Degrees is not twice as hot as 20 degrees.

IQ tests. No such thing as Zero IQ. 120 IQ not twice as intelligent as 60.

Question- Can we assume that attitudinal data represents real, quantifiable measured categories? (ie. That ‘very happy’ is twice as happy as plain ‘happy’ or that ‘Very unhappy’ means no happiness at all). Statisticians not in agreement on this.

Ratio variables-Examples Can be discrete or continuous data. The distance between any two adjacent units

of measurement (intervals) is the same and there is a meaningful zero point (Papadopoulos, 2001)

Income- someone earning £20,000 earns twice as much as someone who earns £10,000.

Height Unemployment rate- measured as the number

of jobseekers as a percentage of the labour force (Papadopoulos, 2001).

If you are still a little worried about your understanding of Quantitative Data please see the Key Information Handout in the Folder. By David Bowers (Learning Development) A reasonable summary of information about

quantitative data. Data types, appropriate measures of central

tendency etc.

Key Information Handout

Everything we do today is about good practice.

Following the steps today, and developing correct inputting skills, will save you lots of problems and heartache later.

SPSS is fussy when it comes to the way data is entered.

Importance of Good Practice

As SPSS is a Quantitative Data analysis software you often have to reduce information down to a numerical state

A Codebook allows you to keep a record of these reductions and decisions

A record of your own. Separate from SPSS. Electronic or on paper. A list of variables, full names, and how you

have coded data.

Codebook

The codes you give data to allow SPSS to analyse it.

You can’t enter text so some variables need to be converted.

E.g. Gender: Female may become 1, Male may become

2.

Relationship Status: Single may become 1, Married 2, Divorced 3, Widowed 4…

Coding

SPSS is fussy when it comes to the names you give variables.

Can’t give them a full description in the main view.

So you can give detailed labels in the special variable view.

Along with a codebook it helps keep the information clear.

Labelling

Available on email that was circulated to you all

File: Data Entry Exercise 1 - Optimism Data We’ll be creating a codebook, setting up

SPSS according to the codebook, and then entering the data.

1st Exercise

Good habits

Create a new Folder on your Desktop

Right-click on Desktop> New > Folder > “SPSS”

New Data Folder

Start>All Programs>IBM SPSS Statistics 19.

Depending on version may have a slightly different name.

GIVE IT TIME SPSS IS RENOWNED FOR TAKING AN AGE TO OPEN UP – CLICKING AGAIN ONLY SLOWS IT DOWN MORE AS IT’LL THEN TRY TO OPEN ANOTHER SPSS WINDOW

Open SPSS

Open SPSS

Optimism Scale data from 4 participants

Firstly, we are going to prepare a codebook

Coding Data

Optimism Hand-out

Rules for naming of variables Variable names:

must be unique (i.e. each variable in a data set must have a different name);

must begin with a letter (not a number); cannot include full stops, spaces or other

characters (!, ? * "); cannot include words used as commands by

SPSS (all, ne, eq, to, le, lt, by, or, gt, and, not, ge, with)

Coding Data

Optimism scale items op1 to 4 Enter number circled 1 (strongly disagree)

to 5 (strongly agree)

Coding Data

Now we have a codebook to keep things clear we can set up SPSS so it is ready for the data.

SPSS has 3 views: Data, Variable and Output.

By switching to Variable we can define the variables we need.

Creating a data file and inputting data

Defining Variables

Variable View

Naming Variables

Decimals

Labels

Values

Enter the relevant value and label as per your codebook, then click add. When all have been entered, click OK

Define the meaning of the values used in the codebook (Gender) and click add for each.

Values

Values

When entering likert data always use the limits of the scale (1-5) even if you know that participants may not have entered some responses. You also need to decide whether you are going o just enter the range or every labeled point.

Values

Data comes in different types. Categorical (Nominal in SPSS) Ordinal Scale/Interval (Scale in SPSS) Different types/measures suit different tests, different measures of central tendency, different forms of visualisation. Makes knowing what type of data you have KEY for successful data analysis.

Measures

Measures

Scale refers to interval/ratio level of measurement - There is some debate about data type in relation to likert data … for our purposes, leave this as Scale

Nominal refers to catergorical

Measures

Now you have the variables set up ready for the data you can start to enter the actual data

Go to the Data View

Inputting Data According to the Codebook

Inputting Data According to the Codebook

Saving the File

You’ve saved the data so now it is ‘safe’ You can have a play around with it and try

a few different things. Delete a case Insert a case between existing cases Delete a variable Insert a variable between existing variables Try during the workshop/at home so you

get more confident with SPSS.

Playing around with the data

Available on LearnUCS.

Different experimental designs require a different style of inputting.

The structure you use will be different between Repeated (Within-Group) and Independent (Between-Group) experimental designs.

Use the wrong structure and the analysis will fall down. It will be meaningless at best.

2nd Exercise: Inputting Repeated and Independent

Measures

So, to recap Repeated Measures. The same participants

experience all treatments/are in all the groups/conditions.

If you wanted to investigate the effect of music on taking an IQ test participants would experience the no music condition, and the music condition.

Hopefully with some counterbalancing.

Repeated Measures

Repeated Measures

Again to recap.

Participants are split. One group will experience one treatment/be in one group/condition.

Another group will experience the other.

Each condition will have a unique, non-shared, set of participants.

Independent

Independent

A quick trick to show you. Good for those who aren’t fond of a

screen full of numbers. If you have coded your variables

correctly there is a button you can press that will make the numbers in your data view appear as the names coded.

For example the 1’s and 2’s for gender could appear as Male and Female.

Labelling Trick

Data Entry Exercise 1 – Optimism Data Input Data Entry Exercise 2 – Repeated and Independent Extra Data Entry Exercises

Exercise 3 – Giving electric shocks Exercise 4 – Shooting people

We’ve gone through 1 and 2 here. Try them on your own. 3 and 4 for extra practice. Make sure you are comfortable with data input, coding and labelling.

Exercises

The theory and step-by-step guide will be covered in the slides following immediately below. If you complete the first exercise move onto exercise 2.

Descriptive Exercise 1: survey.savThe data is from a survey of staff about stress and emotions.Generate the frequencies for 1) marital status and 2) level of education

Descriptive Exercise 2: staffsurvey.savThe data is from a staff survey with likert scales for agreement and importance of factors.Generate appropriate descriptive statistics to answer the following questions:

(a) What percentage of the staff in this organisation are permanent employees? (Use the variable employstatus.)(b) What is the average length of service for staff in the organisation? (Use the variable service.)(c) What percentage of respondents would recommend the organisation to others as a good place to work? (Use the variable recommend.)

Lab Exercises

When you are trying to find your descriptive stats you need to make sure you use the right ones.

Certain types of data/measure, suit certain types of measures of central tendency and dispersion.

Use the wrong ones and your description of the results will be confusing, wrong and won’t match your inferential statistics.

Types of Variables & Descriptives

Also known as Nominal variables in SPSS. Data that has been classified and categorised. So gender, a participant will belong to a

particular category of gender. Marital Status. Anything that you can create a discrete

classification of. You can even take a scale variable like age, and force it into categories (18 and under, 18 – 25, 25 – 35 etc.).

Categorical Variables

Measure of Central tendency to use for Categorical data is the mode.

Frequency of occurrence or amount. So using gender as an example you would

use the mode. 2 of the sample might be male, and 8 female. Mode = Female. 20% male, 80% female

Categorical

In SPSS you should use the Frequency option when you want the descriptive stats for a categorical variable.

Go to Descriptive Exercise 1 on LearnUCS.

Categorical and Frequency

Save survey.sav to your SPSS folder on the Desktop from LearnUCS

Have a look at survey.sav questionnaire from LearnUCS

Open survey.sav dataset

Descriptive Exercise 1 - Survey

Survey Questionnaire

Frequencies

Frequency Output

This is where graphs and the results from tests (descriptive and inferential) will appear.

Also notes about when you have saved and opened files too.

If you want to keep what is in the output you must save it specifically.

Saving the data/variable will not save what is in the output, and vice versa.

Output pages

Aside from Categorical measures we also have Ordinal Scale/Interval (sometimes know as ratio too) These are also generally known as continuous

variables. Usually the mean or median are the measures

of central tendency used, and the standard deviation, or error, the measure of dispersion.

Other measures

Ranked or ordered data. Sometimes Likert scales.

Has some similarity to categorical data (You might consider grade brackets to be categories; A, B, C, D, etc).

But importantly they are ranked, so there is meaning to the position. A is better than B, B better than C and so on.

The median is used here. Central point with an equal amount above/below.

Ordinal

The median is used here. Central point with an equal amount above/below So if you had a collection of grades… 20 people had an A 10 had a B 10 had a C 10 had a D Then B would be the median grade, as 20 people

had higher, and 20 people had lower.

Ordinal

Imagine we wished to find the median for the highest educational level attained by a population

In descriptive exercise 1 (survey) we would click on ‘Analyze’

Using Explore to See the Median


Select ‘Descriptive Statistics’ and then ‘Explore’ from the Drop-down menus


1. When the below box opens move ‘highest educ completed’ from the left pane to the ‘Dependent List’ section

2. Click on ‘Statistics’ and choose ‘Outliers’ and ‘Continue’ 3. Click on ‘Plots’ and choose ‘Histograms’ and ‘Normality Plots with tests’ and ‘Continue’

4. Click on ‘OK’


The resulting ‘Output’ in the Output window will show you a number of descriptive stats.We can see the median is 4 for the ‘highest educ completed’ which means ‘some additional training’ is the median for the highest education completed for 439 participants who took part in the survey.

Interval – a scale with artificial limits, no true zero, and usually some form of cap.

Intervals are of equal size. IQ scores for example. Ratio – has a true zero, constant intervals and

potentially little or no cap. So timing scores on a task for example. SPSS doesn’t really differentiate between the two. Basically if it is a form of score it is likely to be scale.

Scale

The mean is the normal measure of central tendency, and the measure of dispersion the standard deviation.

So 5 people take a maths test. They score 10, 20, 18, 12 and 5. The average would be 13 (total/number of

cases)

Scale

In SPSS we just need the descriptive option, rather than the frequency option.

So for example if we wished to find the mean and standard deviation for ‘age’, ‘total optimism’, ‘total mastery’, ‘total perceived stress’ and ‘total perceived control of internal states’ (PCOISS), for participants who answered the survey we are using for exercise 1.

Scale Descriptives

Descriptives

Descriptives Output

Sometimes information will be left out of a questionnaire, or the value lost, but you will still need to conduct an analysis.

What happens if someone doesn’t fill in the age box on a questionnaire?

Rather than get rid of all their data you can use the ‘Exclude cases pairwise’ option.

It excludes the case (person) only if they are missing the data required for the specific analysis. They will still be included in any of the analyses for which they have the necessary information.

Missing Data

Exclude cases listwise A more extreme option. If the participant is missing any data then

this option should remove them entirely from the analysis.

A matter of judgement as to which to use.

Missing Data

Descriptive Exercise 1 – Survey Descriptive Exercise 2 – Staff Survey

Exercises

Adapted from Green, J. & D’Oliveira, M. (1999). Learning to use statistical tests in psychology. Buckingham, UK: Open University Press.

Differences ?

Categorical & FrequencyData? Relationships ?

How many Independent variables?

START

Within orBetween

participants in each condition?

Two or more

Parametric: Unrelated t-test

Non-param:Mann Whitney

Between

How many experimental conditions?

One

Factorial Within Subjects (Repeated Measures) ANOVA

Within

Factorial Mixed Design (Split-Plot) ANOVA

Both True

Between

Factorial Between Groups ANOVA

3 or more

Within orBetween


Two

Within orBetween


Parametric: Non-param:Oneway FriedmanWithin Ss or(Repeated Page’s Lmeasures) Trend TestANOVA

Within Between

Parametric: Non-param:Oneway Kruskal-Between Wallis orGroup JonckheereANOVA Trend Test

Parametric: Non-Param: Related Wilcoxont-test

Within

Parametric: Non-param:Pearson's r Spearman's r

Flowchart for choosing basic statistics

Summarising Univariate Data?

Descriptive statistics(mean, standard deviation,variance, etc)

1 or 2 sample Chi-square

Within

McNemar

Between

Coolican, H. (2014). Research Methods and Statistics in Psychology (6th ed.). Hove, UK: Psychology Press. A good introduction to the quantitative statistics incorporated in

the social sciences. A comprehensive coverage of the statistics covered in research methods at this level in a clear and comprehensive format.

Pallant, J. (2013). SPSS: Survival Manual (5th ed.). Maidenhead, UK: Open University Press A textbook that is of help with the statistical programme SPSS

whatever your level, as it takes you through the analysis in a step-by-step clear and concise manner that allows you to learn while you put into practice.

Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics (4th ed.). London, UK: Sage An easy to engage with text that covers research methods and

statistics in a fashion that makes it easy to read and follow.

Recommended Reading

You can use the below link to access the UCS library page that has some useful videos showing how to use SPSS http://libguides.ucs.ac.uk/c.php?g=264784&p=1954991

There is also a course that you can do (set up by Jen Versey our Psychology technician and David Mullett from the library support team) https://www.coursesites.com/webapps/Bb-sites-course-creatio

n-BBLEARN/courseHomepage.htmlx?course_id=_383196_1

There is always the IBM SPSS guide that you can access through the help option in SPSS as a starting point.

Web Resources

http://libguides.ucs.ac.uk/c.php?g=264784&p=1954991

http://libguides.ucs.ac.uk/c.php?g=264784&p=1954991

https://www.coursesites.com/webapps/Bb-sites-course-creation-BBLEARN/courseHomepage.htmlx?course_id=_383196_1



Descriptive Statistics

Descriptive statistics – are statistics that describe data. They essentially summarise the data.

They can be either numerical or graphic Numerical statistics come in 2 forms

Measurement of central tendency Measurement of dispersion

Measure of Central Tendency

Three measures of central tendency/ score, which we use is dependent on our level of measurement. They are;

Mean Arithmetic average/mean. Sum of all scores divided by

the number of scores Median

The score that falls in the exact centre of the distribution (middlemost score)

Mode The most common/frequently occurring score

‘the mean’ Formula for the mean is_ Σxx = N_x = the meanΣ = the sum ofx = the scoresN = the number of scores in set Advantages

Powerful statistic used in estimating population parameters for significant differences and correlations. Most sensitive, and works at an interval level.

Disadvantages Can be overly sensitive causing it to easily distort due to outlier values

‘the median’

The measure of central tendency for ordinal data Shorthand may be Guildford’s (1956) Mdn It is the central value of a set A formula used to find the median is N + 1k = 2 For odd number data sets this will reveal the central number For even number data sets this will reveal the two points of data that

the median falls between When you have a number of values the same in the data set you can

use the same method although it is not strictly correct. However, luckily for us as social scientists there are statistical packages that will take care of this for us

‘the mode’

The measure of central tendency for nominal scale data. We are unable to calculate mean and median with this type of data, but we can see what occurred most often/highest frequency

There can be two modes, which we call bi-modal Advantages

Most typical, unaffected by extremes, can be more informative than mean with discrete scales

Disadvantages Does not account for differences between values, can’t be used in

estimates of population parameters, not all that useful for small sets of data, for bi-modal two modal values reported, difficult to estimate accurately when data grouped into class intervals

Measures of Spread/Dispersion

High Variability

Low Variability

‘the range’

Report of the top/highest value and the bottom/lowest value

To calculate what the range is (the difference between) you subtract the lower value from the higher value and add 1

Advantage Includes extremes, easy to calculate

Disadvantages Can be distorted by extremes, can be unrepresentative of the

distribution. Doesn’t tell us whether values close to spaced out from mean

‘the interquartile and semi-interquartile range’

The interquartile range allows us a better insight into how values fall in relation to the central tendency

Instead of the full range, the interquartile range represents the distance between the central 50%, removing the bottom and top 25%. The values are known as the 1st and 3rd quartiles or the 25th and 75th percentiles

Interquartile range

Q1 M Q33 3 4 5 6 8 10 13 14 16 19 The interquartile range is: Q3 – Q1

Semi-interquartile is half of that: Q3 – Q1

2 Advantages

Representative of central group of values, useful for ordinal data

Disadvantages No account of extremes, inaccurate where there are large

class intervals

Standard deviation and variance

These estimate from a sample how the values of a population are distributed

Standard deviation provides us with an average score telling us how different the scores are from the mean

Formula for standard deviation (std, SD, stdev)

)(1

2

nXx

s 1

n

s2d Or

spss introductory session data entry and descriptive stats

Data & Analytics