exploratory data analysis

Exploratory Data Analysis – Using SPSS

Purpose: These hand-outs are aimed at developing some fundamental skills in the exploratory data analysis including the interpretation of hypothesis test they are not in any way meant to be a statistics primer or a substitute for the skills and knowledge taught during the research methods module. However, these notes are designed to encourage managers to develop more robust approaches to quantitative decision making as the basis for effective measurement and evaluation of service improvement efforts, and in particular to recognise the limitations in using single measures of central tendency such as average/mean and aggregated measures such as percentages.

The SPSS environment

Running SPSS

Method 1 – If the programme is running locally on your PC

To load SPSS you can select the icon from your list of programmes

Method 2 – Run the programme in ‘thin client’ mode using the University’s Desktop Anywhere function

To access ‘Desktop Anywhere’ use the URL https://connect2.bangor.ac.uk/sgd/

The login in screen will appear as follows:

https://connect2.bangor.ac.uk/sgd/

Enter your username and password and select [Login] after a short pause the screen should be as displayed below:

(You may be asked to run a Java Applet and/or to confirm that you wish to be connected to the server – select [Yes] in both cases if asked].

Scroll down the left hand column of applications until you come across SPSS V.16, as shown above and then select this option.

There may be a short wait while the system runs some login scripts be patient and don’t confuse the PC by entering data on the keyboard or by attempting to select options with the mouse.

After a while the main SPSS window will appear as shown on the next page – the actual options on the menus may vary a little depending on whether you are using SPSS Ver 16 (thin client) or SPSS 19 running locally on your PC.

(NB If the SPSS has been used previously by yourself if may display a dialogue box similar to the one below asking you if you want to open existing files etc. For the time being select <Cancel>and you will be returned to the main SPSS data entry window – see above)

Loading the data file LOS.SAV

NB SPSS data files are given the extension .SAV for the purposes of this example I am going to assume that the data file has been placed in <My Documents>folder on the PCs local drive.

Select File, Open, Data

And the following dialogue box is displayed.

(Should you wish to load data files from other folders then change then change the Look In: options)

Select LOS.sav and then select <Open> and the LOS.savdata file will be opened and displayed as shown on the top of the next page.

LOS.sav displayed in Variable View

Depending on the options set the file may be displayed in Data View see below:

NB You can toggle between Data View and Variable View using the Tabs at the bottom of the screen

Variable View

The<Variable View> Tab is the window in which you define the variables that you are going to use, according to the coding principles developed for your study. Therefore before you can use SPSS effectively to analyse data you need to think very carefully about which data types you are going to use and how you are going to code the data. This is sometimes called creating a codebook.

Name: This column contains the variable name that will be used to identify each ofthe variables in the data file. These should be listed in your codebook. Eachvariable name should have only <= 64 characters.

Type:The default value for Type that will appear automatically as you enter your firstvariable name is Numeric. For most purposes this is all you will need to use. Thereare some circumstances where other options may be more appropriate.

Width:The default value for Width is 8. This is usually sufficient for most data, unless you have very large variables.

Decimals:The default value for Decimals is 0 (ie no decimal places) this is ok for integer (whole numbers) but if your variable has decimal places, you can changethis to suit your needs.

Label:The Label column allows you to provide a longer description for your variable this will be used in the output generated from the any analyses conducted by SPSS.

Values:In the Values column you can define the meaning of the values you have usedto code your variables. For example if you select the values cell for the Consultant variable - <Click once> with the mouse:

And then click on the button the values that have been used to code the variable consultant are displayed on the screen as shown below:

Missing:Sometimes researchers assign specific values to indicate missing values for theirdata. This is not essential—SPSS will recognise any blank cell as missing data.

Columns:The default column width is usually set at 8. This is sufficient for most purposes - change it only if necessary to accommodate your values. To make your data filesmaller (to fit more on the screen), you may choose to reduce the column width.Just make sure you allow enough space for the width of the variable name.

Align:The alignment of the columns is usually set at ‘right’ alignment. There is no real need to change this as numeric data is best displayed in this format so that place values are easier to read.

Measure:The column heading Measure refers to the level (type) of measurement of each of yourvariables. The default is Scale, which refers to an interval or ratio level ofmeasurement. If your variable consists of categories (eg. sex), then click in thecell, and then on the arrow key that appears. Choose Nominal for categoricaldata, and Ordinal if your data involve rankings, or ordered values.Summary of the

A summary of the Variable View options – Field (2009, p.71)

Data View

Within data view you can enter and edit data almost like a spread sheet. Normally you would enter new data on next empty row of the according the coding principles you have developed for analysing your data and stored in the variable view.

Remember that row correspond to a ‘single record’ or in research terms a ‘case’ and the columns may be thought of as ‘fields’ containing data which defines that entity which may be a participant, an episode of care etc. Thus the columns corresponding to the variables you have defined in within Variable View represent in research terms specific groups eg gender, ward, consultant etc.

The normal cursor control keys work in the same way as they do in Excel as do the key sequences [Ctrl] + [Home] move the cursor to the beginning of the data sheet and [Ctrl] + [End] to move the cursor to the end of the data sheet.

Therefore you can edit the data by moving to that required data entry and changing the values or editing these using the normal editing keys.

To delete an entire record/case:<click> in the row number containing the record/case you wish to delete, this is highlighted as shown below

And then select Edit, Clear

The case/record will be removed from the dataset.

NB If you have made a mistake then the usual key sequence of [Ctrl] + [Z]orclicking on the undo icon will reverse the change!

To insert a new record case:<click> in the row number corresponding to the position you wish to insert the record and this is highlighted as shown below:

And then select Edit, Insert Cases

And a blank row appears allowing you to enter a new record/case.

Alternatively you can use the tool accessed from the top menu bar.

To insert a variable between existing variables:Position your cursor in a cell in the column to the right of where youwould like the new variable to appear.

And then select Edit, Insert Variable

And a blank column appears allowing you to enter the name of the new variable, and data.

Alternatively you can use the tool accessed from the top menu bar.

NB SPSS gives a new variable the default name VAR000001 in the column header – if you <double click> on this you will be taken to the [variable view]where you will be able to define the new variable properties.

To delete a variable between existing variables: Click once on the column header corresponding to the variable you wish to delete. The entire column is highlighted as shown below.

And then select Edit, Clear

The case/record will be removed from the dataset.

NB If you have made a mistake then the usual key sequence of [Ctrl] + [Z]orclicking on the undo icon will reverse the change!

Some Exploratory Analysis

Checking categorical variables

Checking for the number of cases of same day admission and non-same day admission.

Select Analyze, Descriptive Statistics, Frequencies,

Click once on SDA variable in the left hand column and then click on the to move this into the Variables box.

Click on the Statistics button.

Tick Minimum and Maximum check box and then select <OK>.

The output is displayed below.

Checking Continuous Variables

Checking for the LOS variable.

Select Analyze, Descriptive Statistics, Descriptives

Click once on LOS variable in the left hand column and then click on the to move this into the Variables box.

And then click on the Options… button

Ensure that the Mean, Sum, Std. deviation, Variance, Minimum, Maximum, Kurtosis & Skewness are checked

Then select Continue

Then select OK and the output will be similar to that displayed below.

“Descriptives also provides some information concerning the distribution of scores on continuous variables (skewness and kurtosis). This information may be needed if these variables are to be used in parametric statistical techniques (eg t-tests, analysis of variance). The skewness value provides an indication of the symmetry of the distribution. Kurtosis, on the other hand, provides information about the ‘peakedness’ of the distribution. If the distribution is perfectly normal you would obtain a skewness and kurtosis value of 0 (rather an uncommon occurrence in the social sciences).

Positive skewness values indicate positive skew (scores clustered to the left at the low values). Negative skewness values indicate a clustering of scores at the high end (right-hand side of a graph). Positive kurtosis values indicate that the distribution is rather peaked (clustered in the centre), with long thin tails. Kurtosis values below 0 indicate a distribution that is relatively flat (too many cases in the extremes). With reasonably large samples, skewness will not ‘make a substantive difference in the analysis’ (Tabachnick & Fidell, 2001, p. 74). Kurtosis can result in an underestimate of the variance, but this risk is also reduced with a large sample (200+ cases: see Tabachnick & Fidell, 2001, p. 75)”. (Pallant J, 2004, p.51-52)

Frequency Distributions (Histograms) and the normal distribution

Once you’ve collected some data a very useful thing to do is to plot a graph of how many times each score occurs. This is known as a frequency distribution, or histogram, which is a graph plotting values of observations on the horizontal axis, with a bar showing how many times each value occurred in the data set. Frequency distributions can be very useful for assessing properties of the distribution of scores.

Frequency distributions come in many different shapes and sizes. It is quite important, therefore, to have some general descriptions for common types of distributions. In an ideal world our data would be distributed symmetrically around the centre of all scores. As such, if we drew a vertical line through the centre of the distribution then it should look the same on both sides. This is known as a normal distribution and is characterized by the bell-shaped curve with which you might already be familiar. This shape basically implies that the majority of scores lie around the centre of the distribution (so the largest bars on the histogram are all around the central value Also, as we get further away from the centre the bars get smaller, implying that as scores start to deviate from the centre their frequency is decreasing. As we move still further away from the centre our scores become very infrequent(the bars are very short). Many naturally occurring things have this shape of distribution. For example, most men in the UK are about 175 cm tall; some are a bit taller or shorter but most cluster around this value. There will be very few men who are really tall (i.e. above 205 cm) or really short (i.e. under 145 cm). An example of a normal distribution is shown below.

There are two main ways in which a distribution can deviate from normal: (1) lack of symmetry (called skew) and (2) pointyness (called kurtosis). (Field A, 2010, p.19)

Skew & Kurtosis

(Field, 2010, p.20)

Organising the data by Consultant

SPSS enables you to organise the output by categorical variable such as consultant.

Select Data, Split File..

Select Organize output by groups,

Click once on Consultant variable in the left hand column and then click on the to move this into the Groups Based On: box.

Then Select <OK>

Select Analyze, Descriptive Statistics, Descriptives

Then Select <OK>

And the output will be as displayed as shown on the top of the next page.

Consultant = Morris

Consultant = Jones

Consultant = Roberts

Consultant = Huws

To remove the split organisation select:


Select Analyze all cases, do not create groups

Creating a Frequency Distribution

Histograms are used to display the distribution of a single continuous variable such as LOS.

Select, Graphs, Legacy Dialogs, Histogram

Click once on LOS variable in the left hand column and then click on the to move this into the Variables box, and check the Display normal curve option.

If you want to add titles select the <Titles…> option, and the select <OK> the output will be as displayed on the next page.

“Inspection of the shape of the histogram provides information about the distribution of scores on the continuous variable. Many of the statistics discussed in this manual assume that the scores on each of the variables are normally distributed (ie follow the shape of the normal curve)”. (Pallant J, 2004, p.65)

Activity:

Organise the data file by (i) SDA and (ii) by consultant and plot frequency distributions.

TIP To organise the data file by SDA or Consultant


Select Organize output by groups,

Bar graphs

Bar graphs can be simple or very complex, depending on how many variables you wish to include. The bar graph can show the number of cases in particular categories, or it can show the score on some continuous variable for different categories. Basically you need two main variables—one categorical and one continuous. You can also break this down further with another categorical variable if you wish.

To produce a bar chart of the number of same day admission and non-same day admissions by consultant – proceed as follows.

Select, Graphs, Legacy Dialogs, Bar…

Select the <Clustered> and then <Define> options

Click once on Consultant variable in the left

hand column and then click on the to move this into the Category Axis box

Click once on SDA variable in the left hand

column and then click on the to move this into the Define Clusters By box

If you want to add titles select the <Titles…> option, and the select <OK> the output will be as displayed on the next page.

Bar Chart for Average LOS by consultant

Select, Graphs, Legacy Dialogs, Bar…

Select the <Clustered> and then <Define> options

Move Length of Stay to the Variables Box.

Click in the Other Statistic (eg mean) option

Click once on Consultant variable in the left hand column and then click on the to move this into the Category Axis box

Click once on SDA variable in the left hand column and then click on the to move this into the Define Clusters By box

If you want to add titles select the <Titles…> option, and the select <OK> the output will be as displayed below.

Scatter Diagrams (X-Y Graph)

Scatterplots are typically used to explore the relationship between two continuous variables (eg LOS and elapsed time to diagnosis). It is a good idea to generate a scatterplot, before calculating correlations, the scatterplot will give you an indication of whether your variables are related in a linear (straight-line) or curvilinear fashion. Only linear relationships are suitable for correlation analyses.

Select, Graphs, Legacy Dialogs, Scatter/Dot

Select the <Simple Scatter> and then <Define> options

Click once on Length of Stay variable in the left hand column and then click on the

to move this into the Y Axis box

Click once on Elapsed Time variable in the left hand column and then click on the

to move this into the X Axis box

If you want to add titles select the <Titles…> option, and the select <OK> the output will be as displayed below

Some Exploratory Data Analysis

Cross-Tabulation (Tables)

Cross tabulation is useful for counting the number of occurrences of one type of categorical variable in relation to another. For example to create a table of the number of occurrences of same day admission and non-same day admission for each consultant.

Select Analyse, Descriptive Statistics, Crosstabs

The crosstabs dialogue box is displayed as shown below:

Click once on consultant [Cons]variable in the left hand column and then click on

the to move this variable to form the Rowsof our table.

Click once on SDA variable in the left hand column and then click on the to move this variable to form the Column(s)of our table.

The default treatment of Cells.. for tables containing categorical variables is that SPSS counts observed occurrences. If you select the option you can verify this.

Select the option to return to the main Crosstabs dialogue box and then select <OK>the results of the Crosstab are then displayed in the [Output]tab as shown below.

Activity - Crosstabs

1. Rerun the Crosstab to created above and this time create a clustered bas chart by selecting this option from the main crosstabs dialogue box.

TIP You can use the tool from the main menu bar to return to recently used dialogue boxes

2. Create a cross tab and clustered bar chart demonstrating the number of same day admission and non-same day admission for each division.

3. Create a cross tab and clustered bar chart demonstrating the number of same day admission and non-same day admission for each ward.

The output should be similar to that shown below.

Task 1

Task 2

Task 3

Hypothesis Testing – Exploring Relationships between Groups

“Most of these analyses involve comparing the mean score for each group on oneor more dependent variables. There are a number of different but related statisticsin this group. The main techniques are very briefly described below.

T-tests

T-tests are used when you have two groups (e.g. males and females) or two sets of data (before and after), and you wish to compare the mean score on some continuous variable. There are two main types of t-tests. Paired sample t-tests (also called repeated measures) are used when you are interested in changes in scores for subjects tested at Time 1, and then again at Time 2 (often after some intervention or event). The samples are ‘related’ because they are the same people tested each time. Independent sample t-tests are used when you have two different (independent) groups of people (males and females), and you are interested in comparing their scores. In this case you collect information on only one occasion, but from two different sets of people.

One-way analysis of variance (ANOVA)

One-way analysis of variance is similar to a t-test, but is used when you have two or more groups and you wish to compare their mean scores on a continuous variable. It is called one-way because you are looking at the impact of only one independent variable on your dependent variable. A one-way analysis of variance (ANOVA) will let you know whether your groups differ, but it won’t tell you where the significant difference is (gp1/gp3, gp2/gp3 etc.). You can conduct posthoc comparisons to find out which groups are significantly different from one another. You could also choose to test differences between specific groups, rather than comparing all the groups, by using planned comparisons. Similar to t-tests, there are two types of one-way ANOVAs: repeated measures ANOVA (same people on more than two occasions), and between-groups (or independent samples) ANOVA, where you are comparing the mean scores of two or more different groups of people.” (Pallant J, 2004, p.97)

Conditions for parametric tests

T-tests and ANOVA are from a family of statistical tests known as parametric tests, that is they assume that the data is from a population that is a reasonably approximation to the normal distribution. SPSS provides statistical tests for normality – and you will learn more about this on the Research Methods module. Given that the aim of this short introduction is concerned with the interpretation of statistical output from SPSS for the purposes of this section we will assume that the data is distributed normally. On the next page you will find a summary of the conditions for parametric tests which this section assumes.

Assumptions for Parametric Tests

“Level of measurement: Each of these approaches assumes that the dependent variable is measured at the interval or ratio level, that is, using a continuous scale rather than discrete categories. Wherever possible when designing your study, try to make use of continuous, rather than categorical, measures of your dependent variable. This gives you a wider range of possible techniques to use when analysing your data.

Random sampling: The techniques covered assume that the scores are obtained using a random sample from the population. This is often not the case in real-life research.

Independence of observations: The observations that make up your data must be independent of one another. That is, each observation or measurement must not be influenced by any other observation or measurement.

Normal distribution: It is assumed that the populations from which the samples are taken are normally distributed. In a lot of research (particularly in the social sciences), scores on the dependent variable are not nicely normally distributed. Fortunately, most of the techniques are reasonably ‘robust’ or tolerant of violations of this assumption. With large enough sample sizes (e.g. 30+), the violation of this assumption should not cause any major problems (see discussion of this in Gravetter & Wallnau, 2000, p. 302; Stevens, 1996, p. 242). The distribution of scores for each of your groups can be checked using histograms obtained as part of the Descriptive Statistics, Explore option of SPSS (see Chapter 6). For a more detailed description of this process, see Tabachnick and Fidell (2001, pp. 99–104).

Homogeneity of variance: Techniques in this section make the assumption that samples are obtained from populations of equal variances. This means that the variability of scores for each of the groups is similar. To test this, SPSS performs the Levene test for equality of variances as part of the t-test and analysis of variances analyses. The results are presented in the output of each of these techniques. Be careful in interpreting the results of this test: you are hoping to find that the test is not significant (i.e. a significance level of greater than .05). If you obtain a significance value of less than .05, this suggests that variances for the two groups are not equal, and you have therefore violated the assumption of homogeneity of variance. Don’t panic if you find this to be the case. Analysis of variance is reasonably robust to violations of this assumption, provided the size of your groups is reasonably similar (e.g. largest/smallest=1.5, Stevens, 1996, p. 249). For t-tests you are provided with two sets of results, for situations where the assumption is not violated and for when it is violated. In this case, you just consult whichever set of results is appropriate for your data.” (Pallant J, 2004, p.197-198)

Independent Sample T-Test

This test is used when you want to compare the mean scores of two independent categories (treatments) such as (Same day admission, non same day admission) on one continuous variable such as (LOS).

H1 At the p<=0.05 level the mean lengths of stay for same day admission and non same day admission are statistically different.

Procedure for independent-samples t-test

Select Analyze, Compare means, then on Independent Samples T-test.

Click once on Length of Stay in Days variable in the left hand column and then click

on the to move this variable to Test Variable(s): box

Click once on SDA variable in the left hand

column and then click on the to move this variable to Grouping Variable: box

Click on <Define groups> and type in the numbers used in the data set to code each group. In the current data file 0=non same day admission, 1=same day admission.

Select <Continue>

Select <OK> and the output should be as displayed on the next page

Step 1: Checking the information about the groups

In the Group Statistics box SPSS gives you the mean and standard deviation for each of your groups (in this case: same day admission/none same day admission). It also gives you the number in each group (N). Always check these values first. Do they seem right? Are the N values for same day admission and none same day admission correct, is there a lot of missing data. If so, find out why? Perhaps you have entered the wrong code for same day admission and none same day admission (1 and 2, rather than 0 and 1). Check with your codebook.

Step 2: Checking assumptions

The first section of the Independent Samples Test output box gives you the results of Levene’s test for equality of variances. This tests whether the variance (variation) of scores for the two groups (males and females) is the same. The outcome of this test determines which of the t-values that SPSS provides is the correct one for you to use.

If your Sig. value is larger than .05 (e.g. .07, .10), you should use the first line in the table, which refers to Equal variances assumed.

If the significance level of Levene’s test is p=.05 or less (e.g. .01, .001), this means that the variances for the two groups (same day admission and none same day admission) are not the same. Therefore your data violate the assumption of equal variance.

Don’t panic— SPSS is very kind and provides you with an alternative t-value which compensates for the fact that your variances are not the same. You should use the information in the second line of the t-test table, which refers to Equal variances not assumed.

In this case the p value for Levene’s test is 0.214 which is greater than 0.05 so the assumption of homogeneity of variance has not been compromised and therefore the level of significance for the t-test can be read from the first row labelled ‘equal variances assumed’

Step 3: Assessing differences between the groups

To find out whether there is a significant difference between your two groups, refer to the column labelled Sig. (2-tailed), which appears under the section labelled t-test for equality of means. Two values are given. One for equal variance, the other for unequal variance. Choose whichever your Levene’s test result says you should use (see Step 2 above).

If the value in the Sig. (2-tailed) column is equal or less than .05 (e.g. .03, .01, .001), then there is a significant difference in the mean scores on your dependent variable for each of the two groups.

If the value is above .05 (e.g. .06, .10), there is no significant difference between the two groups.

In the example presented in the output below the Sig. (2-tailed) value is 0.258, as this value is above the required cut-off of .05, you conclude that there is not a statistically significant difference in the mean self-esteem scores for males and females.

NB The level of significance for any statistical test is the probability that the null hypothesis ie that there is no statistical difference between the independent treatments of the variables is true. When this is less than or equal to 0.05, in other words there is less than a 5% chance that the null hypothesis is true then, the test has demonstrated a statistically significant difference and we reject the null hypothesis and accept the alternative hypothesis.

Presenting the results for independent-samples t-test

An independent-samples t-test was conducted to compare the mean LOS for same day admission and non same day admission. There was no significant difference in LOS for same day admission (M=9.53, SD=4.036) and non same-day admission [M=9.06, SD=3.385; t(330)=1.132, p=.258].

In this case the p value for Levene’s test is 0.214 which is greater than 0.05 so the assumption of homogeneity of variance has not been compromised and therefore the level of significance for the t-test can be read from the first row labelled ‘equal variances assumed’

The level of statistical significance associated with this test is 0.258, which is greater than (>) 0.05, therefore the difference in the mean lengths of stay between same day and non same day admission are not statistically significant.

Analysis of Variance

Previously we an independent t-tests to compare the scores of two different groups or conditions. In many research situations, however, we are interested in comparing the mean scores of more than two groups (eg the mean LOS for each of the Wards). In this situation we would use analysis of variance (ANOVA). One-way analysis of variance involves one independent variable (referred to as a factor), which has a number of different levels. These levels correspond to the different groups or conditions.

For example, in comparing the mean length of stays by Ward. The dependent variable is a continuous variable (in this case, Length of Stay).

H1 At the p<=0.05 level the mean lengths of stay for each of the wards are statistically different.

Procedure for independent-samples t-test

Select Analyze, Compare means, then on One-Way ANOVA...

Click once on Length of Stay in Days variable in the left hand column and

then click on the to move this variable to Dependent List box

Click once on Ward variable in the left

hand column and then click on the to move the Factor: box

Click the <Options> button

Click on Descriptive, Homogeneity of variance test,Welch

Click on the <Continue> Button

Click on <OK>

The output should be as displayed below.

A one-way between-groups analysis of variance was conducted to explore the impact of Ward on LOS, there was a statistically significant difference at the p<.05 level in LOS scores across the 4 wards, [F(3, 328)=43.568, p<0.000].

P=0.145 which is >0.05 therefore the data has not contravened the

homogeneity of variance condition.

Only use the level of significance from the robust test of means if

there is non homogeneity of variance amongst the factors.

Post Hoc Tests

Post hoc test enable you to undertake multiple comparisons between the factors to determine which if any of the factors are statistically significant. To do this return to the analysis of variance menu.

Select Analyze, Compare means, then on One-Way ANOVA...

Click on the <Post Hoc..> option

Select Tukey, then <Continue>, then <OK>

The output will be the same as before but include an additional table labelled multiple comparisons – as shown on the next page.

You should look at this table only if you found a significant difference in the overall ANOVA. That is, if the Sig. value for the main effect was equal to or less than .05. The post hoc tests in this table will tell you exactly where the differences among the groups occur. Look down the column labelled Mean Difference. Look for any asterisks (*) next to the values listed. If you find an asterisk, this means that the twogroups being compared are significantly different from one another at the p<.05 level.

In this case the mean LOS for Ward 1 is significantly different to the mean LOS for Ward 3 and Ward 4, and there is no statistical difference in the mean length of stay between Ward 1 & Ward 2.

exploratory data analysis

Documents

data view

nb spss data files

data types

spss method

main spss data entry

spss purpose

spss ver

spss environment