part 1 module 2 single file

51
Module 2 - Summarising Data In the first module we established two key ideas: how to express information in terms of cases, variables and values, and how the diverse ways in which the values of a variable can change is expressed in different levels of measurement. In this module we will look in greater depth at what we started to do at the end of the first module, which is to make summary statements about our data. This will prepare us for the next module where we can use such summaries to reach what lies at the heart of quantitative analysis: exploring relationships between variables. Making summary statements: fitting data to models Making summary statements: example Modelling data Standard deviation A measure of central tendency for ordinal and nominal variables Frequency distribution and contingency tables Missing values Standardisation Dealing with continuous variables and ordinal or interval level variables with a large number of values Recoding decisions Contingency tables or crosstabs Good table manners Alternatives to tables: Graphs and charts Conclusion

Upload: hassan-jafry

Post on 21-Feb-2016

215 views

Category:

Documents


0 download

DESCRIPTION

as

TRANSCRIPT

Page 1: Part 1 Module 2 Single File

Module 2 - Summarising Data

In the first module we established two key ideas: how to express information in terms of cases, variables and values, and how the diverse ways in which the values of a variable can change is expressed in different levels of measurement.  In this module we will look in greater depth at what we started to do at the end of the first module, which is to make summary statements about our data. This will prepare us for the next module where we can use such summaries to reach what lies at the heart of quantitative analysis: exploring relationships between variables.

Making summary statements: fitting data to models

Making summary statements: example

Modelling data

Standard deviation

A measure of central tendency for ordinal and nominal variables

Frequency distribution and contingency tables

Missing values

Standardisation

Dealing with continuous variables and ordinal or interval level variables with a large number of values

Recoding decisions

Contingency tables or crosstabs

Good table manners

Alternatives to tables: Graphs and charts

Conclusion

Making summary statements: fitting data to models

Before we look further at exploring relationships between variables there is one more important idea to understand: that of producing a model or fit for our data. We do this by making summary statements about it.

Think of the GHS CQDA Practice dataset that we used at the end of module 1. It contains data for about fifteen variables for about 300 people: around 4,500 pieces of information in all. However if we want to use this information to describe these 300 people we need to simplify as well as generalise. Consider the variable age. Our data describes the age in years of each respondent. We could present this information by

Page 2: Part 1 Module 2 Single File

counting how many respondents reported each age, but since this ranges from 16 to 69 years, this would still be a large amount of detailed information: too much for anyone to take in easily and make much sense of. However we could summarise this information with just one measure: the average, or mean age of the respondents. Converting hundreds of pieces of information into just one piece gives us a clear and quick idea of the general age of our respondents.

We can think of this simple summary of the ages of our respondents as a model. It simplifies reality in order to avoid overwhelming us with unnecessary detail. However such simplification has a cost (few things in life are free!).  Being a simplification of reality, our model is no longer a wholly accurate description of reality in all its detail.  We can think of the ‘difference’ or ‘distance’ between our model and reality as all the information we have left out, or what is know in statistics as the residual. This gives us the following formula:

Data = Fit + Residual   (D = F + R)

This equation describes the idea that our data can be thought of as our summary of it (the mean age of respondents in our example above) – which we call the fit or model – and the distance, for each case, between the value given in our simplified model and that in the actual data – which we call the residual.

Residuals are something we will study in their own right later on. For the moment all we need understand is that one of the things that makes for  good summaries or models of data is to have small residuals. It can be seen that there will usually be a straightforward trade-off, between the simplicity of the summary and the volume of the residuals.

Let us continue with an example .....

Mean

The midpoint of a set of observations, weighted by each of the values i.e. we can think of this as the balancing point in a set of data. This applies to the interval and ratio levels of measurement, but not to the nominal or ordinal levels. Because of this, it makes sense to compute an average of an interval variable, where it doesn't make sense to do so for ordinal scales.

Making summary statements: an example

Age is a variable at the interval (ratio) level of measurement. That is to say we know not only how much older or younger each respondent is compared to every other one, we can also express this difference as a ratio. Thus we know that someone aged 30 is not only 10 years older than someone aged 20, but that we also know that they are 50% older.

The simplest way we could summarise the ages of all the respondents would be to calculate some measure that told us what the average size of age was. This is a summary of the level of the values of the variable.

Page 3: Part 1 Module 2 Single File

One such measure is the mean. It is obtained by adding all the ages (values of the variable) together and dividing them by the number of cases). The mean has a number of properties that make it a very useful measure in statistics.

If we calculate the mean for age for the 278 cases we get the result 41.15 years.

View animation that shows how to calculate the mean in SPSS

This is probably the best single summary we could give for the ages of our 278 respondents. Note that it does not correspond to age in years of any one of them (ages were recorded in whole years). Nor does it tell us whether the range of ages actually clustered closely round about 41 to 42, or whether there were many cases of ages that were very far from this (respondents who were teenagers or in their sixties for example).

The next thing we can do is express the age of each case in the dataset in terms of our ‘model’ of age in general plus its distance from this model, as follows

D = F + R

The value for a case = the summary value of our model (the mean) + the difference between the mean and the case value (the residual) , for example

60 = 41.15 + 18.8541 = 41.15 + (-0.15)46 = 41,15 + 4.8516 = 41.15 + (-25.15)

The graph shows this process for the first ten cases in our dataset (where the mean of these ten cases is almost the same 41.4.). The red part of the bars in the diagram show how far below the mean the cases with values less than 41.4 years lie, the purple part

Page 4: Part 1 Module 2 Single File

of the bars shows how far above the mean the cases older than 41.4 years lie.

Note that if we add all the residuals together they will equal zero.

Why do we want to think of data data in this way?

 

Glossary

Interval level of measurement

In interval measurement the distance between attributes does have meaning. For example, when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable.

If you're getting confused about these levels of measurement the research methods knowledge base has a useful summary.

Mean

The midpoint of a set of observations, weighted by each of the values i.e. we can think of this as the balancing point in a set of data. This applies to the interval and ratio levels of measurement, but not to the nominal or ordinal levels. Because of this, it makes sense to compute an average of an interval variable, where it doesn't make sense to do so for ordinal scales.

Page 5: Part 1 Module 2 Single File

Modelling data  

It might appear to be a very clumsy way of presenting the data, but it makes clear what is involved in making summary statements about data or, to use the correct technical term modelling it. Were we to add all the residuals together they would add up exactly to zero. That is one reason why the mean is a good summary measure. However the reason they sum to zero is that the residuals from values below the mean exactly balance out residuals from values above the mean. Now this will happen regardless of how far from the mean any individual values happen to be. The table below illustrates two cases. In one the values are mostly far from the mean, but in the second, they are mostly close to it.

33 1227 2034 3440 3055 5756 5943 5457 5630 3338 58

*It would be good if we had a measure that also described whether values cluster together or are spread out. It would also be good if this measure were to emphasise the importance of values that lie far from the mean. This measure would tell us more clearly whether the mean was a 'good fit': that is to say gave us a good guide to all the values, so that the residuals were mostly small, or whether the values were spread out far from the mean, so that the residuals were rather large.

To do this we can multiply each residual by itself (this has the same effect as removing the arithmetic sign, since a negative number multiplied by a negative number is a positive number). Next we can sum these squared residuals and then divide this result by the number of cases. This measure is caused the variance. The higher the value of the variance for any variable, the more spread out are the values of the cases. For technical reasons that need not concern us here, the estimate of variance that we get with this procedure slightly underestimates the variance when we have collected information from a sample rather than a whole population. To correct for this we divide by the number of cases minus one. This makes very little difference to the result unless we have a small number of cases. You will find that most software (such as SPSS) will always divide by the number of cases minus one when performing standard deviation calculations. This also means that you will often see two slightly different versions of the formula for calculating a standard deviation. For example the version below uses n rather than n-1. (If you are curious about the n-1 calculation you may wish to learn more about 'degrees of freedom'.)

Page 6: Part 1 Module 2 Single File

Since we have squared the residuals to obtain the variance, the variance is expressed in units on a squared scale, which complicates how we interpret it. We can easily correct this by taking the square root of the variance, which will return us to our original 'unsquared' units. This measure is called the standard deviation. You can think of the standard deviation as a measure of the average size of the residuals in our model of the ages of the respondents in our dataset. If the standard deviation is large, then we have a set of values that are spread out far from our mean. If it is small, then we have a set of ages that are mostly close to our mean.

Statisticians use formulae to describe the mathematical operations involved in, for example, calculating the standard deviation of a set of numbers. These formulae use statistical notation so that a set of operations that would be cumbersome to describe in words can be set out very briefly by a few letters and symbols. We do not use statistical notation in this course, but if you intend to go further with quantitative methods it is a good idea to get some practice with it. The formula for the standard deviation is given on p. 42 of Marsh and Elliott (2008) p. 119 of Fielding and Gilbert (2006) and is shown below.

Calculating the standard deviation of a set of numbers is the sort of operation where computers come into their own. Calculating the mean and standard deviation for 278 cases by hand would be a tedious process (albeit one you had no choice but to carry out until the advent of affordable calculators about forty years ago). SPSS, however, will do it for you in a fraction of a second.

lets see an example...

Page 7: Part 1 Module 2 Single File

Reminder: A negative number multiplied by a negative number is a positive number? If you need to brush up on the basic rules of arithmetic try this short maths tutor summary

Glossary

Model

A representation of a statistical relationship that attempts to provide a simplification of reality.

Residual

This is the unexplained part of a statistical model.

Variance

A measure of spread (dispersion) of the observations, based on the sum of the square of the differences of each individual value from the mean of all values divided by the number of cases. It is only applicable to interval and ratio levels of measurement, not

Page 8: Part 1 Module 2 Single File

to nominal or ordinal levels (becuase in the latter two cases we cannot measure the distance between individual values and a mean for all values).

Statistical Formulae

Although we will show you formlae in this course and how to understand them, we do not expect you to remember or use them. We will always explain everything in words. However, it is not hard to see that formulae are much more precise and compact than words. if you intend to go on studying quantitative methods - for example by taking Intermediate Inferential Statistics, the it is a good idea to start getting used to using formula notation now.

Degrees of Freedom (from Wikipedia Statistics)

A common way to think of degrees of freedom is as the number of independent pieces of information available to estimate another piece of information. More concretely, the number of degrees of freedom is the number of independent observations in a sample of data that are available to estimate a parameter of the population from which that sample is drawn. For example, if we have two observations, when calculating the mean we have two independent observations; however, when calculating the variance, we have only one independent observation, since the two observations are equally distant from the mean.

Standard deviation

The two diagrams at the bottom of this page show visually two sets of ten ages. Both have exactly the same mean (41.4), but the second set has a larger standard deviation, as can be seen from the greater range of the heights of the bars representing each age. The following table lists each set of ten ages. Calculate the standard deviation for the each set of ages by hand (use a spreadsheet or calculator to help).

33 1327 2034 3440 3055 5756 5943 5457 5630 3339 58

Remember the following steps in calculating the standard deviation:

1. Calculate the mean. (In this example the mean of the set of ten numbers).

Page 9: Part 1 Module 2 Single File

2. Calculate the residuals (the distance of each value from the mean, which is the value minus the mean).

3. Square each residual (muliplly each residual by itself).4. Sum the squared residuals (add them up).5. Divide the result (the sum of the squared residuals) by the number of cases

minus one (in this example 10-1 = 9).This produces the variance of the ten numbers.

6. Take the square root of the variance to produce the standard deviation.

You do not need to memorise these steps but you do need to understand what each step of the process involves, so as to get a good idea of exactly what a standard deviation is. You also need to become familiar with how to calculate the variance of a variable, and what it means, as we shall return to this concept in the module on regression and correlation.

You should get the result 11.13 for the first set of numbers and 17.39 for the second set. If you get lost, or get the wrong result you can see the correct working here (but work out the example without taking a sneaky look first!):

View the correct working

Sum of first set of numbers = 414Mean = 41.4

Sum of squared residuals = 1114.4 Divided by 9 (N-1) = 123.8 (= the variance) Square root of 123.8 = 11.13 (= the standard deviation) Sum of second set of numbers = 414 Mean = 41.4

Page 10: Part 1 Module 2 Single File

Sum of squared residuals = 2720.4 Divided by 9 (N-1) = 302.3 (= the variance) Square root of 302.3 = 17.39 (= the standard deviation)

Use SPSS and the GHS CQDA Practice dataset to calculate the mean and standard deviation of respondents age upon finishing education (the variable tea) and the number of years they have been at their current address (the variable reslen). Use the 'descriptives' command under the 'analyze' menu. Which variable has greater ‘spread’?

Why should this be?

You should get the answer 2.7 years for age a leaving education and 11.6 years for length of current residence. Most people finish their education around similar ages, from about 16 to 22. However the lengths of time people spend at a particular address is much more variable: from much less than one year to an entire lifetime.

We can also compare the means and standard deviations of a variable, according to the values of another variable. For example, we might want to know if people of different marital statuses, or different sexes, tended to have the same length of residence. To do this in SPSS we go to 'Analyze' / 'compare means' / 'means' and in the dialog box we put the variable whose mean and standard deviation we wish to calculate in the 'Dependent List' box (in this example reslen) and the variable that describes the groups we wish to compare in the 'Independent List box' (in this example marstat or sex) using the arrows to place variables into or retrieve them from these boxes.

Compare the mean length of residence at their current address for men and women, and for single and currently married people using this procedure in SPSS. What do you find?

 

Summary

When we have interval level data the mean gives us a very useful and simple summary of the level of values that a group of cases takes. Instead of having to

Page 11: Part 1 Module 2 Single File

describe a large number of separate ages, for example, we can summarise them with one number, the mean of the ages.

The standard deviation gives us another very useful summary measure: this time it tells us how far the cases’ values tend to be from the mean value, or their spread. If the cases are spread out, the value of the standard deviation will be high. If they are clustered tightly together it will be much lower.

With just these two summary measures we can say a great deal about a large volume of interval level data.

The mean and standard deviation are also important because we will come across them in another context in this course. When a variable take a very large number of cases, and the values for the variable result from a large number of different, random, factors, their distribution takes the form of normal curve, This naturally occurring frequency distribution can be defined by just the mean and standard deviation of the variable, and has many useful properties. As we shall see in module 5, one of them is to allow us to calculate the probabilities that the results we obtain from a small number of cases in a sample can be used to make statements about the wider population from which such a sample was drawn. 

Interval level data describing individuals are not that common in the social sciences. Age and other variables relating to the passage of time (e.g. length of residence in a particular place, years spent in school etc.);  income, wealth and physical characteristics such as height and weight are about the only things measurable at this level for individuals. However interval variables are common when we measure entities comprising a number of people: organisations, firms, states, elections, cities, offices, birth cohorts, and so on

What about a measure of central tendency for ordinal variables?

 

Glossary

Mean

The midpoint of a set of observations, weighted by each of the values i.e. we can think of this as the balancing point in a set of data. This applies to the interval and ratio levels of measurement, but not to the nominal or ordinal levels.

Central tendancy

This generic term is used to include various types of 'average', such as the mean or median, which occur within the middle of the data distribution.

A measure of central tendency for ordinal and nominal variables

Page 12: Part 1 Module 2 Single File

We can calculate the level and spread for interval variables easily because the values we can record for them are ‘real’. E.g. the value 24 for ‘age at finishing education’ directly represents 24 years. But for ordinal variables we only know the ranking of the values, not their actual size. E.g. the code ‘1’ for ‘higher degree’ in our variable for educational level is simply that, a code: it cannot mean that a higher degree is half the size of a ‘first degree’ coded ‘2’.  However we can still produce a summary measure for the variable, using the property that a case in any class is higher or lower than the cases in other classes. Were we to order all the cases by order of the classes into which they fall, we could select the case which falls exactly half way up (or down) this order as the ‘middle’ case and take its value (the class into which it falls) as the ‘middle’ class. This measure is called the median.

If we have an even number of cases the median can fall between the two middle ones (e.g. the fifth and sixth cases of a set of ten numbers). If these two cases have the same values, this will be the value of the median. If they have different values we take the average of the two middle cases (that is we add them together and divide by two).

We can also use the median as a useful measure for interval level data, especially if we compare it to the mean. If the mean has a higher value than the median, this tells us that the average size of the residuals for values above the median are larger than for those below it (so that a smaller number of cases produces the same total value of residuals from the mean : in other words that the distribution of our cases is skewed, with a small number of cases with high values. A useful way to imagine this is to think of the example of the distribution of incomes in a country. Median income is always below mean income, as there are always a number of people who earn a great deal of money and there is no upper limit to what someone may earn, while by definition the minimum that anyone can earn is nothing. The greater the difference between the median and mean, the greater the degree of income inequality in the country.

SPSS will calculate the median for any variable. Use the 'statistics' button on the 'frequencies' command dialog box.

Fielding and Gilbert (pp 98 - 102) gives you further examples for calculating a median.

Finally how can we ‘summarise’ data measured at the nominal level? There is only one summary statistic of much use, since we cannot sensibly rank our cases. All we know is that the classes into which cases fall are different. Thus all we can do is note the class into which the highest number of cases falls, or simply the most common value that the variable takes: this is known as the mode. The mode doesn’t always tell us much. Look at the variable on economic status of respondents in our data set.

What does the mode tell us?

It tells us that most respondents are working

Other summary measures

Page 13: Part 1 Module 2 Single File

As well as the standard deviation and variance, there are a number of other summary measures that describe the distribution of interval and ordinal level data. We do not deal with them ingreat detail here, but concentrate on the most useful ones.

The minimum value and maximum value are simply the lowest and highest values that a variable takes for a group of cases. The difference between them is called the range. For example if you examine the spread ofages in the GHS CQDA Practice dataset you will see that the oldest respondent is 69 years old and the youngest is 16. The range would be 69-16 = 53 years.

While the median divides the ranked cases at the midpoint value, upper and lower quartiles do the same for the first and last 25% of cases. Deciles do the same for each successive 10% of cases, and percentiles, as you've probably already guessed, do this for each one per cent of cases. The 95th percentile, for example, would be the value of the case at the start of the top 5% of cases. The inter-quartile range sometimes called the midspread (and abbreviated as dQ or IQR) subtracts the lower from upper quartile to measure the difference between the two.

The median and midspread are known as robust measures. This means that they are resistant to unusual values in the data, which are often known as outliers. As the name suggests an outlier is a value that lies 'outside' the other range of values that a variable takes.

 

For example, suppose we collected data on infant mortality rates (no. of children per 1,000 live births dieing within 12 months of birth) in various countries of the world and obtained the following results:

2.31, 2.46, 2.79, 2.89, 3.23, 3.58, 4.25, 4.85, 6.26, 180.21

Were you to plot these values on graph paper you would quickly realise that one value stands out: 180.21 is thirty times larger than the next nearest value, and forty-five times bigger than the difference between the next highest value and the lowest value. There are two possibilities for our outiler. One is that it is an error. Perhaps an extra zero was included in the number by mistake; perhaps the decimal point was put in the wrong place. There is a handy rule of thumb in statistics that states 'the more interesting a piece of data is, the more likley it is that it is the result of an error'. 'Interesting' data is always worth double checking: no matter how reliable and presitigious its source may be. However the other possibility is that the number is correct. In this example it is correct. The reason is that the last country in the list is Angola. (The other countries are, in order, Singapore, Bermuda, Japan, Hong Kong, Iceland, Norway, Slovenia, UK, US and Angola. You can see rates for all the countries of the world and for different time periods at gapminder and at the United Nations World Populations Prospects site).

In this example the existence of a clear outlier points us to something interesting in the data. The comparison is so stark that it leaps out at us. It encourages us to think about what it might be that is so different about this particular case, what marks it apart from the others? In this example it is a question both of relative poverty and

Page 14: Part 1 Module 2 Single File

region of the world. Infant mortaility is much, much higher in sub-Saharan Africa than in any other. (world infant mortality is around 40 per 1,000).

Were we to calculate the mean of infant mortality rates for this group of countries, in an attempt to summarise their experience we'd obtain a figure of 21.3. This is not really a very helpful figure: by including the outlier in the calculation we produce a rate that is well in excess of the experience of all the countreis except Angola, yet well below that of the latter country itself. Similarly if we calculate a standard deviation, we get a very high figure for spread of 55.9. Again this neither captures the fact that many countries have very similar experiences indeed, nor does it tell us just how enormous the gulf between Angola and the other countries actually is.

If however we calculate the median we get a reasonable description of the experience of most of the countries in the group: 3.4.

If we calculate the interquartile range or midspread, we get 2.1: again a reasonable description of the range of experience of the main group of countries, but quite misleading about Angola's situation in relation to them.

We get these rather differnt results because of the different ways our measures are calculated. Since the median depends purely upon ranking, it is not affected by the actual location of outliers. This makes it a very robust measure. But being robust is not everything. The mean, by using information from all the cases is sensitive to outliers, and thus brings at least some of the impact of the case of Angola into the picture. We can make almost exactly the same observations about the standard deviation and interquartile range. There are two lessons to draw from this example. The first is that we need to stay alert to how the way summary measures are calculated enables them to tell rather different stories about data. We need to choose the summary measure appropriate to the task we want it to perform. The second is that the most important thing we should do with any data when we first get it is to have a good look at it, and often visualising it in graphical form is a good way to do this. (We look at techniques of how to do this later in the module). If we do that with the data in this example we would see very quickly that it is very heavily 'skewed': lots of quite small values and one very large one. This ought to make us suspicious. If there is no error in our data we probably have evidence of some kind of powerful structure or process at work: it is certainly not random! From ur general knowledge, as soon as we know the identity of teh countries many possible explanatioins suggest themselves. Big skews should also make us ask another question: is it sensible to represent the data by one summary measure at all? It might be that the experience of countries with high infant mortality rates are best summarised separately from those with lower rates. This would make sense too if we went on to discovere that the factors influencing such rates were differnt in the two groups of countries.

The measures discussed above are all available under the statistics button of the frequencies command dialog box in SPSS.)

Skewness and kurtosis

Skewness and kurtosis might sound like something you would not like to catch, especially when you discover that your kurtosis can be lepto- or plato- kurtic. They

Page 15: Part 1 Module 2 Single File

refer to whether the distributions of values for interval level data are symmetrical about the mean, or cluster towards lower or higher values (skewness), and whether the data is concentrated closely around the mean (peakedness). We'll look at them briefly in Part 2 when we consider the properties of a normal distribution. Outside of economics and demography you're not likely to come across kutosis much, but it is a good idea to get used to the idea of positive skew (more spread above the mean) and negative skew (more spread below the mean). If you are curious you can find more information In Fielding and Gilbert pp. 110-111 and in Marsh and Elliott (2008) pp. 16 & 21-22.

The limitations of measures of central tendency and spread for nominal and ordinal variables means that the best way to make summary statements for such variables is by producing tables - which is what we are going to look at in greater detail now. but before we do...

A Warning!

If you have understood the previous section you will appreciate that while measures appropriate for nominal or ordinal data can also be used for interval level data, the opposite does NOT apply. The median or mean of a nominal variable do not exist since the cases can neither be ranked, nor possess a value that can be totalled up. Similarly an ordinal level variable can have no mean.

However, SPSS will happily treat any variable as taking the interval level of measurement and calculate these statistics accordingly (as a computer programme it cannot know whether the numerical codes for different variables represent real interval level values or just codes assigned to different nominal or ordinal level values). Thus it is important to understand what statistics can be described for the level of data you are working with and avoid asking SPSS to calculate results that are meaningless. Remember the golden rule of computing: rubbish in, rubbish out!

 

Glossary

Mode

The most common value for a variable. This can apply at any level of measurement. If there are two equal highest frequencies, the distribution is called bimodal and if there are several equal highest frequencies, it is known as a polymodal (multimodal) distribution.

Median

Page 16: Part 1 Module 2 Single File

The middle value in an ordered set of observations; i.e. the point at which half the observations lie either side. This applies to the ordinal, interval or ratio levels of measurement, but not to the nominal level.

Readings

Fielding, J and Gilbert, N (2006) 'Understanding Social Statistics' Sage, London.

Frequency distribution and contingency tables

The limitations of measures of central tendency and spread for nominal and ordinal variables means that the best way to make summary statements for such variables is by producing tables. Data at these levels of measurement are ubiquitous in the social sciences. This means that mastering the art of producing good tables, and interpreting (and criticising) tables produced by others, is a fundamental skill for anyone who wants to think of themselves as a social scientist. We wouldn't give much credence to the social science credentials of someone incapable of reading books or articles.  You should be just as suspicious about anyone who cannot read (or write) a table. You will find that time invested in becoming fluent at reading and producing tables is time very well spent.

Summary statistics of the kind we have just considered (mean. standard deviation, median, interquartile range) work well for interval level variables, but often in social science our information comes at the nominal or ordinal level of measurement, where good single summary measures, especially of spread, cannot be calculated. For example, in the GHS CQDA Practice dataset one of the variables (marstat) describes respondents’ marital status. Since this is a nominal variable, we have no measure of spread, and our only measure of ‘level’ is the mode, which in this case happens to be the category ‘married or living as married’.  This summary is not a very useful one for most purposes. This is where tables are vital.

Tables are by far the most common form of summarising and presenting data, and of exploring relationships between different variables. They can be used (with care) for interval as well as nominal and ordinal variables. It is thus important to learn not only how to read and interpret them correctly, but also how to produce them so that others can see clearly the information in them.

Frequency Tables

Legal marital status

    Frequency percent Valid percent

Cumulative percent

valid 1 single, never married 83 29.9 30.5 30.5

  2 married and living with husband/wife 152 54.7 55.9 86.4

Page 17: Part 1 Module 2 Single File

 3 separated divorced or widowed

37 13.3 13.6 100.0

  total 272 97.8 100.0  Missing 9 not recorded 6 2.2  Total   278 100.0  

 

As we saw in the last module, a frequency table simply shows the number of cases that take each value of a variable in our data. I.e. the frequency with which each value of a variable appears in our dataset. Another way of describing this is to say that such a table shows the distribution of a variable. The distribution is just a list, in order, of each of the values the variable takes, and the number of cases in our set of observations that take each of these values.

The above frequency table shows the variable marstat which we used in the previous module.  It was produced using the frequencies command in SPSS (Analyze / Descriptive Statistics/ Frequencies). Each of the values are listed in the second column, the number of cases that take this value in the third column. We can see at a glance that out of the 272 people who answered the question, 83 respondents were single having never married, 152 were married and 37  were separated widowed and divorced. The fourth column presents these numbers as a % of the total of 278 cases. The table also shows that 6 people did not respond to the question. It may be that they did not wish to tell the interviewer what their legal marital status was, or maybe they were unsure of their exact legal position. SPSS can be told to treat cases for which we do not have information for a variable as missing cases. By contrast the values 1 to 3 are valid for this variable: that is to say they contain the information about the values of the variable we are interested in (in this case legal marital status). SPSS presents this information in the first column of the table.

Understanding and dealing with missing cases is fundamental to good analysis of tables, so let us first consider this issue.

A common error! Confusing values and their labels.

In SPSS the values of variables are stored as numbers. E.g. the values for the variable MARSTAT are the numbers 1, 2 & 3. The value label for 1 is 'single, never married' for 2 is 'married and living with husband / wife' and so on. For nominal and ordinal variables the numbers themselves have no meaning: they are simply codes used to store the information. For interval level variables the numbers are the values themselves. Thus the variable for age in years needs no value labels as the number of years is the actual value the variable takes.]

Glossary

Contingency tables

Page 18: Part 1 Module 2 Single File

These are tables (also known as crosstabs) which present frequencies of values for 2 or more categorical variables for the purpose of analysing the relationship between them. Usually this is done by putting the values for one of the variables in the rows and for the other in the columns. If there is a dependency relationship between the variables, it is usual practice to place the independent variable in the column and the dependent variable in the row, although consideration also needs to be given to how much space there is to show the variables in this way.

Missing values

There are two main situations where we have to deal with the fact that the value of a variable for a case is missing. First, it may be that the information does exist, but we do not have access to it. For example a respondent may have the information but be unwilling to divulge it. It may be that the respondent does not have the information either: in our example they may not actually know their formal legal marital status, even though they have one. Other possibilities are that the interviewer forgot to ask for the information, or that it was not recorded for some other reason. In any context where we collect information, there are alway various reasons why we may not be able to collect all of it.

What do we do in such situations? We obviously cannot analyse evidence we do not have! Instead we assume that the distribution of values of a variable for the cases for which we do not have information (the missing cases) is the same as that of the cases for which we do have information (the valid cases). That is to say, we deal with the information we have been able to collect, proceeding on the assumption that there is no difference between these cases and those that for whatever reason, we do not know about.

Often this may be a safe assumption. However it may sometimes be a very dangerous one, especially when it relates to respondents' willingness to divulge information. People with very high incomes may on average be more reluctant to disclose information about them than those on average incomes. Voters for one political party may be more disposed to reveal their voting behaviour than those for other parties, and so on. Responses such as 'don't know' or 'no answer' therefore have to be treated with care.

The second situation in which information may be missing is because such information does not exist to be collected in the first place. A variable may not be relevant or applicable to one or more cases. Imagine we had a question about the age at which a respondent became a parent. It would make little sense to ask respondents who were not parents this question. In this situation we do not have the information because it does not and cannot exist, not because we have been unable to capture it. If the question/variable is not applicable to a respondent, then logically that respondent does not come from our target population, and so can safely be excluded from our study without further ado. However, we do always need to keep in mind exactly what or whom our target population comprises. If we do not do this, we risk calculating proportions wrongly by including observations in the denominator of our proportion that can never appear in the numerator. Imagine I am the publisher of a useless and misleading statistics textbook. 10 people have bought it, and all have failed their statistics exam. I only have information on these ten cases. However this might not

Page 19: Part 1 Module 2 Single File

stop me claiming that 'Out of 60 million people in the UK, only ten (0.00000015%) have not passed their stats exam after reading this book. Excluding the 59,999,990 'missing cases' teh percentage becomes 100%!

We deal with both these situations by assigning values to variables which describe the reason why we do not have information where we do not have it. Common values are 'don´t know', 'not asked' 'not answered', 'not applicable' and so on. If we then tell SPSS to treat these values as missing, SPSS excludes cases with these values from its calculations. In frequency tables these values are displayed separately. In crosstabs they are omitted altogether from the tables produced.

If we want to include missing values in our analysis, we use the missing values column in the data editor window to instruct SPSS to treat any particular value as missing or valid (SPSS treats any value that is not missing as a valid one). For example if we were preparing a table on respondents' voting intentions, we might want to treat 'don know' as a valid value and compare it to the others (such as intention to vote for one or other party). Alternatively, we might want to analyse only the intentions of those who knew how they were going to vote, and treat the 'don´t knows' as missing.

Try it yourself. The variable marstat in the GHS simple dataset has 6 missing cases coded '8' (No response). Use the data editor window to tell SPSS to treat these cases as valid, and produce a crosstab with sex to find out the sex of those who gave no response to the question on marital status. The following animation shows you how to instruct SPSS to treat values as missing or valid.

View animation on missing values

A common mistake with missing values

It is easy to confuse defining a value as missing in SPSS and labelling a value as 'missing'. Values defined as missing are listed (and can be changed) in the 'missing' column of the SPSS data editor window in variable view mode. If you choose to label a particular value as 'missing' in the value labels column in the SPSS data editor window this has no effect on how SPSS treats the variable. To avoid confusion it is good practice to avoid using the word 'missing' for a value label and instead use terms such as 'no answer' 'not applicable' 'don't know' etc. which describe why that value is missing

Standardisation in Frequency tables

It is obviously clumsy to deal with raw numbers (e.g. 83 out of 278), so we express the proportion of cases that take each value as a percentage of all the cases, as if we always had 100 cases in the columns of our table. This is shown in the fourth column of the frequency table. For example, 29.9% of respondents were single. This is the same as saying that the proportion of respondents who were single is 0.299. Tables are standardised by using percentages or proportions in this way. As we shall see in later modules, we can also think of such proportions as the probabilities that any one case has of taking each of the values. Thus if we picked a case at random from all 278

Page 20: Part 1 Module 2 Single File

cases, on average we would pick a respondent who was single 299 times out of 1000, or a chance of .299 out of 1.

Let's look at our table on marital status again:

    Frequency percentValid percent

Cumalitive percent

valid 1 single, never married 83 29.9 30.5 30.5

  2 married and living with husband/wife 152 54.7 55.9 86.4

  3 separated divorced or widowed 37 13.3 13.6 100.0

  total 272 97.8 100.0  Missing 9 not recorded 6 2.2  Total   278 100.0  

The percentages in the fourth column ('percent') are based on all the cases, both valid and missing, but usually we will want to exclude the missing cases from our analysis, and base it on those cases for which we do have information, making the assumption that such cases share their characteristics with those for which we do not have information. This is shown in the fifth column headed 'valid percent'. Thus, e.g., 30.5% of the cases for which we do have information are single. This is the percentage we are most intersted in. The sixth column gives the cumulative % for all the values so far. This is handy when dealing with ordinal variables, or grouping together different values.

When collecting, analysing and presenting information in tables in this way we also need to consider its precision through asking three questions:

1. How accurate does it need to be?2. How accurate can it be?3. What level of accuracy can we afford?

1. How accurate does it need to be?

Too much accuracy can be confusing. How much accuracy do we need to be able to present the information clearly? There is no point in including several decimal places in numbers if these are not relevant to the argument we wish to make.

2. How accurate can it be?

If we have captured the information to a certain degree of accuracy, we cannot pretend that the results of any calculation we make with the data improve its accuracy. E.g. if we have captured information on age in whole years of age last birthday, we

Page 21: Part 1 Module 2 Single File

ought to round the information we present to reflect this, by including, at most, one decimal place.

3. What level of accuracy can we afford?

Capturing data requires resources and these are always scarce. In theory, for example, we could capture respondents' ages to the nearest minute. However this would probably require us to consult and verify hospital or other records to obtain this information: a laborious and expensive exercise that would be very unlikely to give us useful information!

Glossary

Proportion

A relative measure that indicates how large a value is relative to other values for a particular variable. It is directly related to a probability value and takes on a value between 0 and 1, or gets expressed as a percentage.

Dealing with continuous variables and ordinal or interval level variables with a large number of values

 Try producing a frequency table for the variable age from the GHS CQDA Practice dataset. It should look like this.

    Frequency Percent Valid Percent

Cumulative Percent

Valid 16 6 2.2 2.2 2.2  17 7 2.5 2.5 4.7  18 5 1.8 1.8 6.5  19 1 .4 .4 6.8  20 1 .4 .4 7.2  21 8 2.9 2.9 10.1  22 5 1.8 1.8 11.9  23 4 1.4 1.4 13.3  24 4 1.4 1.4 14.7  25 3 1.1 1.1 15.8  26 6 2.2 2.2 18.0  27 5 1.8 1.8 19.8  28 4 1.4 1.4 21.2  29 5 1.8 1.8 23.0  30 5 1.8 1.8 24.8  31 8 2.9 2.9 27.7

Page 22: Part 1 Module 2 Single File

  32 1 .4 .4 28.1  33 8 2.9 2.9 30.9  34 11 4.0 4.0 34.9  35 7 2.5 2.5 37.4  36 9 3.2 3.2 40.6  37 4 1.4 1.4 42.1  38 8 2.9 2.9 45.0  39 10 3.6 3.6 48.6  40 13 4.7 4.7 53.2  41 7 2.5 2.5 55.8  42 6 2.2 2.2 57.9  43 5 1.8 1.8 59.7  44 4 1.4 1.4 61.2  45 3 1.1 1.1 62.2  46 3 1.1 1.1 63.3  47 4 1.4 1.4 64.7  48 6 2.2 2.2 66.9  49 5 1.8 1.8 68.7  50 5 1.8 1.8 70.5  51 4 1.4 1.4 71.9  52 4 1.4 1.4 73.4  53 9 3.2 3.2 76.6  54 2 .7 .7 77.3  55 6 2.2 2.2 79.5  56 6 2.2 2.2 81.7  57 5 1.8 1.8 83.5  58 5 1.8 1.8 85.3  59 8 2.9 2.9 88.1  60 3 1.1 1.1 89.2  61 2 .7 .7 89.9  62 4 1.4 1.4 91.4  63 5 1.8 1.8 93.2  64 1 .4 .4 93.5  65 6 2.2 2.2 95.7  66 5 1.8 1.8 97.5  67 3 1.1 1.1 98.6  68 3 1.1 1.1 99.6  69 1 .4 .4 100.0  Total 278 100.0 100.0  

 

Page 23: Part 1 Module 2 Single File

As you can see such a table contains too much information to be useful! Often we wish to group a number of different values together into a single category or class in order to highlight a more limited range of information from our variable. There is no single rule that guides us about how to produce our new classes, other than that they must be comprehensive and mutually exclusive.  We might be interested in particular substantive values to split up our new classes (e.g. the age of retirement, voting age etc.) or in dividing our cases into roughly equally sized groups (e.g. by taking the quartiles). We can use the 'Recode into different variables' command under the 'transform' menu of SPSS to produce a new version of our variable with our new classes. The animation below shows you how to do this. Practice using this command and its associated dialog box. It is rather complex to start with, but after a few tries it becomes much more straightforward. Try recoding the age variable in the GHS CQDA Practice dataset into a variable with the following age ranges and produce a frequency table of this new variable (remember the new variable will need a name up to eight charcters long):

16 – 2930 – 4950 – 6465+

View animation on using the Recode Function is SPSS

Warning! Remember, when creating categories that they must be comprehensive and mutually exclusive. What would be wrong with the following list of age groups?

16 – 3030 – 5050 – 75

Show Answer

It is neither exhaustive nor exclusive! For example if I was 30 years old which category would I fall into? What if I was 77? I would have a category at all!

Note that while we can group together existing values into broader categories, we cannot do the reverse! For example if respondents had been given the set of  four age classes and asked which one they corresponded to (e.g. 16 - 29) we would not be able to disaggregate this information later to discover their actual age in years. A good rule to follow when collecting information is thus to capture it the most disaggregated form that is practical. We can put it into broader categories for the purposes of analysis later on.

 

Whenever we create new variables in SPSS, we need to save the new version of our dataset with the new variable in it, in the same way that we would save successive versions of a document when writing and editing an essay. WARNING it is very easy to produce dozens of different versions of a dataset, or dozens of different versions of a variable. When working it is easy to believe that you will remember what changes

Page 24: Part 1 Module 2 Single File

you made, which variable was a new version of which older version, precisely what the changes were, and so on. You will not!! Time spend carefullyand systematically logging what you have done, either as a file on your computer or in a notebook, is not only a good habit to get into: it will, in the longer run, save you substantial amounts of time.

View animation on saving a new version of the dataset to include recoded variables

Glossary Reminder:

Ordinal variables

These are variables that have values which are categories that can be placed in rank order.

Interval variables

These are variables that have values which are categories, in rank order and with known distance between them, but with no definite zero point.

Recoding Variables

When we recode a variable we usually produce a new variable (so that we do not lose the information captured in the original variable. The new

variable appears at the end of the lsit of variables in the 'variable view' version of the data editor window. The next thing we do is give our new variable a variable label (so that we know what this new variable describes) and if it is a nominal or ordinal variable value labels (so that we know what each of the value codes represents.   In doing this we have, of course, changed our dataset. We therefore need to tell SPSS  to save this new version of the dataset with the new variable, and to give it a new name (to distinguish it from the old version of our dataset). The SPSS animation shows you how to do this.

Recoding decisions

When we wish to recode variables into a new set of classes, there are three decisions to be made:

1. Determining classes - How many classes do we want?2. Interval width - How big should the class interval be?3. Theoretical and practical class limits - What are the class limits?

1. How many classes do we want?

To be useful, we should choose a number of classes that suits our information purposes, neither too few to lose detail, nor too many to obscure the picture.

Page 25: Part 1 Module 2 Single File

2 How big should the class intervals be?

We could crudely estimate this by dividing the range of values by the number of classes as a first guess. However, we might also look at what might look like relatively homogeneous groupings in the data to decide what would be sensible; i.e. ideally we want members of each class to be similar to each other and different to members of other classes. It might be important to match existing class sizes in other datasets, if comparisons are intended; e.g. in the UK age is often divided into standard European classes (0-5, 6-10, 11-14, 15-24, 25-34, etc.). In other words we do not have to have a constant size of class interval in mind.

3. What are the class limits?

We need to be aware that classes need to be mutually exclusive, but also they need to account for all values (i.e. mutually exclusive and comprehensive). For example, the income ranges of £10k-£14k and £15k-£19k are fine if we record data to the nearest £1000. However, in a survey recording that a respondent earns £14,423, we would not know to which class they belong. Therefore, while these might be the practical limits for our purposes, we need to be clear where the theoretical boundaries occur. In the above example we would probably assume the real classes are £10,000- £14,499 and £14,500- £19,499. There are some variables where such simple midpoints between classes are not appropriate. For example, there is a cultural norm in the UK that age in years is to the day before the next birthday, not rounded either side of the midpoint between birthdays; i.e. in the case of 11-14, the real class limits are 11 years 0 days to 14 years and 364 days (365 days in a leap year); thus, 11-14, 15-19 would effectively have a boundary at 15 years.

Glossary

Class interval

This is the width of a range of values. For example, we might set the width of a class of values for an age-group to be 10; i.e. 25 - 34. Alternatively, we might be interested to talk about the range of precision in an estimated value, such as the confidence interval for a population mean

Class limits (theoretical and practical)

These are the boundaries for a data class that mark the end points of the class interval. Theoretical limits indicate the exact or true boundaries, while for practical purposes we often use more approximate boundaries when classifying observations.

Contingency tables or crosstabs When we want to look at a relationship between variables, we are interested in whether the frequency distribution of cases on one variable is affected by their distribution on another variable. Another similar way of describing this is'whether the

Page 26: Part 1 Module 2 Single File

conditional distribution of a variable changes, under differnt conditions (values) of another variable. This may sound complex but it is simply a logically clear way of describing what we do when we compare groups of people defined by one characteristic (e.g. their sex; e.g. which country they live in) according to another characteristic (e.g. they way they voted in an election; e.g. their views on women's voluntary childlessness). We might compare men and women voters to see if they voted in similar ways (i.e. the distribution of their votes across parties was the same) or if they differed (i.e. women were more or less likely than men to vote for particular parties). If sex didn't affect voting, we could say there was no relationship: we could describe this by saying that the distribution of values on one variable did not affect the distribution of values on the other, or that their distributions remained unchanged under different conditions of the other variable. Another way of saying this is that the variables were not correlated or that the variables were not associated or that the variables were independent of each other.

When we look at relationships between variables we often have some hunch about the causal processes involved. We will look at the idea of causation in the next module. For the moment we need only note one important point. Data rarely tells us anything directly about causes. The concept of cause and effect comes from theory.  However, theories can be tested by our data. When we suspect that one variable may explain something about the distribution of values of another variable, we refer to the first variable as an independent, predictor or explanatory variable, and the second as a dependent, response or outcome variable. In our hypothetical example of  sex and voting, we would be hard pressed to come up with a theory about how someone’s voting behaviour influenced their sex. We would be on much firmer ground if we were to suggest that sex might influence voting behaviour. Sex would be our independent variable, and voting behaviour our dependent variable.

 

The table below, produced by the crosstabs command in SPSS shows the relationship between the variables marstat and tenure in the GHS CQDA Practice dataset. My theory might be that marital status had some influence on the kind of housing tenure (owning or renting) people had. I would therefore be interested in knowing if the distribution of people across different tenure statuses (i.e. the values of the variable ‘tenure’) was the same for people of different marital statuses (i.e. the values of the variable ‘marstat’).  Or II could describe this question as: under different conditions of the variable marstat, is the distribution of the variable tenure the same? Or again: do the conditional distributions of tenure vary across different conditions of marital status? Or again: Is the distribution of tenure conditional upon marstat? Or again Is there an association, between tenure and marstat?

Tenure * Legal marital status Crosstabulation

      Legal marital status Total       1  single,

never married

2  married and living with  husband/wife

3  separated divorce or widowed

 

Tenure 1  Owns outright

Count 12 44 10 66

Page 27: Part 1 Module 2 Single File

    % within Legal marital status

14.5% 28.9% 27.0% 24.3%

  2  Buying on a mortgage

Count 41 92 18 151

    % within Legal marital status

49.4% 60.5% 48.6% 55.5%

  3  Rents Count 30 16 9 55     % within

Legal marital status

36.1% 10.5% 24.3% 20.2%

Total   Count 83 152 37 272     % within

Legal marital status

100.0% 100.0% 100.0% 100.0%

lets look at the table in more detail...

 

Glossary

Correlation

The degree of association between variables.

Dependent variable

A variable that is hypothesised to change as a result of a change in another (independent) variable.

Independent variable

A variable that is hypothesised to effect a change in another (dependent) variable.

Contingency tables or crosstabs (continued)

Contingency tables can be thought of as a series of adjacent frequency tables. They show the frequency distribution of values for a response or dependent (Y) variable according to (or contingent upon) each of the values of the explanatory or independent (X) variable. The variable displayed in the columns of the table is said to be in the header of the table. The variable displayed in the rows is said to be in the stub of the table. The distribution of values for each variable on its own appears in the table marginals. Each cell of the table shows the number of cases (or count) corresponding to each possible combination of the values for the two variables (e.g single and owning outright, single and buying on a mortgage, single and renting, married and renting and so on). Each cell also shows this count standardised in some way. In this case we have standardised by column, expressing each cell count as a

Page 28: Part 1 Module 2 Single File

percentage of all of the cases in that particular table column. You can use the buttons below the table to highlight different parts of it.

It is customary (but no more than a rule of thumb) to put the explanatory or independent (X) variable in the head of a table. Crosstabs do not display cases where a value is missing for either variable. If the explanatory (X) variable is in the head of the table and the response (Y) variable is in the stub, the column percentages in the table correspond to the standardised frequency distribution for the valid cases found in the fifth column of SPSS frequency tables that we looked at eariier. If we compare these columns (comparing column %’s along the row) we can make a rapid visual inspection of the extent to which the distribution of the response variable varies according to values taken by the explanatory variable. The proportions are within the categories of the explanatory variable, the comparisons within the categories of the response variable. There is a golden rule worth following. The proportions of the dependent variable sum to one within the categories of the independent variable.

      Legal marital status Total       1  single,

never married

2  married and living with  husband/wife

3  separated divorce or widowed

 

Tenure 1  Owns outright

Count 12 44 10 66

    % within Legal marital status

14.5% 28.9% 27.0% 24.3%

  2  Buying on a mortgage

Count 41 92 18 151

    % within Legal marital status

49.4% 60.5% 48.6% 55.5%

  3  Rents Count 30 16 9 55     % within

Legal marital status

36.1% 10.5% 24.3% 20.2%

Total   Count 83 152 37 272     % within

Legal marital status

100.0% 100.0% 100.0% 100.0%

Highlight the:

By simple visual inspection of this table we can see that some sort of relationship exists. The proportion of people who own their home outright among those who are, or ever have been married is about twice that of single people. Conversely, more than one out of three single people rent their accommodation: something that only one in ten currently married people do. In the next module we will look at how big such differences across the columns of a table have to be for us to speak of a relationship, but first getting some practice with visually inspecting crosstabs to look for relationships between variables is essential. The self test exercises at the end of the module, and in your tutorial will help you do this. It is worth spending a significant

Page 29: Part 1 Module 2 Single File

amount of time producing different crosstabs from the GHS CQDA Practice and ScotMP datasets and examining them to see if they provide evidence of a relationship between two variables.

One technique that is essential to master is to distinguish row %'s from column %'s. Row %'s express the number of cases in a cell as a % of the total number of cases in that row of the table. Column %'s express the number of cases in a cell as a % of the total number of cases in that column of the table. In the above example 14.5% is the column % for single people who own outright expressed as a percentage of all single people (the variable displayed in the columns of the table). We could put this in words as '14.5% of all single people own their house outright.' The row % for this cell of the table would be 100* (12/66) = 18.2%. This would express the percentage of all people who own their house outright who are single. These two percentages mean different things and it is important not to confuse them.

 

As well us learning how to interpret tables correctly, we also need to establish what the late Cathy Marsh called ‘Good table manners’: that is how to set out information clearly, concisely and comprehensively in a table. For others to be able to understand the summary information we present in a table there are some basic and essential rules we must follow...

 Good table manners  Cathy Marsh set out seven useful rules in her book Exploring Data that determine what should be in a well constructed table. Get into the habit of following them.

1. Is there a clear title? Does it concisely but comprehensively describe the contents of the table?

2. Is the source of the table (usually a dataset arising from a survey) given? This usually goes at the foot of the table and should allow a reader to locate the original information or dataset on which the table is based.

3. Does it show clearly the number of cases (N) on which the table is based? We need to know if our conclusions have been reached on the basis of a few cases or a large sample.

4. Are there clear Variable and Value labels? These should amost ALWAYS be different from those used in SPSS, since the latter have to be so short . You may have a clear idea of what the variable 'marstat' is, for example, but other readers will not. It is essential to describe the variables and the values they take in a way that is both brief but also fully understandable to someone with no knowledge of the source data or dataset on which the table is based.

5. Is there enough detail to tell the story, but not so much detail that the story is obscured? Tables with more than a few rows or columns are very difficult to understand. Most tables, especially those produced by those starting out in quantiative evidence, present far too much, over detailed, information. While it is important not to combine together different classes or values that have an important story to tell in their own right, tables with more than five or six columns or rows confuse the reader and show too much data: remember your

Page 30: Part 1 Module 2 Single File

job is to summarise, to ruthlessly exclude detail, so that it is the main sotry that shows through.

6. Definitions of percentages: is it clear whether any percentages or proportions in the table are calculated along the row or down columns? The procedures used to standardise the data in the table must be made clear. NEVER include percentages or proportions witthout clearly indicating how they have been calculated. Usually this can be done simply by noting in the header or stub of the table what kind of proportion or precentage has been used, or by showing the total for a row or column as being equal to 1, for proportions, or 100, for percentages.

7. Is there information on any missing values or cases not included in the analysis? This should usually go at the bottom of the table along with the source information.

It is very important to get used to following these rules, and we will enforce them rigorously in course assessments. Obseravtion of these rules is also often a good indicator of the quality of a table in an article or monograph. A poorly organised ill-labelled table is often a sign of a poorly thought through argument!

Reading

Marsh, C (1988) Good table manners in Exploring Data, Cambridge, Polity.

Alternatives to tables: Graphs and charts

While tables often contain the data we need for further analysis and calculations, diagrams and charts can be very good ways of visualising information, not only for a reader to grasp its meaning 'at a glance', but also for us as researchers in exploring our data and looking for patterns within it. Charts and diagrams should be chosen to be appropriate to the level of measurement of the data they are based on. It is also good practice to provide appropriate labels and to show the sources of data, in case the reader wishes to check or develop the analysis further. The STEPS site has a good section on data presentation. Marsh and Elliott Ch1 and 2 also has good material.

Fielding and Gilbert (Chapter 4) gives a good summary of different kinds of chart and how to produce them in SPSS. Below are some brief descriptions of three main kinds of chart: Pie charts, Bar charts and Histograms. 

Of these, the most imporatn for our purposes in this course is the histogram. It is important to make sure you can produce and interpret histograms correctly, as we will use them to understand the properties of the normal curve.

Graphical representations of nominal and ordinal levels of data include:

Pie chartsBar charts

Graphical representations of high levels of data measurement include:

Histograms

Page 31: Part 1 Module 2 Single File

Explore

Glossary

Bar chart

A 2 or 3-dimensional diagram that represents the frequencies of a set of values for a variable in a columnar format, with gaps left between the columns to show that it is not a continuous scale. It is used with nominal variables only.

Pie chart

A diagram that show frequencies of categories of a variable as proportions of a circle. It is only used with nominal variables.

Histogram

A 2 or 3-dimensional diagram that represents the frequencies of a set of values for a variable in a columnar format, with no gaps between the columns to indicate that it is part of a continuous scale. It is used with ordinal, interval, or ratio variables only.

Stem and leaf diagram

This is a graphical representation of the raw values for a variable, with the stems being chosen to represent the first few significant figures and the leaves to represent the final significant figure(s).

Pie charts

Since we can always express the frequency distribution of the values for a variable as the proportions out of 1 that each value of the variable takes we can represent this visually as the proportions of the total area ofa circle. Pie charts do this by representing the proportion of cases taking each value of a variable as the ‘slice’ of a pie. In a pie chart categories are drawn in proportion to their count, as a proportion of 360° (the number of degrees in a circle). The bigger the slice the bigger the proportion it represents, while the entire ‘pie’ represents all the cases (the proportions represented by each slice will sum to one). Pie charts are useful when the number of different values that the variable takes is relatively small, and when there are not too many values that take a very small proportion of cases (since very small slices are difficult to see. The main value of pie charts is to give a quick impression of the

Page 32: Part 1 Module 2 Single File

proportion of cases taking any particular value compared to all the others. If we are more interested in comparing proportions with each other, bar charts are often preferable.

Create Pie charts in SPSS using the 'charts' button on the Frequencies dialog box, or the 'Pie' subcommand on the Graphs menu.

Further guides to Pie charts are available here, here and a guide to creating pie charts in excel here and here.

Bar charts

Bar charts plot the number of cases that take each value of a variable against that value, with the length of the bars corresponding to the number of cases. Bar charts are drawn using the frequencies along one axis (horizontal in this example) and the categories on the other axis (vertical in this example). The bars are always of the same width with a small gap between the categories to show that there is no scale (i.e. that the categories are at the ordinal or nominal level of measurement). It does not matter whether the categories are on the horizontal or vertical axis, as long as the text does not get too squeezed; i.e. there is usually a limit to how much will fit legibly on the horizontal axis.

The scale on the frequencies axis of the chart can be that of the number of cases, or of the percentage of cases. The second option is often a more useful one, as it means that we can use the chart to make comparisons not only between values, but also between different charts.

‘Clustered’ Bar charts can also be used to illustrate simple relationships between pairs of variables, by producing ‘clusters’ of bars for each of the values of one variable according to the values taken on another variable. This enables us to see quickly if the distribution of values for our variable of interest changes according to the values taken by another variable. If we use the GHS CQDA Practice dataset to produce a clustered bar chart of the variable marstat using the variable ‘sex’ to produce the clusters, we can see at a glance whether or not  men or women are equally likely to be single, married, divorced etc.

Page 33: Part 1 Module 2 Single File

 

Histograms

Bar charts and pie charts work well for data, usually at the nominal or ordinal level of measurement, with relatively few categories. However for interval level data and data which is continuous rather than discrete, we need a different approach. The most important form of chart for interval level data is the histogram. In the histogram we group the values of the variable into class intervals (or ranges of values) and then plot bars whose area corresponds to the proportion of cases in each of the class intervals. Each bar is usually labelled by the values at the midpoint of the class interval or range of values, or by the values that occur at the start and end point of the class interval.There are no gaps between the bars because the class intervals group values that are on an interval scale and represented directly by scale of the X axis on which they are plotted. The area in each box in the chart corresponds to the relative proportion of the total sample contained in that category. Histograms are important for understanding several aspects of data analysis that we will encounter later on, so that it is worth while taking some time to ensure you understand them thoroughly. See also Fielding and Gilbert (2006) pp. 84-87 and Marsh and Elliott pp. 13-18.

If you produce histograms in SPSS, it will automatically produce the class intervals for you, but they seldom correspond to class intervals that will show your data most clearly. Alternatively, you can specify to SPSS the kind of class intervals you require.

Produce two histograms of the age of respondents in the GHS CQDA Practice dataset, one for those who left full time education  before 18 and those who did not.

Those who left fte later are mostly younge

More on histograms is available here and here.

Explore

The 'Explore' command in SPSS gives you a useful summary of the statistics available for an interval level or continuous variable. As well as giving such values as the mean and median,  and other statistics whose meaning we will learn about later in the course, 

Page 34: Part 1 Module 2 Single File

'Explore' produces two kinds of graphic: the stem and leaf diagram and the boxplot. Stem and leaf pictures give you an idea of the spread of values in your data, and in particular whether there are particular values or range of values in which most cases are concentrated. Box plots give a a very useful intuitive visual summmary of the shape of your data, and are good for making comparisons across different variables, or different subgroups of cases within a variable. In a box plot the line in the centre of the box shows the median for the variable, while the upper and lower boundaries of the box shows the upper and lower quartiles. The whiskers extending from the top and bottom of the box extend to 1.5 times the value of the interquartile range (the difference between the upper and lower quartiles). Values above or below the whiskers are considered to be outliers. Outliers are values who lie so far away from the majority of values taken by a variable as to arouse suspicion or special interest. For example if we found a case in our dataset where age was recorded as 118, we might want to investigate it farther. If we saw that the case was reorded as being in full time education, we might then want to check if 118 was a simple coding error for 18. Alternatively outliers may not be the result of coding errors but cases or groups of cases that are interesting by virtue of their unusual values. For example if we found that most people had fairly short lengths of tenure in their home, but some cases had lived in their homes for a very long period, we might want to investigate what other charascteristics these cases possessed that might be associated with such long tenure.

Chapter 6 of Fielding and Gilbert (pp 124 to 144) go through stem and leaf and box plots in detail. Chapter 8 of Marsh & Elliott covers boxplots

You can also learn more about Boxplots and stem and leaf plots on the STEPS website section on presenting data.

Module 2 conclusion

The mean and standard deviation are two summary measures that allow us to say a lot about interval level data.

Tables are an excellent way of presenting and summarising information for one or more ordinal or nominal variable. If we are looking at only one variable, we use a frequency table. If we are looking at the relationship between variables we use contingency tables that we can think of  as a series of adjacent frequency tables for each value of our predictor variable. Constructing tables so that they are as clear as possible takes time and judgement that is well worth investing in.

In tables we are usually looking for various kinds of pattern or relationships in our data. It is to this that we turn in the next module. First we will return to looking at interval level data, and then consider tables once more.

After working through this module you should have learned:

 

How a model comprises the 'fit' and the 'residual'.

Page 35: Part 1 Module 2 Single File

How a model can be used to make simple summarising statements about a large volume of data.

How the mean and standard deviation are summary measures of the level and spread of values the cases take on an interval variable.

How to calculate the mean, median, mode, variance and  standard deviation for a variable.

How frequency and contingency tables are a useful ways of summarising data at the nominal or ordinal level of measurement.

The components of tables: header, stub, cells marginals. The strict rules of presentation of data in tables 'good table manners'. How to deal with cases with 'missing' values for a variable in a dataset. How to standardise data for purposes of comparison between different groups

of cases, using proportions or percentages in tables. How to decide what level of accuracy is needed in the presentation of data. That the distribution values a variable takes for a set of cases can also be

thought of as the distribution of probabilities that any one case takes a particular value

What we mean by an explanatory, independent, predictor, dependent and response variables.

That the basis of looking for relationships in quantitative data takes the form of investigating whether the frequency distribution of cases on one variable is affected by their distribution on another variable.

How to recode interval level data, or ordinal or nominal level data with a large range of different values, into a smaller number of values or classes, ensuring that the range of classes or values for any variable must be both comprehensive and mutually exclusive.

How to save a new version of an SPSS dataset that you have altered in some way (e.g. by creating new variables).

Now complete the end of module activities:

Self-test activity

Tutorial activity - In your tutorial you will go through these questions and exercises.

Further Reading: Fielding and Gilbert Ch. 3, 4 and 5.

Reading from the Course Library: Marsh 1988 'Numerical Summaries of Level and Spread'; Colman and Pulford (2006).

 

 

 

 

Page 36: Part 1 Module 2 Single File