some stats and spss pointers
TRANSCRIPT
-
8/3/2019 Some Stats and SPSS Pointers
1/29
Some stats and SPSS pointers
These relate to some specific technical matters I am sometimes asked about but which are not
covered in detail in my courses or maybe passed you by unnoticed. Note, I am not trying to
present all my course material here (you have to take the courses for that), just deal with some
frequently asked questions and things people frequently get confused over/get wrong. Also,
these are not all readily understandable unless you took stats courses already! How do I round figures down to make them shorter, e.g. 3.852. And how many decimal
places should I report?
How do I generate random numbers to help when sampling from a list, or when dividing
subjects randomly into groups? Use the facility at http://www.randomizer.org/form.htm
I have the proficiency scores (or the like) for 30 subjects, and want to divide the cases into
groups based on this. Or I need categories of word stimuli of three different frequencies..
How do I do it?
Can I getphonetic symbols like [] shown on the scales of SPSS graphs?
How do I combine columns of figures I have entered in SPSS, when I want averages for
each person of the figures in the columns (e.g. the scores for separate items in a test)? What is item analysis? And what does it mean if the F in an ANOVA result is labelled F1
or F2, where there has been an analysis by items as well as by subjects?
How do I eliminate extreme response times in psycholinguistic data? or response times
where the response was wrong?
What does the standard deviation really mean?
When I do a histogram of some scores (interval scale data) I am supposed to look at the
distribution shape the pattern of the heaps on the graph but how do I interpret the
shape I see?
How should one treat rating scale responses? As ordered categories or interval scores?
If my data is not normally distributed, so not suited to t tests and ANOVA, what can I do?What are the transformations I can use?
What really are Likert and Guttman scales, and how should they be constructed? They
both are ways of measuring things via a set of agree-disagree items. Often we use sets of
items of this type that other researchers made but I wonder if anyone actually selected
and rated the items in the approved way in the first place?
What does it mean when SPSS gives you a figure with an E on the end? e.g. 7.012E-02
What are degrees of freedom (df) and how do I report them, if needed?
What are residuals and what do they tell me?
If in a pilot trial of a few subjects I dont get the significant result I want, how can I
estimate how many subjects I would need to probably get a sig result?
How do I do follow-uppost hoc paired comparisons and planned comparison tests for any
kind of main effect or interaction in ANOVA where more than two groups or conditions
were initially compared? SPSS doesnt do all the possibilities, or hides some away
How do I do post hoc paired comparisons after a Kruskal-Wallis test?
What is Bonferroni adjustment and how can I do it?
What is eta squared and how does SPSS calculate it?
Esp. for ACQUISITION people and SOCIOLINGUISTS. Twenty people in two groups
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
2/29
are each measured for the number of times they use the third person s out of all the
occasions when they had an opportunity to in compositions, recorded speech etc. (often
called potential occurrences or loci). How do you summarise % scores like this? Group %
scores for frequency of use of things, or individual % scores?
Esp. for PSYCHOLINGUISTS and people doing repeated measures EXPERIMENTS.
What on earth is a Latin square and how do I use it or some other method of organising
conditions, different types of stimuli etc. in an experimental design? What are those tests ofprerequisites for ANOVA/GLM such as those of Levene, Mauchly
etc. all in aid of?
If I have a lot ofmissing scores, can I fill them in somehow?
Can I check on whether people are responding by random guessing or with bias, and adjust
scores to take account of that?
My subjects all gave several responses to a set of different stimuli, and I have entered the
data in SPSS with each response as a row. So there are several rows for each subject. How
do I turn that into the more usable SPSS layout with one row per subject?
Subjects have been categorised in a parallel way in several different columns. E.g. they
answered a set of questions each of which had the possible response: me, my teacher, my
classmates (i.e. although coded for SPSS as 1, 2, 3, the responses cannot be considered as
degrees of anything on an interval number scale). How do I get SPSS to add up for each
person across the items totals of how many times each category was chosen?
If you are into word association tests, there are a few descriptive stats that one can use
there that one does not find used anywhere else much: The Group overlap coefficient,
Within groups overlap coefficient, and Index of commonality.
Degrees of freedom
Sometimes journals expect you to report these dffigures along with other statistics. They are the
figures you see quoted in brackets often subscript after t, F, Chi squared etc. E.g. instead of t = 2.34
one sees perhaps t(28) = 2.34.
They can usually easily be got from SPSS output where they are not obvious. Look fordf. Broadly
they reflect the number of categories in any category variables in the design, and the number of cases
in each group. The exception is designs where only category variables are involved (e.g. where you
would use chi squared): in that instance the df just reflects the number of categories.
Since you will have told the reader the numbers of categories and cases involved anyway, I don't
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
3/29
personally see the point of mentioning df. But in case you need to, they mainly turn out to be one less
than the numbers you started with. though it can get more complicated.
The df numbers are written subscript, or in brackets, after the statistic t, F or whatever (not the p).
So in a t test comparing two groups, 108 subjects altogether, the df will be 1,106. One might write
t1,106 = ..... The first figure is one less than the number of EV categories (2-1=1). The second is thenumber of cases less one for each group involved (N-2=108-2=106).
In an ANOVA comparing four groups with 108 subjects altogether, df would be 3,104.
In a t test comparing the same group in two conditions, the df for 108 cases will be 1,107.
The df can be more tricky for more complicated designs and interactions. In the output of ANOVA
you will generally see the first df figure you need in line with the main effect or interaction of interest,
and the second one listed as within groups orerrorbelow it.
In a chi squared test with three categories on each scale, the df is 4 because (3-1) x (3-1) = 4. In a chi
squared test with two categories on each scale, the df is 1 because (2-1) x (2-1) = 1.
Why are these figures called 'degrees of freedom', and why are they important? It is basically because
what is important in statistics is not so much the numbers of anything but the numbers of choices or
separate pieces of information involved. Typically there are always one less choices than people etc. If
I have ten assignments to hand back to my class of ten students, I have to make a choice who to give
each one to for the first nine, but for the tenth one there is no choice, as there is only one assignment
left and one person left to give it to. I have no 'freedom' left on the last one.
Here's the statistician's analog of that. 100 people answer a yes-no question and 38 say 'yes' and 62say 'no'. We want to know if that differs significantly from 50:50. I.e. are they showing a real
preference? There are two categories (yes and no), so we use the binomial test. It might seem that we
have two figures to handle in the test and two comparisons to make. We have to check if the observed
figure of 38 differs from the E of 50 and if O of 62 differs from E of 50. But in fact, of course, the test
need only do one of those. The data has only one degree of freedom. Once the test establishes if 38
differs significantly from 50 for one category, the answer for the other category, whether 62 does so as
well, is fixed. Hence if one calculates statistics by hand one always finds that in the formulae one has
to use the df figures rather than full numbers of cases or categories.
Residuals
These are simply the differences between observed figures (O) and some kind of predicted/expected
figure (E). But they mean different things in different analyses.
Category data: for significant differences/relationships we want them big, because the E figures
represent what is expected under the null hypothesis of NO difference/relationship. In analyses where
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
4/29
just frequencies in categories are involved (e.g. analysed using chi squared or the binomial test), the
residuals are the differences between O and E frequencies. The bigger they are, the more likely that
there is a significant difference involved. In the Labov analysis in class we looked at the table of O
and E values to see where the biggest O-E differences were (for which r use in which store). In fact
chi squared itself is calculated by essentially adding up the residuals for each cell in the table (with a
bit more maths to it). In the binomial test where, say, 20 people are divided 4 saying 'yes' and 16
saying 'no' to a question, we want to know if that differs significantly from a 50-50 split, which wouldbe 10 'yes' and 10 'no' in this instance. So we are concerned with the size of the residual... in this
instance 6. The bigger the better, if we want to show a clear preference.
Interval data: for significant relationships we want them small, because the E figures represent what
is expected under the hypothesis of a perfect linear relationship. This is the other place where you
often find residuals being talked about - in data where all the variables are (treated as) interval
(analysed using Pearson r, or regression). Here they are the differences between the observed scores
and the scores predicted by the best fitting line on a scatterplot, showing the EV-DV relationship.
Here obviously the smaller the residuals, the more likely the relationship is significant. Obviously one
can find a best fitting line to any data where cases are scored on two or more interval variables.... but
if most of the observations fall miles away from the line, that does not show a real relationship.Pearson r and regression statistics in effect reflect whether the residuals are generally large or small;
examining scatterplots, when we look at cases (subjects) that are way off line, we are looking at cases
with exceptionally large residuals.
Eta squared
This is the measure of relationship that you can get in ANOVAs and the like. A bit like a correlation
coefficient it tells you on a scale 0-1 how much EV-DV relationship there is. Really it is more
analogous to r2
and can be thought of as a % on a scale 0-100. It is a useful addition to just being told
if a relationship or difference is significant. Many significant differences/relationships in fact are quite
small in terms of the SIZE of the difference/relationship.
SPSS does not calculate eta quite how the books suggest, or even how SPSS help itself seems to
suggest.
In fact every eta sq is calculated so that it is a proportion out of a different total and some of the
variance that goes into the calculation of one of them may also go into the calculation of another, so
none of them can be added sensibly to each other.
So every effect (main or interaction) is out of its own 100%, representing the maximum variance that
it could account for, but not all the variation in DV scores. This applies even where the effects are ofthe same type and a sensible calculation could be made of the % of variance of the same type
accounted for (e.g. two between subjects main effects - in principle one could calculate what % of the
WS variance they account for together). In fact this is not done.
So the SPSS etas can be compared with each other (This one is accounting for more of the total it
could account for than that one is...) but not really added. Or if you like, the total % if there are three
factors with three main effects, 3 two-way interactions, and one three way, is not 700% but less than
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
5/29
that... but hard to calculate exactly what. (In fact you can see how SPSS calculates the etas: in the sum
of squares column it is simply the sum of squares for the effect of interest divided by the SS of that
effect plus the relevant error SS for that effect. Clearly then it is not calculating the proportion of all
the SS in the entire analysis accounted for by that effect, just the proportion of the SS relevant to that
effect. And also the error SS get re-used in different calculations)
Post hoc tests of paired comparisons after ANOVA
Wherever a main effect or interaction involves a comparison of more than two means, post hoc tests
can be relevant, as the basic significance value given by the ANOVA does not say which pair or pairs
is/are sig different. If the main or interaction effect from ANOVA comes out significant that just
means that there is a sig difference SOMEWHERE among the means but not between every pair
necessarily. Especially this arises where one or more of the EVs has three or more levels (i.e. groups
or conditions), though it can also arise, say, where you have two two-value EVs and the interaction is
significant. You need a post hoc test to identify where the differences are exactly or just judge it by
eye from a graph or table of means. This situation arises in various ways in ANOVAs, some of which
SPSS deals with straightforwardly, others not.
One might think the solution is just to do loads of familiar t tests comparing the means in pairs as
required, to see which pairs are sig different. Indeed one sees this done in some published work, and
in moderation probably you can get away with this However, statisticians dont like that. The
statistical issue underlying all this is that, when you do paired comparisons like this, the same means
are getting reused several times in different comparisons. If you have three groups and compare them
in pairs then the mean for group 1 gets used in the comparison both with group 2 and group 3. Now
the more times a mean gets compared with others in repeated statistical tests, the more chances it has
to come out as significantly different just by chance, not reflecting a real population difference.
Remember that if a difference between two means is significant (at the .05 level) that actuallyMEANS that one would not get a result this different more than 5% of the time or one in twenty
times by chance, due to the vagaries of random sampling, in similar sized samples from a
population where there really was no difference. But another way of looking at that is to say that if
you use the same data in twenty comparisons, then one of the results might be that one-in-twenty
result that looks significant but is actually from a population where there is no difference. The more
tests you do, the more chance of getting a result that looks sig but is not really.
Some adjustment has to be made to compensate for this. Like other activities in life involving pairs,
your tests for multiple paired comparisons should not be unprotected! Post hoc tests and the like
cope with this better than t tests. It is not appropriate to do multiple t tests at least not without a
Bonferroni adjustment of the sig level (though that is a solution that is seen as rather
overcompensating for the problem). Better is to use a post hoc test designed for such comparisons
(e.g. Tukey, Scheffe, etc.). However, as the SPSS dialog box forpost hoc shows, there is a myriad
of options: nobody is certain which is the best, and none are perfect. As a consequence sometimes you
can get an anomalous result that the ANOVA says there is a sig difference somewhere, but the paired
post hoc test does not find any pair significantly different.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
6/29
The term post hoc is used for where you just want to consider all pairs of means that are possible to
compare, following an overall analysis including all the means, which is the appropriate starting point.
SPSS however limits this term to comparisons between cases in different groups, though statisticians
use the term generally for follow up comparisons of pairs of repeated measures conditions as well.
The term planned comparison (=contrasts in SPSS) is used where you planned specific paired
comparisons, not all the possible ones, such as the comparison of three groups of learners with an NS
group, but not with each other.
The general rule is that for k means there are k(k-1)/2 paired comparisons possible. E.g. if four
groups then 4 x 3 / 2 comparisons, i.e. 6. However, SPSS output usually gives you the pairs twice
over so it looks even more.
An EV with three or more independent groups being compared.1.
E.g. the % correct scores for third singular s of three groups of learners are compared. The basic
ANOVA result says whether there is a significant relationship between the EV and the DV a
difference somewhere among the groups but not exactly where. If the overall result is sig, then to
see which pairs of groups are sig different you need to do post hoc tests. Whether you do theANOVA via Compare means Oneway ANOVA or via General Linear Model Univariate, you
get many many ways of doing the post hoc test offered under the Post Hoc option. Tukey HSD is a
common safe bet.
Basic post hoc tests compare every pair of means. But suppose your groups were two of learners and
one of native speakers and you plan to compare the two learner groups with the NS group (which may
be thought of as a control group) but not with each other. These are often called planned
comparisons and you would do better not to use the post hoc tests which compare every pair, and so
are weaker (less likely to identify sig differences). You get this sort of limited comparison in
Analyze.. General Linear Model... Univariate... enter your DV as usual and the three languagesvariable as a fixed factor. This does a oneway ANOVA exactly like you get with Compare Means...
Oneway.. except that it gives you some extra options. If you clickContrasts and click the contrast
option to get Simple and then clickfirst orlast depending on whether the control group is numbered
1 or 3... then (don't forget) clickChange... then Continue then OK... you get an output that just does
those limited paired comparisons.
An EV with three or more repeated measures conditions being compared2.
E.g. you compare the same peoples fluency speaking to the teacher, to peers and to parents. You
want to compare each pair of those conditions afterwards. In General Linear Model Repeated
Measures you have to use not what is labelled Post Hoc but ratherOptions click the variables intoDisplay means and tickCompare main effects and below that choose Bonferroni. This in effect
uses t tests with a simple Bonferroni adjustment for multiple comparisons to compare the pairs of
means. Not ideal because overcautious: i.e. likely to lead to you missing a difference that is actually
sig. SPSS should really make Tukey etc. available in repeated measures as well as independent groups
comparisons. Alternatively you can do your own Tukey test as described below.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
7/29
Once again you can alternatively choose limited planned comparisons via the Contrasts option as
above.
Interaction in a two way ANOVA with both EVs as groupings3.
Where there are two EVs that are groupings, the interaction always involves at least 4 subgroups.
Even if both variables are just two groups, like male-female and upper class-middle class, theinteraction has four groups involved and, if the interaction is sig, you might want to know which pairs
of those are producing that result, beyond just guessing from a suitable graph.
SPSS does not deal with post hoc for interactions, but in some instances you can do it yourself fairly
simply with calculator. For instance you can do a Tukey test to test for pairwise differences when you
get a sig interaction in a two way ANOVA with two independent gps factors, where all groups have
the same number of subjects in.
Calculate T = q x (error mean square / number of people in each group)
Error mean square or error variance is in the original ANOVA table in output.q is found from the table of the Tukey statistic (ask me for it or see a serious stats textbook which has
it in the back. I cant include it here for copyright reasons). Read off the column for the number of
means being compared pairwise, and the row for the df of the error variance/mean square (from
ANOVA table).
Then calculate T and any pair of means differing by more than T is sig different.
If the groups are different sizes, or you wish to save effort, do t tests with Bonferroni adjustment.
Interaction in a two way ANOVA with both EVs as repeated measures4.
As for 3. OR Treat it as a oneway repeated measures situation. Enter all the repeated measures
columns as if there were just one factor not two, and follow 2 above. That in effect does the post hoc
for the interaction.
Mixed independent groups and repeated measures ANOVAs5.
As usual, if the result in ANOVA is significant, and more than two means are being compared, one
needs follow-up tests to see which pairs of means are significantly different (or be happy just to judge
it visually from a graph). Each main effect involving 3 or more levels can be dealt with as above, but
the interactions are more of a problem.
Take five repeated measures conditions and two groups.
One can get the main effect multiple comparisons done by SPSS with suitable adjustments as
described in (2) above (i.e. comparing results on the five conditions with each other in pairs, for the
whole sample of subjects lumping both gps together). In fact if one wants all of them there are 10
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
8/29
comparisons.... because there are five conditions, so (5 x 4) / 2 paired comparisons.
In the interaction, since there are 10 means involved for all 5 conditions and two groups, there are (10
x 9) / 2 comparisons potentially, which makes 45.
One can do some of the interaction paired comparisons, by splitting the file and getting SPSS to use
the Bonferroni option again. Those are the comparisons of each condition with each other conditionwithin each group separately. 10 comparisons in each group = 20 in all.
That leaves 25 comparisons that you could not do with any post hoc procedure in SPSS as far as I
know... the comparisons between each of the 5 means for one gp and the five for the other. Ordinary t
tests do not have any required reduction for multiple comparisons like post hoc tests do. However a
simple adjustment by hand is to use the t test but require stricter sig levels. In fact this is really making
the Bonferroni adjustment oneself.
The account immediately above assumed that there was no a priori reason to be interested in any of
those 25 pairs more than any other... It was a DIY post hoc solution.
However, it could be that, for theoretical reasons or whatever, you were not interested in comparingevery pair of means, only certain ones. In particular:
- the comparisons of all 5 conditions within each group, done OK with split file and Bonferroni
adjustment..... 20 comparisons
- the comparison of each group with the other on each condition separately. That is in fact only 5
comparisons out of the 25 possible other ones. (I.e. you have no interest in comparisons like that
between the lower group on condition A and the higher group on condition C, between with lower and
higher on A etc.). You want to claim, in this instance, that these were what are called 'planned
comparisons' not the usual post hoc 'try everything' type. Then you could reduce the required sig value
of the t test for this part by dividing by 5 not 25 in the Bonferroni adjustment....
In general, then, where there is no post hoc test available in SPSS, the simple but crude solution is to
use ordinary pair comparison statistical tests, but divide the target sig level by the number of potential
comparisons you COULD make, or PLANNED to make, to compensate for making multiple
comparisons. However, this is cruder than using post hoc tests, which take care of this better. You are
more likely to miss sig differences (a so-called Type II error).
You dont get a sig result and you want to know how big a sample you would need to get one
If you have gathered data, especially in a pilot study, and not got a significant result, you may want to know
how big a sample you would need to make the result significant. Remember, if you choose a big enough
sample, even a very small difference or relationship may be significant. So if you have the possibilityavailable to increase the size of the sample (i.e. there are more subjects or cases available), and are
desperate to get a significant result, it would be useful to know how many subjects would be ideal.
Some books give formulae to calculate how big a sample you need, but they dont necessarily
straightforwardly fit the situations you have. The following is my best suggestion for an easy way to get an
estimate of required sample size using SPSS facilities.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
9/29
Basically you create imaginary larger samples simply by using your subjects more than once. Suppose you
have 20 subjects and p=.231 for whatever test you are interested in. You get SPSS to think that you have
three times as many subjects, simply by getting each subject counted three times, and run the test again. Say
then p=.09. Then you get SPSS to think you have four times as many subjects, including each of your
twenty four times, and see again what happens. By trial and error you get to the point where p=.05, and that
gives an estimate of the minimum number of subjects you need to get a significant result.
To get SPSS to count a subject more than once you weight the data, similar to how you are familiar with
doing elsewhere. At transform..computeyou nominate a new target variable which you might call incr
(since it will tell SPSS how many times to increase your sample size). You then enter in the numerical
expression space whatever you want the weighting to be. You could start with a weighting like 2. Click
OKand you will find a new column called incr with 2 repeated all the way down. If you now go to
dataweight cases and weight the data by that column, then SPSS sees your data as having twice as many
cases counting each one twice.
Now do your analysis again and see if it is significant. Go on altering the weighting figure in the incr
column via transformcompute repeatedly and redoing the analysis until you get a sig difference orrelationship. Note that you can enter partial weightings like 3.5 as well.
When by trial and error you achieve a weighting that gives a significant result, multiply it by your original
sample size to see how many subjects you would need. E.g if your sample from two groups was 20 in all
but you only get a sig difference with a weighting of 3.8, then you need at least 20 x 3.8 subjects (= 76), in
similar proportions in the two groups as before to have a chance of getting a sig difference
Cautions. You have to make sure the new bigger sample IS from the same population as the old one. In the
case of comparisons of groups of course several populations may be involved. Even then, any method of
estimating the required sample size is only approximate, because even truly random samples can vary a lot.Also, with an increase in sample size the actual difference or relationship you are interested in may not
actually get any bigger. It is just more likely to be significant. I.e. you may end up showing that there is
indeed a non-zero difference or relationship in the population (which is what significant means), but not
that it is a very large one.
Group % scores
Twenty people in two groups are each measured for the number of times they use the third person s
out of all the occasions or loci when they had an opportunity to (often called potential
occurrences) Very many linguistic features are measured this way in acquisition and
sociolinguistic research. In the former it is often a matter of how often the correct form (in NS terms)is used, as against some erroneous form or omission, on occasions where there was an opportunity to
use it; in the latter it is often a matter of how often one variant out of two or more that make up a
sociolinguistic variable is used.
In all these situations there are two ways of summarising and graphing the data 1) the group way
and 2) the individual way.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
10/29
Either 1) you add up all the potential occurrences for each group, and all the occurrences of the form
of interest, and express the second as a percent of the first for each group.
Or 2) you calculate a % score for each person using their individual frequency of the form of interest
and their individual number of potential occurrences. Then for each group you can calculate the
average (mean) % score for that group from the individual scores of its members. However, you haveto be aware that this can be a bit misleading for cases whose number of potential occurrences is small:
getting one out of one right is 100% as much as getting 20 right out of 20 possible occasions! It is
common to require at least 5 potential occurrences, otherwise treat a case as missing data.
It is easy to show that the group figures may not come out the same! Here we imagine figures for a
group of two people and see what happens:
Method 1 Frequency of
form of interest
Number of
potential
occurrences
% occurrence of
form of interest
Person 1 4 16 25%
Person 2 8 10 80%
Total 12 26
Group % (12/26)x100 =
46.2%
Method 2 Frequency ofform of interest
Number of potential
occurrences
% occurrence ofform of interest
Person 1 4 16 25%
Person 2 8 10 80%
Mean % for
group
(25+80)/2
= 52.5%
In fact the two methods will come out the same only when all subjects had the same number of potential
occurrences (e.g. in a test or list reading task).
Many SLA and sociolinguistic studies use method 1. That is fine, if you wish, for the purposes of giving
descriptive statistics and making graphs, provided you make it clear what you are doing, and are aware of
the difference from the other method.
BUT for any inferential statistics you should use the method 2, entering the data in SPSS in the form of
one row per person, with a % score for each person. Then, to compare two groups, for example, you use
the independent groups t test on the two sets of scores.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
0 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
11/29
If you were to attempt inferential statistics on the total figures of method 1, you would have to use the
numbers of individual occurrences regardless of people. I.e. if the example above were for one group,
you would represent that group with the proportions 12 and 14 (i.e. 12 occurrences of the form of
interest, versus 14 non-occurrences, making up the total of 26 potential occurrences) and compare those
with the overall proportions for the other group being compared with. The test for that is chi squared, and
you do see this used even in some published work for data like this. However, there are at least two majorproblems with this which would lead statisticians mostly to regard this as a misuse of chi squared.
- Like for all significance tests, the basic observations (cases) which enter into the test have to
be independent of each other. Now in method 2 the cases are the people, and there is no
problem in seeing scores from different people as being independent of each other. However,
in method 1 the 26 occurrences in the example are the cases, and clearly while some of those
are independent of each other (being from different people) some are likely not (being from
the same person)
- There is also an expectation that populations sampled are homogeneous. From what we have
just said that is clearly not the case in method 1: the 26 observations representing one group
in the example are a mixture. It cannot be said that each observation is from one population
it is from a mixture of a population of people and the populations of occurrences of eachseparate person.
The only instances where chi squared and method 1 might be defensible would be where the numbers of
potential occurrences are very small amounting to little more than one or two per person included. OR
where all the potential and actual occurrences come from just one person per group. though that still
does not deal with the independence problem. OR where you feel able to argue that responses from the
same person are as independent as if they were from different people There is a tradition of
phoneticians making this tacit assumption for things like VOT, on the belief that such things are beyond
the persons ability to control.
Rounding interval scores
Just checking.... do we know how to round figures on interval scales? The mean of a set of scores may
come out as 6.3597, but often we want to express this in shorter form, such as 6.36 or 6.4. Quoting long
strings of numbers after the decimal point can look as if you are just trying to impress with loads of
numbers. Or it may be you are trying to make up for sloppy METHOD by being super-detailed in the
figures quoted in RESULTS.... Best not to do that, since one's measurement is unlikely to be so accurate
that more than two decimal places are relevant (except perhaps where a computer has measured
something for you like response time...). Generally three or two decimal places for sig/p values, and two
or one for everything else. Keep it intelligible and round numbers where necessary. But where do you
round up, and where down?
Task
Just round the following figures to two decimal places:
3.852 0.679 18.505 1.006 7.597 20.955 0.602
SPSS often rounds figures on screen (e.g. in the data grid) even though it is holding longer versions in its
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
1 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
12/29
memory. You can select for each column how many decimal places it shows on the Data View window.
Answer to above 3.85 0.68 18.51 1.01 7.60 20.96 0.60
Decoding interval scores expressed in E notation in SPSS output
Sometimes SPSS produces numbers like 7.012E-02This is not 7.012.... It is 0.07012
The E with a minus sign signals the number of places the decimal point has to be moved to the left.
So 1.369E-03 = 0.001369
Etc. The E is a shorthand so as not to write a load of noughts.
Always convert any such figures into the familiar form if you report them in your work.
Correspondingly 7.012E+02 would indicate 701.2.
Combining columns of scores for separate items in a test etc. to give a total or average score
Where a test or other instrument produces scores for separate items which then need to be added up togive a total score for a variable, one could of course add them up off computer and just enter the totals.
However, to check on internal reliability or to do an analysis by items in addition, or to filterresponse
times and exclude some, you will need the scores for every item in a separate column, so will have to
enter the data in full.
To then add columns use TransformCompute in SPSS to create a new column that totals the
separate ones. You enter the title of the new summary column top left in the dialog box, and click the
column names to be added into the top right space, with + between them. That creates a new column
of totals.
However, anyone with a score missing in any column will be missed out and their total will come out
as missing.
If there are missing values in some columns, marked in SPSS by a . , where subjects failed to respond
or have unanalysable data, you will probably want each persons total really to be the average score
over all the items they answered, not the total (unless you have some reason to count missing as the
same as wrong and so score it 0). You can get this by, in TransformCompute, inserting in the
Numeric Expression box the function MEAN(numexpr, numexpr,) from the functions list, and
putting the relevant column labels in the brackets separated by commas. I.e. if you have a set of three
items whose scores are in columns item1, item2, item3, then you would enter MEAN(item1, item2,
item3) in the Numeric Expression box. SPSS then generates a new column with the average score ofeach case on the three items or, if they answered less, over the ones they answered.
Similarly, if you want to just add, not average, a set of columns, using whatever scores are available,
then to avoid the people with missing values getting recorded as with zero total use SUM(numexpr,
numexpr,) in the same way as described for means above.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
2 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
13/29
Cutting an interval scale into ordered categories
A common example is deriving a grouping of subjects from something you measured about them
originally on a numerical scale: an explanatory variable such as their ages, English proficiency,
extraversion etc.,. This is often done casually without due thought, and often in peculiar
idiosyncratic ways by novice researchers, but above all it needs careful thought about why it is done,
and how
Before you do this at all, you need to ask if it is necessary at all. Just because some other researcher
had a high prof group and a low prof one does not mean you necessarily have to have groups. When
you derive such groupings from scores originally recorded on a continuous interval scale, obviously
you lose some information. One person may be a bit better than another on the original scores, but
once you decide they both belong in the high prof group, or whatever, they are treated as identical in
any further tests. This may or may not help produce the result you want Certainly how you divide
subjects into groups, if you do, can drastically affect the result!
Reasons for cutting
There are a number of reasons some statistical, some related to research methods, design and
hypotheses more.
A few statistical techniques require interval scales to be reduced to binary grouping.
Implicational scaling (scalogram analysis) is one method of statistical analysis used in
acquisition research that requires this: subjects simply have to be categorised as having
acquired or mastered each feature of interest or not. So also varbrule analysis requires
groupings of people who use or dont use some form of sociolinguistic interest.
a.
If the true interval nature of a scale is in doubt that could be a reason to reduce it to
categories (though reduction to rank order would lose less information).
b.
If you retain the original scores and look at relationships with other (dependent) variables
then you are into the statistics of correlation, and maybe multiple regression, typically. If
you form groups, then you can identify a mean for each group on the other variables of
interest and compare those means with t tests, ANOVA etc. Both methods will show
relationships between EVs and DVs, but the second will be better (or at least easier in
SPSS) for dealing with
c.
i. nonlinear relations e.g. where high and low proficiency subjects perform
similarly on some other variable of interest, compared with intermediate
subjects
ii. interactions between different EVs e.g. where you want to see the
combined effects of gender and prof on something: do high prof females differfrom high prof males in the same way as low prof females differ from low prof
males?
iii. designs involving repeated measures.
The goal of the research may be exploratory - precisely to discover useful categories of
subjects.
d.
You may wish to identify extreme groups of subjects for comparison. E.g. you want toe.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
3 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
14/29
compare bilinguals who are English dominant with those who are Welsh dominant. You
do not want more or less balanced bilinguals. So you measure the bilingual dominance of
a sample and will reject the middle scorers, keeping two extreme groups.
You need categories to form the IV in an experiment. E.g. you want words of three levels
of frequency to be the stimuli for three conditions in an experiment. Or maybe you want
extreme stimuli just frequent and rare. Either way you need groups of words as it is
difficult to use an interval scored variable directly as the EV in a repeated measuresdesign.
f.
Means of cutting
OK so you still want to make groups there are many ways of doing it. To some extent they match
the reasons above.. The principles apply to any interval-scored variable that is to be turned into a
grouping. The issue is where to cut the original interval scale so as to obtain two or more groups of
cases
Cutting at a priori scale values. That is, cutting at predecided score values on the scale, which
would be the same whatever sample you gathered. These values may or may not have someabsolute meaning of the criterion-referenced type. Cf. Reason 1 above. Such a point could be
One used arbitrarily by previous researchers. Not necessarily a good way to do it if it has
no sound basis, other than for the reason that it then enables you later to compare your
results directly with those of other researchers
The pass mark used in a particular institution for some English exam, or a succession of
such marks e.g. corresponding to what are called grade A, B, C, D in some institution.
Again such points may be fairly arbitrary, but perhaps meaningful for your research in
allowing you to contextualise it.
Grades with some universal absolute meaning associated with them, maybe in a
professional published test you have used. E.g. you divide subjects into those who gotgrade 6 or better in the IELTS test, and those who scored worse, given the widespread use
of this value as a criterion for entry to UK universities. Ranges of scores of the Jacobs
instrument for assessing EFL written compositions, and many international language
tests, have proficiency definitions associated with them. A different example of this type
is to divide a five point rating scale of the type strongly agree agree neutral
disagree strongly disagree into just two categories those who showed some agreement
(i.e. the top 2 choices) versus the rest who disagreed or were indifferent: this uses a
division point with some clear meaning of its own (but why then did one not ask the
question in the first place just as a two choice item?)
The score on a variable scored as % correct which is conventionally regarded as
indicative that someone has acquired a feature. Acquisition researchers vary in what theythink this score is, but 80% or higher correct use of, say, third person s would be
regarded by many as enough to put a subject in the group they would say has acquired
the feature. Others argue that that only 100% correct indicates true acquisition. Others
that any correct use greater than 0% correct indicates acquisition has occurred. Again
others use other scores like number of occurrences of a structure in 5 hours of observation
(Bloom and Lahey 1978, 328) 5 or more indicating acquisition.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
4 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
15/29
The score on a variable scored as % use of one alternative which is conventionally
regarded as indicative that someone is a clear user of that alternative. Labov in his
famous department store study divided subjects into three groups - those using no [r]
sounds in the wordsfourth floorsaid twice, those using them on all four possible
occasions (categorical users), and those in between (i.e. variable users).
Scores defined by how some other relevant group of people performed on the same test or
measure e.g. for learners you might make use of the mean score of native speakersdoing the test (a criterion group), or perhaps the score which only 15% of NS do better
than (the 85th percentile). Alternatively one might rely on the mean score that large
numbers of learners of the same sort as ones own testees gained in other research (a
reference group). The latter is not often available in language research more a feature of
standardised NS tests like the British Picture Vocabulary Scale and so on.
Cutting the score scale into halves or equal lengths. That only is easy if the scale has fixed ends,
such as a % score scale, or a test marked out of 40. E.g. you make four groups: those who
scored between 0 and 10, 11-20, etc. (being careful not to label them overlapping 0-10, 10-20,
20-30). This is often not very meaningful unless the scale has some absolute meaning so that
half-marks actually means half knowledge of something beyond the test items and it
produces unequal sized groups. Also it may not even be possible to quite achieve equal lengths
with ease (0-10 actually covers one more point than 11-20!). However, it is a system that can be
used with the same cutting scores on any sample, like the above but unlike those below.
Mitchelmore (1981) suggests that the scale should not be cut into lengths that are too short, so
as to avoid misclassification. Lengths should not be shorter than 1.635 x SD of scores x (1
reliability). Possibly useful for Reason 2 above.
Cutting so as to achieve equal numbers of subjects/cases in each group. Technically this uses
the median and quartiles. I.e. if you had scored 30 people and want two groups. You simply put
them in rank order on the basis of their scores, and the top 15 (those above the median score)
become the high prof group, those below the low prof one. The cutting score obviously will
differ for different samples and has no real meaning, but generally it is better for latercomparisons if groups have more or less the same numbers of subjects in. Often used for
Reasons 2,3, 6 above.
Cutting at the mean, and points related to it. E.g. you divide into those who scored above the
mean (average) and those below. Or four groups: those scoring more than one SD above the
mean, those more than one SD below the mean, those between the mean and one SD above,
those between the mean and one SD below it To get three groups you might use the mean
plus or minus half the SD as cutting points. The mean, like the median, is entirely relative to a
particular sample of course. The problem with dividing at the mean is that usually many cases
score near the mean, so cases very close to each other will get put in different groups. If the
original scoring is not perfectly reliable, that in turn means that some cases may be
misclassified.
Cutting into natural groups using low points in the distribution shape. This is a simple form of
cluster analysis and simply looks to see if the subjects in the sample seem to have grouped
themselves (cf. Reason 4 above, and also maybe 2,3) i.e. looking at a histogram of scores are
there two or more heaps with a low point on the scale where few scored? then make the
cutting score the middle of the low point(s). This of course decides both where to cut and,
unlike most methods, how many groups to identify. It may vary from sample to sample but does
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
5 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
16/29
reflect the nature of a particular sample better than some of the above methods. It will not work
if the histogram is simply one heap (e.g. with the normal distribution shape), though sometimes
rescaling the histogram with finer divisions may reveal what an initial SPSS histogram may
conceal. As an example, the scores of 217 subjects on a College English exam in Pakistan are
graphed below and it is fairly clear that there are two groups in the sample, those scoring above
58 or so and those below. By comparison the median score, above and below which are equal
numbers of cases, is 50 for this data and appears to rather arbitrarily divide people within oneof the groups that they seem to naturally form.
With all the above methods, but especially the third, researchers may choose to use extreme groups
only. Often where a researcher wants to get clear differences between groups later he/she will help
this along a bit by, say, using the top third and the bottom third of subjects and missing out the middle
third in any later comparisons. Reason 5,6 above.
However you cut, you have to be careful how you speak. Very often you will call the groups you
make the high proficiency group and the low proficiency group, or the like. But unless your
original test that produced the scores was a criterion referenced one, deciding some absolute level of
prof for each taker, with international equivalence, then this can be misleading. Very often the
proficiency test researchers use test was a cloze test you cracked up yourself, or the like. It may well
distinguish students with higher proficiency from those with lower, in the sample of students you are
using. But that does not mean there is any equivalence with what were called high prof students by
some other researcher who used a different test with a different sample in another country. It could be
that all his students, high or low prof, are no better than the worst of your low prof group, and so on.Only if some standard published test such as FCE or TOEFL was used by all could you match up
across studies and see if there was any real comparability between so-called high prof students in
different studies. In fact close examination shows that many variables used in research have no
absolute definitions of scale points, and most of the above ways of dividing cases into groups only
distinguish in a relative way between who/what has more of something or less, not exactly how much.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
6 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
17/29
The size of the standard deviation
One is quite used to having SPSS calculate the SD along with the mean (=average) of a set of scores
(i.e. for any interval scale).
We are also used to the idea that the SD measures spread of scores around the mean. If all casesscored the same, the SD would be 0. The bigger the SD, the more spread the scores of different
cases the more subjects are disagreeing with each other in their scores. And the more that happens
within groups, the harder it is usually to show any convincing differences between groups. Similar
concepts to SD are what statisticians call variance and error. These measures are slightly different
but all, roughly, are averages of the differences between each cases score and the mean. If all cases
score the same, which will be also the mean score, then their differences from the mean are 0, so SD =
0.
Sometimes SPSS fails to perform a procedure because of a problem of zero variance. That means it
found that one of your groups on one the variables measured had an SD of 0. All cases scored the
same. This makes certain statistical procedures impossible: they involve variables and cannot work ifeveryone scores the same, as then you have not a variable but a constant. You cannot answer the
question what is the relationship between age and reading ability? if you have obtained data from a
sample who are actually all of the same age!
So we know what an SD of 0 means, but what about big SDs? There is often no simple maximum
value that the SD can have. But there are some guides to help assess the size of an SD:
It may often be more of interest whether different groups or conditions show similar or different
variation (SD) than how great the SD actually is. In general you assess the size of an SD for
each sample group separately.If your scores are on a scale with both ends logically fixed (e.g. a test scored out of 40), then the
maximum possible SD, if cases were maximally varied in scores, is half the scale length (well,
actually it will be a shade above that for small numbers of cases, but that is a useful rule of
thumb). So you can assess the size of an SD you get in relation to that. An SD would usually be
regarded as big if it was even as much as half the maximum (i.e. a quarter of the scale length).
On a scale of % correct scores, half the scale is 50. Note that on a five point rating scale
running 1-5, half the scale length is 2. On such scales of course the mean is also limited: it
cannot be a figure outside the end points of the scale. That places further limits on the size of
the SD: the nearer the mean is to the limit of the scale, the smaller the maximum possible for an
SD
If your scores are on a scale with one or both ends virtually open, then the SD (and the mean)could be indefinitely large. In language research many scales are fixed at one end on zero, but
open at the other. E.g. word frequency: words cannot occur less than 0 times, but there is no
clear upper limit to how often they can be observed. So also sentence length: sentences cannot
be shorter than one word, but they can be indefinitely long. Response times in milliseconds
have a hazier bottom limit: there is an indefinite upper limit to how long anyone can take to
respond to a stimulus and, although technically there is a lower limit of zero, nobody can really
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
7 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
18/29
respond in zero milliseconds so there is an indeterminate lower limit to fast responses. With
these scales it is harder to say what is a big SD, but one can use some yardsticks:
One can use the maximum and minimum scores that occur in ones data as indications of
the effective limits of the scale, and as above treat an SD larger than a quarter of the
distance between them as large. For a scale fixed at one end, one could use the distance
between the bottom limit and the highest observed score.
With scales fixed at the bottom end, but open at the high end, the distribution is oftenskewed to the left (positively). I.e. scores are heaped near the bottom limit and tail off to
the right. In that situation the SD can be, and often has to be, greater than the mean,
though if the distribution has a perfect Poisson shape, the mean = the square of the SD. If
the mean is some way above the bottom limit, and that limit is 0, and the distribution is
more symmetrical, then people sometimes assess an SD in relation to the mean: if the SD
is as much or more than half the mean, that indicates very substantial variation among the
scores of a group.
Always look at the distribution shape on a histogram as well as the mean and SD The shape
may reveal more than anything else.
How to treat rating scale responses
An old problem is how to handle responses to items recorded on scales such as
strongly agree agree neutral disagree strongly disagree
always often sometimes never
They are rating scales (not usually called multiple choice). They are clearly ordered choices and there
is uncertainty whether they are really best thought of, and treated statistically, as
Ordered categories: so you present the results in bar charts, report the % of people
who responded in each category on the scale, and use ordered category statistics to
analyse relationships with other variables.OR
Interval scores: so you assign a score number to each point on the scale and present
the results as a histogram, report the mean and SD of the scores of a group, and use t
tests, Pearson correlation or whatever when comparing groups or looking for
relationships. The numbering could be e.g. strongly disagree = 0, disagree = 1, and so
on; or if you prefer strongly disagree = -2, disagree = -1, neutral = 0 etc.
Generally it is far easier for any statistical handling to treat the data the interval score way as the stats
for interval scores are more well known and versatile in what they can do. The results are usually
easier to absorb as well. Suppose two groups are asked how far they agree that a CALL activity is
easy to understand; Group B is of a higher English level than A. ?Is it easier to derive some meaningfrom being told:
In group A the response was: strongly agree 43.3% agree 20% neutral 13.3% disagree
13.3% strongly disagree 10%. In group B it was: strongly agree 30% agree 30% neutral
10% disagree 30% strongly disagree 0%. The difference between the two groups is not
significant (Kolmogorov-Smirnov Z =0.365, p=0.999).
OR from
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
8 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
19/29
The mean agreement response (on a scale 2 for strong disagreement to +2 for strong
agreement) was in group A 0.73 and in group B 0.6. Variation was similar in the two groups,
and moderately high (SDs 1.41, 1.26). The difference between the groups is not significant
(t=0.265, p=0.793).
I know which I find easier to follow!
So I advise going for the second interpretation wherever possible, but making sure that when you use
such scales the way they are used in the data gathering itself justifies this interpretation. In particular:
Make sure the words used for the points of the scales do suggest more or less equal intervals
between one point and the next, otherwise the interval interpretation is invalid
Accompany the wording with figures in the version presented to respondents, so they are
encouraged to think of the scale as being a number one, with equal intervals between the
numbers.
Tests of prerequisites for parametric statistical tests
These tests of prerequisites are only of interest to check if the data is suitable for using some OTHER
test that you are REALLY interested in, because it relates to your actual research questions or
hypotheses. Tests of prerequisites generally apply where ANOVA/GLM is used, though researchers
rarely report having made these checks and we cannot tell if the checks were performed or not! You
generally want them all to be nonsignificant, as that is what shows the data is straightforwardly
suitable for parametric significance tests like ANOVA/GLM.
If the prerequisite test is failed then there may be alternatives within the parametric tests you can use
to compensate, or weaker nonparametric tests you can use instead of straightforward ANOVA etc., or
possible transformations of the data one could do... but often one has to just admit the data is notperfect for the procedure but carry on and use ANOVA anyway....
Their functions are as follows:
Any parametric significance tests.... t tests, ANOVAs etc. all assume that the populations that
the groups are from have distributions of scores that are normal in shape (i.e. that bell-shaped
distribution you see in all the books). Check with K-S test (though on small samples
everything passes this test!!).
t test for 2 independent groups, and all ANOVAs involving comparisons of 2 or more groups
(with or without also repeated measures). The groups need to each have a similar spread of
scores within them round their respective means (=homogeneity of variance). Check withLevene's test, which (roughly) decides if the SDs of the groups could be from one population of
SDs, so are similar, or not. The t test for 2 independent groups has alternative versions
depending on whether this prerequisite Levene test is passed (nonsig) or not, but ANOVAs
don't, they all assume the prerequisite test of equal variances is passed.
All ANOVAs involving comparisons of 3 or more repeated measures (with or without
independent group comparisons as well). Here again the spreads of the scores in each condition
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
9 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
20/29
need ideally to be similar. Strictly it is the covariation between each pair of conditions that
needs to be similar (=test of sphericity). Check with Mauchly's test (which SPSS automatically
gives you even where you only have two repeated measures, though it applies vacuously there
and need not be looked at). The check, roughly speaking, looks at the correlation between the
scores in each condition and those in each other condition in pairs and sees if the correlations
could all be from a population with one correlation or not. The data would likely not pass if
people who did better on condition A also did so on B but were the worst on C, and so on... If itis passed (nonsig) then you use the 'sphericity assumed' results in the ANOVA table, otherwise
the ones below those (Greenhouse-Geisser).
ANOVAs with a mixture of repeated measure comparisons and independent groups. Here there
is an extra requirement about the pattern of covariance between conditions in each group
separately being similar also between the groups. Check with Box's M test.
Missing values
Missing values are where cases have scores or categorisations completely missing for some reason,
where most cases did provide data. E.g. they gave no response, were uncooperative, or their response
was unanalysable, etc. (Where subjects have taken a multi-item test or the like to produce their scores,then they may miss some items but still get a score for the test as a whole. That is a different issue
You have to decide there whether a missed out item counts as wrong, or whether you allow people to
miss items and as overall test score give them the average score for the set of items they did answer)
They are usually entered in SPSS by a . in the space where a figure should be, unless you have
assigned an actual number that you enter as indicating missing values, and declared it in Variable
ViewMissing.
If you have missing values there may be problems:
- You may have very few cases left that you can use in the required statistical analyses:especially in repeated measures and multivariate designs if a case has data missing on one
variable/condition included in an analysis, it gets left out totally (i.e. listwise).
- The missing values may not be random, but certain kinds of subject may be more prone
to produce them so using the data without them, or with too few of them, will lead to a
biased result. E.g. young versus older testees; lower versus middle class informants.
If you leave missing values in place, SPSS usually gives the choice (in Options for a given test) for
you to treat them listwise orpairwise/test-by-test. This really applies to multiple analyses of the
same data, as within one analysis it usually has to be listwise, meaning that the number of cases used
is the maximum number that has a complete set of data across all the relevant columns: e.g. if in
Correlation you want correlations done between every pair of variables in 5 columns: ten pairs, soten analyses. Listwise option would get you correlations using just the cases with full data across all 5
columns, so the same number of cases would be used in each analysis. Pairwise would, for each
analysis, use the maximum cases with data on both the relevant columns, so use more of the data, but
different numbers of cases might well be used to calculate different correlations.
If you want to fill in missing values. the main principle is that it should not be done in some way that
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
0 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
21/29
will clearly directly influence the result you are interested in. I.e. you should not fill in the missing
values following a principle that will obviously make the difference or relationship which is the focus
of your actual research more marked.
Broadly there are two ways of filling in missings in any column in SPSS (where a column represents a
variable, or a condition in a repeated measures data).
A) You fill in with the mean of the scores in the column itself (or if it is in categories, the mode,
which is the most popular category in that column).
B) You fill in by predicting a score from the general correlation of that column with others in the
data: the EM and regression methods.
Imagine data as follows:
C1 C2
3 5
5 7
7 9
4 .6 8
If the research question concerns whether there is a relationship between two variables, in C1 and C2
(correlational design), then you do NOT use method B, which would use the correlation that exists
already in the data to fill in missing values. I.e. here, given the perfect positive correlation between the
two sets of scores, method B would fill in the missing as 6, predicting it from C1. But that will
obviously enhance the perfection of the correlation which it is your aim to discover! So the mean of
the second column (method A) would be better a better fill-in value: 7.25.
If on the other hand this was data from the same subjects on the same DV scored in two conditions inC1 and C2 (repeated measures design), and the research interest is in the difference between the
means of the scores in each column (Do they score significantly higher on condition 2?), the better
way to fill in the missing values would by method B. Method A would simply enhance the level of the
mean of C2, and strengthen its distance from the mean of C1.
For these reasons, when you run correlation-type statistics like Regression and Factor analysis,
SPSS underOptions offers you the choice to fill in missing values with the means (method A) as it
operates. The data in the Data view does not get visibly altered: just you find all the cases have been
used instead of those with missings left out. Similarly in Regression with optimal scaling, which
works on associations between categories rather than interval scores, there is the choice to use Mode
imputation, which fills in the missings with the most popular category in the relevant column.
In situations where method B is suitable, you have to use AnalyzeMissing Value Analysis to
actually fill in the missing values in the data in Data view beforehand. Basic instructions: at the first
box, enter all the columns relevant to the analysis you will be doing, either as quantitative (i.e.
interval) orcategorical (categories/nominal). Only the former are actually used in the estimation of
missing scores, though (SPSS does not seem to provide a way of filling in missing category data by
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
1 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
22/29
Method B). TickEM and if there are some quantitative columns that you dont want used as a basis
for predicting values of missings, then click the Variables button and make your selection. Otherwise
all the quantitative columns you declared in the first box are used to predict any missings in each
other. Click the EM button and tickSave completed data; and underFile name a file for it to be
stored in. Then SaveContinueOK The procedure will produce various output, but mainly you
are interested in the new stored file of data. If you call it up, you will find the missings all filled in.
In data for independent groups analysis (e.g. t tests, ANOVA), with missings in the DV column, if you
have other columns of dependent variable data not being used in the same analysis, you could use
them to fill in the missings by method B. Otherwise you can only use method A i.e. use the mean for
the DV column (NOT the mean of each group) to fill them in.
Getting phonetic symbols displayed in SPSS graphs
First ensure you have the fonts of your choice (e.g. SILManuscriptIPA etc...) installed in Windows in
the usual way. If they are available to you in Word in the usual way via Insert Symbol, then they
will be available in SPSS. If not, get a copy of the font file (ending .ttf) and put it in the Fonts
subdirectory of the Windows folder on your PC.
Now, having made a graph in SPSS, click the graph you have created to make it appear in the Chart
editing window. Then click the part you want to put special symbols in, such as the bottom scale, so it
comes up outlined. Next clickFormat...Text and select the required font from the menu and the size
you want and clickApply, Close.
Now when you click the scale of the graph and choose to change the Labels, you can type the
symbols you want. However, you don't initially see them when you type them in the dialog box. You
have to know that in the SIL font shift-t gets you the symbol for the th sound of thick, though it will
look as if you have just got T. Anyway, you have to type all the labels in the new font, you cannot mixsymbols from different fonts, I think. So retype the labels using Change, and Continue. The symbols
you want will appear on the graph itself.
I have not found a way to get symbols that are coded outside the range of the font that is covered by
the keyboard keys, with and without shift. To know what symbols you can get from which key with
and without shift, you may have to study the table of symbols for your font in advance through a
program such as Word which displays it through the Insert..Symbol option.
Item Analysis
This term is found used in two distinct senses. Both involve data where variables or experimental
conditions are measured using sets of items for each in some way.
A) The usual traditional sense found especially in the pedagogical testing literature. Here it
applies in the situation where a set of items is used to measure what is regarded as one single
variable/construct. The set of items is usually thought of as a multi-item test of one thing (e.g.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
2 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
23/29
reading ability, or vocabulary size). However, item analysis may also be applied to, say, a set
of Gardner-type statements for respondents to agree or not with, where a distinct attitude or
orientation is measured by an inventory of five such statements, rather than just one. It can
also apply separately to each set of items designed collectively to measure a single condition
in an experiment. Item analysis in all these instances is the activity of checking whether there
are some items in the set that in some way do not seem to belong there, illuminating how and,
if possible, why they are odd, and maybe removing them or replacing them with better itemswhen the test is used again. It is closely tied to internal reliability checking, often done these
days with the use of the Cronbach alpha coefficient or Rasch analysis. Removing items that
are odd improves reliability. This sort of item analysis is often done in pilot studies, as it
represents a way of refining the quality of instruments for use in a main study. There are
several statistical criteria for deciding what items are odd in a set that is supposed to be all
measuring one thing. See further my Reliability handouts. Where items are supposed to attract
similar levels of response (e.g. be of similar difficulty) then the classical IA approach
involving alpha is appropriate; where items are supposed to be graded, and form an
implicational scale, then approach using IRT/Rasch is better. Where response times are
involved, other criteria may be used to exclude responses for specific people on specific items
instances rather than whole items.
B) The sense in which it is found used in some psycholinguistic literature. Here it denotes a
second kind of analysis of data, beyond the usual default one. In an item analysis, instead of
the subjects (usually people) being treated as the cases, the items are treated as the cases.
Hence it is really analysis with items as cases, rather than item analysis, and is typically
part of the analysis of the results of a main study. This applies only when a study has several
conditions, each represented by a set of items, but this is very common in psycholinguistic
studies, where subjects performance in different conditions is often measured by their
responses to sets of stimuli in a repeated measures design. For example a repeated measures
variable word frequency might be constituted as three sets of ten words, of three differentfrequency levels, making 30 items for people to respond to in some way; a variable early vs
late attachment could be instantiated as two sets of sentences, of two structure types, one in
which a relative clause has to parsed with an early noun phrase, the other with a late occurring
one. Often such data arises also in areas such as SLA, applied linguistic and even
sociolinguistic research as well as psycholinguistics, but item analysis in this sense is only
routine in the latter, where it is regarded as a further confirmation of results obtained by the
usual subject analysis, i.e. analysis with subjects as cases. Where, as often, ANOVA (see
my handouts) is used to analyse the results, then the F values for the subjects as cases
analysis are reported as F1, and those for the items as cases analysis as F2. Statisticians
generally regard analysis with subjects as cases as the sounder basis, due especially to the
independence requirement. Cases have to be regarded as providing independent observationsif the assumptions of inferential statistical tests (e.g. ANOVA) are to be met. While it is
generally not difficult to assume that responses from different people are independent of each
other, it is not so certain if responses to different items are so independent, when the same
people respond to all of them. One has to assume that in psycholinguistic experiments people
are unable to make their responses to one item reflect their response to another. This is often
assumed by phoneticians and psycholinguists.
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
3 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
24/29
Imaginary dataset to illuminate both the above. Suppose we have two groups of ten people (G1
and G2), and each respond in two conditions (C1 and C2), where 5 items are used to obtain
responses for each condition. As laid out for a customary subjects as cases analysis in SPSS this
would appear as 11 columns and 20 rows thus. Of course, the items would often not have been
presented to subjects in an experiment in sets, but intermixed with each other and maybe with
additional distracter/filler items that are not scored at all.
Group C1
item1
C1
item2
C1
item3
C1
item4
C1
item5
C2
item1
C2
item2
C2
item3
C2
item4
C2
item5
10 rows
labelled
1, to
mark
each G1
subject
Scores
for
each
G1
person
on C1
item 1
Scores
for
each
G1
person
on C1
item 2
Etc.
10 rows
labelled
2, to
mark
each
G2
subject
Scores
for
each
G2
person
on C1
item 1
Etc.
To do item analysis (A) above in SPSS, you would split the file by Group and use AnalyzeScale Reliability analysis Alpha on each set of five items separately (or for Rasch analysis,
you need other software). Four analyses. That means that the internal consistency is always
assessed within a collection of scores which is from a set of items that supposedly measures one
thing, and which comes from a homogeneous group of subjects. After any adjustment of the data
to improve reliability based on the above, you then typically move on the the actual analysis of
results with subjects as cases. You first produce two extra columns which contain the averages of
each five item set of scores for each person. Use Transform Compute. These Mean C1 and
Mean C2 columns each now summarise the performance of subjects in one condition. Those two
columns, together with the Group column, are then used in a mixed two way ANOVA to see if
there is a sig difference between groups or between conditions, or a significant interaction effect.
That is your subjects as cases F1 ANOVA.
For item analysis (B), you need to make the items into the rows. You can do this with Data
Transpose in SPSS. If you start from the data as displayed above and include all the columns you
end up with 11 rows, which were previously the columns. There are columns now for each of the
20 subjects. You can now use Transform Compute to get two new columns calculated which
represent the mean scores for each group of subjects on each item. Then delete the row that
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
4 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
25/29
contains the grouping numbers. Add a column of 5 1s and 5 2s to record which items (now rows)
relate to condition C1 and which to C2. So the data should end up much as below. Finally use the
column that records whether an item belongs to C1 or C2, and the two columns of group mean
scores for each item. Again do a mixed two way ANOVA to see if there is a sig difference
between groups or between conditions, or a significant interaction effect. That is your items as
cases F2 ANOVA. Note that what was a repeated measures factor in the F1 subject analysis,
condition, becomes a between groups factor in the F2 item analysis. The grouping of subjects,which was a between groups factor in F1, becomes a repeated measures factor in F2.
G1
subj1
G1
subj2
G1
subj3
Etc. to
G1
subj10
G2
subj1
G2
subj2
G2
subj3
Etc. to
G2
subj10
Condition Group
1
Group
2
5
rows
with
scores
for
G1
subj1
on
each
C1
item
Scores
for G1
subj2
on
each
C1
item
Etc. 5 rows
labelled
1, to
mark
each C1
item
Mean
scores
of 10
G1
subjects
on each
C1 item
Mean
scores
of 10
G2
subjects
on each
C1 item
5
rows
with
scores
forG1
subj1
on
each
C2
item
Etc. 5 rows
labelled
2, to
mark
each C2item
Mean
scores
of 10
G1
subjectson each
C2 item
Mean
scores
of 10
G2
subjectson each
C2 item
Note, the above account of items-as-cases analysis assumed that the sets of items used to represent
the two conditions were not themselves matched or repeated in any way. I.e. C1 items 1-5 might
have been five nouns as stimuli in some response time experiment, and C2 items 1-5 five verbs,
with no special connection between individual verbs in one set and individual nouns in the other.
If however the items are themselves matched in pairs or repeated in different forms etc. across
conditions, the items as cases analysis should be different. E.g. if C1 items were five verbs in the
past tense and C2 five verbs in the bare infinitive form, the researcher might choose to use the
same five verbs in both conditions (randomised with suitable distracters interspersed when they
are actually presented to subjects). Then the items are individually matched and the items-as-cases
analysis should be done with the items as repeated measures. I.e. in the data grid above for SPSS,
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
5 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
26/29
the 5 rows for C2 responses would need to be not below the 5 rows for C1 but side by side, with
the matched items in the same row, to allow repeated measures comparison of items as well as
subjects.
Checking for guessing or response bias when using certain data-gathering instruments with closed
responses
Checking for guessing
Any instrument where the subjects are given choices to pick from for an answer are potentially open
to guessing. In the sense of picking one option at random, without thought.
For example, the respondent may randomly pick one of the choices because
they cant be bothered to think about the question/item just want to finish quickly
they dont actually have any relevant knowledge to make a correct choice
they cant understand the question (language too hard, too long, pragmatically oddetc.)
etc..
Clearly the results will not then be a true measure of whatever the researcher intended to measure,
and could even vary if the subjects responded to the same items again on another occasion. I.e. not
valid or even reliable.
This affects multiple choice items, yes/no or agree/disagree items in questionnaires and tests, rating
scales and so forth. Clearly it cannot affect instruments which have open response in some form, i.e.
with no alternatives supplied.
One cannot statistically tell definitely if guessing has taken place or not, but one can check if the
responses are like those one would get from someone who was guessing, or not. Obviously it is quite
possible to get a real result, where people have paid attention and answered sensibly, which happens
to be similar to the guessing one. Only the researcher can judge the interpretation.
You need to calculate what the result would be, on average, for someone who was randomly guessing,
and use the appropriate one sample test (see my LG475-SP handout) to check if the observed result
differs significantly from the one you would get by random guessing.
For example:
1) 30 subjects have to answer yes or no to a question about whether they use the keyword method of
vocab learning or not. Random guess frequency of yes would yield a frequency of 30/2 = 15 yes
responses. Use 50% binomial test.
2) 30 subjects have to pick one of four reasons they are offered for why they are learning English.
Random guess frequency of each choice being picked would be 30/4 = 7.5. Use chi squared one
ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs
6 of 29 12/18/11 2:58
-
8/3/2019 Some Stats and SPSS Pointers
27/29
sample fit test.
3) 30 subjects have to judge 20 words for whether they exist or not in English. Thus each person gets
a score out of 20 for how many they say exist. The average random guess score would be