some stats and spss pointers

8/3/2019 Some Stats and SPSS Pointers

1/29

Some stats and SPSS pointers

These relate to some specific technical matters I am sometimes asked about but which are not

covered in detail in my courses or maybe passed you by unnoticed. Note, I am not trying to

present all my course material here (you have to take the courses for that), just deal with some

frequently asked questions and things people frequently get confused over/get wrong. Also,

these are not all readily understandable unless you took stats courses already! How do I round figures down to make them shorter, e.g. 3.852. And how many decimal

places should I report?

How do I generate random numbers to help when sampling from a list, or when dividing

subjects randomly into groups? Use the facility at http://www.randomizer.org/form.htm

I have the proficiency scores (or the like) for 30 subjects, and want to divide the cases into

groups based on this. Or I need categories of word stimuli of three different frequencies..

How do I do it?

Can I getphonetic symbols like [] shown on the scales of SPSS graphs?

How do I combine columns of figures I have entered in SPSS, when I want averages for

each person of the figures in the columns (e.g. the scores for separate items in a test)? What is item analysis? And what does it mean if the F in an ANOVA result is labelled F1

or F2, where there has been an analysis by items as well as by subjects?

How do I eliminate extreme response times in psycholinguistic data? or response times

where the response was wrong?

What does the standard deviation really mean?

When I do a histogram of some scores (interval scale data) I am supposed to look at the

distribution shape the pattern of the heaps on the graph but how do I interpret the

shape I see?

How should one treat rating scale responses? As ordered categories or interval scores?

If my data is not normally distributed, so not suited to t tests and ANOVA, what can I do?What are the transformations I can use?

What really are Likert and Guttman scales, and how should they be constructed? They

both are ways of measuring things via a set of agree-disagree items. Often we use sets of

items of this type that other researchers made but I wonder if anyone actually selected

and rated the items in the approved way in the first place?

What does it mean when SPSS gives you a figure with an E on the end? e.g. 7.012E-02

What are degrees of freedom (df) and how do I report them, if needed?

What are residuals and what do they tell me?

If in a pilot trial of a few subjects I dont get the significant result I want, how can I

estimate how many subjects I would need to probably get a sig result?

How do I do follow-uppost hoc paired comparisons and planned comparison tests for any

kind of main effect or interaction in ANOVA where more than two groups or conditions

were initially compared? SPSS doesnt do all the possibilities, or hides some away

How do I do post hoc paired comparisons after a Kruskal-Wallis test?

What is Bonferroni adjustment and how can I do it?

What is eta squared and how does SPSS calculate it?

Esp. for ACQUISITION people and SOCIOLINGUISTS. Twenty people in two groups

ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

of 29 12/18/11 2:58


2/29

are each measured for the number of times they use the third person s out of all the

occasions when they had an opportunity to in compositions, recorded speech etc. (often

called potential occurrences or loci). How do you summarise % scores like this? Group %

scores for frequency of use of things, or individual % scores?

Esp. for PSYCHOLINGUISTS and people doing repeated measures EXPERIMENTS.

What on earth is a Latin square and how do I use it or some other method of organising

conditions, different types of stimuli etc. in an experimental design? What are those tests ofprerequisites for ANOVA/GLM such as those of Levene, Mauchly

etc. all in aid of?

If I have a lot ofmissing scores, can I fill them in somehow?

Can I check on whether people are responding by random guessing or with bias, and adjust

scores to take account of that?

My subjects all gave several responses to a set of different stimuli, and I have entered the

data in SPSS with each response as a row. So there are several rows for each subject. How

do I turn that into the more usable SPSS layout with one row per subject?

Subjects have been categorised in a parallel way in several different columns. E.g. they

answered a set of questions each of which had the possible response: me, my teacher, my

classmates (i.e. although coded for SPSS as 1, 2, 3, the responses cannot be considered as

degrees of anything on an interval number scale). How do I get SPSS to add up for each

person across the items totals of how many times each category was chosen?

If you are into word association tests, there are a few descriptive stats that one can use

there that one does not find used anywhere else much: The Group overlap coefficient,

Within groups overlap coefficient, and Index of commonality.

Degrees of freedom

Sometimes journals expect you to report these dffigures along with other statistics. They are the

figures you see quoted in brackets often subscript after t, F, Chi squared etc. E.g. instead of t = 2.34

one sees perhaps t(28) = 2.34.

They can usually easily be got from SPSS output where they are not obvious. Look fordf. Broadly

they reflect the number of categories in any category variables in the design, and the number of cases

in each group. The exception is designs where only category variables are involved (e.g. where you

would use chi squared): in that instance the df just reflects the number of categories.

Since you will have told the reader the numbers of categories and cases involved anyway, I don't


of 29 12/18/11 2:58


3/29

personally see the point of mentioning df. But in case you need to, they mainly turn out to be one less

than the numbers you started with. though it can get more complicated.

The df numbers are written subscript, or in brackets, after the statistic t, F or whatever (not the p).

So in a t test comparing two groups, 108 subjects altogether, the df will be 1,106. One might write

t1,106 = ..... The first figure is one less than the number of EV categories (2-1=1). The second is thenumber of cases less one for each group involved (N-2=108-2=106).

In an ANOVA comparing four groups with 108 subjects altogether, df would be 3,104.

In a t test comparing the same group in two conditions, the df for 108 cases will be 1,107.

The df can be more tricky for more complicated designs and interactions. In the output of ANOVA

you will generally see the first df figure you need in line with the main effect or interaction of interest,

and the second one listed as within groups orerrorbelow it.

In a chi squared test with three categories on each scale, the df is 4 because (3-1) x (3-1) = 4. In a chi

squared test with two categories on each scale, the df is 1 because (2-1) x (2-1) = 1.

Why are these figures called 'degrees of freedom', and why are they important? It is basically because

what is important in statistics is not so much the numbers of anything but the numbers of choices or

separate pieces of information involved. Typically there are always one less choices than people etc. If

I have ten assignments to hand back to my class of ten students, I have to make a choice who to give

each one to for the first nine, but for the tenth one there is no choice, as there is only one assignment

left and one person left to give it to. I have no 'freedom' left on the last one.

Here's the statistician's analog of that. 100 people answer a yes-no question and 38 say 'yes' and 62say 'no'. We want to know if that differs significantly from 50:50. I.e. are they showing a real

preference? There are two categories (yes and no), so we use the binomial test. It might seem that we

have two figures to handle in the test and two comparisons to make. We have to check if the observed

figure of 38 differs from the E of 50 and if O of 62 differs from E of 50. But in fact, of course, the test

need only do one of those. The data has only one degree of freedom. Once the test establishes if 38

differs significantly from 50 for one category, the answer for the other category, whether 62 does so as

well, is fixed. Hence if one calculates statistics by hand one always finds that in the formulae one has

to use the df figures rather than full numbers of cases or categories.

Residuals

These are simply the differences between observed figures (O) and some kind of predicted/expected

figure (E). But they mean different things in different analyses.

Category data: for significant differences/relationships we want them big, because the E figures

represent what is expected under the null hypothesis of NO difference/relationship. In analyses where


of 29 12/18/11 2:58


4/29

just frequencies in categories are involved (e.g. analysed using chi squared or the binomial test), the

residuals are the differences between O and E frequencies. The bigger they are, the more likely that

there is a significant difference involved. In the Labov analysis in class we looked at the table of O

and E values to see where the biggest O-E differences were (for which r use in which store). In fact

chi squared itself is calculated by essentially adding up the residuals for each cell in the table (with a

bit more maths to it). In the binomial test where, say, 20 people are divided 4 saying 'yes' and 16

saying 'no' to a question, we want to know if that differs significantly from a 50-50 split, which wouldbe 10 'yes' and 10 'no' in this instance. So we are concerned with the size of the residual... in this

instance 6. The bigger the better, if we want to show a clear preference.

Interval data: for significant relationships we want them small, because the E figures represent what

is expected under the hypothesis of a perfect linear relationship. This is the other place where you

often find residuals being talked about - in data where all the variables are (treated as) interval

(analysed using Pearson r, or regression). Here they are the differences between the observed scores

and the scores predicted by the best fitting line on a scatterplot, showing the EV-DV relationship.

Here obviously the smaller the residuals, the more likely the relationship is significant. Obviously one

can find a best fitting line to any data where cases are scored on two or more interval variables.... but

if most of the observations fall miles away from the line, that does not show a real relationship.Pearson r and regression statistics in effect reflect whether the residuals are generally large or small;

examining scatterplots, when we look at cases (subjects) that are way off line, we are looking at cases

with exceptionally large residuals.

Eta squared

This is the measure of relationship that you can get in ANOVAs and the like. A bit like a correlation

coefficient it tells you on a scale 0-1 how much EV-DV relationship there is. Really it is more

analogous to r2

and can be thought of as a % on a scale 0-100. It is a useful addition to just being told

if a relationship or difference is significant. Many significant differences/relationships in fact are quite

small in terms of the SIZE of the difference/relationship.

SPSS does not calculate eta quite how the books suggest, or even how SPSS help itself seems to

suggest.

In fact every eta sq is calculated so that it is a proportion out of a different total and some of the

variance that goes into the calculation of one of them may also go into the calculation of another, so

none of them can be added sensibly to each other.

So every effect (main or interaction) is out of its own 100%, representing the maximum variance that

it could account for, but not all the variation in DV scores. This applies even where the effects are ofthe same type and a sensible calculation could be made of the % of variance of the same type

accounted for (e.g. two between subjects main effects - in principle one could calculate what % of the

WS variance they account for together). In fact this is not done.

So the SPSS etas can be compared with each other (This one is accounting for more of the total it

could account for than that one is...) but not really added. Or if you like, the total % if there are three

factors with three main effects, 3 two-way interactions, and one three way, is not 700% but less than


of 29 12/18/11 2:58


5/29

that... but hard to calculate exactly what. (In fact you can see how SPSS calculates the etas: in the sum

of squares column it is simply the sum of squares for the effect of interest divided by the SS of that

effect plus the relevant error SS for that effect. Clearly then it is not calculating the proportion of all

the SS in the entire analysis accounted for by that effect, just the proportion of the SS relevant to that

effect. And also the error SS get re-used in different calculations)

Post hoc tests of paired comparisons after ANOVA

Wherever a main effect or interaction involves a comparison of more than two means, post hoc tests

can be relevant, as the basic significance value given by the ANOVA does not say which pair or pairs

is/are sig different. If the main or interaction effect from ANOVA comes out significant that just

means that there is a sig difference SOMEWHERE among the means but not between every pair

necessarily. Especially this arises where one or more of the EVs has three or more levels (i.e. groups

or conditions), though it can also arise, say, where you have two two-value EVs and the interaction is

significant. You need a post hoc test to identify where the differences are exactly or just judge it by

eye from a graph or table of means. This situation arises in various ways in ANOVAs, some of which

SPSS deals with straightforwardly, others not.

One might think the solution is just to do loads of familiar t tests comparing the means in pairs as

required, to see which pairs are sig different. Indeed one sees this done in some published work, and

in moderation probably you can get away with this However, statisticians dont like that. The

statistical issue underlying all this is that, when you do paired comparisons like this, the same means

are getting reused several times in different comparisons. If you have three groups and compare them

in pairs then the mean for group 1 gets used in the comparison both with group 2 and group 3. Now

the more times a mean gets compared with others in repeated statistical tests, the more chances it has

to come out as significantly different just by chance, not reflecting a real population difference.

Remember that if a difference between two means is significant (at the .05 level) that actuallyMEANS that one would not get a result this different more than 5% of the time or one in twenty

times by chance, due to the vagaries of random sampling, in similar sized samples from a

population where there really was no difference. But another way of looking at that is to say that if

you use the same data in twenty comparisons, then one of the results might be that one-in-twenty

result that looks significant but is actually from a population where there is no difference. The more

tests you do, the more chance of getting a result that looks sig but is not really.

Some adjustment has to be made to compensate for this. Like other activities in life involving pairs,

your tests for multiple paired comparisons should not be unprotected! Post hoc tests and the like

cope with this better than t tests. It is not appropriate to do multiple t tests at least not without a

Bonferroni adjustment of the sig level (though that is a solution that is seen as rather

overcompensating for the problem). Better is to use a post hoc test designed for such comparisons

(e.g. Tukey, Scheffe, etc.). However, as the SPSS dialog box forpost hoc shows, there is a myriad

of options: nobody is certain which is the best, and none are perfect. As a consequence sometimes you

can get an anomalous result that the ANOVA says there is a sig difference somewhere, but the paired

post hoc test does not find any pair significantly different.


of 29 12/18/11 2:58


6/29

The term post hoc is used for where you just want to consider all pairs of means that are possible to

compare, following an overall analysis including all the means, which is the appropriate starting point.

SPSS however limits this term to comparisons between cases in different groups, though statisticians

use the term generally for follow up comparisons of pairs of repeated measures conditions as well.

The term planned comparison (=contrasts in SPSS) is used where you planned specific paired

comparisons, not all the possible ones, such as the comparison of three groups of learners with an NS

group, but not with each other.

The general rule is that for k means there are k(k-1)/2 paired comparisons possible. E.g. if four

groups then 4 x 3 / 2 comparisons, i.e. 6. However, SPSS output usually gives you the pairs twice

over so it looks even more.

An EV with three or more independent groups being compared.1.

E.g. the % correct scores for third singular s of three groups of learners are compared. The basic

ANOVA result says whether there is a significant relationship between the EV and the DV a

difference somewhere among the groups but not exactly where. If the overall result is sig, then to

see which pairs of groups are sig different you need to do post hoc tests. Whether you do theANOVA via Compare means Oneway ANOVA or via General Linear Model Univariate, you

get many many ways of doing the post hoc test offered under the Post Hoc option. Tukey HSD is a

common safe bet.

Basic post hoc tests compare every pair of means. But suppose your groups were two of learners and

one of native speakers and you plan to compare the two learner groups with the NS group (which may

be thought of as a control group) but not with each other. These are often called planned

comparisons and you would do better not to use the post hoc tests which compare every pair, and so

are weaker (less likely to identify sig differences). You get this sort of limited comparison in

Analyze.. General Linear Model... Univariate... enter your DV as usual and the three languagesvariable as a fixed factor. This does a oneway ANOVA exactly like you get with Compare Means...

Oneway.. except that it gives you some extra options. If you clickContrasts and click the contrast

option to get Simple and then clickfirst orlast depending on whether the control group is numbered

1 or 3... then (don't forget) clickChange... then Continue then OK... you get an output that just does

those limited paired comparisons.

An EV with three or more repeated measures conditions being compared2.

E.g. you compare the same peoples fluency speaking to the teacher, to peers and to parents. You

want to compare each pair of those conditions afterwards. In General Linear Model Repeated

Measures you have to use not what is labelled Post Hoc but ratherOptions click the variables intoDisplay means and tickCompare main effects and below that choose Bonferroni. This in effect

uses t tests with a simple Bonferroni adjustment for multiple comparisons to compare the pairs of

means. Not ideal because overcautious: i.e. likely to lead to you missing a difference that is actually

sig. SPSS should really make Tukey etc. available in repeated measures as well as independent groups

comparisons. Alternatively you can do your own Tukey test as described below.


of 29 12/18/11 2:58


7/29

Once again you can alternatively choose limited planned comparisons via the Contrasts option as

above.

Interaction in a two way ANOVA with both EVs as groupings3.

Where there are two EVs that are groupings, the interaction always involves at least 4 subgroups.

Even if both variables are just two groups, like male-female and upper class-middle class, theinteraction has four groups involved and, if the interaction is sig, you might want to know which pairs

of those are producing that result, beyond just guessing from a suitable graph.

SPSS does not deal with post hoc for interactions, but in some instances you can do it yourself fairly

simply with calculator. For instance you can do a Tukey test to test for pairwise differences when you

get a sig interaction in a two way ANOVA with two independent gps factors, where all groups have

the same number of subjects in.

Calculate T = q x (error mean square / number of people in each group)

Error mean square or error variance is in the original ANOVA table in output.q is found from the table of the Tukey statistic (ask me for it or see a serious stats textbook which has

it in the back. I cant include it here for copyright reasons). Read off the column for the number of

means being compared pairwise, and the row for the df of the error variance/mean square (from

ANOVA table).

Then calculate T and any pair of means differing by more than T is sig different.

If the groups are different sizes, or you wish to save effort, do t tests with Bonferroni adjustment.

Interaction in a two way ANOVA with both EVs as repeated measures4.

As for 3. OR Treat it as a oneway repeated measures situation. Enter all the repeated measures

columns as if there were just one factor not two, and follow 2 above. That in effect does the post hoc

for the interaction.

Mixed independent groups and repeated measures ANOVAs5.

As usual, if the result in ANOVA is significant, and more than two means are being compared, one

needs follow-up tests to see which pairs of means are significantly different (or be happy just to judge

it visually from a graph). Each main effect involving 3 or more levels can be dealt with as above, but

the interactions are more of a problem.

Take five repeated measures conditions and two groups.

One can get the main effect multiple comparisons done by SPSS with suitable adjustments as

described in (2) above (i.e. comparing results on the five conditions with each other in pairs, for the

whole sample of subjects lumping both gps together). In fact if one wants all of them there are 10


of 29 12/18/11 2:58


8/29

comparisons.... because there are five conditions, so (5 x 4) / 2 paired comparisons.

In the interaction, since there are 10 means involved for all 5 conditions and two groups, there are (10

x 9) / 2 comparisons potentially, which makes 45.

One can do some of the interaction paired comparisons, by splitting the file and getting SPSS to use

the Bonferroni option again. Those are the comparisons of each condition with each other conditionwithin each group separately. 10 comparisons in each group = 20 in all.

That leaves 25 comparisons that you could not do with any post hoc procedure in SPSS as far as I

know... the comparisons between each of the 5 means for one gp and the five for the other. Ordinary t

tests do not have any required reduction for multiple comparisons like post hoc tests do. However a

simple adjustment by hand is to use the t test but require stricter sig levels. In fact this is really making

the Bonferroni adjustment oneself.

The account immediately above assumed that there was no a priori reason to be interested in any of

those 25 pairs more than any other... It was a DIY post hoc solution.

However, it could be that, for theoretical reasons or whatever, you were not interested in comparingevery pair of means, only certain ones. In particular:

- the comparisons of all 5 conditions within each group, done OK with split file and Bonferroni

adjustment..... 20 comparisons

- the comparison of each group with the other on each condition separately. That is in fact only 5

comparisons out of the 25 possible other ones. (I.e. you have no interest in comparisons like that

between the lower group on condition A and the higher group on condition C, between with lower and

higher on A etc.). You want to claim, in this instance, that these were what are called 'planned

comparisons' not the usual post hoc 'try everything' type. Then you could reduce the required sig value

of the t test for this part by dividing by 5 not 25 in the Bonferroni adjustment....

In general, then, where there is no post hoc test available in SPSS, the simple but crude solution is to

use ordinary pair comparison statistical tests, but divide the target sig level by the number of potential

comparisons you COULD make, or PLANNED to make, to compensate for making multiple

comparisons. However, this is cruder than using post hoc tests, which take care of this better. You are

more likely to miss sig differences (a so-called Type II error).

You dont get a sig result and you want to know how big a sample you would need to get one

If you have gathered data, especially in a pilot study, and not got a significant result, you may want to know

how big a sample you would need to make the result significant. Remember, if you choose a big enough

sample, even a very small difference or relationship may be significant. So if you have the possibilityavailable to increase the size of the sample (i.e. there are more subjects or cases available), and are

desperate to get a significant result, it would be useful to know how many subjects would be ideal.

Some books give formulae to calculate how big a sample you need, but they dont necessarily

straightforwardly fit the situations you have. The following is my best suggestion for an easy way to get an

estimate of required sample size using SPSS facilities.


of 29 12/18/11 2:58


9/29

Basically you create imaginary larger samples simply by using your subjects more than once. Suppose you

have 20 subjects and p=.231 for whatever test you are interested in. You get SPSS to think that you have

three times as many subjects, simply by getting each subject counted three times, and run the test again. Say

then p=.09. Then you get SPSS to think you have four times as many subjects, including each of your

twenty four times, and see again what happens. By trial and error you get to the point where p=.05, and that

gives an estimate of the minimum number of subjects you need to get a significant result.

To get SPSS to count a subject more than once you weight the data, similar to how you are familiar with

doing elsewhere. At transform..computeyou nominate a new target variable which you might call incr

(since it will tell SPSS how many times to increase your sample size). You then enter in the numerical

expression space whatever you want the weighting to be. You could start with a weighting like 2. Click

OKand you will find a new column called incr with 2 repeated all the way down. If you now go to

dataweight cases and weight the data by that column, then SPSS sees your data as having twice as many

cases counting each one twice.

Now do your analysis again and see if it is significant. Go on altering the weighting figure in the incr

column via transformcompute repeatedly and redoing the analysis until you get a sig difference orrelationship. Note that you can enter partial weightings like 3.5 as well.

When by trial and error you achieve a weighting that gives a significant result, multiply it by your original

sample size to see how many subjects you would need. E.g if your sample from two groups was 20 in all

but you only get a sig difference with a weighting of 3.8, then you need at least 20 x 3.8 subjects (= 76), in

similar proportions in the two groups as before to have a chance of getting a sig difference

Cautions. You have to make sure the new bigger sample IS from the same population as the old one. In the

case of comparisons of groups of course several populations may be involved. Even then, any method of

estimating the required sample size is only approximate, because even truly random samples can vary a lot.Also, with an increase in sample size the actual difference or relationship you are interested in may not

actually get any bigger. It is just more likely to be significant. I.e. you may end up showing that there is

indeed a non-zero difference or relationship in the population (which is what significant means), but not

that it is a very large one.

Group % scores

Twenty people in two groups are each measured for the number of times they use the third person s

out of all the occasions or loci when they had an opportunity to (often called potential

occurrences) Very many linguistic features are measured this way in acquisition and

sociolinguistic research. In the former it is often a matter of how often the correct form (in NS terms)is used, as against some erroneous form or omission, on occasions where there was an opportunity to

use it; in the latter it is often a matter of how often one variant out of two or more that make up a

sociolinguistic variable is used.

In all these situations there are two ways of summarising and graphing the data 1) the group way

and 2) the individual way.


of 29 12/18/11 2:58


10/29

Either 1) you add up all the potential occurrences for each group, and all the occurrences of the form

of interest, and express the second as a percent of the first for each group.

Or 2) you calculate a % score for each person using their individual frequency of the form of interest

and their individual number of potential occurrences. Then for each group you can calculate the

average (mean) % score for that group from the individual scores of its members. However, you haveto be aware that this can be a bit misleading for cases whose number of potential occurrences is small:

getting one out of one right is 100% as much as getting 20 right out of 20 possible occasions! It is

common to require at least 5 potential occurrences, otherwise treat a case as missing data.

It is easy to show that the group figures may not come out the same! Here we imagine figures for a

group of two people and see what happens:

Method 1 Frequency of

form of interest

Number of

potential

occurrences

% occurrence of

form of interest

Person 1 4 16 25%

Person 2 8 10 80%

Total 12 26

Group % (12/26)x100 =

46.2%

Method 2 Frequency ofform of interest

Number of potential

occurrences

% occurrence ofform of interest

Person 1 4 16 25%

Person 2 8 10 80%

Mean % for

group

(25+80)/2

= 52.5%

In fact the two methods will come out the same only when all subjects had the same number of potential

occurrences (e.g. in a test or list reading task).

Many SLA and sociolinguistic studies use method 1. That is fine, if you wish, for the purposes of giving

descriptive statistics and making graphs, provided you make it clear what you are doing, and are aware of

the difference from the other method.

BUT for any inferential statistics you should use the method 2, entering the data in SPSS in the form of

one row per person, with a % score for each person. Then, to compare two groups, for example, you use

the independent groups t test on the two sets of scores.


0 of 29 12/18/11 2:58


11/29

If you were to attempt inferential statistics on the total figures of method 1, you would have to use the

numbers of individual occurrences regardless of people. I.e. if the example above were for one group,

you would represent that group with the proportions 12 and 14 (i.e. 12 occurrences of the form of

interest, versus 14 non-occurrences, making up the total of 26 potential occurrences) and compare those

with the overall proportions for the other group being compared with. The test for that is chi squared, and

you do see this used even in some published work for data like this. However, there are at least two majorproblems with this which would lead statisticians mostly to regard this as a misuse of chi squared.

- Like for all significance tests, the basic observations (cases) which enter into the test have to

be independent of each other. Now in method 2 the cases are the people, and there is no

problem in seeing scores from different people as being independent of each other. However,

in method 1 the 26 occurrences in the example are the cases, and clearly while some of those

are independent of each other (being from different people) some are likely not (being from

the same person)

- There is also an expectation that populations sampled are homogeneous. From what we have

just said that is clearly not the case in method 1: the 26 observations representing one group

in the example are a mixture. It cannot be said that each observation is from one population

it is from a mixture of a population of people and the populations of occurrences of eachseparate person.

The only instances where chi squared and method 1 might be defensible would be where the numbers of

potential occurrences are very small amounting to little more than one or two per person included. OR

where all the potential and actual occurrences come from just one person per group. though that still

does not deal with the independence problem. OR where you feel able to argue that responses from the

same person are as independent as if they were from different people There is a tradition of

phoneticians making this tacit assumption for things like VOT, on the belief that such things are beyond

the persons ability to control.

Rounding interval scores

Just checking.... do we know how to round figures on interval scales? The mean of a set of scores may

come out as 6.3597, but often we want to express this in shorter form, such as 6.36 or 6.4. Quoting long

strings of numbers after the decimal point can look as if you are just trying to impress with loads of

numbers. Or it may be you are trying to make up for sloppy METHOD by being super-detailed in the

figures quoted in RESULTS.... Best not to do that, since one's measurement is unlikely to be so accurate

that more than two decimal places are relevant (except perhaps where a computer has measured

something for you like response time...). Generally three or two decimal places for sig/p values, and two

or one for everything else. Keep it intelligible and round numbers where necessary. But where do you

round up, and where down?

Task

Just round the following figures to two decimal places:

3.852 0.679 18.505 1.006 7.597 20.955 0.602

SPSS often rounds figures on screen (e.g. in the data grid) even though it is holding longer versions in its


1 of 29 12/18/11 2:58


12/29

memory. You can select for each column how many decimal places it shows on the Data View window.

Answer to above 3.85 0.68 18.51 1.01 7.60 20.96 0.60

Decoding interval scores expressed in E notation in SPSS output

Sometimes SPSS produces numbers like 7.012E-02This is not 7.012.... It is 0.07012

The E with a minus sign signals the number of places the decimal point has to be moved to the left.

So 1.369E-03 = 0.001369

Etc. The E is a shorthand so as not to write a load of noughts.

Always convert any such figures into the familiar form if you report them in your work.

Correspondingly 7.012E+02 would indicate 701.2.

Combining columns of scores for separate items in a test etc. to give a total or average score

Where a test or other instrument produces scores for separate items which then need to be added up togive a total score for a variable, one could of course add them up off computer and just enter the totals.

However, to check on internal reliability or to do an analysis by items in addition, or to filterresponse

times and exclude some, you will need the scores for every item in a separate column, so will have to

enter the data in full.

To then add columns use TransformCompute in SPSS to create a new column that totals the

separate ones. You enter the title of the new summary column top left in the dialog box, and click the

column names to be added into the top right space, with + between them. That creates a new column

of totals.

However, anyone with a score missing in any column will be missed out and their total will come out

as missing.

If there are missing values in some columns, marked in SPSS by a . , where subjects failed to respond

or have unanalysable data, you will probably want each persons total really to be the average score

over all the items they answered, not the total (unless you have some reason to count missing as the

same as wrong and so score it 0). You can get this by, in TransformCompute, inserting in the

Numeric Expression box the function MEAN(numexpr, numexpr,) from the functions list, and

putting the relevant column labels in the brackets separated by commas. I.e. if you have a set of three

items whose scores are in columns item1, item2, item3, then you would enter MEAN(item1, item2,

item3) in the Numeric Expression box. SPSS then generates a new column with the average score ofeach case on the three items or, if they answered less, over the ones they answered.

Similarly, if you want to just add, not average, a set of columns, using whatever scores are available,

then to avoid the people with missing values getting recorded as with zero total use SUM(numexpr,

numexpr,) in the same way as described for means above.


2 of 29 12/18/11 2:58


13/29

Cutting an interval scale into ordered categories

A common example is deriving a grouping of subjects from something you measured about them

originally on a numerical scale: an explanatory variable such as their ages, English proficiency,

extraversion etc.,. This is often done casually without due thought, and often in peculiar

idiosyncratic ways by novice researchers, but above all it needs careful thought about why it is done,

and how

Before you do this at all, you need to ask if it is necessary at all. Just because some other researcher

had a high prof group and a low prof one does not mean you necessarily have to have groups. When

you derive such groupings from scores originally recorded on a continuous interval scale, obviously

you lose some information. One person may be a bit better than another on the original scores, but

once you decide they both belong in the high prof group, or whatever, they are treated as identical in

any further tests. This may or may not help produce the result you want Certainly how you divide

subjects into groups, if you do, can drastically affect the result!

Reasons for cutting

There are a number of reasons some statistical, some related to research methods, design and

hypotheses more.

A few statistical techniques require interval scales to be reduced to binary grouping.

Implicational scaling (scalogram analysis) is one method of statistical analysis used in

acquisition research that requires this: subjects simply have to be categorised as having

acquired or mastered each feature of interest or not. So also varbrule analysis requires

groupings of people who use or dont use some form of sociolinguistic interest.

a.

If the true interval nature of a scale is in doubt that could be a reason to reduce it to

categories (though reduction to rank order would lose less information).

b.

If you retain the original scores and look at relationships with other (dependent) variables

then you are into the statistics of correlation, and maybe multiple regression, typically. If

you form groups, then you can identify a mean for each group on the other variables of

interest and compare those means with t tests, ANOVA etc. Both methods will show

relationships between EVs and DVs, but the second will be better (or at least easier in

SPSS) for dealing with

c.

i. nonlinear relations e.g. where high and low proficiency subjects perform

similarly on some other variable of interest, compared with intermediate

subjects

ii. interactions between different EVs e.g. where you want to see the

combined effects of gender and prof on something: do high prof females differfrom high prof males in the same way as low prof females differ from low prof

males?

iii. designs involving repeated measures.

The goal of the research may be exploratory - precisely to discover useful categories of

subjects.

d.

You may wish to identify extreme groups of subjects for comparison. E.g. you want toe.


3 of 29 12/18/11 2:58


14/29

compare bilinguals who are English dominant with those who are Welsh dominant. You

do not want more or less balanced bilinguals. So you measure the bilingual dominance of

a sample and will reject the middle scorers, keeping two extreme groups.

You need categories to form the IV in an experiment. E.g. you want words of three levels

of frequency to be the stimuli for three conditions in an experiment. Or maybe you want

extreme stimuli just frequent and rare. Either way you need groups of words as it is

difficult to use an interval scored variable directly as the EV in a repeated measuresdesign.

f.

Means of cutting

OK so you still want to make groups there are many ways of doing it. To some extent they match

the reasons above.. The principles apply to any interval-scored variable that is to be turned into a

grouping. The issue is where to cut the original interval scale so as to obtain two or more groups of

cases

Cutting at a priori scale values. That is, cutting at predecided score values on the scale, which

would be the same whatever sample you gathered. These values may or may not have someabsolute meaning of the criterion-referenced type. Cf. Reason 1 above. Such a point could be

One used arbitrarily by previous researchers. Not necessarily a good way to do it if it has

no sound basis, other than for the reason that it then enables you later to compare your

results directly with those of other researchers

The pass mark used in a particular institution for some English exam, or a succession of

such marks e.g. corresponding to what are called grade A, B, C, D in some institution.

Again such points may be fairly arbitrary, but perhaps meaningful for your research in

allowing you to contextualise it.

Grades with some universal absolute meaning associated with them, maybe in a

professional published test you have used. E.g. you divide subjects into those who gotgrade 6 or better in the IELTS test, and those who scored worse, given the widespread use

of this value as a criterion for entry to UK universities. Ranges of scores of the Jacobs

instrument for assessing EFL written compositions, and many international language

tests, have proficiency definitions associated with them. A different example of this type

is to divide a five point rating scale of the type strongly agree agree neutral

disagree strongly disagree into just two categories those who showed some agreement

(i.e. the top 2 choices) versus the rest who disagreed or were indifferent: this uses a

division point with some clear meaning of its own (but why then did one not ask the

question in the first place just as a two choice item?)

The score on a variable scored as % correct which is conventionally regarded as

indicative that someone has acquired a feature. Acquisition researchers vary in what theythink this score is, but 80% or higher correct use of, say, third person s would be

regarded by many as enough to put a subject in the group they would say has acquired

the feature. Others argue that that only 100% correct indicates true acquisition. Others

that any correct use greater than 0% correct indicates acquisition has occurred. Again

others use other scores like number of occurrences of a structure in 5 hours of observation

(Bloom and Lahey 1978, 328) 5 or more indicating acquisition.


4 of 29 12/18/11 2:58


15/29

The score on a variable scored as % use of one alternative which is conventionally

regarded as indicative that someone is a clear user of that alternative. Labov in his

famous department store study divided subjects into three groups - those using no [r]

sounds in the wordsfourth floorsaid twice, those using them on all four possible

occasions (categorical users), and those in between (i.e. variable users).

Scores defined by how some other relevant group of people performed on the same test or

measure e.g. for learners you might make use of the mean score of native speakersdoing the test (a criterion group), or perhaps the score which only 15% of NS do better

than (the 85th percentile). Alternatively one might rely on the mean score that large

numbers of learners of the same sort as ones own testees gained in other research (a

reference group). The latter is not often available in language research more a feature of

standardised NS tests like the British Picture Vocabulary Scale and so on.

Cutting the score scale into halves or equal lengths. That only is easy if the scale has fixed ends,

such as a % score scale, or a test marked out of 40. E.g. you make four groups: those who

scored between 0 and 10, 11-20, etc. (being careful not to label them overlapping 0-10, 10-20,

20-30). This is often not very meaningful unless the scale has some absolute meaning so that

half-marks actually means half knowledge of something beyond the test items and it

produces unequal sized groups. Also it may not even be possible to quite achieve equal lengths

with ease (0-10 actually covers one more point than 11-20!). However, it is a system that can be

used with the same cutting scores on any sample, like the above but unlike those below.

Mitchelmore (1981) suggests that the scale should not be cut into lengths that are too short, so

as to avoid misclassification. Lengths should not be shorter than 1.635 x SD of scores x (1

reliability). Possibly useful for Reason 2 above.

Cutting so as to achieve equal numbers of subjects/cases in each group. Technically this uses

the median and quartiles. I.e. if you had scored 30 people and want two groups. You simply put

them in rank order on the basis of their scores, and the top 15 (those above the median score)

become the high prof group, those below the low prof one. The cutting score obviously will

differ for different samples and has no real meaning, but generally it is better for latercomparisons if groups have more or less the same numbers of subjects in. Often used for

Reasons 2,3, 6 above.

Cutting at the mean, and points related to it. E.g. you divide into those who scored above the

mean (average) and those below. Or four groups: those scoring more than one SD above the

mean, those more than one SD below the mean, those between the mean and one SD above,

those between the mean and one SD below it To get three groups you might use the mean

plus or minus half the SD as cutting points. The mean, like the median, is entirely relative to a

particular sample of course. The problem with dividing at the mean is that usually many cases

score near the mean, so cases very close to each other will get put in different groups. If the

original scoring is not perfectly reliable, that in turn means that some cases may be

misclassified.

Cutting into natural groups using low points in the distribution shape. This is a simple form of

cluster analysis and simply looks to see if the subjects in the sample seem to have grouped

themselves (cf. Reason 4 above, and also maybe 2,3) i.e. looking at a histogram of scores are

there two or more heaps with a low point on the scale where few scored? then make the

cutting score the middle of the low point(s). This of course decides both where to cut and,

unlike most methods, how many groups to identify. It may vary from sample to sample but does


5 of 29 12/18/11 2:58


16/29

reflect the nature of a particular sample better than some of the above methods. It will not work

if the histogram is simply one heap (e.g. with the normal distribution shape), though sometimes

rescaling the histogram with finer divisions may reveal what an initial SPSS histogram may

conceal. As an example, the scores of 217 subjects on a College English exam in Pakistan are

graphed below and it is fairly clear that there are two groups in the sample, those scoring above

58 or so and those below. By comparison the median score, above and below which are equal

numbers of cases, is 50 for this data and appears to rather arbitrarily divide people within oneof the groups that they seem to naturally form.

With all the above methods, but especially the third, researchers may choose to use extreme groups

only. Often where a researcher wants to get clear differences between groups later he/she will help

this along a bit by, say, using the top third and the bottom third of subjects and missing out the middle

third in any later comparisons. Reason 5,6 above.

However you cut, you have to be careful how you speak. Very often you will call the groups you

make the high proficiency group and the low proficiency group, or the like. But unless your

original test that produced the scores was a criterion referenced one, deciding some absolute level of

prof for each taker, with international equivalence, then this can be misleading. Very often the

proficiency test researchers use test was a cloze test you cracked up yourself, or the like. It may well

distinguish students with higher proficiency from those with lower, in the sample of students you are

using. But that does not mean there is any equivalence with what were called high prof students by

some other researcher who used a different test with a different sample in another country. It could be

that all his students, high or low prof, are no better than the worst of your low prof group, and so on.Only if some standard published test such as FCE or TOEFL was used by all could you match up

across studies and see if there was any real comparability between so-called high prof students in

different studies. In fact close examination shows that many variables used in research have no

absolute definitions of scale points, and most of the above ways of dividing cases into groups only

distinguish in a relative way between who/what has more of something or less, not exactly how much.


6 of 29 12/18/11 2:58


17/29

The size of the standard deviation

One is quite used to having SPSS calculate the SD along with the mean (=average) of a set of scores

(i.e. for any interval scale).

We are also used to the idea that the SD measures spread of scores around the mean. If all casesscored the same, the SD would be 0. The bigger the SD, the more spread the scores of different

cases the more subjects are disagreeing with each other in their scores. And the more that happens

within groups, the harder it is usually to show any convincing differences between groups. Similar

concepts to SD are what statisticians call variance and error. These measures are slightly different

but all, roughly, are averages of the differences between each cases score and the mean. If all cases

score the same, which will be also the mean score, then their differences from the mean are 0, so SD =

0.

Sometimes SPSS fails to perform a procedure because of a problem of zero variance. That means it

found that one of your groups on one the variables measured had an SD of 0. All cases scored the

same. This makes certain statistical procedures impossible: they involve variables and cannot work ifeveryone scores the same, as then you have not a variable but a constant. You cannot answer the

question what is the relationship between age and reading ability? if you have obtained data from a

sample who are actually all of the same age!

So we know what an SD of 0 means, but what about big SDs? There is often no simple maximum

value that the SD can have. But there are some guides to help assess the size of an SD:

It may often be more of interest whether different groups or conditions show similar or different

variation (SD) than how great the SD actually is. In general you assess the size of an SD for

each sample group separately.If your scores are on a scale with both ends logically fixed (e.g. a test scored out of 40), then the

maximum possible SD, if cases were maximally varied in scores, is half the scale length (well,

actually it will be a shade above that for small numbers of cases, but that is a useful rule of

thumb). So you can assess the size of an SD you get in relation to that. An SD would usually be

regarded as big if it was even as much as half the maximum (i.e. a quarter of the scale length).

On a scale of % correct scores, half the scale is 50. Note that on a five point rating scale

running 1-5, half the scale length is 2. On such scales of course the mean is also limited: it

cannot be a figure outside the end points of the scale. That places further limits on the size of

the SD: the nearer the mean is to the limit of the scale, the smaller the maximum possible for an

SD

If your scores are on a scale with one or both ends virtually open, then the SD (and the mean)could be indefinitely large. In language research many scales are fixed at one end on zero, but

open at the other. E.g. word frequency: words cannot occur less than 0 times, but there is no

clear upper limit to how often they can be observed. So also sentence length: sentences cannot

be shorter than one word, but they can be indefinitely long. Response times in milliseconds

have a hazier bottom limit: there is an indefinite upper limit to how long anyone can take to

respond to a stimulus and, although technically there is a lower limit of zero, nobody can really


7 of 29 12/18/11 2:58


18/29

respond in zero milliseconds so there is an indeterminate lower limit to fast responses. With

these scales it is harder to say what is a big SD, but one can use some yardsticks:

One can use the maximum and minimum scores that occur in ones data as indications of

the effective limits of the scale, and as above treat an SD larger than a quarter of the

distance between them as large. For a scale fixed at one end, one could use the distance

between the bottom limit and the highest observed score.

With scales fixed at the bottom end, but open at the high end, the distribution is oftenskewed to the left (positively). I.e. scores are heaped near the bottom limit and tail off to

the right. In that situation the SD can be, and often has to be, greater than the mean,

though if the distribution has a perfect Poisson shape, the mean = the square of the SD. If

the mean is some way above the bottom limit, and that limit is 0, and the distribution is

more symmetrical, then people sometimes assess an SD in relation to the mean: if the SD

is as much or more than half the mean, that indicates very substantial variation among the

scores of a group.

Always look at the distribution shape on a histogram as well as the mean and SD The shape

may reveal more than anything else.

How to treat rating scale responses

An old problem is how to handle responses to items recorded on scales such as

strongly agree agree neutral disagree strongly disagree

always often sometimes never

They are rating scales (not usually called multiple choice). They are clearly ordered choices and there

is uncertainty whether they are really best thought of, and treated statistically, as

Ordered categories: so you present the results in bar charts, report the % of people

who responded in each category on the scale, and use ordered category statistics to

analyse relationships with other variables.OR

Interval scores: so you assign a score number to each point on the scale and present

the results as a histogram, report the mean and SD of the scores of a group, and use t

tests, Pearson correlation or whatever when comparing groups or looking for

relationships. The numbering could be e.g. strongly disagree = 0, disagree = 1, and so

on; or if you prefer strongly disagree = -2, disagree = -1, neutral = 0 etc.

Generally it is far easier for any statistical handling to treat the data the interval score way as the stats

for interval scores are more well known and versatile in what they can do. The results are usually

easier to absorb as well. Suppose two groups are asked how far they agree that a CALL activity is

easy to understand; Group B is of a higher English level than A. ?Is it easier to derive some meaningfrom being told:

In group A the response was: strongly agree 43.3% agree 20% neutral 13.3% disagree

13.3% strongly disagree 10%. In group B it was: strongly agree 30% agree 30% neutral

10% disagree 30% strongly disagree 0%. The difference between the two groups is not

significant (Kolmogorov-Smirnov Z =0.365, p=0.999).

OR from


8 of 29 12/18/11 2:58


19/29

The mean agreement response (on a scale 2 for strong disagreement to +2 for strong

agreement) was in group A 0.73 and in group B 0.6. Variation was similar in the two groups,

and moderately high (SDs 1.41, 1.26). The difference between the groups is not significant

(t=0.265, p=0.793).

I know which I find easier to follow!

So I advise going for the second interpretation wherever possible, but making sure that when you use

such scales the way they are used in the data gathering itself justifies this interpretation. In particular:

Make sure the words used for the points of the scales do suggest more or less equal intervals

between one point and the next, otherwise the interval interpretation is invalid

Accompany the wording with figures in the version presented to respondents, so they are

encouraged to think of the scale as being a number one, with equal intervals between the

numbers.

Tests of prerequisites for parametric statistical tests

These tests of prerequisites are only of interest to check if the data is suitable for using some OTHER

test that you are REALLY interested in, because it relates to your actual research questions or

hypotheses. Tests of prerequisites generally apply where ANOVA/GLM is used, though researchers

rarely report having made these checks and we cannot tell if the checks were performed or not! You

generally want them all to be nonsignificant, as that is what shows the data is straightforwardly

suitable for parametric significance tests like ANOVA/GLM.

If the prerequisite test is failed then there may be alternatives within the parametric tests you can use

to compensate, or weaker nonparametric tests you can use instead of straightforward ANOVA etc., or

possible transformations of the data one could do... but often one has to just admit the data is notperfect for the procedure but carry on and use ANOVA anyway....

Their functions are as follows:

Any parametric significance tests.... t tests, ANOVAs etc. all assume that the populations that

the groups are from have distributions of scores that are normal in shape (i.e. that bell-shaped

distribution you see in all the books). Check with K-S test (though on small samples

everything passes this test!!).

t test for 2 independent groups, and all ANOVAs involving comparisons of 2 or more groups

(with or without also repeated measures). The groups need to each have a similar spread of

scores within them round their respective means (=homogeneity of variance). Check withLevene's test, which (roughly) decides if the SDs of the groups could be from one population of

SDs, so are similar, or not. The t test for 2 independent groups has alternative versions

depending on whether this prerequisite Levene test is passed (nonsig) or not, but ANOVAs

don't, they all assume the prerequisite test of equal variances is passed.

All ANOVAs involving comparisons of 3 or more repeated measures (with or without

independent group comparisons as well). Here again the spreads of the scores in each condition


9 of 29 12/18/11 2:58


20/29

need ideally to be similar. Strictly it is the covariation between each pair of conditions that

needs to be similar (=test of sphericity). Check with Mauchly's test (which SPSS automatically

gives you even where you only have two repeated measures, though it applies vacuously there

and need not be looked at). The check, roughly speaking, looks at the correlation between the

scores in each condition and those in each other condition in pairs and sees if the correlations

could all be from a population with one correlation or not. The data would likely not pass if

people who did better on condition A also did so on B but were the worst on C, and so on... If itis passed (nonsig) then you use the 'sphericity assumed' results in the ANOVA table, otherwise

the ones below those (Greenhouse-Geisser).

ANOVAs with a mixture of repeated measure comparisons and independent groups. Here there

is an extra requirement about the pattern of covariance between conditions in each group

separately being similar also between the groups. Check with Box's M test.

Missing values

Missing values are where cases have scores or categorisations completely missing for some reason,

where most cases did provide data. E.g. they gave no response, were uncooperative, or their response

was unanalysable, etc. (Where subjects have taken a multi-item test or the like to produce their scores,then they may miss some items but still get a score for the test as a whole. That is a different issue

You have to decide there whether a missed out item counts as wrong, or whether you allow people to

miss items and as overall test score give them the average score for the set of items they did answer)

They are usually entered in SPSS by a . in the space where a figure should be, unless you have

assigned an actual number that you enter as indicating missing values, and declared it in Variable

ViewMissing.

If you have missing values there may be problems:

- You may have very few cases left that you can use in the required statistical analyses:especially in repeated measures and multivariate designs if a case has data missing on one

variable/condition included in an analysis, it gets left out totally (i.e. listwise).

- The missing values may not be random, but certain kinds of subject may be more prone

to produce them so using the data without them, or with too few of them, will lead to a

biased result. E.g. young versus older testees; lower versus middle class informants.

If you leave missing values in place, SPSS usually gives the choice (in Options for a given test) for

you to treat them listwise orpairwise/test-by-test. This really applies to multiple analyses of the

same data, as within one analysis it usually has to be listwise, meaning that the number of cases used

is the maximum number that has a complete set of data across all the relevant columns: e.g. if in

Correlation you want correlations done between every pair of variables in 5 columns: ten pairs, soten analyses. Listwise option would get you correlations using just the cases with full data across all 5

columns, so the same number of cases would be used in each analysis. Pairwise would, for each

analysis, use the maximum cases with data on both the relevant columns, so use more of the data, but

different numbers of cases might well be used to calculate different correlations.

If you want to fill in missing values. the main principle is that it should not be done in some way that


0 of 29 12/18/11 2:58


21/29

will clearly directly influence the result you are interested in. I.e. you should not fill in the missing

values following a principle that will obviously make the difference or relationship which is the focus

of your actual research more marked.

Broadly there are two ways of filling in missings in any column in SPSS (where a column represents a

variable, or a condition in a repeated measures data).

A) You fill in with the mean of the scores in the column itself (or if it is in categories, the mode,

which is the most popular category in that column).

B) You fill in by predicting a score from the general correlation of that column with others in the

data: the EM and regression methods.

Imagine data as follows:

C1 C2

3 5

5 7

7 9

4 .6 8

If the research question concerns whether there is a relationship between two variables, in C1 and C2

(correlational design), then you do NOT use method B, which would use the correlation that exists

already in the data to fill in missing values. I.e. here, given the perfect positive correlation between the

two sets of scores, method B would fill in the missing as 6, predicting it from C1. But that will

obviously enhance the perfection of the correlation which it is your aim to discover! So the mean of

the second column (method A) would be better a better fill-in value: 7.25.

If on the other hand this was data from the same subjects on the same DV scored in two conditions inC1 and C2 (repeated measures design), and the research interest is in the difference between the

means of the scores in each column (Do they score significantly higher on condition 2?), the better

way to fill in the missing values would by method B. Method A would simply enhance the level of the

mean of C2, and strengthen its distance from the mean of C1.

For these reasons, when you run correlation-type statistics like Regression and Factor analysis,

SPSS underOptions offers you the choice to fill in missing values with the means (method A) as it

operates. The data in the Data view does not get visibly altered: just you find all the cases have been

used instead of those with missings left out. Similarly in Regression with optimal scaling, which

works on associations between categories rather than interval scores, there is the choice to use Mode

imputation, which fills in the missings with the most popular category in the relevant column.

In situations where method B is suitable, you have to use AnalyzeMissing Value Analysis to

actually fill in the missing values in the data in Data view beforehand. Basic instructions: at the first

box, enter all the columns relevant to the analysis you will be doing, either as quantitative (i.e.

interval) orcategorical (categories/nominal). Only the former are actually used in the estimation of

missing scores, though (SPSS does not seem to provide a way of filling in missing category data by


1 of 29 12/18/11 2:58


22/29

Method B). TickEM and if there are some quantitative columns that you dont want used as a basis

for predicting values of missings, then click the Variables button and make your selection. Otherwise

all the quantitative columns you declared in the first box are used to predict any missings in each

other. Click the EM button and tickSave completed data; and underFile name a file for it to be

stored in. Then SaveContinueOK The procedure will produce various output, but mainly you

are interested in the new stored file of data. If you call it up, you will find the missings all filled in.

In data for independent groups analysis (e.g. t tests, ANOVA), with missings in the DV column, if you

have other columns of dependent variable data not being used in the same analysis, you could use

them to fill in the missings by method B. Otherwise you can only use method A i.e. use the mean for

the DV column (NOT the mean of each group) to fill them in.

Getting phonetic symbols displayed in SPSS graphs

First ensure you have the fonts of your choice (e.g. SILManuscriptIPA etc...) installed in Windows in

the usual way. If they are available to you in Word in the usual way via Insert Symbol, then they

will be available in SPSS. If not, get a copy of the font file (ending .ttf) and put it in the Fonts

subdirectory of the Windows folder on your PC.

Now, having made a graph in SPSS, click the graph you have created to make it appear in the Chart

editing window. Then click the part you want to put special symbols in, such as the bottom scale, so it

comes up outlined. Next clickFormat...Text and select the required font from the menu and the size

you want and clickApply, Close.

Now when you click the scale of the graph and choose to change the Labels, you can type the

symbols you want. However, you don't initially see them when you type them in the dialog box. You

have to know that in the SIL font shift-t gets you the symbol for the th sound of thick, though it will

look as if you have just got T. Anyway, you have to type all the labels in the new font, you cannot mixsymbols from different fonts, I think. So retype the labels using Change, and Continue. The symbols

you want will appear on the graph itself.

I have not found a way to get symbols that are coded outside the range of the font that is covered by

the keyboard keys, with and without shift. To know what symbols you can get from which key with

and without shift, you may have to study the table of symbols for your font in advance through a

program such as Word which displays it through the Insert..Symbol option.

Item Analysis

This term is found used in two distinct senses. Both involve data where variables or experimental

conditions are measured using sets of items for each in some way.

A) The usual traditional sense found especially in the pedagogical testing literature. Here it

applies in the situation where a set of items is used to measure what is regarded as one single

variable/construct. The set of items is usually thought of as a multi-item test of one thing (e.g.


2 of 29 12/18/11 2:58


23/29

reading ability, or vocabulary size). However, item analysis may also be applied to, say, a set

of Gardner-type statements for respondents to agree or not with, where a distinct attitude or

orientation is measured by an inventory of five such statements, rather than just one. It can

also apply separately to each set of items designed collectively to measure a single condition

in an experiment. Item analysis in all these instances is the activity of checking whether there

are some items in the set that in some way do not seem to belong there, illuminating how and,

if possible, why they are odd, and maybe removing them or replacing them with better itemswhen the test is used again. It is closely tied to internal reliability checking, often done these

days with the use of the Cronbach alpha coefficient or Rasch analysis. Removing items that

are odd improves reliability. This sort of item analysis is often done in pilot studies, as it

represents a way of refining the quality of instruments for use in a main study. There are

several statistical criteria for deciding what items are odd in a set that is supposed to be all

measuring one thing. See further my Reliability handouts. Where items are supposed to attract

similar levels of response (e.g. be of similar difficulty) then the classical IA approach

involving alpha is appropriate; where items are supposed to be graded, and form an

implicational scale, then approach using IRT/Rasch is better. Where response times are

involved, other criteria may be used to exclude responses for specific people on specific items

instances rather than whole items.

B) The sense in which it is found used in some psycholinguistic literature. Here it denotes a

second kind of analysis of data, beyond the usual default one. In an item analysis, instead of

the subjects (usually people) being treated as the cases, the items are treated as the cases.

Hence it is really analysis with items as cases, rather than item analysis, and is typically

part of the analysis of the results of a main study. This applies only when a study has several

conditions, each represented by a set of items, but this is very common in psycholinguistic

studies, where subjects performance in different conditions is often measured by their

responses to sets of stimuli in a repeated measures design. For example a repeated measures

variable word frequency might be constituted as three sets of ten words, of three differentfrequency levels, making 30 items for people to respond to in some way; a variable early vs

late attachment could be instantiated as two sets of sentences, of two structure types, one in

which a relative clause has to parsed with an early noun phrase, the other with a late occurring

one. Often such data arises also in areas such as SLA, applied linguistic and even

sociolinguistic research as well as psycholinguistics, but item analysis in this sense is only

routine in the latter, where it is regarded as a further confirmation of results obtained by the

usual subject analysis, i.e. analysis with subjects as cases. Where, as often, ANOVA (see

my handouts) is used to analyse the results, then the F values for the subjects as cases

analysis are reported as F1, and those for the items as cases analysis as F2. Statisticians

generally regard analysis with subjects as cases as the sounder basis, due especially to the

independence requirement. Cases have to be regarded as providing independent observationsif the assumptions of inferential statistical tests (e.g. ANOVA) are to be met. While it is

generally not difficult to assume that responses from different people are independent of each

other, it is not so certain if responses to different items are so independent, when the same

people respond to all of them. One has to assume that in psycholinguistic experiments people

are unable to make their responses to one item reflect their response to another. This is often

assumed by phoneticians and psycholinguists.


3 of 29 12/18/11 2:58


24/29

Imaginary dataset to illuminate both the above. Suppose we have two groups of ten people (G1

and G2), and each respond in two conditions (C1 and C2), where 5 items are used to obtain

responses for each condition. As laid out for a customary subjects as cases analysis in SPSS this

would appear as 11 columns and 20 rows thus. Of course, the items would often not have been

presented to subjects in an experiment in sets, but intermixed with each other and maybe with

additional distracter/filler items that are not scored at all.

Group C1

item1

C1

item2

C1

item3

C1

item4

C1

item5

C2

item1

C2

item2

C2

item3

C2

item4

C2

item5

10 rows

labelled

1, to

mark

each G1

subject

Scores

for

each

G1

person

on C1

item 1

Scores

for

each

G1

person

on C1

item 2

Etc.

10 rows

labelled

2, to

mark

each

G2

subject

Scores

for

each

G2

person

on C1

item 1

Etc.

To do item analysis (A) above in SPSS, you would split the file by Group and use AnalyzeScale Reliability analysis Alpha on each set of five items separately (or for Rasch analysis,

you need other software). Four analyses. That means that the internal consistency is always

assessed within a collection of scores which is from a set of items that supposedly measures one

thing, and which comes from a homogeneous group of subjects. After any adjustment of the data

to improve reliability based on the above, you then typically move on the the actual analysis of

results with subjects as cases. You first produce two extra columns which contain the averages of

each five item set of scores for each person. Use Transform Compute. These Mean C1 and

Mean C2 columns each now summarise the performance of subjects in one condition. Those two

columns, together with the Group column, are then used in a mixed two way ANOVA to see if

there is a sig difference between groups or between conditions, or a significant interaction effect.

That is your subjects as cases F1 ANOVA.

For item analysis (B), you need to make the items into the rows. You can do this with Data

Transpose in SPSS. If you start from the data as displayed above and include all the columns you

end up with 11 rows, which were previously the columns. There are columns now for each of the

20 subjects. You can now use Transform Compute to get two new columns calculated which

represent the mean scores for each group of subjects on each item. Then delete the row that


4 of 29 12/18/11 2:58


25/29

contains the grouping numbers. Add a column of 5 1s and 5 2s to record which items (now rows)

relate to condition C1 and which to C2. So the data should end up much as below. Finally use the

column that records whether an item belongs to C1 or C2, and the two columns of group mean

scores for each item. Again do a mixed two way ANOVA to see if there is a sig difference

between groups or between conditions, or a significant interaction effect. That is your items as

cases F2 ANOVA. Note that what was a repeated measures factor in the F1 subject analysis,

condition, becomes a between groups factor in the F2 item analysis. The grouping of subjects,which was a between groups factor in F1, becomes a repeated measures factor in F2.

G1

subj1

G1

subj2

G1

subj3

Etc. to

G1

subj10

G2

subj1

G2

subj2

G2

subj3

Etc. to

G2

subj10

Condition Group

1

Group

2

5

rows

with

scores

for

G1

subj1

on

each

C1

item

Scores

for G1

subj2

on

each

C1

item

Etc. 5 rows

labelled

1, to

mark

each C1

item

Mean

scores

of 10

G1

subjects

on each

C1 item

Mean

scores

of 10

G2

subjects

on each

C1 item

5

rows

with

scores

forG1

subj1

on

each

C2

item

Etc. 5 rows

labelled

2, to

mark

each C2item

Mean

scores

of 10

G1

subjectson each

C2 item

Mean

scores

of 10

G2

subjectson each

C2 item

Note, the above account of items-as-cases analysis assumed that the sets of items used to represent

the two conditions were not themselves matched or repeated in any way. I.e. C1 items 1-5 might

have been five nouns as stimuli in some response time experiment, and C2 items 1-5 five verbs,

with no special connection between individual verbs in one set and individual nouns in the other.

If however the items are themselves matched in pairs or repeated in different forms etc. across

conditions, the items as cases analysis should be different. E.g. if C1 items were five verbs in the

past tense and C2 five verbs in the bare infinitive form, the researcher might choose to use the

same five verbs in both conditions (randomised with suitable distracters interspersed when they

are actually presented to subjects). Then the items are individually matched and the items-as-cases

analysis should be done with the items as repeated measures. I.e. in the data grid above for SPSS,


5 of 29 12/18/11 2:58


26/29

the 5 rows for C2 responses would need to be not below the 5 rows for C1 but side by side, with

the matched items in the same row, to allow repeated measures comparison of items as well as

subjects.

Checking for guessing or response bias when using certain data-gathering instruments with closed

responses

Checking for guessing

Any instrument where the subjects are given choices to pick from for an answer are potentially open

to guessing. In the sense of picking one option at random, without thought.

For example, the respondent may randomly pick one of the choices because

they cant be bothered to think about the question/item just want to finish quickly

they dont actually have any relevant knowledge to make a correct choice

they cant understand the question (language too hard, too long, pragmatically oddetc.)

etc..

Clearly the results will not then be a true measure of whatever the researcher intended to measure,

and could even vary if the subjects responded to the same items again on another occasion. I.e. not

valid or even reliable.

This affects multiple choice items, yes/no or agree/disagree items in questionnaires and tests, rating

scales and so forth. Clearly it cannot affect instruments which have open response in some form, i.e.

with no alternatives supplied.

One cannot statistically tell definitely if guessing has taken place or not, but one can check if the

responses are like those one would get from someone who was guessing, or not. Obviously it is quite

possible to get a real result, where people have paid attention and answered sensibly, which happens

to be similar to the guessing one. Only the researcher can judge the interpretation.

You need to calculate what the result would be, on average, for someone who was randomly guessing,

and use the appropriate one sample test (see my LG475-SP handout) to check if the observed result

differs significantly from the one you would get by random guessing.

For example:

1) 30 subjects have to answer yes or no to a question about whether they use the keyword method of

vocab learning or not. Random guess frequency of yes would yield a frequency of 30/2 = 15 yes

responses. Use 50% binomial test.

2) 30 subjects have to pick one of four reasons they are offered for why they are learning English.

Random guess frequency of each choice being picked would be 30/4 = 7.5. Use chi squared one


6 of 29 12/18/11 2:58


27/29

sample fit test.

3) 30 subjects have to judge 20 words for whether they exist or not in English. Thus each person gets

a score out of 20 for how many they say exist. The average random guess score would be

some stats and spss pointers

Documents