some stats and spss pointers

Upload: rm7029

Post on 06-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Some Stats and SPSS Pointers

    1/29

    Some stats and SPSS pointers

    These relate to some specific technical matters I am sometimes asked about but which are not

    covered in detail in my courses or maybe passed you by unnoticed. Note, I am not trying to

    present all my course material here (you have to take the courses for that), just deal with some

    frequently asked questions and things people frequently get confused over/get wrong. Also,

    these are not all readily understandable unless you took stats courses already! How do I round figures down to make them shorter, e.g. 3.852. And how many decimal

    places should I report?

    How do I generate random numbers to help when sampling from a list, or when dividing

    subjects randomly into groups? Use the facility at http://www.randomizer.org/form.htm

    I have the proficiency scores (or the like) for 30 subjects, and want to divide the cases into

    groups based on this. Or I need categories of word stimuli of three different frequencies..

    How do I do it?

    Can I getphonetic symbols like [] shown on the scales of SPSS graphs?

    How do I combine columns of figures I have entered in SPSS, when I want averages for

    each person of the figures in the columns (e.g. the scores for separate items in a test)? What is item analysis? And what does it mean if the F in an ANOVA result is labelled F1

    or F2, where there has been an analysis by items as well as by subjects?

    How do I eliminate extreme response times in psycholinguistic data? or response times

    where the response was wrong?

    What does the standard deviation really mean?

    When I do a histogram of some scores (interval scale data) I am supposed to look at the

    distribution shape the pattern of the heaps on the graph but how do I interpret the

    shape I see?

    How should one treat rating scale responses? As ordered categories or interval scores?

    If my data is not normally distributed, so not suited to t tests and ANOVA, what can I do?What are the transformations I can use?

    What really are Likert and Guttman scales, and how should they be constructed? They

    both are ways of measuring things via a set of agree-disagree items. Often we use sets of

    items of this type that other researchers made but I wonder if anyone actually selected

    and rated the items in the approved way in the first place?

    What does it mean when SPSS gives you a figure with an E on the end? e.g. 7.012E-02

    What are degrees of freedom (df) and how do I report them, if needed?

    What are residuals and what do they tell me?

    If in a pilot trial of a few subjects I dont get the significant result I want, how can I

    estimate how many subjects I would need to probably get a sig result?

    How do I do follow-uppost hoc paired comparisons and planned comparison tests for any

    kind of main effect or interaction in ANOVA where more than two groups or conditions

    were initially compared? SPSS doesnt do all the possibilities, or hides some away

    How do I do post hoc paired comparisons after a Kruskal-Wallis test?

    What is Bonferroni adjustment and how can I do it?

    What is eta squared and how does SPSS calculate it?

    Esp. for ACQUISITION people and SOCIOLINGUISTS. Twenty people in two groups

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    2/29

    are each measured for the number of times they use the third person s out of all the

    occasions when they had an opportunity to in compositions, recorded speech etc. (often

    called potential occurrences or loci). How do you summarise % scores like this? Group %

    scores for frequency of use of things, or individual % scores?

    Esp. for PSYCHOLINGUISTS and people doing repeated measures EXPERIMENTS.

    What on earth is a Latin square and how do I use it or some other method of organising

    conditions, different types of stimuli etc. in an experimental design? What are those tests ofprerequisites for ANOVA/GLM such as those of Levene, Mauchly

    etc. all in aid of?

    If I have a lot ofmissing scores, can I fill them in somehow?

    Can I check on whether people are responding by random guessing or with bias, and adjust

    scores to take account of that?

    My subjects all gave several responses to a set of different stimuli, and I have entered the

    data in SPSS with each response as a row. So there are several rows for each subject. How

    do I turn that into the more usable SPSS layout with one row per subject?

    Subjects have been categorised in a parallel way in several different columns. E.g. they

    answered a set of questions each of which had the possible response: me, my teacher, my

    classmates (i.e. although coded for SPSS as 1, 2, 3, the responses cannot be considered as

    degrees of anything on an interval number scale). How do I get SPSS to add up for each

    person across the items totals of how many times each category was chosen?

    If you are into word association tests, there are a few descriptive stats that one can use

    there that one does not find used anywhere else much: The Group overlap coefficient,

    Within groups overlap coefficient, and Index of commonality.

    Degrees of freedom

    Sometimes journals expect you to report these dffigures along with other statistics. They are the

    figures you see quoted in brackets often subscript after t, F, Chi squared etc. E.g. instead of t = 2.34

    one sees perhaps t(28) = 2.34.

    They can usually easily be got from SPSS output where they are not obvious. Look fordf. Broadly

    they reflect the number of categories in any category variables in the design, and the number of cases

    in each group. The exception is designs where only category variables are involved (e.g. where you

    would use chi squared): in that instance the df just reflects the number of categories.

    Since you will have told the reader the numbers of categories and cases involved anyway, I don't

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    3/29

    personally see the point of mentioning df. But in case you need to, they mainly turn out to be one less

    than the numbers you started with. though it can get more complicated.

    The df numbers are written subscript, or in brackets, after the statistic t, F or whatever (not the p).

    So in a t test comparing two groups, 108 subjects altogether, the df will be 1,106. One might write

    t1,106 = ..... The first figure is one less than the number of EV categories (2-1=1). The second is thenumber of cases less one for each group involved (N-2=108-2=106).

    In an ANOVA comparing four groups with 108 subjects altogether, df would be 3,104.

    In a t test comparing the same group in two conditions, the df for 108 cases will be 1,107.

    The df can be more tricky for more complicated designs and interactions. In the output of ANOVA

    you will generally see the first df figure you need in line with the main effect or interaction of interest,

    and the second one listed as within groups orerrorbelow it.

    In a chi squared test with three categories on each scale, the df is 4 because (3-1) x (3-1) = 4. In a chi

    squared test with two categories on each scale, the df is 1 because (2-1) x (2-1) = 1.

    Why are these figures called 'degrees of freedom', and why are they important? It is basically because

    what is important in statistics is not so much the numbers of anything but the numbers of choices or

    separate pieces of information involved. Typically there are always one less choices than people etc. If

    I have ten assignments to hand back to my class of ten students, I have to make a choice who to give

    each one to for the first nine, but for the tenth one there is no choice, as there is only one assignment

    left and one person left to give it to. I have no 'freedom' left on the last one.

    Here's the statistician's analog of that. 100 people answer a yes-no question and 38 say 'yes' and 62say 'no'. We want to know if that differs significantly from 50:50. I.e. are they showing a real

    preference? There are two categories (yes and no), so we use the binomial test. It might seem that we

    have two figures to handle in the test and two comparisons to make. We have to check if the observed

    figure of 38 differs from the E of 50 and if O of 62 differs from E of 50. But in fact, of course, the test

    need only do one of those. The data has only one degree of freedom. Once the test establishes if 38

    differs significantly from 50 for one category, the answer for the other category, whether 62 does so as

    well, is fixed. Hence if one calculates statistics by hand one always finds that in the formulae one has

    to use the df figures rather than full numbers of cases or categories.

    Residuals

    These are simply the differences between observed figures (O) and some kind of predicted/expected

    figure (E). But they mean different things in different analyses.

    Category data: for significant differences/relationships we want them big, because the E figures

    represent what is expected under the null hypothesis of NO difference/relationship. In analyses where

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    4/29

    just frequencies in categories are involved (e.g. analysed using chi squared or the binomial test), the

    residuals are the differences between O and E frequencies. The bigger they are, the more likely that

    there is a significant difference involved. In the Labov analysis in class we looked at the table of O

    and E values to see where the biggest O-E differences were (for which r use in which store). In fact

    chi squared itself is calculated by essentially adding up the residuals for each cell in the table (with a

    bit more maths to it). In the binomial test where, say, 20 people are divided 4 saying 'yes' and 16

    saying 'no' to a question, we want to know if that differs significantly from a 50-50 split, which wouldbe 10 'yes' and 10 'no' in this instance. So we are concerned with the size of the residual... in this

    instance 6. The bigger the better, if we want to show a clear preference.

    Interval data: for significant relationships we want them small, because the E figures represent what

    is expected under the hypothesis of a perfect linear relationship. This is the other place where you

    often find residuals being talked about - in data where all the variables are (treated as) interval

    (analysed using Pearson r, or regression). Here they are the differences between the observed scores

    and the scores predicted by the best fitting line on a scatterplot, showing the EV-DV relationship.

    Here obviously the smaller the residuals, the more likely the relationship is significant. Obviously one

    can find a best fitting line to any data where cases are scored on two or more interval variables.... but

    if most of the observations fall miles away from the line, that does not show a real relationship.Pearson r and regression statistics in effect reflect whether the residuals are generally large or small;

    examining scatterplots, when we look at cases (subjects) that are way off line, we are looking at cases

    with exceptionally large residuals.

    Eta squared

    This is the measure of relationship that you can get in ANOVAs and the like. A bit like a correlation

    coefficient it tells you on a scale 0-1 how much EV-DV relationship there is. Really it is more

    analogous to r2

    and can be thought of as a % on a scale 0-100. It is a useful addition to just being told

    if a relationship or difference is significant. Many significant differences/relationships in fact are quite

    small in terms of the SIZE of the difference/relationship.

    SPSS does not calculate eta quite how the books suggest, or even how SPSS help itself seems to

    suggest.

    In fact every eta sq is calculated so that it is a proportion out of a different total and some of the

    variance that goes into the calculation of one of them may also go into the calculation of another, so

    none of them can be added sensibly to each other.

    So every effect (main or interaction) is out of its own 100%, representing the maximum variance that

    it could account for, but not all the variation in DV scores. This applies even where the effects are ofthe same type and a sensible calculation could be made of the % of variance of the same type

    accounted for (e.g. two between subjects main effects - in principle one could calculate what % of the

    WS variance they account for together). In fact this is not done.

    So the SPSS etas can be compared with each other (This one is accounting for more of the total it

    could account for than that one is...) but not really added. Or if you like, the total % if there are three

    factors with three main effects, 3 two-way interactions, and one three way, is not 700% but less than

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    5/29

    that... but hard to calculate exactly what. (In fact you can see how SPSS calculates the etas: in the sum

    of squares column it is simply the sum of squares for the effect of interest divided by the SS of that

    effect plus the relevant error SS for that effect. Clearly then it is not calculating the proportion of all

    the SS in the entire analysis accounted for by that effect, just the proportion of the SS relevant to that

    effect. And also the error SS get re-used in different calculations)

    Post hoc tests of paired comparisons after ANOVA

    Wherever a main effect or interaction involves a comparison of more than two means, post hoc tests

    can be relevant, as the basic significance value given by the ANOVA does not say which pair or pairs

    is/are sig different. If the main or interaction effect from ANOVA comes out significant that just

    means that there is a sig difference SOMEWHERE among the means but not between every pair

    necessarily. Especially this arises where one or more of the EVs has three or more levels (i.e. groups

    or conditions), though it can also arise, say, where you have two two-value EVs and the interaction is

    significant. You need a post hoc test to identify where the differences are exactly or just judge it by

    eye from a graph or table of means. This situation arises in various ways in ANOVAs, some of which

    SPSS deals with straightforwardly, others not.

    One might think the solution is just to do loads of familiar t tests comparing the means in pairs as

    required, to see which pairs are sig different. Indeed one sees this done in some published work, and

    in moderation probably you can get away with this However, statisticians dont like that. The

    statistical issue underlying all this is that, when you do paired comparisons like this, the same means

    are getting reused several times in different comparisons. If you have three groups and compare them

    in pairs then the mean for group 1 gets used in the comparison both with group 2 and group 3. Now

    the more times a mean gets compared with others in repeated statistical tests, the more chances it has

    to come out as significantly different just by chance, not reflecting a real population difference.

    Remember that if a difference between two means is significant (at the .05 level) that actuallyMEANS that one would not get a result this different more than 5% of the time or one in twenty

    times by chance, due to the vagaries of random sampling, in similar sized samples from a

    population where there really was no difference. But another way of looking at that is to say that if

    you use the same data in twenty comparisons, then one of the results might be that one-in-twenty

    result that looks significant but is actually from a population where there is no difference. The more

    tests you do, the more chance of getting a result that looks sig but is not really.

    Some adjustment has to be made to compensate for this. Like other activities in life involving pairs,

    your tests for multiple paired comparisons should not be unprotected! Post hoc tests and the like

    cope with this better than t tests. It is not appropriate to do multiple t tests at least not without a

    Bonferroni adjustment of the sig level (though that is a solution that is seen as rather

    overcompensating for the problem). Better is to use a post hoc test designed for such comparisons

    (e.g. Tukey, Scheffe, etc.). However, as the SPSS dialog box forpost hoc shows, there is a myriad

    of options: nobody is certain which is the best, and none are perfect. As a consequence sometimes you

    can get an anomalous result that the ANOVA says there is a sig difference somewhere, but the paired

    post hoc test does not find any pair significantly different.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    6/29

    The term post hoc is used for where you just want to consider all pairs of means that are possible to

    compare, following an overall analysis including all the means, which is the appropriate starting point.

    SPSS however limits this term to comparisons between cases in different groups, though statisticians

    use the term generally for follow up comparisons of pairs of repeated measures conditions as well.

    The term planned comparison (=contrasts in SPSS) is used where you planned specific paired

    comparisons, not all the possible ones, such as the comparison of three groups of learners with an NS

    group, but not with each other.

    The general rule is that for k means there are k(k-1)/2 paired comparisons possible. E.g. if four

    groups then 4 x 3 / 2 comparisons, i.e. 6. However, SPSS output usually gives you the pairs twice

    over so it looks even more.

    An EV with three or more independent groups being compared.1.

    E.g. the % correct scores for third singular s of three groups of learners are compared. The basic

    ANOVA result says whether there is a significant relationship between the EV and the DV a

    difference somewhere among the groups but not exactly where. If the overall result is sig, then to

    see which pairs of groups are sig different you need to do post hoc tests. Whether you do theANOVA via Compare means Oneway ANOVA or via General Linear Model Univariate, you

    get many many ways of doing the post hoc test offered under the Post Hoc option. Tukey HSD is a

    common safe bet.

    Basic post hoc tests compare every pair of means. But suppose your groups were two of learners and

    one of native speakers and you plan to compare the two learner groups with the NS group (which may

    be thought of as a control group) but not with each other. These are often called planned

    comparisons and you would do better not to use the post hoc tests which compare every pair, and so

    are weaker (less likely to identify sig differences). You get this sort of limited comparison in

    Analyze.. General Linear Model... Univariate... enter your DV as usual and the three languagesvariable as a fixed factor. This does a oneway ANOVA exactly like you get with Compare Means...

    Oneway.. except that it gives you some extra options. If you clickContrasts and click the contrast

    option to get Simple and then clickfirst orlast depending on whether the control group is numbered

    1 or 3... then (don't forget) clickChange... then Continue then OK... you get an output that just does

    those limited paired comparisons.

    An EV with three or more repeated measures conditions being compared2.

    E.g. you compare the same peoples fluency speaking to the teacher, to peers and to parents. You

    want to compare each pair of those conditions afterwards. In General Linear Model Repeated

    Measures you have to use not what is labelled Post Hoc but ratherOptions click the variables intoDisplay means and tickCompare main effects and below that choose Bonferroni. This in effect

    uses t tests with a simple Bonferroni adjustment for multiple comparisons to compare the pairs of

    means. Not ideal because overcautious: i.e. likely to lead to you missing a difference that is actually

    sig. SPSS should really make Tukey etc. available in repeated measures as well as independent groups

    comparisons. Alternatively you can do your own Tukey test as described below.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    7/29

    Once again you can alternatively choose limited planned comparisons via the Contrasts option as

    above.

    Interaction in a two way ANOVA with both EVs as groupings3.

    Where there are two EVs that are groupings, the interaction always involves at least 4 subgroups.

    Even if both variables are just two groups, like male-female and upper class-middle class, theinteraction has four groups involved and, if the interaction is sig, you might want to know which pairs

    of those are producing that result, beyond just guessing from a suitable graph.

    SPSS does not deal with post hoc for interactions, but in some instances you can do it yourself fairly

    simply with calculator. For instance you can do a Tukey test to test for pairwise differences when you

    get a sig interaction in a two way ANOVA with two independent gps factors, where all groups have

    the same number of subjects in.

    Calculate T = q x (error mean square / number of people in each group)

    Error mean square or error variance is in the original ANOVA table in output.q is found from the table of the Tukey statistic (ask me for it or see a serious stats textbook which has

    it in the back. I cant include it here for copyright reasons). Read off the column for the number of

    means being compared pairwise, and the row for the df of the error variance/mean square (from

    ANOVA table).

    Then calculate T and any pair of means differing by more than T is sig different.

    If the groups are different sizes, or you wish to save effort, do t tests with Bonferroni adjustment.

    Interaction in a two way ANOVA with both EVs as repeated measures4.

    As for 3. OR Treat it as a oneway repeated measures situation. Enter all the repeated measures

    columns as if there were just one factor not two, and follow 2 above. That in effect does the post hoc

    for the interaction.

    Mixed independent groups and repeated measures ANOVAs5.

    As usual, if the result in ANOVA is significant, and more than two means are being compared, one

    needs follow-up tests to see which pairs of means are significantly different (or be happy just to judge

    it visually from a graph). Each main effect involving 3 or more levels can be dealt with as above, but

    the interactions are more of a problem.

    Take five repeated measures conditions and two groups.

    One can get the main effect multiple comparisons done by SPSS with suitable adjustments as

    described in (2) above (i.e. comparing results on the five conditions with each other in pairs, for the

    whole sample of subjects lumping both gps together). In fact if one wants all of them there are 10

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    8/29

    comparisons.... because there are five conditions, so (5 x 4) / 2 paired comparisons.

    In the interaction, since there are 10 means involved for all 5 conditions and two groups, there are (10

    x 9) / 2 comparisons potentially, which makes 45.

    One can do some of the interaction paired comparisons, by splitting the file and getting SPSS to use

    the Bonferroni option again. Those are the comparisons of each condition with each other conditionwithin each group separately. 10 comparisons in each group = 20 in all.

    That leaves 25 comparisons that you could not do with any post hoc procedure in SPSS as far as I

    know... the comparisons between each of the 5 means for one gp and the five for the other. Ordinary t

    tests do not have any required reduction for multiple comparisons like post hoc tests do. However a

    simple adjustment by hand is to use the t test but require stricter sig levels. In fact this is really making

    the Bonferroni adjustment oneself.

    The account immediately above assumed that there was no a priori reason to be interested in any of

    those 25 pairs more than any other... It was a DIY post hoc solution.

    However, it could be that, for theoretical reasons or whatever, you were not interested in comparingevery pair of means, only certain ones. In particular:

    - the comparisons of all 5 conditions within each group, done OK with split file and Bonferroni

    adjustment..... 20 comparisons

    - the comparison of each group with the other on each condition separately. That is in fact only 5

    comparisons out of the 25 possible other ones. (I.e. you have no interest in comparisons like that

    between the lower group on condition A and the higher group on condition C, between with lower and

    higher on A etc.). You want to claim, in this instance, that these were what are called 'planned

    comparisons' not the usual post hoc 'try everything' type. Then you could reduce the required sig value

    of the t test for this part by dividing by 5 not 25 in the Bonferroni adjustment....

    In general, then, where there is no post hoc test available in SPSS, the simple but crude solution is to

    use ordinary pair comparison statistical tests, but divide the target sig level by the number of potential

    comparisons you COULD make, or PLANNED to make, to compensate for making multiple

    comparisons. However, this is cruder than using post hoc tests, which take care of this better. You are

    more likely to miss sig differences (a so-called Type II error).

    You dont get a sig result and you want to know how big a sample you would need to get one

    If you have gathered data, especially in a pilot study, and not got a significant result, you may want to know

    how big a sample you would need to make the result significant. Remember, if you choose a big enough

    sample, even a very small difference or relationship may be significant. So if you have the possibilityavailable to increase the size of the sample (i.e. there are more subjects or cases available), and are

    desperate to get a significant result, it would be useful to know how many subjects would be ideal.

    Some books give formulae to calculate how big a sample you need, but they dont necessarily

    straightforwardly fit the situations you have. The following is my best suggestion for an easy way to get an

    estimate of required sample size using SPSS facilities.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    9/29

    Basically you create imaginary larger samples simply by using your subjects more than once. Suppose you

    have 20 subjects and p=.231 for whatever test you are interested in. You get SPSS to think that you have

    three times as many subjects, simply by getting each subject counted three times, and run the test again. Say

    then p=.09. Then you get SPSS to think you have four times as many subjects, including each of your

    twenty four times, and see again what happens. By trial and error you get to the point where p=.05, and that

    gives an estimate of the minimum number of subjects you need to get a significant result.

    To get SPSS to count a subject more than once you weight the data, similar to how you are familiar with

    doing elsewhere. At transform..computeyou nominate a new target variable which you might call incr

    (since it will tell SPSS how many times to increase your sample size). You then enter in the numerical

    expression space whatever you want the weighting to be. You could start with a weighting like 2. Click

    OKand you will find a new column called incr with 2 repeated all the way down. If you now go to

    dataweight cases and weight the data by that column, then SPSS sees your data as having twice as many

    cases counting each one twice.

    Now do your analysis again and see if it is significant. Go on altering the weighting figure in the incr

    column via transformcompute repeatedly and redoing the analysis until you get a sig difference orrelationship. Note that you can enter partial weightings like 3.5 as well.

    When by trial and error you achieve a weighting that gives a significant result, multiply it by your original

    sample size to see how many subjects you would need. E.g if your sample from two groups was 20 in all

    but you only get a sig difference with a weighting of 3.8, then you need at least 20 x 3.8 subjects (= 76), in

    similar proportions in the two groups as before to have a chance of getting a sig difference

    Cautions. You have to make sure the new bigger sample IS from the same population as the old one. In the

    case of comparisons of groups of course several populations may be involved. Even then, any method of

    estimating the required sample size is only approximate, because even truly random samples can vary a lot.Also, with an increase in sample size the actual difference or relationship you are interested in may not

    actually get any bigger. It is just more likely to be significant. I.e. you may end up showing that there is

    indeed a non-zero difference or relationship in the population (which is what significant means), but not

    that it is a very large one.

    Group % scores

    Twenty people in two groups are each measured for the number of times they use the third person s

    out of all the occasions or loci when they had an opportunity to (often called potential

    occurrences) Very many linguistic features are measured this way in acquisition and

    sociolinguistic research. In the former it is often a matter of how often the correct form (in NS terms)is used, as against some erroneous form or omission, on occasions where there was an opportunity to

    use it; in the latter it is often a matter of how often one variant out of two or more that make up a

    sociolinguistic variable is used.

    In all these situations there are two ways of summarising and graphing the data 1) the group way

    and 2) the individual way.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    10/29

    Either 1) you add up all the potential occurrences for each group, and all the occurrences of the form

    of interest, and express the second as a percent of the first for each group.

    Or 2) you calculate a % score for each person using their individual frequency of the form of interest

    and their individual number of potential occurrences. Then for each group you can calculate the

    average (mean) % score for that group from the individual scores of its members. However, you haveto be aware that this can be a bit misleading for cases whose number of potential occurrences is small:

    getting one out of one right is 100% as much as getting 20 right out of 20 possible occasions! It is

    common to require at least 5 potential occurrences, otherwise treat a case as missing data.

    It is easy to show that the group figures may not come out the same! Here we imagine figures for a

    group of two people and see what happens:

    Method 1 Frequency of

    form of interest

    Number of

    potential

    occurrences

    % occurrence of

    form of interest

    Person 1 4 16 25%

    Person 2 8 10 80%

    Total 12 26

    Group % (12/26)x100 =

    46.2%

    Method 2 Frequency ofform of interest

    Number of potential

    occurrences

    % occurrence ofform of interest

    Person 1 4 16 25%

    Person 2 8 10 80%

    Mean % for

    group

    (25+80)/2

    = 52.5%

    In fact the two methods will come out the same only when all subjects had the same number of potential

    occurrences (e.g. in a test or list reading task).

    Many SLA and sociolinguistic studies use method 1. That is fine, if you wish, for the purposes of giving

    descriptive statistics and making graphs, provided you make it clear what you are doing, and are aware of

    the difference from the other method.

    BUT for any inferential statistics you should use the method 2, entering the data in SPSS in the form of

    one row per person, with a % score for each person. Then, to compare two groups, for example, you use

    the independent groups t test on the two sets of scores.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    0 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    11/29

    If you were to attempt inferential statistics on the total figures of method 1, you would have to use the

    numbers of individual occurrences regardless of people. I.e. if the example above were for one group,

    you would represent that group with the proportions 12 and 14 (i.e. 12 occurrences of the form of

    interest, versus 14 non-occurrences, making up the total of 26 potential occurrences) and compare those

    with the overall proportions for the other group being compared with. The test for that is chi squared, and

    you do see this used even in some published work for data like this. However, there are at least two majorproblems with this which would lead statisticians mostly to regard this as a misuse of chi squared.

    - Like for all significance tests, the basic observations (cases) which enter into the test have to

    be independent of each other. Now in method 2 the cases are the people, and there is no

    problem in seeing scores from different people as being independent of each other. However,

    in method 1 the 26 occurrences in the example are the cases, and clearly while some of those

    are independent of each other (being from different people) some are likely not (being from

    the same person)

    - There is also an expectation that populations sampled are homogeneous. From what we have

    just said that is clearly not the case in method 1: the 26 observations representing one group

    in the example are a mixture. It cannot be said that each observation is from one population

    it is from a mixture of a population of people and the populations of occurrences of eachseparate person.

    The only instances where chi squared and method 1 might be defensible would be where the numbers of

    potential occurrences are very small amounting to little more than one or two per person included. OR

    where all the potential and actual occurrences come from just one person per group. though that still

    does not deal with the independence problem. OR where you feel able to argue that responses from the

    same person are as independent as if they were from different people There is a tradition of

    phoneticians making this tacit assumption for things like VOT, on the belief that such things are beyond

    the persons ability to control.

    Rounding interval scores

    Just checking.... do we know how to round figures on interval scales? The mean of a set of scores may

    come out as 6.3597, but often we want to express this in shorter form, such as 6.36 or 6.4. Quoting long

    strings of numbers after the decimal point can look as if you are just trying to impress with loads of

    numbers. Or it may be you are trying to make up for sloppy METHOD by being super-detailed in the

    figures quoted in RESULTS.... Best not to do that, since one's measurement is unlikely to be so accurate

    that more than two decimal places are relevant (except perhaps where a computer has measured

    something for you like response time...). Generally three or two decimal places for sig/p values, and two

    or one for everything else. Keep it intelligible and round numbers where necessary. But where do you

    round up, and where down?

    Task

    Just round the following figures to two decimal places:

    3.852 0.679 18.505 1.006 7.597 20.955 0.602

    SPSS often rounds figures on screen (e.g. in the data grid) even though it is holding longer versions in its

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    1 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    12/29

    memory. You can select for each column how many decimal places it shows on the Data View window.

    Answer to above 3.85 0.68 18.51 1.01 7.60 20.96 0.60

    Decoding interval scores expressed in E notation in SPSS output

    Sometimes SPSS produces numbers like 7.012E-02This is not 7.012.... It is 0.07012

    The E with a minus sign signals the number of places the decimal point has to be moved to the left.

    So 1.369E-03 = 0.001369

    Etc. The E is a shorthand so as not to write a load of noughts.

    Always convert any such figures into the familiar form if you report them in your work.

    Correspondingly 7.012E+02 would indicate 701.2.

    Combining columns of scores for separate items in a test etc. to give a total or average score

    Where a test or other instrument produces scores for separate items which then need to be added up togive a total score for a variable, one could of course add them up off computer and just enter the totals.

    However, to check on internal reliability or to do an analysis by items in addition, or to filterresponse

    times and exclude some, you will need the scores for every item in a separate column, so will have to

    enter the data in full.

    To then add columns use TransformCompute in SPSS to create a new column that totals the

    separate ones. You enter the title of the new summary column top left in the dialog box, and click the

    column names to be added into the top right space, with + between them. That creates a new column

    of totals.

    However, anyone with a score missing in any column will be missed out and their total will come out

    as missing.

    If there are missing values in some columns, marked in SPSS by a . , where subjects failed to respond

    or have unanalysable data, you will probably want each persons total really to be the average score

    over all the items they answered, not the total (unless you have some reason to count missing as the

    same as wrong and so score it 0). You can get this by, in TransformCompute, inserting in the

    Numeric Expression box the function MEAN(numexpr, numexpr,) from the functions list, and

    putting the relevant column labels in the brackets separated by commas. I.e. if you have a set of three

    items whose scores are in columns item1, item2, item3, then you would enter MEAN(item1, item2,

    item3) in the Numeric Expression box. SPSS then generates a new column with the average score ofeach case on the three items or, if they answered less, over the ones they answered.

    Similarly, if you want to just add, not average, a set of columns, using whatever scores are available,

    then to avoid the people with missing values getting recorded as with zero total use SUM(numexpr,

    numexpr,) in the same way as described for means above.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    2 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    13/29

    Cutting an interval scale into ordered categories

    A common example is deriving a grouping of subjects from something you measured about them

    originally on a numerical scale: an explanatory variable such as their ages, English proficiency,

    extraversion etc.,. This is often done casually without due thought, and often in peculiar

    idiosyncratic ways by novice researchers, but above all it needs careful thought about why it is done,

    and how

    Before you do this at all, you need to ask if it is necessary at all. Just because some other researcher

    had a high prof group and a low prof one does not mean you necessarily have to have groups. When

    you derive such groupings from scores originally recorded on a continuous interval scale, obviously

    you lose some information. One person may be a bit better than another on the original scores, but

    once you decide they both belong in the high prof group, or whatever, they are treated as identical in

    any further tests. This may or may not help produce the result you want Certainly how you divide

    subjects into groups, if you do, can drastically affect the result!

    Reasons for cutting

    There are a number of reasons some statistical, some related to research methods, design and

    hypotheses more.

    A few statistical techniques require interval scales to be reduced to binary grouping.

    Implicational scaling (scalogram analysis) is one method of statistical analysis used in

    acquisition research that requires this: subjects simply have to be categorised as having

    acquired or mastered each feature of interest or not. So also varbrule analysis requires

    groupings of people who use or dont use some form of sociolinguistic interest.

    a.

    If the true interval nature of a scale is in doubt that could be a reason to reduce it to

    categories (though reduction to rank order would lose less information).

    b.

    If you retain the original scores and look at relationships with other (dependent) variables

    then you are into the statistics of correlation, and maybe multiple regression, typically. If

    you form groups, then you can identify a mean for each group on the other variables of

    interest and compare those means with t tests, ANOVA etc. Both methods will show

    relationships between EVs and DVs, but the second will be better (or at least easier in

    SPSS) for dealing with

    c.

    i. nonlinear relations e.g. where high and low proficiency subjects perform

    similarly on some other variable of interest, compared with intermediate

    subjects

    ii. interactions between different EVs e.g. where you want to see the

    combined effects of gender and prof on something: do high prof females differfrom high prof males in the same way as low prof females differ from low prof

    males?

    iii. designs involving repeated measures.

    The goal of the research may be exploratory - precisely to discover useful categories of

    subjects.

    d.

    You may wish to identify extreme groups of subjects for comparison. E.g. you want toe.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    3 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    14/29

    compare bilinguals who are English dominant with those who are Welsh dominant. You

    do not want more or less balanced bilinguals. So you measure the bilingual dominance of

    a sample and will reject the middle scorers, keeping two extreme groups.

    You need categories to form the IV in an experiment. E.g. you want words of three levels

    of frequency to be the stimuli for three conditions in an experiment. Or maybe you want

    extreme stimuli just frequent and rare. Either way you need groups of words as it is

    difficult to use an interval scored variable directly as the EV in a repeated measuresdesign.

    f.

    Means of cutting

    OK so you still want to make groups there are many ways of doing it. To some extent they match

    the reasons above.. The principles apply to any interval-scored variable that is to be turned into a

    grouping. The issue is where to cut the original interval scale so as to obtain two or more groups of

    cases

    Cutting at a priori scale values. That is, cutting at predecided score values on the scale, which

    would be the same whatever sample you gathered. These values may or may not have someabsolute meaning of the criterion-referenced type. Cf. Reason 1 above. Such a point could be

    One used arbitrarily by previous researchers. Not necessarily a good way to do it if it has

    no sound basis, other than for the reason that it then enables you later to compare your

    results directly with those of other researchers

    The pass mark used in a particular institution for some English exam, or a succession of

    such marks e.g. corresponding to what are called grade A, B, C, D in some institution.

    Again such points may be fairly arbitrary, but perhaps meaningful for your research in

    allowing you to contextualise it.

    Grades with some universal absolute meaning associated with them, maybe in a

    professional published test you have used. E.g. you divide subjects into those who gotgrade 6 or better in the IELTS test, and those who scored worse, given the widespread use

    of this value as a criterion for entry to UK universities. Ranges of scores of the Jacobs

    instrument for assessing EFL written compositions, and many international language

    tests, have proficiency definitions associated with them. A different example of this type

    is to divide a five point rating scale of the type strongly agree agree neutral

    disagree strongly disagree into just two categories those who showed some agreement

    (i.e. the top 2 choices) versus the rest who disagreed or were indifferent: this uses a

    division point with some clear meaning of its own (but why then did one not ask the

    question in the first place just as a two choice item?)

    The score on a variable scored as % correct which is conventionally regarded as

    indicative that someone has acquired a feature. Acquisition researchers vary in what theythink this score is, but 80% or higher correct use of, say, third person s would be

    regarded by many as enough to put a subject in the group they would say has acquired

    the feature. Others argue that that only 100% correct indicates true acquisition. Others

    that any correct use greater than 0% correct indicates acquisition has occurred. Again

    others use other scores like number of occurrences of a structure in 5 hours of observation

    (Bloom and Lahey 1978, 328) 5 or more indicating acquisition.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    4 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    15/29

    The score on a variable scored as % use of one alternative which is conventionally

    regarded as indicative that someone is a clear user of that alternative. Labov in his

    famous department store study divided subjects into three groups - those using no [r]

    sounds in the wordsfourth floorsaid twice, those using them on all four possible

    occasions (categorical users), and those in between (i.e. variable users).

    Scores defined by how some other relevant group of people performed on the same test or

    measure e.g. for learners you might make use of the mean score of native speakersdoing the test (a criterion group), or perhaps the score which only 15% of NS do better

    than (the 85th percentile). Alternatively one might rely on the mean score that large

    numbers of learners of the same sort as ones own testees gained in other research (a

    reference group). The latter is not often available in language research more a feature of

    standardised NS tests like the British Picture Vocabulary Scale and so on.

    Cutting the score scale into halves or equal lengths. That only is easy if the scale has fixed ends,

    such as a % score scale, or a test marked out of 40. E.g. you make four groups: those who

    scored between 0 and 10, 11-20, etc. (being careful not to label them overlapping 0-10, 10-20,

    20-30). This is often not very meaningful unless the scale has some absolute meaning so that

    half-marks actually means half knowledge of something beyond the test items and it

    produces unequal sized groups. Also it may not even be possible to quite achieve equal lengths

    with ease (0-10 actually covers one more point than 11-20!). However, it is a system that can be

    used with the same cutting scores on any sample, like the above but unlike those below.

    Mitchelmore (1981) suggests that the scale should not be cut into lengths that are too short, so

    as to avoid misclassification. Lengths should not be shorter than 1.635 x SD of scores x (1

    reliability). Possibly useful for Reason 2 above.

    Cutting so as to achieve equal numbers of subjects/cases in each group. Technically this uses

    the median and quartiles. I.e. if you had scored 30 people and want two groups. You simply put

    them in rank order on the basis of their scores, and the top 15 (those above the median score)

    become the high prof group, those below the low prof one. The cutting score obviously will

    differ for different samples and has no real meaning, but generally it is better for latercomparisons if groups have more or less the same numbers of subjects in. Often used for

    Reasons 2,3, 6 above.

    Cutting at the mean, and points related to it. E.g. you divide into those who scored above the

    mean (average) and those below. Or four groups: those scoring more than one SD above the

    mean, those more than one SD below the mean, those between the mean and one SD above,

    those between the mean and one SD below it To get three groups you might use the mean

    plus or minus half the SD as cutting points. The mean, like the median, is entirely relative to a

    particular sample of course. The problem with dividing at the mean is that usually many cases

    score near the mean, so cases very close to each other will get put in different groups. If the

    original scoring is not perfectly reliable, that in turn means that some cases may be

    misclassified.

    Cutting into natural groups using low points in the distribution shape. This is a simple form of

    cluster analysis and simply looks to see if the subjects in the sample seem to have grouped

    themselves (cf. Reason 4 above, and also maybe 2,3) i.e. looking at a histogram of scores are

    there two or more heaps with a low point on the scale where few scored? then make the

    cutting score the middle of the low point(s). This of course decides both where to cut and,

    unlike most methods, how many groups to identify. It may vary from sample to sample but does

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    5 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    16/29

    reflect the nature of a particular sample better than some of the above methods. It will not work

    if the histogram is simply one heap (e.g. with the normal distribution shape), though sometimes

    rescaling the histogram with finer divisions may reveal what an initial SPSS histogram may

    conceal. As an example, the scores of 217 subjects on a College English exam in Pakistan are

    graphed below and it is fairly clear that there are two groups in the sample, those scoring above

    58 or so and those below. By comparison the median score, above and below which are equal

    numbers of cases, is 50 for this data and appears to rather arbitrarily divide people within oneof the groups that they seem to naturally form.

    With all the above methods, but especially the third, researchers may choose to use extreme groups

    only. Often where a researcher wants to get clear differences between groups later he/she will help

    this along a bit by, say, using the top third and the bottom third of subjects and missing out the middle

    third in any later comparisons. Reason 5,6 above.

    However you cut, you have to be careful how you speak. Very often you will call the groups you

    make the high proficiency group and the low proficiency group, or the like. But unless your

    original test that produced the scores was a criterion referenced one, deciding some absolute level of

    prof for each taker, with international equivalence, then this can be misleading. Very often the

    proficiency test researchers use test was a cloze test you cracked up yourself, or the like. It may well

    distinguish students with higher proficiency from those with lower, in the sample of students you are

    using. But that does not mean there is any equivalence with what were called high prof students by

    some other researcher who used a different test with a different sample in another country. It could be

    that all his students, high or low prof, are no better than the worst of your low prof group, and so on.Only if some standard published test such as FCE or TOEFL was used by all could you match up

    across studies and see if there was any real comparability between so-called high prof students in

    different studies. In fact close examination shows that many variables used in research have no

    absolute definitions of scale points, and most of the above ways of dividing cases into groups only

    distinguish in a relative way between who/what has more of something or less, not exactly how much.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    6 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    17/29

    The size of the standard deviation

    One is quite used to having SPSS calculate the SD along with the mean (=average) of a set of scores

    (i.e. for any interval scale).

    We are also used to the idea that the SD measures spread of scores around the mean. If all casesscored the same, the SD would be 0. The bigger the SD, the more spread the scores of different

    cases the more subjects are disagreeing with each other in their scores. And the more that happens

    within groups, the harder it is usually to show any convincing differences between groups. Similar

    concepts to SD are what statisticians call variance and error. These measures are slightly different

    but all, roughly, are averages of the differences between each cases score and the mean. If all cases

    score the same, which will be also the mean score, then their differences from the mean are 0, so SD =

    0.

    Sometimes SPSS fails to perform a procedure because of a problem of zero variance. That means it

    found that one of your groups on one the variables measured had an SD of 0. All cases scored the

    same. This makes certain statistical procedures impossible: they involve variables and cannot work ifeveryone scores the same, as then you have not a variable but a constant. You cannot answer the

    question what is the relationship between age and reading ability? if you have obtained data from a

    sample who are actually all of the same age!

    So we know what an SD of 0 means, but what about big SDs? There is often no simple maximum

    value that the SD can have. But there are some guides to help assess the size of an SD:

    It may often be more of interest whether different groups or conditions show similar or different

    variation (SD) than how great the SD actually is. In general you assess the size of an SD for

    each sample group separately.If your scores are on a scale with both ends logically fixed (e.g. a test scored out of 40), then the

    maximum possible SD, if cases were maximally varied in scores, is half the scale length (well,

    actually it will be a shade above that for small numbers of cases, but that is a useful rule of

    thumb). So you can assess the size of an SD you get in relation to that. An SD would usually be

    regarded as big if it was even as much as half the maximum (i.e. a quarter of the scale length).

    On a scale of % correct scores, half the scale is 50. Note that on a five point rating scale

    running 1-5, half the scale length is 2. On such scales of course the mean is also limited: it

    cannot be a figure outside the end points of the scale. That places further limits on the size of

    the SD: the nearer the mean is to the limit of the scale, the smaller the maximum possible for an

    SD

    If your scores are on a scale with one or both ends virtually open, then the SD (and the mean)could be indefinitely large. In language research many scales are fixed at one end on zero, but

    open at the other. E.g. word frequency: words cannot occur less than 0 times, but there is no

    clear upper limit to how often they can be observed. So also sentence length: sentences cannot

    be shorter than one word, but they can be indefinitely long. Response times in milliseconds

    have a hazier bottom limit: there is an indefinite upper limit to how long anyone can take to

    respond to a stimulus and, although technically there is a lower limit of zero, nobody can really

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    7 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    18/29

    respond in zero milliseconds so there is an indeterminate lower limit to fast responses. With

    these scales it is harder to say what is a big SD, but one can use some yardsticks:

    One can use the maximum and minimum scores that occur in ones data as indications of

    the effective limits of the scale, and as above treat an SD larger than a quarter of the

    distance between them as large. For a scale fixed at one end, one could use the distance

    between the bottom limit and the highest observed score.

    With scales fixed at the bottom end, but open at the high end, the distribution is oftenskewed to the left (positively). I.e. scores are heaped near the bottom limit and tail off to

    the right. In that situation the SD can be, and often has to be, greater than the mean,

    though if the distribution has a perfect Poisson shape, the mean = the square of the SD. If

    the mean is some way above the bottom limit, and that limit is 0, and the distribution is

    more symmetrical, then people sometimes assess an SD in relation to the mean: if the SD

    is as much or more than half the mean, that indicates very substantial variation among the

    scores of a group.

    Always look at the distribution shape on a histogram as well as the mean and SD The shape

    may reveal more than anything else.

    How to treat rating scale responses

    An old problem is how to handle responses to items recorded on scales such as

    strongly agree agree neutral disagree strongly disagree

    always often sometimes never

    They are rating scales (not usually called multiple choice). They are clearly ordered choices and there

    is uncertainty whether they are really best thought of, and treated statistically, as

    Ordered categories: so you present the results in bar charts, report the % of people

    who responded in each category on the scale, and use ordered category statistics to

    analyse relationships with other variables.OR

    Interval scores: so you assign a score number to each point on the scale and present

    the results as a histogram, report the mean and SD of the scores of a group, and use t

    tests, Pearson correlation or whatever when comparing groups or looking for

    relationships. The numbering could be e.g. strongly disagree = 0, disagree = 1, and so

    on; or if you prefer strongly disagree = -2, disagree = -1, neutral = 0 etc.

    Generally it is far easier for any statistical handling to treat the data the interval score way as the stats

    for interval scores are more well known and versatile in what they can do. The results are usually

    easier to absorb as well. Suppose two groups are asked how far they agree that a CALL activity is

    easy to understand; Group B is of a higher English level than A. ?Is it easier to derive some meaningfrom being told:

    In group A the response was: strongly agree 43.3% agree 20% neutral 13.3% disagree

    13.3% strongly disagree 10%. In group B it was: strongly agree 30% agree 30% neutral

    10% disagree 30% strongly disagree 0%. The difference between the two groups is not

    significant (Kolmogorov-Smirnov Z =0.365, p=0.999).

    OR from

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    8 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    19/29

    The mean agreement response (on a scale 2 for strong disagreement to +2 for strong

    agreement) was in group A 0.73 and in group B 0.6. Variation was similar in the two groups,

    and moderately high (SDs 1.41, 1.26). The difference between the groups is not significant

    (t=0.265, p=0.793).

    I know which I find easier to follow!

    So I advise going for the second interpretation wherever possible, but making sure that when you use

    such scales the way they are used in the data gathering itself justifies this interpretation. In particular:

    Make sure the words used for the points of the scales do suggest more or less equal intervals

    between one point and the next, otherwise the interval interpretation is invalid

    Accompany the wording with figures in the version presented to respondents, so they are

    encouraged to think of the scale as being a number one, with equal intervals between the

    numbers.

    Tests of prerequisites for parametric statistical tests

    These tests of prerequisites are only of interest to check if the data is suitable for using some OTHER

    test that you are REALLY interested in, because it relates to your actual research questions or

    hypotheses. Tests of prerequisites generally apply where ANOVA/GLM is used, though researchers

    rarely report having made these checks and we cannot tell if the checks were performed or not! You

    generally want them all to be nonsignificant, as that is what shows the data is straightforwardly

    suitable for parametric significance tests like ANOVA/GLM.

    If the prerequisite test is failed then there may be alternatives within the parametric tests you can use

    to compensate, or weaker nonparametric tests you can use instead of straightforward ANOVA etc., or

    possible transformations of the data one could do... but often one has to just admit the data is notperfect for the procedure but carry on and use ANOVA anyway....

    Their functions are as follows:

    Any parametric significance tests.... t tests, ANOVAs etc. all assume that the populations that

    the groups are from have distributions of scores that are normal in shape (i.e. that bell-shaped

    distribution you see in all the books). Check with K-S test (though on small samples

    everything passes this test!!).

    t test for 2 independent groups, and all ANOVAs involving comparisons of 2 or more groups

    (with or without also repeated measures). The groups need to each have a similar spread of

    scores within them round their respective means (=homogeneity of variance). Check withLevene's test, which (roughly) decides if the SDs of the groups could be from one population of

    SDs, so are similar, or not. The t test for 2 independent groups has alternative versions

    depending on whether this prerequisite Levene test is passed (nonsig) or not, but ANOVAs

    don't, they all assume the prerequisite test of equal variances is passed.

    All ANOVAs involving comparisons of 3 or more repeated measures (with or without

    independent group comparisons as well). Here again the spreads of the scores in each condition

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    9 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    20/29

    need ideally to be similar. Strictly it is the covariation between each pair of conditions that

    needs to be similar (=test of sphericity). Check with Mauchly's test (which SPSS automatically

    gives you even where you only have two repeated measures, though it applies vacuously there

    and need not be looked at). The check, roughly speaking, looks at the correlation between the

    scores in each condition and those in each other condition in pairs and sees if the correlations

    could all be from a population with one correlation or not. The data would likely not pass if

    people who did better on condition A also did so on B but were the worst on C, and so on... If itis passed (nonsig) then you use the 'sphericity assumed' results in the ANOVA table, otherwise

    the ones below those (Greenhouse-Geisser).

    ANOVAs with a mixture of repeated measure comparisons and independent groups. Here there

    is an extra requirement about the pattern of covariance between conditions in each group

    separately being similar also between the groups. Check with Box's M test.

    Missing values

    Missing values are where cases have scores or categorisations completely missing for some reason,

    where most cases did provide data. E.g. they gave no response, were uncooperative, or their response

    was unanalysable, etc. (Where subjects have taken a multi-item test or the like to produce their scores,then they may miss some items but still get a score for the test as a whole. That is a different issue

    You have to decide there whether a missed out item counts as wrong, or whether you allow people to

    miss items and as overall test score give them the average score for the set of items they did answer)

    They are usually entered in SPSS by a . in the space where a figure should be, unless you have

    assigned an actual number that you enter as indicating missing values, and declared it in Variable

    ViewMissing.

    If you have missing values there may be problems:

    - You may have very few cases left that you can use in the required statistical analyses:especially in repeated measures and multivariate designs if a case has data missing on one

    variable/condition included in an analysis, it gets left out totally (i.e. listwise).

    - The missing values may not be random, but certain kinds of subject may be more prone

    to produce them so using the data without them, or with too few of them, will lead to a

    biased result. E.g. young versus older testees; lower versus middle class informants.

    If you leave missing values in place, SPSS usually gives the choice (in Options for a given test) for

    you to treat them listwise orpairwise/test-by-test. This really applies to multiple analyses of the

    same data, as within one analysis it usually has to be listwise, meaning that the number of cases used

    is the maximum number that has a complete set of data across all the relevant columns: e.g. if in

    Correlation you want correlations done between every pair of variables in 5 columns: ten pairs, soten analyses. Listwise option would get you correlations using just the cases with full data across all 5

    columns, so the same number of cases would be used in each analysis. Pairwise would, for each

    analysis, use the maximum cases with data on both the relevant columns, so use more of the data, but

    different numbers of cases might well be used to calculate different correlations.

    If you want to fill in missing values. the main principle is that it should not be done in some way that

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    0 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    21/29

    will clearly directly influence the result you are interested in. I.e. you should not fill in the missing

    values following a principle that will obviously make the difference or relationship which is the focus

    of your actual research more marked.

    Broadly there are two ways of filling in missings in any column in SPSS (where a column represents a

    variable, or a condition in a repeated measures data).

    A) You fill in with the mean of the scores in the column itself (or if it is in categories, the mode,

    which is the most popular category in that column).

    B) You fill in by predicting a score from the general correlation of that column with others in the

    data: the EM and regression methods.

    Imagine data as follows:

    C1 C2

    3 5

    5 7

    7 9

    4 .6 8

    If the research question concerns whether there is a relationship between two variables, in C1 and C2

    (correlational design), then you do NOT use method B, which would use the correlation that exists

    already in the data to fill in missing values. I.e. here, given the perfect positive correlation between the

    two sets of scores, method B would fill in the missing as 6, predicting it from C1. But that will

    obviously enhance the perfection of the correlation which it is your aim to discover! So the mean of

    the second column (method A) would be better a better fill-in value: 7.25.

    If on the other hand this was data from the same subjects on the same DV scored in two conditions inC1 and C2 (repeated measures design), and the research interest is in the difference between the

    means of the scores in each column (Do they score significantly higher on condition 2?), the better

    way to fill in the missing values would by method B. Method A would simply enhance the level of the

    mean of C2, and strengthen its distance from the mean of C1.

    For these reasons, when you run correlation-type statistics like Regression and Factor analysis,

    SPSS underOptions offers you the choice to fill in missing values with the means (method A) as it

    operates. The data in the Data view does not get visibly altered: just you find all the cases have been

    used instead of those with missings left out. Similarly in Regression with optimal scaling, which

    works on associations between categories rather than interval scores, there is the choice to use Mode

    imputation, which fills in the missings with the most popular category in the relevant column.

    In situations where method B is suitable, you have to use AnalyzeMissing Value Analysis to

    actually fill in the missing values in the data in Data view beforehand. Basic instructions: at the first

    box, enter all the columns relevant to the analysis you will be doing, either as quantitative (i.e.

    interval) orcategorical (categories/nominal). Only the former are actually used in the estimation of

    missing scores, though (SPSS does not seem to provide a way of filling in missing category data by

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    1 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    22/29

    Method B). TickEM and if there are some quantitative columns that you dont want used as a basis

    for predicting values of missings, then click the Variables button and make your selection. Otherwise

    all the quantitative columns you declared in the first box are used to predict any missings in each

    other. Click the EM button and tickSave completed data; and underFile name a file for it to be

    stored in. Then SaveContinueOK The procedure will produce various output, but mainly you

    are interested in the new stored file of data. If you call it up, you will find the missings all filled in.

    In data for independent groups analysis (e.g. t tests, ANOVA), with missings in the DV column, if you

    have other columns of dependent variable data not being used in the same analysis, you could use

    them to fill in the missings by method B. Otherwise you can only use method A i.e. use the mean for

    the DV column (NOT the mean of each group) to fill them in.

    Getting phonetic symbols displayed in SPSS graphs

    First ensure you have the fonts of your choice (e.g. SILManuscriptIPA etc...) installed in Windows in

    the usual way. If they are available to you in Word in the usual way via Insert Symbol, then they

    will be available in SPSS. If not, get a copy of the font file (ending .ttf) and put it in the Fonts

    subdirectory of the Windows folder on your PC.

    Now, having made a graph in SPSS, click the graph you have created to make it appear in the Chart

    editing window. Then click the part you want to put special symbols in, such as the bottom scale, so it

    comes up outlined. Next clickFormat...Text and select the required font from the menu and the size

    you want and clickApply, Close.

    Now when you click the scale of the graph and choose to change the Labels, you can type the

    symbols you want. However, you don't initially see them when you type them in the dialog box. You

    have to know that in the SIL font shift-t gets you the symbol for the th sound of thick, though it will

    look as if you have just got T. Anyway, you have to type all the labels in the new font, you cannot mixsymbols from different fonts, I think. So retype the labels using Change, and Continue. The symbols

    you want will appear on the graph itself.

    I have not found a way to get symbols that are coded outside the range of the font that is covered by

    the keyboard keys, with and without shift. To know what symbols you can get from which key with

    and without shift, you may have to study the table of symbols for your font in advance through a

    program such as Word which displays it through the Insert..Symbol option.

    Item Analysis

    This term is found used in two distinct senses. Both involve data where variables or experimental

    conditions are measured using sets of items for each in some way.

    A) The usual traditional sense found especially in the pedagogical testing literature. Here it

    applies in the situation where a set of items is used to measure what is regarded as one single

    variable/construct. The set of items is usually thought of as a multi-item test of one thing (e.g.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    2 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    23/29

    reading ability, or vocabulary size). However, item analysis may also be applied to, say, a set

    of Gardner-type statements for respondents to agree or not with, where a distinct attitude or

    orientation is measured by an inventory of five such statements, rather than just one. It can

    also apply separately to each set of items designed collectively to measure a single condition

    in an experiment. Item analysis in all these instances is the activity of checking whether there

    are some items in the set that in some way do not seem to belong there, illuminating how and,

    if possible, why they are odd, and maybe removing them or replacing them with better itemswhen the test is used again. It is closely tied to internal reliability checking, often done these

    days with the use of the Cronbach alpha coefficient or Rasch analysis. Removing items that

    are odd improves reliability. This sort of item analysis is often done in pilot studies, as it

    represents a way of refining the quality of instruments for use in a main study. There are

    several statistical criteria for deciding what items are odd in a set that is supposed to be all

    measuring one thing. See further my Reliability handouts. Where items are supposed to attract

    similar levels of response (e.g. be of similar difficulty) then the classical IA approach

    involving alpha is appropriate; where items are supposed to be graded, and form an

    implicational scale, then approach using IRT/Rasch is better. Where response times are

    involved, other criteria may be used to exclude responses for specific people on specific items

    instances rather than whole items.

    B) The sense in which it is found used in some psycholinguistic literature. Here it denotes a

    second kind of analysis of data, beyond the usual default one. In an item analysis, instead of

    the subjects (usually people) being treated as the cases, the items are treated as the cases.

    Hence it is really analysis with items as cases, rather than item analysis, and is typically

    part of the analysis of the results of a main study. This applies only when a study has several

    conditions, each represented by a set of items, but this is very common in psycholinguistic

    studies, where subjects performance in different conditions is often measured by their

    responses to sets of stimuli in a repeated measures design. For example a repeated measures

    variable word frequency might be constituted as three sets of ten words, of three differentfrequency levels, making 30 items for people to respond to in some way; a variable early vs

    late attachment could be instantiated as two sets of sentences, of two structure types, one in

    which a relative clause has to parsed with an early noun phrase, the other with a late occurring

    one. Often such data arises also in areas such as SLA, applied linguistic and even

    sociolinguistic research as well as psycholinguistics, but item analysis in this sense is only

    routine in the latter, where it is regarded as a further confirmation of results obtained by the

    usual subject analysis, i.e. analysis with subjects as cases. Where, as often, ANOVA (see

    my handouts) is used to analyse the results, then the F values for the subjects as cases

    analysis are reported as F1, and those for the items as cases analysis as F2. Statisticians

    generally regard analysis with subjects as cases as the sounder basis, due especially to the

    independence requirement. Cases have to be regarded as providing independent observationsif the assumptions of inferential statistical tests (e.g. ANOVA) are to be met. While it is

    generally not difficult to assume that responses from different people are independent of each

    other, it is not so certain if responses to different items are so independent, when the same

    people respond to all of them. One has to assume that in psycholinguistic experiments people

    are unable to make their responses to one item reflect their response to another. This is often

    assumed by phoneticians and psycholinguists.

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    3 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    24/29

    Imaginary dataset to illuminate both the above. Suppose we have two groups of ten people (G1

    and G2), and each respond in two conditions (C1 and C2), where 5 items are used to obtain

    responses for each condition. As laid out for a customary subjects as cases analysis in SPSS this

    would appear as 11 columns and 20 rows thus. Of course, the items would often not have been

    presented to subjects in an experiment in sets, but intermixed with each other and maybe with

    additional distracter/filler items that are not scored at all.

    Group C1

    item1

    C1

    item2

    C1

    item3

    C1

    item4

    C1

    item5

    C2

    item1

    C2

    item2

    C2

    item3

    C2

    item4

    C2

    item5

    10 rows

    labelled

    1, to

    mark

    each G1

    subject

    Scores

    for

    each

    G1

    person

    on C1

    item 1

    Scores

    for

    each

    G1

    person

    on C1

    item 2

    Etc.

    10 rows

    labelled

    2, to

    mark

    each

    G2

    subject

    Scores

    for

    each

    G2

    person

    on C1

    item 1

    Etc.

    To do item analysis (A) above in SPSS, you would split the file by Group and use AnalyzeScale Reliability analysis Alpha on each set of five items separately (or for Rasch analysis,

    you need other software). Four analyses. That means that the internal consistency is always

    assessed within a collection of scores which is from a set of items that supposedly measures one

    thing, and which comes from a homogeneous group of subjects. After any adjustment of the data

    to improve reliability based on the above, you then typically move on the the actual analysis of

    results with subjects as cases. You first produce two extra columns which contain the averages of

    each five item set of scores for each person. Use Transform Compute. These Mean C1 and

    Mean C2 columns each now summarise the performance of subjects in one condition. Those two

    columns, together with the Group column, are then used in a mixed two way ANOVA to see if

    there is a sig difference between groups or between conditions, or a significant interaction effect.

    That is your subjects as cases F1 ANOVA.

    For item analysis (B), you need to make the items into the rows. You can do this with Data

    Transpose in SPSS. If you start from the data as displayed above and include all the columns you

    end up with 11 rows, which were previously the columns. There are columns now for each of the

    20 subjects. You can now use Transform Compute to get two new columns calculated which

    represent the mean scores for each group of subjects on each item. Then delete the row that

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    4 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    25/29

    contains the grouping numbers. Add a column of 5 1s and 5 2s to record which items (now rows)

    relate to condition C1 and which to C2. So the data should end up much as below. Finally use the

    column that records whether an item belongs to C1 or C2, and the two columns of group mean

    scores for each item. Again do a mixed two way ANOVA to see if there is a sig difference

    between groups or between conditions, or a significant interaction effect. That is your items as

    cases F2 ANOVA. Note that what was a repeated measures factor in the F1 subject analysis,

    condition, becomes a between groups factor in the F2 item analysis. The grouping of subjects,which was a between groups factor in F1, becomes a repeated measures factor in F2.

    G1

    subj1

    G1

    subj2

    G1

    subj3

    Etc. to

    G1

    subj10

    G2

    subj1

    G2

    subj2

    G2

    subj3

    Etc. to

    G2

    subj10

    Condition Group

    1

    Group

    2

    5

    rows

    with

    scores

    for

    G1

    subj1

    on

    each

    C1

    item

    Scores

    for G1

    subj2

    on

    each

    C1

    item

    Etc. 5 rows

    labelled

    1, to

    mark

    each C1

    item

    Mean

    scores

    of 10

    G1

    subjects

    on each

    C1 item

    Mean

    scores

    of 10

    G2

    subjects

    on each

    C1 item

    5

    rows

    with

    scores

    forG1

    subj1

    on

    each

    C2

    item

    Etc. 5 rows

    labelled

    2, to

    mark

    each C2item

    Mean

    scores

    of 10

    G1

    subjectson each

    C2 item

    Mean

    scores

    of 10

    G2

    subjectson each

    C2 item

    Note, the above account of items-as-cases analysis assumed that the sets of items used to represent

    the two conditions were not themselves matched or repeated in any way. I.e. C1 items 1-5 might

    have been five nouns as stimuli in some response time experiment, and C2 items 1-5 five verbs,

    with no special connection between individual verbs in one set and individual nouns in the other.

    If however the items are themselves matched in pairs or repeated in different forms etc. across

    conditions, the items as cases analysis should be different. E.g. if C1 items were five verbs in the

    past tense and C2 five verbs in the bare infinitive form, the researcher might choose to use the

    same five verbs in both conditions (randomised with suitable distracters interspersed when they

    are actually presented to subjects). Then the items are individually matched and the items-as-cases

    analysis should be done with the items as repeated measures. I.e. in the data grid above for SPSS,

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    5 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    26/29

    the 5 rows for C2 responses would need to be not below the 5 rows for C1 but side by side, with

    the matched items in the same row, to allow repeated measures comparison of items as well as

    subjects.

    Checking for guessing or response bias when using certain data-gathering instruments with closed

    responses

    Checking for guessing

    Any instrument where the subjects are given choices to pick from for an answer are potentially open

    to guessing. In the sense of picking one option at random, without thought.

    For example, the respondent may randomly pick one of the choices because

    they cant be bothered to think about the question/item just want to finish quickly

    they dont actually have any relevant knowledge to make a correct choice

    they cant understand the question (language too hard, too long, pragmatically oddetc.)

    etc..

    Clearly the results will not then be a true measure of whatever the researcher intended to measure,

    and could even vary if the subjects responded to the same items again on another occasion. I.e. not

    valid or even reliable.

    This affects multiple choice items, yes/no or agree/disagree items in questionnaires and tests, rating

    scales and so forth. Clearly it cannot affect instruments which have open response in some form, i.e.

    with no alternatives supplied.

    One cannot statistically tell definitely if guessing has taken place or not, but one can check if the

    responses are like those one would get from someone who was guessing, or not. Obviously it is quite

    possible to get a real result, where people have paid attention and answered sensibly, which happens

    to be similar to the guessing one. Only the researcher can judge the interpretation.

    You need to calculate what the result would be, on average, for someone who was randomly guessing,

    and use the appropriate one sample test (see my LG475-SP handout) to check if the observed result

    differs significantly from the one you would get by random guessing.

    For example:

    1) 30 subjects have to answer yes or no to a question about whether they use the keyword method of

    vocab learning or not. Random guess frequency of yes would yield a frequency of 30/2 = 15 yes

    responses. Use 50% binomial test.

    2) 30 subjects have to pick one of four reasons they are offered for why they are learning English.

    Random guess frequency of each choice being picked would be 30/4 = 7.5. Use chi squared one

    ome stats and SPSS pointers http://privatewww.essex.ac.uk/~scholp/statsquibs

    6 of 29 12/18/11 2:58

  • 8/3/2019 Some Stats and SPSS Pointers

    27/29

    sample fit test.

    3) 30 subjects have to judge 20 words for whether they exist or not in English. Thus each person gets

    a score out of 20 for how many they say exist. The average random guess score would be