statistical analysis using spss - uio.no

STATISTICAL ANALYSIS USING SPSS

Anne Schad Bergsaker

24. September 2020

BEFORE WE BEGIN...

LEARNING GOALS

1. Know about the most common tests that are used instatistical analysis

2. Know the difference between parametric andnon-parametric tests

3. Know which statistical tests to use in different situations

4. Know when you should use robust methods

5. Know how to interpret the results from a test done inSPSS, and know if a model you have made is good or not

TECHNICAL PREREQUISITES

If you have not got SPSS installed on your own device, useremote desktop, by going to view.uio.no.

The data files used for examples are from the SPSS survivalmanual. These files can be downloaded as a single .zip filefrom the course website.

Try to do what I do, and follow the same steps. If you missed astep or have a question, don’t hesitate to ask.

TYPICAL

PREREQUISITES/ASSUMPTIONS FOR

ANALYSES

WHAT TYPE OF STUDY WILL YOU DO?

There are (in general) two types of studies: controlled studiesand observational studies.

In the former you run a study where you either have parallelgroups with different experimental conditions (independent), oryou let everyone start out the same and then actively changethe conditions for all participants during the experiment(longitudinal).

Observational studies involves observing without intervening, tolook for correlations. Keep in mind that these studies can nottell you anything about cause and effect, but simplyco-occurence.

RANDOMNESS

All cases/participants in a data set should as far as it ispossible, be a random sample. This assumption is at the heartof statistics.

If you do not have a random sample, you will have problemswith unforeseen sources of error, and it will be more difficult todraw general conclusions, since you can no longer assume thatyour sample is representative of the population as a whole.

INDEPENDENCE

Measurements from the same person will not be independent.Measurements from individuals who belong to a group, e.g.members of a family, can influence each other, and maytherefore not necessarily be independent.

In regular linear regression analysis data needs to beindependent.

For t-tests and ANOVA there are special solutions if you havedata from the same person at different points in time or underdifferent test conditions.

It is a little more complicated to get around the issue ofindividuals who have been allowed to influence each other. Itcan be done, but it is beyond the scope of this course.

OUTLIERS

Extreme values that stand out from the rest of the data, or datafrom a different population, will always make it more difficult tomake good models, as these single points will not fit well withthe model, while at the same time, they may have a greatinfluence on the model itself.

If you have outliers, you may want to consider transforming ortrimming (remove the top and bottom 5%, 10%, etc) your dataset, or you can remove single points (if it seems like these aremeasurement errors). Alternatively, you can use more robustmethods that are less affected by outliers.

If you do remove points or change your data set, you have toexplain why you do it. It is not enough to say that the datapoints do not fit the model. The model should be adapted to thedata, not the other way around. 6

LINEARITY AND ADDITIVITY

Most of the tests we will run assume that the relationshipbetween variables is linear. A non-linear relation will notnecessarily be discovered, and a model based on linearity willnot provide a good description of the data.

Additivity means that a model based on several variables isbest represented by adding the effects of the different variables.Most regular models assume additivity. 7

HOMOSCEDASTICITY/CONSTANT VARIANCE

Deviation between the data and the model are called residuals.The residuals should be normally distributed, but they shouldalso have more or less constant spread throughout the model.Correspondingly the variance or spread in data from differentcategories or groups should also be more or less the same.

If the error in the model changes as the input increases ordecreases, we do not have homoscedasticity. We have aproblem: heteroscedasticity.

NORMALITY OR NORMAL DISTRIBUTION

Most tests assume that something or other is normallydistributed (the residuals, the estimates should come from anormal sample distribution, etc.), and use this to theiradvantage. This is the case for t-tests, ANOVA, Pearsoncorrelation and linear regression.

Because of the central limit theorem we can assume that forlarge data sets (more than 30 cases) the parameter estimateswill have a normal sampling distribution, regardless of thedistribution of the data. However, if you have much spread inthe data or many outliers, you will need more cases. It isusually a good idea to have at least 100 just to be safe.

It is still a good idea to check how the data are distributedbefore we get started on any more complicated analyses.

NORMALITY

To check if the data are normally distributed we use ExploreAnalyze > Descriptive Statistics > Explore

• Dependent list: Thevariable(s) that you wish toanalyze

• Factor list: Categoricalvariable that can define groupswithin the data in thedependent list.

• Label cases by: Makes iteasier to identify extremeoutliers

NORMALITY

Explore: Plots

• Boxplots: Factor levels together - If youhave provided a factor variable this optionwill make one plot with all the groups

• Boxplots: Dependents together - If youhave provided more than one dependentvariable, this will put the different variablestogether in the same graph

• Descriptive: Histogram is usually the mostinformative choice

• Normality plots with tests: Plots and tablesthat makes it clearer if the data are normalor not

NORMALITY

Output: Tests of Normality and Extreme Values

• Tests of Normality: If the data areperfectly normal, sigma will begreater than 0.05. HOWEVER, thishardly ever happens in large datasets. Therefore it is better to look atthe plots to decide if they are normalor not.

• Extreme values: the five biggest andsmallest values.

NORMALITY

• Histogram: Here we seethat the data are a littleskewed, but overall theyare almost normal

• Box plot: Shows muchthe same as thehistogram

NORMALITY

• Normal Q-Q plot: The black lineshows where the data should beif it is perfectly normal.Except forthe right tail, the data lie fairlyclose to the line

• Detrended normal Q-Q plot: Thisshows the deviation between thedata and the normal distributionmore clearly. There is no cleartrend in the deviation, which is agood sign, but we see even moreclearly that the right tail is moreheavy than expected according tothe normal distribution.

HELP, MY DATA DO NOT FULFILL THE REQUIREMENTS!

In some cases you will be able to choose a non-parametric testinstead, which does not have the same strict prerequisites.

Trim the data: Remove the highest and lowest 5%, 10%, 15%,etc, alternatively trim based on standard deviation (usually abad idea as the standard deviation is heavily influenced byoutliers).

Windsorizing: Replace the values of extreme data points withthe highest (or lowest) value that is not an outlier.

Bootstrapping: Create many hypothetical sets of samples,based on the values you already have, and do the same typesof analysis on all these samples, to obtain interval estimates

Transform variables that deviate greatly from the normaldistribution 15

BOOTSTRAPPING

Bootstrapping is a form of robust analysis, and it is the mostcommonly used one. It is also very easy to implement in SPSS.

Bootstrapping is a method where you get SPSS to treat yoursample as a kind of population. New cases are drawn from yoursample (and then returned), to create a new sample consistingof N cases. Usually N is equal to the number of cases in yoursample. This is done as many times as you want, typically atleast 1000 times, to create the same number of new samples.For each sample, the statistical parameters you ask for arecalculated. Based on the results from all of the samples,interval estimates for each parameter is given, e.g.for mean orcorrelation coefficient.

TRANSFORMATION OF ’ABNORMAL’ DATA

Last resort, as it can make it harder to interpret the results.

NUMBER OF CASES/DATA MEASUREMENTS

There is no single number of cases needed to use a specificstatistical test, or to obtain significance. It will depend on thetype of analysis and the size of the effect your are looking for.The smaller the effect, the more cases you need.

The central limit theorem states that even though the populationthe data is taken from is not normal, the estimates you createwill have a normal sample distribution as long as you haveenough data, typically you need at least 30 cases. However, 30cases is the absolute minimum, and is not always enough.

If you have data with a high variance, you will need more than30 cases, and if you wish to compare groups, the differentgroups should also have more than 30 cases each.

EXPLORE RELATIONSHIP BETWEEN

VARIABLES

CORRELATION

Correlation measures the strength of the linear relationship betweenvariables. If two variables are correlated, a change in one variable willcorrespond with a change in the other.

Correlation is given as a number between -1 and 1, where 1 indicatesa perfect correlation. If one variable increases in value, the other willincrease. This does not mean that an increase of 1 in one variable,will give an equal increase in the other.

Correlation equal to -1 is also a form of perfect correlation, but itindicates that an increase in one variable corresponds to a decreasein the other variable. This is called a negative correlation.

Correlation equal to 0 means that there is no relationship between thevariables. If one increases, the other will change at random.

Pearson correlation is a parametric test. Spearman correlation andKendall’s tau are non-parametric. If you have significant outliers youshould use one of these instead. 19

CORRELATION

CONDITIONS FOR REGULAR/PEARSON CORRELATION

You need two (or more) continuous variables

The relationship between the variables should be linear

There should not be any very distinct outliers

The variables should be approximately normal, but this is onlyvery important if you have small data sets.

If you wish to study correlation between ordinal and continuousvariables you should use Spearman or Kendall’s tau. If youhave non normal data, or fewer cases than 30 you can useeither Spearman or Pearson correlation in combination withbootstrap.

CORRELATION

Find correlation between two continuous variables: Analyze >Correlate > Bivariate

We use the data setsurvey.sav. Select thevariables you want toinvestigate. Make surethat Pearson andTwo-tailed areselected. If you areunsure which type ofcorrelation to calculate,select Spearman andKendall’s tau too.

HANDLING OF MISSING DATA

Choose Exclude cases pairwise, toinclude as much data as possible in theanalysis. This will only exclude caseswhen there is a missing value in one ofthe two variables being compared at atime.Exclude cases listwise will exclude allcases that have missing data points in atleast one of the variables included in theanalysis. In this case it makes nodifference, as we have two variables, butwith more than two it can make adifference.

BOOTSTRAPPING

If you are not sure if youhave enough data for thecentral limit theorem to saveyou from a lack of a normaldistribution, you can getSPSS to run a quickbootstrapping at the sametime.Select Perform bootstrap,and choose how manysamples. 1000 is default andis usually enough, but if youwant, you can increase it to2000. Choose Bias correctedaccelerated. 24

CORRELATION

The bootstrapping calculatesconfidence intervals for ourestimates of the samplemean and correlationcoefficients. The CIs supportwhat the significancesuggests, i.e. that we have asignificant correlation, andthat in this case, thevariables are highlycorrelated. (Small effekt:r≥0.1, Medium: r≥0.3,Large: r≥0.5).

CORRELATION

Find grouped correlation between two variables

• In order to findcorrelationbetween variablesfor different groupsin your sample,you can use SplitFile.

• Data > Split File >Compare groups

• Run the samecorrelationanalysis

PARTIAL CORRELATION

Analyze > Correlate > PartialUsed when you wish tosee how correlated twovariables are, whiletaking into accountvariation in a thirdvariable, that may ormay not influence thetwo variables you areinterested in. Move thevariables of interestinto the box labeledVariables, and the oneyou wish to control forinto Controlling for. 27

PARTIAL CORRELATION

It is useful to select Zero-ordercorrelations. This gives you something tocompare with, as SPSS will then alsoshow the correlation between variableswithout taking into account variation inthe control variable.

PARTIAL CORRELATION

Correlation where we control for variation in one or more othervariables can be compared to the correlation without. If theseare identical, the control variable has no effect. In most casesthe correlation will be somewhat reduced, but sometimes thechange will be large enough to reduce the correlationsubstantially (confounding).

LINEAR REGRESSION

This is used to look for a linear relationship between variables.Is it possible to "predict" how someone is likely to score on onevariable, based on where they score on others?

The errors we make (deviation between sample andpopulation) must be independent

Should have many more cases than variables. Rule of thumb isN>50+8m, where m is the number of independent variables.This rule is somewhat oversimplified.

It is a common misconception that independent variables musthave a normal distribution, but this is not necessary.

Deviation between measurements and predicted values arecalled residuals, and these need some extra attention after youhave created your model. 30

NUMBER OF CASES NEEDED

The number of cases neededwill not only depend on thenumber ofpredictors/independentvariables, but also on thesize of the effect you arelooking for. (Based on afigure from Field (2017).)

ASSUMPTIONS OF LINEAR REGRESSION

You should have a continuous dependent variable, and one or morecontinuous or dichotomous independent variables.

There should be a linear relationship between the dependent variableand all the independent variables, and the combined effect of all thedependent variables should best be expressed as a sum of all thecontributions.

Observations should be independent, and you should not haveextreme outliers.

Data should be homoscedastic (check after you have made themodel)

The residuals should have a normal distribution (check after you havemade the model)

If you have more than one independent variable, these should not bestrongly correlated with each other, i.e. no multicollinearity.

LINEAR REGRESSION

Analyze > Regression > LinearWe use the data setsurvey.sav. Move thedependent variable tothe box labeledDependent, and startwith a selection ofindependent variablesthat you want to includein your model, andmove them into the boxlabeled Independents.Click Next.

LINEAR REGRESSION

Add the rest of theindependent variables youwant to include in the boxlabeled Independents. If youwish to compare with aneven more complicatedmodel you can click Nextagain and add morevariables. If you want allvariables to be included fromthe start, you add allvariables in the first block,without making more .

LINEAR REGRESSION

Linear Regression: StatisticsMake sure that Estimatesand Confidence intervals areselected in RegressionCoeff., and include Model fit,R squared change,Descriptives, Part and partialcorrelations, Collinearitydiagnostics and Casewisediagnostics. This will give agood overview of the qualityof the model.

LINEAR REGRESSION

Linear Regression: Options

Go with the default option.This is one of few analyseswhere it is better to useExclude cases listwise ratherthan Exclude cases pairwise.If you use pairwise exclusion,you risk getting absurdresults from your model, e.g.that it explains more than100% of the variation andsuch.

LINEAR REGRESSION

Linear Regression: PlotsThe most useful thing to lookcloser at is the residuals.Therefore we plot thestandardized predictedvalues (ZPRED) against thestandardized residuals(ZRESID). In addition wechoose Histogram, Normalprobability plot and Produceall partial plots.

LINEAR REGRESSION

Linear Regression: Save

To check if we have any outliers weshould be concerned with, or anycases that are overly influential, wesave a few variables associatedwith the residuals and thepredicted values. These are addedas separate variables on the rightend of your data set.

LINEAR REGRESSION

Descriptives provide the usualdescriptive statistics for allincluded variables, bothdependent and independent.Correlations present thePearson correlation betweenthe different variables. Here youcan check if the independentvariables correlate with thedependent variable, and if anyof the independent variables arevery highly correlated.Correlation betweenindependents over 0.9 is a badsign. 39

LINEAR REGRESSION

R Square indicates howmuch of the variation in thedependent variable isexplained or described by themodel (multiply with 100 toget %). The ANOVA tableindicates if the model in itselfis significant. In this caseboth models are significant,but model 2 describes moreof the variation in thedependent variable.

LINEAR REGRESSION

Coefficients lists the parameters that indicates the effect ofeach variable (B), and if these are significant (Sig.). Beta letsyou compare the effect of each variable with the other, as ifthey were measured on the same scale. VIF is a measure ofmulticollinearity. Values over 10 are a cause for concern.

LINEAR REGRESSION

Especially the Mahal. Distance andthe Cook’s distance are useful tosee if you have outliers andunusually influential cases. Withfour independent variables thecritical upper limit for Mahalanobisis 18.47. The case in the data setthat has a value of 18.64 can belocated in the data set, but it is onlymildly larger than the critical value,so we will not worry about it.Critical value for Cook’s is 1.Anything below that is fine.

LINEAR REGRESSION

The residuals are fairlyclose to a normaldistribution, so there isno reason to beworried about violatingthis assumption. Ifthere had been, weshould have tried withbootstrapping, toobtain more reliableCIs and significancemeasures.

LINEAR REGRESSION

The P-P plot shows theactual residuals plottedagainst what you wouldexpect, given a normaldistribution.Theseshould follow the blackline, and in this casethey do, strengtheningthe assumption that theresiduals are normallydistributed.

LINEAR REGRESSION

Predicted valuesplotted againstresiduals can be usedto check if the data arehomoscedastic. If thepoints have a funnelshape, it indicatesheteroscedasticity, andwe should use morerobust methods. Theblob shape that wehave is what you want.

LINEAR REGRESSION

Final check of linearity and correlation is to look at the scatterplots of all independent variables against the dependentvariable. Not surprisingly we see little indication of correlationbetween the dependent variable and the two independentvariables that were not significant.

LINEAR REGRESSION

Plotting the standardized residuals against the independentvariables we can check if we have independence of errors.Here we want no clear trends.

EXPLORE THE DIFFERENCE

BETWEEN GROUPS

A SMALL DISCLAIMER

Even though t-tests and ANOVA usually are presented astechniques that are completely different from linear regression,when the fact of the matter is that they are based on the samebasic mathematical model. The reason they are kept separateis more historical than anything else, and SPSS holds on to thisseparation even though it is quite artificial.

T-TEST

Compare data from two different groups, in order to determineif the two are different. t-tests are usually used to analyze datafrom controlled studies.

Be aware that there are generally two different types of t-tests;one for independent groups, and one for paired samples, wheredata are collected from the same participants at two differenttimes (repeated measures)

If the assumptions of the t-test are not met, you can use MannWhitney U test (for independent samples), Wilcoxon SignedRank test (repeated measures), t-test combined with bootstrap,or a robust version of the standard t-test.

ASSUMPTIONS OF THE INDEPENDENT T-TEST

You need a continuous dependent variable and a dichotomouscategorical variable.

Independent observations/groups. This means that eachparticipant can only be part of one of the groups, e.g. men andwomen, smokers and non-smokers, etc.

Random selection

No extreme outliers

If you have a small sample (less than 30), the dependentvariable should be normally distributed within each of thecategories defined by the independent variable.

The variance of the dependent variable should beapproximately equal in the two categories defined by thecategorical variable. The groups should also be similar in size. 50

T-TEST

Analyze > Compare Means > Independent Samples T Test

We will use the data set survey.sav.Move the dependent continuousvariable into the box labeled TestVariable(s), and the independentcategorical variable into GroupingVariable. Even if you can test morethan one dependent variable at atime, you should not do so. UseMANOVA instead.

T-TEST

Click Define Groups...

Here you have to remember howyou have coded the categoricalvariable. Indicate what the twogroups are. In this case sex iscoded as 1=Man and 2=Woman,so we write 1 and 2 and clickContinue. If the codes had been 0and 1, we would have written thosevalues.

T-TEST

Group Statistics provides some descriptive statistics relating tothe different groups, such as mean, standard deviation.Independent Samples Test shows what the difference betweenthe groups are (Mean difference), and if the difference issignificant (Sig. 2-tailed). If Levene’s test is not significant(column 2), we can look at the results from the first row (Equalvariances assumed). The t-test in this case is not significant.

MANN WHITNEY U TEST

Non-parametric version of a standard t-test for independentsamples. You need a continuous dependent variable and adichotomous independent variable. If you have many extremeoutliers you might consider using this test.

Analyze >Nonparametric Tests >Independent Samples

Choose Customizeanalysis and clickFields

MANN WHITNEY U TEST

Choose Use customfield assignments, andmove the dependentvariable into the boxlabeled Test Fields.Move the independentvariable to Groups, andclick on Settings.

MANN WHITNEY U TEST

Go to the Settings tab.Under Choose tests,select Mann-Whitney U(2 samples), and clickPaste.

MANN WHITNEY U TEST

The summary of the test shows the hypothesis we are testingagainst (no difference between groups), and what theconclusion of the test is, based on significance. In this casethere is no significant difference between the groups, and weretain the null hypothesis.

MANN WHITNEY U TEST

The significance value of the test, with corresponding teststatistic is shown in the table Independent SampesMann-Whitney U.... The histograms of the two groups supportsthe result of the test, that there is no significant differencebetween the groups.

ASSUMPTIONS OF REPEATED MEASURES T-TEST

You need a continuous dependent variable measured at twodifferent times, or under two different conditions

Random selection

There should be no extreme outliers in the difference betweenthe two sets of measurements

The difference between the two measurements should benormally distributed, at least if you have a small sample

The data should be organized so that each participant has onerow, with two different variables representing the data from thetwo different points in time/conditions

T-TEST (REPEATED MEASURES)

Analyze > Compare Means > Paired Samples T Test

We will use the dataset experim.sav. Movethe variable containingmeasurements fromtime 1 to the box calledPaired Variables. Movethen the variablecontaining the secondset of measurementsinto the same box.

Paired Samples Statistics shows descriptive statistics such asmean and standard deviation for the two different variables.Paired Samples Correlations provides the correlation betweenmeasurements from the two different variables.

The final table shows if the test is significant or not, and whatthe average difference is. Here the difference is 2.67, and thetest is highly significant (p<0.001).

WILCOXON SIGNED RANK TEST

Non-parametric alternative to the repeated measures t-test.You need a continuous variable measured on two differentoccasions. This test is more suitable than a t-test if you havemany outliers.

Analyze > Nonparametric Tests > Legacy Dialogs > 2 RelatedSamples

Move the data fromtime 1 to Test Pairs,then the data from time2. Check the box nextto Wilcoxon.

Two-Related-Samples: OptionsSelect Quartiles (andDescriptives if you wantsome basic descriptivestatistics as well).Excludecases test-by-test will includeall cases that have data onboth occasions, but that maystill miss data in othervariables.

Descriptive Statistics presents thequartiles. Here we can see thatthere are signs of differencesbetween the two sets ofmeasurements, since all quartilesfrom time 2 are lower than for time1. Test Statistics confirms this(p<0.001). The effect size can becalculated by using r=z/(2*N)where N is number of cases, in thiscase -4.18/(2*30)=0.54,corresponding to a large effect.

ANOVA - ANALYSIS OF VARIANCE

Compare continuous data from two or more different groups, tosee if the groups differ

This test must be adjusted to whether you have independentgroups (different people in each group), or if you have repeatedmeasures (the same people in each of the groups). ANOVAassumes that all groups have similar variance.

Alternative if assumptions are not met: Kruskal Wallis test,Friedman’s ANOVA, bootstrap and other robust methods

ASSUMPTIONS OF INDEPENDENT ANOVA

You need a continuous dependent variable, and a categoricalindependent variable with at least two categories.

Independent measurements. Participants can only belong toone of the categories in the independent variable, and differentparticipants should not be able to affect each other.

Random sample

No extreme outliers

The dependent variable should be approximately normallydistributed within each of the categories in the independentvariable.

The variance in each group should be similar. Groups shouldalso be of similar sizes.

Analyze > Compare Means > One way ANOVA

We will use the dataset survey.sav. Movethe dependent variableto the box labelesDependent List, andthe categorical variableto Factor.

One way ANOVA: Options

Select Descriptive, Homogeneity ofvariance, Brown-Forsythe andWelch under Statistics, and chooseMeans plot. To include as muchdata as possible in your analysis,choose Exclude cases analysis byanalysis.

One way ANOVA: Post Hoc Multiple Comparisons

You can choose from a wideselection of post hoc tests.Check the SPSSdocumentation for detailsabout each test. We chooseTukey (if we have more orless equal variance andgroup sizes), Bonferroni(controls for type I errors)and Games-Howell (in caseof difference in variance).

Descriptives shows some basic descriptive statistics for thedependent variable for each of the groups in the independentvariable.

Test of Homogeneity... shows if wecan assume equal variances. Herethe null hypothesis is that they areequal, so we want Sig. to begreater than 0.05. Robust Test ofEquality of Means shows the testresults we should use if thevariances are different. In this casethere are significant differencesbetween the groups.

Since we can assume equalvariances, we can also lookat the regular ANOVA, whichsupports the conclusion ofthe robust tests, that there isa singificant differencebetween groups (Sig.=0.01).The results of the post hoctests show which groupsdiffer. Here we can see thatthere is only a significantdifference between the oldestand the youngestparticipants.

The mean of the dependent variable in each age group isplotted against age group, and indicates a clear trend withincreasing optimism with age.

KRUSKAL WALLIS

Non-parametric alternative to ANOVA for independent groups.You need a continuous dependent variable and a categoricalindependent variable with two or more groups.

Analyze > Nonparametric parametric Tests > IndependentSamples

In the Fields tab, movethe dependent varialbeto Test Fields, and theindependentcategorical variable toGroups.

KRUSKAL WALLIS

Under Settings, chooseKruskal-Wallis 1-wayANOVA, and makesure that Multiplecomparisons is set toAll pairwise. ChooseTest for OrderedAlternatives if thecategorical variable isordinal. Click Paste.

KRUSKAL WALLIS

Hypothesis Test Summaryshows what the nullhypothesis is, and if it shouldbe rejected. In this case thetest states that we should gowith the alternativehypothesis, that there is asignificant differencebetween groups. Specifictest statistics andsignificance is shown innIndependent-SamplesKruskal-Wallis...

KRUSKAL WALLIS

Boxplots of the datafrom the different agegroups look like theysupport our conclusionthat optimism is higherin older participants.

KRUSKAL WALLIS

Pairwise comparisons... show that there is only a significantdifference between the first and the last age groups.

ASSUMPTIONS OF REPEATED MEASURES ANOVA

You need a continuous variable measured at two or moreoccasions or experimental conditions

Random selection

No extreme outliers

The dependent variable should be approximately normallydistributed at each of the occasions

The variance of the dependent variable should be similar for alloccasions

The variance of the difference between all possiblecombinations of occasions should be more or less equal for allcombinations (called sphericity). If sphericity cannot beassumed, correction must be made.

ANOVA - ANALYSIS OF VARIANCE (REPEATED MEASURES)

Analyze > General Linear Model > Repeated Measures

We will use the data setcalled experim.sav. First wemust ’create’ the factorvariable indicating thedifferent occasions orexperimental conditions. Allwe have to do is provide aname and the number oflevels, and click Add.

After clicking Add, this factorwill appear in the windowbelow. We can then goahead and click Define.

The three levels of thefactor we created arelisted as three separatevariables. These mustbe defined by clickinglevel 1, and move thevariable containing themeasurements of thedependent variable attime 1 over to the boxlabeledWithin-SubjectsVariables.

When all the variablesrepresenting differentoccasions /experimentalconditions are added, itwill look like this. Allthree levels in thefactor have beendefined by a variable inthe data set.

Repeated Measures: Model

Make sure that Fullfactoral is selected.

Repeated Measures: Options

Choose Descriptivestatistics andEstimates of effectsize. If you wish youcan also selectParameter estimates.

Repeated Measures: Profile Plots

Move time to Horizontal axis, andclick Add. Choose either Line chartor Bar chart (depending on whatyou prefer). Select Include ErrorBars.

Repeated Measures: Estimated Marginal Means

Choose time and move it toDisplay Means for. Checkthe box labeled Comparemain effects and chooseBonferroni (it is the strictest).Click Continue and Paste.

Descriptive Statistics shows the mean and standard deviationfor the dependent variable for each of the different occasions.Multivariate Tests shows significance. Here you can choose thetest that is most usual to use in your field. Sig. less than 0.05indicates a significant difference.

Mauchly’s test of sphericity shows if sphericity can be assumedor not. The null hypothesis is that sphericity can be assumed. IfSig. is less than 0.05, then we need to reject the nullhypothesis, and this must be kept in mind when interpreting therest of the results.

Since we cannot assumesphericity, we must base ourconclusion on the three othermeasures of significantdifference. The strictest isLower, and in this case eventhis test is significant. Wealso have significance for thetest that there is a linearrelationship between the timefactor and our dependentvariable (see Tests ofwithin-subjects contraststable).

The mean of the dependentvariable at each time isshown with thecorresponding standard errorand confidence interval arepresented in EstimatedMarginal Means. Pairwisecomparisons show that thereis a significant differencebetween all levels, togetherwith the correspondingaverage difference.

The mean at each of the three occasions is plotted againsttime, including the uncertainty given by the confidenceintervals. This graph shows the linear trend in the participants’fear of statistics, and how it decreases with time.

FRIEDMAN’S ANOVA

Non-parametric alternative to the repeated measures ANOVA.You need a continuous dependent variable that has beenmeasured at two or more occasions/conditions.

Analyze > Nonparametric Tests > Legacy Dialogs > K RelatedSamples

Move the variablesrepresenting thedifferentoccasions/conditions tothe box labeled TestVariables.

FRIEDMAN’S ANOVA

Several Related Samples:Statistics. Select Quartilesand (if you wish)Descriptives.

Several Related Samples:Exact. Choose Exact underExact tests.

FRIEDMAN’S ANOVA

Descriptive statistics showquartiles for eachoccasion.Ranks provide themean rank at each of thethree different times.Average rank seems todecrease with time,suggesting that there is arelationship between timeand fear of statistics. TestStatistics tells if the test issignificant or not (Asymp.Sig.<0.05).

MIXED ANOVA

This is used when you want to compare both independentgroups, while also looking at measurements on the sameindividuals from different times/conditions.

Analyze > General Linear Model > Repeated MeasuresLike regular repeatedmeasures ANOVA, wehave to create thefactor that includes thedifferenttimes/conditions. Giveit a name, provide thenumber of levels andclick Add, beforeclicking Define. 97

MIXED ANOVA

Move the variablescorresponding to the differentlevels of the dependentrepeated measures variableto the windowWithin-subjects variables.Move the factor containingindependent groups to thewindow Between-subjectsfactors.

MIXED ANOVA

Repeated Measures: Options

Choose Descriptivestatistics, Estimates ofeffect size, Parameterestimates andHomogeneity tests.

MIXED ANOVA

Repeated Measures: Profile Plots

Move the repeated measuresfactor (here time) toHorizontal Axis, and thefactor with independentgroups to Separate Lines,and click Add. Select LineChart and check the box forInclude Error Bars.

MIXED ANOVA

Mean andstandarddeviation for thetwo independentgroups at each ofthe three timepoints.

MIXED ANOVA

This is a test of covariancebetween different groups andtimes. The null hypothesis is thatcovariance is equal. If this test isnot significant, we can assume thatcovariance does not vary, i.e. thatcorrelation between different timeswithin different subgroups definedby the independent variable are thesame, which is what we want.Keep in mind that this test can givesignificant results for large datasets, even when covariances areequal.

MIXED ANOVA

The first four rows inMultivariate testsindicate that there is asignificant effect oftime. The next fourrows indicate that thereis no significant effectof the combined factorcontaining time andgroups.

MIXED ANOVA

Levene’s test suggeststhat there is constantvariance in the errorthat the model makes,as none of the testsare significant. Tests ofBetween Subjectseffects shows that thegroups in theindependent variableare not significantlydifferent from eachother (p=0.81).

MIXED ANOVA

The graph of fear of statistics vs. time for the two differentgroups strengthens the impression that there is no significantdifference between the groups, as the two lines follow eachother pretty closely.

Other types of ANOVA

• Two way ANOVA - More than one independent categoricalvariable. This allows you to look at difference betweenmore than one type of group, e.g. not just gender, but alsoage group as well.

• ANCOVA (ANalysis of COVAriance) - Perform ANOVA,while also taking into account one or more continuousvariables.

• MANOVA (Multiple ANOVA) - Look at differences betweengroups, within more than one continuousdependent/outcome variable at the same time

CATEGORICAL OUTCOME VARIABLES

CHI SQUARED(χ2) TEST

You can use this to test if the distribution of data withincategorical variables is random or not, i.e. if there is acorrelation between categorical variables.

This is a non-parametric test, so here we do not need to worryso much about data distribution. However, assuming a randomdistribution, none of the groups defined by the categoricalvariables should be too small. If you use two variables with twogroups each, this results in 2 × 2 = 4 subgroups. None of thesesubgroups should have an expected frequency less than 5. Forgreater tables, at least 80% of the groups should have anexpected frequency of 5 or more.

Observations should be independent, so the two variablesshould for instance not represent pre/post test. If you have datalike this you should use McNemar’s test instead. 107

Analyze > Descriptive Statistics > Crosstabs...

We use the dataset survey.sav.Move one of thecategoricalvariables toRow(s) and theother toColumn(s).Select Displayclustered barcharts if youwant.

Crosstabs: ExactChoose Exact insteadof Asymptotic only.That way you runFisher’s exact test,which is usefulespecially if you havefew cases or lowexpected frequenciesin subgroups. If this isnot the case, nocorrection will beapplied to the dataanyways, and no harmis done. 109

Crosstabs: StatisticsSelect Chi-square,Contingency coefficient, Phiand Cramer’s V and Lambda,so that we get the correcttest (χ2) and a measure ofeffect size (Phi/Cramers V).Lambda provides a measureof how much smaller theerror we get is, if groupmembership within onevariable is predicted basedon group membership in theother variable.

Crosstabs: Cell Display

Select both Observed andExpected in Counts. Inaddition, select Row, Columnand Total in Percentages,and Standardized inResiduals.

The contingency table showshow cases are distributedbetween the (in this case)four different subgroups, aswell as the expectedfrequencies. The residualsshow the difference betweenthe real frequencies and theexpected frequencies. If theresidual is greater than 2.0(for a 2×3 table or greater),the difference is muchgreater than expected.

Small effect: 0.1, Medium effect:0.3, Large effect: 0.5.

For larger tables, the mostimportant value inChi-Square Tests is PearsonChi-Square. For us (with a2×2) we should rather useContinuity Correction. Sincethis is not significant, there isno significant differencebetween smoking for menand women. With a 2×2table, we should report phias a measure of effect. Forlarger tables you should useCramer’s V instead. 113

The bar chart shows what the test has already shown, thatthere is no significant difference between men and womenwhen it comes to smoking, as the bars are almost the samesizes for men and women.

LOGISTIC REGRESSION

This is used when you have a categorical outcome variable, i.e.when you are trying to predict group membership based oncontinuous and/or categorical variables.

Not dependent on normal distribution, but it is important that allgroups/categories are well represented.

Multicollinearity between independent variables is important towatch out for.

ASSUMPTIONS OF LOGISTIC REGRESSION

You need a dependent categorical variable, where thecategories are mutually exclusive. It cannot be possible tobelong to more than one category or group

You have one or more independent variables that are eithercontinuous or categorical

Observations are independent

If you have several continuous independent variables, theseshould not be heavily correlated

All categories in the categorical variable should be wellrepresented

There should be a linear relationship between predictor and thelogit transformation of the outcome variable

LOGISTIC REGRESSION

Analyze > Regression > Binary Logistic

We will use the dataset sleep.sav. Move thedependent variable toDependent and theindependent variablesyou wish to include,into Covariates. ClickCategorical...

LOGISTIC REGRESSION

Logistic Regression: Define Categorical Variables

All categoricalvariables must bespecified ascategorical, in order forthem to be handledcorrectly. Here you canchoose which categoryis going to be thereference category.Click the variable,select First or Last, andclick Change.

LOGISTIC REGRESSION

Logistic Regression: Options

SelectHosmer-Lemeshowgoodness-of-fit,Casewise listing ofresiduals, CI for exp(B)and Include constant inmodel. Click Continueand Paste.

LOGISTIC REGRESSION

The first part of the outputprovides information on thenumber of cases, and howthe outcome variable iscoded. This is useful toremember when interpretingthe rest of the results. Thereference group for theoutcome is coded as 0, inthis case no.

LOGISTIC REGRESSION

Corresponding coding for allthe independent categoricalvariables, includingfrequencies for eachcategory. Groups coded as 0are reference groups,because we chose the firstgroup to be the reference.Classification table showsthe results of the model wecompare with, i.e. the simplemodel with no independentvariables.

LOGISTIC REGRESSION

Omnibus test shows thesignificance of the whole model,which should be below 0.05. Cox &Snell R Square and Nagelkerke RSquare estimates how muchvariation in the outcome isdescribed by the model.Hosmer-Lemeshow tells us if themodel is good. Here we want Sig.greater than 0.05.

LOGISTIC REGRESSION

With our model, we can nowcorrectly predict 75.1% of theoutcomes, compared to 57.3% withthe simple model. For those caseswhere prediction is incorrect, thecases with a greater residual than2 are listed in Casewise list. If theresiduals are greater than 2.5,these should be examined moreclosely. There might be a reasonwhy these particular cases are notdescribed well by the model.

LOGISTIC REGRESSION

The model itself is presented in Variables in the Equation. Column Band Exp(B) show effect size, and Sig. shows significance. If Exp(B) isgreater than 1 for a specific variable, it means that the odds of endingup in group 1 in the outcome, is greater if you either increase thevalue of the predictor (continuous), or belong to that specific grouprather than the reference group (categorical). Those who struggle tofall asleep (prob fall asleep=1) have 2.05 times higher odds of havingproblems sleeping (prob sleep=1).

LOGISTIC REGRESSION

To test linearity, you can calculate the natural logaritm of allcontinuous variables (e.g. LN(age), LN(hourwnit)). Theinteractions LN(variable)×variable can then be included in thelogisitic regression model. If these interactions are notsignificant, the assumption is correct.

LOGISTIC REGRESSION

To check for multicollinearity, you must use ordinary linearregression, but with exactly the same dependent andindependent variables as in the logistic regression model.Select Collinearity Diagnostics under Statistics in the dialogbox. VIF<10 is the requirement.

BACK TO THE STUDY HALL/OFFICE

Start using SPSS as quickly as possible on your own data (orsomeone else’s for that matter)! The only way to improve yourunderstanding of statistics and SPSS, is to use it. Learning bydoing.

Make sure you have good textbooks and online resourceswhen you need to check something. A decent online tutorialcan be found e.g. at https://libguides.library.kent.edu/SPSS

If SPSS throws a warning at you saying something about how"validity of subsequent results cannot be ascertained", itREALLY means that you CANNOT trust your results, even ifthey look good. You need to change your analysis.

Ask google or youtube when you get stuck. If that doesn’t help,ask us (statistikk@usit.uio.no)!

SUGGESTED BOOKS

In SPSS: SPSS Survival manualby Julie Pallant

In statistics and SPSS: Discoveringstatistics using IBM SPSS and Anadventure in statistics by AndyField

statistical analysis using spss - uio.no

Documents

statistical package for social science ). spss. spss spss...

spss intermediate advanced statistical techniques

lineær regresjon - uio.no · avhengig variabel output...

software packages for statistical analysis - spss

statistical software in research: spss · spss is a...

organizing your data for statistical analysis in spss

statistical package for the social sciences) for...

statistical tools spss

statistical infernece, corelation spss report

statistical package for social science (spss)

statistical analyses using spss

intro to spss - pacific...

statistical tests in spss

statistical package for the social sciences (spss ) ibm spss...

intermediate ibm spss - flinders university ·...

statistical inference, regression spss report

introduction to spss 16 - university of...

unit v(spss statistical tool) anova table

1. introduction to spss...

introduction to statistical analysis using ibm spss...